Tuning Linux for MongoDB

Tim Vaillancourt | August 12, 2016 | Posted In: MongoDB

In this post, we’ll discuss tuning Linux for MongoDB deployments.

By far the most common operating system you’ll see MongoDB running on is Linux 2.6 and 3.x. Linux flavors such as CentOS and Debian do a fantastic job of being a stable, general-purpose operating system. Linux runs software on hardware ranging from tiny computers like the Raspberry Pi up to massive data center servers. To make this flexibility work, however, Linux defaults to some “lowest common denominator” tunings so that the OS will boot on anything.

Working with databases, we often focus on the queries, patterns and tunings that happen inside the database process itself. This means we sometimes forget that the operating system below it is the life-support of database, the air that it breathes so-to-speak. Of course, a highly-scalable database such as MongoDB runs fine on these general-purpose defaults without complaints, but the efficiency can be equivalent to running in regular shoes instead of sleek runners. At small scale, you might not notice the lost efficiency, but at large scale (especially when data exceeds RAM) improved tunings equate to fewer servers and less operational costs. For all use cases and scale, good OS tunings also provide some improvement in response times and removes extra “what if…?” questions when troubleshooting.

Overall, memory, network and disk are the system resources important to MongoDB. This article covers how to optimize each of these areas. Of course, while we have successfully deployed these tunings to many live systems, it’s always best to test before applying changes to your servers.

If you plan on applying these changes, I suggest performing them with one full reboot of the host. Some of these changes don’t require a reboot, but test that they get re-applied if you reboot in the future. MongoDB’s clustered nature should make this relatively painless, plus it might be a good time to do that dreaded “yum upgrade” / “aptitude upgrade“, too.

Linux Ulimit

To prevent a single user from impacting the entire system, Linux has a facility to implement some system resource constraints on processes, file handles and other system resources on a per-user-basis. For medium-high-usage MongoDB deployments, the default limits are almost always too low. Considering MongoDB generally uses dedicated hardware, it makes sense to allow the Linux user running MongoDB (e.g., “mongod”) to use a majority of the available resources.

Now you might be thinking: “Why not disable the limit (or set it to unlimited)?” This is a common recommendation for database servers. I think you should avoid this for two reasons:

If you hit a problem, a lack of a limit on system resources can allow a relatively smaller problem to spiral out of control, often bringing down other services (such as SSH) crucial to solving the original problem.
All systems DO have an upper-limit, and understanding those limitations instead of masking them is an important exercise.

In most cases, a limit of 64,000 “max user processes” and 64,000 “open files” (both have defaults of 1024) will suffice. To be more exact you need to do some math on the number of applications/clients, the maximum size of their connection pools and some case-by-case tuning for the number of inter-node connections between replica set members and sharding processes. (We might address this in a future blog post.)

You can deploy these limits by adding a file in “/etc/security/limits.d” (or appending to “/etc/security/limits.conf” if there is no “limits.d”). Below is an example file for the Linux user “mongod”, raising open-file and max-user-process limits to 64,000:

1

2

3

4

mongod soft nproc 64000

mongod hard nproc 64000

mongod soft nofile 64000

mongod hard nofile 64000

Note: this change only applies to new shells, meaning you must restart “mongod” or “mongos” to apply this change!

Virtual Memory

Dirty Ratio

The “dirty_ratio” is the percentage of total system memory that can hold dirty pages. The default on most Linux hosts is between 20-30%. When you exceed the limit the dirty pages are committed to disk, creating a small pause. To avoid this hard pause there is a second ratio: “dirty_background_ratio” (default 10-15%) which tells the kernel to start flushing dirty pages to disk in the background without any pause.

20-30% is a good general default for “dirty_ratio”, but on large-memory database servers this can be a lot of memory! For example, on a 128GB-memory host this can allow up to 38.4GB of dirty pages. The background ratio won’t kick in until 12.8GB! We recommend that you lower this setting and monitor the impact to query performance and disk IO. The goal is reducing memory usage without impacting query performance negatively. Reducing caches sizes also guarantees data gets written to disk in smaller batches more frequently, which increases disk throughput (than huge bulk writes less often).

A recommended setting for dirty ratios on large-memory (64GB+ perhaps) database servers is: “vm.dirty_ratio = 15″ and “vm.dirty_background_ratio = 5″, or possibly less. (Red Hat recommends lower ratios of 10 and 3 for high-performance/large-memory servers.)

You can set this by adding the following lines to “/etc/sysctl.conf”:

1 2	vm.dirty_ratio = 15 vm.dirty_background_ratio = 5

To check these current running values:

1

2

3

$ sysctl -a | egrep "vm.dirty.*_ratio"

vm.dirty_background_ratio = 5

vm.dirty_ratio = 15

Swappiness

“Swappiness” is a Linux kernel setting that influences the behavior of the Virtual Memory manager when it needs to allocate a swap, ranging from 0-100. A setting of “0“ tells the kernel to swap only to avoid out-of-memory problems. A setting of 100 tells it to swap aggressively to disk. The Linux default is usually 60, which is not ideal for database usage.

It is common to see a setting of “0″ (or sometimes “10”) on database servers, telling the kernel to prefer to swap to memory for better response times. However, Ovais Tariq details a known bug (or feature) when using a setting of “0“ in this blog post: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Due to this bug, we recommended using a setting of “1″ (or “10” if you prefer some disk swapping) by adding the following to your “/etc/sysctl.conf”:

1	vm.swappiness = 1

To check the current swappiness:

1 2	$ sysctl vm.swappiness vm.swappiness = 1

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply a dirty_ratio or swappiness change!

Transparent HugePages

*Does not apply to Debian/Ubuntu or CentOS/RedHat 5 and lower*

Transparent HugePages is an optimization introduced in CentOS/RedHat 6.0, with the goal of reducing overhead on systems with large amounts of memory. However, due to the way MongoDB uses memory, this feature actually does more harm than good as memory access are rarely contiguous.

Disabled THP entirely by adding the following flag below to your Linux kernel boot options:

1	transparent_hugepage=never

Usually this requires changes to the GRUB boot-loader config in the directory “/boot/grub” or “/etc/grub.d” on newer systems. Red Hat covers this in more detail in this article (same method on CentOS): https://access.redhat.com/solutions/46111.

Note: We recommended rebooting the system to clear out any previous huge pages and validate that the setting will persist on reboot.

NUMA (Non-Uniform Memory Access) Architecture

Non-Uniform Memory Access is a recent memory architecture that takes into account the locality of caches and CPUs for lower latency. Unfortunately, MongoDB is not “NUMA-aware” and leaving NUMA setup in the default behavior can cause severe memory in-balance.

There are two ways to disable NUMA: one is via an on/off switch in the system BIOS config, the 2nd is using the “numactl” command to set NUMA-interleaved-mode (similar effect to disabling NUMA) when starting MongoDB. Both methods achieve the same result. I lean towards using the “numactl” command due to future-proofing yourself for the mostly inevitable addition of NUMA awareness. On CentOS 7+ you may need to install the “numactl” yum/rpm package.

To make mongod start using interleaved-mode, add “numactl –interleave=all” before your regular “mongod” command:

1	$ numactl --interleave=all mongod <options here>

To check mongod’s NUMA setting:

1

2

3

4

5

6

7

8

9

10

11

$ sudo numastat -p $(pidof mongod)

Per-node process memory usage (in MBs) for PID 7516 (mongod)

Node 0 Total

--------------- ---------------

Huge 0.00 0.00

Heap 28.53 28.53

Stack 0.20 0.20

Private 7.55 7.55

---------------- --------------- ---------------

Total 36.29 36.29

If you see only 1 x NUMA-node column (“Node0”) NUMA is disabled. If you see more than 1 x NUMA-node, make sure the metric numbers (“Heap”, etc.) are balanced between nodes. Otherwise, NUMA is NOT in “interleave” mode.

Note: some MongoDB packages already ship logic to disable NUMA in the init/startup script. Check for this using “grep” first. Your hardware or BIOS manual should cover disabling NUMA via the system BIOS.

Block Device IO Scheduler and Read-Ahead

For tuning flexibility, we recommended that MongoDB data sits on its own disk volume, preferably with its own dedicated disks/RAID array. While it may complicate backups, for the best performance you can also dedicate a separate volume for the MongoDB journal to separate it’s disk activity noise from the main data set. The journal does not yet have it’s own config/command-line setting, so you’ll need to mount a volume to the “journal” directory inside the dbPath. For example, “/var/lib/mongo/journal” would be the journal mount-path if the dbPath was set to “/var/lib/mongo”.

Aside from good hardware, the block device MongoDB stores its data on can benefit from 2 x major adjustments:

IO Scheduler

The IO scheduler is an algorithm the kernel will use to commit reads and writes to disk. By default most Linux installs use the CFQ (Completely-Fair Queue) scheduler. This is designed to work well for many general use cases, but with little latency guarantees. Two other popular schedulers are “deadline” and “noop”. Deadline excels at latency-sensitive use cases (like databases) and noop is closer to no scheduling at all.

We generally suggest using the “deadline” IO scheduler for cases where you have real, non-virtualised disks under MongoDB. (For example, a “bare metal” server.) In some cases I’ve seen “noop” perform better with certain hardware RAID controllers, however. The difference between “deadline” and “cfq” can be massive for disk-bound deployments.

If you are running MongoDB inside a VM (which has it’s own IO scheduler beneath it) it is best to use “noop” and let the virtualization layer take care of the IO scheduling itself.

Read-Ahead

Read-ahead is a per-block device performance tuning in Linux that causes data ahead of a requested block on disk to be read and then cached into the filesystem cache. Read-ahead assumes that there is a sequential read pattern and something will benefit from those extra blocks being cached. MongoDB tends to have very random disk patterns and often does not benefit from the default read-ahead setting, wasting memory that could be used for more hot data. Most Linux systems have a default setting of 128KB/256 sectors (128KB = 256 x 512-byte sectors). This means if MongoDB fetches a 64kb document from disk, 128kb of filesystem cache is used and maybe the extra 64kb is never accessed later, wasting memory.

For this setting, we suggest a starting-point of 32 sectors (=16KB) for most MongoDB workloads. From there you can test increasing/reducing this setting and then monitor a combination of query performance, cached memory usage and disk read activity to find a better balance. You should aim to use as little cached memory as possible without dropping the query performance or causing significant disk activity.

Both the IO scheduler and read-ahead can be changed by adding a file to the udev configuration at “/etc/udev/rules.d”. In this example I am assuming the block device serving mongo data is named “/dev/sda” and I am setting “deadline” as the IO scheduler and 16kb/32-sectors as read-ahead:

1 2	# set deadline scheduler and 16kb read-ahead for /dev/sda ACTION=="add\|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

To check the IO scheduler was applied ([square-brackets] = enabled scheduler):

1 2	$ cat /sys/block/sda/queue/scheduler noop [deadline] cfq

To check the current read-ahead setting:

1 2	$ sudo blockdev --getra /dev/sda 32

Note: this change should be applied and tested with a full system reboot!

Filesystem and Options

It is recommended that MongoDB uses only the ext4 or XFS filesystems for on-disk database data. ext3 should be avoided due to its poor pre-allocation performance. If you’re using WiredTiger (MongoDB 3.0+) as a storage engine, it is strongly advised that you ONLY use XFS due to serious stability issues on ext4.

Each time you read a file, the filesystems perform an access-time metadata update by default. However, MongoDB (and most applications) does not use this access-time information. This means you can disable access-time updates on MongoDB’s data volume. A small amount of disk IO activity that the access-time updates cause stops in this case.

You can disable access-time updates by adding the flag “noatime” to the filesystem options field in the file “/etc/fstab” (4th field) for the disk serving MongoDB data:

1	/dev/mapper/data-mongodb /var/lib/mongo ext4 defaults,noatime 0 0

Use “grep” to verify the volume is currently mounted:

1 2	$ grep "/var/lib/mongo" /proc/mounts /dev/mapper/data-mongodb /var/lib/mongo ext4 rw,seclabel,noatime,data=ordered 0 0

Note: to apply a filesystem-options change, you must remount (umount + mount) the volume again after stopping MongoDB, or reboot.

Network Stack

Several defaults of the Linux kernel network tunings are either not optimal for MongoDB, limit a typical host with 1000mbps network interfaces (or better) or cause unpredictable behavior with routers and load balancers. We suggest some increases to the relatively low throughput settings (net.core.somaxconn and net.ipv4.tcp_max_syn_backlog) and a decrease in keepalive settings, seen below.

Make these changes permanent by adding the following to “/etc/sysctl.conf” (or a new file /etc/sysctl.d/mongodb-sysctl.conf – if /etc/sysctl.d exists):

1

2

3

4

5

net.core.somaxconn = 4096

net.ipv4.tcp_fin_timeout = 30

net.ipv4.tcp_keepalive_intvl = 30

net.ipv4.tcp_keepalive_time = 120

net.ipv4.tcp_max_syn_backlog = 4096

To check the current values of any of these settings:

1 2	$ sysctl net.core.somaxconn net.core.somaxconn = 4096

Note: you must run the command “/sbin/sysctl -p” as root/sudo (or reboot) to apply this change!

NTP Daemon

All of these deeper tunings make it easy to forget about something as simple as your clock source. As MongoDB is a cluster, it relies on a consistent time across nodes. Thus the NTP Daemon should run permanently on all MongoDB hosts, mongos and arbiters included. Be sure to check the time syncing won’t fight with any guest-based virtualization tools like “VMWare tools” and “VirtualBox Guest Additions”.

This is installed on RedHat/CentOS with:

1	$ sudo yum install ntp

And on Debian/Ubuntu:

1	$ sudo apt-get install ntp

Note: Start and enable the NTP Daemon (for starting on reboots) after installation. The commands to do this vary by OS and OS version, so please consult your documentation.

Security-Enhanced Linux (SELinux)

Security-Enhanced Linux is a kernel-level security access control module that has an unfortunate tendency to be disabled or set to warn-only on Linux deployments. As SELinux is a strict access control system, sometimes it can cause unexpected errors (permission denied, etc.) with applications that were not configured properly for SELinux. Often people disable SELinux to resolve the issue and forget about it entirely. While implementing SELinux is not an end-all solution, it massively reduces the local attack surface of the server. We recommend deploying MongoDB with SELinus “Enforcing” mode on.

The modes of SELinux are:

Enforcing – Block and log policy violations.
Permissive – Log policy violations only.
Disabled – Completely disabled.

As database servers are usually dedicated to one purpose, such as running MongoDB, the work of setting up SELinux is a lot simpler than a multi-use server with many processes and users (such as an application/web server, etc.). The OS access pattern of a database server should be extremely predictable. Introducing “Enforcing” mode at the very beginning of your testing/installation instead of after-the-fact avoids a lot of gotchas with SELinux. Logging for SELinux is directed to “/var/log/audit/audit.log” and the configuration is at “/etc/selinux”.

Luckily, Percona Server for MongoDB RPM packages (CentOS/RedHat) are SELinux “Enforcing” mode compatible as they install/enable an SELinux policy at RPM install time! Debian/Ubuntu SELinux support is still in planning.

Here you can see the SELinux policy shipped in the Percona Server for MongoDB version 3.2 server package:

1 2	$ rpm -ql Percona-Server-MongoDB-32-server \| grep selinux /etc/selinux/targeted/modules/active/modules/mongod.pp

To change the SELinux mode to “Enforcing”:

1	$ sudo setenforce Enforcing

To check the running SELinux mode:

1 2	$ sudo getenforce Enforcing

Linux Kernel and Glibc Version

The version of the Linux kernel and Glibc itself may be more important than you think. Some community benchmarks show a significant improvement on OLTP throughput benchmarks with the recent Linux 3.x kernels versus the 2.6 still widely deployed. To avoid serious bugs, MongoDB should at minimum use Linux 2.6.36 and Glibc 2.13 or newer.

I hope to create a follow-up post on the specific differences seen under MongoDB workloads on Linux 3.2+ versus 2.6. Until then, I recommend you test the difference using your own workloads and any results/feedback are appreciated.

What’s Next?

What’s the next thing to tune? At this point, tuning becomes case-by-case and open-ended. I appreciate any comments on use-case/tunings pairings that worked for you. Also, look out for follow-ups to this article for a few tunings I excluded due to lack of testing.

Not knowing the next step might mean you’re done tuning, or that you need more visibility into your stack to find the next bottleneck. Good monitoring and data visibility are invaluable for this type of investigation. Look out for future posts regarding monitoring your MongoDB (or MySQL) deployment and consider using Percona Monitoring and Management as an all-in-one monitoring solution. You could also try using Percona-Lab/prometheus_mongodb_exporter, prometheus/node_exporter and Percona-Lab/grafana_mongodb_dashboards for monitoring MongoDB/Linux with Prometheus and Grafana.

The road to an efficient database stack requires patience, analysis and iteration. Tomorrow a new hardware architecture or change in kernel behavior could come, be the first to spot the next bottleneck! Happy hunting.

原文：https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/