May 19, 2014, 12:35 PM — Monitoring, anticipating, and reacting to server load is a full time job in some organizations. Unexpected spikes in resource usage can indicate a software or hardware problem. Gradual increases over time can help you predict hardware growth requirements. Under utilization can show you opportunities to use hardware more efficiently. CPU load is one of the most important metrics for measuring hardware usage.
These days, RAM and storage are cheap and plentiful. More often it’s the CPU causing resource shortages, especially if you operate a virtualized environment. When you create a new virtual machine, the VM requires at least 1 CPU core to operate. It’s recommended that your VM CPU allocation match up with a physical CPU core. That means your host server can only run as many virtual machines as it has cores (minus 1 for the host server), and usually a VM needs more than 1 core if it’s doing any real work. Properly allocating the cores to run the most VM’s efficiently is the goal of any virtualized system.
If you’re used to Windows style CPU reporting which shows you a percentage based statistic of utilization, Linux load reporting can be a little confusing.
Under Linux, CPU usage is reported as a series of three decimals like the following result of the ‘uptime’ command:
The first decimal represents the average CPU load over the past minute. The second decimal is the average load over a 5 minute period. The third and final number is the average load over a 15 minute period. Using these 3 measurements you can get a sense of whether a spike was a short term occurrence or if it’s a prolonged event. If the third number is too high, you’ve got a problem to deal with. But what is ‘too high’?
The decimal represents the amount of active tasks requesting CPU resources to perform an action. If you think of the number in terms of percentage utilization, 1.0 represents 100% of a single CPU core. Anything over 1.0 represents the amount of processes which are waiting in line to be executed. In this way, the Linux style of measurement is more informative than the Windows percentage style because it doesn’t just tell you a CPU is overloaded, it also tells you by how much and over what time period.
An important note is that this number scales along side CPU cores. If you have 4 CPUs for example, 4.0 is equal to 100% utilization across all cores. The standard rule of thumb is that 70% utilization is healthy. Once you're consistently above 70%, you need to start planning for expansion or else optimize your software. That means 0.70 per CPU core.
Personally, I like to use htop for resource monitoring on Linux. It gives you a view of all CPU core usage in addition to load averages, memory usage, and more.
In this example, the server has 4 CPU cores. The load average over 15 minutes is 1.15. If you divide that number by the number of cores (4), you get the average single core load: 0.2875 or 28.75%. That’s pretty low usage, but you want to monitor the number over a period of time to get a variety of readings before jumping to any conclusions around over provisioning. If I’m keeping my eye out for this server reaching the warning threshold of 70% usage, the number I’m looking for is 0.70 * the number of cores (4): 2.80. If the 15 minutes average is at or near 2.8, I know I need to start considering some options soon.
On the flip side, if you have a ton of CPU cores allocated to a VM that’s not using them, you’re wasting resources. I recently noticed a server with 8 CPU cores running at around 1.40 load average, or 17.5% utilization. After monitoring it for a couple of weeks, it was determined that we could reclaim 4 CPU cores from that VM and still operate under 70%. Gaining those 4 cores allows us to spin up another 4 CPU VM on the same hardware which is a great gain in resource utilization.
The goal is to utilize your resources effectively. In an ideal world, each server would run at 100% CPU utilization without any increase or decrease. Obviously that’s not going to happen. By monitoring your CPU loads over time however, you can make the best decisions for your servers and avoid any surprise CPU lock ups.