How busy is the CPU, really?
Q:
In April's column you said that CPU usage is inaccurate -- but by
how much, and does it matter?
A: Error is minimal at high usage levels, but ranges up to 80 percent or more at low
levels. The problem is that usage is under reported, and the range of error
increases on faster CPUs.
At a real usage level of 5 percent busy, you'll often see vmstat reporting that
the system is only 1 percent busy -- under reporting by 80 percent
of the true value. You could also look at this as a 400 percent
error in the reported value.
As an example of the kind of problem this can cause, consider a system
planned to cope with a load of up
to 1000 users. If you measure the average process activity of the first
20 users, they only appear to use 1 percent of the system (but in
fact use 5 percent). There appears to be sufficient capacity for 2000 users,
but really there is only enough capacity for 400. As the total
user load increases, and the measurement error reduces, the amount
of CPU used by each user also appears to increase.
I built a tool to measure the errors, collected data on a few
systems, and plotted the results. I would like to get more data, so
the tool has been folded into an updated copy of the process
monitoring update bundle. If you like, you can monitor accuracy on your own
systems and send me the results. I'll start with a more
detailed explanation of the problem, then describe the tool I built,
and show you plots of the initial results.
CPU usage measurements
Normally, CPU time is measured by
sampling, 100 times per second, the state of all CPUs at the clock interrupt.
Process scheduling employs the same clock interrupt used to measure CPU usage,
leading to systematic errors in the sampled data. Microstate accounting,
discussed in April's Performance Q&A, is much more accurate than sampled measurements.
To illustrate how errors occur, I'll excerpt the following example
from April's column:
Consider a performance monitor that wakes up every 10 seconds,
reads some data from the kernel, then prints the results and sleeps. On a fast
system, the total CPU time consumed per wake-up might be a few milliseconds.
On exit from the clock interrupt, the scheduler wakes up processes and kernel
threads that have been sleeping. Processes that sleep consume less than their
allotted CPU time-quanta and always run at the highest timeshare priority.
On a lightly loaded system there is no queue for access to the CPU, so
immediately after the clock interrupt, it's likely that the performance monitor will be
scheduled. If it runs for less than 10 milliseconds it will have completed its task
and be sleeping again by the time the next clock interrupt comes along. Now,
given that CPU time is allocated based on what is running when the clock
interrupt occurs, you can see that the performance monitor could be sneaking a
bite of CPU time whenever the clock interrupt isn't looking.
In the diagram below, a process wakes
Symantec Backup Exec 12 and Backup Exec System Recovery 8 deliver industry leading Windows data protection and system recovery. Download this whitepaper to find out the top reasons to upgrade and how to get continuous data protection and complete system recovery.
Data and system loss — from a hard drive failure, malicious attack, natural disaster, or simple human error — can happen anytime. Don’t leave your business vulnerable. Make sure you have a secure recovery strategy in place. Symantec's latest backup and system recovery technology can efficiently restore critical applications, individual emails and documents and even restore your entire system in minutes in the event of a loss.
Businesses face a growing challenge to ensure that the IT environment is properly protected. Backup Exec 12 integrates with other applications in the Symantec family of products, to complement your current data protection strategy, keep your data securely backed up and make it recoverable when you need it most.







