From: www.itworld.com

Prying into processes and workloads

May 7, 2001 —

 

Q: How can I tell which processes are causing problems and which ones are stuck in a bottleneck?



A: A significant amount of data is available that is not shown by the ps command. In
addition, there are more clever ways to process and display data than top or proctool use. A new extension to the SE toolkit implements some of my ideas in this area. Along the way it becomes clear that the CPU usage measurements everyone relies on are somewhat inaccurate.


Process data sources

I described process data sources in my August 1996 Performance Q&A
column, but this time I'll go a step further with the data.
These data structures are described in full in the proc(4)
manual page. They are also available in the SE toolkit, so if you want
to obtain the data and play around with it, you should look at the code
for ps-ax.se and msacct.se.


The interface to /proc involves sending ioctl
commands or opening special pseudo-files and reading them (a new
feature of Solaris 2.6). The data that ps uses is called
PIOCPSINFO. Here's what you get back from
ioctl (you get slightly different data if you read it from
the pseudo-file):


proc(4)                   File Formats                    proc(4)
  PIOCPSINFO
     This returns miscellaneous process information such as that
     reported by ps(1). p is a pointer to a prpsinfo structure
     containing at least the following fields:
     typedef struct prpsinfo {
       char        pr_state;    /* numeric process state (see pr_sname) */
       char        pr_sname;    /* printable character representing pr_state */
       char        pr_zomb;     /* !=0: process terminated but not waited for */
       char        pr_nice;     /* nice for cpu usage */
       u_long      pr_flag;     /* process flags */
       int         pr_wstat;    /* if zombie, the wait() status */
       uid_t       pr_uid;      /* real user id */
       uid_t       pr_euid;     /* effective user id */
       gid_t       pr_gid;      /* real group id */
       gid_t       pr_egid;     /* effective group id */
       pid_t       pr_pid;      /* process id */
       pid_t       pr_ppid;     /* process id of parent */
       pid_t       pr_pgrp;     /* pid of process group leader */
       pid_t       pr_sid;      /* session id */
       caddr_t     pr_addr;     /* physical address of process */
       long        pr_size;     /* size of process image in pages */
       long        pr_rssize;   /* resident set size in pages */
       u_long      pr_bysize;   /* size of process image in bytes */
       u_long      pr_byrssize; /* resident set size in bytes */
       caddr_t     pr_wchan;    /* wait addr for sleeping process */
       short      pr_syscall;  /* system call number (if in syscall) */
       id_t        pr_aslwpid;  /* lwp id of the aslwp; zero if no aslwp */
       timestruc_t pr_start;    /* process start time, sec+nsec since epoch */
       timestruc_t pr_time;     /* usr+sys cpu time for this process */
       timestruc_t pr_ctime;    /* usr+sys cpu time for reaped children */
       long        pr_pri;      /* priority, high value is high priority */
       char        pr_oldpri;   /* pre-SVR4, low value is high priority */
       char        pr_cpu;      /* pre-SVR4, cpu usage for scheduling */
       u_short     pr_pctcpu;   /* % of recent cpu time, one or all lwps */
       u_short     pr_pctmem;   /* % of system memory used by the process */
       dev_t       pr_ttydev;   /* controlling tty device (PRNODEV if none) */
       char        pr_clname[PRCLSZ]; /* scheduling class name */
       char        pr_fname[PRFNSZ]; /* last component of exec()ed pathname */
       char        pr_psargs[PRARGSZ];/* initial characters of arg list */
       int         pr_argc;     /* initial argument count */
       char        **pr_argv;   /* initial argument vector */
       char        **pr_envp;   /* initial environment vector */
     } prpsinfo_t;


You can get the data for each lightweight process of a multithreaded process separately. While
there's a lot of useful-looking information there, there's no sign of the high-resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se) display. They use a separate ioctl, PIOCUSAGE:


proc(4)                   File Formats                    proc(4)
  PIOCUSAGE
     When applied to the process file descriptor, PIOCUSAGE
     returns the process usage information; when applied to an
     lwp file descriptor, it returns usage information for the
     specific lwp.   p points to a prusage structure which is
     filled by the operation. The prusage structure contains at
     least the following fields:
     typedef struct prusage {
          id_t           pr_lwpid;    /* lwp id.  0: process or defunct */
          u_long         pr_count;    /* number of contributing lwps */
          timestruc_t    pr_tstamp;   /* current time stamp */
          timestruc_t    pr_create;   /* process/lwp creation time stamp */
          timestruc_t    pr_term;     /* process/lwp termination timestamp */
          timestruc_t    pr_rtime;    /* total lwp real (elapsed) time */
          timestruc_t    pr_utime;    /* user level CPU time */
          timestruc_t    pr_stime;    /* system call CPU time */
          timestruc_t    pr_ttime;    /* other system trap CPU time */
          timestruc_t    pr_tftime;   /* text page fault sleep time */
          timestruc_t    pr_dftime;   /* data page fault sleep time */
          timestruc_t    pr_kftime;   /* kernel page fault sleep time */
          timestruc_t    pr_ltime;    /* user lock wait sleep time *
          timestruc_t    pr_slptime;  /* all other sleep time */
          timestruc_t    pr_wtime;    /* wait-cpu (latency) time */
          timestruc_t    pr_stoptime; /* stopped time */
          u_long         pr_minf;     /* minor page faults */
          u_long         pr_majf;     /* major page faults */
          u_long         pr_nswap;    /* swaps */
          u_long         pr_inblk;    /* input blocks */
          u_long         pr_oublk;    /* output blocks */
          u_long         pr_msnd;     /* messages sent */
          u_long         pr_mrcv;     /* messages received */
          u_long         pr_sigs;     /* signals received */
          u_long         pr_vctx;     /* voluntary context switches */
          u_long         pr_ictx;     /* involuntary context switches */
          u_long         pr_sysc;     /* system calls */
          u_long         pr_ioch;     /* chars read and written */
     } prusage_t;
     PIOCUSAGE can be applied to a zombie   process   (see
     PIOCPSINFO).
     Applying PIOCUSAGE to a process that does not have micro-
     state accounting enabled will enable microstate accounting
     and return an estimate of times spent in the various states
     up to this point.   Further invocations of PIOCUSAGE will
     yield accurate microstate time accounting from this point.
     To disable microstate accounting, use PIOCRESET with the
     PR_MSACCT flag.


You'll find a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se as follows:


Elapsed time         3:20:50.049  Current time Fri Jul 26 12:49:28 1996
User CPU time           2:11.723  System call time        1:54.890
System trap time           0.006  Text pfault sleep          0.000
Data pfault sleep          0.023  Kernel pfault sleep        0.000
User lock sleep            0.000  Other sleep time     3:16:43.022
Wait for CPU time          0.382  Stopped time               0.000


Microstate accounting

Microstate accounting is not turned on by default. It slows the
system down very slightly. Although it was a default up to
Solaris 2.3, from Solaris 2.4 on it is enabled the first time you
read the data. CPU time is normally measured by sampling,
100 times per second, the state of all the CPUs from the clock
interrupt. Microstate accounting, on the other hand, takes a
high-resolution timestamp on every state change, every system call, every
page fault, and every scheduler change. Microstate accounting
doesn't miss anything, and the results are much more accurate than
those from sampled measurements. The normal measures of CPU user and
system time made by sampling can be off by 20 percent or more
because the sample is biased, not random. Process scheduling employs
the same clock interrupt used to measure CPU usage, and this
approach leads to systematic errors in the sampled data. The
microstate-measured CPU-usage data does not suffer from such
errors.


For example, consider a performance monitor that wakes up every 10
seconds, reads some data from the kernel, then prints the results and
sleeps. On a fast system, the total CPU time consumed per wake-up
might be a few milliseconds. On exit from the clock interrupt,
the scheduler wakes up processes and kernel threads that have been
sleeping. Processes that sleep consume less than their allotted CPU
time-quanta and always run at the highest timeshare priority.


On a lightly loaded system there is no queue for access to the CPU, so
immediately after the clock interrupt, it's likely that the performance monitor
will be scheduled. If it runs for less than 10 milliseconds it will have
completed its task and be sleeping again by the time the next clock interrupt
comes along. Now, given that CPU time is allocated based on what is running
when the clock interrupt occurs, you can see that the performance monitor could
be sneaking a bite of CPU time whenever the clock interrupt isn't looking.
This is an artifact of the dual functions of the clock interrupt -- if two
independent unsynchronized interrupts were used, one for scheduling and one
for performance measurement, the errors would be averaged away over time.


Another approach to the problem is to sample more frequently by
running the clock interrupt more often. This does not remove the
bias, but it makes it harder to hide small bites of the CPU. The
overhead of splitting the interrupts up is not worth implementing.
And, while it's possible to increase the CPU clock rate for the sake
of more accurate measurements, this method creates a higher overhead
than using direct microstate measurement. In any case, microstate
measurement is far more useful and accurate, as it measures more
interesting state transitions. When there is a significant amount of
queuing for CPU time, the performance monitor will be delayed by a
random amount of time, so it will be seen by the clock interrupt
some of the time.


As a simple experiment I ran vmstat with a one-second
interval and output redirected to /dev/null so that it would
not be delayed by display or filesystem operations.


The ptime command uses microstate accounting to accurately
measure the CPU time used. I left this running for a long time on a
pair of fairly idle systems. I modified the ps-p.se script
to show CPU time used down to 100-hertz accuracy of the underlying
measurements. After a few minutes the process had accumulated only 15
ticks of CPU time on an
85-megahertz microSPARC, and only one tick of CPU time on a dual 300-megahertz UltraSPARC. After an hour the number of ticks had not increased at all! (The error increases on a quieter system with a faster CPU, as it is easier to sneak a bite of the CPU time without the clock noticing.)


Using microstate accounting to measure the same processes, however, it
turned out that about 4.8 seconds of CPU time had been used on the
85-megahertz microSPARC and 1.2 seconds on the 300-megahertz
UltraSPARC. This is an extreme case, but the fact remains that the
actual CPU usage is far more than is being reported by the normal
mechanism. Since the number of ticks is not increasing linearly, the
actual error is infinite. The longer I let this run, the larger it gets:


micro85% /usr/proc/bin/ptime vmstat 1 >/dev/null

^C
real  1:03:26.115
user        2.913
sys         1.891

ultra300% /usr/proc/bin/ptime vmstat 1 >/dev/null

^C
real  1:02:01.626
user        0.621
sys         0.555


Just before stopping vmstat, I ran my modified ps-p.se on it on the two systems:


micro85% se ps-p.se 6513
   PID TT       S  TIME    COMMAND
  6513 pts/1    S  0:00.16 vmstat 1

ultra300% se ps-p.se 21560
   PID TT       S  TIME    COMMAND
 21560 pts/3    S  0:00.01 vmstat 1   


What is needed is a way to monitor process CPU usage more accurately
and using more convenient commands than ptime and
msacct.se. I decided to extend the SE toolkit to include
a process class, process_class.se, that could be reused by
several commands and would provide the information that I really want
about each process on a system. I've tried to get performance-tool
vendors interested in microstate data without any success. Hopefully,
offering an example of how to get and use this information will
generate increased user demand for this kind of tool.


The process class

The basic requirement for the process class was that it should collect
both the psinfo and usage data for every
process on the system. For consistency, all data should be collected at
once, and as quickly as possible, then offered for display one process
at a time. This avoids the problem inherent in the ps
command, where the data for the last process displayed is measured
after all the other processes have been measured and displayed, so the
data is not associated with a consistent timestamp.


The psinfo data contains a measure of recent average CPU
usage, but I really want all the data measured over the time interval
since the last reading. This gets complex as new processes arrive and
old ones die. Matching up all the data is not as trivial as measuring
the performance deltas for the CPUs or disks in the system. There also
can be up to 32000 processes to keep track of.


The resulting code is quite complex, but it does the job, and all the
complexity is hidden in the class code in
/opt/RICHPse/include/process_class.se. Note that this is not
part of SE3.0, as I wrote it after that release. It is provided as a
tar file that can be loaded over the top of an SE3.0 installation, and
an improved version will be included in the next release of SE.


Much of the data is left as a difference over the interval. To
calculate rates, it can be divided by the interval that is provided as
part of the data and that is accurately measured as the difference in
time for each process separately. If the collection process is delayed
on a busy system by other processing, the measurements are still
accurate. I'll discuss the data in detail next.

	
			Control Entries
/* codes for action$ */
#define PROC_ACTION_INIT       0    /* starting point -> next index */
#define PROC_ACTION_PID        1    /* get the specified pid */
#define PROC_ACTION_NEXT_INDEX 2    /* index order is based on /proc  */
#define PROC_ACTION_NEXT_PID   3    /* search for pid and return data */

class proc_class_t {
	/* input controls */
	int	index$;	             /* always contains current index or -1 */
	int	pid$;	             /* always contains current pid */
	int	action$;


It is a convention in SE that control variables for a class have a
$ sign attached to them. When process data is read it is
returned in the order that /proc provides entries, not in
order of process ID. The index$ entry counts through the
data in this order. When all processes have been read, it returns
-1. This is a sign to the calling program that it should
sleep a while before reading any more data. On the next read, all the
process data is captured, then data for the first process is returned.
By default, subsequent reads return data in index order. The
pid$ entry is always updated to contain the process ID.
The action$ entry controls the automatic behavior of the
class. It starts of by initializing the class and changes itself to
PROC_ACTION_NEXT_INDEX. If you would rather get data for a
particular pid, you can set action$ to
PROC_ACTION_PID and set pid$ to specify which
one. If you want data to be returned in order of increasing pid, you set action$ to PROC_ACTION_NEXT_PID. This mode is less efficient in this implementation.


			Summary Data
	/* summary totals */
	double  lasttime;       /* timestamp for the end of the last update */
	int	nproc;          /* current number of processes */
	int	newproc;        /* number of new processes this time */
	int	deadproc;       /* number of processes that died */


The timestamp indicates that all process data was collected before that
time. The current number of processes and the number of new ones are
easy to understand; the handling of dead processes is a bit odd. Rather
than being ignored, dead processes are provided after all current
processes are reported. This allows a last chance to see what the
process did before it died, as the data for that process is erased once
it is reported for the last time. It would be nice to report the
process accounting record, but that does not include the pid. Also
processes may have come and gone completely between samples. These will
show up as child CPU activity below.


			Per-Process Data
	/* output data for specified process */
	double interval;        /* measured time interval for averages */
	double timestamp;       /* last time process was measured */
	double creation;        /* process start time */
	double termination;     /* process termination time stamp */
	double elapsed;         /* elapsed time for all lwps in process */
	double total_user;      /* current totals in seconds */
	double total_system;
	double total_child;     /* child processes that have exited */
	double user_time;       /* user time in this interval */
	double system_time;     /* system call time in this interval */
	double trap_time;       /* system trap time in interval */
	double child_time;      /* child CPU in this interval */
	double text_pf_time;    /* text page fault wait in interval */
	double data_pf_time;    /* data page fault wait in interval */
	double kernel_pf_time;  /* kernel page fault wait in interval */
	double user_lock_time;  /* user lock wait in interval */
	double sleep_time;      /* all other sleep time */
	double cpu_wait_time;   /* time on runqueue waiting for CPU */
	double stoptime;        /* time stopped from ^Z */
	ulong syscalls;         /* syscall/interval for this process */
	ulong inblocks;         /* input blocks/interval - metadata only - not interesting */
	ulong outblocks;        /* output blocks/interval - metadata only - not interesting */
	ulong vmem_size;        /* size in KB */
	ulong rmem_size;        /* RSS in KB */
#ifdef XMAP                     /* XMAP not yet implemented */
	ulong pmem_size;        /* private mem in KB */
	ulong smem_size;        /* shared mem in KB */
#endif
	ulong maj_faults;       /* majf/interval */
	ulong min_faults;       /* minf/interval - always zero - bug? */
	ulong total_swaps;      /* swapout count */
	long  priority;         /* current sched priority */
	long  niceness;         /* current nice value */
	char  sched_class[PRCLSZ];	/* name of class */
	ulong messages;         /* msgin+msgout/interval */
	ulong signals;          /* signals/interval */
	ulong vcontexts;        /* voluntary context switches/interval */
	ulong icontexts;        /* involuntary context switches/interval */
	ulong charios;          /* characters in and out/interval */
	ulong lwp_count;        /* number of lwps for the process */
	int   uid;              /* current uid */
	long  ppid;             /* parent pid */
	char  fname[PRFNSZ];    /* last component of exec'd pathname */
	char  args[PRARGSZ];    /* initial part of command name and arg list */
	proc$() {
		/* lots of complex code and data hides in here */
}


A future version of this class will also include the extended memory
information described in last month's column and shown above as #ifdef
XMAP
. Most of the above data is self-explanatory. All times are
in seconds in double precision with microsecond accuracy. The minor
fault counter seems to be broken as it always reports zero. The
inblock and outblock counters are
uninteresting as they only refer to filesystem metadata for the
old-style buffer cache. The charios counter includes all
read and write data for all file descriptors so you can see the file
I/O rate. The lwp_count is not the current number of lwps;
it is a count of how many lwps the process has ever had. If the number
is more than one the process is multithreaded. It's possible to access
each lwp in turn and read its psinfo and
usage data. The process data is the sum of these.


Child data is accumulated when a child process exits. The CPU used by
the child is added into the data for the parent. This can be used to
find processes that are forking lots of little short-lived commands.


Data access permissions

To get at process data you must have access permissions for entries in
/proc or run as a setuid root command. In Solaris 2.5.1, using
the ioctl access method for /proc, you can only
access processes that you own, unless you login as root. In Solaris
2.6, although you cannot access the /proc/pid entry for every
process, you can read /proc/pid/psinfo and
/proc/pid/usage for every process. This means that the full
functionality of ps and the process class can be employed
by any user. The code for process_class.se conditionally
uses the new Solaris 2.6 access method and the slightly changed
definition of the psinfo data structure.


% ls -l /proc/3209
total 2217
-rw-------   1 adrianc  9506     1118208 Mar  5 22:39 as
-r--------   1 adrianc  9506         152 Mar  5 22:39 auxv
-r--------   1 adrianc  9506          36 Mar  5 22:39 cred
--w-------   1 adrianc  9506           0 Mar  5 22:39 ctl
lr-x------   1 adrianc  9506           0 Mar  5 22:39 cwd -> /
dr-x------   2 adrianc  9506         416 Mar  5 22:39 fd/
-r--r--r--   1 adrianc  9506         120 Mar  5 22:39 lpsinfo
-r--------   1 adrianc  9506         912 Mar  5 22:39 lstatus
-r--r--r--   1 adrianc  9506         536 Mar  5 22:39 lusage
dr-xr-xr-x   3 adrianc  9506          48 Mar  5 22:39 lwp/
-r--------   1 adrianc  9506        1440 Mar  5 22:39 map
dr-x------   2 adrianc  9506         288 Mar  5 22:39 object/
-r--------   1 adrianc  9506        1808 Mar  5 22:39 pagedata
-r--r--r--   1 adrianc  9506         336 Mar  5 22:39 psinfo
-r--------   1 adrianc  9506        1440 Mar  5 22:39 rmap
lr-x------   1 adrianc  9506           0 Mar  5 22:39 root -> /
-r--------   1 adrianc  9506        1440 Mar  5 22:39 sigact
-r--------   1 adrianc  9506        1232 Mar  5 22:39 status
-r--r--r--   1 adrianc  9506         256 Mar  5 22:39 usage
-r--------   1 adrianc  9506           0 Mar  5 22:39 watch
-r--------   1 adrianc  9506        2280 Mar  5 22:39 xmap


The pea.se script


Usage: se [-DWIDE] pea.se [interval]


The pea.se script is an extended process monitor that acts
as a test program for process_class.se and displays very
useful information that is not extracted by any standard tool. It is
based on the microstate accounting information described above. The
script runs continuously and reports on the average data for each
active process in the measured interval. This reporting is very
different than tools such as top or ps, which
print the current data only. There are two display modes: By default
pea.se fits into an 80-column format, but the wide mode
has much more information. The initial data display includes all
processes and shows their average data since the process was created.
Any new processes that appear are also treated this way. When a process
is measured a second time its averages for the measured interval are
displayed if it has consumed any CPU time. Idle processes are ignored.
The output is generated every 10 seconds by default. It can report only on processes that it has permission to access, so it must be run as root to see everything on Solaris 2.5.1. And as described above, it sees everything on Solaris 2.6 without needing root permissions.


% se pea.se
09:34:06 name  lwp   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf
olwm             1   322   299  9506   0.01   0.01  0.03  0.00   2328  1032  0.0
maker5X.exe      1 21508     1  9506   0.55   0.33  0.04  0.00  29696 19000  0.0
perfmeter        1   348     1  9506   0.04   0.02  0.00  0.00   3776  1040  0.0
cmdtool          1   351     1  9506   0.01   0.00  0.03  0.00   3616   960  0.0
cmdtool          1 22815   322  9506   0.08   0.03  2.28  0.00   3616  1552  2.2
xterm            1 22011  9180  9506   0.04   0.03  0.30  0.00   2840  1000  0.0
se.sparc.5.5.1   1 23089 22818  9506   1.92   0.07  0.00  0.00   1744  1608  0.0
fa.htmllite      1 21559     1  9506   0.00   0.00  0.00  0.00   1832    88  0.0
fa.tooltalk      1 21574     1  9506   0.00   0.00  0.00  0.00   2904  1208  0.0
nproc 31  newproc 0  deadproc 0

% se -DWIDE pea.se
09:34:51 name  lwp   pid  ppid   uid   usr%   sys% wait% chld%   size   rss   pf  inblk outblk chario   sysc   vctx   ictx   msps
maker5X.exe      1 21508     1  9506   0.86   0.36  0.10  0.00  29696 19088  0.0   0.00   0.00   5811    380  60.03   0.30   0.20
perfmeter        1   348     1  9506   0.03   0.02  0.00  0.00   3776  1040  0.0   0.00   0.00    263     12   1.39   0.20   0.29
cmdtool          1 22815   322  9506   0.04   0.00  0.04  0.00   3624  1928  0.0   0.00   0.00    229      2   0.20   0.30   0.96
se.sparc.5.5.1   1  3792   341  9506   0.12   0.01  0.00  0.00   9832  3376  0.0   0.00   0.00      2      9   0.20   0.10   4.55
se.sparc.5.5.1   1 23097 22818  9506   0.75   0.06  0.00  0.00   1752  1616  0.0   0.00   0.00    119     19   0.10   0.30  20.45
fa.htmllite      1 21559     1  9506   0.00   0.00  0.00  0.00   1832    88  0.0   0.00   0.00      0      0   0.10   0.00   0.06
nproc 31  newproc 0  deadproc 0


The pea.se script is 90 lines of code, a few simple printfs in a loop. The real work is done in process_class.se (over 500 lines of code) and can be used by any other script. The default data shown by pea.se consists of:


When the command is run in wide mode, the following data is added:


Process class implementation overhead

It's quite hard to handle large amounts of dynamic data in SE. In the
end I used a very crude approach based on an array of pointers indexed
by process ID (i.e. 128 kilobytes of memory) with malloced data
structures to hold the information. A problem with this is that after
collecting all the data, the class does a sweep through the array
looking for dead processes. This adds some CPU load, but it's not that
bad and doesn't increase as you add more processes. On my 85-megahertz
microSPARC with Solaris 2.6 pea.se uses 15 percent of the
CPU at a 10-second interval (i.e. 1.5 seconds per invocation). On the
300-megahertz UltraSPARC with Solaris 2.5.1 pea.se uses
three percent (i.e. 0.3 seconds per invocation). In both cases about 80
processes were being monitored. Since then, Richard Pettit and I have
decided that SE needs better ways to handle dynamic data and pointer
handling, so Richard is working on extensions to the language. I'm
going to rewrite process_class.se to be far smaller and more efficient. The code will be more like standard C as well.


A read of the usage data itself turns on microstate accounting for that
process. This increases the overhead for each system call. To measure
the overhead, I put a C-shell into a while loop and watched the
systemwide system call rate. I then ran the shell with microstate
accounting enabled for that process. The call rate reduced from 110,000
system calls per second to 98,000 system calls per second. Both these
rates are far higher than normal, and are measured using a single
300-megahertz UltraSPARC with Solaris 2.5.1. That puts the worst-case
overhead at about 10 percent for system call intensive processes.
Another way of looking at it is that it adds about one microsecond to
each system call. In normal use I doubt that the overhead is
measurable.


Workload-based summarization

When you have a lot of processes, you want to group them together to
make it more manageable. If you group by user name and command
you can form workloads, which are a very powerful way to view the
system. I have also built a workload class that sits on top of the
process class. It pattern matches on user name, command, and arguments.
It can work on a first-fit basis, where each process is included only
in the first workload that matches. It can also work on a summary
basis, where each process is included in every workload that matches.
The code is quite simple, 160 lines or so, and by default it allows up
to 10 workloads to be specified. SE includes a neat regular expression
pattern match comparison operator "string =~ expression",
but this could be translated to C using the regexp library
routines. The workload_class.se file is provided in the tar
bundle along with the process_class.se file.


Test program for workload class -- pw.se

The challenge is how to specify workloads. It would be nice to have a
GUI, but to get me started I resorted to my old favorite of using
environment variables. The first variable is PW_COUNT, the
number of workloads. This is then followed by PW_CMD_n,
PW_ARGS_n, and PW_USER_n, where
n is from 0 to PW_COUNT -1. If no pattern is
provided, it automatically matches anything. Running pw.se
with nothing specified gives you all processes accumulated into a
single catch-all workload. The size value is accumulated
as it is related to the total swap space usage for the workload. The
rss value is not, as too much memory is shared for the
result to have any meaning.


12:46:54 nproc 31  newproc 0  deadproc 0
wk  command    args    user procs   usr%   sys% wait% chld%   size   pf
 0                             31    2.2    0.7   0.2   0.0 112176    0
 1                              0    0.0    0.0   0.0   0.0      0    0
 2                              0    0.0    0.0   0.0   0.0      0    0
 3                              0    0.0    0.0   0.0   0.0      0    0
 4                              0    0.0    0.0   0.0   0.0      0    0
 5                              0    0.0    0.0   0.0   0.0      0    0
 6                              0    0.0    0.0   0.0   0.0      0    0
 7                              0    0.0    0.0   0.0   0.0      0    0
 8                              0    0.0    0.0   0.0   0.0      0    0
 9                              0    0.0    0.0   0.0   0.0      0    0


To make life easier, I built a small script that sets up a workload suitable for monitoring a desktop workstation that is also running a Netscape Web server:


% more pw.sh
#!/bin/csh

setenv PW_CMD_0 ns-httpd
setenv PW_CMD_1 'se.sparc'
setenv PW_CMD_2 'dtmail'
setenv PW_CMD_3 'dt'
setenv PW_CMD_4 'roam'
setenv PW_CMD_5 'netscape'
setenv PW_CMD_6 'X'
setenv PW_USER_7 'adrianc'
setenv PW_USER_8 'root'
setenv PW_COUNT 10
exec /opt/RICHPse/bin/se -DWIDE pw.se 60


This runs with a one-minute update rate and uses the wide mode by default. It's useful to use
this information to note that a particular workload that has a high wait% is either being starved of memory (waiting for page faults) or of CPU power. A high number of page faults for a workload would indicate that it's either starting many new processes, doing a lot of filesystem I/O, or short of memory.


12:53:06 nproc 85  newproc 2  deadproc 0
wk  command    args    user count   usr%   sys% wait% chld%   size   pf  inblk outblk chario   sysc   vctx   ictx   msps
 0 ns-httpd                     2    0.0    0.0   0.0   0.0  17736    0      0      0      6      1      0      0   0.00
 1 se.sparc                     1    0.6    0.0   0.0   0.0   2120    0      0      0     44     10      1      0   6.42
 2   dtmail                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 3       dt                     6    0.0    0.0   0.0   0.0  20656    0      0      0     95      3      0      0   0.00
 4     roam                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 5 netscape                     0    0.0    0.0   0.0   0.0      0    0      0      0      0      0      0      0   0.00
 6        X                     2    0.4    0.3   0.0   0.0 151032    0      0      0   2071    166     14      0   0.49
 7                  adrianc    27    0.1    0.0   0.0   0.0  83840    0      0      0    652     59      3      0   0.42
 8                     root    41    0.6    0.1   0.1   0.5  70640    0      0      0   3583     66      4      0   1.85
 9                              4    1.2    0.0   0.3   0.0   4216    0      0      0    138   3016      1      0  11.94


Wrap up

After whining about the lack of use that microstate accounting data was
getting for several years, I finally spent just a few days writing this
code. It's not yet as efficient as I'd like, and it's probably a bit
buggy, but it seems to open up another very useful window on what is
going on inside a system. You can download a tar file from the regular
SE3.0 download page that contains workload_class.se and
process_class.se, pea.se and pw.se,
a new version of the proc.se header file and the pw.sh script.
When you untar it as root, it automatically puts the SE files in the
/opt/RICHPse directory, and it puts pw.sh in your
current directory.


New book update

You should be able to get my new book in the shops this month.
The title is Sun
Performance and Tuning -- Java and the Internet
, by Adrian
Cockcroft and Richard Pettit, Sun Press/PTR Prentice Hall, ISBN
0-13-095249-4. At the time of writing the book, I had written the
process class, so pea.se is described, but I had not
written the workload class, so pw.se is not in the book.

Resources and Related Links

Other Cockcroft columns at www.sun.com