From: www.itworld.com

Clearing up swap space confusion

May 7, 2001 —

 

Q: The swap space numbers reported by swap -s and swap -l and sar and vmstat don't add up. Why not?


A: When I tried to summarize the way that swap space works for the
second edition of my book, I discovered that I didn't really
understand it myself. After looking back through my April 1996
Performance Q&A column, "How does swap space work?"
and reading Inside Solaris columnist Jim Mauro's two-part look at swap space implementation, I decided that Jim didn't quite have the full picture either. After many hours of deep detective work I figured it all out, and discovered some minor bugs in the tools that had us both confused.


Since then I've seen this question many times. The answer is too
complex to include in an e-mail, so I've decided to base this month's column
on a section of my book. In the future, I'll be able to point readers with
swap-space queries in the direction of this column.


For all practical purposes, the swapping out of entire processes can be
ignored in Solaris 2, which no longer implements the time-based soft
swap-outs that occur in SunOS 4. vmstat -s reports total
numbers of swap-ins and swap-outs, and they're almost always zero.
It is important to note that prolonged memory shortages can trigger
swap-outs of inactive processes. Swapping out idle processes helps the
performance of machines with less than 32 megabytes (MB) of RAM.
The number of idle swapped-out processes is reported as the
swap queue length by vmstat. This measurement is not explained properly
in the manual page, since that measure used to be the number of active
swapped-out processes waiting to be swapped back in. As soon as a
swapped-out process wakes up again, it will swap its basic data
structures back into the kernel, and page in its code and data as
they are accessed. This activity requires so little memory that it
can always happen immediately.


Code Example 1


%vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s1 s5   in   sy   cs us sy id
...
 0 0 4 314064  5728   0   7  2  1  1  0  0  0 177 91 22 132  514   94  1  0 98


If you come across a system with a non-zero swap queue reported by
vmstat, take it as a sign that at some time in the past, free memory
went low for long enough to trigger the swapping out of idle
processes. This is the only useful conclusion you can draw from such a
measure.


Swap space operations

Swap space is really a misnomer for paging space.
Almost all accesses are page related. Swap space is actually
used to hold anonymous memory, as it's not associated with a named
file in the file system. The kernel uses anon in the name of many
of the swap-related variables.


Swap space is allocated from spare RAM and from a swap disk. The
measures provided are based on two sets of underlying numbers. One
set relates to a physical swap disk, while the other set relates to RAM
used as swap space by pages in memory.


Swap space is used in two stages: When memory is requested (for
example, as a result of a malloc call), swap is reserved, and a
mapping is made against the /dev/zero device. To start with, reservations are made
against available disk-based swap. When disk swap is all
used up, RAM is reserved. When these pages are first accessed,
physical pages are obtained from the free list and filled with
zeros, and pages of swap become allocated rather than reserved. In
effect, reservations are initially taken out of disk-based swap, but
allocations are initially taken out of RAM-based swap. When a page
of anonymous RAM is stolen by the page scanner, the data is written
out to the swap space (i.e., the swap allocation is moved from
memory to disk, and the memory is freed.)


Memory space that is mapped but never used stays in the reserved
state, and the reservation consumes swap space. This behavior is
common for large database systems and is the reason why large
amounts of swap disk must be configured to run applications like
Oracle and SAP R/3, even though they are unlikely to allocate all
the reserved space. Before Solaris 2.6, the shared memory area used
by databases such as Oracle and Sybase also required a swap
reservation. This memory is normally allocated as intimate shared
memory
(ISM) and is locked in RAM. The swap reservation can never be
used, so in Solaris 2.6 a special case was made and ISM no longer
reserves swap space. If you have a 2-gigabyte (GB) shared memory
segment, 2 GB of swap disk will be freed for other purposes
when you upgrade from Solaris 2.5.1 to 2.6.


The first swap partition allocated is also used as the system dump
space for storage of a kernel crash dump. It's a good idea to have
plenty of disk space set aside in /var/crash and to enable
savecore by un-commenting the commands in
/etc/rc2.d/S20sysetup. If you forget and think that there may have
been an unsaved crash dump, you can try running savecore long after
the system has rebooted. The crash dump is stored at the very end of
the swap partition, and the savecore command will tell you if it has been
overwritten yet.


Swap space calculations

Please refer to Jim Mauro's December 1997 and January 1998
Inside Solaris columns in Unix Insider for a
detailed explanation of how swap space works.


Disk space used for swap is listed by the swap -l command. All swap
space segments must be 2 GB or less in size. Any extra
space is ignored. The anoninfo structure in the kernel keeps track
of anonymous memory. In Solaris 2.6, the name of this structure has changed
to k_anoninfo, but these three values are the same. This illustrates
why it's best to rely on the more stable kstat interface rather
than the raw kernel data. In this case the data provided is so
confusing that I feel the need to see how the kstat data is derived:

anoninfo.ani_max is the total amount of disk-based swap space.

anoninfo.ani_resv is the amount reserved thus far from both disk and
RAM.

anoninfo.ani_free is the amount of unallocated physical space plus
the amount of reserved unallocated RAM.


If ani_resv is greater than ani_max, we've reserved all the
disk swap and some RAM-based swap. Otherwise, the amount of
disk-resident swap space available to be reserved is calculated
by subtracting ani_resv from ani_max.

swapfs_minfree is set to physmem/8 (with a minimum of 3.5 megabytes)
and acts as a limit on the amount of memory used to hold anonymous
data.



availrmem is the amount of resident, unswappable memory in the
system. It varies and can be read from the system_pages kstat shown
in Code Example 2.


The amount of swap space that can be reserved from memory is calculated
by subtracting swapfs_minfree from availrmem. The total amount available for
reservation is thus


MAX(ani_max - ani_resv, 0) + (availrmem - swapfs_minfree)


A reservation failure will prevent a process from
starting or growing. Allocations aren't really interesting. The
counters provided by the kernel to commands such as vmstat and sar
are part of the vminfo kstat structure. These counters accumulate
once per second, so average swap usage over a measured interval
can be determined. The swap -s command reads the kernel directly to
obtain a snapshot of the current anoninfo values, so the numbers
will never match exactly. Also, the simple act of running a program
changes the values, so you can't get an exact match. The vminfo
calculations are as follows:


swap_resv += ani_resv
swap_alloc += MAX(ani_resv, ani_max) - ani_free
swap_avail += MAX(ani_max - ani_resv, 0) + (availrmem - swapfs_minfree)
swap_free += ani_free + (availrmem - swapfs_minfree)


Code Example 2


system_pages:
physmem 15778 nalloc 7745990 nfree 5600412 nalloc_calls 2962 nfree_calls 2047 
kernelbase 268461504 econtig 279511040 freemem 4608 availrmem 13849 lotsfree 256
    desfree 100 minfree 61 fastscan 7884 slowscan 500 nscan 0 desscan 125 
pp_kernel 1920 pagesfree 4608 pageslocked 1929 pagesio 0 pagestotal 15769 


To figure out how the numbers really do add up, I wrote a short
program in SE and compared it to the example data shown in Code
Example 3. To get the numbers to match, I needed some odd
combinations for sar and swap -s. In summary, the only useful
measure is swap_available, as printed by swap -s, vmstat, and sar -r
(though sar labels it freeswap, and before Solaris 2.5 sar actually
displayed swap_free rather than swap_avail). The other measures are
mislabelled and confusing. The code for the SE program in Code
Example 4 shows how the data is calculated and suggests a more
useful display that is also simpler to calculate.


Code Example 3

	
# se swap.se
ani_max 54814  ani_resv 19429  ani_free 37981  availrmem 13859  swapfs_minfree 1972
ramres 11887  swap_resv 19429  swap_alloc 16833  swap_avail 47272  swap_free 49868

Misleading data printed by swap -s
134664 K allocated + 20768 K reserved = 155432 K used, 378176 K available
Corrected labels:
134664 K allocated + 20768 K unallocated = 155432 K reserved, 378176 K available

Mislabelled sar -r 1
freeswap (really swap available) 756352 blocks

Useful swap data: Total swap 520 M
available 369 M  reserved 151 M  Total disk 428 M  Total RAM 92 M
# swap -s
total: 134056k bytes allocated + 20800k reserved = 154856k used, 378752k available
# sar -r 1
18:40:51 freemem freeswap
18:40:52    4152   756912


The only thing you need to know about SE to read this code is that reading kvm$name causes the current value of the kernel variable name to be read. The preprocessor variable MINOR_RELEASE is set to 51 for Solaris 2.5.1 and 60 for Solaris 2.6.


Code Example 4


/* extract all the swap data and generate the numbers */
/* must be run as root to read kvm variables */

struct anon {
	int ani_max;
	int ani_free;
	int ani_resv;
	};

int max(int a, int b) {
	if (a > b) {
		return a;
	} else {
		return b;
	}
}

main() {
#if MINOR_RELEASE < 60
	anon kvm$anoninfo;
#else
	anon kvm$k_anoninfo;
#endif
	anon tmpa;
	int  kvm$availrmem;
	int  availrmem;
	int  kvm$swapfs_minfree;
	int  swapfs_minfree;
	int  ramres;
	int  swap_alloc;
	int  swap_avail;
	int  swap_free;
	int  kvm$pagesize;
	int  ptok = kvm$pagesize/1024;
	int  res_but_not_alloc;
#if MINOR_RELEASE < 60
	tmpa = kvm$anoninfo;
#else
	tmpa = kvm$k_anoninfo;
#endif
	availrmem = kvm$availrmem;
	swapfs_minfree = kvm$swapfs_minfree;

	ramres = availrmem - swapfs_minfree;
	swap_alloc = max(tmpa.ani_resv, tmpa.ani_max) - tmpa.ani_free;
	swap_avail =  max(tmpa.ani_max - tmpa.ani_resv, 0) + ramres;
	swap_free = tmpa.ani_free + ramres;
	res_but_not_alloc = tmpa.ani_resv - swap_alloc;

	printf("ani_max %d  ani_resv %d  ani_free %d  availrmem %d
	swapfs_minfree %d\n", tmpa.ani_max, tmpa.ani_resv, tmpa.ani_free, 
	availrmem, swapfs_minfree);
	printf("ramres %d  swap_resv %d  swap_alloc %d  swap_avail %d  
	swap_free %d\n",
	ramres, tmpa.ani_resv, swap_alloc, swap_avail, swap_free);

	printf("\nMisleading data printed by swap -s\n");
	printf("%d K allocated + %d K reserved = %d K used, %d K available\n",
	swap_alloc * ptok, res_but_not_alloc * ptok,
	tmpa.ani_resv * ptok, swap_avail * ptok);
	printf("Corrected labels:\n");
	printf("%d K allocated + %d K unallocated = %d K reserved, 
	%d K available\n",
	swap_alloc * ptok, res_but_not_alloc * ptok,
	tmpa.ani_resv * ptok, swap_avail * ptok);

	printf("\nMislabelled sar -r 1\n");
	printf("freeswap (really swap available) %d blocks\n",
	swap_avail * ptok * 2);

	printf("\nUseful swap data: Total swap %d M\n",
	swap_avail * ptok / 1024 + tmpa.ani_resv * ptok / 1024);
	printf("available %d M  reserved %d M  Total disk %d M  Total RAM 
	%d M\n",
	swap_avail * ptok / 1024, tmpa.ani_resv * ptok / 1024,
	tmpa.ani_max * ptok /1024, ramres * ptok / 1024);
	}


Wrap up

Over the years, many people have struggled to understand the Solaris
2 swap system. If you've experienced confusion in trying to add up
the numbers from the commands, it's not your fault. It
really is confusing, and the numbers don't add up unless you know
how they were calculated in the first place!


I've started a new alias specifically for people developing code
using the SE toolkit. We now have a few thousand people using SE to
monitor their systems, and relatively few people writing code in SE.
Rich Pettit and I would like to talk about upcoming language
features with a core group of people who have their own code to
contribute or maintain. Send e-mail to the regular
se-feedback@chessie.eng.sun.com alias asking to be added to the developers' alias, and tell us what kind of things you've
developed.


Finally, thanks for buying the book. The first print run sold out,
and the second is now shipping. We made a few corrections to the
second print run. Most of the problems were format- and layout-related, and the index page numbers should now be properly
synchronized throughout. The references chapter of the book is now
online at http://www.sun.com/sun-on-net/performance/book2ref.html.

Resources and Related Links

Other Cockcroft columns at www.sun.com