The right disk configurations for servers
Q: I'm setting up a large server, and I'm not sure how to configure the
disks. The new server replaces several smaller ones so it will do a bit
of everything, NFS, a couple of large databases, home directories,
number crunching, intranet home pages. Can I just make one big
filesystem and throw the lot in, or do I need to setup lots of
different collections of disks?
--Clueless in Crivitz
A: There are several trade-offs, and no single right solution.
The main factors you need to balance are the administrative complexity,
resilience to disk failure, and performance requirements of each
There are two underlying factors to take into account. The filesystem
type and whether the access pattern is primarily random or sequential
access. I've shown this as a two-by-two table, with the typical
workloads for each combination shown. To cover all that you mentioned,
I would combine NFS, home directories, and intranet home pages into the
random access filesystem category. Databases have less overhead, and
less interaction with other workloads if they use separate raw disks.
Databases that don't use raw disks should be given their own filesystem
setup, as should number crunching applications that read/write large
|Primary workload options|
|Raw disk||database indexes||database table scans|
|Filesystem||Home directories||Number crunching|
Managing the trade-offs
We need to make a trade-off between performance and complexity. The
best solution is to configure a small number of separate disk spaces,
each optimized for a type of workload. You can get more performance by
increasing the number of separate disk spaces, but if you have fewer it
is easier to be able to share the spare capacity and add more disks in
the future without having to go through a major reorganization.
Another trade-off is between cost, performance, and availability. You
can be fast, cheap, or safe; pick one, balance two, but you can't have all
three. One thing about combining disks into a stripe is that the more
disks you have, the more likely it is that one will fail and take out
the whole stripe.
If each disk has a mean time between failure (MTBF) of 500,000 hours,
this implies that 100 disks have a combined MTBF of only 5,000 hours
(about 30 weeks). If you have 1,000 disks you can expect to have a
failure every three weeks on average. It is also worth noting that the
very latest, fastest disks are much more likely to fail than disks that
have been well debugged over several years, regardless of the MTBF
quoted on the specification sheet.
There are two consequences of failure; one is loss of data, and the other
is down time. In some cases, data can be regenerated (i.e. database indexes,
number crunching output) or restored from backup tapes. If you can afford
the time it takes to restore the data, and it is unlikely to happen
often, there is no need to provide a resilient disk subsystem. This lets
you configure for the highest performance. If data integrity or high
availability is important there are two common approaches, mirroring
and parity (typically RAID-5). Mirroring has the highest performance,
especially for write-intensive workloads, but requires twice as many
disks to implement it. Parity uses one extra disk in each stripe to
hold the redundant information required to reconstruct data after a
failure. Writes require read-modify-write operations, and there is the
extra overhead of calculating the parity.
It is sometimes the case that the cost of a high performance, RAID-5
array controller exceeds the cost of the extra disks you would need to
do simple mirroring. To get high performance, these controllers use
non-volatile memory to perform write-behind safely, and to coalesce
adjacent writes into single operations. Implementing RAID-5 without
non-volatile memory will give you very poor write performance. The
other problem with parity-based arrays is that when a disk has failed,
extra work is needed to reconstruct the missing data and performance is
The home, NFS, and Web choice
In this particular case, I will assume that the filesystem dedicated to
home directories, NFS, and Web server home pages is mostly read-intensive,
and that there is some kind of array controller available (a
SPARCstorage Array or one of the many third party RAID controllers) that
has non-volatile memory configured and enabled. Note that SPARCstorage
Arrays default to having it disabled. You need to use the
command to turn on fast writes. The non-volatile, fast writes greatly
speed up NFS response times for writes, and as long as high-throughput
applications do not saturate this filesystem, it is a good candidate
for a RAID-5 configuration. The extra resilience saves you from data
loss and keeps users happy without wasting disk space on mirroring.
The default UFS filesystem parameters should be tuned slightly, as
there is no need to waste 10 percent on free space, and almost as much
on inodes. I would configure 1 or 2 percent free space (default is 10
percent) and an 8 kilobyte average file size per inode (default is 2
kilobytes) unless you are configuring a filesystem that is under one GB
# newfs -i 8192 -m 1 /dev/raw_big_disk_device
big_disk_device itself should be created by combining groups of
disks together into RAID-5 protected arrays, then concatenating the
arrays to make the final filesystem. If you need to extend its size in
the future, make up a new group of disks into a RAID-5 array and extend
the filesystem onto it. It is possible to grow a filesystem on-line if
necessary, so there is no need to rebuild the whole array and restore
it from backup tapes. Each RAID-5 array should contain between 5 and 30
disks. I've used a 25-disk, RAID-5 setup on a SPARCstorage
Array. For highest performance, keep it to the lower end of this range,
and concatenate more smaller arrays together. We have found that a
128-kilobyte interlace is optimal for this largely random access workload.
Another issue to consider is the filesystem check required on
reboot. If the system shut down cleanly,
fsck can tell
it is safe to skip the check. If it went down in a power outage or
crash, it could take tens of minutes to more than an hour to check a
really huge filesystem. The solution is to use a logging filesystem,
where a separate disk stores all the changes. On reboot,
fsck just reads the log in a few seconds and it is done.
With Solstice Disk Suite (SDS), this is set up using a "metatrans"
device, and the normal UFS filesystem. In fact, an existing SDS
hosted filesystem can have the metatrans log added without any
disruption to the data. With the Veritas Volume Manager, it is necessary
to use the Veritas filesystem, VxFS, as the logging hooks in UFS are
For good performance, the log should live on a dedicated disk. For
resilience, the log should be mirrored. In extreme cases, the log disk
might saturate and require striping over more than one disk. In low
usage cases, the log can be situated in a small partition at the start
of a data disk.
An example result is shown below. To extend the capacity, you would
make up another array of disks and concatenate it. There is nothing to
prevent you making each array a different size either. Unless the log
disk maxes out, a mirrored pair of log disks should not need to be
2 disks combined as mirrored filesystem log
|Log 1||Log 2|
30 disks combined as three concatenated RAID-5 arrays of 10 disks
|Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 1||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 2||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3||Array 3|
The high performance number crunching choice
High throughput applications such as number crunching that want to do
large, high-speed, sequential reads and writes should use a
completely separate collection of disks on their own controllers. In
most cases, data can be regenerated if there is a disk failure, or
occasional snapshots of the data could be compressed and archived into
the home directory space. The key thing is to off-load frequent,
I/O-intensive activity from the home directories into this high performance
Configure as many fast-wide (20 megabytes per second) or Ultra-SCSI (40
megabytes per second) disk controllers as you can. Each disk should be
able to stream sequential data at between 3 and 5 megabytes per
second, so don't put too many on each bus. Non-volatile memory in array
controllers may help in some cases, but it may also get in the way. The
cost may also tempt you to use too few controllers. A large number of
SBus SCSI interfaces is a better investment for this particular
If you need to run at sustained rates of more than 20 to 30
megabytes per second of sequential activity on a single file, you will
run into problems with the default UFS filesystem. The UFS indirect
block structure and data layout strategy work well for general purpose
accesses such as the home directories, but cause too many random seeks
for high-speed sequential performance. The Veritas VxFS filesystem is
an extent-based structure, which avoids the indirect block problem. It
also allows individual files to be designated as "direct" for raw
unbuffered access. This bypasses the problems caused by UFS trying to
cache all files in RAM, which is inappropriate for large sequential
access files and stresses the pager. It is possible to get 100
megabytes per second or more with a carefully setup VxFS
A log-based filesystem may slow down high-speed sequential operations
by limiting you to the throughput of the log. It should only log
synchronous updates, such as directory changes and file creation/deletion,
so see how it goes with and without a log for your own workload. If it
doesn't get in the way of the performance you need, use a log to keep
reboot times down.
Database workloads are very different again. Reads may be done in small
random blocks (when looking up indexes), or large sequential blocks
(when doing a full table scan). Writes are normally synchronous for
safe commits of new data. On a mixed workload system, running databases
through the filesystem can cause virtual memory "churning" due to the
high levels of paging and scanning associated with filesystem I/O. This
can affect other applications adversely, so where possible it is best
to use raw disks or direct unbuffered I/O to a filesystem that supports
it (such as VxFS).
Both Oracle and Sybase default to a 2-kilobyte block size. A small
block size keeps the disk service time low for random lookups of
indexes and small amounts of data. When a full table scan occurs, the
database may read multiple blocks in one operation, causing larger I/O
sizes and sequential patterns.
Databases have two characteristics that are greatly assisted by an
array controller that contains non-volatile RAM. One is that a large
proportion of the writes are synchronous, and are on the critical path
for user response times. The service time for a 2-kilobyte write is
often reduced from about 10 to 15 milliseconds to 1 to 2 milliseconds.
The other is that synchronous sequential writes often occur as a stream
of small blocks, typically of only 2 kilobytes at a time. The array
controller can coalesce together multiple adjacent writes into a
smaller number of much larger operations, which can be written to disk
far faster. Throughput can increase by as much as three to four times
on a per-disk basis.
Data integrity is important, but some sections of a database can be
regenerated after a failure. You can trade off performance against
availability by making temporary tablespaces and perhaps indexes out of
wide unprotected stripes of disks. Tables that are largely read only or
not on the critical performance path can be assigned to RAID-5 stripes.
Safe, high performance writes should be handled with mirrored stripes.
The same basic techniques described in previous sections can be used
to configure the arrays. Use concatenations of stripes, either
unprotected, RAID-5, or mirrored as appropriate, with a 128-kilobyte
I have scratched the surface of a large and possibly contentious
subject here. I hope this gives you the basis for a solution.
The important thing is to divide the problem into subproblems by
separating the workloads according to their performance
characteristics. Balance your solution on appropriate measures of
performance, cost, and availability.
Resources and Related Links
- Adrian's column, " Manage Performance, or It Will Manage You!
- "New Release of the SE Performance Toolkit"
- "Solaris 2.5 Performance Update"
- "Confessions of an Ultra 1 User"
- "Advanced Monitoring and Tuning"
- "System Performance Monitoring"
- If you want to build performance tools and utilities, get a copy of the SE Performance Toolkit Version 184.108.40.206
- If you like Adrian's column, you'll probably want a copy of his book, Sun Performance and Tuning
- And be sure to take a look at Adrian Cockcroft's profile