From: www.itworld.com
October 31, 2001 —
Most of us think of the network as a single
entity that can be managed, coached, and coaxed as needed. That's
certainly the goal, and even the theme of Sun's latest advertising
campaign. In reality, any network is a loosely managed team of
components, spanning wiring closets, configuration files, and
applications that use or abuse resources. It's rare that a single point
of failure can be identified easily, or that the same element fails
repeatedly. Failures in one part of the infrastructure affect
applications or services several logical layers away, masking the true
source of the problem. Physical plant problems ripple upstream,
disrupting name or file services. When one player fails, the whole team
goes south.
As businesses become more dependent on the correct operation of
networks for groupware, Internet, intranet, and e-mail capabilities,
network reliability will become another buzzword adored by analysts and
touted by vendors. We'll look at the issues underlying network
reliability, including but not limited to
network performance and traffic control.
We'll examine physical access and security
problems, bandwidth allocation, configuration files, and
service dependencies.
From there it's on to application exposures, such as unduly long network latency or resending lost requests. We'll conclude with some suggested metrics and measurement techniques, designed to help you demonstrate that you have your network team under
control.
Go team! Yeah team! An end-to-end approach
How do you take a team approach to network reliability? Instead of
focusing on the individual components, look at how they interact and
form an end-to-end system, and examine the relationships of network
devices and subsystems to higher level services. Taking the team
analogy a step further, define some ideal attributes for a reliable,
well-built network:
A good "team approach" example is the United States telephone system.
No matter where you go to plug in your handset, the RJ-11 jacks appear
the same. The dialtone is the same, as are the touchtones and dialing
sequences from any access point. There is only minor variation in the
time required to connect a call, and the system has enough fault
tolerance built in to survive a number of failures on the natural
disaster scale. The telephone service model is appropriately strong
because it describes a distributed data center -- predictable, regular,
reliable performance right to the end user access point. Perhaps the
most impressive feature of the telephone system is that its reliability
is transparent -- you don't "feel" the system. Rarely does a caller
notice the changes in topology that occur in response to congestion or
failures, because reliability is designed into the system in layers.
We'll cover the layers of network reliability from the ground wire up:
Unfortunately, most people associate "network reliability" with simple
congestion control and end-to-end connectivity. When packets are
flowing, they consider the network reliable. This narrow definition
skirts issues listed above, and it's also the source of problems that
make distributed computing seem riskier and more costly than
centralized, host-based architectures. Reliability has to be built
into every component in the system: it's a function of design, not
simply location. You're building a distributed data center, and the
network connecting the distributed components has to be as reliable as
the centrally managed ones. When the pieces are assembled in a
well-designed, well-matched system, the end result is both reliable and
predictable.
Plant it: Cables and access point management
The logical starting point for reliable network design is the physical
cable plant -- the wires, fibers, hubs, transceivers, closets, cable
raceways, and wallplates that form the foundation of your network. You
want the physical layer to be as reliable as possible, but you're
limited in power. Mechanical components from hammers to network cables
invariably fail, and always at the least opportune moment. Network
wiring is especially susceptible to failure because it is exposed.
Often, Ethernet taps run under desks, and thinwire (10base2) daisy
chains are unintentional targets of the cleaning service's vacuum
cleaners. The trend toward switching hubs, with point to point twisted
pair wiring, bodes well for improved reliability. A single failure
affects only one machine, not an entire leg of the network or cluster
of machines on the same run. Minimizing the number of components
affected by a failure is your first design principle, and one that
we'll revisit later.
Your best defense against cable faults is a good offense, complete
with diagnostic tools and equipment. A simple ping command
or script that performs a "reachability" test is not sufficient to
diagnose all cable problems. A host may answer the ping
after a very long delay caused by a fault that injects noise or runt
packets onto the network. It's best to know the warning signs of a
noisy, broken or failing cable:
netstat -i, dividing the number of collisions by the number of output packets. Note that collisions are only counted whennetstat -i output. Output errors occur whenIn general, it's a good idea to have the appropriate test equipment on
hand if you're going to be responsible for the spaghetti that often
resembles a cable nest. Wiring and network contractors that install and
maintain the physical plant should be properly equipped, but if you're
going to do those jobs yourself, make sure you have access to the right
gear for your cabling. When half the machines on the network are
spewing messages about network jams, it's embarrassing to ask for
purchase order approval to buy a network analyzer. At the very least,
locate a shop that will lease equipment on a short-term basis.
What kind of cabling should you use? From a reliability perspective,
you need to worry about the quality of the connections between cable
endpoints and other devices, and the impact of radio frequency
interference (RFI) on the signal quality. High levels of RFI, for
example, cable raceways tucked behind the elevator motor room that pick
up noise, or poorly shielded high-voltage power lines, indicate
shielded twisted-pair wiring, or fiber in extreme cases. If you're
going to use fiber, it pays to have a professional installation done
because the fiber requires careful treatment of the ends and connection
jackets.
The final physical issue to address is security of the network
access points. If you worry about unattended PCs being used to tap into
your network, you should also worry about unattended network
connections becoming a conduit as well. Any connectors left out in the
open are ripe for attack, a risk that affects both desktop devices as
well as wide-area and leased lines that leave your building. Are you
sure that your private lines come into a secure area? Could an intruder
put a pair of alligator clips on your incoming phone line, and watch
the 1s and 0s go by on your leased line? When you make your living on
the net, and losing your connection is fatal, it's worth at least
ensuring that the line comes into a locked room. The best reliability
and security engineering don't help if your network access points are
left dangling, unprotected, where untrusted persons can have a field
day with them.
If this seems to be a bit of hyperbolic Usenet ranting, think again
about your consultants and contractors. Have you security screened
them? Are they included in the web of trust you extend to employees and
others who can reach under a desk and unplug a live Ethernet tap?
Extend the scope of your reliability and security analysis out to the
value of your data, and the extent to which someone might go to gain
access to it. Plugging a rogue PC into an available twisted pair
connector is far easier than breaking into a workstation. The new
network addition may be added with the best of intentions, but its
choice of IP address or protocol stacks may interfere with other
network traffic. The larger these risks, the more you don't want to
leave the access points standing naked.
Guaranteed traffic: Dealing with network rush hour
Making sure the bits get from one end of the network to the other is
the first problem, but it's also the easiest one to solve. As you move
up a layer from the physical wiring and connectivity to the network
traffic level, you get into issues of bandwidth consumption,
performance, and other black arts. Ensuring network reliability means
that you have an unencumbered path between any two points, and that
you'll be able to guarantee some minimum throughput over that path.
Bandwidth and throughput are numbers that vendors love to toss
around. Bandwidth is what you can theoretically achieve out of the wire
-- it's an upper limit, and one that is rarely reached. Throughput
tells you what you can obtain in practice. Many factors reduce
effective throughput over a network: congestion resulting in contention
for the media, inefficient protocols, overloaded routers, and hubs that
drop packets. Monitor your network utilization, as well as the typical
throughput between pairs of machines, looking for periods of peak usage
that contribute to network problems. Users may think that the network
is "down" but in fact it's excruciatingly slow due to heavy demand and
traffic volumes. Be sure to tabulate the types of traffic you see at
peak times -- are you suffering from an ftp party at the end of the
day, or is there a regularly scheduled video conference that saps every
available network pipe on Tuesday afternoons?
Watch for traffic anomalies, such as broadcast storms, caused by
machines that are configured with improper broadcast or IP addresses.
A simple snoop filter, for example, snoop -d le0, will show you the volume of ARP and other broadcast
broadcast
packets. Malformed sequences, for example, multiple replies to a
single ARP request, or broadcasts that result in a flurry of ARP
requests for IP addresses ending in .0 or .255, are the first isobars
of a pending broadcast storm. While normal traffic peaks may die down
quickly, broadcast storms often take several minutes to subside. For a
quick and dirty indicator of storm-like activity, try playing with the
audio option of snoop or etherman
snoop -a -d le0 broadcast
When your speaker box sounds
less like a Geiger counter and more like a cheap electronic keyboard,
you need to take a look at the traffic statistics.
Sometimes local traffic patterns, such as those caused by unduly
large ftp or http transfers, disrupt network availability to the point
where you want to regulate access to services. In other cases, you may
have a bona fide denial of service attack underway, in which
someone is flooding your network with noise or generating non-stop
connection requests to one of your servers. The solution to both
problems is to install firewall or access controls at the entrance to
the network, and on the hosts from which the services are provided.
Having trouble with the local ftp maniac? Stick a firewall and proxy
agent in between that turns off outgoing ftp connections before 5 pm.
Techniques for filtering connections and enforcing access controls are
discussed in this month's System Administration column.
In addition to providing selective access to services, firewalls and
proxies let you build up minimal throughput guarantees by eliminating
or re-directing non-essential traffic. Classes of service are nearly
completely absent from the TCP/IP world, although class of service
(COS) is popular in SNA networks. It's amazingly difficult to relegate
less important traffic to the back of the network output queue, and to
minimize latency for critical messages. The next generation of the IP
protocol, IPv6 promises to support some mechanisms for defining class
of service. For now, however, you have to rely on brute-force
techniques of separating traffic with logically separate networks, or
enforcing access controls based on host name, time of day, and traffic
type. (See the list of resources at the end of this story for more on
IPv6.)
Don't underestimate the importance of bandwidth guarantees. From the
perspective of a desktop user, a network that's running red-hot with
traffic is as useful as one that's not connected. Calls for traffic
segregation start to fall along business lines: a decision support
query that returns a megabyte to an eager marketeer is going to impact
the OLTP transactions run by the front office. It's laudable that
marketing is using the network to improve customer choices, but it
helps if the customers can make choices the first place. Fighting
network congestion through careful design and monitoring is the first
logical step you take toward providing reliability that is
seen and felt by the end users and customers.
Don't decreasing networking costs and increasing bandwidth make this
concern over guarantees a bit misplaced? Bandwidth isn't infinite, and
it's certainly going to be a critical factor where heavy payloads
converge. Let's say you've left your 10 megabits per second Ethernet
dangling in the ceiling of your old building, and wired a new campus
with 100 megabits per second Fast Ethernet. If you bring four of those
networks together in the data center, each utilized at 50 percent or
more, you're going to need at least 200 megabits per second just to
handle the traffic streams without introducing latency or dropping
bits.
When everybody wants a nice 200 kilobyte per second stream from your
video server you need to chop up the available bandwidth to make
everyone happy. You can stagger the retrievals, and give each user full
use of the network in round-robin fashion. Betting on politeness over
politics is not a sure thing, so you could opt to have the
users duke it out with simultaneous requests, then listen to them
complain about the terrible network reliability. The ideal solution is
to institute a bidding system for available bandwidth -- those users
willing to pay a premium get a better quality feed, and those who are
cost constrained get a low-resolution stream. These kinds of
interactive, on-the-fly auctions will be a fundamental part of any kind
of electronic commerce done over internets. Building the bidding tools,
auctioneers, and service exchanges is a wide-open arena for Java
applets and SafeTcl scripts.
After tackling the first two layers of reliability, you should have
a solid plan in place for moving bit streams and providing a constant
performance level even under peak loads. There's still a chunk of glue
needed between applications and the wire -- configuration files and
naming services -- that has to be made just as robust as the underlying
pieces.
Soft consistency: Configuration information and service dependencies
Basic network configuration information includes host to IP address
mappings (/etc/hosts), network routes, names of
network services (/etc/services) and distributed filesystem
usage (/etc/vfstab, /etc/fstab and the automounter).
Network configuration data should promote network isotropism -- the net
appears the same from any vantage point. Naming services like DNS, NIS,
and NIS+ provide some measure of consistency. Policies for making
changes, propagating the deltas, and coordinating updates among
multiple management groups complement the name service. We've covered
techniques for managing the change control process and the integration
of network service configuration files in previous System
Administration columns. (See the list of resources at the end of this
story.)
The key questions to answer are:
Once we reach the configuration information layer, we can modify a
machine's view of the network and services offered on the wire. By
editing configuration files and network interface setups on the fly,
it's possible to change a host's network usage, moving it from one
network to another or from one target server to another. The
configuration level is the first stage in which a fault in a lower
level can be hidden from applications and users. Here's a simple
example: let's say you connect two Ethernets to all of your servers, so
that a failure in the hub, cabling or other machines on the one network
doesn't render the server unreachable. In the event of a network
failure, the server will switch from its primary network interface to
the secondary one. How do you get the clients to start looking for the
server on its new network?
The best answer is that the change should be transparent to the
clients. That is, the server should not change names or IP addresses
after it switches network interfaces. The clients may see a long
timeout or delay while the server reconfigures itself, but the clients
should not have to look up new IP addresses or walk down a list of host
names to handle a server network failure. The easiest approach is to
use a "virtual" IP address for the server, switching it from the
primary to the secondary interface after a failure is detected on the
primary side. Virtual IP addresses are created by adding new hostname
and IP address pairs to an existing physical interface. For example,
consider the following fragment of /etc/hosts:
192.9.200.10 db1 192.9.201.10 db1-primary # on le0 192.9.202.10 db1-secondary # on le1
The host calls itself db1-primary, using the .201 network on
the primary interface and the .202 address on the secondary interface.
Management tools can talk to the machine over either interface, and the
interfaces can ping each other and test for connectivity
using these "private" names. To make the clients' life easier, however,
the virtual address 192.9.200.10 is assigned first to the primary
interface:
# ifconfig le0:1 db1 up broadcast + netmask +
Packets sent to host "db1" show up on the primary interface, and the
server responds over the same network. Now assume that the cable
connecting the primary interface to the network fails, leaving the host
temporarily disconnected. At this point, you'd fail the interface over
to the redundant network:
# ifconfig le0:1 down # ifconfig le1:1 db1 up broadcast + netmaskt +
These actions can be initiated by a script that monitors the network
connectivity. To the clients, nothing has changed -- the same host is
responding to queries, with the same IP address, albeit over a
different cable. It's easiest to use different network numbers to avoid
confusing the routing tables, but with some care and deliberate
sequencing of the monitoring and reconfiguration scripts you can use a
single network number and build up/tear down IP addresses as needed to
migrate the public, virtual IP address to live interfaces.
What if the server that fails is providing a key service, such as
NIS, NIS+, or DNS? With replicated servers for each on a network, the
clients will eventually find a new source of network information, but
what happens when the name service is corrupted to the point where it
requires system administrator intervention? If you can't get the name
service back on its feet, you may have trouble righting other network
ships that start following it down. Most name and file services are
tightly coupled to other network functions, so a failure in the network
or in the service takes down the application as well. Once you've
provided reliability at the network infrastructure level, including
redundant network paths, it's time to put insurance plans into place
for system software failures.
Here's a simple example of service dependency that leads to failure
ripples: The RPC port mapper (rpcbind or portmap) exits without
explanation on your sole NIS server on one network. An application on a
client machine goes to read a file from a library server, residing on a
central NFS server that is two router hops away. The application's NFS
request hits the automounter on the client machine, and the automounter
tries to resolve the host name into an IP address -- using NIS. With
the portmapper out of service, the client can't bind to the NIS server,
so the automounter hangs waiting on NIS. To the application, it appears
that the local machine has hung, the shared library NFS server has
hung, or that the network is "out" somewhere along the double router
hop. After a few minutes, the "NIS server not responding" messages
begin to appear, and the problem can be resolved with some additional
prying on the local network. But what happens when NIS is taken out of
the equation, and DNS is used to resolve host names? The timeouts may
get longer, and the longer period of silence makes it harder to track
down the root cause of the problem.
Here are failure modes and recovery tactics to consider when
layering critical applications on top of network services like NIS, DNS
and network license managers:
Having safely secured services from the physical through the
configuration levels, it's time to isolate application dependencies and
make your code insensitive to changes in the network fabric.
Lost Worlds: Dealing with unreliable transports
Applications built on top of TCP enjoy sequenced, reliable data
transmission from one socket endpoint to another, provided the
intervening routers, hosts, and configuration data are in order. The
downside of using TCP is that it requires end-to-end handshaking to
ensure data delivery; this means that the communicating hosts must
remain in contact until the data is received and acknowledged. When UDP
is used, the sequencing and delivery guarantees go away, the
application must make arrangements for requests arriving out of order,
multiple times, or not at all.
There are a large number of application exposures with which to
contend. All of the name and file service dependencies mentioned above,
for example, apply to applications. If your transaction processing
engine reads an NFS-mounted file for access control information (a bad
idea we'll explore shortly), you've made it dependent on the remote
server and the network between it and the client and possibly on an NIS
server as well. Dealing with other network failures is more subtle,
because it requires analysis of the handshaking between client and
server. For example, an application that considers a TCP message
"delivered" once the write() on the socket completes is
going to fail when the TCP segments aren't delivered because the
receiver was disconnected from the network. The application will
eventually detect an error while writing on the socket, but without a
clear acknowledgement protocol it's close to impossible to know the
last request that was safely received by the other end.
Other network failures that impact applications include:
Enforcing performance policies, traffic management, and configuration
control will let you govern the typical network latency, and presumably
smooth out the spikes even under periods of peak demand. But knowing
when and where packets fell off the network is another story. For
example, look at the the design decisions implemented by NFS to handle
network failures and clock synchronization.
NFS resends any request that is not acknowledged. This requires that
the requests be idempotent, that is, they are self-describing and can
be sent repeatedly. NFS does just that, sending a request that appears
to have been lost again and again until it is executed and acknowledged
by the server. Not all requests can be repeated ad nauseum --
try removing a file several times, or creating a new directory. NFS
accounts for these non-idempotent requests by caching recently executed
requests and suppressing the error-prone duplicates. (See the list of
resources at the end of this story.)
Realizing that the client and server clocks may have drifted,
NFS-aware versions of ls, ar, and
ranlib disregard minor variations in the current time.
When ls displays file modification times, it shows
anything older than six months as a month and day. ls
subtracts the modification time from the current time to gauge the
file's age, which is fine until you have clock drift that makes the
file appear to be modified a few minutes in the future -- making the
time delta negative. The NFS-aware ls compensates for the
clock skew, rather than displaying the file as having been modified in
1969.
With your applications safely ensconsed in network reliable
frameworks, it's time to look at the last risk posed by the network --
exposure of data or relationships that creates a security hole.
Safe and secure: Data exposure and network reliability
One of the primary drawbacks of distributed computing is that the
safety of the centralized, access controlled data center is gone. When
you run a transaction over a network, you also run the risk of exposing
the contents of the transaction, its participants, and their
relationships. If the source and destinations are not properly
protected, you run the risk of having (maliciously) incorrect data
inserted in the transaction. Consider the previous example of a
transaction system that reads an access control list from an NFS
mounted file. If a network intruder "spoofed" the NFS server identity,
creating a machine that appeared to be the valid NFS server, he or she
could easily insert a hand-crafted access control file. Anyone with
unrestricted network access -- say from a PC on an unprotected network
tap -- could listen for the NFS request and return a bogus NFS buffer.
Protecting the identities of parties to a transaction is just as
important as securing its contents. Wall Street's "Chinese Wall"
separating investment banking and trading operations is supposed to
prevent the flow of inside information. But what happens when the
traders can watch packets going to and from the corporate gateway, and
surmise the existence of an investment banking relationship with ABC,
Inc. because traffic is exchanged between their bank and abc.com?
Protecting the players, their roles, and their data requires careful
analysis of the paths over which the transactions travel, and the
degree of protection that each requires.
Keeping score
So what do you measure to prove that you're making progress on the
reliability front? Strive for consistency in all areas -- flat response
times even under the heaviest load and in the face of growing network
usage, fast and regular responses to calls for help, and aggressive
resolution of network failures. Instrument applications as well, so you
can determine where the clock time is expended -- on the client, the
server, or the network. If 90 percent of the request response time is
consistently taken up by host or client processing, then the network
should shoulder minimal blame for performance or reliability problems.
Here are additional metrics to track your reliability efforts:
You're looking for a measure of repeatability, that is, things work the
same way even with minor variations in usage and load. Making the
network more reliable and more seamless allows you to extend the
boundary of the data center out to the desktop network connection. It
won't solve the problems of distributed computing, but it will make the
distributed system predictable, and therefore as comfortable as the
glass walls of a data center.
Unix Insider