From: www.itworld.com
August 5, 2003 —
Imagine a very, very large corn field. Think about a field the size of
Texas or perhaps a factor of ten bigger than that. Now consider how you
would harvest the ripe corn in that field.
Plan A: Get a single harvesting machine and start methodically chewing
through the corn, strip by strip.
Plan B: Get many, many harvesting machines and get them methodically
chewing through strips of corn in parallel with each other.
Plan A, I think you will agree, demonstrates about as much higher order
intelligence as a single ear of corn in that field. Plan B is obviously
the way to go. Why? Because from a harvesting point of view, one strip
of corn in one part of Texas, is utterly unrelated to any another. Thus
the harvesting can happen in parallel. Blindingly obvious right?
Parallelizing the work has other benefits apart from throughput. We can
use cheap and cheerful machines to do the work rather than a very
expensive uber-machine. Who cares if the individual cheap machines break
down? We just replace the ones that fail and forge on. The chances of
corn production stopping completely because of hardware failure are
basically infinitesimally small. What are the chances of hundreds,
perhaps thousands of independent machines all failing at the same time?
Effectively zero.
In IT terminology, the harvesting system exhibits high availability,
fault tolerance and a linear scaling relationship between throughput and
processing power. Nice.
Now lets switch from corn to, say, customers in our reveries. Imagine a
very large set of customers. We need to look at what they have bought
from us in the last month and generate invoices, one for each customer.
Plan A: Get a single invoice processing "machine". Start at one corner
of the customer list and work methodically through to the end.
Plan B: Get many, many invoice processing machines and get them
methodically chewing through strips of customers in parallel.
From a common sense perspective, Plan B is just as compelling as it is
in the corn harvesting example. However, by and large, enterprise
computing does not work that way. We go for Plan A most of the time.
Bigger, faster, more expensive individual machines to do processing of
tasks that could be done faster and cheaper with multiple machines
processing in parallel.
In fact, Plan B is even more compelling in computing than it is in
harvesting corn. The unit cost of processing machines continues to fall
through the floor. So much so that a large Web search engine provider -
who makes extensive use of parallelism - does not even replace
individual processing machines when they break down. Why? Because the
cost of dispatching an engineer to replace it is higher than the cost of
just tacking a new one on the end of the processing rack. This is a
pretty radical shift in the economics of computing.
What an embarrassment of computing power riches surrounds us! Yet, by
and large, in enterprise computing, we do the equivalent of heading off
into a corner of the corn field with a single machine, working through
our data strip by strip. If we were engineers in the agricultural
sector we would be laughed at. Seriously.
Two questions arise I guess. Firstly, why do we gravitate towards plan A
in enterprise computing and secondly, what should be done about it?
I think the first question is best answered by saying that we do it, the
way we do it, because we have *always* done it that way. Ever since John
von Neumann's brilliant insights into general purpose processing
machines[1], we have been building mental models of processing around
the idea of a single, all powerful CPU (the 'C' stands for 'Central'
after all). The so-called Von Neumann architecture is endemic in the way
we think about systems.
It is a curse of sorts. The easiest way to see it in action is to ask a
developer how long it will take to process a million invoices if each
invoice takes 10 seconds of processing time. After some calculation the
answer will be that it will take about one full day. Without even
thinking about it, most developers will serialize the invoice processing
and assume that everything will be routed through a single processor,
which will chew through invoices at the rate of one every 10 seconds.
The answer to the second question doubles as a rallying cry. There is a
ton of lore about parallel computing out there. There are real systems
such as Beowulf Clusters in the Linux world that show the power of
massive parallelism. There is the Grid Forum[2] which is emerging as a
focal point for activity in parallel computing technologies.
Now here is the rallying cry. A lot of the interest in massive
parallelism comes from scientists interested in fluid dynamics, N-Body
problems, quantum chemistry and the like. All really important stuff but
where are the commercial IT people? The benefits for enterprise
computing of parallel computing are enormous, not only in terms of
availability of computational power but also in terms of cutting costs,
availing of computational power on demand and so on.
I think part of the problem is that there are many commercial IT
problems which, in parallel computing terminology are 'trivially
parallelizable' and thus not a source of scientific interest. It is true
that finding a way to distribute the calculation of Nth degree
polynomials on a grid is a lot harder than distributing invoice
generation on a grid. The latter is an example of a problem that is
amenable to what is known as 'domain decomposition'[3]. Simply put, the
problem is like a large corn field, it is trivial to perform the work in
parallel. The more machines the better.
It seems to me that enterprise IT people need to start wrapping their
heads around this stuff. In the million invoice processing example,
given a grid with a million nodes that you can tap in to, you could
process all your invoices in 10 seconds. Let's double - no treble - that
figure to take account of bandwidth and data transmission. 30 seconds
plays 24 hours. Nice.
I think it's time to take a long hard look at Plan B.
[1] http://www.wikipedia.org/wiki/Von_Neumann_architecture
[2] http://www.gridforum.org
[3] http://cauchy.math.colostate.edu/Projects/Garrison/paper.html
ITworld