Computer programming logic is easy really. It all boils down to endless variations
of the following pseudo-English constructs:
- do <some thing>
- if <some condition is true> do <some thing>
- while <some condition is true> do <thing thing>
That's it. Nothing else to it. Granted, there are a million and one ways to
express all three of these. There are also a million and one poetically beautiful
ways to creating new language constructs that hide the do/if/while stuff from
direct view. But, when it is all boiled down, it is just do/if/while in endless
permutations and combinations.
On the face of it, these are very primitive constructs. Constructs that application
designers can feel free to use in whatever way they choose. However, it is becoming
increasingly apparent that the third one of these - the while construct - needs
to be treated with great care if you want your application to scale to internet
size proportions easily.
The easiest way to illustrate what I am talking about is with an example so
here goes. Imagine you have some calculation to perform on some type of file.
You design your application according to classic decomposition theory. That
is, you write a self-contained chunk of code that works on one file. You then
stick a while loop over the top. Your program has this shape:
do <pick the next file>
while <we have a file to work on>
do <work on that file>
do <pick the next file>
Everything works fine until one day your program is invoked with 1.5 million
input files. Oops! The program is correct and will give a result eventually
but, well, it might take quite a while to chew threw 1.5 million input files.
What to do? Well, think of the application as consisting of two parts. The
first part does the real work: working on a file-by-file basis. The second part
is to feed files one by one into the first part. It is in this second part that
the troublesome while loop exists.
Here is an alternative way of thinking about the problem. Imagine if you had
at your disposal, a standard mechanism for running the core of your program
over any number of files without you having to code the while loop explicitly.
Imagine that this standard mechanism automatically handles spreading the work
across a bunch of processing nodes for you. Now also imagine that you could
get your hands on the results of each separate invocation and pull all that
stuff together afterwards without you having to work about failures/retries
or any of that stuff.
With such a facility, handling 1.5 million files is no longer such a big deal.
To be able to take advantage of this sort of facility you need to think differently
about your primary while loops. Where possible, don't bake them deep into your
code. Keep them external where possible. If you do this carefully, then you
will find that the MapReduce world into which a significant chunk of data
processing at internet scale is heading will be yours to exploit - even if
you do not need it right now.
On the other hand, if you do not watch the placement of your while loops, your
application may require significant re-engineering in the event that it proves
such a runaway success that it needs to be run over 1.5 million files.