From: www.itworld.com
March 31, 2008 —
Computer programming logic is easy really. It all boils down to endless variations of the following pseudo-English constructs:
That's it. Nothing else to it. Granted, there are a million and one ways to express all three of these. There are also a million and one poetically beautiful ways to creating new language constructs that hide the do/if/while stuff from direct view. But, when it is all boiled down, it is just do/if/while in endless permutations and combinations.
On the face of it, these are very primitive constructs. Constructs that application designers can feel free to use in whatever way they choose. However, it is becoming increasingly apparent that the third one of these - the while construct - needs to be treated with great care if you want your application to scale to internet size proportions easily.
The easiest way to illustrate what I am talking about is with an example so here goes. Imagine you have some calculation to perform on some type of file. You design your application according to classic decomposition theory. That is, you write a self-contained chunk of code that works on one file. You then stick a while loop over the top. Your program has this shape:
Everything works fine until one day your program is invoked with 1.5 million input files. Oops! The program is correct and will give a result eventually but, well, it might take quite a while to chew threw 1.5 million input files.
What to do? Well, think of the application as consisting of two parts. The first part does the real work: working on a file-by-file basis. The second part is to feed files one by one into the first part. It is in this second part that the troublesome while loop exists.
Here is an alternative way of thinking about the problem. Imagine if you had at your disposal, a standard mechanism for running the core of your program over any number of files without you having to code the while loop explicitly. Imagine that this standard mechanism automatically handles spreading the work across a bunch of processing nodes for you. Now also imagine that you could get your hands on the results of each separate invocation and pull all that stuff together afterwards without you having to work about failures/retries or any of that stuff.
With such a facility, handling 1.5 million files is no longer such a big deal. To be able to take advantage of this sort of facility you need to think differently about your primary while loops. Where possible, don't bake them deep into your code. Keep them external where possible. If you do this carefully, then you will find that the MapReduce world into which a significant chunk of data processing at internet scale is heading will be yours to exploit - even if you do not need it right now.
On the other hand, if you do not watch the placement of your while loops, your application may require significant re-engineering in the event that it proves such a runaway success that it needs to be run over 1.5 million files.
ITworld.com