November 28, 2009, 12:24 PM — What have the Roman empire and the relational database got in common? Not too much at the moment I would suggest, but in a few short years I think we will be seeing an interesting similarity in their life histories. Pretend we are in the next decade somewhere ... Here is how it might look in Wikipedia if I'm right. "Like the Roman Empire, the Relational Database grew gradually to be an extremely dominant force. Like the Roman Empire, it suffered gradual collapse, the causes of which, are debated by historians to this day ... "
Of course, from the current vantage point of 2009 -- still some years away, I suspect, from the collapse of the relational database hegemony -- it is difficult to predict how history will be written. Will the relational database have the equivalent of the Romulus Augustulus abdication moment? Will some historians attribute some modern day IT equivalent of lead poising to its demise? Will I prove to be utterly wrong in the grand tradition of technology prognosticators?
I am predicting that the fall of the relational database will indeed happen and furthermore, I am predicting that history will not point to any one single event as the trigger. Rather, I suspect that a combination of forces, a variety of ideological movements and technological developments, acting contemporaneously, will create a perfect storm for the relational database. I see seven main forces at work. I have given each force a somewhat whimsical name in what follows. Here is a list of all seven with a brief explanation of each. The rest of the article goes into a little more detail on each one.
- The hierarchicalists: Promote the benefits of hierarchical information models such as XML and the emergence of viable mechanisms for querying and processing large corpora of such information. The poster boy here is the XQuery language and to a lesser extent Microsoft's Xlinq.
- The chaoticians: Promote the benefits of ex-post-facto information structuring. i.e. rather than worry about finding a perfect structured model for information, chaoticians keep it all loose – maybe as a set of office documents or html pages or spreadsheets. Then use tools to retro-fit or reverse engineer structure on top of the purposely loose, chaotic corpus. The poster boy of this movement is, of course, the Google search engine which manages to achieve a high degree of findability from a low-level of explicit information structure. In fact, Google makes a public point of not using the metadata capabilities of HTML. Capabilities that could be used to add structure on top of a corpus of HTML. Also in the chaotician camp are products such as Microsoft Sharepoint, Lotus Notes and Autonomy which purposely blur the distinction between structured and un-structured information and support findability of both within their information retrieval features.
- The steganographers: Promote the benefits of sprinkling structure information inside largely unstructured information and then using parsing software to dig it out and synthesize it, thus creating a graph of structure on top of a set of unstructured information artifacts. Microformats and RDFa fit into this movement. The concept of a mashup and the "linked data" initiatives are emerging as the poster children here. The steganographers and the chaoticians have much in common but steganographers are more inclined to try to work out in advance, what the data model should look like.
- The democriticians: Promote the benefits of decomposing information into triples and treating all higher levels of structure as derived from this atomic level. Web 3.0, OWL and the semantic web are all in this category.
- The parallelizers: Promote the benefits of simple key/value information models and like to point out how enormous compute power and storage can be brought to bear cheaply to derive higher order information from simply key/value models. Google's MapReduce is the poster child here.
- The agilists: Promote the benefits of iterative development of information models. Rather than the classic relational approach of building your data model and then building the applications around it, agilists often hold the view that the model must be as fluid as the applications built around it. The relational model, as implemented in many database management systems, has many positive attributes, but fluidity is not one of them.
- The temporalists: Point to the weakness of the relational model when it comes to one of the most frequent concerns in IT systems. Namely, how information changes over time. Although the time dimension can be factored into relational systems, it is not something that the model itself promotes. In fact, it can be argued that relational data normalization is antithetical to the common requirement of capturing "point in time" views of a business process or a corpus of content.
The rise of the hierarchicalists
XML, like SGML before it, takes the view that much information is naturally hierarchical in form. SGML never really caught on outside of some niche areas but XML – its successor – is slowly but surely carving out a following in the mainsteam database management arena. For all its faults, the W3C XML Schema Language (XSD) is one reason for this but I think it is XQuery that is really responsible for the shift to center stage. Not many technologists under the age of, say, forty will remember but believe it or not, there was life in the field of hierarchical database management long before SGML and XML came along in the form of IMS. It can be argued that technology and adoption are really just catching up with an idea that is now over forty years old.
The intriguing thing about XQuery is that it sets out not so much to kill the relational database but more to extend it. It does this in the time-honored way of treating the enemy as a mere special case of a more powerful data modelling abstraction. I.e. to the hierarchicalists, relational tables are merely very regular, shallow and non-recursive hierarchies.
The rise of the democriticians
Democritus commonly gets credit for being the first person to speculate that matter is really all made up of super small indivisible particles: atoms. In information theory, it is common to think of a triple consisting of a subject, a predicate and an object as being an atom of information. Democriticians hold that the best way to create data models is to start at this triple level and build everything up from there. The idea has a long, long history. It can be argued that Prolog explored this approach in the Seventies. It can also be argued that the CODASYL model popular in the days of COBOL -- with its network approach to data modelling -- also covered this territory. Indeed, the philosopher C.S. Pierce was arguably drawing RDF diagrams with a quill pen back in 1885.
The rise of the parallelizers
It is pretty evident I think that we are entering the age of the parallel -- whether we like it or not. It seems that Moore's law is slowing down. Individual chips are not getting faster at the rate we have become used to. Instead, there are more and more of them crammed into each chip. The term "CPU" is becoming increasingly inaccurate. The Von Neumann architecture that has served so long as the fundamental abstraction no longer fits the facts. The facts increasingly consist of umpteen virtualized machines, bottomless pits of storage and an increasingly "on demand" approach to compute resource allocation.
It is true that some developers creating cloud computing platforms, or utilizing the cloud ecosystems being created by companies such as IBM, Microsoft, Amazon and Google, start by firing up a relational database but a goodly number are jumping straight for designs based on MapReduce or Hadoop or CouchDB to name but three.
The rise of the chaoticians
Over the last decade, we have seen an increasingly jaundiced eye being turned toward what I would call the library sciences. Foundational concepts such as controlled vocabularies, taxonomies, data types, part/whole relationships, relational algebras etc. all require data modelling work up front. Often, they require a lot of work up front. The theory goes that the time spent up front on the information design will allow the rest of the project to proceed waterfall-style and will lead to systems that are optimal in terms of performance and accuracy.
There are a number of problems with this theory, say the chaoticians. Firstly, you never know up front what the information model will need to be. Rather, you discover it as you go along (a world view that resonates with the agilists also). Secondly, it is no longer such a big deal to have an optimal model in terms of storage or performance. Who cares if a more chaotic model entails some extra processing or eats some more storage or creates some "false positives" in the results? An abundance of cheap processing power and storage density deals with the former and human nature deals with the latter. Look at a set of results from a search engine. Some are false positives. In fact, the majority of them may be false positives. The search "hits" are statistical in nature or, put another way, wrong. The chaoticians argue that it often makes more sense to deal with the statistical nature of the results than slave for years to find the perfect, normalized, relational model.
The chaotcians often point to the memo fields and blob storage layers in relational databases as evidence to support their cause. Over time, it is not unusual for memo fields to end up as repositories of very rich information in a relational database. Once out of the rigorously controlled, rigorously data-typed table/field structure, it is "in the database" but it effectively bypasses the data model. "If a lot of the good stuff is going to end up in memo fields", say the chaoticians, "why bother building an elaborate table/field structure that will atrophy over time anyway?"
The rise of the steganographers
This movement overlaps to some extent with the democriticians and the chaoticians. Steganographers point out that "structured" is really a subset of "un-structured" when it comes to information. If there are a few identifiable integer, date and dollar fields to be found in all the invoices, curriculum vitae, recipes, product descriptions etc., why not simply smuggle them inside the word processor or html pages that hold all the rest of the information? To the steganographers, much of the world's information is semi-structured at best. Their position is that it is better to start with an open-ended design in which anything goes i.e. the use of very loose data models such as text fields, word processor files, spreadsheets etc., and layer on whatever islands of structure you can. Aiding their cause is the emergence of tools and techniques to effectively index large corpora of semi-structured text. Many search engines support the creation of "fields" that can be embedded into otherwise unstructured documents and these are indexed and queried in a very analogous way to how relational databases function. The primary difference say the steganographers, is that the messy, irregular real-world documents remain the real deal and the indexing sub-system is simply a finding aid - not the repository per se.
The rise of the agilists
Agilists prize one thing above all else and that is the speed with which an IT system can change shape over time. Back in the Eighties, the world experimented with 4GLs, many of which had the notion of evolving a data model hand in glove with the applications built on top of it. More recently web-oriented database application development systems like Django and Ruby On Rails have done much to promote the idea that the application level data structures are really primary and that the relational data model "falls out" as just one possible way to represent the application model at the storage layer.
The intriguing thing about this is that the data modelling language is not predicated on a relational storage model. It just so happens that the first back-ends for these frameworks have been relational. The very fact that both frameworks speak of relational databases as one possible "back end" speaks volumes for what is going to happen in this space. Namely, we will see more and more back-ends for these frameworks that are not relational at all.
The rise of the temporalists
Many, many real-world systems have to fight the realities of time's arrow. How many systems do you know that have to store data that changes through time? Or report on how data has changed over time? Or allow modification to themselves over time? A sizable subset I suspect. And yet, the concept of time is not primary in the relational model. Of course, it is possible to model time in a relational database and implement a layer on top that adds the time dimension but it is not the relational database's strong point. Indeed, the concept of data normalization and the removal of duplication in general, has a nasty habit of making point-in-time reporting very problematic indeed. Consider the classic example of normalizing a design that contains customer information. You want to store a single copy of the customer's contact information - or so the standard wisdom holds. But what if you need to find out who used to be the contact before the current person took over? In many classically designed relational information models your only recourse is to backups or historical reports. In this day and age, when storage is effectively free and technology has developed to the point where storing information deltas between time points can be done very efficiently, does it really make sense to throw any historical information away? Does it make sense to have to manually account for time's arrow in every data model?
One of the reasons why this dimension of data modeling is under scrutiny is that software developers are increasingly used to the highly time-oriented data management approaches used in source code control systems such as Subversion, Git, Mercurial, Darcs, Microsoft Visual Source Safe and Perforce. Managing a complex corpus of source code has much in common with managing a complex corpus of product data, manufacturing data, personnel data ... the similarities are not being lost on developers who are becoming increasingly used to being able to mix and match structured and un-structured information and manage it all under a system that makes the time dimension easy to access and exploit.
I do not think that any one of the above camps can deliver a killer blow to the pre-eminence of the relational database but taken together, I think they have enough momentum to topple the giant. For years I thought that the relational database was unassailable. After all, the last time a challenger entered the fray -- the object database that accompanied the object oriented analysis and design revolution -- it was summarily dismissed. This time it is different, the enemy is diverse and attacking from all sides. People are revisiting the writings of the early heretics. Terms like NOSQL are being coined. The term "schema free" is accruing acceptability. Open source projects in this space are appearing at a rate of knots: mongodb, cassandra, CouchDB to name but three. The phrase 'non-relational persistence' produces quite a few hits in search engines nowadays. Also, I am increasingly detecting the use of the word "legacy" in connection with relational databases!
As Ghandi said, "first they ignore you, then they laugh at you, then they fight you, then you win". I think we are now in stage 3 of that progression.
Pass the popcorn.