What have the Roman empire and the relational database got in common? Not too much at the moment I would suggest, but in a few short years I think we will be seeing an interesting similarity in their life histories. Pretend we are in next decade somewhere...Here is how it might look in Wikipedia if I'm right. “Like the Roman Empire, the Relational Database grew gradually to be an extremely dominant force. Like the Roman Empire, it suffered gradual collapse the causes of which, are debated by historians to this day...”
Of course, from the current vantage point of 2009 – still some years away, I suspect, from the collapse of the relational database hegemony – it is difficult to predict how history will be written. Will the relational database have the equivalent the Romulus Augustulus abdication moment? Will some historians attribute some modern day IT equivalent of lead poising to its demise? Will I prove to be utterly wrong in the grand tradition of technology prognosticators?
I am predicting that the fall of the relational database will indeed happen and furthermore, I am predicting that history will not point to any one single event as the trigger. Rather, I suspect that a combination of forces, a variety of ideological movements and technological developments, acting contemporaneously, will create a perfect storm for the relational database. I see seven main forces at work. I have given each force a somewhat whimsical name in what follows. Here is list of all seven with a brief explanation of each. The rest of the article goes into a little more detail on each one.
- The hierarchicalists: promote the benefits of hierarchical information models such as XML and the emergence of viable mechanisms for querying and processing large corpora of such information. The poster boy here is the XQuery language and to a lesser extent Microsoft's Xlinq.
- The chaoticians: promote the benefits of ex-post-facto information structuring. i.e. rather that worry about finding a perfect structured model for information, chaoticians keep it all loose – maybe as a set of office documents or html pages or spreadsheets. Then use tools to retro-fit or reverse engineer structure on top of the purposely loose, chaotic corpus. The poster boy of this movement is, of course, the Google search engine which manages to achieve a high degree of findability from a low level of explicit information structure. In fact, Google make a public point of not using the metadata capabilities of HTML. Capabilities that could be used to add structure on top of a corpus of HTML. Also in the chaotician camp are products like Microsoft Sharepoint, Lotus Notes and Autonomy which purposely blur the distinction between structured and un-structured information and support findability of both within their information retrieval features.
- The steganographers: promote the benefits of sprinkling structure information inside largely unstructured information and then using parsing software to dig it out and synthesize it, thus creating a graph of structure on top of a set of unstructured information artifacts. Microformats and RDFa fit into this movement. The concept of a mashup and the “linked data” initiatives are emerging as the poster children here. The steganographers and the chaoticians have much in common but steganographers are more inclined to try to work out in advance, what the data model should look like.
- The democriticians: promote the benefits of decomposing information into triples and treating all higher levels of structure as derived from this atomic level. Web 3.0, OWL and the semantic web are all in this category.
- The parallelizers: promote the benefits of simple key/value information models and like to point out how enormous compute power and storage can be brought to bear cheaply to derive higher order information from simply key/value models. Google's MapReduce is the poster child here.
- The agilists: promote the benefits of iterative development of information models. Rather than the classic relational approach of building your data model and then building the applications around it, agilists often hold the view that the model needs to be as fluid as the applications built around it. The relational model, as implemented in many database management systems, has many positive attributes but fluidity is not one of them.
- The temporalists: point to the weakness of the relational model when it comes to one of the most frequent concerns in IT systems. Namely, how information changes over time. Although the time dimension can be factored into relational systems, it is not something that the model itself promotes. In fact, it can be argued that relational data normalization is antithetical to the common requirement of capturing “point in time” views of a business process or a corpus of content.
The rise of the hierarchicalists
XML, like SGML before it, takes the view that much information is naturally hierarchical in form. SGML never really caught on outside of some niche areas but XML – its successor – is slowly but surely carving out a following in the mainsteam database management arena. For all its faults, the W3C XML Schema Language (XSD) is one reason for this but I think it is XQuery that is really responsible for the shift to center stage. Not many technologists under the age of, say, forty will remember but believe it or not, there was life in the field of hierarchical database management long SGML and XML came along in the form of IMS. It can be argued that technology and adoption are really just catching up with an idea that is now over forty years old.
The intriguing thing about XQuery is that it sets out not so much to kill the relational database but more to extend it. It does this in the time honored way of treating the enemy as a mere special case of a more powerful data modelling abstraction. I.e. to the hierarchicalists, relational tables are merely very regular, shallow and non-recursive hierarchies.
The rise of the democriticians
Democritus commonly gets credit for being the first person to speculate that matter is really all made up of super small indivisible particles: atoms. In information theory, it is common to think of a triple consisting of a subject, a predicate and an object as being an atom of information. Democriticians hold that the best way to create data models is to start at this triple level and build everything up from there. The idea has a long, long history. It can be argued that Prolog explored this approach in the Seventies. It can also be argued that the CODASYL model popular in the days of COBOL - with its network approach to data modelling - also covered this territory. Indeed, the philosopher C.S. Pierce was arguably drawing RDF diagrams with a quill pen back in 1885.
The rise of the parallelizers
It is pretty evident I think that we are entering the age of the parallel – whether we like it or not. It seems that Moores law is slowing down. Individual chips are not getting faster at the rate we have become used to. Instead, there are more and more of them crammed into each chip. The term “CPU” is becoming increasingly inaccurate. The Von Neumann architecture that has served so long as the fundamental absatraction no longer fits the facts. The facts increasingly consist of umpteen virtualized machines, bottomless pits of storage and an increasingly "on demand" approach to compute resource allocation.
It is true that some developers creating cloud computing platforms, or utilizing the cloud ecosystems being created by companies such as IBM, Microsoft, Amazon and Google, start by firing up a relational database but a goodly number are jumping straight for designs based on MapReduce or Hadoop or CouchDB to name but three.
The rise of the chaoticians
Over the last decade, we have seen an increasingly jaundiced eye being turned to toward what I would call the library sciences. Foundational concepts like controlled vocabularies, taxonomies, data types, part/whole relationships, relational algebras etc. all require data modelling work up front. Often, they require a lot of work up front. The theory goes that the time spent up front on the information design will allow the rest of the project to proceed waterfall-style and will lead to systems that are optimal in terms of performance and accuracy.
There are a number of problems with this theory say the chaoticians. Firstly, you never know up front what the information model will need to be. Rather, you discover it as you go along (a world view that resonates with the agilists also). Secondly, it is no longer such a big deal to have an optimal model in terms of storage or performance. Who cares if a more chaotic model entails some extra processing or eats some more storage or creates some “false positives” in the results? An abundance of cheap processing power and storage density deals with the former and human nature deals with the latter. Look at a set of results from a search engine. Some are false positives. In fact, the majority of them may be false positives. The search "hits" are statistical in nature or, put another way, wrong. The chaoticians argue that it often makes more sense to deal with the statistical nature of the results than slave for years to find the perfect, normalized, relational model.
The chaotcians often point to the memo fields and blob storage layers in relational databases as evidence to support their cause. Over time, it is not unusual for memo fields to end up as repositories of very rich information in a relational database. Once out of the rigorously controlled, rigorously data-typed table/field structure, it is "in the database" but it effectively bypasses the data model. "If a lot of the good stuff is going to end up in memo fields", say the chaoticians, "why bother building an elaborate table/field structure that will atrophy over time anyway?"
The rise of the steganographers
This movement overlaps to some extent with the democriticians and the chaoticians. Steganographers point out that “structured” is really a subset of “un-structured” when it comes to information. If there are a few identifiable integer, date and dollar fields to be found in all the invoices, curriculum vitae, recipes, product descriptions etc., why not simply smuggle them inside the word processor or html pages that hold all the rest of the information? To the steganographers, much of the worlds information is semi-structured at best. Their position is that it is better to start with an open-ended design in which anything goes i.e. the use of very loose data models such as text fields, word processor files, spreadsheets etc., and layer on whatever islands of structure you can. Aiding their cause is the emergence of tools and techniques to effectively index large corpora of semi-structured text. Many search engines support the creation of "fields" that can be embedded into otherwise unstructured documents and these are indexed and queried in a very analogous way to how relational databases function. The primary difference say the steganographers, is that the messy, irregular real world documents remain the real deal and the indexing sub-system is simply a finding aid - not the repository per se.
The rise of the agilists
Agilists prize one thing above all else and that is the speed with which an IT system can change shape over time. Back in the Eighties, the world experimented with 4GLs, many of which had the notion of evolving a data model hand in glove with the applications built on top of it. More recently web-oriented database application development systems like Django and Ruby On Rails have done much to promote the idea that the application level data structures are really primary and that the relational data model “falls out” as just one possible way to represent the application model at the storage layer.
The intriguing thing about this is that the data modelling language is not predicated on a relational storage model. It just so happens that the first back-ends for these frameworks have been relational. The very fact that both frameworks speak of relational databases as one possible "back end" speaks volumes for what is going to happen in this space. Namely, we will see more and more back-ends for these frameworks that are not relational at all.
The rise of the temporalists