The small world of big data

Big data's history includes a lot of interconnected technologies and one notable figure, Hadoop

When we talk about big data and data warehousing, it is almost inevitable that Hadoop will be mentioned. But Hadoop didn't come in from a vacuum -- like most big data technologies, it bears a close relationship with other technologies in this sector. In this case, Hadoop, which uses map/reduce technologies to form a data framework on which data is stored and applications to get at that data can run, can trace its origins back to another kind of data warehouse technology: enterprise search. Enterprise search -- also known as realtime search -- is a method of data storage that takes the concept of searching and applies it to at times very large collections of unstructured or partially structured data, such as documents. The best document storage system will utilize some sort of XML or SGML-based tagging to keep those documents' content nice and organized. But in reality, documents will fall quite a bit short of that ideal mark. That's when enterprise search comes into play. Enterprise search products--such as ElasticSearch, Apache Lucene, and Apache Solr -- use a concept called facets that enable you to treat data within documents as you would fields within a relational database. Facets are essentially inverted indexes that let you find specific pieces of information in a document, like an address or other customer information. Enterprise search is ideal if you have a large set of these types of documents to cull through, and need to do some straightforward data mining or business intelligence analysis. The more structured the data, the better: enterprise search does particularly well with documents like weblogs, which are structured uniformly enough to enable deeper data mining. The connection between enterprise search and the currently much-hyped Hadoop lies in the creator of both technologies: Doug Cutting. Cutting, currently an Architect at commercial Hadoop vendor Cloudera, put Lucene together as a Java search engine library in 1998. But life (and the Internet boom) pulled Cutting away from his Java project. By the time 2000 rolled around, Cutting opted to take this perfectly good search engine library and open sourced it under the GPL license on SourceForge. After sharp pushback from potential users, Cutting would later switch the license to the less-restrictive LGPL. When the project was invited to join the Apache Software Foundation in 2001, Cutting was urged to take them up on the offer, and from then on Lucene would be under the ASF umbrella and licensed under the Apache Software License. Cutting would continue to work on Lucene, developing the technology into the open source Nutch search engine, which was a full-on application as opposed to a platform like Lucene. Nutch was also very much geared towards web search and uses many of the same features found in enterprise search, such as web crawling, document format and language detection, and parsing. But, as powerful as Nutch would prove to be, it would not be scalable enough to search enterprise-level datasets. Multi-node installations, even as little as four nodes, would prove to be difficult to manage. Space allocation and resource management in Nutch for anything over 100 million pages would prove to be the limit. Thus, in 2008, Hadoop was born, which would use distributed computing techniques and become the new framework on which Nutch would be run. The Hadoop distributed filesystem, coupled with MapReduce (both of which would be modeled on Google projects), would be the framework upon which Nutch would run. Cutting's Lucene would not only foster the creation of the MapReduce-based Hadoop technology, but it would also form the basis of other enterprise search technologies. In particular, ElasticSearch and Apache Solr are both enterprise web-based search tools that make use of the Lucene Java library. There is much debate in the enterprise search sector about which of these two tools does a better job. Solr is reputedly very fast, but ElasticSearch's distributed capabilities mean that the job can be shared across many distributed resources and therefore deliver similar performance. The evolution of technology is interesting, but not just from a purely esoteric standpoint. Understanding how these technologies fit together will give users a better idea of which solution is right for them.

Read more of Brian Proffitt's Zettatag and Open for Discussion blogs and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon