Perhaps the biggest challenge facing those pursuing Big Data is getting a platform that can store and access all the current and future information and make it available online for analysis cost-effectively. That means a highly scalable platform. Such platforms consist of storage technologies, query languages, analytics tools, content analysis tools, and transport infrastructures -- there are many moving parts for IT to deploy and look after.
There are many proprietary and open source resources for these tools, often from startups but also from established cloud technology companies such as Amazon.com and Google -- in fact, use of the cloud helps solve the Big Data scalability issue, both for data storage and computational capability. However, Big Data does not necessarily have to be a "roll your own" type of deployment. Large vendors such as IBM and EMC offer tools for Big Data projects, though their costs can be high and hard to justify.
Hadoop: The core of most Big Data efforts In the open source realm, the big name is Hadoop, a project administered by the Apache Software Foundation that consists of Google-derived technologies for building a platform to consolidate, combine, and understand data.
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. The goal of those services is to provide a foundation where the fast, reliable analysis of both structured and complex data becomes a reality. In many cases, enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old and new data sets in powerful new ways. Hadoop allows enterprises to easily explore complex data using custom analyses tailored to their information and questions.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data -- and run large-scale, high-performance processing jobs -- in spite of system changes or failures.
Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons, cross-integration, and custom implementations of the technology. To that end, Hadoop offers subprojects, which add functionality and new capabilities to the platform: