ComputerWorld Canada –
Open-source source platforms for big data have exploded in popularity. And in the past few months, it seems like nearly everyone is feeling the fallout.
Cost, flexibility and the availability of trained personnel are major reasons for the open-source boom. Hadoop, R and NoSQL are now the supporting pillars of many enterprises' big data strategies, whether they involve managing unstructured data or performing complex statistical analyses on it."
It's almost hard to keep up: SAP AG recently released a new product, SAP BusinessObjects Predictive Analysis, software that integrates algorithms from the open-source R language, which is used extensively in the academic community for advanced statistical modelling.
A few weeks before that, Teradata Corp. announced that its new integrated analytics portfolio would include R functionality as well as a connection to GeoServer, a Java-based open-source geolocation platform. Countless other companies are rushing to build links to Hadoop.
Widespread adoption, feverish innovation
James Kobielus, then an analyst at Forrester Research Inc. (he's now senior program director for product marketing of big data analytics solutions at IBM Corp.), wrote in an e-mail message that "open-source approaches have the momentum of the most widespread adoption and the most feverish innovation."
But what's the rush?
First of all, Kobielus explains, just as open-source products ranging from Mozilla to Android have earned widespread acceptance in the IT community after some birth pains, open-source data storage and analysis software have now matured ("no longer the risky bet they were just a year or two ago," as he puts it).
Secondly, Kobielus wrote, platforms like Hadoop, R and NoSQL have enjoyed an advantage over proprietary software because they were able to evolve faster. They're also being continuously developed and refined by many different parties. Pretty soon, he predicts, open-source will begin to dominate the big data world.
"As the footprint of closed-source software shrinks in many data/analytics environments, many incumbent vendors will evolve their business models toward open-source approaches," he wrote, "and also ramp up professional services and systems integration to assist customers in their moves towards open-source, cloud-oriented analytics, much of it focused on Hadoop and R.
"Forrester regards Hadoop, for example, as the nucleus of the next-generation enterprise data warehouse (EDW) in the cloud, and R as a key codebase in the coming wave of integrated big data development tools. We also expect various open-source NoSQL databases and tools to coalesce into rich alternatives to closed-source content analytics offerings."
The Red Hat model
Different enterprises are approaching open-source integration in different ways. Some, like SAP, have opted to use their own in-house expertise to develop products with Hadoop or R functionality, while others, like Teradata [NYSE: TD], hand over much of the work to firms like Revolution Analytics Inc., a company that is somewhat like the Red Hat Inc. of big data. The company offers a commercialized version of R geared towards enterprises, much as Red Hat does with Linux.
A small company standing among big data giants, the firm specializes in modifying R for distinct business processes, says David Smith, vice-president of marketing and community at Revolution Analytics. "In particular," he says, "we make it run with really big data sets."
Using open-source in their products is a way for companies to differentiate themselves in the market, says Smith. "By definition," he says, "it means that you're not doing what your competitors are doing."
Smith says that for organizations that take a progressive, scientific approach to big data analysis, open-source technologies are a natural choice. "Those companies that have a bit of a culture around data science, around exploration and curiousity with data, have really gravitated towards open-source technologies because they're so flexible and they lend themselves to these different ways of just thinking about working with data and exploring different things you can do with that."
Scott Gnau, president of Teradata Labs, which has partnered with Revolution Analytics, says large enterprises will benefit most from commercial packages of open-source technology so they can keep their focus on their particular line of business.
"There is a lot of value to be created in adopting some of the newer technologies that are developed in a Hadoop and MapReduce environment, but to deploy them as an enterprise-class kind of software, where there's dependable version control, and there's dependable scalability and there's support available.
"It's got to be packaged and dependable to get into the mainstream because the mainstream doesn't want to be a software development house," he says.
Will Davis, product marketing manager at EMC Greenplum, agrees. Larger companies, he says, need more stable, reliable incarnations of open-source big data platforms, whether they add the polish themselves or rely on others to do it for them.
"A lot of the enterprises... traditional customers of EMC, these sort of large Fortune 500 companies, really need their deployment of this technology to be enterprise-ready, to meet strict SLAs, to be always available," he says.
Some early adopters of open-technology developed the expertise to go it alone, but "the second wave" of companies, he says, is anxious to get up and running quickly and might not have the internal talent for a do-it-yourself approach.
Enter the data scientist
Big data talent is indeed in great demand these days, and companies are realizing that by running open-source platforms, they'll be the best position to attract the trained people. Open-source technologies, particularly R, are widely used in academia.
These data scientists, moreover, work better with open-source platforms. Imran Ahmad is a data scientist who has developed his own grid-computing algorithm, a Hadoop competitor called Bileg, which is based on the open-source Globus toolkit (GT4). The president of Cloudanum Inc., a Toronto-based company that develops data analysis technologies for cloud environments, he says the fundamental advantage in an open-source platform is that people like him can see its underlying mathematical basis.
"If it's in open-source, you can dig down and see why I'm getting these results, why these results are the optimal ones," Ahamad says.
Proprietary data analytics software will work reasonably well most of the time, he adds. But it's when an "unusual scenario" comes up that you won't be able to trust your results. "They'll be way off from what you're looking for," he says. "And that is a really scary situation".
Not surprisingly, the most brilliant minds with backgrounds in statistical modeling are also in the highest demand, especially since organizations in other sectors, like financial institutions, are scooping them up.
"They've hired a bunch of people out of school to a data science department or an R&D department and a modeling department," says Smith, "and they've found that all of them have been trained in R, and not in, say SAS."
And not surprisingly, the most brilliant minds with backgrounds in statistical modelling are in the highest demand, especially since organizations in other sectors, like financial institutions, are scooping them up.
"We provide a consulting arm of Greenplum," says Davis, "which is our data science team, [who] are PhDs that have expertise in a variety of industries and verticals. I have brainiacs, to be honest with you, who are working with customers to enable them to make use of their data."
Jason Kuo, group marketing manager at SAP, says "without a doubt" companies that need to perform complex tasks like predictive analysis are hunting for manpower in the universities. He says SAP's new product, which incorporates a user-friendly interface and drag-and-drop capabilities, will ease the data scientists' transition into the corporate world.
"Those people are bringing their R expertise, their R background, and are asking for tools around R," he says. "Now what's interesting is, in an academic environment, for whatever reason, whether it's budget or familiarity, they are much more likely to be working with R without a GUI, without a strong graphical interface. And now they walk into a corporate world where their demands are higher, the turnaround frame for projects is faster, maybe ROIs are being tracked and so forth.
"Companies are able to say... what do you need to be more successful? How can we make you more productive? And they have a budget for these statisticians who may not, in the past, have had it."
If you can't beat them...
Paul Kent, vice-president of platform development at SAS Institute Inc., works for a company often seen as belonging to the opposite side of the big data divide, developing proprietary data analysis algorithms that are alternatives to those used in open-source languages like R.
Kens says that to a certain extent, SAS does regard the open-source community as a competitor it needs to keep up with. New techniques can be developed in open-source environments very quickly, while his company may need more time to study them before turning them into a marketable product feature.
"[It] takes a little bit longer for us to react to the technique and to test out all the different corners and permutations of the way you would use it. So, we might be a bit slower to respond.
However, he says SAS has the advantage of a large technical support segment and has the expertise to make certain techniques work for different organizations, whether retail businesses, banks, or healthcare institutions. SAS's strength lies in "the application of mathematics to particular domains," says Kent.
At the same time, he says, SAS keeps abreast of the trends and has opted to give its customers open-source options just the same. Kent says SAS has "built a bridge to R" just like it has with Hadoop. Whenever the open-source community comes up with a good idea, Kent says, SAS is paying attention.
"It's more useful in the long run to build a bridge or an interface to that idea than to try to pretend that it's not there."