"It doesn't seem there are any limitations other than good engineering required to get there," he said. "Moore's law or not, we have essentially unlimited computation available."
Spector sees a day when distributed computing systems will offer Web developers what he calls "totally transparent processing," where the big data analytics engines learn over time to, say, translate file or block data into whatever language is preferred based on a user's profile history, or act as a moderator on a Web site, identifying spam and culling it out. "We want to make these capabilities available to users through a prediction API. You can provide data sets and train machine algorithms on those data sets," he said.
Yahoo has been the largest contributor to Hadoop to date, creating about 70% of the code. The search site uses it across all lines of its businesses, and has standardized on Apache Hadoop, favoring its open source attributes.
Papaioannou said Yahoo has 43,000 servers, many of which are configured in Hadoop clusters. By the end of the year, he expects his server farms to have grown to 60,000 because the site is generating 50TB of data per day and has stored more than 200 petabytes.
"Our approach is not to throw any data away," he said.
That is exactly what other corporations are hoping to accomplish: Use every piece of data to their businesses' advantage so that nothing goes to waste.
Lucas Mearian covers storage, disaster recovery and business continuity, financial services infrastructure and health care IT for Computerworld. Follow Lucas on Twitter at @lucasmearian or subscribe to Lucas's RSS feed . His e-mail address is email@example.com .
Read more about bi and analytics in Computerworld's BI and Analytics Topic Center.