March 04, 2014, 5:07 PM — The U.S. National Institute of Standards and Technology (NIST) wants to bring some metrics and rigor to the nascent but rapidly growing field of data science.
The government agency is embarking on a project to develop by 2016 a framework that can be used by all industries to understand how to use, and measure the results from data science, and big data projects.
NIST, an agency of the U.S. Department of Commerce, is holding a symposium on Tuesday and Wednesday at its Gaithersburg, Maryland, headquarters with big data specialists and data scientists to better understand the challenges around the emerging discipline.
"Data science is pretty convoluted because it involves multiple data types, structured and unstructured," said event speaker Ashit Talukder, NIST chief for its information access division. "So metrics to measure the performance of data science solutions is going to be pretty complex."
Starting with this symposium, the organization plans to seek feedback from industry about the challenges and successes of data science and big data projects. It then hopes to build a common taxonomy with the community that can be used across different domains of expertise, allowing best practices to be shared among multiple industries, Talukder said.
While computer-based data analysis is nothing new, many of the speakers at the event talked about a fundamental industry shift now going on underway with data analysis.
Doug Cutting, who originally created the Hadoop data processing platform noted that what made Hadoop unique is that it took a different approach to working with data. Instead of moving the data to a place where it can be analyzed -- an approach used with data warehouses -- the analysis takes place where the data is stored itself.
"You can't move [large] data sets without major performance penalties," Cutting said. Since its creation in 2005, Apache Hadoop has set the stage for storing and analyzing data sets so large that they can not fit into a standard relational database, hence the term "big data."
As these data sets grow larger, the tools for working with them are changing as well, noted Volker Markl, a professor and chair of the database systems and information management group at the Technische Universität Berlin.
"Data analysis is becoming more complex," Markl said. As a discipline, data science is challenging in that it requires both understanding the technologies to handle the data, such as Hadoop and R, as well as the statistics and other forms of mathematics needed to harvest useful information from the data, Markl said.