Ensuring data consistency and accuracy is one of the biggest challenges. "We're making decisions with data, yet it's very hard to actually make sure that the data is correct," says Mardenfeld, who's focused on building the infrastructure that powers Etsy's big data projects. "We put a lot of work into error checking, making sure our collection pipelines are working. Data is a little bit of a different beast. You can't just get your code to compile. You have to compile and also make sure that it makes sense. I think that's the hardest part about this."
In terms of platforms and tooling, Hadoop plays a key role in storing and processing the data. Etsy runs dozens of workflows each night on Amazon's cloud-based Elastic MapReduce service. Rather than keeping a single cluster running continuously, Etsy brings up a new cluster for each job so it can tailor the number and types of instances to the workload.
"We have our own custom event-logging frameworks, and we store all the data in [Hadoop Distributed File System (HDFS)]. We process the data into ETL using a data flow language known as Cascading, and then we push it downstream to a data warehouse, which is Vertica," Mardenfeld says.
Etsy also uses Elastic MapReduce clusters to analyze the data and perform predictive analytics. "Hadoop is an important part of our pipeline. I don't think we'd be able to do any of this without it," Mardenfeld says.
[Hiring trends: Hadoop wins over enterprise IT, spurs talent crunch]
To digest the data, Etsy has built a number of homegrown tools. "We write a bunch of custom UIs for this, for our internal tools. One of them is what we call the A/B Analyzer, which allows us to easily do analysis on experiments that we run. We also have our own internal funnel tool and our own dashboard tool," Mardenfeld says.
The homegrown presentation tools make it easier for teams throughout Etsy to access and make use of data for experimentation and to inform product development, even if they don't have statistical expertise. A launch calendar keeps track of all the current, active experiments at Etsy, and Etsy employees can simply click on an experiment and, using the homegrown dashboards, see the results to date of that experiment.
"We had a lot of questions that were of the same type, so we've generalized those so it's easy for people to get the answers to those questions without doing a lot of work," Mardenfeld says. "For more custom questions, you can answer questions in Vertica, you can use SQL, and for more in-depth data mining and analysis and building products, then you can jump down to writing things in MapReduce and Cascading."