Inspired by Cutting's work and sense of humour, Mattmann has started his own project called Tika - a text analysis tool that detects and extracts metadata and structured text content from various documents using existing parser libraries. It is named after a soft toy belonging to the daughter of his partner in the project.
"Most of the open source work I do is through Apache, a lot of it has to do with the Apache licence being a very permissive licence," Mattman says. "It allows people downstream that leverage Apache based software to use that upstream open source component in arbitrary ways. It makes it so the software I build -- when we distribute it to customers, or others we collaborate with, we don't have to give them any surprises."
Mattmann says NASA has been an active user of open source software for around 15 years but only recently has it become active on the production side. For the past two years NASA has held open source summits, outlining its contribution to open source.
NASA categorises its data in different levels, and in the next generation earth science system satellite area where Mattmann works it is publically distributed via DAACs (Distributed Active Archive Centres). He says the programs and tools used to process data vary depending on the preferences of the scientists involved in the project. "A lot times the software itself is coupled to the instrument."
Level zero data is raw data that comes off the instrument and level one data is data which has started to be calibrated from raw voltages.
Mattmann says that the public can have access, through the DAACs, to level two data. This is data that is calibrated, geospatially identified and mapped to a physical model (measurements that can be mapped in space and time).
"It's so voluminous, because it's raw measurements in space and time from an instrument. You probably won't use that in your IT organisations, it might be too big for you," he says.
It's when you get to level 3 data, which is typically mapped or gridded information, that the user can really "crank on it" because the files are lot smaller and more manageable, says Mattmann. This information is often used in discussions about temperature and climate change.
"With each level of processing there are more assumptions that are codified into the data. More scientific assumptions that you didn't necessarily make," Mattmann points out.
His enthusiasm for big data projects is contagious, but when asked how he came to have a career as a NASA computer scientist, he says it's a "lame story".