How to work with firehose data
Some data comes in at a constant, and even massive, rate. How can IT systems contend with such data flows, while still analyzing that data? PostgreSQL expert Josh Berkus outlines the important guidelines.
It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.
Not too long ago, before the age of data automation, data would typically come into an organization at predetermined, set times: a good example would be business hours. When data was entered from nine to five, it grew at perfectly predictable rates, and was accessed and analyzed at equally predictable rates. Even better, the downtime that occurred when everyone went home at night would enable the night-owl DBAs to make updates and repairs to the database in question
There may have even been - are you sitting? - overtime pay in it for them.
Many businesses still work with data in this manner, some even exclusively. (Gone, in many cases, is this strange word known as "overtime.") But more and more often, data is coming in from automated sources that don't have downtime, and could be firing data to an organization every second of every day. And significant amounts, at that.
This, then, is what the data gurus call firehose data - a steady and powerful stream of data that your IT infrastructure may be required to manage, and when all is said and done, actually use for business decisions.
According to Josh Berkus, CEO of PostgreSQL Experts Inc., there are four inherent challenges of working with firehose data. Berkus addressed those characteristics in a talk Jan. 22 at the Southern California Linux Expo.
First, the firehouse will have a lot of volume: anywhere from hundreds to thousands of facts-per-second. That volume may not be a steady rate, Berkus added, as it can have spikes, come from multiple uncoordinated sources, and may grow over time.
The second challenge is that, while the rate of volume can vary, the flow itself will be nearly constant, arriving on a 24/7 cycle. This means DBAs can't stop their systems to process the data, nor bring down an entire infrastructure for maintenance. This, and the fact that data can also arrive out of order, means extract, transform, and load (ETL) operations are pretty much not happening.
The third obstacle, Berkus told his audience, was that the database itself was going to be large - with multiple terabytes to petabytes of data to be handled.
"This means a lot of hardware," Berkus said, "because single-node database management systems aren't going to be enough." It's not just catching the data, either - analytic operations on data sets this big are also extremely resource-intensive. And, complicating this issue is that issue of database growth, "because no one ever wants to throw data away."
The final hurdle is dealing with component failure. "All components fail," Berkus declared, "but collection of data must continue - even if the network fails."
These challenges are the ones Berkus and his team meet head on when they work on new projects. One such project PostgreSQL Experts tackled recently was Upwind, a wind farm management company.
Windmsills, it seems, generate a lot of data. You might say a hurricane's worth: speed, wind speed, heat, and vibration are just some of the data generated every second by those massive power generators that dot landscapes across the planet. In fact, Berkus explained, each turbine can push out 90 to 700 facts-per-second. With 100 turbines per farm, and Upwind managing 40 plus such farms, this gives the company upwards of 300,000 facts per second with which to contend.
Complicating this is that Upwind is working with multiple turbine owners, who want their own windmills' data separately and may have different algorithms and analytics to measure that data - techniques they assuredly do not want their competition to have.
Berkus described to his audience a series of solutions that featured very elastic connections to handle the peaks and valleys of incoming data and a parallel infrastructure to deal with the multi-tenant requirements. The systems described by Berkus in his talk highlighted the need to deal with out-of-order data (connections from the turbines to the datacenter can and do fail occasionally), as well as being extremely fault-tolerant.
Naturally, Berkus' described solution had multiple PostgreSQL nodes at its heart, but there were also Hadoop nodes in place to manage the multiple time-based rollup tables in the system to work with aggregate data.
Berkus summarized his discussion with five key elements that a firehouse data system had to have:
- Queuing software to manage out-of-sequence data
- Buffering techniques to deal with component outages
- Materialized views that update data into aggregate tables
- Configuration management for all the systems in the solution
- Comprehensive monitoring to look for failures
With these key principles in place, firehose data management is a survivable event for any business that needs to handle a constant flow of data without getting knocked down.