New York turns to big data to solve big tree problem
NEW YORK CITY--It may seem strange if you don't live in this urban concrete-and-glass jungle, but New Yorkers love their trees. Tourists may flock to Times Square, but New Yorkers know their parks are the city's heart and soul: Central Park in Manhattan, Prospect Park in Brooklyn, Flushing Meadows Corona Park in Queens, Van Cortlandt Park in the Bronx, the Greenbelt in Staten Island and the hundreds of smaller parks and urban green spaces that dot the five boroughs. And, of course, there are the trees that line the streets.
In all, there are roughly 2.5 million trees in New York City. And while the citizens of the city love them, for City of New York Parks & Recreation, they're a big problem, but a problem big data analytics can solve.
Brian Dalessandro, data ambassador for DataKind, leads a DataDive on tree pruning data from City of New York Parks & Recreation.
It's not just a dollars-and-cents problem either; it's about lives. In an 11-month-span from 2009 to 2010, four people were killed or seriously injured by falling tree limbs in Central Park alone, including a six-month-old girl who was crushed to death in June 2010. Nearly a year earlier, a 100-pound limb fell from an oak tree in Central Park, fracturing the skull and partially severing the spine of a 37-year-old Google software engineer.
Arborists believe that pruning and otherwise maintaining trees can keep them healthier and make them more likely to withstand a storm, decreasing the likelihood of property damage, injuries and deaths. While this is the conventional wisdom, there hasn't been any research or data to back it up, says Brian Dalessandro, vice president of Data Science at media6degrees (m6d), provider of a machine learning-based ad targeting platform, and a Data Ambassador for DataKind that helps unite volunteer data scientists with nonprofit and civic organizations that have big data problems.
Leveraging Machine Learning Skills to Answer Causal Question
"Years ago, NYC Parks created a program for taking better preventative care of the city's trees," Dalessandro says. This program involves a regular schedule of pruning and grooming large trees in an effort to reduce the risk of damage from storms and high winds. For years, the department kept a record of which blocks were pruned, as well as how many times they had to dispatch a crew to remove fallen branches and upended trees.
Armed with all of this data, they approached DataKind with the following question: "Does pruning trees in one year reduce the number of hazardous tree conditions in the following year?"
Savvy advertisers and those schooled in analytics will recognize that the department was asking a causal question, and causal analysis is one of the most difficult forms of analysis one can do without a formal experiment. And let's face it, Dalessandro says, you can't A/B test the problem because you'd essentially be experimenting with people's lives.
But with the right data, you can statistically recreate an experiment, Dalessandro says, and his experience in the advertising world equipped him with the skillset to do just that, only a few years ago, his team at m6d figured out how to estimate the causal impact of ads by analyzing impression logs. But approaching the city's problem wasn't cut-and-dried. After all, while the city had been collecting lots of data, it had been collecting it for reporting purposes, not for actionable insight.
Data Collection Is a Key Issue
"Their data had not been designed in a relational way," Dalessandro says. "They weren't really thinking about joining the data sets together."
For example, the data sets had different levels of granularity. Data on past pruning work was recorded on a block-by-block level, while data on clean ups was recorded at the address level.
"One of the biggest challenges of this is determining the fundamental unit of analysis," Dalessandro says. "As a statistician, you divide the world into entities. What would be the equivalent of a single row? They don't give a unique identifier to every tree. It's a balance between having the data as granular as possible while also having the greatest coverage possible."
Eventually, they settled on the city block as the fundamental unit of analysis. With the blessing of m6d's CEO, Dalessandro devoted a few work hours here and there to downloading, cleaning, merging and analyzing the data. He was even able to use the firm's high-powered server infrastructure to run some intensive modeling. He found the answer to the city's question: pruning trees for certain types of hazards caused a 22% reduction in the number of times the department had to send a crew for emergency cleanups.
"A block that is pruned this year will have 22% fewer hazardous problems the following year," Dalessandro says. "We were told this is the first time this number has ever been generated."
Using Analytics to Build a Risk Profile
While an important first step, this number is only a beginning. After all, the city already has a pruning program in place. But even a city as large as New York doesn't have the resources to prune every block every year. The department has to choose which blocks to target for pruning.
"The first thing is to create a baseline so the Parks Department can work with supervisors to determine how much of their resources they can allocate," Dalessandro says. "The second phase will be intelligent pruning. Part of my vision for this is I want to help them set up the analytics on their end so they can start asking different questions and getting the answers themselves. That's all part of building the risk profile of a block: the number of trees, types of trees, whether the block is in a flood zone or storm zone. These are all the types of questions that can be answered."
Now, he says, the department is armed with tangible results that can provide justification for investments in their data infrastructure and data collection efforts.
"Ultimately, this could not only improve New York's pruning program, it could be applied to any other city that has a similar pruning program," says Jake Porway, founder and executive director of DataKind, which brought Dalessandro and the Parks department together.
DataKind Pioneering New Form of Corporate Civic Engagement
Comprised of a group of "conscientious data scientists, NPO/NGO gurus, do-gooder CIOs and dedicated organizers," DataKind's mission is to use data in the service of humanity. For instance, Porway says, DataKind works with Refugees United to use analytics and recommendations to bring together families of refugees who have been displaced and separated in the process.
"You're not just using data to report," Porway says. "You're applying it to real humanitarian issues."
DataKind regularly schedules weekend events called DataDives that team three selected social organizations that have well-defined data problems with volunteer data scientists to tackle their data challenges. The events are completely free. The Parks department had come to DataKind for a DataDive.
In addition, DataKind maintains the "DataCorps," a select group of data scientists who work on volunteer or contract data projects part-time. They work for one to six months on targeted data projects. They are either paid by the organization or sponsored by the private companies they work for. Finally, DataKind maintains a full-time in-house staff of data scientists to take on the most pressing and high-impact problems.
"I didn't do all this on the weekend and late at night," Dalessandro says. "I did a lot of this while I was at work, but I did it with the blessing of my CEO and marketing people. They thought it was a great idea. It wasn't weeks and months of my time. It was a few hours here and there. I was able to use company servers and they were completely cool with it. They like it when the data scientists are active with the community."
"This, to me, is an amazing example of corporate engagement," Dalessandro adds. "If more companies can actually donate people who are highly skilled in a particular area to these underfunded organizations, whether charities or civic organizations, to me it's the most amazing form of corporate civic engagement I can think of. This is something I want to continue doing as long as I'm in the private sector."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn. Email Thor at firstname.lastname@example.org
Read more about data management in CIO's Data Management Drilldown.