Maybe you should read the Bad Data Handbook
Bad Data Handbook by Q. Ethan McCallum et al, O'Reilly 2012
If you're starting on a project and know you are going to be dealing with a LOT of data that won't be arriving in a format that anticipates your needs, you might save yourself a lot of time and trouble by picking up a copy of this book. Q. Ethan McCallum and 18 others got together and wrote about their experiences, their methods and their wins in dealing with datasets that are "bad" in any of large number of ways.
Bad data might be inaccurate, incomplete, unreliable, encoded in some fashion that makes the data hard to work with, or formatted in some way that makes extraction difficult or time-consuming. Data records might be inconsistent or poorly annotated. Whatever the issue, you can end up spending a lot of time sampling, analyzing, cleaning, extract, reformatting, questioning, verifying and accumulating a lot of it only to find out that you have to start over again because of some problem you didn't discover until late in the game.
Bad Data offers tips for ...
- test driving your data to make sure it's ready to be analyzed
- working spreadsheet data into a usable format
- handing encoding problems
- developing a good web-scraping strategy
- using natural language processing to detect attitude in online reviews
- addressing cloud issues that can complicate analysis
- avoiding policies that create analysis roadblocks
- taking a systematic approach to quality analysis
The 19 chapters provide a series of real life "bad data" stories with lots of examples, sample code, lessons learned, and explanations of how you can avoid getting stuck and feeling as if you're the first person to ever have to deal with the type of data problems you're struggling with. Some chapters will likely jump out at you as being immediately useful. Others might settle on you more slowly.
Chapter 1 Setting the Pace: What Is Bad Data?
Chapter 2 Is It Just Me or Does This Data Smell Funny?
Chapter 3 Data Intended for Human Consumption, Not Machine Consumption
Chapter 4 Bad Data Lurking in Plain Text
Chapter 5 (Re)Organizing the Web's Data
Chapter 6 Detecting Liars and the Confused in Contradictory Online Reviews
Chapter 7 Will the Bad Data Please Stand Up?
Chapter 8 Blood Sweat and Urine
Chapter 9 When Data and Reality Don't Match
Chapter 10 Subtle Sources of Bias and Error
Chapter 11 Don't Let the Perfect be the Enemy of the Good: Is Bad Data Really Bad?
Chapter 12 When Databases Attack: A Guide for When to Stick to Files
Chapter 13 Crouching Table, Hidden Network
Chapter 14 Myths of Cloud Computing
Chapter 15 The Dark Side of Data Science
Chapter 16 How to Feed and Care for Your Machine-Learning Experts
Chapter 17 Data Traceability
Chapter 18 Social Media: Erasable Ink?
Chapter 19 Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Some of my favorite points were ...
Chapter 7's explanation of how easy it is to confuse typical with average. Oh, how wrong you can be! The author starts this section by talking about people getting on the highway and going in the wrong direction, not realizing their problem until they find themselves on the beach instead of in the mountains. Oops! How easy it is to simply get the wrong picture and create useless analyses.
Chapter 19's grappling with the issue of knowing when your data is good enough. How do you determine what you should care about? Do a couple unanswered questions or missing data points invalidate everything? Or can you step back and establish your own criteria for quality?
Bad Data Handbook is not the easiest read, but it provides insights that could save a lot of us considerable time and money. If you're going to be working with big data sets and you don't get to order it to your own specifications, this book might become your new best friend.