Clean your data

By Cameron Laird  Add a new comment

Think of it this way: when a procedural drama enacts a climactic scene in which someone announces, "the computer found a match! We've got our 'perp' ...", I have trouble doing anything but giggling.

While there must be a healthier response, I haven't figured it out yet. I've seen inside "mission-critical" databases, and I know how ... imperfect they are. Bluntly, the databases which have so much power over lives--credit reports, school grades, police reports, medical histories, inventories, bank records, and many more--are filled with dirt.

No individual programmer can solve this. In most instances, there is no technical solution; it doesn't particularly bother the sponsor or hosting organization to be inaccurate (at least, until an inaccuracy lands with someone powerful enough to cause embarrassment).

What you can do is to know the facts. As best as I've been able to determine them, they include:

  • The data which your programs manage and process are inaccurate. Don't waste your time in surprise; yes, you were told that everything had already been corrected (document that!), but, unless your organization is very, very unusual, it's simply not true;
  • On the other hand, you might be no worse than average. Your competitors are likely no better;
  • It could well happen that no one cares when you correct mistakes, or, even worse, they display active hostility to your efforts. Perhaps your supervisors regard you as paid to program, not to deal with clerical details. You have a chance to think in advance how you feel about that; and
  • Although it's unlikely any "technical fix" in isolation will make a difference, there's a lot you can do at least to diagnose the state of your data, and perhaps suggest what processes can help.

Take advantage of the techniques at hand: use Assert()s and similar. Don't just document that the second argument to myfunc() is an accountNumber; take action to make it clear that, in this context, all whitespace has been stripped from accountNumber, case is significant, and, yes, there should be exactly twelve characters in accountNumber. If automated, readable test cases are impossible, at least put a salient example value or two in comments.

You'll emerge from any analysis phase with a whole collection of "assumptions" (preconditions, ...). Turn those into executable code. Well, use engineering judgment--if you're running fluid mechanics simulations, and CPU time already is one of your biggest expenses, I don't want you checking bounds on all intermediate numbers for every iteration. In principle, though, there's a lot you can do to validate data that flows through your hands. Certainly if it's structured as
XML or a DBMS, you have abundant possibilities for catching errors.

The real value of techniques like these isn't that they correct your data. Dedicated users are far too clever for such simple means; they can introduce fiendishly difficult errors, given the chance. What you'll find, though, is that even simple checks will turn up problems that you should trace back to root causes. With real examples in hand, you can start to have meaningful conversations about how entering "Flood damage" in some places, and "Flooding damage" in others, seems like a small thing right up until the day that $37 million class-action lawsuit arrives (particulars fuzzed to protect bystanders).

ITworld LIVE

DevelopmentWhite Papers & Webcasts

Webcast On Demand

How to Distribute Apps to Your Mobile Workforce

When considering enterprise app deployment, you may find some unexpected challenges and a number of options that range from simple distribution to running your own enterprise market. How can you determine the best approach for your organization? MOTODEV for Enterprise can help you understand and evaluate current enterprise deployment technologies and learn best practices that support your choice.

Sponsor: Motorola Mobility

Webcast On Demand

Authentication, Certificates and VPNs

MOTODEV for Enterprise can help get you up to speed quickly on key topics such as how to enable secure access to a company intranet from outside the firewall. This webinar provides a clear explanation of terms and technologies and what they can do for your enterprise app development.

Sponsor: Motorola Mobility

Webcast On Demand

Improving Enterprise App Quality with MOTODEV App Validator

MOTODEV for Enterprise supports quality app development for businesses, government, and institutions with technical resources and tools such as the MOTODEV App Validator, a free static analysis testing tool.

Sponsor: Motorola Mobility

White Paper

HR Analytics: Driving Return on Human Capital Investments

In today's economy, it's critical for organizations to make employee retention and development a major business focus, to ensure that valuable employees are not lost as the economy improves. With advanced BI solutions, organizations can be supported by workforce analytics to drive return on human capital investment and to see the value the workforce delivers to organizational performance. This white paper demonstrates how the increased power of having metrics and analytic insight can align core HR business processes with organizational goals and strategies and help ensure organizations make the right business decisions today for tomorrow.

White Paper

Positioning the CIO as a Powerful Business Partner with IT Portfolio Governance

In this whitepaper, learn how you can become a visionary portfolio manager and transform IT into a streamlined revenue and profit center.

See more White Papers | Webcasts

Ask a question

Ask a Question