PDF widely misunderstood

PDF is in wide use, and plenty of developers bump into the standard on a daily basis. It's also widely, and sometimes deeply, misunderstood, if my reading of popular discussion sites for programmers is at all representative. "Smart Development" will open the first week of September 2009 by talking over a few of the commonest mistakes that turn up.

PDF shines at display, not persistence. I frequently run into programmers who report things such as, "my boss told me we have 23,000 PDF scans of resumes, and he assigned me to make a database by reading the names, telephone numbers and addresses from them." Some of these programmers are so inexperienced that they sink several weeks into such a project before they realize they must tell their bosses this particular project is a bad idea. Humans can read PDF images, and extract useful content. Software can do some of the same, but generally only with difficulty and many errors. PDF mostly has setters, but not getters.

If you ever receive such an assignment, you basic choices are:

  • Let your boss know that the task is much more expensive and less rewarding than he realizes;
  • Use some of the PDF-to-text tools already available to make the best of a bad situation; or
  • "Swim upstream", to discover the true home of the data you're after.

It can easily happen that, when a manager says, "get the telephone numbers off these PDFs", what he really means is, "See these telephone numbers? I need data like that. I don't care how you get it; I'm just showing you this particular representation, because you're a programmer, and we rarely understand each other." If the PDF images are generated from, say, an existing database, or correspond to a known XML feed, your best bet is to use the database or XML directly.

Later this week, I'll say a bit about PDF's security features, bookmarks, portfolios, how to do things with PDF you shouldn't do, and my favorite PDF automation. 'Have questions or criticism? Let me know; reader comments will largely determine how the rest of the week goes.

