Friends don't let friends extract PDF content

Unless you have an exceptional reason to do so, don't try to retrieve the text from a PDF

Here's a slightly-paraphrased example of a dialogue that seems to turn up periodically in every forum I frequent:

"Please help me retrieve the document from PDF Web using XYZ Software."

"What does that mean?  XYZ does Q with PDF, but it doesn't 'retrieve the document' in any way I know."

"Please help immediately.  My manager says I must program to take values from PDF."

"Are you talking about the textual content of a PDF instance?  You really don't want to do that--and certainly XYZ doesn't offer anything like you're describing."

 ...

A few more rounds, and we're quickly in the neighborhood of Godwin's law, with the questioner wondering why the respondents are so rude as to say, 'no', and the answerers annoyed at being distracted from their usual work with XYZ.

I confess this is a slight exaggeration: I haven't seen this conversation play out in a discussion group of high-altitude forage crops. Even there, though, it could be that I've just not been paying close attention ...

Straight scoop on content retrieval from PDF

The true facts about this situation are starkly simple:

  • Extraction or retrieval of content or text from PDF instances is a bad idea. Bad, bad. Very bad. Yes, it's possible, and yes, I do it often enough as to have written about it before, but, unless you're already experienced in this area, it's almost certainly an order of magnitude more difficult than you naively assume;
  • Decision-makers ask for or even demand it constantly. I have a lot of sympathy for the questioners caricatured above, because I have no doubt they're sincere: someone truly did say to them, "You're a programmer; program XYZ to get G from PDFs." I know how often managers have asked me for the same;
  • Managerial and sales types use language different from the way we do. When they say, "You must write a program that extracts the applicant's home address from the PDF of his résumé!", they sometimes mean, "I don't care if you get the address from the corporate database our group already accesses." In fact, part of our professional responsibility as programmers is to "push back" on just such matters: we always need to question the requirements presented to us. You might hear, "This Java program has to be patched today!" perfectly clearly, but recognize that the speaker often doesn't care whether you're editing the $PROGRAM.java or the $PROGRAM.class.

I'm experienced enough to know that nothing short of global meltdown can forestall eternal September; certainly this one posting has no hope of straightening out even all the newcomers to PDF who read English. I do want you to know, though, Reader: unless you have an exceptional reason to do so, don't try to retrieve the text from "electronic paper". When you see others trying, feel free to advise them that there's probably a better way to get what they're after.

Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies