August 31, 2009, 10:43 PM — A "best practice" in PDF content extraction applies far beyond the specialty of PDF. This will take a bit of explanation.
I used to argue against content extraction from PDF at all: it was a bad idea, the available products were excessively flawed, even the best results disappointed end-users, and out-of-pocket costs were high. The technology has improved enormously in the last decade; I do a lot of context extraction now, and frequently "pitch" it to decision-makers. I especially like to scan collections of PDF documents to construct various specialized search capabilities. Even though defects remain (most prominently, the text extraction utilities miss a word or two every few score pages), these enhancements are generally a hit with end users.
There's a principle at work here, though, that apparently never comes up in college courses, and therefore is important to mention in "Smart Development": providers--vendors and/or open-source maintainers--who are responsive. pdftk, for example, was a "breakthrough" open-source utility--but I've cut my use back to no more than a half-dozen times a week. I'm finding substitutes for everything pdftk does for me, because author Sid Steward has moved his life away from computing, from what I've been told, and, as fond as I continue to be of pdftk, it's just not keeping up with the PDF world.
This could change; it's possible someone else will take over where Mr. Steward left off, and there are rumors that exactly that might happen as soon as this fall. In the meantime, I'm working with vendors who *like* customer reports, because they give the vendor an opportunity to improve the product. While it's common in some swatches of software--database managers, for example--for functionality to be planned irrevocably years in advance, my experience in working with PDF is that at least a few of the better companies are eager to receive "bug reports", ones that provide specific details about PDF oddities. These providers incorporate the examples in the latest regression tests, and quickly cycle the enhancements back to customers. Responsiveness like that can be even more important to some projects than price or base functionality.
PDF makes this pattern particularly important, because PDF continues to develop each year. There seem always to be new constructs and standards to accommodate.
Talk things over with your vendor; you'll get the best results not just from a "punch list" of features you think you need, but from a deeper conversation about your usage pattern. Find out what a prospective vendor thinks when you say that you're planning to use his product once a week, or ten times a second, or whatever is real for you. Does he have a way to issue enhancements to his text engine when customers find PDF instances the existing product mishandles?