topics that matter; ideas worth sharing

share a tip, submit a link, add something new

PDF and HTML: Splitting the difference

August 13, 2007, 01:33 PM —  ITworld.com — 

I never thought I would hear myself saying this, but I think the world needs another file format for storing images.



Like many people in this industry, I have often had to fight the file format fight converting images endlessly from format A to format B and back again to achieve some result or work around some application limitation. More than once I have said to myself "It's only pixels darn it! How many sensible ways can there possibly be to store these things?". And now I find myself advocating the creation of another one? What gives?



Here is where my head is at. In my day job I regularly come across situations where very tight control over the presentation of textual information is required. Situations in which it is important to know for sure that information appears in a browser pretty much exactly as it appears on the paper produced through a good old fashioned publishing cycle. Situations where allowing a browser to re-arrange text and graphics to suit itself would be extremely undesirable.



Obviously, I could create images of the relevant material - perhaps in jpg or tiff and drop those into the web pages. This solves the layout problem at the expense of creating a whole bunch of other problems though. The text can no longer be seen by search engines. Browsers have nothing to work with in trying to make the underlying text copy/pasteable. Browsers have their hands tied in trying to support accessibility requirements. And so on.



Alternatively, I could drop the layout-sensitive information into a PDF and pop that onto the web page. This is better in many respects but still falls short. PDF is a page painter. Inside a PDF you tell the computer to move to X,Y. Draw some text. Move to some other X,Y. Draw some more text. And so on. By the time the text hits PDF, critical information about what text follows what other text is missing. Simply put, the flow order of the text has disappeared. This is a real problem as anyone who has attempted to extract text from PDF can tell you. For simple cases it works great. For complex cases involving, say, multiple columns, tables or footnotes... Well, let's just say that a variety of infuriatingly bad things can happen.



And thus we arrive at my tentative conclusion which is a wish list for a new file format. I want:



- a file format that is primarily an image. Something that a browser can render without any risk to the visual representation of the primarily textual information therein.



- the file format should allow HTML markup to be embedded within it so that markup & text can be carried around with the image. Applications such as search engines, copy&paste tools etc. would have access to the text as text rather than image pixels.



It is possible I guess, to do this with XMP[1], but my sense of it so far is that (a) it requires stretching the use case of XMP to breaking point (b) folks are not using XMP for this in any great numbers.



Am I nuts? Have I missed something? Can it really be that the world needs another file format?




[1] http://www.adobe.com/products/xmp/

 

ITworld.com

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Resources
White Paper

Symantec Backup Exec 12 and Backup Exec System Recovery 8 deliver industry leading Windows data protection and system recovery. Download this whitepaper to find out the top reasons to upgrade and how to get continuous data protection and complete system recovery.

Webcast

Data and system loss — from a hard drive failure, malicious attack, natural disaster, or simple human error — can happen anytime. Don’t leave your business vulnerable. Make sure you have a secure recovery strategy in place. Symantec's latest backup and system recovery technology can efficiently restore critical applications, individual emails and documents and even restore your entire system in minutes in the event of a loss.

White Paper

Businesses face a growing challenge to ensure that the IT environment is properly protected. Backup Exec 12 integrates with other applications in the Symantec family of products, to complement your current data protection strategy, keep your data securely backed up and make it recoverable when you need it most.

Free stuff
Featured Sponsor

AISO founders envisioned a Web hosting company that was environmentally friendly. While the company employed energy-efficient innovations like solar panels, its infrastructure produced unacceptable power and cooling requirements. Find out how AISO leveraged AMD technology to overcome their challenge in this case study white paper.

In this whitepaper, Scalar explores the opportunity to change the landscape with respect to mission critical databases built around Oracle. Leveraging technologies such as Linux, high-end commodity processing power and Oracle RAC technology to architect, design, build and maintain database infrastructure that delivers maximum availability, reliability and performance at a fraction of traditional cost.

On a typical day, weather.com, the Web site for The Weather Channel in Atlanta, serves up between 15 million and 20 million page views. But in September 2004, when back-to-back hurricanes ransacked Florida, the peak traffic on one day more than tripled: over 70 million page views by more than 7 million unique visitors. Read the full success story now.

More Resources