ITworld.com
  Search  
ITworld Home Page ITworld Webcasts ITworld White Papers ITworld Newsletters ITworld News ITworld Topics Careers ITworld Voices ITwhirled Changing the way you view IT
PDF and HTML: Splitting the difference
E-BUSINESS IN THE ENTERPRISE --- 08/21/2007

Sean McGrath

I never thought I would hear myself saying this, but I think the world needs another file format for storing images. 

On this topic

Like many people in this industry, I have often had to fight the file format fight converting images endlessly from format A to format B and back again to achieve some result or work around some application limitation. More than once I have said to myself "It's only pixels darn it! How many sensible ways can there possibly be to store these things?". And now I find myself advocating the creation of another one? What gives?

Here is where my head is at. In my day job I regularly come across situations where very tight control over the presentation of textual information is required. Situations in which it is important to know for sure that information appears in a browser pretty much exactly as it appears on the paper produced through a good old fashioned publishing cycle. Situations where allowing a browser to re-arrange text and graphics to suit itself would be extremely undesirable.

Obviously, I could create images of the relevant material - perhaps in jpg or tiff and drop those into the web pages. This solves the layout problem at the expense of creating a whole bunch of other problems though. The text can no longer be seen by search engines. Browsers have nothing to work with in trying to make the underlying text copy/pasteable. Browsers have their hands tied in trying to support accessibility requirements. And so on.

Alternatively, I could drop the layout-sensitive information into a PDF and pop that onto the web page. This is better in many respects but still falls short. PDF is a page painter. Inside a PDF you tell the computer to move to X,Y. Draw some text. Move to some other X,Y. Draw some more text. And so on. By the time the text hits PDF, critical information about what text follows what other text is missing. Simply put, the flow order of the text has disappeared. This is a real problem as anyone who has attempted to extract text from PDF can tell you. For simple cases it works great. For complex cases involving, say, multiple columns, tables or footnotes... Well, let's just say that a variety of infuriatingly bad things can happen.

And thus we arrive at my tentative conclusion which is a wish list for a new file format. I want:

- a file format that is primarily an image. Something that a browser can render without any risk to the visual representation of the primarily textual information therein.

- the file format should allow HTML markup to be embedded within it so that markup & text can be carried around with the image. Applications such as search engines, copy&paste tools etc. would have access to the text as text rather than image pixels.

It is possible I guess, to do this with XMP, but my sense of it so far is that (a) it requires stretching the use case of XMP to breaking point (b) folks are not using XMP for this in any great numbers.

Am I nuts? Have I missed something? Can it really be that the world needs another file format?

[1] http://www.adobe.com/products/xmp/

 

Sean McGrath is CTO of Propylon. He is an internationally acknowledged authority on XML and related standards. He served as an invited expert to the W3C's Expert Group that defined XML in 1998. He is the author of three books on markup languages published by Prentice Hall. Visit his site at: http://seanmcgrath.blogspot.com.



Advertisements
Sponsored links
Locate Hidden Software on business PCs with this free tool
Bring harmony to your mix of UNIX-Linux-Windows computing environments
Top 5 Reasons to Combine App Performance and Security
KODAK i1400 Series Scanners stand up to the challenge
 Home   Newsletters  E-BUSINESS IN THE ENTERPRISE
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Industry Standard   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.