Like many people in this industry, I have often had to fight the file
format fight converting images endlessly from format A to format B and
back again to achieve some result or work around some application
limitation. More than once I have said to myself "It's only pixels darn
it! How many sensible ways can there possibly be to store these
things?". And now I find myself advocating the creation of another one?
What gives?
Here is where my head is at. In my day job I regularly come across
situations where very tight control over the presentation of textual
information is required. Situations in which it is important to know for
sure that information appears in a browser pretty much exactly as it
appears on the paper produced through a good old fashioned publishing
cycle. Situations where allowing a browser to re-arrange text and
graphics to suit itself would be extremely undesirable.
Obviously, I could create images of the relevant material - perhaps in
jpg or tiff and drop those into the web pages. This solves the layout
problem at the expense of creating a whole bunch of other problems
though. The text can no longer be seen by search engines. Browsers have
nothing to work with in trying to make the underlying text
copy/pasteable. Browsers have their hands tied in trying to support
accessibility requirements. And so on.
Alternatively, I could drop the layout-sensitive information into a PDF
and pop that onto the web page. This is better in many respects but
still falls short. PDF is a page painter. Inside a PDF you tell the
computer to move to X,Y. Draw some text. Move to some other X,Y. Draw
some more text. And so on. By the time the text hits PDF, critical
information about what text follows what other text is missing. Simply
put, the flow order of the text has disappeared. This is a real problem
as anyone who has attempted to extract text from PDF can tell you. For
simple cases it works great. For complex cases involving, say, multiple
columns, tables or footnotes... Well, let's just say that a variety of
infuriatingly bad things can happen.
And thus we arrive at my tentative conclusion which is a wish list for a
new file format. I want:
- a file format that is primarily an image. Something that a browser can
render without any risk to the visual representation of the primarily
textual information therein.
- the file format should allow HTML markup to be embedded within it so
that markup & text can be carried around with the image. Applications
such as search engines, copy&paste tools etc. would have access to the
text as text rather than image pixels.
It is possible I guess, to do this with XMP, but my sense of it so far
is that (a) it requires stretching the use case of XMP to breaking point
(b) folks are not using XMP for this in any great numbers.
Am I nuts? Have I missed something? Can it really be that the world
needs another file format?
[1] http://www.adobe.com/products/xmp/