ITworld.com
  Search  
 Home  Newsletter Archive  XML IN PRACTICE
The Real Difference Between Structured and Unstructured Data
Sign up for XML IN PRACTICE
More Newsletters
 

XML IN PRACTICE --- 06/06/2002



RTF is more structured than HTML.

This highly contentious statement is best made out of earshot of XML people who, almost unanimously, would disagree in the strongest possible terms. I happen to believe it is true and that thinking about why it is true is a useful exercise. It is useful because it forces us to zoom in on what we mean by the terms "structured" and "unstructured". In my experience, we do not all mean the same thing by these terms, which result in confusion and flame wars in equal measure.

If your first reaction on reading the first statement was, "Wow! McGrath has finally lost it!", then please bear with me for a few more paragraphs. If you still feel I've lost it by the time we reach the end, then please come visit me in the Asylum to which I will surely be dispatched having flown so far over the Cuckoo's Nest.

I remember an episode of the documentary series "The Simpsons" where Dr. Nick Riviera's [1] hospital burns down when an errant firecracker launched by Bart and Milhouse sets a tank on fire marked "inflammable". Bewildered, Dr Nick exclaims, "Inflammable means flammable?! Boy, what a country." Sometimes I think the words "structured" and "un-structured", beloved by the XML community, are much like flammable and inflammable -- more synonymous than antonymous when applied to data formats.

What does the word "structured" mean anyway? What characteristic makes RTF un-structured? What is it about HTML that makes it un-structured? Are HTML and RTF un-structured in the same way, or is there a difference between the two? Why are they both so different from XML? Are they really different? And just what does XHTML give us by combining HTML and XML?

Lets start with RTF. It can be argued that RTF is, in fact, highly structured. Its simple stack based control language [2] can be unambiguously parsed using a simple pushdown automaton. Oh, and it is fully compatible with ISO-646. How more structured can you get?

Lets move over to HTML, another famously un-structured format. Compared to RTF, HTML is definitely un-structured. Unlike RTF, HTML is not unambiguously parseable. Different HTML parsers can and do produce different parse trees of the underlying information, but the most common result of this inherent ambiguity is that browsers produce different renderings of the same content.

On this basis, I stated that RTF is more structured than HTML. Two RTF processors are more likely to agree on the fundamental tokenized data structure they emit than two HTML processors. Does it follow that we would not have been better off using a stripped down RTF rather than HTML on the Web?

The answer is a firm "no", but the reason has nothing to do with structured versus un-structured formats as the terms are commonly used. Rather, the reason has everything to do with the different conceptual model each takes towards expressing the document concept.

RTF is a highly structured format that expresses a conceptual document model in terms of instructions to make marks onto pages of a fixed width and height. An *imperative* document model that involves providing detailed instructions to an RTF processor to guide the rendering process.

HTML is an unstructured format used to express a conceptual document model as a hierarchy of information chunks. However, the page layout is not constrained to fixed sizes. Thus, HTML is a *declarative* document model that describes the hierarchical relationship between bits of information, but does not specify, in detail, how the information should be rendered.

HTML's declarative nature makes it a better choice for the Web than RTF, despite the fact that HTML is less structured than RTF and harder to parse. Declarative models are much easier to use in a wide variety of contexts than imperative models, which tend to be tied to one particular use (e.g., printing onto US Legal sized paper).

Unfortunately, HTML has, over various iterations, developed numerous features that allow it to be more RTF-like (i.e., more imperative) in its approach to rendering information. As a consequence, producing HTML that is, essentially, RTF in drag has become possible. This happens when HTML is wired to a particular screen size, makes use of nested tables to achieve fancy layout tricks, or buries textual content in graphics to make it look nice.

So, what then can we make of XHTML? An XML compatible version of HTML. XHTML fixes the unstructured side of HTML by making it unambiguously parseable, but does not address the fundamental issue of declarative versus imperative information models. Creating XHTML that is wired to a particular screen size, makes use of nested tables to achieve fancy layout tricks, and buries textual content in graphics to make it look nice remains entirely possible.

The difference between structured and unstructured documents has more to do with imperative versus declarative models. If your HTML is imperative in its view of the world rather than declarative, then converting it to XHTML gains you very little.

NOTES

[1] "Trilogy of Error"
http://www.thespringfieldshopper.com/reviewcabf14.htm [2] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec_3.asp

 



www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   Industry Standard   Infoworld   ITworld  
JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

DEMO   IDG Connect   IDG Knowledge Hub   IDG TechNetwork   IDG World Expo  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.