RTF is more structured than HTML.
This highly contentious statement is best made out of earshot of XML
people who, almost unanimously, would disagree in the strongest possible
terms. I happen to believe it is true and that thinking about why it is
true is a useful exercise. It is useful because it forces us to zoom in
on what we mean by the terms "structured" and "unstructured". In my
experience, we do not all mean the same thing by these terms, which
result in confusion and flame wars in equal measure.
If your first reaction on reading the first statement was, "Wow! McGrath
has finally lost it!", then please bear with me for a few more
paragraphs. If you still feel I've lost it by the time we reach the end,
then please come visit me in the Asylum to which I will surely be
dispatched having flown so far over the Cuckoo's Nest.
I remember an episode of the documentary series "The Simpsons" where Dr.
Nick Riviera's [1] hospital burns down when an errant firecracker
launched by Bart and Milhouse sets a tank on fire marked "inflammable".
Bewildered, Dr Nick exclaims, "Inflammable means flammable?! Boy, what a
country." Sometimes I think the words "structured" and "un-structured",
beloved by the XML community, are much like flammable and inflammable --
more synonymous than antonymous when applied to data formats.
What does the word "structured" mean anyway? What characteristic makes
RTF un-structured? What is it about HTML that makes it un-structured?
Are HTML and RTF un-structured in the same way, or is there a difference
between the two? Why are they both so different from XML? Are they
really different? And just what does XHTML give us by combining HTML and
XML?
Lets start with RTF. It can be argued that RTF is, in fact, highly
structured. Its simple stack based control language [2] can be
unambiguously parsed using a simple pushdown automaton. Oh, and it is
fully compatible with ISO-646. How more structured can you get?
Lets move over to HTML, another famously un-structured format. Compared
to RTF, HTML is definitely un-structured. Unlike RTF, HTML is not
unambiguously parseable. Different HTML parsers can and do produce
different parse trees of the underlying information, but the most common
result of this inherent ambiguity is that browsers produce different
renderings of the same content.
On this basis, I stated that RTF is more structured than HTML. Two RTF
processors are more likely to agree on the fundamental tokenized data
structure they emit than two HTML processors. Does it follow that we
would not have been better off using a stripped down RTF rather than
HTML on the Web?
The answer is a firm "no", but the reason has nothing to do with
structured versus un-structured formats as the terms are commonly used.
Rather, the reason has everything to do with the different conceptual
model each takes towards expressing the document concept.
RTF is a highly structured format that expresses a conceptual document
model in terms of instructions to make marks onto pages of a fixed width
and height. An *imperative* document model that involves providing
detailed instructions to an RTF processor to guide the rendering
process.
HTML is an unstructured format used to express a conceptual document
model as a hierarchy of information chunks. However, the page layout is
not constrained to fixed sizes. Thus, HTML is a *declarative* document
model that describes the hierarchical relationship between bits of
information, but does not specify, in detail, how the information should
be rendered.
HTML's declarative nature makes it a better choice for the Web than RTF,
despite the fact that HTML is less structured than RTF and harder to
parse. Declarative models are much easier to use in a wide variety of
contexts than imperative models, which tend to be tied to one particular
use (e.g., printing onto US Legal sized paper).
Unfortunately, HTML has, over various iterations, developed numerous
features that allow it to be more RTF-like (i.e., more imperative) in
its approach to rendering information. As a consequence, producing HTML
that is, essentially, RTF in drag has become possible. This happens when
HTML is wired to a particular screen size, makes use of nested tables to
achieve fancy layout tricks, or buries textual content in graphics to
make it look nice.
So, what then can we make of XHTML? An XML compatible version of HTML.
XHTML fixes the unstructured side of HTML by making it unambiguously
parseable, but does not address the fundamental issue of declarative
versus imperative information models. Creating XHTML that is wired to a
particular screen size, makes use of nested tables to achieve fancy
layout tricks, and buries textual content in graphics to make it look
nice remains entirely possible.
The difference between structured and unstructured documents has more to
do with imperative versus declarative models. If your HTML is imperative
in its view of the world rather than declarative, then converting it to
XHTML gains you very little.
NOTES
[1] "Trilogy of Error"
http://www.thespringfieldshopper.com/reviewcabf14.htm
[2]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec_3.asp