June 06, 2002, 12:00 AM — RTF is more structured than HTML.
This highly contentious statement is best made out of earshot of XML
people who, almost unanimously, would disagree in the strongest possible
terms. I happen to believe it is true and that thinking about why it is
true is a useful exercise. It is useful because it forces us to zoom in
on what we mean by the terms "structured" and "unstructured". In my
experience, we do not all mean the same thing by these terms, which
result in confusion and flame wars in equal measure.
If your first reaction on reading the first statement was, "Wow! McGrath
has finally lost it!", then please bear with me for a few more
paragraphs. If you still feel I've lost it by the time we reach the end,
then please come visit me in the Asylum to which I will surely be
dispatched having flown so far over the Cuckoo's Nest.
I remember an episode of the documentary series "The Simpsons" where Dr.
Nick Riviera's  hospital burns down when an errant firecracker
launched by Bart and Milhouse sets a tank on fire marked "inflammable".
Bewildered, Dr Nick exclaims, "Inflammable means flammable?! Boy, what a
country." Sometimes I think the words "structured" and "un-structured",
beloved by the XML community, are much like flammable and inflammable --
more synonymous than antonymous when applied to data formats.
What does the word "structured" mean anyway? What characteristic makes
RTF un-structured? What is it about HTML that makes it un-structured?
Are HTML and RTF un-structured in the same way, or is there a difference
between the two? Why are they both so different from XML? Are they
really different? And just what does XHTML give us by combining HTML and
Lets start with RTF. It can be argued that RTF is, in fact, highly
structured. Its simple stack based control language  can be
unambiguously parsed using a simple pushdown automaton. Oh, and it is
fully compatible with ISO-646. How more structured can you get?
Lets move over to HTML, another famously un-structured format. Compared
to RTF, HTML is definitely un-structured. Unlike RTF, HTML is not
unambiguously parseable. Different HTML parsers can and do produce
different parse trees of the underlying information, but the most common
result of this inherent ambiguity is that browsers produce different
renderings of the same content.
On this basis, I stated that RTF is more structured than HTML. Two RTF
processors are more likely to agree on the fundamental tokenized data
structure they emit than two HTML processors.