The Real Difference Between Structured and Unstructured Data

RTF is more structured than HTML.

This highly contentious statement is best made out of earshot of XML

people who, almost unanimously, would disagree in the strongest possible

terms. I happen to believe it is true and that thinking about why it is

true is a useful exercise. It is useful because it forces us to zoom in

on what we mean by the terms "structured" and "unstructured". In my

experience, we do not all mean the same thing by these terms, which

result in confusion and flame wars in equal measure.

If your first reaction on reading the first statement was, "Wow! McGrath

has finally lost it!", then please bear with me for a few more

paragraphs. If you still feel I've lost it by the time we reach the end,

then please come visit me in the Asylum to which I will surely be

dispatched having flown so far over the Cuckoo's Nest.

I remember an episode of the documentary series "The Simpsons" where Dr.

Nick Riviera's [1] hospital burns down when an errant firecracker

launched by Bart and Milhouse sets a tank on fire marked "inflammable".

Bewildered, Dr Nick exclaims, "Inflammable means flammable?! Boy, what a

country." Sometimes I think the words "structured" and "un-structured",

beloved by the XML community, are much like flammable and inflammable --

more synonymous than antonymous when applied to data formats.

What does the word "structured" mean anyway? What characteristic makes

RTF un-structured? What is it about HTML that makes it un-structured?

Are HTML and RTF un-structured in the same way, or is there a difference

between the two? Why are they both so different from XML? Are they

really different? And just what does XHTML give us by combining HTML and

XML?

Lets start with RTF. It can be argued that RTF is, in fact, highly

structured. Its simple stack based control language [2] can be

unambiguously parsed using a simple pushdown automaton. Oh, and it is

fully compatible with ISO-646. How more structured can you get?

Lets move over to HTML, another famously un-structured format. Compared

to RTF, HTML is definitely un-structured. Unlike RTF, HTML is not

unambiguously parseable. Different HTML parsers can and do produce

different parse trees of the underlying information, but the most common

result of this inherent ambiguity is that browsers produce different

renderings of the same content.

On this basis, I stated that RTF is more structured than HTML. Two RTF

processors are more likely to agree on the fundamental tokenized data

structure they emit than two HTML processors. Does it follow that we

would not have been better off using a stripped down RTF rather than

HTML on the Web?

The answer is a firm "no", but the reason has nothing to do with

structured versus un-structured formats as the terms are commonly used.

Rather, the reason has everything to do with the different conceptual

model each takes towards expressing the document concept.

RTF is a highly structured format that expresses a conceptual document

model in terms of instructions to make marks onto pages of a fixed width

and height. An *imperative* document model that involves providing

detailed instructions to an RTF processor to guide the rendering

process.

HTML is an unstructured format used to express a conceptual document

model as a hierarchy of information chunks. However, the page layout is

not constrained to fixed sizes. Thus, HTML is a *declarative* document

model that describes the hierarchical relationship between bits of

information, but does not specify, in detail, how the information should

be rendered.

HTML's declarative nature makes it a better choice for the Web than RTF,

despite the fact that HTML is less structured than RTF and harder to

parse. Declarative models are much easier to use in a wide variety of

contexts than imperative models, which tend to be tied to one particular

use (e.g., printing onto US Legal sized paper).

Unfortunately, HTML has, over various iterations, developed numerous

features that allow it to be more RTF-like (i.e., more imperative) in

its approach to rendering information. As a consequence, producing HTML

that is, essentially, RTF in drag has become possible. This happens when

HTML is wired to a particular screen size, makes use of nested tables to

achieve fancy layout tricks, or buries textual content in graphics to

make it look nice.

So, what then can we make of XHTML? An XML compatible version of HTML.

XHTML fixes the unstructured side of HTML by making it unambiguously

parseable, but does not address the fundamental issue of declarative

versus imperative information models. Creating XHTML that is wired to a

particular screen size, makes use of nested tables to achieve fancy

layout tricks, and buries textual content in graphics to make it look

nice remains entirely possible.

The difference between structured and unstructured documents has more to

do with imperative versus declarative models. If your HTML is imperative

in its view of the world rather than declarative, then converting it to

XHTML gains you very little.

NOTES

[1] "Trilogy of Error"

http://www.thespringfieldshopper.com/reviewcabf14.htm

[2]

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec_3.asp

What’s wrong? The new clean desk test
Join the discussion
Be the first to comment on this article. Our Commenting Policies