"XML is just a tree" is, perhaps, the most potent half-truth of early
21st century white paper executive summaries. Although the primary
abstraction fostered by the XML family of standards is, indeed, that
most information has a primary hierarchical structure, modeled in plain
text using things called "tags", the de-jure reality is somewhat
different. It does, however, approximate the de-facto reality quite
nicely.
Firstly, XML instances are not trees rooted at a single root element.
One level above that is needed to house any stuff that precedes or
succeeds the root element. Stuff before the top-level, start-tag is
called the "prolog". Stuff after the top-level, end-tag is called
the "epilog" (colloquially, the "epilog" does not actually have a name
in the XML standard).
The prolog can bite you when you realize that DOCTYPE declarations, XSL
style-sheet processing instructions, and character set information are
all part of the prolog. The epilog stuff can bite you if you wish to
concatenate XML instances into a single stream of data and split them
apart later. It is not possible to tell if certain constructs
(processing instructions, comments) are part of the epilog of on XML
document or part of the prolog of the next!
So, the XML "tree" is actually a tree one level further up than most
people conceptualize it, in order to deal with these prolog/epilog.
But this is only the first level at which the XML "tree" is not what
most people conceptualize it as. XML is actually *two* tree structures
nested perfectly one inside the other. The one we all know and love is
called the logical structure -- the one composed of start-tags, end-
tags, attributes, and data content. The other one, less well-known and
significantly less well loved, is called the "physical structure" and
is composed of "entities".
The entity structure allows you to assemble a logical tree from a
collection of physical pieces -- typically files. These entities are
typically introduced with an "&" and tail off with a ";" with an entity
name in the middle. "amp", "lt" are two simple examples that are
actually built into the XML standard. Here is another one:
<!DOCTYPE foo SYSTEM "foo.dtd" [
<!ENTITY bar SYSTEM "bar.xml">
]>
<foo>
&bar;
</foo>
The upshot of this structure from the parsers perspective is that the
contents of the file bar.xml are spliced into the document to replace
the "&bar;" reference. But here is the thing: the file bar.xml can
itself contain entity references that can contain entity-references and
so on. All in a perfectly nested tree structure.
These two trees -- the logical and the physical -- are far from equal
in the XML world. The logical structure is at the top of most people's
conceptualization of XML. The entity structure is more IT architect
fodder. For those who discover the entity structure there is a
temptation to use it and a sense of fear: "How come this stuff isn't
more widely utilized? Is there a deep gotcha four levels deep into this
entity theory?"
Personally, I believe the entity structure should be eschewed. Firstly,
they are based on a "declare before use" model that is not the way the
Web works. Secondly, they are tightly bound to DTDs in an unpleasant
way. What has entity structure got to do with validation? Exactly!
Thirdly, nobody seems to want what the entity structure has to offer.
More than one developer of my acquaintance has looked at it and
said "nah! Thanks but no thanks." Fourthly, XInclude, once it is cut
down to size, will, I believe, provide most of the benefits with none
of the syntactic baggage.
The benefits and drawbacks I see in XInclude as currently formulated
will be the subject of a future article.