Nothing in the world of XML seems as harmless as the <p> tag; a
universally accepted way of saying "here is a paragraph". A concept
familiar to anyone with even a passing familiarity with the Web. It's
as if the <p> tag has always been with us, a fundamental truth, part of
the fabric of the universe. Discovered rather then invented. Simple,
elegant, perfect....
Of course, the <p> tag is too simple for some pedants who insist on
using <para>, or even <paragraph>, tags. Such pretensions! Plain
country folk like me like our beer cold, our apple pie warm, and our
paragraphs surrounded by good 'ole <p> tags just like grandma used to
make. However, peel off a layer or two and those little old <p> tags
shows their teeth, revealing a vista of complexity that is at the heart
of the "XML for data" versus "XML for documents" debate.
There are two features that differentiate data oriented XML and
document oriented XML. Firstly, the depth of tagging is irregular and
unbounded in "documents". In data oriented XML, the tagging is regular
and bounded. All tags occur in the same order, record after record, and
the same depth of tagging is used throughout. Secondly, plain text can
be intermixed at the same level as tags to create what is called "mixed
content" in "documents". In data oriented XML, everything is tagged;
there is no free standing text and thus no mixed content.
In both cases, the <p> tag is center stage. In fact, if you see a <p>
tag in an XML document, then you can infer a lot about the type of
issues you are likely to face. If you find <para> or <paragraph> tags,
then you know that the issues are the same but you are also dealing
with a pedant.
- If <p> tags occur in an XML document, you can be pretty sure that
it will not be possible to treat the data as a collection of
records. In particular, it may prove impossible to get the
contents of any particular tag easily. Upside down, even-driven
programming typically results.
- If <p> tags occur in an XML document, you can be pretty sure that
white space is significant in some places. In other words, it
will not be possible to simply strip any white space surrounding
tags without potentially damaging the content.
- If <p> tags occur in an XML document, you can be pretty sure that
typography issues will be troublesome. The "paragraph" is the
fundamental block of text to which rendering engines flow and
present text. However, the print world has long worked on the
basis of "margins" often measured in tiny fractions of an inch,
to specify locations of paragraph.
On the Web, where sub-millimeter control over paragraph layout is
neither practical nor desirable, an alternative paragraph-positioning
model is needed. The answer, to date, has involved the aid of the
single most abused element in the HTML tag bag -- the table. Much to
the chagrin of typographers and XML data modelers alike, the border-
less table has replaced pretty much every other geometry model for
laying out paragraphs of text.
CCS2 has made it possible to exert fine control over paragraph
positioning using pre-Web methods such as left indent, negative first
line indent, and so on. However, until the likes of CSS2 becomes
standard in all browsers, we are likely to see table trickery remain.
So, in summary, the <p> tag is not so simple. Its presence or absence
tells you a lot about the type of XML you are dealing with, not to
mention the world-view of whoever created it. If you work purely with
data-oriented XML, you may never come across them but if you work with
document oriented XML, then they will be a source of constant trouble
and complexity, but also endless fascination for us easily amused doc-
heads!