XML tags tell you what data actually means. XML tags turn mere data into
bright shiny information nuggets just dripping with juicy semantics. The
truth is out there -- in the data where it belongs -- thanks to XML.
Truth is, some of the most egregious lies I have ever seen were
expressed in terms of fully validating, XML 1.0 compliant XML documents.
Dripping with tags? Yes. Dripping with truth? Er, no.
Perhaps the most common form of XML lies is known in the trade as tag
abuse [1]. Tag abuse occurs whenever a tag is used in such a way that it
meets all the rules of XML yet does not fulfill the role intended by the
application designer. WYSIWYG editing environments are a common cause of
tag abuse.
Lets say that you, as an author, wish to emphasis a word by putting it
into italics. Something like this:
<p>Insert the widget <Emphasis>very
carefully</Emphasis>.</p>
You ask your wonderful XML-aware editing tool what tags are valid at
this point in the document and it pops up a list containing these
options:
ReplaceableBattery
AmphibiousLandingCraft
Table
Emphasis
You notice that the ReplaceableBattery tag will render their contents in
italic and, since it is first on the list and quickest to insert, you
use it! Your document now contains this:
<p>Insert the widget <ReplaceableBattery>very
carefully</ReplaceableBattery>.</p>
As an author, do you care? Using the first tag that comes to hand makes
your life easier, the document prints okay and looks fine on the Web
site....
In my first decade as a markup geek, I confess to being on the
engineering-side of this argument. In a word, I was horrified. Tag abuse
used to drive me crazy! What is it with those authors?!? They should get
with the program and start marking up the data using the tags we
engineers properly make available. No shortcuts!
Then I started to write books. Boy did that change my perspective on the
problem! As a writer, I find XML -- indeed, any form of structured
authoring -- a real pain. It gets in the way! When I am in full flow,
trying to squeeze sensibly structured English out of my brain, the last
thing I want is an interface that beeps at me and insists I select from
long lists of available tags. Half of the time, I would not be in a
position to pick a tag if I tried. Why? Because writing is a creative
process. As the words flow through my fingers, I do not have a
comprehensive ontological map of the territory. Its just words and
ideas. The names of the right tags will be obvious, but only after the
content has come into existence. Not before and certainly not during the
content creation process. I used to abuse tags with the best of 'em!
The engineers, not the authors, need to get with the program and realize
that XML markup cannot be an impediment to the creation of content. If
it is, it will be subverted.
There is another reason -- this time a linguistic one -- that causes
lies to creep into XML tagging. Natural languages, such as English,
compliment vocabularies with grammars. In that sense, they are much like
XML applications, which also compliment vocabularies (tags) with
grammars (schemas). Ever wonder why the most commonly used constructs in
English break the rules of English grammar?
Humans are fond of what Herman Zipf calls the Principle of Least Effort
[2]. Basically, we humans will break rules left, right, and center in
order to make communications easier. Whether in natural languages
(English, French) or artificial languages (UBL, DocBook), the result is
the same: We break the rules of grammars to make our life easier. Ergo,
tag abuse happens so deal with it.
A third form of lies through markup is also a manifestation of human
nature but of a different kind. What if, having searched the list of
available tags in your schema, you cannot find one that suits your
needs? Given the typically high cost of modifying schemata, authoring
environments, downstream processes, etc..., the temptation to just pick
a tag that is "close enough" is very great. Indeed, in some environments
where XML data capture is invoiced based on character counts plus tags,
there can be an economic impetus to just pick a valid tag and get on
with it.
Watching schema designers lulled into a false sense of accomplishment as
requests for schema changes dwindle into their applications is amusing.
More often than not, it is not that the tags are perfect and
comprehensive, but that the tag users have found ways around them.
Deluded engineers think all is well because the documents pass the
validating XML parser.
Oftentimes, if there is a doubt as to the correctness of a tag in a
particular context, then the best thing to do is not enter a tag at all.
Unfortunately, we humans typically prefer positive action. We like to
pick a tag, any old tag will do because any is better than none. It's in
our nature, which is unfortunate. With apologies to Wittgenstein, that
which we cannot tag, we should pass over in silence.
NOTES
[1] SDATA Society for the Definitive Abolition of Tag Abuse:
http://www.ucc.ie/sdata/
[2] http://pespmc1.vub.ac.be/ASC/PRINCI_EFFOR.html