Oscar Wilde infamously quipped that a cynic is someone who knows the
cost of everything but the value of nothing. If this is true (and if I
have my inference logic the right way round), then it can be concluded
that all schema writers are idealists. They seem to know the value of
everything (how to tag it) and the cost of nothing (money consumed by
tags).
Do you sense a short fuse burning? A sense that the writer is not at
peace with the universe in general and XML schema writers in
particular? You are darned right my fuse is burning. For weeks now, I
have been cutting code to do some XML data processing that would have
been much easier if only the schema creator had shown some
understanding of the true cost of XML tagging.
"Ah", I hear you say, "You are being selfish again. You want all the
XML to be perfect for your specific needs. Typical becubicled,
besandled, bearded, belligerent, bellicose, back-end system development
type."
Well, yes, of course I would like my life made as easy as possible, but
hear this: Where do you think all the money goes on XML markup? In the
Total Cost of Ownership calculations to do with XML tags, where are the
real costs?
Did you say, "Data modeling/schema creation"?
Bzzzzzt. Thank you for playing, but that is not the correct answer.
The bulk of the costs are attributable to each and every lowly XML tag
accumulated in data processing. Every time a developer writes an XPath
expression, a SAX handler, or weaves a DOM NodeList, he or she is
contributing to the XML tags' cost of ownership. Every time a developer
backs off from cutting code because of the sheer complexity of the XML
structure being manipulated, you are accumulating costs.
Where does the complexity come from? How complicated can processing a
few cuddly little tags peppered with some good old PCDATA be?
Complexity Comes from the Schemas
A schema (be it DTD, XSD, or RNG) makes saying, "At this point, A or B
or C or D can occur zero or more times", very easy. Sounds harmless,
but look at it from a programmer's perspective. At this point in the
document structure, elements A to D can occur any number of times in
any order. Lets keep things simple by restricting the elements to
having two values. That means that my "business logic" has to deal with
8 main branches at this point in the structure. Now add in another
element E with the same two possible values. The main branches in my
business logic now number 16.
Simply put, my software's so-called "state space" grows exponentially
with the number of things that can occur at a given point in a document
structure. The number of possible occurrences at any given point in the
structure is directly related to the schema.
Anything that grows exponentially is bad news for software development
(except of course caffeine levels). Introducing non-terminals into the
schema can significantly reduce state space explosion. By non-
terminals, I mean nodes that serve to scaffold sub-structures rather
than to carry their own semantic meaning.
An example common in many document centric DTDs is "list". What is a
list but a holder of list elements? Most developers hate to see the
holding list element go from the schema as it gives them a hook to
cleanly snag that starts and ends a list. When manipulating XML as a
tree structure, it gives developers a single point to treat as a
holdall for the list.
In data-centric XML, the "record" tag is an example of a non-terminal
element. What is a record but a sequence of repeating fields? If the
repetition can be worked out automatically, then isn't the record tag
superfluous? Again, developers would hate to see the record tag go.
So, intermediate "non-terminal" tags are good and tags in general are
bad (i.e., the less tags the better for everyone). Think of tags as
global variables. The professor would tell you to minimize them because
of the thorny logic and spaghetti code they foster. I look at SAX XML
processing code and what do I see? Spaghetti!
In XML land, not only are the equivalent of "global variables" created
with wild abandon, but their creators often see fit to invoice based on
the number they create for you. An unfortunate schism exists in XML
software development between the team that develops the schema and the
team processing the XML that conforms to the schema. Too often, these
are not the same teams.
This state of affairs does not encourage schema design that takes into
account the needs of the software development phase, which, after all,
is where most of the money will be spent.
What Can Be Done?
I remember an example of lateral thinking in a book by Edward De Bono.
A company suspected of water pollution applied for permission to draw
fresh water from the same stream it was pumping effluent into.
Permission was granted on condition that the water in-take occur down-
stream from the effluent discharge.
Stretching the analogy to its breaking point, those who wish to create
schemas must work at a software development level with the XML they
themselves have modeled. That will teach them the real cost of XML
tagging. Then they will be grumpy too. Harrumph!