Before I say anything else, please notice that the word "schemas" in the
title of this column has a lowercase "s". The schema cheat techniques
that follow are equally applicable to all the schema languages out
there, including DTDs, RelaxNG, and W3C XML Schema.
Also, by way of important preamble, I should point out that creating a
schema for a corpus of information is a "No Pain, No Gain" process -- it
has to hurt to be of any use. Cheating will only cause you more pain in
the end. Having said all that, if you really want to avoid (actually,
defer) the pain, here are two excellent, time honored, techniques for
doing so:
1. Model the hard bits as attributes
2. Make all elements optional
Model the Hard Bits as Attributes
Schema languages, by and large, focus on providing mechanisms for
expressing constraints on the order of XML elements. For example,
expressing the constraint, "If there is an element X there must be an
element Y immediately after it", is possible in all three common schema
languages. It is also possible to express the constraint, "An element A
must be followed by either one or more B elements, or one or more C
elements", and so on.
With attributes, the expressible constraints are much simpler, often no
more than mandatory/optional occurrence constraints. This limitation on
attribute constraints is the basis for this first cheat.
Let's say you are faced with modeling a collection of customers and a
collection of partners that have complex inter-relationships known as
"deals". You could create an element "customer" and an element
"partner". Then you can express the order that mixtures of the two can
occur in valid deal elements. You might end up with something like this
in your schema (using pseudo DTD syntax):
deal = (partner,customer) | (partner,partner) |
(partner,customer,customer?) | (customer,partner,customer)*
This says that a deal involved either a partner/customer pair, two
partners, a partner/customer pair with possibly more customers or a
series of triples containing two customers and a partner.
Now, using the magic of attributes, you can make the complexity of this
model go away. We can think of partners and customers as "actors" in a
deal. In this way of thinking, partners and customers become special
cases of a more general purpose thing we are calling an "actor". We can
model an actor as a thing that has a "type" attribute that can be one of
"partner" or "customer", i.e.
<actor type = "partner"> / <actor type = "customer">
And now the schema model for a deal looks like this:
deal = actor+
There! Isn't that a lot simpler?
Make All Elements Optional
To illustrate this cheat, we will use the same example as before, i.e.
modeling a potentially complex combinations of partners and customers
making up deals.
This time, we keep partners and customers in separate tags rather than
using one element with a type attribute to distinguish them. So, unlike
in the last example where we used:
<actor type = "partner"> / <actor type = "customer">
we will use:
<partner> / <customer>
Now, using the trivial observation that all deals consist of either
partners or customers, we can model any complex deal like this:
deal = (partner|customer)+
This says that a deal consists of one or more partners or customers.
There! Isn't that a lot simpler!
Yes and no, for both cheats. Yes, in both cases the cheat models work
fine. Your boss (or you) can be thrilled with the ease with which all
conceivable combinations of partners and customers can be catered for in
the model of a deal in the XML schema. Any scenario you dream up can be
captured in XML form and validated to be 100% XML and schema language
compliant. Great!
In reality, what has happened is that the burden of providing useful,
meaningful validation of the structure of your information has simply
moved. It has moved further along the workflow. It has moved over to the
- programmers* creating Java programs, .NET web services, XSLT
stylesheets - to process the data.
Such a shift of data validation into code is almost always a bad idea.
In both cheats presented here, you end up with schemas that do not tell
you very much about the real structure of your data. Consequently, over
and over again, the programs that process the data, must check the
constraints that the schemas do not check. The result is that validation
ends up buried inside programs. Worse, it ends up being duplicated and
buried inside programs. Over and over again.
If this happens, the real pain will occur in your wallet. You will end
up paying over and over again to enforce the same constraints in
multiple places in your systems. Then you will pay more to have them
changed when the business requires the addition/modification of the
constraints. All as a direct result of making the schema designs
simpler.
No pain, no gain.