Mark Johnson
While a well-formed document is well-formed because it follows rules
defined by the XML spec, a valid document is valid because it matches
its document type definition (DTD). The DTD is the grammar for a markup
language, defined by the designer of the markup language. The DTD
specifies what elements may exist, what attributes the elements may
have, what elements may or must be found inside other elements, and in
what order.
Nonvalidating parsers read the XML and, if it's well-formed, give you
back the document structure as a tree of objects. We'll discuss the
document structure you get from a parser in the section below
entitled "The Document Object Model." If the document is well-formed
but the elements are nonsensical (as was the case with the two <Qty>
elements in the <Ingredient> above), that's your problem.
This is, in fact, how HTML browsers work. Generally, HTML parsers are
nonvalidating. The various "HTML checking" parsers, which report sytax
errors in HTML, are essentially validating HTML parsers (with
additional functionality, like link checking).
Validating parsers read XML, verify that it's well-formed (just as
nonvalidating parsers do), and then go on to determine whether the
document's element tags are legal, whether the attribute names make
sense, whether every element nested inside another element belongs
there, and so on.
The DTD defines the document type. It accounts for the Extensible in
XML. The DTD is how you actually define a new markup language -- what I
often call a dialect of XML. DTDs currently are being written for an
enormous number of different problem domains, and each DTD defines a
new markup language. New markup languages now exist, or are being
designed, to mark up the plays of Shakespeare; to define general data
resources (RDF); to model information in the health care industry (HL7
SGML/XML); to typeset, display, and actively use mathematical equations
(MathML); and to perform electronic data interchange (XML/EDI). There's
even a proposal for a markup language for business data in the footwear
industry (FDX). (No, I'm not joking.)
Central to each of these new languages is a DTD that describes what
tags the markup language has, what those tags' attributes may be, and
how they may be combined. A DTD specifies very clearly what information
may or may not be included in a markup language. For instance, the DTD
for HTML does not allow for markup tags to select paper size for
printing.
Let's take a look at a DTD for a recipe XML. I'm going to call it
JWSRML (JavaWorld Scary Recipe Markup Language). Apologies to anyone
already using that acronym.
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Description (#PCDATA)>
<!ELEMENT Ingredients (Ingredient)*>
<!ELEMENT Ingredient (Qty, Item)>
<!ELEMENT Qty (#PCDATA)>
<!ATTLIST Qty unit CDATA #REQUIRED>
<!ELEMENT Item (#PCDATA)>
<!ATTLIST Item optional CDATA "0"
isVegetarian CDATA "true">
<!ELEMENT Instructions (Step)+>
Listing 1. The DTD for JWSRML
The document type definition in Listing 1 defines a language for a
validating parser to accept -- meaning, the parser will produce errors
if the rules listed in the DTD aren't followed. To get a general idea
of how a DTD works, let's look at what a few of the lines in this file
mean.
- lt;!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
The <!ELEMENT...> statement defines a tag in the document. This
tag defines a <Recipe> tag, stating that it can contain a <Name>,
an optional <Description> (the question mark [?] denotes
optionality), an optional<Ingredients> tag, and an
optional<Instructions> tag.
- lt;!ELEMENT Name (#PCDATA)>
This simply states that a <Name> tag can contain character data
and nothing else.
- lt;!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true">
This section states that the <Item> tag has two possible
attributes: optional, whose default value is 0; and isVegetarian,
whose default value is true. Notice that attribute values aren't
limited to numbers; they can be any text.
A DTD is associated with an XML document by way of a document type
declaration, which appears at the top the XML file (after the <?xml...?
> line). The document type declaration may contain either an inline
copy of the document type definition or contain a reference to that
document as a system filename or URI (universal resource ID). For
example,
<!DOCTYPE Recipe SYSTEM "example.dtd">
tells the parser to start looking for a <Recipe> tag as the top-level
tag of the document. It also declares that the DTD is in the system
file example.dtd. There are other characters and notations in the DTD,
but writing DTDs is a topic unto itself.
You now know a lot about how XML is structured and controlled, but you
haven't heard what it's good for. Why are people so excited about this
technology?
So, what good is made-up markup?
Here are some benefits of representing information in XML:
- XML is at least as readable as HTML and probably more so
Anyone who understands, more or less, what HTML is probably
understands what the markup means, since the markup
uses fairly intuitive terms (<Ingredient><OBJECT
CLASSID="000DDA23432...">).
- The tags don't have anything to do with how the document is
displayed, Listing 1 is pure content: It's information. The markup
indicates what the information means, not how to display it. The
formatting information for an XML file (if there is any need for
formatting) is usually written in a style language and stored
separately from the XML. (See the sections on CSS and XSL below
for more on formatting XML.) Separation of content and
presentation is a key concept inherited from SGML.
- A lot of the programming is already done for you
If you write a DTD and use a validating parser, much of the error
checking for the validity of your input is done by the parser.
There's no need to write the parser yourself, since there are so
many high-quality parsers available for free. If you want to
change the language, you simply change the DTD; the parser then
obeys your new rules. Moreover, if your system needs to
interoperate with other systems, you can choose a standard DTD
(like XML/EDI, for example), so that other systems will
automatically understand your system's vocabulary, and vice versa.
In fact, CSS (Cascading Style Sheets) and XSL (the Extensible
Stylesheet Language) do precisely that: They're the style languages for
XML. Let's take a quick look at these two technologies.
What if there were a way to turn XML into a text file, a PostScript
document, a photo-typesetting file, or input to a text-to-speech system
for the hearing-impaired? Or what if the XML could somehow be
transformed into HTML and displayed in a browser?
The members of the appropriate committees at the W3C have addressed
these concerns with two specifications: CSS and XSL. While both are
declarative languages (meaning that there are no instructions in the
first-do-this, then-do-that sense), they serve different functions. CSS
exists as a current recommendation from the W3C, usable with HTML or
XML, is simpler to use and less powerful than XSL, and is supported by
most current-generation browsers (to varying degrees). XSL is used
exclusively to format XML or SGML and is more complex and powerful than
CSS.
Great strides have been made with XSL in the past year. While XSL is
still just a "working draft" (meaning its design isn't yet complete),
you can experiment today with working implementations of the draft.
Just this month (March 18, 1999), Microsoft released Internet Explorer
5.0, which includes support for part of the XSL specification. And
Mozilla (the open source project based on the Netscape source code) can
display XML using CSS. At the XTech '99 conference in San Jose, CA, in
early March, Sun Microsystems "pre-announced" a request for proposals
(for a grant) and a contest relating to the implementation of an XSL
batch-processor and the addition of full XSL to Mozilla.
Again, the purpose of creating these new standards is to make most
things very simple for most people, just like HTML has made hypertext
and structured documents attainable to your grandma (or your nine-year-
old).