ITworld.com
  Search  
 Home  Newsletter Archive  XML IN PRACTICE
Make Up a Markup
Sign up for XML IN PRACTICE
More Newsletters
 

XML IN PRACTICE --- 01/25/2001



Mark Johnson

While a well-formed document is well-formed because it follows rules defined by the XML spec, a valid document is valid because it matches its document type definition (DTD). The DTD is the grammar for a markup language, defined by the designer of the markup language. The DTD specifies what elements may exist, what attributes the elements may have, what elements may or must be found inside other elements, and in what order.

Nonvalidating parsers read the XML and, if it's well-formed, give you back the document structure as a tree of objects. We'll discuss the document structure you get from a parser in the section below entitled "The Document Object Model." If the document is well-formed but the elements are nonsensical (as was the case with the two <Qty> elements in the <Ingredient> above), that's your problem.

This is, in fact, how HTML browsers work. Generally, HTML parsers are nonvalidating. The various "HTML checking" parsers, which report sytax errors in HTML, are essentially validating HTML parsers (with additional functionality, like link checking).

Validating parsers read XML, verify that it's well-formed (just as nonvalidating parsers do), and then go on to determine whether the document's element tags are legal, whether the attribute names make sense, whether every element nested inside another element belongs there, and so on.

The DTD defines the document type. It accounts for the Extensible in XML. The DTD is how you actually define a new markup language -- what I often call a dialect of XML. DTDs currently are being written for an enormous number of different problem domains, and each DTD defines a new markup language. New markup languages now exist, or are being designed, to mark up the plays of Shakespeare; to define general data resources (RDF); to model information in the health care industry (HL7 SGML/XML); to typeset, display, and actively use mathematical equations (MathML); and to perform electronic data interchange (XML/EDI). There's even a proposal for a markup language for business data in the footwear industry (FDX). (No, I'm not joking.)

Central to each of these new languages is a DTD that describes what tags the markup language has, what those tags' attributes may be, and how they may be combined. A DTD specifies very clearly what information may or may not be included in a markup language. For instance, the DTD for HTML does not allow for markup tags to select paper size for printing.

Let's take a look at a DTD for a recipe XML. I'm going to call it JWSRML (JavaWorld Scary Recipe Markup Language). Apologies to anyone already using that acronym.

<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Description (#PCDATA)> <!ELEMENT Ingredients (Ingredient)*> <!ELEMENT Ingredient (Qty, Item)> <!ELEMENT Qty (#PCDATA)> <!ATTLIST Qty unit CDATA #REQUIRED> <!ELEMENT Item (#PCDATA)> <!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true"> <!ELEMENT Instructions (Step)+>

Listing 1. The DTD for JWSRML

The document type definition in Listing 1 defines a language for a validating parser to accept -- meaning, the parser will produce errors if the rules listed in the DTD aren't followed. To get a general idea of how a DTD works, let's look at what a few of the lines in this file mean.

  • lt;!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)> The <!ELEMENT...> statement defines a tag in the document. This tag defines a <Recipe> tag, stating that it can contain a <Name>, an optional <Description> (the question mark [?] denotes optionality), an optional<Ingredients> tag, and an optional<Instructions> tag.
  • lt;!ELEMENT Name (#PCDATA)> This simply states that a <Name> tag can contain character data and nothing else.
  • lt;!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true"> This section states that the <Item> tag has two possible attributes: optional, whose default value is 0; and isVegetarian, whose default value is true. Notice that attribute values aren't limited to numbers; they can be any text.

A DTD is associated with an XML document by way of a document type declaration, which appears at the top the XML file (after the <?xml...? > line). The document type declaration may contain either an inline copy of the document type definition or contain a reference to that document as a system filename or URI (universal resource ID). For example,

<!DOCTYPE Recipe SYSTEM "example.dtd">

tells the parser to start looking for a <Recipe> tag as the top-level tag of the document. It also declares that the DTD is in the system file example.dtd. There are other characters and notations in the DTD, but writing DTDs is a topic unto itself.

You now know a lot about how XML is structured and controlled, but you haven't heard what it's good for. Why are people so excited about this technology?

So, what good is made-up markup?
Here are some benefits of representing information in XML:

  • XML is at least as readable as HTML and probably more so Anyone who understands, more or less, what HTML is probably understands what the markup means, since the markup uses fairly intuitive terms (<Ingredient><OBJECT CLASSID="000DDA23432...">).
  • The tags don't have anything to do with how the document is displayed, Listing 1 is pure content: It's information. The markup indicates what the information means, not how to display it. The formatting information for an XML file (if there is any need for formatting) is usually written in a style language and stored separately from the XML. (See the sections on CSS and XSL below for more on formatting XML.) Separation of content and presentation is a key concept inherited from SGML.
  • A lot of the programming is already done for you If you write a DTD and use a validating parser, much of the error checking for the validity of your input is done by the parser. There's no need to write the parser yourself, since there are so many high-quality parsers available for free. If you want to change the language, you simply change the DTD; the parser then obeys your new rules. Moreover, if your system needs to interoperate with other systems, you can choose a standard DTD (like XML/EDI, for example), so that other systems will automatically understand your system's vocabulary, and vice versa.

In fact, CSS (Cascading Style Sheets) and XSL (the Extensible Stylesheet Language) do precisely that: They're the style languages for XML. Let's take a quick look at these two technologies.

What if there were a way to turn XML into a text file, a PostScript document, a photo-typesetting file, or input to a text-to-speech system for the hearing-impaired? Or what if the XML could somehow be transformed into HTML and displayed in a browser?

The members of the appropriate committees at the W3C have addressed these concerns with two specifications: CSS and XSL. While both are declarative languages (meaning that there are no instructions in the first-do-this, then-do-that sense), they serve different functions. CSS exists as a current recommendation from the W3C, usable with HTML or XML, is simpler to use and less powerful than XSL, and is supported by most current-generation browsers (to varying degrees). XSL is used exclusively to format XML or SGML and is more complex and powerful than CSS.

Great strides have been made with XSL in the past year. While XSL is still just a "working draft" (meaning its design isn't yet complete), you can experiment today with working implementations of the draft. Just this month (March 18, 1999), Microsoft released Internet Explorer 5.0, which includes support for part of the XSL specification. And Mozilla (the open source project based on the Netscape source code) can display XML using CSS. At the XTech '99 conference in San Jose, CA, in early March, Sun Microsystems "pre-announced" a request for proposals (for a grant) and a contest relating to the implementation of an XSL batch-processor and the addition of full XSL to Mozilla.

Again, the purpose of creating these new standards is to make most things very simple for most people, just like HTML has made hypertext and structured documents attainable to your grandma (or your nine-year- old).

 

Mark Johnson is president of Elucify Technical Communications, a Colorado-based training and consulting company dedicated to clarifying novel or complex ideas through clear explanation and examples.

www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   Industry Standard   Infoworld   ITworld  
JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

DEMO   IDG Connect   IDG Knowledge Hub   IDG TechNetwork   IDG World Expo  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.