XML for the absolute beginner
Summary: In just a few short years, the World Wide Web and HTML have taken the world by storm. But HTML's limitations and the ever-increasing demand for more flexibility in Internet systems has XML, the Extensible Markup Language, brewing on the horizon. Further, Java applications that move data around need a data representation format as portable as Java itself. Developers who learn XML now will find it a powerful tool for data representation, storage, modelling, and interoperation.
Mark Johnson steps away from his popular JavaBeans column this month to introduce you to the world of XML: where it came from, why it's necessary, how it interoperates with existing Internet technology, and how to use it in your designs. You'll learn about Cascading Style Sheets and XSL, then follow up with a look at the XML and Java technology base at a promising Internet startup, with comments from that company's CEO and technical lead. By the time you've finished reading Mark's article, you'll understand why so many people are paying so much attention to this new data representation standard.
HTML and the World Wide Web are everywhere. As an example of their
ubiquity, I'm going to Central America for Easter this year, and if I
want to, I'll be able to surf the Web, read my e-mail, and even do
online banking from Internet cafés in Antigua Guatemala and
Belize City. (I don't intend to, however, since doing so would take
time away from a date I have with a palm tree and a rum-filled coconut.)
And yet, despite the omnipresence and popularity of HTML, it is
severely limited in what it can do. It's fine for disseminating
informal documents, but HTML now is being used to do things it was
never designed for. Trying to design heavy-duty, flexible,
interoperable data systems from HTML is like trying to build an
aircraft carrier with hacksaws and soldering irons: the tools
(HTML and HTTP) just aren't up to the job.
The good news is that many of the limitations of HTML have been
overcome in XML, the Extensible Markup Language. XML is easily
comprehensible to anyone who understands HTML, but it is much more
powerful. More than just a markup language, XML is a
metalanguage -- a language used to define new markup
languages. With XML, you can create a language crafted specifically for
your application or domain.
XML will complement, rather than replace, HTML. Whereas HTML is used
for formatting and displaying data, XML represents the contextual
meaning of the data.
This article will present the history of markup languages and how XML
came to be. We'll look at sample data in HTML and move gradually
into XML, demonstrating why it provides a superior way to represent
data. We'll explore the reasons you might need to invent a
custom markup language, and I'll teach you how to do it.
We'll cover the basics of XML notation, and how to
display XML with two different sorts of style languages. Then, we'll
dive into the Document Object Model, a powerful tool for manipulating
documents as objects (or manipulating object structures as documents,
depending upon how you look at it). We'll go over how to write Java
programs that extract information from XML documents, with a pointer to
a free program useful for experimenting with these new concepts.
Finally, we'll take a look at an Internet company that's basing its
core technology strategy on XML and Java.
Is XML for you?
Though this article is written for anyone interested in XML, it has a
special relationship to the JavaWorld series on XML
JavaBeans. (See Resources for links to related articles.) If you've been reading that series and aren't quite "getting it," this article should clarify how to use XML with beans. If you are getting it, this article serves as the perfect companion piece to the XML JavaBeans series, since it covers topics untouched therein.
And, if you're one of the lucky few who still have the XML JavaBeans
articles to look forward to, I recommend that you read the present
article first as introductory material.
A note about Java
There's so much recent XML activity in the computer world that even an
article of this length can only skim the surface. Still, the whole
point of this article is to give you the context you need to use XML in
your Java program designs. This article also covers how XML operates
with existing Web technology, since many Java programmers work in such
XML opens the Internet and Java programming to portable, nonbrowser
functionality. XML frees Internet content from the browser in much the
same way Java frees program behavior from the platform. XML makes
Internet content available to real applications.
Java is an excellent platform for using XML, and XML is an outstanding
data representation for Java applications. I'll point out some of
Java's strengths with XML as we go along.
Let's begin with a history lesson.
The origins of markup languages
The HTML we all know and love (well, that we know, anyway) was
originally designed by Tim Berners-Lee at CERN (le Conseil
Européen pour la Recherche Nucléaire, or the
European Laboratory for Particle Physics) in Geneva to allow physics
nerds (and even non-nerds) to communicate with each other. HTML was
released in December 1990 within CERN, and became publicly available in
the summer of 1991 for the rest of us. CERN and Berners-Lee gave away
the specifications for HTML, HTTP, and URLs, in the fine old tradition
of Internet share-and-enjoy.
Berners-Lee defined HTML in SGML, the Standard Generalized Markup
Language. SGML, like XML, is a metalanguage -- a language used for
defining other languages. Each so-defined language is called an
application of SGML. HTML is an application of SGML.
SGML emerged from research done primarily at IBM on text document
representation in the late '60s. IBM created GML ("General Markup
Language"), a predecessor language to SGML, and in 1978 the
American National Standards Institute (ANSI) created its first version
of SGML. The first standard was released in 1983, with the draft
standard released in 1985, and the first standard was published in
1986. Interestingly enough, the first SGML standard was published
using an SGML system developed by Anders Berglund at CERN, the
organization that, as we have seen, gave us HTML and the Web.
SGML is widely used in large industries and governments such as in
large aerospace, automotive, and telecommunications companies. SGML is
used as a document standard at the United States Department of Defense
and the Internal Revenue Service. (For readers outside of the US, the
IRS are the tax guys.)
Albert Einstein said everything should be made as simple as possible,
and no simpler. The reason SGML isn't found in more places is that
it's extremely sophisticated and complex. And HTML, which you can find
everywhere, is very simple; for a lot of applications, it's too
HTML: All form and no substance
HTML is a language designed to "talk about" documents:
headings, titles, captions, fonts, and so on. It's heavily document
structure- and presentation-oriented.
Admittedly, artists and hackers have been able to work miracles with
the relatively dull tool called HTML. But HTML has serious drawbacks
that make it a poor fit for designing flexible, powerful, evolutionary
information systems. Here a few of the major complaints:
- HTML isn't extensible
An extensible markup
language would allow application developers to define custom tags for
application-specific situations. Unless you're a 600-pound gorilla (and
maybe not even then) you can't require all browser manufacturers to
implement all the markup tags necessary for your application. So,
you're stuck with what the big browser makers, or the W3C (World Wide
Web Consortium) will let you have. What we need is a language that
allows us to make up our own markup tags without having to call the
HTML is a fine language for display purposes, unless you require a lot
of precise formatting or transformation control (in which case it
stinks). HTML represents a mixture of document logical structure
(titles, paragraphs, and such) with presentation tags (bold, image
alignment, and so on). Since almost all of the HTML tags have
to do with how to display information in a browser, HTML is useless for
other common network applications -- like data replication or
application services. We need a way to unify these common functions
with display, so the same server used to browse data can also, for
example, perform enterprise business functions and interoperate with
documents in word-processors and then exporting them as HTML is
somewhat automated but still requires, at the very least, some tweaking
of the output in order to achieve acceptable results. If the data from
which the document was produced change, the entire HTML translation
needs to be redone. Web sites that show the current weather around the
globe, around the clock, usually handle this automatic reformatting
very well. The content and the presentation style of the document are
separated, because the system designers understand that their content
(the temperatures, forecasts, and so on) changes constantly.
What we need is a way to specify data presentation in terms of
structure, so that when data are updated, the formatting can be
"reapplied" consistently and easily.
difficult to write HTML that displays the same data in different ways
based on user requests. Dynamic HTML is a start, but it requires an
enormous amount of scripting and isn't a general solution to this
problem. (Dynamic HTML is discussed in more detail below.) What we need
is a way to get all the information we may want to browse at once, and
look at it in various ways on the client.
Web applications would benefit from an ability to represent data by
meaning rather than by layout. For example, it can be very difficult to
find what you're looking for on the Internet, because there's no
indication of the meaning of the data in HTML files (aside from META
tags, which are usually misleading). Type red into a search
engine, and you'll get links to Red Skeleton, red herring, red snapper,
the red scare, Red Letter Day, and probably a page or two of
"Books I've Red." HTML has no way to specify what a
particular page item means. A more useful markup language would
represent information in terms of its meaning. What we need is a
language that tells us not how to display information, but
rather, what a given block of information is so we know what
to do with it.
SGML has none of these weaknesses, but in order to be general, it's
hair-tearingly complex (at least in its complete form). The language
used to format SGML (its "style language"), called DSSSL
(Document Style Semantics and Specification Language), is
extremely powerful but difficult to use. How do we get a language
that's roughly as easy to use as HTML but has most of the power of