Mark Johnson
In last week's newsletter, I presented the types of entities that can
be found in an XML document. This week, I'll explain the "XML
declaration", which should occupy the first line of every XML file. As
in last week's newsletter, I'll set off XML vocabulary in *asterisks*,
to make it stand out.
Every XML file should start with an *XML declaration*, which indicates
several pieces of information that an XML-processing program uses to
parse the file. The XML declaration indicates that a document is XML,
what *XML version* the document uses, the *encoding* for the document,
and whether the document is *standalone*. (I'll explain what these
things mean in a moment.) A typical XML declaration might look like
this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
The *XML version* number indicates the version of the XML spec to which
the document conforms. The only current valid version number is "1.0"
since that's the only official version of the XML specification. The
version number is the only mandatory attribute of the XML declaration;
in other words, the minimum XML declaration looks like this:
<?xml version="1.0"?>
A document's *encoding* describes how the program processing the XML
document should interpret the bytes in the file. A character set
defines how sequences of one or more bytes map to characters for
display. XML handles character sets in a general manner as it was
designed to be international. The XML specification mentions the
following character strings, and spells out what character sets these
strings encode
Character Set Strings
------------- -------
Unicode (ISO/IEC 10646) UTF-8, UTF-16, ISO-10646-UCS-2,
Unicode (ISO/IEC 10646) ISO-10646-UCS-2
ISO 8859 ISO-8859-1 .. ISO-8859-9
JIS X-0208-1997 ISO-2022-JP, Shift_JIS, EUC-JP
XML processors typically recognize other encodings, too. ASCII is a
subset of UTF-8, for example. The encodings' names are case-insensitive
by definition. Most commercial products should be able to handle ASCII,
ISO-8859-1, the UTF encodings, and probably some of the JIS encodings.
The Annotated XML specification recommends choosing one of those.
The *standalone* document declaration (SDD) is the third possible
element in the XML declaration. The standalone declaration indicates
whether the document contents can be fully interpreted without getting
information from elsewhere. Certain declarations in the DTD (for
example, external entity declarations) can affect the document's
content when XML processing program reads it. For example, if your
document uses an entity defined in an external file, then the document
isn't "standalone". The XML processor has to read and use the DTD to
properly interpret the document contents. The value of the declaration
must be either 'yes' (if the document itself contains all of the data
needed to interpret it), or 'no'. Like the encoding, the standalone
declaration is optional.
As a final note, you'll notice that I said every XML file "should"
start with an XML declaration. That's because an XML declaration is
optional. The XML specification doesn't absolutely require the
declaration, since a great deal of SGML and HTML already exists as well-
formed (or nearly-well-formed) XML. Absolutely requiring the XML
declaration would have made these otherwise-compliant legacy files non-
well-formed. Therefore, the specification leads recommend the XML
declaration, instead of requiring it. Tim Bray, one of the XML
specification editors, says, "You should definitely use an XML
declaration unless you have a *really* good reason not to." In
addition, many popular XML parsers treat the absence of a declaration
as an error; so if you don't have a declaration, your file won't parse,
much less validate.