Mark Johnson
Reading SGML-related specifications such as the W3C's XML, XSL, or XML
schema recommendations can be frustrating and confusing. Over the past
thirty years, the SGML community has developed a jargon to describe XML
concepts. Terms like "notation" or "external unparsed entity" are
perfectly clear to an SGML expert, but confusing to the uninitiated.
Since XML, XSL, and related technologies descend from SGML, their
specifications are written using this jargon. In this week's
newsletter, I'll cover some terminology basics so you can read the
specifications for yourself.
Keep in mind that when I say "XML" in this letter, I actually
mean, "XML and similar technologies". I'll set off each term below with
- asterisks*, so you'll know when you're seeing XML terminology. First,
we'll address some basic concepts
Structure
XML documents have both logical and physical structure. The *logical
structure* is simply the elements (and attributes) in the document and
their order. The *physical structure* is the arrangement of physical
data sources (like filenames or URLs) to produce the logical structure.
For example, you've probably seen something like this in an XML
document:
<!DOCTYPE PurchaseOrder SYSTEM "PurchaseOrder.dtd">
This XML document's DTD is in an external file, so the document's
physical structure involves both the original XML document and this
external file. The logical structure is simply the element contents
after the physical dependencies have been resolved.
Entities
XML documents use storage units called *entities* to arrange physical
structures to produce a logical structure. Entities define blocks of
text for reuse in documents or in DTDs, and include data from other
storage units (such as files). Several characteristics determine an
entity's type. Every entity is either *internal* or *external*;
- parsed* or *unparsed*; and a *general entity* or a *parameter entity*.
An *internal entity* is defined in a document's prolog (along with or
within the DTD), and is not associated with any external file or data
source. An *external entity* is also defined in the prolog, but depends
on some external file or data source. For example:
<!ENTITY Alpha "Á"> <!-- Internal -->
<!ENTITY Chars SYSTEM "chars.dtd"> <!-- External -->
A *parsed entity* is parsed by the XML processor, and its contents are
part of the document's logical structure. An *unparsed entity* is a
reference to data that may or may not be XML. Each unparsed entity is
associated with a *notation*, which indicates what sort of processor
can access the unparsed entity. All internal entities are parsed,
whereas external entities may be parsed or unparsed.
A *general entity* is used to represents text in the body of a
document, and a *parameter entity* represents text in a DTD. To use a
general entity, de-reference it with an ampersand (&). For example:
<!-- Define an internal (general parsed) entity -->
<!ENTITY Copyright "Copyright (c) 2001">
<!ENTITY Author "Mark Johnson">
<!-- Use the entity -->
<TITLE>
&Copyright; by &Author. All rights reserved.
</TITLE>
A *parameter entity* is used (and defined) with a Percent symbol (%),
and can only be used within a DTD like this:
<!ENTITY % Colors "Red|Green|Blue|Black|Brown">
<!ELEMENT SHOES EMPTY>
<!ATTLIST SHOE
SIZE (#PCDATA)
COLOR %Colors;>
<!ATTLIST TIE
TYPE "Bola|Cravat|Ascot"
COLOR %Colors;>
In the sample above, the parameter entity Colors (which is also an
internal, parsed entity) is defined with the ENTITY keyword and used to
COLOR for both SHOE and TIE. Trying to use a parameter entity outside
of a DTD is an error.
The resources below will provide you with enough additional detail to
come "up to speed" on entity terminology. Meanwhile, I'll cover more
XML terminology in upcoming newsletters.