Mark Johnson
HTML programmers are accustomed to having a large set of valid
character entities for producing characters other than the common ASCII
characters from space to ~ (hex 20 through hex 7f). These characters
include:
- international" characters, like characters with an accent
(à = a with a "grave" or right accent) or a circumflex
(ô = o with a circumflex or caret above it);
- special characters" like œ, the smashed-together o and e
in archaic spellings like "encyclopoedia";
- various symbols like the Greek alphabet (α, β and so
on), and the "for all" symbol (∀ = an upside-down
capital "A").
But how do you encode such characters in XML?
Character entities, like all other entities, can be defined in a DTD
with an <!ENTITY> definition. XHTML (the new reformulation of HTML 4 as
an XML document type) defines these entities, but XML does not (with
the exceptions of & < and >). So, what do you do if you want
to use these characters in XML?
The World Wide Web consortium, an international consortium dedicated to
open Web standards, provides three entity definition files as a part of
XHTML. These files define the character entities for XHTML, but they're
usable in XML as well. You simply have to include the contents of those
files in the DTD for your document. These files are:
- http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent (Latin characters)
- http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent (Special
characters)
- http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent (Symbols)
To include one of these files in your DTD, place the following line in
your DTD:
<!ENTITY % HTMLsymbol PUBLIC
"-//W3C//ENTITIES Symbols for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
%HTMLsymbol;
(This is the line for xhtml-symbol. Change it accordingly to use the
other two files.) The http: URL above indicates the .ent file's
location. However, you may not always be online so copy the .ent file
to a local directory. Then, replace the http URL with a reference to
the file.
The following small XML document demonstrates the use of these
character entities:
<?xml version="1.0"?>
<!-- Start DTD -->
<!DOCTYPE ThisDoc [
<!ELEMENT AnyEntity (#PCDATA)>
<!-- Define entities for symbols -->
<!ENTITY % HTMLsymbol PUBLIC
"-//W3C//ENTITIES Symbols for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
%HTMLsymbol;
<!-- Define entities for special characters -->
<!DOCTYPE ThisDoc [
<!ELEMENT AnyEntity (#PCDATA)>
<!ENTITY % HTMLspecial PUBLIC
"-//W3C//ENTITIES Specials for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
%HTMLspecial;
<!-- Define entities for latin and other characters -->
<!DOCTYPE ThisDoc [
<!ELEMENT AnyEntity (#PCDATA)>
<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latins for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlatin;
]> <!-- End DTD -->
<!-- Start document -->
<ThisDoc>
<AnyEntity>
Try some international characters: á ζ η θ
†
</AnyEntity>
</ThisDoc> <!-- End document -->
<!-- End of example -->