Mark Johnson
Sometimes, you want the XML parser to leave your text alone. XML
requires the ampersand ('&') and less-than ('<') be represented by the
general entities & and <, respectively. This restriction can
make for some tedious typing, and hard-to-read, harder-to-write XML.
Putting the the previous sentence in an XML file, would require you to
encode it like this:
"XML requires the ampersand ('&') and less-than ('<') be
represented by the general entities &and <,
repectively."
Pretty bad, huh? It gets worse. Imagine you want something looking like
XSLT in your XML parser output. For example, this XSLT template
implements a new tag '<tag/>' that formats everything within it in bold
code font:
<!-- Format contents of "tag" as a tag -->
<xsl:template match="tag">
<b><code><<xsl:apply-templates/>></code></b>
</xsl:template>
But imagine writing a document to show this XSLT rule in the output
document, just as it looks above. It would have to be encoded like
this:
<!-- Format contents of "tag" as a tag -->
<xsl:template match="tag">
<b><code><<xsl:apply-
templates/>></code></b>
</xsl:template>
Yuck! You can see the encoding requirement make for pretty awkward XML.
Fortunately, an easy XML trick called a "CDATA section" gives you a
temporary reprieve from the & and < encoding rules. (I mean, & and
<). A CDATA section starts with the delimiter '<![CDATA[' and ends
with the delimiter ']]>'. It can occur anywhere in an XML document
that character data can occur. So, the XSLT rule above can be encoded
as:
<![CDATA[
<!-- Format contents of "tag" as a tag -->
<xsl:template match="tag">
<b><code><<xsl:apply-templates/>></code></b>
</xsl:template>
]]>
Much better!
The CDATA section tells the XML processor to pass through anything
inside, verbatim - no parameter substitution, no whitespace processing.
The XML processor doesn't parse what's inside a CDATA section, except
to look for the CDATA section's closing delimiter ']]>'. So, you can
include text just as you want it to appear to processors downstream
from the XML parser.
Data inside a CDATA section is just plain character data. The XML
parser clips the text out of the CDATA section, pastes the enclosed
text block into its output, and then "forgets" a CDATA section ever
existed. To programs using XML parsers, CDATA sections are
indistinguishable from any other block of text. So you can't, for
example, write an XSLT rule that matches only text in CDATA sections.
CDATA sections are just a notational convention for temporarily
disabling XML's input parsing.
One final point: Don't confuse a CDATA section:
<![CDATA[Hello, XML!]]>
with using the CDATA keyword in a DTD ATTLIST:
<!ATTLIST Address City CDATA #REQUIRED>
or with #PCDATA in an element definition:
<!ELEMENT Address (#PCDATA|'none')*>
The three notations are completely separate concepts.
Now, a pop quiz: How could you represent the string '<![CDATA[Hello,
XML!]>>' in an XML document? If you understand the following answer
(which is just one way to do it), then you understand CDATA sections:
<![CDATA[<![CDATA[Hello, XML!]]>]]>