Ampersand Attrition in XML and HTML

July 3, 2002, 11:00 PM —  ITworld — 

Have you ever played the "spot the development platform" game? In my
version of it, points are awarded to players who correctly guess what
programming language an application is written in, simply by looking at
the user interface of the application.

Many tell tale signs can be spotted ranging from the shape of the
hover-text that appear on buttons through to the general pattern of URLs
generated in HTTP GET requests. Visual C++ thick client binaries,
Vignette, and JBOSS all have pretty distinctive attributes that can be
perceived on close inspection of a running application's GUI.

With XML and HTML, a more challenging game is possible, namely,
"diagnose the problems with ampersand characters". Note that this game
is about diagnosis not detection. Detecting problems with ampersand
characters in XML/HTML applications yield no prizes because ampersands
in XML/HTML applications *always* cause problems.

Just now, I searched for the string "amp;amp" with Google and received
about 22,000 hits. If you are interested in this phenomenon, then I'd
suggest following some of the links, viewing the source, and marveling
at the number of "amps" on show. Sometimes you'll find a single amp, and
other times as many as twenty!

The Root Cause of the Problem
The ampersand character has special meaning in SGML, HTML, and XML
markup languages. If you wish to use it literally, you must "escape" it.
The escaped form consists of an ampersand sign (! -- more on this
later), the string "amp", and a semi-colon. However, a literal ampersand
sign can occur within an XML document without causing parsing problems
in certain cases. For example, they can occur inside CDATA sections and
inside comments in un-escaped form; they are used to introduce "entity
references" for special characters such as "lt" for less than and "quot"
for single quote; they are also used to introduce so called "character
entities" such as "#x0041", which is the Unicode code for a capital A
character.

The multiple uses of ampersand characters -- some special, some not --
are the cause of the trouble. Let us say you are in the process of
adding markup to a document. It does not parse yet, so you are doing all
your text processing lexically (i.e. by editing with a text editor or
performing string processing using some sort of search/replace or
regular expression capability). You know that some literal ampersands
are scattered throughout the document's text so you fire off a
search/replace to escape them all.

Trouble is, if there are any ampersands in CDATA sections, or comments
or introducing entities, they are also escaped -- causing "amp;" to
appear in your final output. Furthermore, any ampersands in the true
text of the document that had already been escaped would then be double
escaped -- thus again causing "amp;" to appear in your final output.

This process may be repeated, depending on the number of steps involved
in the document production workflow. Like wood-rings in a felled tree,
you can get a feel for the number of seasons in a document workflow by
seeing how many times, erroneously escaped ampersands are escaped!

What is the essence of the problem here? Why is it that, after all these
years of SGML experience, ampersand attrition rates are still so
dreadful? I suspect the problem is a parallel of the problem in the Unix
world known as the "two to the n minus 1 backslash problem".

In Unix, a backslash has special meaning in numerous contexts. To escape
it, you add another backslash. However, if you are creating syntax for a
command that will pass through a couple of backslash sensitive layers
before hitting its final target, you need to escape the backslash by
adding backslashes. If there are 2 intermediate layers, you need 3
backslashes. For 3 layers you need 5 and so on.

In both ampersand escaping and backslash escaping, we see the same
phenomenon. Namely, the character to be escaped is, itself, used in the
escaping mechanism. In the case of ampersands, the first character in
the escape sequence is *another* ampersand. In the case of backslashes,
the escape sequence is *another* backslash.

The last thing you want to do if you find yourself in a hole is to keep
digging. It seems to me that this is exactly what the ampersand escape
mechanism does by adding more ampersands.

Now We Know the Problem...What's the Cure?
So how should this be fixed? Can it be fixed? I'm not sure but one idea,
which I believe bears investigation is the use of a pre-defined empty
element type in XML, called amp in the XML namespace bound to the
reserved prefix "xml:". That way, I can represent a literal ampersand
in text as . In so doing, I would be able to cleanly separate
literal ampersands in the text of a document from ampersands that are
part of the surface syntax of the markup.

It will not have escaped your attention that this article does not
contain a single literal ampersand. To do so would be to invite Murphy
to mess one up. That would make this article self-referential in a way I
would rather avoid. There is enough ampersand attrition in the world
without this article adding to it!

» posted by ITworld staff

ITworld

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Free books

Build your tech library with our book giveaways.

Windows PowerShell 2.0 Unleashed
By Tyson Kopczynski, Pete Handley, Marco Shaw; Published by Sams

Windows PowerShell Unleashed will not only give you deep mastery over PowerShell but also a greater understanding of the features being introduced in PowerShell 2.0–and show you how to use it to solve your challenges in your production environment. Enter now!

 

Ubuntu Server Administration
By Michael Jang; Published by McGraw-Hill Osborne Media

Realize a dynamic, stable, and secure Ubuntu Server environment with expert guidance, tips, and techniques from a Linux professional. Ubuntu Server Administration covers every facet of system management -- from users and file systems to performance tuning and troubleshooting. Enter now!

Featured Sponsor

AISO founders envisioned a Web hosting company that was environmentally friendly. While the company employed energy-efficient innovations like solar panels, its infrastructure produced unacceptable power and cooling requirements. Find out how AISO leveraged AMD technology to overcome their challenge in this case study white paper.

In this whitepaper, Scalar explores the opportunity to change the landscape with respect to mission critical databases built around Oracle. Leveraging technologies such as Linux, high-end commodity processing power and Oracle RAC technology to architect, design, build and maintain database infrastructure that delivers maximum availability, reliability and performance at a fraction of traditional cost.

On a typical day, weather.com, the Web site for The Weather Channel in Atlanta, serves up between 15 million and 20 million page views. But in September 2004, when back-to-back hurricanes ransacked Florida, the peak traffic on one day more than tripled: over 70 million page views by more than 7 million unique visitors. Read the full success story now.

More Resources