Ampersand Attrition in XML and HTML
Have you ever played the "spot the development platform" game? In my
version of it, points are awarded to players who correctly guess what
programming language an application is written in, simply by looking at
the user interface of the application.
Many tell tale signs can be spotted ranging from the shape of the
hover-text that appear on buttons through to the general pattern of URLs
generated in HTTP GET requests. Visual C++ thick client binaries,
Vignette, and JBOSS all have pretty distinctive attributes that can be
perceived on close inspection of a running application's GUI.
With XML and HTML, a more challenging game is possible, namely,
"diagnose the problems with ampersand characters". Note that this game
is about diagnosis not detection. Detecting problems with ampersand
characters in XML/HTML applications yield no prizes because ampersands
in XML/HTML applications *always* cause problems.
Just now, I searched for the string "amp;amp" with Google and received
about 22,000 hits. If you are interested in this phenomenon, then I'd
suggest following some of the links, viewing the source, and marveling
at the number of "amps" on show. Sometimes you'll find a single amp, and
other times as many as twenty!
The Root Cause of the Problem
The ampersand character has special meaning in SGML, HTML, and XML
markup languages. If you wish to use it literally, you must "escape" it.
The escaped form consists of an ampersand sign (! -- more on this
later), the string "amp", and a semi-colon. However, a literal ampersand
sign can occur within an XML document without causing parsing problems
in certain cases. For example, they can occur inside CDATA sections and
inside comments in un-escaped form; they are used to introduce "entity
references" for special characters such as "lt" for less than and "quot"
for single quote; they are also used to introduce so called "character
entities" such as "#x0041", which is the Unicode code for a capital A
character.
The multiple uses of ampersand characters -- some special, some not --
are the cause of the trouble. Let us say you are in the process of
adding markup to a document. It does not parse yet, so you are doing all
your text processing lexically (i.e. by editing with a text editor or
performing string processing using some sort of search/replace or
regular expression capability). You know that some literal ampersands
are scattered throughout the document's text so you fire off a
search/replace to escape them all.
Trouble is, if there are any ampersands in CDATA sections, or comments
or introducing entities, they are also escaped -- causing "amp;" to
appear in your final output.
Sign up for ITworld's Daily newsletter
Follow ITworld on Twitter @IT_world
jfruh
Apple syncing patent can't come soon enough
pasmith
New Twitter features borrow from 3rd party clients
Esther Schindler
Open Source Changes the Software Acquisition Process
mikelgan
How to set up continuous podcast play on the new iTunes
David Strom
Five important Windows 7 mobility features
sjvn
Guard your Wi-Fi for your own sake
Sandra Henry-Stocker
Grepping on Whole Words
Sidekick: The Good News & the Bad News
Either way you look at it Microsoft Data Center management did not follow standards or best practices in this failure. In which case it makes me wonder more about the outsourcing of corporate data much less personal data.
- mburton325
Join the conversation here
Quick, practical advice for IT pros. Made fresh daily.
Want to cash in on your IT savvy? Send your tip to tips@itworld.com. If we post it, we'll send you a $25 Amazon e-gift card.













Great explanation
..but how do you escape them?