Ampersand Attrition in XML and HTML
Have you ever played the "spot the development platform" game? In my
version of it, points are awarded to players who correctly guess what
programming language an application is written in, simply by looking at
the user interface of the application.
Many tell tale signs can be spotted ranging from the shape of the
hover-text that appear on buttons through to the general pattern of URLs
generated in HTTP GET requests. Visual C++ thick client binaries,
Vignette, and JBOSS all have pretty distinctive attributes that can be
perceived on close inspection of a running application's GUI.
With XML and HTML, a more challenging game is possible, namely,
"diagnose the problems with ampersand characters". Note that this game
is about diagnosis not detection. Detecting problems with ampersand
characters in XML/HTML applications yield no prizes because ampersands
in XML/HTML applications *always* cause problems.
Just now, I searched for the string "amp;amp" with Google and received
about 22,000 hits. If you are interested in this phenomenon, then I'd
suggest following some of the links, viewing the source, and marveling
at the number of "amps" on show. Sometimes you'll find a single amp, and
other times as many as twenty!
The Root Cause of the Problem
The ampersand character has special meaning in SGML, HTML, and XML
markup languages. If you wish to use it literally, you must "escape" it.
The escaped form consists of an ampersand sign (! -- more on this
later), the string "amp", and a semi-colon. However, a literal ampersand
sign can occur within an XML document without causing parsing problems
in certain cases. For example, they can occur inside CDATA sections and
inside comments in un-escaped form; they are used to introduce "entity
references" for special characters such as "lt" for less than and "quot"
for single quote; they are also used to introduce so called "character
entities" such as "#x0041", which is the Unicode code for a capital A
character.
The multiple uses of ampersand characters -- some special, some not --
are the cause of the trouble. Let us say you are in the process of
adding markup to a document. It does not parse yet, so you are doing all
your text processing lexically (i.e. by editing with a text editor or
performing string processing using some sort of search/replace or
regular expression capability). You know that some literal ampersands
are scattered throughout the document's text so you fire off a
search/replace to escape them all.
Trouble is, if there are any ampersands in CDATA sections, or comments
or introducing entities, they are also escaped -- causing "amp;" to
appear in your final output.
Sign up for ITworld's Daily newsletter
Follow ITworld on Twitter @IT_world
Brian Proffitt
Microsoft/Novell: Breaking Down the Coupon Numbers
Esther Schindler
Drupal's Dries Buytaert on Building the Next Drupal
Tom Henderson
Top Ten General Operating Systems Rants
pasmith
PS3 motion controller delayed; goes up against Project Natal
sjvn
Neolithic Windows security hole alive and well in Windows 7
claird
Perl source code comparison makes for good reading
mikelgan
Cell phones don't create stress or interrupt much
Sandra Henry-Stocker
How to: The Unix Interview
Where Google Chrome security fails: the password
I heard mention that the Chrome OS will have some sort of encryption available a la bitlocker. If it's possible to encrypt personal data using another password or key, then it may have potential for very secure data.... And Ubuntu has an 'encrypt home directory' option, perhaps google should follow suit.
- Dann
Join the conversation here
Quick, practical advice for IT pros. Made fresh daily.
- Ubuntu advances: Why Ubuntu server installations will surge in 2010
- Social media marketing: How to make friends with benefits
- More...
Want to cash in on your IT savvy? Send your tip to tips@itworld.com. If we post it, we'll send you a $25 Amazon e-gift card.







Great explanation
..but how do you escape them?