In my youth, as an aspiring software developer, I used to write
software in Intel 8086 assembly language. I freely admit that I was
writing assembly code for some time before the full meaning of
everything I wrote into my programs was clear to me. In particular, I
remember starting my programs with the incantation:
code segment para public
For quite a while, I did not know what that meant -- except that all
assembler programs I looked at seemed to have it, all the books used
it, all my peers were putting it into their programs, and leaving it
out caused undecipherable error messages to come from the compiler. As
a new programmer not wishing to look stupid, I rattled off the
incantation at the top of my programs with a flourish and even told new
programmers to do the same on the grounds that "the parser needs it".
An XML analog of this anecdote can be found in the incantation:
<xml version="1.0" encoding="utf-8"?>
The implications of the utf-8 part of this statement are as lost on
some developers at the "code segment para public" statement was on me.
A lot of XML documents seem to have it, all the books use it, other XML
people put it into their documents, and leaving it out can cause
undecipherable error messages to come from the parser. Ask a room full
of XML developers why it is there and the answer "the parser needs it"
will feature prominently.
Lets face facts, the world has not yet hit a critical mass of Unicode
(http://www.unicode.org) compliant tools. I believe that a lot of XML
that says "here be Unicode" is processed by systems that will no
unpleasant things if you feed them anything outside the plain vanilla
US ASCII range.
The situation is not helped by the fact that it is not possible to say
in your XML "just use seven bit US ASCII". Yes, you can specify US
ASCII like this:
<?xml version="1.0" encoding="us-ascii"?>
BUT, this is not officially part of the XML 1.0 standard. I have yet to
come across a tool that does not support US ASCII but use it and the
risk exists that someone could accuse you of using non-standard XML. A
charge it is difficult to refute reading the letter of the standard.
To make matters worse, if you use a declaration like this:
<?xml version="1.0"?>
or worse, no declaration at all, the default behaviour of the parser is
to treat the content as UTF-8. In other words "here be Unicode".
The practical upshot of this is that if you wish to use a subset of
Unicode -- US ASCII, Greek, or Cyrillic -- you cannot express that
constraint in your XML documents. People can send you XML that you are
expecting to be all 7 bit US ASCII but with some Gaelic in the middle.
The results can range from benign through to severe. Doing the right
thing with Unicode data effects everything from the programming
language you use to the types of output renderings you can create. The
very meaning of some concepts we are, perhaps, inured to in the West,
such as "regular expressions" and "uppercase text", are significantly
complicated in the face of fully blown Unicode.
So much for the engineering department. Lets head over to
sales/marketing and find out what is going on over there about this
Unicode issue:
Potential Customer: "Does your software support Japanese?"
Sales Person: "Oh yes, our software is fully Unicode compliant and,
thus, we support Japanese."
Yikes! As anyone involved in internationalization will tell you,
supporting Japanese requires much more than sticking a utf-8 or a UTF-
16 encoding into your XML and perhaps using a programming language that
can handle wide characters such as Java or Python.
I support Unicode. Unicode is a good thing. However, the "all or
nothing" way it must be used with XML and the sales propaganda that
Unicode support in a programming language magically solves
internationalization issues is not in the best interests of either the
Unicode or the XML cause.
A piece of positive news on Unicode subsetting: I have just found out
that it is possible to restrict the range of Unicode characters in a
document using a W3C XML Schema lexical constraint. All I need to do
now is to understand the other %99.995 of that spec!