ITworld.com
  Search  
Menu Changing the way you view IT
Unicode: Confessions of a Latinate
Sign up for XML IN PRACTICE
More Newsletters
 
 

XML IN PRACTICE --- 12/06/2001

Although Unicode isn't the answer to code internationalization, XML's continued support and improved subsetting is certainly helping it along.



In my youth, as an aspiring software developer, I used to write software in Intel 8086 assembly language. I freely admit that I was writing assembly code for some time before the full meaning of everything I wrote into my programs was clear to me. In particular, I remember starting my programs with the incantation:
Advertisement
On this topic




code segment para public

For quite a while, I did not know what that meant -- except that all assembler programs I looked at seemed to have it, all the books used it, all my peers were putting it into their programs, and leaving it out caused undecipherable error messages to come from the compiler. As a new programmer not wishing to look stupid, I rattled off the incantation at the top of my programs with a flourish and even told new programmers to do the same on the grounds that "the parser needs it".

An XML analog of this anecdote can be found in the incantation:

<xml version="1.0" encoding="utf-8"?>

The implications of the utf-8 part of this statement are as lost on some developers at the "code segment para public" statement was on me. A lot of XML documents seem to have it, all the books use it, other XML people put it into their documents, and leaving it out can cause undecipherable error messages to come from the parser. Ask a room full of XML developers why it is there and the answer "the parser needs it" will feature prominently.

Lets face facts, the world has not yet hit a critical mass of Unicode (http://www.unicode.org) compliant tools. I believe that a lot of XML that says "here be Unicode" is processed by systems that will no unpleasant things if you feed them anything outside the plain vanilla US ASCII range.

The situation is not helped by the fact that it is not possible to say in your XML "just use seven bit US ASCII". Yes, you can specify US ASCII like this:

<?xml version="1.0" encoding="us-ascii"?>

BUT, this is not officially part of the XML 1.0 standard. I have yet to come across a tool that does not support US ASCII but use it and the risk exists that someone could accuse you of using non-standard XML. A charge it is difficult to refute reading the letter of the standard.

To make matters worse, if you use a declaration like this:

<?xml version="1.0"?>

or worse, no declaration at all, the default behaviour of the parser is to treat the content as UTF-8. In other words "here be Unicode".

The practical upshot of this is that if you wish to use a subset of Unicode -- US ASCII, Greek, or Cyrillic -- you cannot express that constraint in your XML documents. People can send you XML that you are expecting to be all 7 bit US ASCII but with some Gaelic in the middle. The results can range from benign through to severe. Doing the right thing with Unicode data effects everything from the programming language you use to the types of output renderings you can create. The very meaning of some concepts we are, perhaps, inured to in the West, such as "regular expressions" and "uppercase text", are significantly complicated in the face of fully blown Unicode.

So much for the engineering department. Lets head over to sales/marketing and find out what is going on over there about this Unicode issue:

Potential Customer: "Does your software support Japanese?"

Sales Person: "Oh yes, our software is fully Unicode compliant and, thus, we support Japanese."

Yikes! As anyone involved in internationalization will tell you, supporting Japanese requires much more than sticking a utf-8 or a UTF- 16 encoding into your XML and perhaps using a programming language that can handle wide characters such as Java or Python.

I support Unicode. Unicode is a good thing. However, the "all or nothing" way it must be used with XML and the sales propaganda that Unicode support in a programming language magically solves internationalization issues is not in the best interests of either the Unicode or the XML cause.

A piece of positive news on Unicode subsetting: I have just found out that it is possible to restrict the range of Unicode characters in a document using a W3C XML Schema lexical constraint. All I need to do now is to understand the other %99.995 of that spec!

 



Sponsored links
Top 5 Reasons to Combine App Performance and Security
Locate Hidden Software on business PCs with this free tool
KODAK i1400 Series Scanners stand up to the challenge
Bring harmony to your mix of UNIX-Linux-Windows computing environments
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Industry Standard   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.