Pet Peeves - Unicode

Making it possible to write software that will work in any language in any country, in any culture in the world is an extremely laudable goal. A goal that I wholeheartedly sign up to. We should make sure that the software arts make it possible. It is simultaneously the commercially sensible thing to do and the right thing to do from many human perspectives. A rare alignment of drivers indeed.

At this point, a voice somewhere in the room (is that a pointy haired boss I see out of the corner of my eye?) proclaims the answer to be obvious. It goes like this : "Just use Unicode!".


Where to start? The trouble with explaining why this is not true - not even close to being true - is that it takes time and takes more knowledge of the technical ins-and-outs than most care to carry around in their heads.

Let us start by looking at four main variables in the space : language, country, culture and technology. Language. Let us take that one first. How hard

can that be? Well, what list of languages do you want to target? Please don't say "all of them". There is no such thing as a definitive list of languages and even if there was, the list of languages supported in Unicode changes across various incarnations of Unicode. Oh, and there are languages with unbounded sets of "characters" such as Chinese which literally cannot be fully described in Unicode.

Let us dig out a globe and do a quick tour to find suitable languages to target. Western Europe? Not too bad. Some variations on ASCII but nothing too worrying. Lets head further East. Oops! Greek. Those guys still write left to right but they have a different character for everything! Oops! What is that I see as I fly over the Urals? Cyrillic! Another entirely different set of symbols. Life is becoming more complicated the further we travel from the Western world.

Let us keep going...After all, sometimes problems get easier as you generalize them. Oh dear, when we hit the Middle East we hit languages that drive the wrong way on the page. Right to left! That changes essentially everything user-interface-related in most software. Further east we hit ideographics. The concept of a "letter" has just flew out of the occidental window. Not only that but the text is laid out top to bottom. What to do?

Let us back up the truck and start again. Let's look at the country variable this time. Maybe that is a better starting point? The British speak English right? Well yes and no. There is British English which is rather different from, say, US English and different again from, say, two letters to allocate codes to countries or three? Do we end up making political statements in our software as a result of the choice of what countries to include and what ones not to include? Probably.

What about Wales? Well, the Welsh speak Welsh as well as English and Welsh is very, very different from English. Then there is Scots Gaeilic, Cornish, Ulster Scots... Maybe the British Isles was a complicated case? Let us try somewhere else, like Spain? Oops! Castillian, Catalan, Basque, Galician, Asturian, Aranese...Let us go further afield. How about India? Oops! Hindi, Bengali, Malayalam, Marathi, Telugu, Urdu - and that just the ones with more than 30 million speakers...

Maybe the concept of cultures is a better place to start? What would our software need to do to operate in, say, Europe? Well, apart from selecting a subset of the myriad of languages (France alone has 30+ ), we need to deal with the many "gotchas" of internationalization. Some examples:

- Some cultures have languages in which changing a string of text between uppercase and lowercase changes the length of the string (Croatia is an example). Think of what that can do for your database fields.

- Some cultures have languages in which the shape of a character changes depending on what characters precede them. Arabic and Thai are two examples.

- Some cultures habitually sort names by given name, then family name (Iceland is an example). Think of what that can do for your database reports. In fact, the simple sounding concept of sorting strings turns out to be fiendishly complex.In Japanese for example, the rules change based on the type of Japanese script you are using. Worse, your average SQL database and your average operating system almost certainly sort text differently.

- Some cultures use "," as a decimal point. Think of what that can do for your text processing code.

- Some cultures do not use zip codes - Ireland is an example.

- Some cultures have names for people in which the family name comes first and the given name comes second. Think of what that can do to your mail merge.

- Some cultures think of "27-29 Main St." in an address as being a range of buildings whereas in others, this would be "location 29 in building 27" in others. Think of what that can do to your address processing.

- Some cultures think of red as being a joyful color - China is an example. Other cultures think of red as indicating danger. Think of what that can do for your GUIs.

- Some cultures find words like "failed" insulting. What are you going to do with all your "failed to connect to Internet" messages?

Whew. Now let's take a quick look at the technology aspects. Unicode is Unicode right? Well, no. Unicode is, sadly, more of a state of mind thanks to the way IT standards and IT competitive forces operate. The first version of Unicode dates from 1991 and the latest - version 5.1 - has a significantly expanded character repertoire. There have been a dozen editions in total. So, just use the latest one? Not so fast. We need to use versions of Unicode that are supported by the operating systems, programming environments, virtual machines and operating systems we want to deploy on.

For example, Microsoft Windows XP has good support for Unicode 3.0. It will mostly do the right thing with Unicode 4.0 but it does not ship with fonts that cover all the code points in Unicode 4.0. Also, it does not support varying a characters shape depending on what character precede it - as needed in some Arabic and Thai scenarios. Ubuntu, to take a Unix example, tries very hard to keep up as Unicode evolves but the degree of Unicode support you will see in any given setup is very dependent on the mix of applications you have installed.

Then of course, there is the whole area of programming language. Some recent and trendy languages have less Unicode support than one might think. Php and Ruby are two examples. Python has had Unicode support for some time but some significant changes related to Unicode handling are part of the Python 3k project.

The there are the big VMs out there. .NET support Unicode out of the box but favours a representational device known as UTF-16 and supports a complicated trick known as "surrogate pairs" to expand the character repetoire beyond what 16 bits would support. The JVM (version 5) is Unicode 4.0 based, does the UTF-16 thing but the concept of a Char in the language has a checkered history. Prior to version 5, aspects of the Java language documentation where Han Unification is controversial. Ruby appears to be planning to take a different tack in the form of m17n in which Unicode becomes a special case of a Universal Character Set inside a Character Set Independent framework...

Need I go on? "Just use Unicode"?

I wish it were that simple.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon