Pet Peeves - Unicode

By Sean McGrath  7 comments

Making it possible to write software that will work in any language in any country, in any culture in the world is an extremely laudable goal. A goal that I wholeheartedly sign up to. We should make sure that the software arts make it possible. It is simultaneously the commercially sensible thing to do and the right thing to do from many human perspectives. A rare alignment of drivers indeed.

At this point, a voice somewhere in the room (is that a pointy haired boss I see out of the corner of my eye?) proclaims the answer to be obvious. It goes like this : "Just use Unicode!".

Sigh.

Where to start? The trouble with explaining why this is not true - not even close to being true - is that it takes time and takes more knowledge of the technical ins-and-outs than most care to carry around in their heads.

Let us start by looking at four main variables in the space : language, country, culture and technology. Language. Let us take that one first. How hard
can that be? Well, what list of languages do you want to target? Please don't say "all of them". There is no such thing as a definitive list of languages and even if there was, the list of languages supported in Unicode changes across various incarnations of Unicode. Oh, and there are languages with unbounded sets of "characters" such as Chinese which literally cannot be fully described in Unicode.

Let us dig out a globe and do a quick tour to find suitable languages to target. Western Europe? Not too bad. Some variations on ASCII but nothing too worrying. Lets head further East. Oops! Greek. Those guys still write left to right but they have a different character for everything! Oops! What is that I see as I fly over the Urals? Cyrillic! Another entirely different set of symbols. Life is becoming more complicated the further we travel from the Western world.

Let us keep going...After all, sometimes problems get easier as you generalize them. Oh dear, when we hit the Middle East we hit languages that drive the wrong way on the page. Right to left! That changes essentially everything user-interface-related in most software. Further east we hit ideographics. The concept of a "letter" has just flew out of the occidental window. Not only that but the text is laid out top to bottom. What to do?

Let us back up the truck and start again. Let's look at the country variable this time. Maybe that is a better starting point? The British speak English right? Well yes and no. There is British English which is rather different from, say, US English and different again from, say, Hinglish. Hmmm. When does a language stop and a vernacular or a patois or a dialect start?
How do these relate to these locale things that computers are fond of setting? Hmmm. The boundaries are very fuzzy and the standards keep changing. Do we need just two letters to allocate codes to countries or three? Do we end up making political statements in our software as a result of the choice of what countries to include and what ones not to include? Probably.

What about Wales? Well, the Welsh speak Welsh as well as English and Welsh is very, very different from English. Then there is Scots Gaeilic, Cornish, Ulster Scots... Maybe the British Isles was a complicated case? Let us try somewhere else, like Spain? Oops! Castillian, Catalan, Basque, Galician, Asturian, Aranese...Let us go further afield. How about India? Oops! Hindi, Bengali, Malayalam, Marathi, Telugu, Urdu - and that just the ones with more than 30 million speakers...

Maybe the concept of cultures is a better place to start? What would our software need to do to operate in, say, Europe? Well, apart from selecting a subset of the myriad of languages (France alone has 30+ ), we need to deal with the many "gotchas" of internationalization. Some examples:

- Some cultures have languages in which changing a string of text between uppercase and lowercase changes the length of the string (Croatia is an example). Think of what that can do for your database fields.

- Some cultures have languages in which the shape of a character changes depending on what characters precede them. Arabic and Thai are two examples.

- Some cultures habitually sort names by given name, then family name (Iceland is an example). Think of what that can do for your database reports. In fact, the simple sounding concept of sorting strings turns out to be fiendishly complex.In Japanese for example, the rules change based on the type of Japanese script you are using. Worse, your average SQL database and your average operating system almost certainly sort text differently.

- Some cultures use "," as a decimal point. Think of what that can do for your text processing code.

- Some cultures do not use zip codes - Ireland is an example.

- Some cultures have names for people in which the family name comes first and the given name comes second. Think of what that can do to your mail merge.

- Some cultures think of "27-29 Main St." in an address as being a range of buildings whereas in others, this would be "location 29 in building 27" in others. Think of what that can do to your address processing.

- Some cultures think of red as being a joyful color - China is an example. Other cultures think of red as indicating danger. Think of what that can do for your GUIs.

- Some cultures find words like "failed" insulting. What are you going to do with all your "failed to connect to Internet" messages?

Whew. Now let's take a quick look at the technology aspects. Unicode is Unicode right? Well, no. Unicode is, sadly, more of a state of mind thanks to the way IT standards and IT competitive forces operate. The first version of Unicode dates from 1991 and the latest - version 5.1 - has a significantly expanded character repertoire. There have been a dozen editions in total. So, just use the latest one? Not so fast. We need to use versions of Unicode that are supported by the operating systems, programming environments, virtual machines and operating systems we want to deploy on.

For example, Microsoft Windows XP has good support for Unicode 3.0. It will mostly do the right thing with Unicode 4.0 but it does not ship with fonts that cover all the code points in Unicode 4.0. Also, it does not support varying a characters shape depending on what character precede it - as needed in some Arabic and Thai scenarios. Ubuntu, to take a Unix example, tries very hard to keep up as Unicode evolves but the degree of Unicode support you will see in any given setup is very dependent on the mix of applications you have installed.

Then of course, there is the whole area of programming language. Some recent and trendy languages have less Unicode support than one might think. Php and Ruby are two examples. Python has had Unicode support for some time but some significant changes related to Unicode handling are part of the Python 3k project.

The there are the big VMs out there. .NET support Unicode out of the box but favours a representational device known as UTF-16 and supports a complicated trick known as "surrogate pairs" to expand the character repetoire beyond what 16 bits would support. The JVM (version 5) is Unicode 4.0 based, does the UTF-16 thing but the concept of a Char in the language has a checkered history. Prior to version 5, aspects of the Java language documentation where rather misleading on Unicode.

Oh and finally. As if all this was not complicated enough. Not everyone likes where Unicode is going. Han Unification is controversial. Ruby appears to be planning to take a different tack in the form of m17n in which Unicode becomes a special case of a Universal Character Set inside a Character Set Independent framework...

Need I go on? "Just use Unicode"?

I wish it were that simple.

7 comments

    Anonymous 3 years ago
    If the point of the article is indeed that Unicode doesn't handle all aspects of internationalization and localization, then that point got lost in the details of the article. The details of the article focus on particular issues that Unicode does address and provides misinformation about Unicode's inadequacies to do so. For example, there are "controversies" over Unicode support for Han characters, but it is quite slanted to insist that Unicode falls short by saying "Oh, and there are languages with unbounded sets of "characters" such as Chinese which literally cannot be fully described in Unicode." This plays into the propagandist fueled controversy that doesn't really exist. Are there actual characters in real software you can point to that Unicode cannot accommodate. Unicode has driven the process of Logographic/Ideographic character assignemnt and now has provision over 140,000 code points to ideographs alone. Combine that with the 256 variation selectors and that provides for over 33 million potential ideograph variations. While there are legitimate issues with Han unification (especially since with the deprecation of language tags it now fails to support Unicode's own goal of multilingual document support), the actual numbers of available code points is not really a practical issue.Implemntations of Unicode certainly have some holes, but it is an unimportant implementation detail if an implementation uses UTF-16 as its underlying representation of characters or uses Python's M17N approach. The one minor practical implication of some UTF-16 based implmentations is that they might errently miscount characters in a validation context where the number of characters should not exceed n, however with composite characters such character/grapheme cluster counts require special attention regardless of the underlying data structure.Finally on the issue of capitalization, the treatment by Unicode of capitalization is one of the problems Unicode solves (not a problem with Unicode). Unicode also solves it with some flexibility for implementers. For example while the length of a string can change with Unicode's full case folding algorithms (such as in German where "SS becomes "ß"), this need not even be handled in that way. For example the "ß" can be treated as a ligature using an OpenType font so that "SS" still becomes "ss" but is displayed with the ligature "ß").Finally, the issue of the Turkish dotted-i is not really a unique feature of Turkish so much as the peculiar encoding of Turkish in Unicode. By trying to treat the Turkish dotless and dotted i as the same characters as similar characters already encoded in Unicode, it requires a special exception needed for Turkish and no other language. It would be difficult to break from this tradition now, but if Turkish was simply assigned its own dotted and dotless "i" (in both lower and upper case variants), then Turkish would not require this exceptional treatment. Though such a solution would create confusion since Turkish users would need to be aware of the need to enter an ASCII i and an ASCII I as distinct from the Turkish characters that look the same, for example when entering a non-Turkish URL in a web browser. However, many Turkish computer users are likely already aware of these subtle differences (more so than those of us outside Turkey).So yes internationalization and localization requires many things beyond Unicode, however, the swipes at Unicode in this article are largely unfounded.
    Sean McGrath
    Sean McGrath 3 years ago
    Coyote,Thanks for the detailed clarification. I cannot remember where I found the ref to Croatian string size changing but it was obviously wrong. My badThanks again,Sean
    Anonymous 3 years ago
      "- Some cultures have languages in which changing a string of text between uppercase and lowercase changes the length of the string (Croatia is an example)."
    I don't know where you heard this but it is not true.The only problem with Croatian is sorting. We have three double letter pairs which when sorting should be considered as single letters:- "lj" pair comes after all other pairs starting with "l"- "nj" pair comes after all other pairs starting with "n"- "dž" pair (that's "d" and "z" with a caron or inverted circumflex) comes after all other pairs starting with "d", but since "ž" is the last in the alphabet then it is naturally so...However, capitalization does not work on pairs but rather on the single letters. If only first letter of the word should be capitalized then:dž -> Džlj -> Ljnj -> NjIf complete words should be capitalized then:dž -> DŽlj -> LJnj -> NJSo there are no changing of the string size.In accordance with sorting, words starting with those pairs of letters are considered separately (for example in dictionaries) from other words starting with only the first letter from the pairs listed above.That's because those three pairs actually represent specific phonemes and therefore are treated as single consonants. Writing in Croatian is almost totally phonemic. Each letter is exactly one phoneme (sound). Therefore it is very easy for children to learn to read, although proper writing is a bit more difficult because of many local dialects which children at early ages usually mix with standard language.Reading is also relatively easy for strangers, but they often miss the proper accent, especially because accentuation is seldom written in Croatian.Just as a side note - Serbian Cyrillic alphabet has a distinct letters for the above three pairs (dž, lj, nj). However, when Serbian is written using Latin alphabet then it uses the same pairing as Croatian.
    Anonymous 3 years ago
    Great article. I want to give an example. Most of the java applications, IDEs, application servers fail to run on Turkish locale machines and Unicode can do nothing about it. There is a very simple reason for that catastrophic problem. Turkish language has a unique characteristic. Unlike other languages:i -uppercase-> İ (not I)I -lowercase-> ı (not i)Language Code of Locale -- Lower Case -- Upper Case -- Descriptiontr (Turkish) -- u0069 -- u0130 -- small letter i -> capital letter I with dot abovetr (Turkish) -- u0131 -- u0049 -- small letter dotless i -> capital letter Ias you can see, there are 4 possibilities in Turkish: ı (small letter i without a dot) - I - i - İ (capital letter I with a dot above)So when there is a character conversion in application code (strings that are invisible to user), applications fail. For example id becomes İD, interceptor becomes İNTERCEPTOR and exceptions occur. This is happening because toLowerCase and toUpperCase methods doesn't enforce a locale argument and most developers (who are unaware of this issue and expect that uppercase of i same everywhere on the world) don't use these methods with locale arguments. If you want a java programme to run on Turkish locale computers, you must use these two methods with English locale (since java code is always English, this is not a problem) for user invisible strings.
    Anonymous 3 years ago
    1. Unicode is exactly the answer for many of the things you mention: it displays Cyrillic, Greek, Chinese, etc characters just fine.2. Your GUI system should know about directions of text. Try to run Gnome in Arabic, and see how the entire user interface seemlessly works.3. A lot of the cultural/language specific items you mention should be resolved by the translators. What would you like to do, use Google-translate to translate your whole app? You need to have your application strings translated by professionals. Read about how the gettext library works.4. Read about iconv(), it makes switching between codesets much more straightforward.5. Printing localized numbers should be easy. I know glibc has done that correctly for at least 3-4 years.Yes, there are quirks that are hard to get right with localization. But, if you use the proper tools and libraries, not localizing your software is a sign of your lack of professionalism.P.S. Being a native speaker of Bosnian/Croatian/Serbian, this is the first time I hear about "changing a string of text between uppercase and lowercase changes the length of the string". AFAIK, that's not true for Croatian, but might be true of German ß.
    Sean McGrath
    Sean McGrath 3 years ago
    Hans,"I'm assuming the author is actually aware that unicode doesn't govern all aspects of localization -- maybe that's the point of the article..."Exactly so. The core point of the article is that many who say "just use Unicode" do not know that, which is unfortunate.regards,Sean
    Anonymous 3 years ago
    Yes, it's really too bad that not everyone in the world speaks the same language, writes with the same script -- heck, it'd be nice if everyone just ate burgers and pizza too.It is pretty tedious to read someone whine about how difficult and complicated it is to write localized software. Yeah, it's harder to write localized software than non-localized. It's also harder to write software that has any sort of UI than software that doesn't. No matter how hard we try to change this, people tend to want software to work with them instead of vice versa.Finally, there seems to be some implicit confusion here between "unicode" and localization. I'm assuming the author is actually aware that unicode doesn't govern all aspects of localization -- maybe that's the point of the article (in which case the the title could probably be a little less misleading).

      Add a comment

      Post a comment using one of these accounts
      Or join now
      At least 6 characters

      Note: Comment will appear soon after you have activated your account.
      Obscene/spam comments will be removed and accounts suspended.
      The information you submit is subject to our Privacy Policy and Terms of Service.

      ITworld LIVE

      DevelopmentWhite Papers & Webcasts

      White Paper

      HP NonStop SQL Fundamentals whitepaper

      This whitepaper offers a detailed look into the fundamentals of HP NonStop SQL solutions. See how this system delivers unprecedented levels of application availability with fail-safe data integrity and meets the needs of enterprises with large-scale business critical applications.

      White Paper

      Nebraska Medical Center case study

      See how the Nebraska Medical Center implemented a SQL solution to make information more readily available to streamline operations, improve patient care and facilitate medical research with an enterprise solution running on HP NonStop servers.

      White Paper

      Concepts of NonStop SQL/MX

      For DBAs and developers who are familiar with Oracle solutions and want to learn about NonStop SQL/MX, this whitepaper provides an overview of the similarities and differences between the two products-with a specific focus on implementation.

      White Paper

      6 Things Your CIO Needs to Know About Requirements

      If your organization is not predictably successful on technology projects, there is likely an issue in requirements. CIOs must take action and own requirements maturity improvement. There are 6 main things a CIO must know about requirements.

      Webcast On Demand

      User Experience Monitoring

      In this webinar, you will learn hints & tips for improving end-user response times from Forrester Research analyst, Jean-Pierre Garbani.

      Sponsor: Nimsoft

      See more White Papers | Webcasts

      Ask a question

      Ask a Question