Previously, we have reported on the ability of several languages to process Unicode correctly. Recall that Unicode is the current standard for encoding most human written languages. This installment of Regular Expressions is a tutorial for English-speaking developers on how to get started with these capabilities.
There's a wealth of available literature on computer representations of human languages. The Resources below point to leading sites that will satisfy even the largest appetites for details on this subject. Our method today is to simplify this abundance down to a few steps that are sure to bring a Unicode newcomer quick successes.
In the United States, computer users conventionally use a keyboard that corresponds closely or exactly to the ASCII encoding chart. The alphabet does not include any accented characters. Even in Western Europe, a region so culturally close to the United States, keyboards are typically localized to facilitate writing with richer character sets. Every language used in Europe other than English requires diacritics for correct spelling.
Most modern desktop computers are delivered with the ability to display at least the languages of Western Europe. The first problem for a US user is just to enter accented characters. Netscape, especially during the heyday of its browser, did a commendable job of documenting this information. If you're sitting at a Windows desktop, for example, Netscape is right to recommend that you use the ALT-integers for character entry. Suppose you want to write:
déjà
With a keyboard from Western Europe, you can probably do this directly. In the United States, though, you'll likely need to press these keys, in succession:
- d
- Set NumLock. While holding down ALT, pick the digits 0, 2, 3, and 3 from the numeric keypad. This sequence is often abbreviated as \0233.
- j
- Hold ALT down again to type \0224.
At this point, you should see déjà.
The Resources explain how to do similar operations for different Unix varieties, Mac OS, and BeOS. Write us if you have a desktop that doesn't appear in our references.
How computer languages see character constants
That was a big step! Now you can read and write languages used by almost half the planet's population. The good news is that the scripting languages you're most likely to use are ready before you are. Perl, Python, and Tcl, for example, are fully Unicode-capable internally. While there's still ambiguity and occasionally even error in moving data between its Unicode encoding and what you see on a screen or on paper, the maintainers of these languages have made them quite reliable for most operations.
Tcl was the first of these languages to support Unicode internally. There, the grave-a we learned to type as \0224 can appear as any of the following:
# Let the keyboard figure out the value.
set grave-a à
# Hexadecimal E0 is decimal 224.
set grave-a \xe0
# Hexadecimal 00E0 is decimal 224. "\u"
# means we can enter a full four bytes
# of Unicode value.
set grave-a \u00e0
# This is the most direct way to enter a
# value from the Netscape "Accent Input"
# table.
set grave-a [format %c 224]
Perl 5.6 recognizes similar syntaxes, including:
$grave_a = '\xE0';
Python offers even more flexibility in converting between different representations. See Marc-André Lemburg's Python Unicode Tutorial for more on the subject. Note that in this column we occasionally blur distinctions between Unicode and related encodings, such as UTF-16. Our aim in this first pass is to simplify explanations, not to make them rigorous.
Moving forward
Now you can read, write, and compute with the symbols and words of Western Europe and the New World. What's next?
One direction you can go is to other alphabets and writing systems. Some desktops are distributed with at least one Greek, Cyrillic, Arabic, or Hebrew font. If you work much with these, though, or with Asian languages, you'll need to augment your host's built-in complement of fonts. This becomes a particular issue for Chinese, Japanese, and other languages with ideographic scripts. Fonts for these languages are typically at least an order of magnitude larger than Occidental ones. For example, Netscape recommends the popular Cyberbit Bitstream font. Although the version available for no-charge licensing isn't complete in its coverage of Chinese-Japanese-Korean (CJK), it fills a zip file of more than 6 megabytes.
 |
|
From a Suchenwirth project to build a Unicode-savvy editor |
Along with the advice available from Netscape, application developer Richard Suchenwirth has documented his experiments with Unicode and related encodings. He makes all of this freely available through the public Tcl Wiki. One place to start is his Practical Guide to Choosing Fonts.
With a nice collection of fonts installed, you can follow Suchenwirth's lead and experiment with A Little Unicode Editor, synthesize your own keyboard widget, or begin to learn A Simple Arabic Renderer.
Note the simultaneous use of several different human languages with their appropriate glyphs.
Toolkit choices
Although Suchenwirth codes his examples in Tcl/Tk, most can be easily translated to Tkinter or Perl/Tk. Moving to other GUI toolkits presents more of a challenge: as our recent series on toolkits explained, many popular toolkits do not fully support Unicode. There are even a few outstanding inconsistencies in Tk display of Unix fonts. In principle, Java support of Unicode is particularly complete; however, Java's Swing GUI toolkit itself is poorly standardized on different platforms, so we prefer Tk or Qt for most of our Unicode-oriented work.
This is only the beginning of a working knowledge of internationalization and localization. In fact, many aspects of Unicode remain unsettled and even controversial. Parts of Unicode are still in beta, and native writers have rejected a few others. However, a little practice with the topics presented here will at least allow you to reproduce on your own desktop the work of the experts with human languages other than English. Remember: scripting languages' interactivity and responsiveness make them great vehicles for experimental learning.
Resources