Previously, we have reported on the ability of several languages to
process Unicode correctly. Recall that Unicode is the current standard
for encoding most human written languages. This installment of Regular
Expressions is a tutorial for English-speaking developers on how to get
started with these capabilities.
There's a wealth of available literature on computer representations of
human languages. The Resources below point to leading sites that will
satisfy even the largest appetites for details on this subject. Our
method today is to simplify this abundance down to a few steps that are
sure to bring a Unicode newcomer quick successes.
In the United States, computer users conventionally use a keyboard that
corresponds closely or exactly to the ASCII encoding chart. The
alphabet does not include any accented characters. Even in Western
Europe, a region so culturally close to the United States, keyboards
are typically localized to facilitate writing with richer character
sets. Every language used in Europe other than English requires
diacritics for correct spelling.
Most modern desktop computers are delivered with the ability to display
at least the languages of Western Europe. The first problem for a US
user is just to enter accented characters. Netscape, especially during
the heyday of its browser, did a commendable job of documenting this
information. If you're sitting at a Windows desktop, for example,
Netscape is right to recommend that you use the ALT-integers for
character entry. Suppose you want to write:
With a keyboard from Western Europe, you can probably do this directly.
In the United States, though, you'll likely need to press these keys,
2. Set NumLock. While holding down ALT, pick the digits 0, 2, 3,
and 3 from the numeric keypad. This sequence is often
abbreviated as \0233.
4. Hold ALT down again to type \0224.
At this point, you should see deja.
The Resources explain how to do similar operations for different Unix
varieties, Mac OS, and BeOS. Write us if you have a desktop that
doesn't appear in our references.
How computer languages see character constants
That was a big step! Now you can read and write languages used by
almost half the planet's population. The good news is that the
scripting languages you're most likely to use are ready before you are.
Perl, Python, and Tcl, for example, are fully Unicode-capable
internally. While there's still ambiguity and occasionally even error
in moving data between its Unicode encoding and what you see on a
screen or on paper, the maintainers of these languages have made them
quite reliable for most operations.
Tcl was the first of these languages to support Unicode internally.
There, the grave-a we learned to type as \0224 can appear as any of the
# Let the keyboard figure out the value.
set grave-a a
# Hexadecimal E0 is decimal 224.
set grave-a \xe0
# Hexadecimal 00E0 is decimal 224. "\u"
# means we can enter a full four bytes
# of Unicode value.
set grave-a \u00e0
# This is the most direct way to enter a
# value from the Netscape "Accent Input"
set grave-a [format %c 224]
Perl 5.6 recognizes similar syntaxes, including:
$grave_a = '\xE0';
Python offers even more flexibility in converting between different
representations. See Marc-Andre Lemburg's Python Unicode Tutorial for
more on the subject. Note that in this column we occasionally blur
distinctions between Unicode and related encodings, such as UTF-16. Our
aim in this first pass is to simplify explanations, not to make them
Now you can read, write, and compute with the symbols and words of
Western Europe and the New World. What's next?
One direction you can go is to other alphabets and writing systems.
Some desktops are distributed with at least one Greek, Cyrillic,
Arabic, or Hebrew font. If you work much with these, though, or with
Asian languages, you'll need to augment your host's built-in complement
of fonts. This becomes a particular issue for Chinese, Japanese, and
other languages with ideographic scripts. Fonts for these languages are
typically at least an order of magnitude larger than Occidental ones.
For example, Netscape recommends the popular Cyberbit Bitstream font.
Although the version available for no-charge licensing isn't complete
in its coverage of Chinese-Japanese-Korean (CJK), it fills a zip file
of more than 6 megabytes.
Along with the advice available from Netscape, application developer
Richard Suchenwirth has documented his experiments with Unicode and
related encodings. He makes all of this freely available through the
public Tcl Wiki. One place to start is his Practical Guide to Choosing
With a nice collection of fonts installed, you can follow Suchenwirth's
lead and experiment with A Little Unicode Editor, synthesize your own
keyboard widget, or begin to learn A Simple Arabic Renderer.
Note the simultaneous use of several different human languages with
their appropriate glyphs.
Although Suchenwirth codes his examples in Tcl/Tk, most can be easily
translated to Tkinter or Perl/Tk. Moving to other GUI toolkits presents
more of a challenge: as our recent series on toolkits explained, many
popular toolkits do not fully support Unicode. There are even a few
outstanding inconsistencies in Tk display of Unix fonts. In principle,
Java support of Unicode is particularly complete; however, Java's Swing
GUI toolkit itself is poorly standardized on different platforms, so we
prefer Tk or Qt for most of our Unicode-oriented work.
This is only the beginning of a working knowledge of
internationalization and localization. In fact, many aspects of Unicode
remain unsettled and even controversial. Parts of Unicode are still in
beta, and native writers have rejected a few others. However, a little
practice with the topics presented here will at least allow you to
reproduce on your own desktop the work of the experts with human
languages other than English. Remember: scripting languages'
interactivity and responsiveness make them great vehicles for