October 04, 2001, 12:00 AM — Previously, we have reported on the ability of several languages to
process Unicode correctly. Recall that Unicode is the current standard
for encoding most human written languages. This installment of Regular
Expressions is a tutorial for English-speaking developers on how to get
started with these capabilities.
There's a wealth of available literature on computer representations of
human languages. The Resources below point to leading sites that will
satisfy even the largest appetites for details on this subject. Our
method today is to simplify this abundance down to a few steps that are
sure to bring a Unicode newcomer quick successes.
In the United States, computer users conventionally use a keyboard that
corresponds closely or exactly to the ASCII encoding chart. The
alphabet does not include any accented characters. Even in Western
Europe, a region so culturally close to the United States, keyboards
are typically localized to facilitate writing with richer character
sets. Every language used in Europe other than English requires
diacritics for correct spelling.
Most modern desktop computers are delivered with the ability to display
at least the languages of Western Europe. The first problem for a US
user is just to enter accented characters. Netscape, especially during
the heyday of its browser, did a commendable job of documenting this
information. If you're sitting at a Windows desktop, for example,
Netscape is right to recommend that you use the ALT-integers for
character entry. Suppose you want to write:
With a keyboard from Western Europe, you can probably do this directly.
In the United States, though, you'll likely need to press these keys,
2. Set NumLock. While holding down ALT, pick the digits 0, 2, 3,
and 3 from the numeric keypad. This sequence is often
abbreviated as \0233.
4. Hold ALT down again to type \0224.
At this point, you should see deja.
The Resources explain how to do similar operations for different Unix
varieties, Mac OS, and BeOS. Write us if you have a desktop that
doesn't appear in our references.
How computer languages see character constants
That was a big step! Now you can read and write languages used by
almost half the planet's population. The good news is that the
scripting languages you're most likely to use are ready before you are.
Perl, Python, and Tcl, for example, are fully Unicode-capable
internally. While there's still ambiguity and occasionally even error
in moving data between its Unicode encoding and what you see on a
screen or on paper, the maintainers of these languages have made them
quite reliable for most operations.
Tcl was the first of these languages to support Unicode internally.