More Than Just English

Previously, we have reported on the ability of several languages to

process Unicode correctly. Recall that Unicode is the current standard

for encoding most human written languages. This installment of Regular

Expressions is a tutorial for English-speaking developers on how to get

started with these capabilities.

There's a wealth of available literature on computer representations of

human languages. The Resources below point to leading sites that will

satisfy even the largest appetites for details on this subject. Our

method today is to simplify this abundance down to a few steps that are

sure to bring a Unicode newcomer quick successes.

In the United States, computer users conventionally use a keyboard that

corresponds closely or exactly to the ASCII encoding chart. The

alphabet does not include any accented characters. Even in Western

Europe, a region so culturally close to the United States, keyboards

are typically localized to facilitate writing with richer character

sets. Every language used in Europe other than English requires

diacritics for correct spelling.

Most modern desktop computers are delivered with the ability to display

at least the languages of Western Europe. The first problem for a US

user is just to enter accented characters. Netscape, especially during

the heyday of its browser, did a commendable job of documenting this

information. If you're sitting at a Windows desktop, for example,

Netscape is right to recommend that you use the ALT-integers for

character entry. Suppose you want to write:


With a keyboard from Western Europe, you can probably do this directly.

In the United States, though, you'll likely need to press these keys,

in succession:

1. d

2. Set NumLock. While holding down ALT, pick the digits 0, 2, 3,

and 3 from the numeric keypad. This sequence is often

abbreviated as \0233.

3. j

4. Hold ALT down again to type \0224.

At this point, you should see deja.

The Resources explain how to do similar operations for different Unix

varieties, Mac OS, and BeOS. Write us if you have a desktop that

doesn't appear in our references.

How computer languages see character constants

That was a big step! Now you can read and write languages used by

almost half the planet's population. The good news is that the

scripting languages you're most likely to use are ready before you are.

Perl, Python, and Tcl, for example, are fully Unicode-capable

internally. While there's still ambiguity and occasionally even error

in moving data between its Unicode encoding and what you see on a

screen or on paper, the maintainers of these languages have made them

quite reliable for most operations.

Tcl was the first of these languages to support Unicode internally.

There, the grave-a we learned to type as \0224 can appear as any of the


# Let the keyboard figure out the value.

set grave-a a

# Hexadecimal E0 is decimal 224.

set grave-a \xe0

# Hexadecimal 00E0 is decimal 224. "\u"

# means we can enter a full four bytes

# of Unicode value.

set grave-a \u00e0

# This is the most direct way to enter a

# value from the Netscape "Accent Input"

# table.

set grave-a [format %c 224]

Perl 5.6 recognizes similar syntaxes, including:

$grave_a = '\xE0';

Python offers even more flexibility in converting between different

representations. See Marc-Andre Lemburg's Python Unicode Tutorial for

more on the subject. Note that in this column we occasionally blur

distinctions between Unicode and related encodings, such as UTF-16. Our

aim in this first pass is to simplify explanations, not to make them


Moving forward

Now you can read, write, and compute with the symbols and words of

Western Europe and the New World. What's next?

One direction you can go is to other alphabets and writing systems.

Some desktops are distributed with at least one Greek, Cyrillic,

Arabic, or Hebrew font. If you work much with these, though, or with

Asian languages, you'll need to augment your host's built-in complement

of fonts. This becomes a particular issue for Chinese, Japanese, and

other languages with ideographic scripts. Fonts for these languages are

typically at least an order of magnitude larger than Occidental ones.

For example, Netscape recommends the popular Cyberbit Bitstream font.

Although the version available for no-charge licensing isn't complete

in its coverage of Chinese-Japanese-Korean (CJK), it fills a zip file

of more than 6 megabytes.

Along with the advice available from Netscape, application developer

Richard Suchenwirth has documented his experiments with Unicode and

related encodings. He makes all of this freely available through the

public Tcl Wiki. One place to start is his Practical Guide to Choosing


With a nice collection of fonts installed, you can follow Suchenwirth's

lead and experiment with A Little Unicode Editor, synthesize your own

keyboard widget, or begin to learn A Simple Arabic Renderer.

Note the simultaneous use of several different human languages with

their appropriate glyphs.

Toolkit choices

Although Suchenwirth codes his examples in Tcl/Tk, most can be easily

translated to Tkinter or Perl/Tk. Moving to other GUI toolkits presents

more of a challenge: as our recent series on toolkits explained, many

popular toolkits do not fully support Unicode. There are even a few

outstanding inconsistencies in Tk display of Unix fonts. In principle,

Java support of Unicode is particularly complete; however, Java's Swing

GUI toolkit itself is poorly standardized on different platforms, so we

prefer Tk or Qt for most of our Unicode-oriented work.

This is only the beginning of a working knowledge of

internationalization and localization. In fact, many aspects of Unicode

remain unsettled and even controversial. Parts of Unicode are still in

beta, and native writers have rejected a few others. However, a little

practice with the topics presented here will at least allow you to

reproduce on your own desktop the work of the experts with human

languages other than English. Remember: scripting languages'

interactivity and responsiveness make them great vehicles for

experimental learning.

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon