ITworld.com
  Search  
 Home   Application Development  Programming tools  Programming languages  Server Scripting
More than just English
 Printer Friendly Format
 Mail to a friend

Unix Insider 3/1/01

A practical introduction to scripting the majority languages

Cameron Laird and Kathryn Soraiz, Unix Insider

unixinsiderhome
Previously, we have reported on the ability of several languages to process Unicode correctly. Recall that Unicode is the current standard for encoding most human written languages. This installment of Regular Expressions is a tutorial for English-speaking developers on how to get started with these capabilities.

Advertisement
 On this topic
 Newsletters
 Data Management Strategies. Sign up Now!

There's a wealth of available literature on computer representations of human languages. The Resources below point to leading sites that will satisfy even the largest appetites for details on this subject. Our method today is to simplify this abundance down to a few steps that are sure to bring a Unicode newcomer quick successes.

In the United States, computer users conventionally use a keyboard that corresponds closely or exactly to the ASCII encoding chart. The alphabet does not include any accented characters. Even in Western Europe, a region so culturally close to the United States, keyboards are typically localized to facilitate writing with richer character sets. Every language used in Europe other than English requires diacritics for correct spelling.

Most modern desktop computers are delivered with the ability to display at least the languages of Western Europe. The first problem for a US user is just to enter accented characters. Netscape, especially during the heyday of its browser, did a commendable job of documenting this information. If you're sitting at a Windows desktop, for example, Netscape is right to recommend that you use the ALT-integers for character entry. Suppose you want to write:

déjà

With a keyboard from Western Europe, you can probably do this directly. In the United States, though, you'll likely need to press these keys, in succession:

  1. d
  2. Set NumLock. While holding down ALT, pick the digits 0, 2, 3, and 3 from the numeric keypad. This sequence is often abbreviated as \0233.
  3. j
  4. Hold ALT down again to type \0224.

At this point, you should see déjà.

The Resources explain how to do similar operations for different Unix varieties, Mac OS, and BeOS. Write us if you have a desktop that doesn't appear in our references.

How computer languages see character constants
That was a big step! Now you can read and write languages used by almost half the planet's population. The good news is that the scripting languages you're most likely to use are ready before you are. Perl, Python, and Tcl, for example, are fully Unicode-capable internally. While there's still ambiguity and occasionally even error in moving data between its Unicode encoding and what you see on a screen or on paper, the maintainers of these languages have made them quite reliable for most operations.

Tcl was the first of these languages to support Unicode internally. There, the grave-a we learned to type as \0224 can appear as any of the following:


		 # Let the keyboard figure out the value.
      set grave-a à

        # Hexadecimal E0 is decimal 224. 
      set grave-a \xe0

        # Hexadecimal 00E0 is decimal 224.  "\u"
		 #    means we can enter a full four bytes
		 #    of Unicode value.
      set grave-a \u00e0

        # This is the most direct way to enter a
		 #    value from the Netscape "Accent Input"
		 #    table.
      set grave-a [format %c 224]
   

Perl 5.6 recognizes similar syntaxes, including:


      $grave_a = '\xE0';
   

Python offers even more flexibility in converting between different representations. See Marc-André Lemburg's Python Unicode Tutorial for more on the subject. Note that in this column we occasionally blur distinctions between Unicode and related encodings, such as UTF-16. Our aim in this first pass is to simplify explanations, not to make them rigorous.

Moving forward
Now you can read, write, and compute with the symbols and words of Western Europe and the New World. What's next?

One direction you can go is to other alphabets and writing systems. Some desktops are distributed with at least one Greek, Cyrillic, Arabic, or Hebrew font. If you work much with these, though, or with Asian languages, you'll need to augment your host's built-in complement of fonts. This becomes a particular issue for Chinese, Japanese, and other languages with ideographic scripts. Fonts for these languages are typically at least an order of magnitude larger than Occidental ones. For example, Netscape recommends the popular Cyberbit Bitstream font. Although the version available for no-charge licensing isn't complete in its coverage of Chinese-Japanese-Korean (CJK), it fills a zip file of more than 6 megabytes.

Unicode_editor
From a Suchenwirth project to build a Unicode-savvy editor

Along with the advice available from Netscape, application developer Richard Suchenwirth has documented his experiments with Unicode and related encodings. He makes all of this freely available through the public Tcl Wiki. One place to start is his Practical Guide to Choosing Fonts.

With a nice collection of fonts installed, you can follow Suchenwirth's lead and experiment with A Little Unicode Editor, synthesize your own keyboard widget, or begin to learn A Simple Arabic Renderer.

Note the simultaneous use of several different human languages with their appropriate glyphs.

Toolkit choices

Although Suchenwirth codes his examples in Tcl/Tk, most can be easily translated to Tkinter or Perl/Tk. Moving to other GUI toolkits presents more of a challenge: as our recent series on toolkits explained, many popular toolkits do not fully support Unicode. There are even a few outstanding inconsistencies in Tk display of Unix fonts. In principle, Java support of Unicode is particularly complete; however, Java's Swing GUI toolkit itself is poorly standardized on different platforms, so we prefer Tk or Qt for most of our Unicode-oriented work.

This is only the beginning of a working knowledge of internationalization and localization. In fact, many aspects of Unicode remain unsettled and even controversial. Parts of Unicode are still in beta, and native writers have rejected a few others. However, a little practice with the topics presented here will at least allow you to reproduce on your own desktop the work of the experts with human languages other than English. Remember: scripting languages' interactivity and responsiveness make them great vehicles for experimental learning.

Resources


Sponsored links
Locate Hidden Software on business PCs with this free tool
Bring harmony to your mix of UNIX-Linux-Windows computing environments
KODAK i1400 Series Scanners stand up to the challenge
Top 5 Reasons to Combine App Performance and Security
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Industry Standard   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.