Volume 3, No. 4 
October 1999

 
 
Gabe Bokor

 


 

 

 
 
 
 
 
 
 
 
 
Translation Journal
 
Non-English Computing


The Multilingual Web

by Gabe Bokor
 

Translators and Web page design
Web page design has become an increasingly valuable skill for translators, both for creating their own Web sites and for meeting their customers' needs for multilingual Web pages. The question is: "How can a non-English Web page be created that appears correctly on most browsers and platforms, with a variety of user settings and different fonts installed in the author's and the users' systems?" 

for fast-loading, sharp, editable characters in any language, text is preferable. Use graphics for compatibility where loading speed is of little or no concern.

    Although Web authoring WYSWYG software is becoming increasingly international, it usually lags behind current development in Web standards. While such software allows people to create Web pages without learning and using codes, some knowledge of HTML and related standards is useful when creating more sophisticated or multilingual Web pages.


From ASCII to Latin-1

The original character set of the Internet is ASCII (American Standard Code for Information Interchange), the characters you can find on the keyboard of a standard American computer or typewriter. Even today, most information is transmitted over the Internet in the form of ASCII characters, with each character being encoded with a number in the range from 1 to 127. In HTML, the language of the Web, this encoding is represented as &#xxx; with the x's standing for the character's code number. Thus, the letter "a" can be encoded as a, the letter "A" as A, and the figure bracket "{" as {.
   It was not difficult to extend this notation from ASCII to the characters used by most Western European languages, resulting in the "Latin-1" character set (also known as the ISO-8859-1 or, somewhat erroneously, as the "extended ASCII" character set). Whereas there are 127 possible 7-bit ASCII characters, the 8-bit Latin-1 encoding allows 255 characters to be represented. Latin-1 includes, in addition to the regular ASCII characters, non-English letters such as the "é" (é) and "ñ" (ñ), some special symbols, such as the bullet (•) encoded as €, the section sign (§), encoded as § and others. Most of the "extended ASCII" characters also have a named form, which is easier to remember than the numeric form. For example, "é" can be used in a Web page as either é or é, "ñ" as ñ or ñ, and "§" as § or §. All the major Web browsers support both forms of encoding.
   The numerical and named HTML codes of the Latin-1 character set can be found, for example, at http://www.owlnet.rice.edu/~jwmitch/iso8859-1.html.
   The newer versions of Netscape Navigator and Microsoft Internet Explorer don't need any encoding for the Latin-1 (= ISO-8859-1 = Windows Code Page 1252) characters when configured for the Western European character set (default for U.S. and Western European browsers). In that case, a Web page text typed using the standard Windows keyboard (U.S. or U.S. International) and the Windows character set appears correctly on browsers running on either Mac or Windows machines, although the Mac doesn't support some characters of the Latin-1/ISO-8859-1 character set.


From Latin-1 to Unicode

Most languages of the world, however, are not restricted to the ISO-8859-1 character set. Hungarian, for example, has the characters ő, Ő, ű, and Ű; Czech has ů and ř, Romanian has ţ and other characters, which are not part of Latin-1. Then there are characters of alphabets other than our Roman alphabet, such as Cyrillic (Кириллица), Hebrew, Arabic and others. Over the years, several forms of encoding have been devised for these characters either by replacing some of the 255 Latin-1 characters by others or by using a completely different 8-bit or 16-bit encoding. Some examples of these character sets are given in the table below.

iso-8859-1 or CP-1252 or Latin-1 Western European
iso-8859-2 or CP-1250 or Latin-2 Eastern European
iso-8859-3 Esperanto, Galician, Maltese, Turkish
iso-8859-4 Scandinavian, Baltic
iso-8859-5 Cyrillic
iso-8859-6 Arabic
iso-8859-7or CP-1253 Greek
iso-8859-8 Hebrew
GB2312 Simplified Chinese
Big5 Traditional Chinese
Shift_JIS, EUC-JP Japanese
KOI8-R Russian
ISO-2022-KR, EUC-KR Korean

Outside the ISO-8859-1 character set, the character generated in a word processor or editor may not be interpreted correctly by a browser. For example, the character "Û" generated in a Windows word processor will be displayed by a browser as "Û" under the Western European (ISO-8859-1) encoding, as a (Hungarian) "Å°" under the Central European encoding (ISO-8859-2), as a (Russian) "л" under ISO-8859-5, as "Ñ‹" under Windows-1251, and "ш" under KOI-8-R.
   The Web author has two options to make sure the characters of the page will be displayed as intended:

  1. She may convert each character to the proper code (either manually, using code tables, or automatically, using appropriate word processing or Web authoring software);
  2. She can use the special characters unencoded and instead tell the browser how to interpret them.

If the latter option is selected, a META tag containing the "charset" (character set) attribute is inserted in the header of the Web page. For example, the following META tag tells the browser that the page is a Russian page generated with the Windows Code Page 1251 character set:

<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=WINDOWS-1251">

Tags are not case-sensitive.

Another Russian encoding system (KOI-8) would use the META tag

<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=koi-8-r">.

Another hint to the browser about the type of encoding is the META tag containing the language attribute. For Russian, this tag would be the following:

<META NAME="LANGUAGE" CONTENT="RU">

These codes, together with the direction-of-text codes DIR and BDO (with their two possible values LTR and RTL for left-to-right and right-to-left), can also be applied at the level of Web page elements, such as paragraph <P> or table cell <TD>. For example, a paragraph in Hebrew might have the following tag:

<P CHARSET=ISO-8859-8 LANGUAGE="he" DIR="rtl">

Unfortunately, at this time not all browsers understand the language and direction-of-text tags, and they do not implement the character set/language indicated in the META tag and at page element level consistently. 
   If the Web author didn't indicate, using the charset and/or language attribute in the META tag, the type of encoding of the page, the reader must select the proper character set in his or her browser. In Microsoft Internet Explorer 4.0, this selection is made under View - Fonts, Netscape Navigator 4.0 and Microsoft Internet Explorer 5.0 under View - Encoding, and in Netscape Navigator 4.5 under View - Character Set.
   Regardless of the type of encoding used by the Web author, the reader of the Web page must have the appropriate fonts installed in order to correctly read the page.
    Most of the historic character sets listed in Table 1 (except those of the ISO family) are mutually incompatible. Due to the inconsistent manner different browsers interpret page-level and page element-level charset and language codes, results can be unpredictable when languages encoded with different character sets, such as Chinese and Russian, are used on the same Web page. A character set that would include all the above character sets as subsets was needed. Unicode (ISO-10646), in which each character is encoded using 16 bits (compared to 7 bits in ASCII), is such a universal character set. Unicode theoretically allows 65536 characters to be generated, covering all the known languages of the world, many graphic symbols, computer commands, and more. A projected extension of the standard will allow this number to be increased to about a million. As of this writing, about 40,000 Unicode characters have been defined.
   The structure of Unicode is the following (the four characters after the U+ symbol indicate the range of codes assigned to that set in hexadecimal notation):

U+0000 - U+007F ASCII (Standard English; can be combined with other blocks)
U+0080 - U+00FF Latin 1 (Danish, Dutch, Spanish, French, Italian, etc.)
U+0100 - U+017F European Latin (Czech, Polish, Romanian, etc.)
U+0180 - U+01FF Extended Latin (Croatian digraphs and Pinyin diacritic vowels)
U+0250 - U+02AF Standard Phonetic (International Phonetic Association characters)
U+02B0 - U+02FF Modifiers (glottal stops, tone transcription letters, etc.)
U+0300 - U+036F Generic Diacritics (umlauts, Vietnamese tone marks, etc.)
U+0370 - U+03FF Greek and Coptic
U+0400 - U+04FF Cyrillic and Cyrillic variants (Serbian, etc.)
U+0530 - U+058F Armenian
U+0590 - U+05FF Hebrew and Yiddish
U+0600 - U+06FF Arabic
U+0900 - U+097F Devanagari
U+0980 - U+09FF Bengali
U+0A00 - U+0A7F Gurmukhi
U+0A80 - U+0AFF Gujarati
U+0B00 - U+0B7F Oriya
U+0B80 - U+0BFF Tamil
U+0C00 - U+OC7F Telugu
U+0C80 - U+0CFF Kannada
U+0D00 - U+0D7F Malaylam
U+0E00 - U+0E7F Thai
U+0E80 - U+0EFF Lao
U+1000 - U+105F Tibetan
U+10A0 - U+10FF Georgian
U+2000 - U+27BF General Punctuation, Symbols, Dingbats, Arrows, Blocks, etc.
U+3000 - U+303F CJK (Chinese, Japanese, Korean) Symbols and Punctuation
U+3040 - U+309F Hiragana
U+30A0 - U+30FF Katakana
U+3100 - U+312F Bopomofo (Chinese/Mandarin phonetic characters for teaching)
U+3130 - U+318F Hangul Elements (Korean)
U+3190 - U+3D2F CJK Marks, Letters, Enclosed Ideographs, etc.
U+4000 - U+8BFF Chinese/Japanese/Korean Han Ideographic characters

The notation &#xxxx; used for the Latin-1 character set is understood by the most recent versions of Web browsers when it's extended to the higher Unicode characters. Thus, the double-dagger symbol ‡ can be coded as &#8225;, where 8225 is the decimal equivalent of the character's code U+2021 in the hexadecimal notation of Table 2. The character is located in the light green area of Table 2.
   One encoding form of Unicode, known as UTF-8 (Unicode Transfer Format), uses variable-length encoding: one byte (7 bits) for ASCII and 2 to 6 bytes (up to 31 bits) for the other character sets. Thus, ASCII is a subset of UTF-8, and the ASCII characters can be used unencoded on a UTF-8 Web page. The META tag with the charset attribute in the header of a UTF-8-encoded page is the following: 

<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=UTF-8">.

More information about the Unicode standard can be found at http://www.unicode.org/unicode/standard/standard.html. The charts of all Unicode characters, in .pdf format, can be accessed from the page http://www.unicode.org/charts/.


How to generate and read Unicode text

Microsoft Word v. 8 provides Unicode support, allowing the user to save a file as "Unicode Text," but I've found this feature to be quite buggy. This version of Word uses not UTF-8, but 16-bit encoding, i.e., even ASCII characters are encoded with two bytes; the second byte appears as an empty square on the screen. Chris Pratley, Program Manager of Microsoft Office (chrispr@MICROSOFT.com), informed me that the version of Word that comes with Office 2000 has solved this and other problems. 
   You can also generate a UTF-8-encoded Web page in Microsoft Front Page 2000 by selecting File - Properties - Language - Save Document As - Multilingual (UTF-8) or by manually changing the charcode META tag in the HTML view of the page. Selecting UTF-8 under Tools - Page Options - Default Font - Multilingual UTF-8 will not do the trick. The page defaults back to Windows-1252 or the character set of the first character typed.
    Regarding other software for generating UTF-8 code, Otto Stolz (Otto.Stolz@uni-konstanz.de) informed me as follows:

"You could edit your HTML source with UniEdit and store it in UTF-8. See http://www.lang.duke.edu/uniintro.htm.
    You may also wish to try Tango Creator, the Unicode-capable HTML editor from Alis. See http://www.alis.com/internet_products/creator/creator.html."

Both Netscape Navigator and Microsoft Internet Explorer v. 4.0 and higher are fully UTF-8-compatible. 


If everything else fails

If the compatibility problems between Web author and the intended reader cannot be solved via coding, the last-resource solution is the graphic format. Of course, any character can be converted into, and displayed as, graphics, in which case no character encoding is required and the reader does not need to have any special font installed in his or her system. The graphic formats handled by Web browsers are JPEG and GIF. The relative merits of these two formats are beyond the scope of this article, but it's good to remember that graphics always take up more disk space and load more slowly than the respective text files. They are also more difficult to modify if the text has to be changed for any reason at a later time. Therefore, for fast-loading, sharp, editable characters in any language, text is preferable. Use graphics for compatibility where loading speed is of little or no concern.