Digital Language Access

Scripts, Transliteration, and Computer Access

John Clews
Chairman of ISO/TC46/SC2: Conversion of Written Languages
SESAME Computer Projects
8 Avenue Road, Harrogate, HG2 7PG, United Kingdom
[email protected]

D-Lib Magazine, March 1997

ISSN 1082-9873

Introduction
Only Three Types of Script!
How These Scripts Work
Libraries, Transliteration, and Standards
Standards: What Are They and How Are They Made?
End users rule - OK!
End-Users and Transliteration

Appendix 1: Structure of ISO/TC46/SC2
Appendix 2: Electronic Access to ISO/TC46/SC2

1. Introduction

Today's world is an increasingly global culture -- business and science are certainly moving that way. Moreover, there is now a vast amount of cultural information accessible via the Internet, as various religious and cultural foundations worldwide are putting a lot of information onto the World Wide Web, and sending e-mail internationally in a wide variety of languages. But foreign languages can be a barrier to information even as the communications technologies increase are potential to build our own Tower of Babel. How can we access this information? We cannot all learn every language in use worldwide even the effort to learn one or more new languages is reward enough, enabling us to access the riches of a completely new culture.

Students of new language will have noted the common linguistic features between one language and another, which can help provide a route map into learning additional new languages and with new languages, different points of view. These common linguistic characteristics are not specialist linguistic features, to be learned by professional linguists, but rather are features that become obvious to anyone actively involved in this learning, particularly in terms of loan words and expressions spread across languages.

Clearly language is more than just words and learning a language is more than memorizing vocabulary. Besides learning the concepts, jargon, slang, and "feel" of a new language, we also must learn to write it particularly if we want to communicate via computing and the net.

Except for the languages of Western Europe, Africa, America and Australia, many of them are written in non-Latin scripts, which poses an even greater barrier. It is one, however, with a rich history of problem-solving in library collections.

The work of the International Organization for Standardization (ISO) helps to provide solutions in a wide range of areas to ease people's digital language access in an increasingly on-line world.

2. Only Three Types of Script!

Although the number of languages in use world wide still runs into the thousands, the number of scripts currently used to write them is only around two dozen. To simplify things further, it is rarely realised that there are, in fact, only three basic types of script: ideographic scripts (mostly used in China, and also in Japan and Korea); derivatives of Brahmi script, developed mainly in South Asia, and scripts derived from Phonecian, now used in scripts of Europe, the Middle East and North Africa.

Good ideas travel and are widely adopted. If we can get around the script barrier -- and ways are described later in this article -- we can access much more of the world's culture.

The three types of scripts grew up in three areas, each largely separated by mountainous and desert areas:

in East Asia, where Chinese ideographic script has had a major influence:

in South Asia, in and around the Indian Subcontinent, influenced by Brahmi script; and in West Asia and around the Mediterranean, influenced by script.

These three basic scripts (see Figure 1) have a major influence on the scripts we use today. Understanding these three base scripts can give us a major insight into all other scripts and languages we may have to come across.

Figure. 1: Historical Script Families and Derived Scripts Used Today


    Latin   Cyrillic            Devanagari.-.-.-.Tibetan
       \     /                 /  Gujarati
        \   / - Armenian      /   Bengali      _ Mongolian
         \ /                 /    Gurumukhi    /
        Greek - Georgian    /     Oriya    SOGDIAN   Chinese
          |                /               SCRIPT    /
          |               /       Telugu            /
      PHOENICIAN       BRAHMI - - Kannada      IDEOGRAPHIC - Japanese
     /  SCRIPT  \      SCRIPT     Malayalam.   SCRIPT       \
    /     |      \        \       Tamil                      \
 Hebrew   |     Arabic     \                                 Korean
          |        \        \ - - Sinhala
          |                  \
          |         \         \ - Burmese
          |                    \  Khmer
          |          \          \
       Ethiopic      Divehi      \ - Thai
      (Ethiopia,    (Maldives)       Lao
       Eritrea)

Key: Scripts shown in CAPS are the historical source script for other scripts shown. These scripts are used in over 99% of the worlds official languages Sogdian (a fourth historical script) shares some aspects of neighbouring scripts.

3. How These Scripts Work

This section describes scripts from East to West, taking in East Asian, South Asian and West Asian scripts (from which European scripts originate). Most of these spread with specific cultural and religious contact in neighbouring regions of the world. In understanding the diffusion of scripts, we are seeing population migrations through the lens of language.

3.1 East Asia

The oldest scripts are ideographic (symbols directly representing meaning rather than sound); they are still used in China. Chinese characters were adopted and adapted -- with additional or alternative phonetic characters - in Japan and Korea, and at one time, in Vietnam, particularly when various religions such as Taoism, Confucianism and Buddhism spread from or through China at various times in the past.

East Asia has many great treasures of writings contained in manuscripts. Obviously copying of manuscripts by hand could never have led to the availability of publication we are used to today. Xylographs (wood-block printing) were developed, primarily to spread religious texts, many of the earliest coming from East Asia and South Asia.

Thus, the East Asian written languages which use Chinese characters today -- Chinese, Japanese, and to a lesser extent, Korean -- draw on a writing tradition going back over millennia. The most recent changes to the written language reflect government-backed standardization on the range of characters used at various educational levels. In the People's Republic of China, a major simplification of the written forms of characters has occurred. Still, the 214 traditional radicals (character elements) have been a traditional way of arranging Chinese characters in dictionaries. Although the same arrangement also operates in Unicode, computer technology enables sorting and input by various other keys, such as phonetic value.

3. 2 West Asia

At the other end of Asia, West Asia also provided the earliest script to be used in the Mediterranean region. The early Phoenicians, who lived in what is now Lebanon, were major traders. Their 22-letter script influenced nearby scripts like Hebrew and ultimately the European scripts.

3.3 European scripts

Later the Hebrew letters were adopted for, and reshaped in, Greek, as can be seen from the common order in Figure 2. The Greek script itself was later adapted for other European scripts like Latin, Cyrillic, Georgian, and Armenian.

Just as Phoenician script, and other contemporary variants were adopted in Hebrew script, Hebrew in turn was adopted and adapted for writing Greek, which again in turn was adopted as a model for writing the Latin, Cyrillic, Georgian and Armenian scripts, each of which spread under the influence of Christianity.

Latin script languages were the first to develop movable-type printing, which led to a publication revolution worldwide. Countries using other scripts with similar characteristics to Latin script - using separate rather than cursive (joined-up) characters -- such as Greek and Cyrillic - were the first to benefit from this technology. But typographers soon turned their attention to most other scripts in use. By the time that printing was mechanised in the nineteenth century, large numbers of trained machine operators controlled hot metal processes to produce regular publications in large quantities in many language. Moveable type systems were developed for all scripts in current use in official languages, and others too, in the West and in Asia. This process continues today with the computer revolution. .

There is an astonishing amount of correspondence between different scripts of Europe, the Middle East and North Africa, all deriving from their common origin in Phoenician script, as Figure 2 shows below, reinforced by ebbs and flows of populations and ideas. Some of these correspondences relate to voiced/unvoiced versions of the same letter (compare Latin c with g in most other scripts)

The Latin representations used here are designed to highlight the common origins of scripts and do not represent any specific transliteration or transcription.

Figure 2:.Correlating Scripts of Europe, the Middle East, and North Africa

Figure.2:.Correlating scripts of Europe, the Middle East and North.Africa

                                 Part.1: a-q
______________________________________________________________________________
 LAT a b   c d e f  g h           i j      k        l     m   n    o    p   q
______________________________________________________________________________
 GRE a b   g d e    z e     th    i        k        l     m   n ks o    p
 CYR a b v g d e zh z             i j      k        l     m   n    o    p
 GEO a b   g d e v  z e     t     i        k        l     m   n    o    p   zh
 ARM a b   g d e    z e  e  t  zh i l x c  k  h  j  l  ch m y n sh o j  p   q
______________________________________________________________________________
 LAT a b   c d e f  g h           i j      k        l     m   n    o    p   q
______________________________________________________________________________
 HEB a b   g d h w  z h     t     y        k        l     m   n s  `    p c q
*ETH a b   g d h w  z h     t     y        k        l     m   n s  `    f c q
 ARA a b   j d   r  z   s d t z            k        l     m   n    ` gh f   q
  "     t   h dh
  "      th  kh
______________________________________________________________________________
 LAT a b   c d e f  g h           i j      k        l     m   n    o    p   q
______________________________________________________________________________

*.[original.order.in.Ge'ez]

                                Part 2: r-z
 ______________________________________________________________________________
 LAT r s    t       u  v w x                                  y              z
 ______________________________________________________________________________
 GRE r s    t       u  f   kh            ps                            o
 CYR r s    t       u  f   kh ts ch sh shch                '  y  "     e  yu ya
 GEO r s    t       u  f   kh g  q  sh ch c d  s  c  x  j  x  y  xh w  o     f
 ARM r s  v t r c   w  f   kh                                       ew o (u) f
 ______________________________________________________________________________
 LAT r s    t       u  v w x                                  y              z
 ______________________________________________________________________________
 HEB r s    t
 ETH r s    t
 ARA              h      w                                    y


 ______________________________________________________________________________
 LAT r s    t       u  v w x                                  y              z
 ______________________________________________________________________________

* [original order in Ge'ez]

ISO/TC46/SC2 has the following working.groups (see ISO/TC46/SC2.N.384).

WG1: Transliteration of Cyrillic
WG2: Transliteration of Arabic
WG3: Transliteration of Hebrew
WG4: Transliteration of Korean
WG5: Transliteration of Greek
WG6: Transliteration of Chinese
WG7: Transliteration of Japanese WG8: Joint ISO/TC46/SC4 Working Group: Relations between transliteration and machine representation of characters
WG9: Transliteration of Thai
WG10:.Transliteration.of.Mongolian WG11:.Transliteration.of.Persian; formerly inWG2

3.4 The Middle East and North Africa

Although Arabic and Hebrew are both written from right-to-left, the basic writing systems share many features with other European and West Asian scripts, which are written left-to-right. For example, consider the alphabetic order of these scripts.

Hebrew script has been used for a variety of languages as Judaism spread with the Jewish diaspora, such as with Yiddish and Ladino in Europe. Similarly, due to the expansion of Islam, Arabic script has been used not just for Arabic, but also for languages like Persian. In the wake of Islam's reach into Asia, the Perso-Arabic tradition, and the Perso-Arabic script had a major influence in parts of South Asia and Central Asia too. even as far as Western China.

Neither Hebrew nor Arabic generally indicate vowels; usually the context enables the user to determine the likely text. This apparent weakness is, in fact, a major strength. As an experiment, use your word processor to globally delete all occurrences of a, e, i, o and u from a sample text, and try and read the result: after an initial hurdle, you will be surprised how much you can read! Arabic and Hebrew are therefore very efficient writing systems. If completely unambiguous text is essential -- as in religious and educational texts -- various vowel signs can be added above or below other letters.

3.5 South Asia

Most South Asian and Southeast Asian scripts also represent vowel sounds by vowel-signs above or below letters, but in this case their use is mandatory. They derive from the Brahmi script used on the Indian sub-continent many centuries ago, and are written from left-to-right. All these alphabets follow a very logical phonetic order -- so logical that the International Phonetic Association (IPA) adopted it (with modifications) for the International Phonetic Alphabet. Several of these scripts also combine letters as ligatures, or conjunct consonants, with many more ligatures than in European scripts. Thus, all computer equipment for Indian languages needs to provide for this degree of complexity in both display and printing but without adding any extra complexity to the keyboard or other input system.

Most users will recognise Devanagari script used for Hindi, Marathi, and some other North Indian languages by its appearance. It is the only script to hang from an overline. Most other South Asian scripts map to this very closely in the same way that Latin, Gothic, and Gaelic scripts map to each other. In fact, many existing computer systems treat different South Asian scripts as different fonts. This use of fonts also provides a fairly consistent transliteration between scripts and is not just limited to a Latin script transliteration.

Some South and Southeast Asian scripts also add a considerable number of additional consonants and vowels to the repertoire that they derive ultimately from the same Brahmi script. This is particularly true for Thai and Lao, and also for Tibetan script. By comparison, these scripts lack the independent vowels that all other South Asian scripts have. Indeed, Tamil has far fewer letters than most other scripts used in the Indian sub-continent.

4. Libraries, Transliteration and Standards

Many libraries have large multilingual collections, just because knowledge is not restricted to one language. Large, prestigious libraries provided a Latin script key -- transliteration systems -- to enable readers to access to collections in non-Latin scripts. By developing transliteration systems, libraries in fact provided the original multilingual information system.

Several transliteration systems have been used for most scripts. Various libraries, scholarly journals, encyclopaedias, maps, and atlases each provided ways of achieving the same aim, most of them developed during the last century or so. Many librarians will be familiar with the Library of Congress transliteration schemes - but other types of users listed above will be more familiar with other schemes. More recently, character set limitations on the Internet -- particularly for many e-mail users -- have led to a much more widespread use of different transliteration systems for that specific purpose. Some other computer users have developed transliteration systems designed as input systems for non-Latin script languages.

Clearly, in a global information environment, if one approach can be developed which will meet the needs of most users, there can be great advantages for all information users -- i. e. advantages for all of us. Finding a useful approach - or approaches -- is the aim of ISO's international standardisation committee ISO/TC46/SC2 (Conversion of Written Languages). Originally aimed just at library use, this now aims at providing solutions for all potential users, wherever transliteration might be of use.

Current ISO standards for the conversion of scripts developed by the ISO/TC46/SC2 committee are: ISO 9 (Cyrillic); ISO 233 (Arabic); ISO 259 (Hebrew); ISO 843. 2 (Greek); ISO 3602 (Japanese); ISO 7098 (Chinese); ISO 9984 (Georgian) and ISO 9985 (Armenian).

Standards in development (currently at the WD, CD, DIS or DTR stage) include ISO 11940 (Thai); ISO TR 11941 (Korean); and ISO 14522 (Mongolian). New standards are also being planned from 1997.

ISO/TC46/SC2 has also set the tc46sc2@elot. gr e-mail discussion list on transliteration to speed up its work, and also its associated web site <http://www. elot. gr/tc46sc2/list/>

At its annual meeting in Oslo in May 1996, ISO/TC46/SC2 decided to use the UCS Universal Coded Character Set standard (ISO/IEC 10646 and Unicode) as its base reference standard in place of ISO 5426.

This would involve a plan to provide transliteration standards for all scripts present in ISO/IEC 10646 used in official languages worldwide, and to use helpful features already present in ISO/IEC 10646, such as character identifiers described in pDAM 9 to ISO/IEC 10646.

5. Standards: What Are They and How Are They Made?

We live in a world of standards; everything from equipment to flight schedules, enabling a level of travel and exchange otherwise impossible in a welter of incompatibility. Within the last century, government policies all over the world have determined and shaped some aspects of language use, and standardised things like spelling and the range of characters that can be used. The use of computers has also had a standardising effect, in determining the range of characters used, and their sorting order.

5.1 Computer standards

The advent of computer standards like Unicode (with identical coding to the international standard ISO/IEC 10646) and the prospect of being able to code information using these standards, and to make it universally accessible on the Internet tantalizingly close. Some of this material is already available. But particularly when it is in non-Latin scripts, it is often unavailable to many who wish to see it, due to incompatibilities with and among legacy software and older equipment. In due course, Unicode adoption should solve all that.

5.2 Transliteration standards

Despite all the work on ISO/IEC 10646 and Unicode, there will always be a need for transliteration. Why? Many people will not have the same level of competence in all scripts besides the script used in their mother-tongue.

It is easy to imagine short-term needs requiring us to deal with these languages radically different from our own, or dealing with mechanical or computerised equipment which does not provide all the scripts of characters required.

We are now beginning to realise that transliteration may have more indirect impact on other aspects of multilingual computing than previously appreciated. For example, providing useful keyboard methods to overlay QWERTY and similar layouts where possible, avoiding having to learn several different keyboard layouts if when using different languages with different scripts.

The amount of correspondences, compared to the amount of differences, generally make transliteration a useful device. It is not always a simple matter, given the divergence in use of characters among different languages over the centuries, and the fact that many resemblances are not on a one to one basis, as Figure 2 shows.

5.3 Transliteration and transcription

What is the difference between transliteration (TL) and transcription (TS)? TL is the representation of letters of one script by the letters of another; TS is the representation of sounds of one language in letters of one script.

The difference is a fairly crucial one. Transcription has a superficial attraction in that it is generally more readable, because it is expressing sounds associated with one script using the sounds associated with a language using another script. However, that can lead to wide variations in output, which can lead to ambiguities, and certainly reduces the scope for handling the output on computers. Dealing with sounds means that different dialects and language conventions come into play. As these change over time and place, they are always more ambiguous than transliteration rules.

For that reason, the IPA (International Phonetic Alphabet) is not suitable as a transliteration alphabet, because it only relates to sounds, and not to letters. In addition, many IPA characters are not available on most computers so computer processing using an IPA alphabet transliteration would have various limitations. Some users will also be familiar with IPA: many will not. All users will, however, be able to work with the letters a-z, even though they may sometimes associate different sounds with them. The human brain is very adaptable: most people have no great difficulty in taking different situations in their stride: people will read the words box, choux, and the Chinese town of Xi'an in the same sentence without even thinking that there is a potential problem with the letter x.

5.4 Benefits of transliteration

Although in some cases the transliteration rules can look complex, once mastered they do provide the unambiguous output that transcription cannot. A good transliteration system should also produce results that are as readable as any transcription, one that can be read by an end-user without anybody needing to consider the rules that produced them. What's more, computers can apply the rules simply and unambiguously, enabling a variety of different transliteration outputs to suit different specific needs.

The most recent standards developed by ISO/TC46/SC2 (Conversion of Written Languages) - such as those for Greek and Korean - use only plain 7-bit ASCII (or ISO/IEC 646) characters, of A-Z, a-z, 0-9 plus limited punctuation. This contrasts with the large number of diacritics used in earlier ISO/TC46/SC2 standards. As a result, reading transliterated text and using computers in transliteration has become much more straightforward. Although computers can cope with a variety of character sets, the incompatibilities between them mean that there are great advantages in sticking to a simple character repertoire.

Transliteration can work very well for extremely phonetic languages and scripts like the Cyrillic, Greek, Armenian, and Georgian scripts in Europe, and most scripts of South and Southeast Asia, such as Devanagari, Panjabi,Gujerati, Bengali and Oriya; Kannada, Malayalam, Telugu, Tamil; Sinhala; Maldivian; Lao; Burmese and Khmer; and for Amharic (used in Ethiopia and Eritrea).

However, transcription works better for languages like Chinese. Even here the term "Ideographic transliteration" is being considered since the outcome is still converting one or more written characters in one script to one or more written characters in another script.

5.5 Special cases

Various scripts do need more sensitive treatment: both Hebrew and Arabic generally omit vowel signs in text. While unvowelled text looks fine in those scripts, and can easily be read, it is much harder to decipher unvowelled Latin text -- especially as there are no language clues present from a Latin script language. This is true even for those with a fluent reading knowledge of both English and Hebrew, or of both English and Arabic.

Vowelled Hebrew or Arabic text do not present this problem, but converting unvowelled Hebrew or Arabic text into readable vowelled Latin script requires a mixture of transliteration and transcription. A new approach, called Phonemic transliteration, is currently being developed in ISO 259-3 to deal with this issue.

There are also some complications for Thai: Various Thai letters have special uses for tones, and some letters can have different phonetic values indifferent parts of the syllable, especially as final letters. Strict transliteration is certainly possible, and the present draft standard for Thai provides unambiguous transliteration. However, a transcription of Thai may be far more readable than a strict transliteration of Thai.

The same applies to Tibetan. This monosyllabic language uses different spellings, often with silent letters, to disambiguate words with the same pronunciation (much as Chinese uses different single characters for different words with the same pronunciation). While transliteration can allow reversible text in either Latin or Tibetan script, some users will require a transcription in Latin script, particularly if they are not yet familiar with the spoken language.

5.6 How are standards produced?

Standards are supposed to be the result of international consensus. The mechanism is a mixture of development by experts, dissemination of proposals, and monitoring and publication by the secretariat. Moves are underway to streamline some of these procedures.

Certain stages have to be followed: for new standards, a New Work Item (NWI) has to be proposed, and enough members found to develop this work. Committee drafts (CDs) are produced initially, and later, mature stages become a DIS (Draft International Standard). When finally approved, they + become an IS International Standard). At each of these stages, and intervening stages, voting has to take place by the 'P' member bodies. This is quite a lengthy process. WDs (Working Documents) can also be produced by ISO/TC46/SC2 and its Working Groups (WGs) at any stage to back up technical work.

(a) Technical work is done at working group meetings.
(b) Voting is done by national bodies, based on draft standards circulated through the ISO/TC46/SC2 Secretariat. This is independent of both the working group meetings, and independent of the ISO/TC46/SC2 meeting.
(c) The annual meeting of ISO/TC46/SC2 is largely a business meeting overseeing the work of the whole committee, its working groups and its liaisons.

At its last meeting in Oslo, ISO/TC46/SC2 decided to set up a Group with Advisory Functions which we are calling ISO/TC46/SC2/STRAG (Strategic Action Group). Coordinating the time scales of (a), (b), and (c) will be its most urgent task.

6. End-users rule - OK!

Standards are intended to be designed with end users' needs in mind. However, with the best will in the world, nobody can design a standards that will suit all people in all conditions without input from interested parties. The ISO standardisation process is designed to enable input from various organisations, by ensuring that drafts are circulated via its national member bodies (national standards organisations like BSI in the UK, ANSI (or NISO) in the USA, AFNOR in France, DIN in Germany etc) to interested bodies who are likely to be represented on the relevant committees.

In an increasing number of areas, national standards organisations are no longer developing national standards which differ from ISO standards, but concentrating on national input to specific international standards under development, or under revision (standards have to be reviewed by ISO national member bodies, and confirmed, revised or withdrawn every five years, depending on the consensus of international opinion).

Such processes should lead to worldwide consultation. Nevertheless, national standards bodies tend to be a bit remote, and end users are often unaware of their chance to influence standards development through the appropriate channels. This is a pity, considering the fact that they may affect some various aspects of their lives.

7. End-Users and Transliteration Standards

In the case of transliteration, the standards which are used may affect whether or not certain information in libraries, or in books, is found. In some countries such standards may have the force of law.

Outside libraries, there are further issues. A Greek citizen moved between Greece and Germany not long ago, varying transliterations of his name caused expensive legal actions up to the level of the European Court. Some passport and immigration authorities are also considering developing fairly rigid rules on representing names in passports, which may have similar consequences in the future.

ISO/TC46/SC2 will hold its next meeting, and a series of working group meetings, at the British Standards Institution in London, from May 12-14, 1997. Widening the scope of work of ISO/TC46/SC2, and enabling its working groups to be even more effective, and widening its influence and finding out views of end users will be a major part of its concerns.

How do you raise your concerns? You can contact your national member body of ISO, or an active liaison organization (e. g. the ISSN International Centre) or as a last resort you can contact the Sub-Committee Secretariat. ISO/TC46/SC2 meets once a year to review the activities of its working groups, and to review standards under development. Normally most delegates are nominated by national member bodies.

However, some national member bodies are less active than others. For some purposes in ISO/TC46/SC2, the [email protected] e-mail list on transliteration may be a more effective way of raising certain issues. This is one of several e-mail discussion lists now used by different standards committees as a way of ensuring input and feedback to its user community, particularly in the information technology area. Some also have associated World Wide Web sites like <http://www.elot.gr/tc46sc2/list/> in the case of ISO/TC46/SC2. The general structure and contact points for ISO/TC46/SC2 are given in an appendix to this article.

Given the short time scale between the appearance of this article and the next meeting of ISO/TC46/SC2, anybody wishing to present a point of view, either personally, or through other organisations, should contact the chair, or secretary of ISO/TC46/SC2,and/or a specific working group leader, before May 1997.

Transliteration Standards affect you -- it's worth using them, and making your voice heard when necessary.