UNICODE(7)          Linux Programmer's Manual          UNICODE(7)



NAME

       Unicode - the unified 16-bit super character set


DESCRIPTION

       The international standard ISO 10646 defines the Universal
       Character Set (UCS).  UCS contains all characters  of  all
       other  character  set standards. It also guarantees round-
       trip compatibility, i.e., conversion tables can  be  built
       such  that  no  information  is lost when a string is con-
       verted from any other encoding to UCS and back.

       UCS contains the characters required to  represent  almost
       all  known  languages.  This  includes apart from the many
       languages which use extensions of the  Latin  script  also
       the  following  scripts  and  languages:  Greek, Cyrillic,
       Hebrew, Arabic, Armenian,  Gregorian,  Japanese,  Chinese,
       Hiragana,  Katakana,  Korean,  Hangul, Devangari, Bengali,
       Gurmukhi,  Gujarati,  Oriya,   Tamil,   Telugu,   Kannada,
       Malayam, Thai, Lao, Bopomofo, and a number of others. Work
       is going on to  include  further  scripts  like  Tibetian,
       Khmer, Runic, Ethiopian, Hieroglyphics, various Indo-Euro-
       pean languages, and many others. For most of these  latter
       scripts, it was not yet clear how they can be encoded best
       when the standard was published in 1993.  In  addition  to
       the  characters  required  by  these scripts, also a large
       number of graphical, typographical, mathematical and  sci-
       entific  symbols  like  those provided by TeX, PostScript,
       MS-DOS, Macintosh, Videotext, OCR, and many word  process-
       ing  systems  have been included, as well as special codes
       that  guarantee  round-trip  compatibility  to  all  other
       existing character set standards.

       The  UCS standard (ISO 10646) describes a 31-bit character
       set architecture, however, today only the first 65534 code
       positions  (0x0000  to 0xfffd), which are called the Basic
       Multilingual Plane (BMP), have been  assigned  characters,
       and  it is expected that only very exotic characters (e.g.
       Hieroglyphics) for special scientific purposes  will  ever
       get a place outside this 16-bit BMP.

       The UCS characters 0x0000 to 0x007f are identical to those
       of the classic US-ASCII character set and  the  characters
       in  the  range  0x0000 to 0x00ff are identical to those in
       the ISO 8859-1 Latin-1 character set.


COMBINING CHARACTERS

       Some code points in UCS have been  assigned  to  combining
       characters.   These  are similar to the non-spacing accent
       keys on a typewriter. A combining character just  adds  an
       accent  to  the  previous  character.   The most important
       accented characters have codes of their own in  UCS,  how-
       ever,  the  combining  character  mechanism  allows to add
       accents and other diacritical marks to any character.  The
       combining  characters  always  follow  the character which



Linux                       1995-12-27                          1





UNICODE(7)          Linux Programmer's Manual          UNICODE(7)


       they modify. For example, the  German  character  Umlaut-A
       ("Latin  capital  letter  A with diaeresis") can either be
       represented by the precomposed UCS code 0x00c4, or  alter-
       natively  as  the  combination  of a normal "Latin capital
       letter A" followed  by  a  "combining  diaeresis":  0x0041
       0x0308.


IMPLEMENTATION LEVELS

       As not all systems are expected to support advanced mecha-
       nisms like combining characters, ISO 10646  specifies  the
       following three implementation levels of UCS:

       Level 1  Combining  characters  and Hangul Jamo characters
                (a special,  more  complicated  encoding  of  the
                Korean  script,  where Hangul syllables are coded
                as two or three subcharacters) are not supported.

       Level 2  Like  level 1, however in some scripts, some com-
                bining  characters  are  now  allowed  (e.g.  for
                Hebrew,  Arabic,  Devangari,  Bengali,  Gurmukhi,
                Gujarati, Oriya, Tamil, Telugu,  Kannada,  Malay-
                alam, Thai and Lao).

       Level 3  All UCS characters are supported.

       The  Unicode 1.1 standard published by the Unicode Consor-
       tium contains exactly the UCS Basic Multilingual Plane  at
       implementation level 3, as described in ISO 10646. Unicode
       1.1 also adds some semantical definitions for some charac-
       ters to the definitions of ISO 10646.


UNICODE UNDER LINUX

       Under Linux, only the BMP at implementation level 1 should
       be used at the moment, in order to keep the implementation
       complexity  of combining characters low. The higher imple-
       mentation levels are more suitable for special  word  pro-
       cessing  formats,  but  not  as a generic system character
       set. The C type wchar_t is on Linux a signed 32-bit  inte-
       ger type and its values are interpreted as UCS4 codes.

       The locale setting specifies, whether the system character
       encoding is for example  UTF-8  or  ISO  8859-1.   Library
       functions  like  wctomb, mbtowc, or wprintf can be used to
       transform the internal wchar_t characters and strings into
       the system character encoding and back.


PRIVATE AREA

       In  the  BMP,  the  range  0xe000  to 0xf8ff will never be
       assigned any characters by the standard  and  is  reserved
       for  private  usage. For the Linux community, this private
       area has been subdivided further into the range 0xe000  to
       0xefff  which can be used individually by any end-user and
       the Linux zone in the range 0xf000 to 0xf8ff where  exten-
       sions  are coordinated among all Linux users. The registry



Linux                       1995-12-27                          2





UNICODE(7)          Linux Programmer's Manual          UNICODE(7)


       of the characters assigned to the Linux zone is  currently
       maintained  by  H.  Peter  Anvin  <Peter.Anvin@linux.org>,
       Yggdrasil Computing,  Inc.  It  contains  some  DEC  VT100
       graphics  characters  missing  in  Unicode,  gives  direct
       access to the characters in the console  font  buffer  and
       contains  the  characters  used  by a few advanced scripts
       like Klingon.


LITERATURE

       * Information technology - Universal Multiple-Octet  Coded
         Character  Set  (UCS)  -  Part 1: Architecture and Basic
         Multilingual Plane.  International Standard ISO 10646-1,
         International  Organization for Standardization, Geneva,
         1993.

         This is the official specification of UCS.  Pretty offi-
         cial,  pretty  thick, and pretty expensive. For ordering
         information, check www.iso.ch.

       * The Unicode Standard - Worldwide Character Encoding Ver-
         sion 1.0.  The Unicode Consortium, Addison-Wesley, Read-
         ing, MA, 1991.

         There is already Unicode 1.1.4 available. The changes to
         the 1.0 book are available from ftp.unicode.org. Unicode
         2.0 will be published again as a book in 1996.

       * S. Harbison, G. Steele. C - A Reference  Manual.  Fourth
         edition,  Prentice  Hall,  Englewood  Cliffs, 1995, ISBN
         0-13-326224-3.

         A good reference book about the C programming  language.
         The  fourth edition now covers also the 1994 Amendment 1
         to the ISO C standard (ISO/IEC 9899:1990) which  adds  a
         large  number  of  new  C library functions for handling
         wide character sets.


BUGS

       At the time when this man page was written, the Linux libc
       support for UCS was far from complete.


AUTHOR

       Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>


SEE ALSO

       utf-8(7)











Linux                       1995-12-27                          3