ISO/IEC 8859

ISO/IEC 8859

ISO/IEC 8859 is a joint ISO and IEC standard for 8-bit character encodings for use by computers. The standard is divided into numbered, separately published parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc., each of which may be informally referred to as a standard in itself. There are currently 15 parts as of 2006 excluding the abandoned ISO/IEC 8859-12 standard.

ISO/IEC 8859 parts 1, 2, 3 and 4 are also Ecma International ECMA-94.

Introduction

While the bit patterns of the 95 printable ASCII characters are sufficient to exchange information in modern English, most other languages that use the Latin alphabet need additional symbols not covered by ASCII, such as "ß" (German), "ñ" (Spanish), "å" (Swedish and other Nordic languages) and ő (Hungarian). ISO/IEC 8859 sought to remedy this problem by utilizing the eighth bit in an 8-bit byte in order to allow positions for another 128 characters. (This bit was previously used for data transmission protocol information, or was left unused.) However, more characters were needed than could fit in a single 8-bit character encoding, so several mappings were developed, including at least 10 just to cover the Latin script.

The ISO/IEC 8859-"n" encodings only contain printable characters, and were designed to be used in conjunction with control characters mapped to the unassigned bytes. To this end a series of encodings registered with the IANA add the C0 control set (control characters mapped to bytes 0 to 31) from ISO 646 and the C1 control set (control characters mapped to bytes 127 to 159) from ISO 6429, resulting in full 8-bit character maps with most, if not all, bytes assigned. These sets have ISO/IEC-8859-"n" as their preferred MIME name or, in cases where a preferred MIME name isn't specified, their canonical name. Many people use the terms ISO/IEC 8859-"n" and ISO/IEC-8859-"n" interchangeably. ISO/IEC 8859-11 did not get such a charset assigned, presumably because it was almost identical to TIS 620.

Characters

The ISO/IEC 8859 standard is designed for reliable information exchange, not typography; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the ASCII and ISO/IEC 8859 standards, or use Unicode instead.

As a rule of thumb, if a character or symbol was not already part of a widely used data-processing character set and was also not usually provided on typewriter keyboards for a national language, it didn't get in. Hence the directional double quotation marks "«" and "»" used for some European languages were included, but not the directional double quotation marks "“" and "”" used for English and some other languages. French didn't get its "œ" and "Œ" ligatures because they could be typed as 'oe'. Ÿ, needed for all-caps text, was left out as well. These characters were, however, included later with ISO/IEC 8859-15, which also introduced the new euro sign character €. Likewise Dutch did not get the 'ij' and 'IJ' letters, because Dutch speakers had gotten used to typing these as two letters instead. Romanian did not initially get its 'unicode|Ș/ș' and 'unicode|Ț/ț' (with comma) letters, because these letters were initially unified with 'Ş/ş' and 'Ţ/ţ' (with cedilla) by the Unicode Consortium, considering the shapes with comma beneath to be glyph variants of the shapes with cedilla. However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO/IEC 8859-16.

Most of the ISO/IEC 8859 encodings provide diacritic marks required for various European languages using the Latin script. Others provide non-Latin alphabets: Greek, Cyrillic, Hebrew, Arabic and Thai. Most of the encodings contain only spacing characters although the Thai, Hebrew, and Arabic ones do also contain combining characters. However, the standard makes no provision for the scripts of East Asian languages ("CJK"), as their ideographic writing systems require many thousands of code points. Although it uses Latin based characters, Vietnamese does not fit into 96 positions (without using combining diacritics) either. Each Japanese syllabic alphabet (hiragana or katakana, see Kana) would fit, but like several other alphabets of the world they aren't encoded in the ISO/IEC 8859 system.

The Parts of ISO/IEC 8859

ISO/IEC 8859 is divided into the following parts:

Each part of ISO 8859 is designed to support languages that often borrow from each other, so the characters needed by each language are usually accommodated by a single part. However, there are some characters and language combinations that are not accommodated without transcriptions. Efforts were made to make conversions as smooth as possible. For example, German has all of its seven special characters at the same positions in all Latin variants (1-4, 9-10, 13-16), and in many positions the characters only differ in the diacritics between the sets. In particular, variants 1-4 were designed jointly, and have the property that every encoded character appears either at a given position or not at all.

Table

At position 0xA0 there's always the non breaking space and 0xAD is mostly the soft hyphen, which only shows at line breaks. Other empty fields are either unassigned or the system used isn't able to display them.

There are new additions as ISO/IEC 8859-7:2003 and ISO/IEC 8859-8:1999 versions. LRM stands for left-to-right mark (U+200E) and RLM stands for right-to-left mark (U+200F).

Relationship to Unicode and the UCS

Since 1991, the Unicode Consortium has been working with ISO and IEC to develop the Unicode Standard and (UCS) in tandem. This pair of standards was created to unify the ISO/IEC 8859 character repertoire, among others, by assigning each character, initially, to a 16-bit code value, with some code values left unassigned. Over time, their models adapted to map characters to abstract numeric code points rather than fixed bit-width values, so that more code points and encoding methods could be supported.

Unicode and ISO/IEC 10646 currently assign about 100,000 characters to a code space consisting of over a million code points, and they define several standard encodings that are capable of representing every available code point. The standard encodings of Unicode and the UCS use sequences of one to four 8-bit code values (UTF-8), sequences of one or two 16-bit code values (UTF-16), or one 32-bit code value (UTF-32 or UCS-4). There is also an older encoding that uses one 16-bit code value (UCS-2), capable of representing one-seventeenth of the available code points. Of these encoding forms, only UTF-8's byte sequences are in a fixed order; the others are subject to platform-dependent byte ordering issues that may be addressed via special codes or indicated via out-of-band means.

Newer editions of ISO/IEC 8859 express characters in terms of their Unicode/UCS names and the "U+nnnn" notation, effectively causing each part of ISO/IEC 8859 to be a Unicode/UCS character encoding scheme that maps a very small subset of the UCS to single 8-bit bytes. The first 256 characters in Unicode and the UCS are identical to those in ISO/IEC-8859-1.

Single-byte character sets including the parts of ISO/IEC 8859 and derivatives of them were favored throughout the 1990s, having the advantages of being well-established and more easily implemented in software: the equation of one byte to one character is simple and adequate for most single-language applications, and there are no combining characters or variant forms.

As the relative cost, in computing resources, of using more than one byte per character began to diminish, programming languages and operating systems added native support for Unicode alongside their system of code pages. Windows NT was quite an early adopter of Unicode. However Unicode support in Windows 9x required linking with a special compatibility layer or restricting your design to a very small subset of the Windows API discouraging its use. As Unicode-enabled operating systems became more widespread, ISO/IEC 8859 and other legacy encodings became less popular. While remnants of ISO 8859 and single-byte character models remain entrenched in many operating systems, programming languages, data storage systems, networking applications, display hardware, and end-user application software, most modern computing applications use Unicode internally, and rely on conversion tables to map to and from other encodings, when necessary.

Development status

The ISO/IEC 8859 standard was maintained by ISO/IEC Joint Technical Committee 1, Subcommittee 2, Working Group 3 (ISO/IEC JTC 1/SC 2/WG 3). In June 2004, WG 3 disbanded, and maintenance duties were transferred to SC 2. The standard is not currently being updated, as the Subcommittee's only remaining working group, WG 2, is concentrating on development of ISO/IEC 10646.

References

* Published versions of each part of ISO/IEC 8859 are available, for a fee, from the [http://www.iso.ch/iso/en/stdsdevelopment/tc/tclist/TechnicalCommitteeStandardsListPage.TechnicalCommitteeStandardsList?COMMID=23 ISO catalogue site] and from the [http://webstore.iec.ch/webstore/webstore.nsf/searchview/?searchView=&SearchOrder=4&SearchWV=TRUE&SearchMax=1000&Submit=OK&Query=ISO/IEC%208859 IEC Webstore] .

* PDF versions of the final drafts of some parts of ISO/IEC 8859 as submitted for review & publication by ISO/IEC JTC 1/SC 2/WG 3 are available at the [http://anubis.dkuug.dk/JTC1/SC2/WG3/ WG 3 web site] :
** [http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n411.pdf ISO/IEC 8859-1:1998] - 8-bit single-byte coded graphic character sets, Part 1: Latin alphabet No. 1 "(draft dated February 12 1998, published April 15 1998)"
** [http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n413.pdf ISO/IEC 8859-4:1998] - 8-bit single-byte coded graphic character sets, Part 4: Latin alphabet No. 4 "(draft dated February 12 1998, published July 1 1998)"
** [http://anubis.dkuug.dk/jtc1/sc2/open/02n3329.pdf ISO/IEC 8859-7:1999] - 8-bit single-byte coded graphic character sets, Part 7: Latin/Greek alphabet "(draft dated June 10 1999; superseded by ISO/IEC 8859-7:2003, published October 10 2003)"
** [http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n415.pdf ISO/IEC 8859-10:1998] - 8-bit single-byte coded graphic character sets, Part 10: Latin alphabet No. 6 "(draft dated February 12 1998, published July 15 1998)"
** [http://anubis.dkuug.dk/jtc1/sc2/open/02n3333.pdf ISO/IEC 8859-11:1999] - 8-bit single-byte coded graphic character sets, Part 11: Latin/Thai character set "(draft dated June 22 1999; superseded by ISO/IEC 8859-11:2001, published 15 December 2001)"
** [http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n451.pdf ISO/IEC 8859-13:1998] - 8-bit single-byte coded graphic character sets, Part 13: Latin alphabet No. 7 "(draft dated April 15 1998, published October 15 1998)"
** [http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n404.pdf ISO/IEC 8859-15:1998] - 8-bit single-byte coded graphic character sets, Part 15: Latin alphabet No. 9 "(draft dated August 1 1997; superseded by ISO/IEC 8859-15:1999, published March 15 1999)"
** [http://anubis.dkuug.dk/jtc1/sc2/open/02n3389.pdf ISO/IEC 8859-16:2000] - 8-bit single-byte coded graphic character sets, Part 16: Latin alphabet No. 10 "(draft dated November 15 1999; superseded by ISO/IEC 8859-16:2001, published July 15 2001)"

* ECMA standards, which in intent correspond exactly to the ISO/IEC 8859 character set standards, can be found at:
** [http://www.ecma-international.org/publications/standards/Ecma-094.htm Standard ECMA-94] : 8-Bit Single Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4 "2nd edition (June 1986)"
** [http://www.ecma-international.org/publications/standards/Ecma-113.htm Standard ECMA-113] : 8-Bit Single-Byte Coded Graphic Character Sets - Latin/Cyrillic Alphabet "3rd edition (December 1999)"
** [http://www.ecma-international.org/publications/standards/Ecma-114.htm Standard ECMA-114] : 8-Bit Single-Byte Coded Graphic Character Sets - Latin/Arabic Alphabet "2nd edition (December 2000)"
** [http://www.ecma-international.org/publications/standards/Ecma-118.htm Standard ECMA-118] : 8-Bit Single-Byte Coded Graphic Character Sets - Latin/Greek Alphabet "(December 1986)"
** [http://www.ecma-international.org/publications/standards/Ecma-121.htm Standard ECMA-121] : 8-Bit Single-Byte Coded Graphic Character Sets - Latin/Hebrew Alphabet "2nd edition (December 2000)"
** [http://www.ecma-international.org/publications/standards/Ecma-128.htm Standard ECMA-128] : 8-Bit Single-Byte Coded Graphic Character Sets - Latin Alphabet No. 5 "2nd edition (December 1999)"
** [http://www.ecma-international.org/publications/standards/Ecma-144.htm Standard ECMA-144] : 8-Bit Single-Byte Coded Character Sets - Latin Alphabet No. 6 "3rd edition (December 2000)"

* ISO/IEC 8859-1 to Unicode [ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859 mapping tables] as plain text files are at the Unicode FTP site.

* Informal descriptions and code charts for most ISO/IEC 8859 standards are available in [http://czyborra.com/charsets/iso8859.html ISO/IEC 8859 Alphabet Soup] [http://www.lysator.liu.se/~jmo/czyborra_index.html (Mirror)]


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • ISO/IEC 8859-1 — ISO 8859 1 Latin 1, Westeuropäisch 2 Latin 2, Mitteleuropäisch 3 Latin 3, Südeuropäisch 4 Latin 4, Baltisch 5 Kyrillisch 6 Arabisch 7 Griechisch 8 …   Deutsch Wikipedia

  • ISO/IEC 8859-1 — ISO 8859 1, more formally cited as ISO/IEC 8859 1 is part 1 of ISO/IEC 8859, a standard character encoding of the Latin alphabet. It is less formally referred to as Latin 1. It was originally developed by the ISO, but later jointly maintained by… …   Wikipedia

  • ISO/IEC 8859-11 — ISO/IEC 8859 11:2001, Information technology 8 bit single byte coded graphic character sets Part 11: Latin/Thai alphabet, is part of the ISO/IEC 8859 series of ASCII based standard character encodings, first edition published in 2001. It is… …   Wikipedia

  • ISO/IEC 8859-8 — ISO 8859 8, more formally cited as ISO/IEC 8859 8 (but not as Latin 8!), is part 8 of ISO/IEC 8859, a standard character encoding defined by ISO.ISO 8859 8 contains all the Hebrew letters (no Hebrew vowel signs). ISO 8859 8:1988, more commonly… …   Wikipedia

  • ISO/IEC 8859-6 — ISO/IEC 8859 6:1999, Information technology 8 bit single byte coded graphic character sets Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII based standard character encodings, first edition published in 1987. It is… …   Wikipedia

  • ISO/IEC 8859-15 — ISO 8859 15 is part 15 of ISO 8859, a standard character encoding defined by International Organization for Standardization. It is also known as Latin 9, and unofficially as Latin 0 but not as Latin 15. It is similar to ISO 8859 1 but replaces… …   Wikipedia

  • ISO/IEC 8859-2 — ISO 8859 2, more formally cited as ISO/IEC 8859 2 or less formally as Latin 2, is part 2 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. 2, consisting of 191 characters from the… …   Wikipedia

  • ISO/IEC 8859-7 — ISO 8859 7, also known as Greek, is an 8 bit character encoding, part of the ISO 8859 standard. It was designed originally to cover the modern Greek language as well as mathematical symbols derived from the Greek.The original 1987 version of the… …   Wikipedia

  • ISO/IEC 8859-13 — ISO 8859 13, also known as Latin 7 or Baltic Rim , is an 8 bit character encoding, part of the ISO 8859 standard. It was designed originally to cover the Baltic languages, and added characters missing from the earlier encodings ISO 8859 4 and ISO …   Wikipedia

  • ISO/IEC 8859-16 — ISO 8859 16, also known as Latin 10 or South Eastern European , is an 8 bit character encoding, part of the ISO 8859 standard. It was designed to cover Albanian, Croatian, Hungarian, Polish, Romanian and Slovenian, but also French, German,… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”