UTF-32/UCS-4

UTF-32/UCS-4

UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other "Unicode transformation formats" use variable-length encodings.

Because UTF-32 uses 4 bytes for every character it is quite space inefficient. Specifically, non-BMP characters are so rare in most texts, they may as well be considered non-existent for sizing discussions, making UTF-32 between twice and four times the size of other encodings.

Though a fixed number of bytes per code point seems convenient, it is not used as much as the other Unicode encodings. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases, since even with a “fixed width” font there may be more than one code point per character position (combining marks) or more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing.

History

The original ISO 10646 standard defines a 31-bit "encoding form" called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly "code value" in the "code space" of integers between 0 and hexadecimal 7FFFFFFF.

UCS-4 is sufficient to represent all of the Unicode code space, which has 1114112 (= 220+216) code points and therefore requires only up to hexadecimal 10FFFF. Some people consider it wasteful to reserve such a large code space for mapping a relatively small set of code points, so a new encoding form, UTF-32, was proposed. UTF-32 is a subset of UCS-4 that uses 32-bit code values only in the 0 to 10FFFF code space.

UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.

Accordingly UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

See also

* Comparison of Unicode encodings

External links

* [http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf The Unicode Standard 4.1, chapter 3] - formally defines UTF-32 in §3.10, D43-D45
* [http://www.unicode.org/reports/tr19/tr19-9.html Unicode Standard Annex #19] - formally defined UTF-32 for Unicode 3.x (March 2001; last updated March 2002)
* [http://mail.apps.ietf.org/ietf/charsets/msg01095.html Registration of new charsets: UTF-32, UTF-32BE, UTF-32LE] - announcement of UTF-32 being added to the IANA charset registry (April 2002)


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • UTF-16/UCS-2 — In computing, UTF 16 (16 bit Unicode Transformation Format) is a variable length character encoding for Unicode, capable of encoding the entire Unicode repertoire.The encoding form maps each character to a sequence of 16 bit words. Characters are …   Wikipedia

  • UTF-7 — (7 bit Unicode Transformation Format) is a variable length character encoding that was proposed for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in… …   Wikipedia

  • UTF-8 — (UCS transformation format 8 bits) est un format de codage de caractères. Chaque caractère ou graphème est représenté dans le répertoire du jeu universel de caractères sous la forme d’une suite d’un ou plusieurs « caractères abstraits » …   Wikipédia en Français

  • Utf-8 — Wikipédia …   Wikipédia en Français

  • UTF-8 — (UCS Transformation Format  8 bit[1]) is a multibyte character encoding for Unicode. Like UTF 16 and UTF 32, UTF 8 can represent every character in the Unicode character set. Unlike them, it is backward compatible with ASCII and avoids the… …   Wikipedia

  • UCS-4 — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

  • UTF-32LE — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

  • UTF-32 Little Endian — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

  • UTF — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

  • UTF-7 — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”