Unicode Specials

Unicode Specials

Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 5.0::U+FFF9 "INTERLINEAR ANNOTATION ANCHOR", marks start of annotated text:U+FFFA "INTERLINEAR ANNOTATION SEPARATOR", marks start of annotating text:U+FFFB "INTERLINEAR ANNOTATION TERMINATOR", marks end of annotating text:U+FFFC "OBJECT REPLACEMENT CHARACTER", placeholder in the text for another unspecified object, for example in a compound document.:U+FFFD "REPLACEMENT CHARACTER" used to replace an unknown or unprintable character

U+FFFE and U+FFFF are not unassigned in the usual sense, but 'guaranteed not to be a Unicode character at all'. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's byte-order mark, named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order, it will read 0xFFFE, which is illegal UNICODE.

Replacement character

The replacement character unicode|� (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the "Specials" table. It is used to indicate problems when a system such as a text parser was not able to decode a stream of data to a correct symbol.

Consider a text file created with Notepad in Microsoft Windows and saved with Windows-1252 encoding (Microsoft calls this code page usually "ANSI"). This file has the content "für", a German word. These three letters correspond to the byte values 0x66 0xFC 0x72.

This file is now opened within a Linux environment. Many Linux text editors nowadays have UTF-8 as the preset encoding. As the first character (0x66) is within the code range 0x000000–0x00007F, UTF-8 correctly interprets it as an "f". The second character (0xFC) translates to binary 1111 1100, which is not a reasonable value for any UTF-8 encoded data. A text editor could therefore now insert the replacement character symbol to "warn" the user that something went wrong. The last, character (0x72) now is within the code range 0x000000–0x00007F and can be decoded correctly. The whole string now looks like this: "fUnicode|�r".

If this file now is saved in UTF-8 form, the text file data will look like this: 0x66 0xEF 0xBF 0xBD 0x72. The “new” data, 0xEF 0xBF 0xBD, is the correct UTF-8 code for Unicode code point U+FFFD. Therefore, the original 0xFC, "ü", has been replaced with 0xEF 0xBF 0xBD, "Unicode|�".

Back to the Windows environment, this modified text file is opened with Microsoft's editor using Windows-1252 encoding. As 0xEF = ï, 0xBF = ¿ and 0xBD = ½, The whole text file will be displayed within Editor like this: "f�r".

Once data was transformed as in the example above (different symbols replaced with a single replacement character), there is no trivial way other than manually finding and replacing the correct character from context to get back the original data.

Some websites specify their used encoding incorrectly to UTF-8 rather than, for example, the actually used Windows-1252. In some web browsers (such as Firefox), this results in all umlauts, ß's and some other characters in the higher range of Windows-1252 (with the most significant bit set to 1) being displayed as Unicode|� instead. Other web browsers such as new versions of Internet Explorer try their best in figuring out which code page may was meant to be used.

ee also

*Mojibake

External links

* [http://www.unicode.org/charts/PDF/UFFF0.pdf Unicode's Specials table]
* [http://www.decodeunicode.org/w3.php?nodeId=65635&page=1&lang=2&zoom=&prop= Decodeunicode's entry for the replacement character]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… …   Wikipedia

  • Unicode-Block Specials — Der Unicode Block Specials (Spezielles) (FFF0–FFFF) enthält spezielle Kontroll und Zusatzzeichen für das Unicode System. Unicode Nummer Zeichen Beschreibung Offizielle Bezeichnung U+FFF9 (65529)  Interlinearer Anmerkungsanker INTERLINEAR… …   Deutsch Wikipedia

  • Unicode — For the 1889 Universal Telegraphic Phrase book, see Commercial code (communications). The Unicode official logo since October 2009 …   Wikipedia

  • Unicode font — A Unicode font (also known as UCS font and Unicode typeface) is a computer font that contains a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal… …   Wikipedia

  • Unicode-Block Spezielles — Der Unicode Block Specials (Spezielles) (FFF0–FFFF) enthält spezielle Kontroll und Zusatzzeichen für das Unicode System. Unicode Nummer Zeichen Beschreibung Offizielle Bezeichnung U+FFF9 (65529)  Interlinearer Anmerkungsanker INTERLINEAR… …   Deutsch Wikipedia

  • Plane (Unicode) — Main article: Mapping of Unicode characters In the Unicode system, planes are groups of numerical values that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 (= 216) code points. Planes are …   Wikipedia

  • Table des caracteres Unicode/UFFF0 — Table des caractères Unicode/UFFF0 Tables Unicode 0000 – 0FFF   8000 – 8FFF 1000 – 1FFF 9000 – 9FFF 2000 – 2FFF …   Wikipédia en Français

  • Table des caractères Unicode/UFFF0 — Tables Unicode 0000 – 0FFF   8000 – 8FFF 1000 – 1FFF 9000 – 9FFF 2000 – 2FFF …   Wikipédia en Français

  • Table des caractères unicode/ufff0 — Tables Unicode 0000 – 0FFF   8000 – 8FFF 1000 – 1FFF 9000 – 9FFF 2000 – 2FFF …   Wikipédia en Français

  • Summary of Unicode character assignments — Legend The following table lists all of the blocks currently assigned characters as of April 2007 (Unicode 5.0).Fact|date=August 2008 Blocks are grouped according to their function. * The first column lists the name of the group.Working backwards …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”