- Unicode Specials
Specials is the name of a short
Unicode block allocated at the very end of theBasic Multilingual Plane , at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 5.0::U+FFF9 "INTERLINEAR ANNOTATION ANCHOR", marks start of annotated text:U+FFFA "INTERLINEAR ANNOTATION SEPARATOR", marks start of annotating text:U+FFFB "INTERLINEAR ANNOTATION TERMINATOR", marks end of annotating text:U+FFFC "OBJECT REPLACEMENT CHARACTER", placeholder in the text for another unspecified object, for example in acompound document .:U+FFFD "REPLACEMENT CHARACTER" used to replace an unknown or unprintable characterU+FFFE and U+FFFF are not unassigned in the usual sense, but 'guaranteed not to be a Unicode character at all'. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's
byte-order mark , named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order, it will read 0xFFFE, which is illegal UNICODE.Replacement character
The replacement character unicode|� (often a black diamond with a white question mark) is a symbol found in the
Unicode standard at codepoint U+FFFD in the "Specials" table. It is used to indicate problems when a system such as a text parser was not able to decode a stream of data to a correct symbol.Consider a
text file created withNotepad inMicrosoft Windows and saved withWindows-1252 encoding (Microsoft calls thiscode page usually "ANSI"). This file has the content "für
", a German word. These three letters correspond to the byte values0x66 0xFC 0x72
.This file is now opened within a
Linux environment. Many Linux text editors nowadays haveUTF-8 as the preset encoding. As the first character (0x66
) is within the code range0x000000–0x00007F
, UTF-8 correctly interprets it as an "f". The second character (0xFC
) translates to binary 1111 1100, which is not a reasonable value for any UTF-8 encoded data. A text editor could therefore now insert the replacement character symbol to "warn" the user that something went wrong. The last, character (0x72
) now is within the code range 0x000000–0x00007F and can be decoded correctly. The whole string now looks like this: "fUnicode|�r
".If this file now is saved in UTF-8 form, the text file data will look like this:
0x66 0xEF 0xBF 0xBD 0x72
. The “new” data,0xEF 0xBF 0xBD
, is the correct UTF-8 code for Unicode code point U+FFFD. Therefore, the original0xFC
,"ü"
, has been replaced with0xEF 0xBF 0xBD
, "Unicode|�
".Back to the Windows environment, this modified text file is opened with Microsoft's editor using Windows-1252 encoding. As
0xEF = ï
,0xBF = ¿
and0xBD = ½
, The whole text file will be displayed within Editor like this:"f�r"
.Once data was transformed as in the example above (different symbols replaced with a single replacement character), there is no trivial way other than manually finding and replacing the correct character from context to get back the original data.
Some websites specify their used encoding incorrectly to UTF-8 rather than, for example, the actually used Windows-1252. In some web browsers (such as
Firefox ), this results in all umlauts,ß 's and some other characters in the higher range of Windows-1252 (with themost significant bit set to 1) being displayed as Unicode|� instead. Other web browsers such as new versions ofInternet Explorer try their best in figuring out which code page may was meant to be used.ee also
*
Mojibake External links
* [http://www.unicode.org/charts/PDF/UFFF0.pdf Unicode's Specials table]
* [http://www.decodeunicode.org/w3.php?nodeId=65635&page=1&lang=2&zoom=&prop= Decodeunicode's entry for the replacement character]
Wikimedia Foundation. 2010.