- Byte-order mark
A byte-order mark (BOM) is the
Unicode character at code pointU+FEFF
("zero-width no-break space") when that character is used to denote theendianness of a string of UCS/Unicode characters encoded inUTF-16 orUTF-32 . It is conventionally used as a marker to indicate that text is encoded inUTF-8 ,UTF-16 orUTF-32 .Usage
In most character encodings the BOM is a pattern which is unlikely to be seen in other contexts (it would usually look like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within Unicode text then it will generally be invisible due to the fact it is a "zero-width no-break space". Use of the
U+FEFF
character for non-BOM purposes has been deprecated in Unicode 3.2 (which provides an alternative,U+2060
, for those other purposes), allowingU+FEFF
to be used solely with the semantic of BOM.In
UTF-16 , a BOM (U+FEFF
) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.* If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as
0xFE
followed by0xFF
(where "0x
" indicateshexadecimal );
* if the 16-bit units use little-endian order, the sequence of bytes will have0xFF
followed by0xFE
.The Unicode value
U+FFFE
is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the0xFF
,0xFE
byte pattern can only be interpreted as theU+FEFF
character expressed in little-endian byte order (since it could not be aU+FFFE
character expressed in big-endian byte order).While
UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. It only identifies a file as UTF-8 and does not state anything about byte order.cite web|title = FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?|url = http://unicode.org/faq/utf_bom.html#29|accessdate = 2008-03-29] Many Windows programs (including WindowsNotepad ) add BOM's to UTF-8 files. However inUnix-like systems (which make heavy use oftext files forfile format s as well as forinter-process communication ) this practice is not recommended, as it will interfere with correct processing of important codes such as thehash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and inPHP , if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequenceEF BB BF
, which appears as theISO-8859-1 characters
in mosttext editor s andweb browser s not prepared to handle UTF-8.Although a BOM could be used with
UTF-32 , this encoding is rarely used for transmission. Otherwise the same rules as forUTF-16 are applicable. For the IANA registered charsets UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE a "byte order mark" must not be used, an initial U+FEFF has to be interpreted as a (deprecated) "zero width no-break space", because the names of these charsets already determine the byte order. For the registered charsets UTF-16 and UTF-32, an initial U+FEFF indicates the byte order.Representations of byte order marks by encoding
* In UTF-8, this is not really a "byte order" mark. It identifies the text as UTF-8 but doesn't say anything about the byte order, because UTF-8 does not have byte order issues. [ [http://tools.ietf.org/html/rfc3629#section-6 STD 63: UTF-8, a transformation of ISO 10646] Byte Order Mark (BOM)]
* In UTF-7, the fourth byte of the BOM, before encoding asbase64 , is001111xx
in binary, andxx
depends on the next character (the first character after the BOM). Hence, technically, the fourth byte is not purely a part of the BOM, but also contains information about the next (non-BOM) character. Forxx=00
,01
,10
,11
, this byte is, respectively,38
,39
,2B
, or2F
when encoded as base64. If no following character is encoded,38
is used for the fourth byte and the following byte is2D
.
* SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6. [ [http://www.unicode.org/reports/tr6/#Signature UTR #6: Signature Byte Sequence for SCSU] ]
* For BOCU-1 a signature changes the state of the decoder. Octet 0xFF resets the decoder to the initial state. [ [http://www.unicode.org/notes/tn6/#Signature UTN #6: Signature Byte Sequence] ]See also
*
Left-to-right mark References
External links
* [http://www.unicode.org/faq/utf_bom.html Unicode FAQ: "UTF-8, UTF-16, UTF-32 & BOM"]
* [http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273 The Unicode Standard, chapter 2.6 "Encoding Schemes"]
* [http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G9354 The Unicode Standard, chapter 2.13 "Special Characters and Noncharacters", section "Byte Order Mark (BOM)"]
* [http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#G25817 The Unicode Standard, chapter 16.8 "Specials", section "Byte Order Mark (BOM): U+FEFF"]
Wikimedia Foundation. 2010.