- Binary Ordered Compression for Unicode
BOCU-1 is a
MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability ofUTF-8 with the compactness of SCSU. ThisUnicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in an Unicode Technical Note. [cite web |url=http://www.unicode.org/notes/tn6/#Introduction |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, Mark Davis |accessdate=2008-05-18] For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specificcode page s. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip,bzip2 , and other industry standard algorithms compact larger amounts of Unicode text more efficiently. [cite web |url=http://unicode.org/notes/tn14 |title=UTN #14: A survey of Unicode compression
date=2004-01-30 |first=Doug |last=Ewell |accessdate=2008-06-13 |format=PDF ]Both SCSU [ [http://www.iana.org/assignments/charset-reg/SCSU IANA registration record for SCSU] ] and BOCU-1 [ [http://www.iana.org/assignments/charset-reg/BOCU-1 IANA registration record for BOCU-1] ] are IANA registered charsets.
Details
All numbers in this section are
hexadecimal , and all ranges are inclusive.Code points from
U+0000
toU+0020
are encoded in BOCU-1 as the corresponding byte value. All other code points (that is,U+0021
throughU+D7FF
andU+E000
throughU+10FFFF
) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020
). The initial state isU+0040
. The normalization mapping is as follows:The difference between the current code point and the normalized previous code point is encoded as follows:
Each byte range is lexicographically ordered with the following thirteen byte values excluded:
00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20
. For example, the byte sequenceFC 06 FF
, coding for a difference of1156B
, is immediately followed by the byte sequenceFC 10 01
, coding for a difference of1156C
.Any ASCII input
U+0000
toU+007F
excluding spaceU+0020
resets the encoder toU+0040
. Because the above mentioned values cover line end code pointsU+000D
andU+000A
"as is" (0D 0A
), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte inUTF-8 affects at most one code point, for SCSU it can affect the entire document.BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code
0xFF
. When a decoder finds this octet it resets its state toU+0040
as for a line end. The use of0xFF
reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the "binary order".The optional use of a signature
U+FEFF
at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequenceFB EE 28
, changes the initial stateU+0040
toU+FE80
. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF
) could avoid this effect, but the BOCU-1 specification does not recommend this practise.In theory
UTF-1 andUTF-8 could encode the original UCS-4 set with 31 bits up to7FFFFFFF
. BOCU-1 andUTF-16 can encodethe modernUnicode set fromU+0000
toU+10FFFF
. Excluding the thirteen "protected" code points encoded as single octets BOCU-1 can use octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference.Note that the reset byte0xFF
is not "protected" and can occur as trail byte.References
See also
*
UTF-1 contains a comparison of the UTF-1,UTF-8 , and BOCU-1 designs
*International Components for Unicode A library that can convert between BOCU-1 and other Unicode encodings
Wikimedia Foundation. 2010.