Standard Compression Scheme for Unicode

Standard Compression Scheme for Unicode

The Standard Compression Scheme for Unicode (SCSU) [cite web |url=http://www.unicode.org/reports/tr6/ |title=UTS #6: Compression Scheme for Unicode |date=2005-05-06 |accessdate=2008-06-13 ] is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character (plus setup overhead, which for common languages is often only 1 byte), most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

Reuters, the organization that floated the first draft of SCSU, is believed to use SCSU internally.

Comparison with external compression schemes

SCSU has not been a resounding success. Few applications need to compress so much Unicode text that it's worth using a special-purpose compression scheme which does not have widespread support. Also, while it can be used as a text encoding, it can be difficult to handle internally.

Treated purely as a compression algorithm, SCSU is inferior to most commonly-used general-purpose algorithms for texts of over a few kilobytes. One of several problems with SCSU is then that the savings of SCSU versus UTF-16 or UTF-8 drop after external compression, [cite web |url=http://unicode.org/notes/tn14 |title=UTN #14: A survey of Unicode compression
date=2004-01-30 |first=Doug |last=Ewell |accessdate=2008-06-13 |format=PDF
] often dramatically so.

SCSU does have the advantage that it can usefully compress texts that are only a few characters long, whereas most full-scale compressors need a few kilobytes of data to break even against their own overhead.

ee also

* Binary Ordered Compression for Unicode (BOCU-1)
* International Components for Unicode A library that can convert between SCSU and other Unicode encodings

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Unicode — Logo von Unicode Unicode [ˈjuːnɪkoʊd] ist ein internationaler Standard, in dem langfristig für jedes sinntragende Schriftzeichen oder Textelement aller bekannten Schriftkulturen und Zeichensysteme ein digitaler Code festgelegt wird. Ziel ist es,… …   Deutsch Wikipedia

  • Binary Ordered Compression for Unicode — BOCU 1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU 1 combines the wide applicability of UTF 8 with the compactness of SCSU. This Unicode encoding is designed to be useful for… …   Wikipedia

  • Comparison of Unicode encodings — This article compares Unicode encodings. Two situations are considered: 8 bit clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven …   Wikipedia

  • Unicode — For the 1889 Universal Telegraphic Phrase book, see Commercial code (communications). The Unicode official logo since October 2009 …   Wikipedia

  • Unicode — est une norme informatique, développée par le Consortium Unicode, qui vise à permettre le codage de texte écrit en donnant à tout caractère de n’importe quel système d’écriture un nom et un identifiant numérique, et ce de manière unifiée, quelle… …   Wikipédia en Français

  • SCSU — abbr. Standard Compression Scheme for Unicode (Unicode) …   United dictionary of abbreviations and acronyms

  • SCSU — can mean several things: *St. Cloud State University *South Carolina State University *Southern Connecticut State University *Sheffield College Student Union *Scarborough Campus Students Union *the Standard Compression Scheme for Unicode …   Wikipedia

  • List of file formats — This is an incomplete list, which may never be able to satisfy particular standards for completeness. You can help by expanding it with reliably sourced entries. See also: List of file formats (alphabetical) This is a list of file formats… …   Wikipedia

  • Technical features new to Windows Vista — This article is part of a series on Windows Vista New features Overview Technical and core system Security and safety Networking technologies I/O technologies Management and administration Removed features …   Wikipedia

  • RAR — Infobox file format name = RAR extension = .rar, .rev, formerly .r00, .r01, etc. mime = application/x rar compressedapplication/octet stream owner = Eugene Roshal creatorcode = genre = Archive format containerfor = containedby = extendedfrom =… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”