Rzip

Rzip

The rzip program is huge-scale data compression software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by Bzip2-based Burrows-Wheeler transform (BWT) and entropy coding (Huffman) on 900 kB output chunks.

Compression Algorithm

rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2) to compress the output of the first stage.

It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as PDF files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.

The intermediate interface between the two stages is made of a byte-aligned data stream of which there are two commands, a "literal" ("add") with length and data:

type:8 = 0 => literal/add range of count bytes count:16 = 1..65535 data:8..∞ = literal data to be inserted (n whole bytes)

and a "match" ("copy") with length and offset parameters:

type:8 = 1 => match/copy range of count bytes count:16 = 31..65535 offset:32 = offset to position to be copied from

Literal or match/copy lengths of greater than 65535 bytes are split into multiple instructions. End-of-stream is indicated with a zero-length literal/add (type=0,count=0) command and immediately followed by a 32-bit CRC checksum.

Reference implementation

A rolling-checksum algorithm based on the one in rsync is used to locate potential matches from over such a large dataset. As the hash buckets fill up, previous hashes ("tags") are discarded based on twice. The tags are discarded in such a manner as to provide "fairly good" coverage, with a gradually decreasing match granularity as the distance increases. This implementation does not search for match lengths of less than 31 consecutive bytes.

Advantages

The key difference between rzip and other well known compression algorithms is its ability to take advantage of very long distance redundancy. The well known deflate algorithm used in gzip uses a maximum history buffer of 32 k. The BWT block sorting algorithm used in bzip2 is limited to 900 k of history. The history buffer in rzip can be up to 900MB long, several orders of magnitude larger than gzip or bzip2. Interestingly, rzip is often much faster than bzip2, despite using the bzip2 library as a backend. This is because rzip feeds bzip2 with shrunken data, so that bzip2 has to do less work. A simple comparison (although too small for an authoritative benchmark) can be found in [http://www.linux-mag.com/content/view/2528/0/1/0/] (log-in required). Another one is found in [http://rzip.samba.org/ rzip's webpage] .

Disadvantages

Rzip is not suited for every purpose. The two biggest disadvantages of rzip are that it cannot be pipelined (so it cannot read from standard input or write to standard output), and that it uses a high amount of memory: a typical compression run on a large file might use hundreds of megabytes of RAM. If there is a lot of RAM to spare and a very high compression rate is required, rzip should be used, but if these conditions are not satisfied, alternate compression methods such as gzip and bzip2, which are less memory-intensive, should be used instead of rzip.

History

rzip was originally written by Andrew Tridgell as part of his PhD research.

ee also

*List of archive formats
*List of file archivers
*Comparison of file archivers
*rsync, by the same author (Andrew Tridgell) and containing a similar matching/search algorithm to the first-stage of rzip.
* [http://ck.kolivas.org/apps/lrzip/ lrzip] , an improvement to 'rzip' allowing the second stage bzip2 reduction to be replaced by LZMA, LZO, or no second-stage (raw, dictionary-only compression). The author is Con Kolivas who states that 'lrzip' stands for 'Long Range ZIP'.

External links

* [http://rzip.samba.org rzip]
* [http://datacompression.info/LZSS.shtml DataCompression.info - LZ77/LZSS and derivatives]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Rzip — est un logiciel libre de compression de données à grande échelle, construit autour d une recherche de chaînes initiale de type LZ77 avec une fenêtre de recherche de 900 MB, suivi d une transformation de Burrows Wheeler (BWT) de type Bzip2 et… …   Wikipédia en Français

  • rzip — est un logiciel libre de compression de données à grande échelle, construit autour d une recherche de chaînes initiale de type LZ77 avec une fenêtre de recherche de 900 MB, suivi d une transformation de Burrows Wheeler (BWT) de type bzip2 et… …   Wikipédia en Français

  • ZIP (file format) — unzip redirects here. For the program, see Info ZIP. ZIP Filename extension .zip .zipx (newer compression algorithms) Internet media type application/zip Uniform Type Identifier com.pkware.zip archive Magic …   Wikipedia

  • StuffIt — Developer(s) Aladdin Systems, Smith Micro Software Stable release 14 / September 15, 2009; 2 years ago ( …   Wikipedia

  • List of archive formats — This is a list of file formats used by archivers and compressors used to create archive files. Contents 1 Archiving only 2 Compression only 3 Archiving and compression 4 …   Wikipedia

  • PeaZip — running on Windows 7 Developer(s) …   Wikipedia

  • Andrew Tridgell — Infobox person name=Andrew Tridgell birth date=birth date and age|df=yes|1967|02|28 occupation=Programmer known for=rsync, Samba nationality=Australian employer=IBM other names=TridgeAndrew Tridge Tridgell (born 28 February 1967) is an Australian …   Wikipedia

  • Data compression — Source coding redirects here. For the term in computer programming, see Source code. In computer science and information theory, data compression, source coding or bit rate reduction is the process of encoding information using fewer bits than… …   Wikipedia

  • LAME — This article is about the audio encoder. For other uses, see Lame. LAME Developer(s) The LAME development team …   Wikipedia

  • Speex — Filename extension .spx Internet media type audio/x speex, audio/speex, audio/ogg Developed by Xiph.Org Foundation, Jean Marc Valin Type of format Audio Contained by Ogg …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”