Canonical Huffman code

Canonical Huffman code

A canonical Huffman code is a particular type of Huffman code which has the property that it can be very compactly described.

Data compressors generally work in one of two ways. Either the decompressor can infer what codebook the compressor has used from previous context, or the compressor must "tell" the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and the convert it to canonical Huffman before using it.

Algorithm

The normal Huffman coding algorithm assigns a variable length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the following "non"-canonical codebook:

A = 11 B = 0 C = 101 D = 100

Here the letter A has been assigned 2 bits, B has 1 bit, and C and D both have 3 bits. To make the code a "canonical" Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted "first" by codeword length and "secondly" by alphabetical value:

B = 0 A = 11 C = 101 D = 100

Each of the existing codes are replaced with a new one of the same length, using the following algorithm:

* The "first" symbol in the list gets assigned a codeword which is the same length as the symbol's original codeword but all zeros. This will often be a single zero ('0').
* Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value.
* When you reach a longer codeword, then "after" incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shift.

By following these three rules, the "canonical" version of the code book produced will be:

B = 0 A = 10 C = 110 D = 111

Encoding the Codebook

The whole advantage of a canonical Huffman tree is that one can encode the description (the codebook) in fewer bits than a fully-described tree.

Let us take our "original" Huffman codebook:

A = 11 B = 0 C = 101 D = 100

There are several ways we could encode this Huffman tree. For example, we could write each symbol followed by the number of bits and code:

('A',2,11), ('B',1,0), ('C',3,101), ('D',3,100)

Since we are listing the symbols in sequential alphabetical order, we can omit the symbols themselves, listing just the number of bits and code:

(2,11), (1,0), (3,101), (3,100)

With our "canonical" version we have the knowledge that the symbols are in sequential alphabetical order "and" that a later code will always be higher in value than an earlier one. The only parts left to transmit are the bit-lengths (number of bits) for each symbol. Note that our canonical Huffman tree always has higher values for longer bit lengths and that any symbols of the same bit length ("C" and "D") have higher code values for higher symbols:

A = 10 (code value: 2 decimal, bits: 2) B = 0 (code value: 0 decimal, bits: 1) C = 110 (code value: 6 decimal, bits: 3) D = 111 (code value: 7 decimal, bits: 3)

Since two-thirds of the constraints are known, only the number of bits for each symbol need be transmitted:

2, 1, 3, 3

With knowledge of the canonical Huffman algorithm, it is then possible to recreate the entire table (symbol and code values) from just the bit-lengths. Unused symbols are normally transmitted as having zero bit length.

Pseudo code

Pseudo code for construction of a canonical Huffman table could look something like the following:

code = 0 while more symbols: print symbol, code code = code + (1 << (current bit length - 1)) if next bit length > current bit length: code = code << (next bit length - current bit length)


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Huffman coding — Huffman tree generated from the exact frequencies of the text this is an example of a huffman tree . The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 bits, as opposed of 288 bits if 36… …   Wikipedia

  • List of mathematics articles (C) — NOTOC C C closed subgroup C minimal theory C normal subgroup C number C semiring C space C symmetry C* algebra C0 semigroup CA group Cabal (set theory) Cabibbo Kobayashi Maskawa matrix Cabinet projection Cable knot Cabri Geometry Cabtaxi number… …   Wikipedia

  • Package-merge algorithm — The Package Merge Algorithm is an O(nL) time algorithm for finding anoptimal length limited Huffman code for a given distribution on a given alphabet of size n , where no code word is longer than L . It is a greedy algorithm, and a generalization …   Wikipedia

  • Musepack — (formerly MPEGplus, MPEG+, MP+) Filename extension .mpc, .mp+, .mpp Internet media type audio/x musepack audio/musepack Magic number MPCK, MP+ Latest release 1.30.0 (SV8) / April 2, 2009; 2 years ago …   Wikipedia

  • Perl — This article is about the programming language. For other uses, see Perl (disambiguation). Perl Paradig …   Wikipedia

  • Algorithm — Flow chart of an algorithm (Euclid s algorithm) for calculating the greatest common divisor (g.c.d.) of two numbers a and b in locations named A and B. The algorithm proceeds by successive subtractions in two loops: IF the test B ≤ A yields yes… …   Wikipedia

  • Список терминов, относящихся к алгоритмам и структурам данных —   Это служебный список статей, созданный для координации работ по развитию темы.   Данное предупреждение не устанавливается на информационные списки и глоссарии …   Википедия

  • Список терминов — Список терминов, относящихся к алгоритмам и структурам данных   Это сл …   Википедия

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • List of terms relating to algorithms and data structures — The [http://www.nist.gov/dads/ NIST Dictionary of Algorithms and Data Structures] is a reference work maintained by the U.S. National Institute of Standards and Technology. It defines a large number of terms relating to algorithms and data… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”