Dictionary coder

Dictionary coder

A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure.

Methods and applications

Some dictionary coders use a 'static dictionary', one whose full set of strings is determined before coding begins and does not change during the coding process. This approach is most often used when the message or set of messages to be encoded is fixed and large; for instance, a software package that stores the contents of the Bible in the limited storage space of a PDA generally builds a static dictionary from a concordance of the text and then uses that dictionary to compress the verses.

More common are methods where the dictionary starts in some predetermined state but the contents change during the encoding process, based on the data that has already been encoded. Both the LZ77 and LZ78 algorithms work on this principle. In LZ77, a data structure called the "sliding window" is used to hold the last N bytes of data processed; this window serves as the dictionary, effectively storing every substring that has appeared in the past N bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the length, indicating the length of the matched text, and the offset (also called the distance), indicating that the match is found in the sliding window starting offset bytes before the current text.

LZ78 uses a more explicit dictionary structure; at the beginning of the encoding process, the dictionary only needs to contain entries for the symbols of the alphabet used in the text to be compressed, but the indexes are numbered so as to leave spaces for many more entries. At each step of the encoding process, the longest entry in the dictionary that matches the text is found, and its index is written to the output; the combination of that entry and the character that followed it in the text is then added to the dictionary as a new entry.


Example: The text to be compressed starts "HURLYBURLY". In the first six steps of the encoding, we output the indexes for H, U, R, L, Y, and B, and we add to the dictionary new entries for HU, UR, RL, LY, YB, and BU. On the seventh step, we are at the start of "URLY"; the longest entry in our dictionary that matches the text is "UR", an entry we added on the second step. We send the index for "UR" to the output, and add an entry for "URL" to the dictionary. On the eighth step, we send the index for "LY" to the output, and assuming that a space follows HURLYBURLY in the text, we add an entry for "LY " to the dictionary. If later in the text we were to encounter the word "HURLYBURLY" again, we could encode it this time (assuming we started at the H) in no more than five indexes:- HU, RL, YB, URL, and Y.

The LZ78 decoder receives each symbol and, if it already has a previous prefix, adds the prefix plus the symbol to its own separate dictionary. It then outputs the symbol and sets the prefix to the last character of the symbol. One "gotcha" here is that if the encoder sees a sequence of the form STRING STRING CHARACTER, where STRING is currently in the dictionary, it will output a symbol that is one higher than the decoder's last dictionary entry. The decoder must detect such an event and output the previous symbol plus its first character. This symbol will always be only one higher than the last numbered symbol in the decoder's dictionary.

R -- (UR is in the dictionary already)
Y RLY (doesn't matter)

Example: The encoder is encoding BANANANANA; after outputting the indexes for B, A, N and AN the encoder has in its dictionary entries for BA, AN, NA, and ANA and the decoder has entries for BA, AN, and NA. The encoder can match "ANA" so it sends the index for "ANA" and adds "ANAN" to the dictionary. However, the decoder doesn't have "ANA" in its dictionary. It must guess that this new symbol is the prefix (the last symbol it received, "AN") plus its first character ("A"). It then outputs "ANA" and adds the prefix plus the last character of the output ("A" again) to the dictionary. Decoding can continue from there.

Another dictionary coding scheme is byte pair encoding, where a byte that does not appear in the source text is assigned to represent the most commonly appearing two-byte combination. This can be done repeatedly as long as there are bytes that do not appear in the source text, and bytes that are already representing combinations of other bytes can themselves appear in combinations.

See also

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Data compression — Source coding redirects here. For the term in computer programming, see Source code. In computer science and information theory, data compression, source coding or bit rate reduction is the process of encoding information using fewer bits than… …   Wikipedia

  • LZ77 and LZ78 — are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively [http://www.patentstorm.us/patents/5532693 description.html] .… …   Wikipedia

  • Incremental encoding — Incremental encoding, also known as front compression, back compression, or front coding, is a type of delta encoding compression algorithm whereby common prefixes or suffixes and their lengths are recorded so that they need not be duplicated.… …   Wikipedia

  • PAQ — A sample session of PAQ8O PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory… …   Wikipedia

  • National Clinical Coding Qualification (UK) — The National Clinical Coding Qualification (UK) is the only nationally recognised qualification for clinical coders working in the NHS[1]. Upon passing the examination a clinical coder is able to use the Post nominal letters ACC. Contents 1… …   Wikipedia

  • JBIG2 — is an image compression standard for bi level images, developed by the Joint Bi level Image Experts Group. It is suitable for both lossless and lossy compression. According to a press release [ [http://www.jpeg.org/public/mauijbig.pdf Press… …   Wikipedia

  • Dynamic Markov compression — (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool [1]. It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time (rather… …   Wikipedia

  • Content analysis — or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as the study of recorded human communications, such as books, websites, paintings and laws. According to Dr. Farooq… …   Wikipedia

  • Pierre Hérigone —  Ne doit pas être confondu avec Érigone. Une démonstration de l Optique d Euclide traduite par Hérigone (Tome V) Pierre Hérigone, d origine basque …   Wikipédia en Français

  • Translitteration des hieroglyphes — Translittération des hiéroglyphes Des hiéroglyphes du temple de Komombo. En égyptologie, la translittération est le processus permettant de transcrire un texte écrit en hiéroglyphes ou en hiératique en utilisant des symboles alphabétiques, de… …   Wikipédia en Français

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”