- Shannon's source coding theorem
In
information theory , Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possibledata compression , and the operational meaning of theShannon entropy .The source coding theorem shows that (in the limit, as the length of a stream of i.i.d. data tends to infinity) it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss.
The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a
random variable ) and of the size of the target alphabet.Statements
"Source coding" is a mapping from (a sequence of) symbols from an information source to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind
data compression .Source coding theorem
In information theory, the source coding theorem (Shannon 1948) informally states that:
:"N" i.i.d. random variables each with entropy "H(X)" can be compressed into more than "NH(X)"
bits with negligible risk of information loss, as "N" tends to infinity; but conversely, if they are compressed into fewer than "NH(X)" bits it is virtually certain that information will be lost." (MacKay 2003).Source coding theorem for symbol codes
Let Sigma_1, Sigma_2 denote two finite alphabets and let Sigma_1^* and Sigma_2^* denote the set of all finite words from those alphabets (respectively).
Suppose that "X" is a random variable taking values in Sigma_1 and let "f" be a decipherable code from Sigma_1^* to Sigma_2^* where Sigma_2|=a. Let "S" denote the random variable given by the wordlength "f(X)".
If "f" is optimal in the sense that it has the minimal expected wordlength for "X", then
:frac{H(X)}{log_2 a} leq mathbb{E}S < frac{H(X)}{log_2 a} +1 (Shannon 1948)
Proof: Source coding theorem
Given X is an i.i.d. source, its
time series "X"1, ..., "X""n" is i.i.d. withentropy "H"("X") in the discrete-valued case anddifferential entropy in the continuous-valued case. The Source coding theorem states that for any epsilon > 0 for any rate larger than theentropy of the source, there is large enough n and an encoder that takes n i.i.d. repetition of the source, X^{1:n} , and maps it to n.(H(X)+epsilon) binary bits such that the source symbols X^{1:n} are recoverable from the binary bits with probability at least 1 - epsilon .Proof for achievability
Fix some for any epsilon > 0 . The typical set,A_n^epsilon, is defined as follows:
A_n^epsilon =; left{x_1^n : left|-frac{1}{n} log p(X_1, X_2, ..., X_n) - H_n(X) ight|
AEP shows that for large enough "n", the probability that a sequence generated by the source lies in the typical set,A_n^epsilon, defined as approaches one. In particular there for large enough "n", P(A_n^epsilon)>1-epsilon (See
AEP for a proof):The definition of typical sets implies that those sequences that lie in the typical set satisfy:
:2^{-n(H(X)+epsilon)} leq p(x_1, x_2, ..., x_n) leq 2^{-n(H(X)-epsilon)}
Note that:
*The probability of a sequence from X being drawn from A_epsilon}^{(n)} is greater than 1-epsilon
*left| {A_epsilon}^{(n)} ight| leq 2^{n(H(X)+epsilon)} since the probability of the whole set A_epsilon}^{(n)} is at most one.
*left| {A_epsilon}^{(n)} ight| geq (1-epsilon)2^{n(H(X)-epsilon)}. For the proof, use the upper bound on the probability of each term in typical set and the lower bound on the probability of the whole set A_epsilon}^{(n)}.
Since left| {A_epsilon}^{(n)} ight| leq 2^{n(H(X)+epsilon)}, n.(H(X)+epsilon); bits are enough to point to any string in this set.
The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary n.(H(X)+epsilon) digit number. As long as the input sequence lies within the typical set (with probability at least 1-epsilon), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by epsilon
Proof for converse The converse is proved by showing that any set of size smaller than A_n^epsilon (in the sense of exponent) would cover a set of probability bounded away from 1.
Proof: Source coding theorem for symbol codes
Let s_i denote the wordlength of each possible x_i (i=1,ldots,n). Define q_i = a^{-s_i}/C, where "C" is chosen so that sum q_i = 1.
Then
:egin{matrix}H(X) &=& - sum_{i=1}^n p_i log_2 p_i \&& \&leq& - sum_{i=1}^n p_i log_2 q_i \&& \&=& - sum_{i=1}^n p_i log_2 a^{-s_i} + sum_{i=1}^n p_i log_2 C \&& \&=& - sum_{i=1}^n p_i log_2 a^{-s_i} + log_2 C \&& \&leq& - sum_{i=1}^n - s_i p_i log_2 a \&& \&leq& mathbb{E}S log_2 a \end{matrix}
where the second line follows from
Gibbs' inequality and the fifth line follows fromKraft's inequality : C = sum_{i=1}^n a^{-s_i} leq 1 so log C leq 0.For the second inequality we may set
:s_i = lceil - log_a p_i ceil
so that
:log_a p_i leq s_i < -log_a p_i + 1
and so
:a^{-s_i} leq p_i
and
:sum a^{-s_i} leq sum p_i = 1
and so by Kraft's inequality there exists a prefix-free code having those wordlengths. Thus the minimal "S" satisfies
:egin{matrix}mathbb{E}S & = & sum p_i s_i \&& \& < & sum p_i left( -log_a p_i +1 ight) \&& \& = & sum - p_i frac{log_2 p_i}{log_2 a} +1 \&& \& = & frac{H(X)}{log_2 a} +1 \end{matrix}
Extension to non-stationary independent sources
Fixed Rate lossless source coding for discrete time non-stationary independent sources
Define typical set A_n^epsilon as:
: A_n^epsilon =; {x_1^n : left|-frac{1}{n} log p(X_1, X_2, ..., X_n) - ar{H_n}(X) ight|
Then, for given delta>0, for n large enough, Pr(A_n^epsilon)> 1-delta . Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than 2^{n(ar{H_n}(X)+epsilon)}. Thus, on an average, ar{H_n}(X)+epsilon bits suffice for encoding with probability greater than 1-delta, where epsilon; mbox{ and }; delta can be made arbitrarily small, by making n larger.
ee also
*
Channel coding
* Noisy Channel Coding Theorem
*Error exponent
* Asymptotic Equipartition Property (AEP)
*Data compression References
* cite book
last = Cover
first = Thomas M.
title = Elements of Information Theory
chapter = Chapter 5: Data Compression
year = 2006
publisher = John Wiley & Sons
isbn = 0-471-24195-4
* C.E. Shannon, " [http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf A Mathematical Theory of Communication] ", "Bell System Technical Journal ", vol. 27, pp. 379-423, 623-656, July, October, 1948
Wikimedia Foundation. 2010.