- Noisy channel model
-
The noisy channel model is a framework used in spell checkers, question answering, speech recognition, and machine translation. In this model, the goal is to find the intended word given a word where the letters have been scrambled in some manner.
Contents
Definition
Given an alphabet Σ, let Σ * be the set of all finite strings over Σ. Let the dictionary D of valid words be some subset of Σ * , i.e.,
.
The noisy channel is the matrix
- Γws = Pr(s | w),
where
is the intended word and
is the scrambled word that was actually received.
Example
Consider the English alphabet Σ = {a,b,c,...,y,z,A,B,...,Z,...}. Some subset
makes up the dictionary of valid English words.
There are several mistakes that may occur while typing, including:
- Missing letters, e.g., leter instead of letter
- Accidental letter additions, e.g., misstake instead of mistake
- Swapping letters, e.g., recieved instead of received
- Replacing letters, e.g., fimite instead of finite
To construct the noisy channel matrix Γ, we must consider the probability of each mistake, given the intended word (Pr(s | w) for all
and
). These probabilities may be gathered, for example, by considering the Levenshtein distance between s and w or by comparing the draft of an essay with one that has been manually edited for spelling.
Error-correction
The goal of the noisy channel model is to find the intended word given the scrambled word that was received. The decision function
is a function that, given a scrambled word, returns the intended word.
Methods of constructing a decision function include the maximum likelihood rule, the maximum a posteriori rule, and the minimum distance rule.
In some cases, it may be better to accept the scrambled word as the intended word rather than attempt to find an intended word in the dictionary. For example, the word schönfinkeling may not be in the dictionary, but might in fact be the intended word.
See also
References
- Brill, Eric; Moore, Robert C. (Jan 2000). "An Improved Error Model for Noisy Channel Spelling Correction". Proceedings of ACL 2000. https://research.microsoft.com/apps/pubs/default.aspx?id=66833.
Categories:- Automatic identification and data capture
- Computational linguistics
- Statistical natural language processing
Wikimedia Foundation. 2010.