- Language model
A statistical language model assigns a
probability to a sequence of "m" words P(w_1,ldots,w_m) by means of aprobability distribution .Language modeling is used in many
natural language processing applications such asspeech recognition ,machine translation ,part-of-speech tagging ,parsing andinformation retrieval .In
speech recognition and indata compression , such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.When used in information retrieval, a language model is associated with a
document in a collection. With query "Q" as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, "P(Q|Md)".Estimating the probability of sequences can become difficult in
corpora , in whichphrase s or sentences can be arbitrarily long and hence some sequences are not observed duringtraining of the language model (data sparseness problem ofoverfitting ). For that reason these models are often approximated using smoothedN-gram models.N-gram models
In an n-gram model, the probability P(w_1,ldots,w_m) of observing the sentence w1,...,wm is approximated as
P(w_1,ldots,w_m) = prod^m_{i=1} P(w_i|w_1,ldots,w_{i-1}) approx prod^m_{i=1} P(w_i|w_{i-(n-1)},ldots,w_{i-1})
Here, it is assumed that the probability of observing the "ith" word "wi" in the context history of the preceding "i-1" words can be approximated by the probability of observing it in the shortened context history of the preceding "n-1" words ("nth order
Markov property ).The conditional probability can be calculated from n-gram frequency counts:P(w_i|w_{i-(n-1)},ldots,w_{i-1}) = frac{count(w_{i-(n-1)},w_{i-1},ldots,w_i)}{count(w_{i-(n-1)},ldots,w_{i-1})}
The words bigram and trigram language model denote n-gram language models with "n=2" and "n=3", respectively.
Example
In a bigram (n=2) language model, the probability of the sentence "I saw the red house" is approximated as P(I,saw,the,red,house) approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red)
whereas in a trigram (n=3) language model, the approximation isP(I,saw,the,red,house) approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red)
See also
*
Factored language model References
*cite conference | author=J M Ponte and W B Croft | url=http://citeseer.ist.psu.edu/ponte98language.html | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275-281
*cite conference | author=F Song and W B Croft | url=http://citeseer.ist.psu.edu/song99general.html | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279-280
Wikimedia Foundation. 2010.