Bigram

Bigram

Bigrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition. [Michael Collins. "A new statistical parser based on bigram lexical dependencies". In Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics, Santa Cruz, CA. 1996. pp.184-191.] They are a special case of N-gram.

"Gappy bigrams" or "skipping bigrams" are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

"Head word bigrams" are gappy bigrams with an explicit dependency relationship.

The term is also used in cryptography, where "bigram frequency attacks" have sometimes been used to attempt to solve cryptograms. See frequency analysis.

Bigrams help provide the conditional probability of a word given the preceding word, when Bayes' theorem is applied:

P(W_n|W_{n-1}) = { P(W_{n-1},W_n) over P(W_{n-1}) }

That is, the probability P() of a word W_n given the preceding word W_{n-1} is equal to the probability of their bigram, or the co-occurrence of the two words P(W_{n-1},W_n), divided by the probability of the preceding word.

Bigram Frequency in the English language

The most common letter bigrams in the English language are listed below, with the expected number of occurrences per 2000 letters. In the analysis here, the bigrams are not permitted to span across consecutive words. [Friedman & Callimahos, "Military Cryptanalytics Part I", Aegean Park Press, Laguna Hills, CA. 1985]

TH 50 AT 25 ST 20 ER 40 EN 25 IO 18 ON 39 ES 25 LE 18 AN 38 OF 25 IS 17 RE 36 OR 25 OU 17 HE 33 NT 24 AR 16 IN 31 EA 22 AS 16 ED 30 TI 22 DE 16 ND 30 TO 22 RT 16 HA 26 IT 20 VE 16

However, these counts are different than other published results; one from the Cornell University Math Explorer's Project [ [http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.html Cornell Math Explorer's Project – Substitution Ciphers] ] (measured over 40,000 words or about 200,000 letters) gives the first five as follows:

th 5532 he 4657 in 3429 er 3420 an 3005

References

ee also

* Digraph (orthography)
* N-gram


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • bigram — igram n. a word that is written with two letters in an alphabetic writing system. [WordNet 1.5] …   The Collaborative International Dictionary of English

  • bigram — noun a pair, often of words or tags, used in analysis …   Wiktionary

  • bigram — bi·gram …   English syllables

  • bigram — noun a word that is written with two letters in an alphabetic writing system • Hypernyms: ↑written word * * * ˈbīˌgram noun Etymology: bi (I) + gram cryptography : digraph …   Useful english dictionary

  • Banburismus — was a process invented by Alan Turing at Bletchley Park in England during the Second World War. It was used by Hut 8 at Bletchley Park to break German Kriegsmarine (i.e., Naval) Enigma. It was a codebreaking procedure which used an early form of… …   Wikipedia

  • Cryptanalysis of the Enigma — enabled the western Allies in World War II to read substantial amounts of secret Morse coded radio communications of the Axis powers that had been enciphered using Enigma machines. This yielded military intelligence which, along with that from… …   Wikipedia

  • n-gram — Not to be confused with engram. In the fields of computational linguistics and probability, an n gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or …   Wikipedia

  • Frequency analysis — In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers.this is wrong is based on the fact that, in any given stretch of… …   Wikipedia

  • Automatic summarization — is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and… …   Wikipedia

  • N-gram — An n gram is a sub sequence of n items from a given sequence. n grams are used in various areas of statistical natural language processing and genetic sequence analysis. The items in question can be phonemes, syllables, letters, words or base… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”