Statistical machine translation

Statistical machine translation

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

The first ideas of statistical machine translation were introduced by Warren Weaver in 1949 [ W. Weaver (1955). Translation (1949). In: "Machine Translation of Languages", MIT Press, Cambridge, MA.] , including the ideas of applying Claude Shannon's information theory. Statistical machine translation was re-introduced in 1991 by researchers at IBM's Thomas J. Watson Research Center [P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer (1993). The mathematics of statistical machine translation: parameter estimation. "Computational Linguistics", 19(2), 263-311.] and has contributed to the significant resurgence in interest in machine translation in recent years. As of 2006, it is by far the most widely-studied machine translation paradigm.


The benefits of statistical machine translation over traditional paradigms that are most often cited are the following:

* Better use of resources
**There is a great deal of natural language in machine-readable format.
**Generally, SMT systems are not tailored to any specific pair of languages.
**Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
* More natural translations

The ideas behind statistical machine translation come out of information theory. Essentially, the document is translated on the probability p(e|f) that a string e in native language (for example, English) is the translation of a string f in foreign language (for example, French). Generally, these probabilities are estimated using techniques of parameter estimation.

The Bayes Theorem is applied to p(e|f), the probability that the foreign string produces the native string to get p(e|f) propto p(f|e) p(e), where the translation model p(f|e) is the probability that the native string is the translation of the foreign string, and the language model p(e) is the probability of seeing that native string.Mathematically speaking, finding the best translation ilde{e} is done by picking up the one that gives the highest probability:: ilde{e} = arg max_{e in e^*} p(e|f) = arg max_{ein e^*} p(f|e) p(e) .

For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings e^* in the native language. Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in speech recognition.

As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough. Language models are typically approximated by smoothed "n"-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word orders in the languages.

The statistical translation models were initially word based (Models 1-5 from IBM), but significant advances were made with the introduction of phrase based models [P. Koehn, F.J. Och, and D. Marcu (2003). Statistical phrase based translation. In "Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL)".] . Recent work has incorporated syntax or quasi-syntactic structures [D. Chiang (2005). A Hierarchical Phrase-Based Model for Statistical Machine Translation. In "Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)".] .

Word-based translation

In word-based translation, translated elements are words. Typically, the number of words in translated sentences are different due to compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Simple word-based translation is not able to translate language pairs with fertility rates different from one. To make word-based translation systems manage, for instance, high fertility rates, the system could be able to map a single word to multiple words, but not vice versa. For instance, if we are translating from French to English, each word in English could produce zero or more French words. But there's no way to group two English words producing a single French word.

An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes IBM models.

Phrase-based translation

In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ. The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from the corpus. Restricting the phrases to linguistic phrases has been shown to decrease translation quality.

yntax-based translation

Challenges with statistical machine translation

Problems that statistical machine translation have to deal with include

Compound words



Different word orders

Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located.

In Speech Recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the translation model is only able to translate small sequences of words and word order has to be taken into account somehow. Typical solution has been re-ordering models, where a distribution of location changes foreach item of translation is approximated from aligned bi-text. Different location changes can be rankedwith the help of the language model and the best can be selected.


Out of vocabulary (OOV) words

SMT systems store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated. Main reasons for out of vocabulary words are the limitation of training data, domain changes and morphology.

ee also

*Machine translation
*Example-based machine translation
*Language Weaver
*Google Translate
*Asia Online

External links

* [ Statistical Machine Translation] — includes introduction to research, conference, corpus and software listings.
* [ Moses: a state-of-the-art open source SMT system]
* [ Annotated list of statistical natural language processing resources] — Includes links to freely available statistical machine translation software.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Machine translation — Part of a series on Translation Types Language interpretation …   Wikipedia

  • Machine translation software usability — The sections below give objective criteria for evaluating the usability of machine translation software output. Stationarity or Canonical Form Do repeated translations converge on a single expression in both languages? I.e. does the translation… …   Wikipedia

  • Moses (machine translation) — Moses Development status stable Written in C++, Perl Operating system 32 bit MS Windows (NT/2000/XP), OS Portable, Linux, OS X Platform cross platform …   Wikipedia

  • Dictionary-based machine translation — Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or… …   Wikipedia

  • History of machine translation — The history of machine translation generally starts in the 1950s, although work can be found from earlier periods. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The… …   Wikipedia

  • Example-based machine translation — The Example based machine translation (EBMT) approach to machine translation is often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base, at run time. It is essentially a translation by analogy and can… …   Wikipedia

  • Transfer-based machine translation — is a type of machine translation, it is based on the idea of interlingua and is currently one of the most widely used methods of machine translationOverviewBoth transfer based and interlingua based machine translation have the same idea: to make… …   Wikipedia

  • Comparison of machine translation applications — A machine translation application is a program which can translate text or speech from one natural language to another. Machine translation applications are essential to the modern language industry. Please see the individual products articles… …   Wikipedia

  • Evaluation of machine translation — Various methods for the evaluation for machine translation have been employed. This article will focus on the evaluation of the output of machine translation, rather than on performance or usability evaluation.Before covering the large scale… …   Wikipedia

  • Translation — For other uses, see Translation (disambiguation). Translator redirects here. For other uses, see Translator (disambiguation). Contents 1 Etymology 2 Theory …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”