Language identification

Language identification

Language identification is the process of determining which natural language given content is in. Traditionally, identification of written language - as practiced, for instance, in library science - has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing approach which relies on statistical methods.

Non-Computational Approaches

In the field of library science, language identification is important for categorizing materials. As librarians often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent words and distinctive letters or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography, identifying several is often highly reliable.

tatistical Approaches

This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure [http://www.xs4all.nl/~ajwp/langident.pdf] . The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.

Another technique, as described by Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and the two models are compared. The stored language model which is most similar to the model from the piece of text is the most likely language.

See also

* Algorithmic information theory
* Artificial grammar learning
* Kolmogorov complexity
* Machine translation
* Translation

References

* Benedetto, D., E. Caglioti and V. Loreto. Language trees and zipping. Physical Review Letters, 88:4 (2002) [http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/BenedettoCaLo.pdf] , [http://pil.phys.uniroma1.it/~loreto/complexity.htm] , [http://www.hpcwire.com/dsstar/02/0507/104225.html] .

* Cilibrasi, Rudi and Paul M.B. Vitanyi. "Clustering by compression". IEEE Transactions on Information Theory 51(4), April 2005, 1523-1545. [http://homepages.cwi.nl/~paulv/papers/cluster.pdf]

* Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994.

* Goodman, Joshua. (2002) Extended comment on "Language Trees and Zipping". Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.) [http://research.microsoft.com/~joshuago/physicslongcomment.ps]

* Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented at [http://hmi.ewi.utwente.nl/Conferences/clin2001.html CLIN 2001] .

* The Economist. (2002) "The elements of style: Analysing compressed data leads to impressive results in linguistics [http://www.economist.com/science/displayStory.cfm?story_id=975770]

* Survey of the State of the Art in Human Language Technology, (1996), section 8.7 Automatic Language Identification [http://cslu.cse.ogi.edu/HLTsurvey/ch8node9.html#SECTION87]

External links

* Unknown Language Identification at Georgetown University [http://complingone.georgetown.edu/~langid/]

* Links to LID tools by Gertjan van Noord [http://www.let.rug.nl/~vannoord/TextCat/competitors.html]

* Implementation of an n-gram based LID tool in Python by Damir Cavar [http://jones.ling.indiana.edu/~dcavar/lid/]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Language identification in the limit — is a formal model for inductive inference. It was introduced by E. Mark Gold in his paper with the same title [http://www.isrl.uiuc.edu/ amag/langev/paper/gold67limit.html] . In this model, a learner is provided with presentation of some language …   Wikipedia

  • Identification — can mean * Identification (psychoanalysis); * Identification (information); * Identification (parameter), in statistics and econometrics, how parameters can be inferred from data; * Identification (literature), in literary criticism; *… …   Wikipedia

  • Language delay — is a failure to develop language abilities on the usual developmental timetable. Language delay is distinct from speech delay, in which the speech mechanism itself is the focus of delay. Thus, language delay refers specifically to a delay in the… …   Wikipedia

  • Language arts — is the general academic subject area dealing with developing comprehension and capacity for use of written and oral language. The five strands of the Language arts are reading, writing, speaking, listening, and viewing (visual literacy), as… …   Wikipedia

  • Language Line — is a language resources company based in London. It was founded by Michael Young, Lord Young of Dartington, as a charitable project in April 1990. The driving force behind this was Young s identification of exclusion faced by ethnic minorities… …   Wikipedia

  • Identification key — In biology, an identification key is a printed or computer aided device that aids the identification of biological entities, such as plants, animals, fossils, microorganisms, and pollen grains. Identification keys are also used in many other… …   Wikipedia

  • identification — [[t]aɪde̱ntɪfɪke͟ɪʃ(ə)n[/t]] identifications 1) N VAR: oft N of n The identification of something is the recognition that it exists, is important, or is true. Early identification of a disease can prevent death and illness... Their work includes… …   English dictionary

  • Language C++ — C++ Apparu en 1985 (dernière révision en 2003) Auteur Bjarne Stroustrup …   Wikipédia en Français

  • Language demographics of Quebec — This article presents the current language demographics of the Canadian province of Quebec. Contents 1 Demographic terms 2 Current demographics 2.1 Cities 2.2 Montreal …   Wikipedia

  • Rajasthani language — language name=Rajasthani nativename=राजस्थानी familycolor=Indo European states=Rajasthan (India) speakers= 80 million (approx.)Fact|date=January 2008 fam2=Indo Iranian fam3=Indo Aryan fam4=Central Indo Aryan fam5=Rajasthani iso2=raj… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”