Sliding window based part-of-speech tagging

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

A high percentage of words in a natural language are words which are independently of context can be assigned more than one morphological analysis. The percentage of these ambiguous words is typically around 30%, although it depends greatly on the language. Solving this problem is very important in many areas of natural language processing. For example in machine translation changing the part-of-speech of a word can dramatically change its translation.

Sliding window based part-of-speech taggers are programs which assign a single part-of-speech to a given lexical form of a word, by looking at a fixed sized "window" of words around the word to be disambiguated.

The two main advantages of this approach are:

* It is possible to automatically train the tagger, getting rid of the need of manually tagging a corpus.
* The tagger can be implemented as a finite state automata (Mealy machine)

Formal definition

Let

: $Gamma = { gamma_{1}, gamma_{2}, ldots, gamma_ }$

be the set of grammatical tags of the application, that is, the set of all possible tags which may be assigned to a word, and let

: $W = { w1, w2, ldots }$

be the vocabulary of the application. Let

: $T : W
ightarrow P ( Gamma )$

be a function for morphological analysis which assigns each $w$ its set of possible tags, $T ( w ) subseteq Gamma$ , that can be implemented by a full-form lexicon, or a morphological analyser. Let

: $Sigma = { sigma_{1}, sigma_{2}, ldots, sigma_ }$

be the set of word classes, that in general will be a partition of $W$ with the restriction that for each $sigma in Sigma$ all of the words $w , Sigma , sigma$ will receive the same set of tags, that is, all of the words in each word class ( $sigma$ ) belong to the same ambiguity class.

Normally, $Sigma$ is constructed in a way that for high frequency words, each word class contains a single word, while for low frequency words, each word class corresponds to a single ambiguity class. This allows good performance for high frequency ambiguous words, and doesn't require too many parameters for the tagger.

With these definitions it is possible to state problem in the following way: Given a text $w [1] w [2] ldots w [L] in W^*$ each word $w [t]$ is assigned a word class $T ( w [t] ) in Sigma$ (either by using the lexicon or morphological analyser) in order to get an ambiguously tagged text $sigma [1] sigma [2] ldots sigma [L] in W^*$ . The job of the tagger is to get a tagged text $gamma [1] gamma [2] ldots gamma [L]$ (with $gamma [t] in T( sigma [t] ) )$ as correct as possible.

A statistical tagger looks for the most probable tag for an ambiguously tagged text $sigma [1] sigma [2] ldots sigma [L]$ :

: $gamma^* [1] ldots gamma^* [L] = argmax_{gamma [t] epsilon T ( sigma [t] )} p(gamma [1] ldots gamma [L] sigma [1] ldots sigma [L] )$

Using Bayes formula, this is converted into:

: $gamma^* [1] ldots gamma^* [L] = argmax_{gamma [t] epsilon T ( sigma [t] )} p(gamma [1] ldots gamma [L] ) p(sigma [1] ldots sigma [L] gamma [1] ldots gamma [L] )$

where $p(gamma [1] gamma [2] ldots gamma [L] )$ is the probability that a particular tag (syntactic probability) and $p(sigma [1] dots sigma [L] gamma [1] ldots gamma [L] )$ is the probability that this tag corresponds to the text $sigma [1] ldots sigma [L]$ (lexical probability).

In a Markov model, these probabilities are approximated as products. The syntactic probabilities are modelled by a first order Markov process:

: $p(gamma [1] gamma [2] ldots gamma [L] ) = prod_{t=1}^{t=L} p(gamma [t+1] gamma [t] )$

where $gamma [0]$ and $gamma [L+1]$ are delimiter symbols.

Lexical probabilities are independent of context:

: $p(sigma [1] sigma [2] ldots sigma [L] gamma [1] gamma [2] ldots gamma [L] ) = prod_{t=1}^{t=L} p(sigma [t] gamma [t] )$

One form of tagging is to approximate the first probability formula:

: $p(sigma [1] sigma [2] ldots sigma [L] gamma [1] gamma [2] ldots gamma [L] ) = prod_{t=1}^{t=L} p(gamma [t] C_{(-)} [t] sigma [t] C_{(+)} [t] )$

where $C_{(-)} [t] = sigma [t - N_{(-)} sigma [t - N_{(-)}] ldots sigma [t - 1]$ is the right context of the size $N_{(+)}$ .

In this way the sliding window algorithm only has to take into account a context of size $N_{(-)} + N_{(+)} + 1$ . For most applications $N_{(-)} = N_{(+)} = 1$ . For example to tag the ambiguous word "run" in the sentence "He runs from danger", only the tags of the words 'He' and 'from' are needed to be taken into account.

Further reading

* Sanchez-Villamil, E., Forcada, M. L., and Carrasco, R. C. (2005). " [http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf Unsupervised training of a finite-state sliding-window part-of-speech tagger] ". "Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence", vol. 3230, p. 454-463

Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Academic Dictionaries and Encyclopedias

Sliding window based part-of-speech tagging

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Sliding window based part-of-speech tagging

Share the article and excerpts

Direct link