Grammar induction

Grammar induction

Grammatical induction, also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of inducing a formal grammar (usually in the form of "re-write rules" or "productions") from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. Grammatical inference is distinguished from traditional decision rules and other such methods principally by the nature of the resulting model, which in the case of grammatical inference relies heavily on hierarchical substitutions. Whereas a traditional decision rule set is geared toward assessing object classification, a grammatical rule set is geared toward the generation of examples. In this sense, the grammatical induction problem can be said to seek a "generative" model, while the decision rule problem seeks a "descriptive" model.

Methodologies

There are a wide variety of methods for grammatical inference. Two of the classic sources are Harvtxt|Fu|1977 and Harvtxt|Fu|1982. Harvtxt|Duda|Hart|Stork|2001 also devote a brief section to the problem, and cite a number of references. The basic trial-and-error method they present is discussed below.

Grammatical inference by trial-and-error

The method proposed in Section 8.7 of Harvtxt|Duda|Hart|Stork|2001 suggests successively guessing grammar rules (productions) and testing them against positive and negative observations. The rule set is expanded so as to be able to generate each positive example, but if a given rule set also generates a negative example, it must be discarded. This particular approach can be characterized as "hypothesis testing" and bears some similarity to Mitchel's version space algorithm. The Harvtxt|Duda|Hart|Stork|2001 text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious. An evolutionary approach, such as that described below, is likely to yield much better results because Fact|date=January 2008.

Grammatical inference by genetic algorithms

Grammatical Induction using evolutionary algorithms is the process of evolving a representation of the grammar of a target language through some evolutionary process. Formal grammars can easily be represented as a tree structure of production rules that can be subjected to evolutionary operators. Algorithms of this sort stem from the genetic programming paradigm pioneered by John Koza.Fact|date=August 2007 Other early work on simple formal languages used the binary string representation of genetic algorithms, but the inherently hierarchical structure of grammars couched in the EBNF language made trees a more flexible approach.

Koza represented Lisp programs as trees. He was able to find analogues to the genetic operators within the standard set of tree operators. For example, swapping sub-trees is equivalent to the corresponding process of genetic crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness is measured by scoring the output from the functions of the lisp code. Similar analogues between the tree structured lisp representation and the representation of grammars as trees, made the application of genetic programming techniques possible for grammar induction.

In the case of Grammar Induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language. The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language. In a tree representation of a grammar, a terminal symbol (e.g. a noun or verb or some other part of speech) of a production rule corresponds to a leaf node of the tree. Its parent nodes corresponds to a non-terminal symbol (e.g. a noun phrase or a verb phrase) in the rule set. Ultimately, the root node might correspond to a sentence non-terminal.

Applications

The principle of grammar induction has been applied to other aspects of natural language processing, and have been applied (among many other problems) to morpheme analysis, and even place name derivations.

ee also

* Artificial grammar learning
* Syntactic pattern recognition
* Inductive inference

References

*

*Harvard reference | Surname1=Fu| Given1=King Sun|Title=Syntactic Pattern Recognition and Applications| Publisher=Prentice-Hall| Place=Englewood Cliffs, NJ | Year=1982| Edition=| URL=

*Harvard reference | Surname1=Fu| Given1=King Sun|Title=Syntactic Pattern Recognition, Applications| Publisher=Springer-Verlag| Place=Berlin | Year=1977| Edition=| URL=

*Harvard reference | Surname1=Horning| Given1=James Jay|Title=A Study of Grammatical Inference| Publisher=Stanford University Computer Science Department| Place=Stanford | Year=1969| Edition=Ph.D. Thesis| URL=http://proquest.umi.com/pqdlink?Ver=1&Exp=05-16-2013&FMT=7&DID=757518381&RQT=309&attempt=1&cfc=1

*Harvard reference | Surname1=Gold| Given1=E Mark|Title=Language Identification in the Limit| Publisher=Information and Control| Place=| Year=1967| Edition=| URL=


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • induction — Synonyms and related words: Baconian method, a fortiori reasoning, a posteriori reasoning, a priori reasoning, accedence, acceptance, accession, admission, admittance, alphabet, analysis, apostolic orders, appointment, baptism, basics, call, call …   Moby Thesaurus

  • grammar — Synonyms and related words: abecedarium, abecedary, alphabet, alphabet book, basics, battledore, bowwow theory, casebook, choice of words, comparative linguistics, composition, derivation, descriptive linguistics, dialect, dialectology, diction,… …   Moby Thesaurus

  • Newcastle Grammar School — Latin: Spectemur Agendo Let us be judged by our actions Location Newcastle, New South …   Wikipedia

  • Artificial grammar learning — is a paradigm of study within cognitive psychology. Its goal is to investigate the processes that underlie human language learning, by testing subjects ability to learn a made up language in a laboratory setting. The area of interest is typically …   Wikipedia

  • Southwood Boys' Grammar School — Infobox Aust school private name = Southwood Boys Grammar School motto = Factis Non Verbis (Latin: By deeds not words ) established = 1999 type = Independent, Single sex, Day denomination = Anglican slogan = key people = Mrs. Jenny Collins… …   Wikipedia

  • Computational linguistics — This article is about the scientific field. For the journal, see Computational Linguistics (journal). Linguistics …   Wikipedia

  • Minimum message length — (MML) is a formal information theory restatement of Occam s Razor: even when models are not equal in goodness of fit accuracy to the observed data, the one generating the shortest overall message is more likely to be correct (where the message… …   Wikipedia

  • Kolmogorov complexity — In algorithmic information theory (a subfield of computer science), the Kolmogorov complexity of an object, such as a piece of text, is a measure of the computational resources needed to specify the object. It is named after Soviet Russian… …   Wikipedia

  • Lossless data compression — is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be… …   Wikipedia

  • Automatic distillation of structure — (ADIOS) is an algorithm that can analyse source material such as text and come up with meaningful information about the generative structures that gave rise to the source. One application of the algorithm is grammar induction: ADIOS can read a… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”