Named-entity recognition

Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text, such as this one:

<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.

In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference in the 1990s.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.^[1]^[2] These algorithms had roughly twice the error rate (6.61%) of human annotators (2.40% and 3.05%).

1 Approaches
2 Problem domains
3 Named entity types
4 NER evaluation forums
5 References
6 See also
7 External links

Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data.

Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.^[3] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products.

Named entity types

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.

There is a general agreement to include temporal expressions and some numerical expressions (i.e., money, percentages, etc.) as instances of named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used^[4].

At least two hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.^[5] Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.^[6]

NER evaluation forums

Evaluation of NER systems is critical to scientific progress of this field.

Most evaluation of these systems has been performed at conferences or contests put on by government organizations, sometimes acting in concert with contractors or academics.

Conference	Acronym	Language(s)	Year(s)	Sponsor	Archive Site
Message Understanding Conference	MUC	English	1987–1999	DARPA	[1]
Multilingual Entity Task Conference	MET	Chinese and Japanese	1998	US	[2]
Automatic Content Extraction Program	ACE	English	2000–2008	NIST	[3]
Conference on Computational Natural Language Learning	CoNLL	Spanish and Dutch / German and English	2002–2003		[4]
Evaluation contest for named entity recognizers in Portuguese	HAREM	Portuguese	2004–2008	Linguateca	[5]
Information Retrieval and Extraction Exercise	IREX	Japanese	1998–1999		[6]
ACL Special Interest Group in Chinese	SIGHan	Chinese	2006		[7]
TAC Knowledge Base Population Evaluation	TAC/KBP	English	2009–	NIST	[8]

References

^ Elaine Marsh, Dennis Perzanowski, "MUC-7 Evaluation of IE Technology: Overview of Results", 29 April 1998 PDF
^ MUC-07 Proceedings (Named Entity Tasks)
^ Poibeau, Thierry and Kosseim, L. (2001) Proper Name Extraction from Non-Journalistic Texts. Proc. Computational Linguistics in the Netherlands.
^ http://www.webknox.com/blog/2010/09/named-entity-definition/
^ http://www.ldc.upenn.edu/Catalog/docs/LDC2005T33/BBN-Types-Subtypes.html
^ http://nlp.cs.nyu.edu/ene/

External links

Named entity recognition for Arabic – Issues and challenges in morphologically rich languages such as Arabic

Categories:

Computational linguistics
Tasks of Natural language processing

Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

Named entity recognition — (NER) (also known as entity identification (EI) and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations,… … Wikipedia
Recognition (disambiguation) — Recognition is identification of something already known or acknowledgement of something as valid. The term may have the following specialized meanings.*Recognition (sociology), an acknowledgement of merits. *Recognition (diplomacy), acceptance… … Wikipedia
Name recognition — For information extraction related task, see named entity recognition. Name recognition is a concept used in politics to describe the number of people who are aware of a politician. It is considered an important factor in elections, as candidates … Wikipedia
International recognition of Kosovo — Kosovo This article is part of the series: Politics and government of Kosovo Political status of Kosovo Declaration of independence … Wikipedia
Natural language processing — (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages; it began as a branch of artificial intelligence.[1] In theory, natural language processing is a very attractive… … Wikipedia
Text analytics — The term text analytics describes a set of linguistic, lexical, pattern recognition,extraction, tagging/structuring, visualization, and predictive techniques. The termalso describes processes that apply these techniques, whether independently or… … Wikipedia
Conditional random field — A conditional random field (CRF) is a statistical modelling method often applied in pattern recognition. More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between… … Wikipedia
DBpedia Spotlight — Written in Java, Scala Operating system Cross platform Type … Wikipedia
Разрешение лексической многозначности — Необходимо проверить качество перевода и привести статью в соответствие со стилистическими правилами Википедии. Вы можете помочь … Википедия
Information extraction — In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well defined data from a certain… … Wikipedia

Academic Dictionaries and Encyclopedias

Named-entity recognition

Contents

Approaches

Problem domains

Named entity types

NER evaluation forums

References

See also

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Named-entity recognition

Contents

Approaches

Problem domains

Named entity types

NER evaluation forums

References

See also

External links

Look at other dictionaries:

Share the article and excerpts

Direct link