Sentence extraction

Sentence extraction

Sentence extraction is a technique used for automatic summarization.In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as ontologies or linguistic knowledge. In short "sentence extraction" works as a filter which allows only important sentences to pass.

The major downside of applying sentence-extraction techniques to the task of summarization is the loss of coherence in the resulting summary. Nevertheless, sentence extraction summaries can give valuable clues to the main points of a document and are frequently sufficiently intelligible to human readers.

Procedure

Usually, a combination of heuristics is used to determine the most important sentences within the document. Each heuristic assigns a (positive or negative) score to the sentence. After all heuristics have been applied, the x highest-scoring sentences are included in the summary.The individual heuristics are weighted according to their importance.

Early approaches and some sample heuristics

Seminal papers which laid the foundations for many techniques used today have been published by H. P. Luhn in 1958 [Cite journal
author = H. P. Luhn
title = The Automatic Creation of Literature Abstracts
journal = IBM Journal
year = 1958
month = April
pages = 159–165
url = http://www.research.ibm.com/journal/rd/022/luhn.pdf
] and H. P Edmundson in 1969. [Cite journal
author = H. P. Edmundson
year = 1969
title = New Methods in Automatic Extracting
journal = Journal of the ACM
volume = 16
issue = 2
pages = 264–285
doi = 10.1145/321510.321519
url = http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf
]

Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph.Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as "of", "the", "a"). He also distinguished between "bonus words" and "stigma words", i.e. words that probably occur together with important (e.g. the word form "significant") or unimportant information.His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the TF/IDF value which originated in Information Retrieval, can be successfully applied to identify the key words of a text: If for example the word "cat" occurs significantly more often in the text to be summarized (TF = text frequency) than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by "document"), then "cat" is likely to be an important word of the text; the text may in fact be a text about cats.

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Information extraction — In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well defined data from a certain… …   Wikipedia

  • Text Extraction — Die Text Extraction (auch Keyphrase Extraction) bzw. Textextrahierung ist eine Methode zur automatischen Zusammenfassung eines Textes mit Hilfe computerlinguistischer Techniken. Dabei werden Teile eines Textes zum Beispiel Sätze oder ganze… …   Deutsch Wikipedia

  • Text-Extraction — Die Text Extraction (auch Keyphrase Extraction) bzw. Textextrahierung ist eine Methode zur automatischen Zusammenfassung eines Textes mit Hilfe computerlinguistischer Techniken. Dabei werden Teile eines Textes zum Beispiel Sätze oder ganze… …   Deutsch Wikipedia

  • Automatic summarization — is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and… …   Wikipedia

  • Wh-movement — (or wh fronting or wh extraction) is a syntactic phenomenon found in many languages around the world, in which interrogative words (sometimes called wh words ) show a special word order. Unlike ordinary phrases, such wh words appear at the… …   Wikipedia

  • Natural language processing — (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages; it began as a branch of artificial intelligence.[1] In theory, natural language processing is a very attractive… …   Wikipedia

  • china — /chuy neuh/, n. 1. a translucent ceramic material, biscuit fired at a high temperature, its glaze fired at a low temperature. 2. any porcelain ware. 3. plates, cups, saucers, etc., collectively. 4. figurines made of porcelain or ceramic material …   Universalium

  • China — /chuy neuh/, n. 1. People s Republic of, a country in E Asia. 1,221,591,778; 3,691,502 sq. mi. (9,560,990 sq. km). Cap.: Beijing. 2. Republic of. Also called Nationalist China. a republic consisting mainly of the island of Taiwan off the SE coast …   Universalium

  • mathematics, East Asian — Introduction       the discipline of mathematics as it developed in China and Japan.       When speaking of mathematics in East Asia, it is necessary to take into account China, Japan, Korea, and Vietnam as a whole. At a very early time in their… …   Universalium

  • Uses of torture in recent times — Torture, the infliction of severe physical or psychological pain upon an individual to extract information, a confession or as a punishment, is prohibited by international law and illegal in most countries. However, it is still used by many… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”