Text segmentation

Text segmentation

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Compare speech segmentation, the process of dividing speech into linguistically meaningful portions.

Contents

Segmentation problems

Word segmentation

Word segmentation is the problem of dividing a string of written language into its component words.

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word delimiter. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

The Unicode Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.

Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.

Word splitting may also refer to the process of hyphenation.

Sentence segmentation

Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.

As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.

Other segmentation problems

Processes may be required to segment text into segments besides words, including morphemes (a task usually called morphological analysis), paragraphs, topics or discourse turns.

A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases one needs to use techniques similar to those used in document classification. Many different approaches have been tried.[1][2]

Automatic segmentation approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

  • Manual analysis of text and writing custom software
  • Annotate the sample corpus with boundary information and use Machine Learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

See also

References

  1. ^ Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation" (PDF). Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33. http://acl.ldc.upenn.edu/A/A00/A00-2004.pdf. 
  2. ^ Jeffrey C. Reynar (1998) (PDF). Topic Segmentation: Algorithms and Applications. IRCS-98-21. University of Pennsylvania. http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports. Retrieved 2007-11-08. 

External Links

  • Word Split An open source software tool designed to split conjoined words into human-readable text.

Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • text segmentation — teksto segmentavimas statusas T sritis informatika apibrėžtis Verčiamo teksto skaidymo į dalis būdas, naudojamas ↑lokalizavimo programinėje įrangoje ir ↑vertimo atmintyse. Teksto segmentas paprastai atitinka teksto pastraipą, sakinį, frazę,… …   Enciklopedinis kompiuterijos žodynas

  • Segmentation — may mean: *Market segmentation, in economics Biology *A morphogenesis process that divides a metazoan body into a series of semi repetitive segments *Segmentation (biology), the structure that results from said processComputing *Segmentation… …   Wikipedia

  • Text Retrieval Conference — Le Text REtrieval Conference (TREC) est un programme conçu comme une série d ateliers dans le domaine de la Recherche d information (RI ou IR). Ce programme est soutenu conjointement par le National Institute of Standards and Technology (NIST) et …   Wikipédia en Français

  • Text Retrieval Conference — The Text REtrieval Conference (TREC) is an on going series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co sponsored by the National Institute of Standards and Technology (NIST) and the… …   Wikipedia

  • Text REtrieval Conference — Le Text REtrieval Conference (TREC) est un programme conçu comme une série d ateliers dans le domaine de la Recherche d information (RI ou IR). Ce programme est soutenu conjointement par le National Institute of Standards and Technology (NIST) et …   Wikipédia en Français

  • Speech segmentation — is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.Speech… …   Wikipedia

  • Scale-space segmentation — or multi scale segmentation is a general framework for signal and image segmentation, based on the computation of image descriptors at multiple scales of smoothing. One dimensional hierarchical signal segmentationWitkin s seminal work in scale… …   Wikipedia

  • Memory segmentation — is the division of computer memory into segments or sections. Segments or sections are also used in object files of compiled programs when they are linked together into a program image, or the image is loaded into memory. In a computer system… …   Wikipedia

  • SRX Segmentation Rules eXchage LISA OSCAR XML based Standard — Description= Segmentation Rules eXchange (SRX) is an XML based standard for description of the ways in which translation and other language processing tools segment text for processing. It was created when it was realized that TMX leverage is… …   Wikipedia

  • Natural language processing — (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages; it began as a branch of artificial intelligence.[1] In theory, natural language processing is a very attractive… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”