Text normalization

Text normalization

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

  • Unicode normalization
  • converting all letters to lower or upper case
  • removing punctuation
  • removing accent marks and other diacritics from letters
  • expanding abbreviations
  • removing stopwords or "too common" words
  • stemming

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.

Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot".

Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • Normalization — may refer to: Contents 1 Mathematics and statistics 2 Science 3 Technology …   Wikipedia

  • URL normalization — (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two… …   Wikipedia

  • Chinese Text Project — URL ctext.org Commercial? No Type of site Digital library Registration required to contribute Available language(s) English a …   Wikipedia

  • Normalisierung (Text) — Normalisierung wird verwendet, um Texte in eine äquivalente Form zu transformieren. Dieser Prozess wird durchgeführt, um Texte für einen bestimmten Vorgang, wie etwa Interaktionen mit der Datenbank oder Suchoperationen, konsistent zu halten.… …   Deutsch Wikipedia

  • Unicode equivalence — is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character… …   Wikipedia

  • Speech synthesis — Stephen Hawking is one of the most famous people using speech synthesis to communicate Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented… …   Wikipedia

  • Automatic summarization — is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and… …   Wikipedia

  • Sintetizador del habla — Uno o varios wikipedistas están trabajando actualmente en este artículo o sección. Es posible que a causa de ello haya lagunas de contenido o deficiencias de formato. Si quieres, puedes ayudar y editar, pero por favor: antes de realizar… …   Wikipedia Español

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • IVONA — Infobox Software name = IVONA caption = IVONA visualisation developer = [http://www.ivosoftware.com IVO Software] released = 2005 frequently updated = yes programming language = C/C++ operating system = Cross platform language = English / Polish… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”