Text normalization

Text normalization: Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

Unicode normalization

converting all letters to lower or upper case

removing punctuation

removing accent marks and other diacritics from letters

expanding abbreviations

removing stopwords or "too common" words

stemming

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.

Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot".

Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.

Categories:
Unicode
Computer science stubs

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

Normalization — may refer to: Contents 1 Mathematics and statistics 2 Science 3 Technology … Wikipedia
URL normalization — (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two… … Wikipedia
Chinese Text Project — URL ctext.org Commercial? No Type of site Digital library Registration required to contribute Available language(s) English a … Wikipedia
Normalisierung (Text) — Normalisierung wird verwendet, um Texte in eine äquivalente Form zu transformieren. Dieser Prozess wird durchgeführt, um Texte für einen bestimmten Vorgang, wie etwa Interaktionen mit der Datenbank oder Suchoperationen, konsistent zu halten.… … Deutsch Wikipedia
Unicode equivalence — is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character… … Wikipedia
Speech synthesis — Stephen Hawking is one of the most famous people using speech synthesis to communicate Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented… … Wikipedia
Automatic summarization — is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and… … Wikipedia
Sintetizador del habla — Uno o varios wikipedistas están trabajando actualmente en este artículo o sección. Es posible que a causa de ello haya lagunas de contenido o deficiencias de formato. Si quieres, puedes ayudar y editar, pero por favor: antes de realizar… … Wikipedia Español
Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… … Wikipedia
IVONA — Infobox Software name = IVONA caption = IVONA visualisation developer = [http://www.ivosoftware.com IVO Software] released = 2005 frequently updated = yes programming language = C/C++ operating system = Cross platform language = English / Polish… … Wikipedia

Academic Dictionaries and Encyclopedias

Text normalization

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Text normalization

Look at other dictionaries:

Share the article and excerpts

Direct link