Soundex

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for names with the same pronunciation to be encoded to the same representation so that they can be matched despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithmsFact|date=August 2007.

History

Soundex was developed by Robert Russell and Margaret Odell and patented in 1918 [US patent reference
number=1261167
y=1918
m=04
d=02
title=(title unknown)
inventor=R. C. Russell
] and 1922 [US patent reference
number=1435663
y=1922
m=11
d=14
inventor=R. C. Russell
title=(title unknown)
] . A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery (CACM and JACM), and especially when described in Donald Knuth's magnum opus, "The Art of Computer Programming".

The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government.cite web|title=The Soundex Indexing System|publisher=National Archives and Records Administration|date=2007-05-30|url=http://www.archives.gov/genealogy/census/soundex.html|accessdate=2007-06-07] These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

Rules

The Soundex code for a name consists of a letter followed by three digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Similar sounding consonants share the same digit so, for example, the labials B, F, P, and V are each encoded as 1. Vowels can affect the coding, but are not coded themselves except as the first letter.

The correct value can be found as follows:
# Replace consonants with digits as follows (but do not change the first letter):
#* b, f, p, v => 1
#* c, g, j, k, q, s, x, z => 2
#* d, t => 3
#* l => 4
#* m, n => 5
#* r => 6
# Collapse "adjacent" identical digits into a single digit of that value.
# Remove all non-digits after the first letter.
# Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150".

oundex variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The NYSIIS algorithm was introduced by the New York State Identification and Intelligence System as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not.

The Celko Improved Soundex algorithm was introduced by Joe Celko in his book "SQL For Smarties: Advanced SQL Programming".

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm for the same purpose. Philips later developed an improvement to Metaphone, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

Daitch-Mokotoff Soundex (D-M Soundex) was developed by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D-M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex" [cite web
url=http://www.avotaynu.com/soundex.html
title=Soundexing and Genealogy
first=Gary
last=Mokotoff
date=2007-09-08
accessdate=2008-01-27
] , although the authors discourage the use of these nicknames. The D-M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

ee also

* Metaphone
* New York State Identification and Intelligence System

References

External links

* [http://www.archives.gov/publications/general-info-leaflets/55.html The Soundex Indexing System] (U.S. National Archives and Records Administration)
* [http://search.cpan.org/perldoc?Text::Soundex Text::Soundex] Perl module from CPAN
* [http://php.net/soundex/ PHP soundex function]
* [http://sourceforge.net/projects/simmetrics/ SimMetrics an open source (sourceforge) library of similarity metrics including a number of soundex variants]
* [http://snippets.dzone.com/posts/show/844 Soundex in JavaScript]
* [http://snippets.dzone.com/posts/show/4530 Soundex in Ruby]
* [http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213 Soundex in Python]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Soundex — est un algorithme phonétique d indexation de noms par leur prononciation en anglais britannique. L objectif basique est que les noms ayant la même prononciation soient codés avec la même chaîne de manière à pouvoir trouver une correspondance… …   Wikipédia en Français

  • Soundex — es un algoritmo fonético, un algoritmo para indexar nombre por su sonido, al ser pronunciados en Inglés. El objetivo básico de este algoritmo es codificar de la misma forma los nombres con la misma pronunciación. Soundex es el algoritmo fonético… …   Wikipedia Español

  • Soundex — Algorithmus Soundex ist ein phonetischer Algorithmus zur Indizierung von Wörtern und Phrasen nach ihrem Klang in der englischen Sprache. Gleichklingende Wörter sollen dabei zu einer identischen Zeichenfolge kodiert werden. Der Soundex Algorithmus …   Deutsch Wikipedia

  • Soundex — один из алгоритмов сравнения двух строк по их звучанию. Он устанавливает одинаковый индекс для строк, имеющих схожее звучание в английском языке. Soundex был разработан Робертом Расселлом (Robert Russel) и Маргарет Обелл (Margaret Obell) и… …   Википедия

  • Soundex — es un algoritmo fonético, un algoritmo para indexar nombre por su sonido, al ser pronunciados en Inglés. El objetivo básico de este algoritmo es codificar de la misma forma los nombres con la misma pronunciación. Soundex es el algoritmo fonético… …   Enciclopedia Universal

  • Soundex — noun A phonetic algorithm for indexing names by their English pronunciation, based on the most probably significant consonants, so that a search for a misspelled name may find the desired one …   Wiktionary

  • Soundex — ● np. m. ►PAO (++) …   Dictionnaire d'informatique francophone

  • soundex — n. algorithm used to encode words in way that enables words with identical sounds to be encoded in the same manner (Computers) …   English contemporary dictionary

  • Soundex — …   Useful english dictionary

  • Daitch–Mokotoff Soundex — (D–M Soundex) is a phonetic algorithm invented in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of the Russell and American Soundex algorithms designed to allow greater accuracy in matching of Slavic and Yiddish… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”