- Metaphone
-
- Lawrence Philips redirects here. For the football player, see Lawrence Phillips.
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding that does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys.
Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. Metaphone is available as a built-in operator in a number of systems, including later versions of PHP.
The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.
Contents
Procedure
Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY.[1] The '0' represents "th" (as an ASCII approximation of Θ), 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.[2]
- Drop duplicate adjacent letters, except for C.
- If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
- Drop 'B' if after 'M' and if it is at the end of the word.
- 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
- 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
- Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
- 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
- Drop 'H' if after vowel and not before a vowel.
- 'CK' transforms to 'K'.
- 'PH' transforms to 'F'.
- 'Q' transforms to 'K'.
- 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
- 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
- 'V' transforms to 'F'.
- 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
- 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
- Drop 'Y' if not followed by a vowel.
- 'Z' transforms to 'S'.
- Drop all vowels unless it is the beginning.
Double Metaphone
The Double Metaphone search algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal.
It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common.
Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.
Metaphone 3
Developed by the same author, this algorithm aims at further improving the accuracy of phonetic encoding of words in the English language. The ability to encode Metaphone keys taking non-initial vowels into account, as well as encoding voiced and unvoiced consonants differently, has been added. This allows the result set to be more closely focused if desired. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.
See also
External links
- Open Source Spell Checker
- Page for PHP implementation of Metaphone
- Project Dedupe
- Ruby implementation included in http://rubyforge.org/projects/text
- "The Double Metaphone Search Algorithm", C/C++ Users Journal, June 2000 (full-text access requires registration)
- The Double Metaphone Search Algorithm, By Lawrence Phillips, June 01, 2000, Dr Dobb's, Original article
- Code project article on double metaphone: http://www.codeproject.com/string/dmetaphone1.asp
Metaphone Implementations
- Metaphone implementation in T-SQL
- Soundex, Metaphone, and Double Metaphone implementation in Java
- Soundex, Metaphone, Caverphone implementation in Python
- Text::Metaphone Perl module from CPAN
- Text::DoubleMetaphone Perl module from CPAN
- OCaml implementation of Double Metaphone
- PHP implementation by Stephen Woodbridge
- Ruby implementation included in http://english.rubyforge.org
- Ruby implementation included in http://rubyforge.org/projects/text/
- 4GL implementation by Robert Minter
- CodeProject's article about double metaphone implementations
- FileMaker Pro custom function, requiring FileMaker Pro Advanced to implement
- Spanish Metaphone in PHP (First post), from a comment in the PHP Metaphone Manual Page
- Brazilian Portuguese in C Metaphone for Brazilian Portuguese, in C with PHP and PostgreSQL port.
- natural - javascript (nodejs) natural language toolkit
Double Metaphone Implementations
- C++ see: http://web.archive.org/web/20080101012741/http://www.cuj.com/documents/s=8038/cuj0006philips/
- C# see: http://www.codeproject.com/KB/recipes/dmetaphone5.aspx
- Perl see: http://search.cpan.org/dist/Text-DoubleMetaphone/
- PHP see: http://swoodbridge.com/DoubleMetaPhone/ and native, in C: http://pecl.php.net/package/doublemetaphone
- Java see: http://commons.apache.org/codec/userguide.html
- Ruby see: http://english.rubyforge.org/ and http://rubyforge.org/projects/text/
- SQL:
- MySQL see: see: http://www.atomodo.com/code/double-metaphone
- PostgreSQL see: http://www.postgresql.org/docs/current/static/fuzzystrmatch.html
- Transact-SQL see: http://www.sqlmag.com/Articles/ArticleID/26094/pg/1/1.html (full-text access requires subscription)
- Python see: http://www.atomodo.com/code/double-metaphone
- Smalltalk, Squeak, also with SoundEx, see: http://www.squeaksource.com/SoundsLike.html
- Visual Basic see: http://www.snakelegs.org/2008/01/18/double-metaphone-visual-basic-implementation/
References
Categories:- Pattern matching
- Algorithms on strings
Wikimedia Foundation. 2010.