Moby Project

Moby Project

The Moby Project is a collection of public-domain lexical resources. It was created by Grady Ward. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg. As of 2007, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciation.

Contents

Hyphenator

The Moby Hyphenator II contains 187,175 hyphenated words, with 9,752 indicating that they should not be hyphenated. Hyphenation is indicated by a character value 165 (hex A5).

Language

Moby Language II contains wordlists of five languages - French, German, Italian, Japanese, and Spanish:

Language Words Size (in bytes)
French 138,257 1,524,757
German 159,809 2,055,986
Italian 60,453 561,981
Japanese 115,523 934,783
Spanish 86,059 850,523
Total 560,101 5,928,030

However, some of the lists are contaminated, for example the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./.

Part-of-Speech

Moby Part-of-Speech contains 233,356 words fully described by part(s) of speech, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:

Part-of-speech Code
Noun N
Plural p
Noun phrase h
Verb (usually participle) V
Transitive verb t
Intransitive verb i
Adjective A
Adverb v
Conjunction C
Preposition P
Interjection  !
Pronoun r
Definite article D
Indefinite article I
Nominative o

Pronunciator

The Moby Pronunciator II contains 177,267 words with corresponding pronunciation. The Project Gutenberg distribution also contains a copy of the cmudict v0.3. The file follows the format word[/part-of-speech] pronunciation. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example for the words spelled close, the verb has the pronunciation /ˈkloʊz/, whereas the adjective is /ˈkloʊs/. The parts-of-speech have been assigned the following codes:

Part-of-speech Code
Noun n
Verb v
Adjective aj
Adverb av
Interjection interj

Following this is the pronunciation. Several special symbols are present:

Symbol Meaning
/ Used to separate phonemes
_ Used to separate words
' Primary stress on the following syllable
, Secondary stress on the following syllable

The rest of the symbols are used to represent IPA characters, according to the following table:

Symbol IPA
& æ
- ə
@ ʌ, ə
@r ɜr, ər
A ɑː
aI
Ar ɑr
AU
b b
d d
D ð
dZ
E ɛ
eI
f f
g ɡ
h h
hw hw
i
I ɪ
j j
k k
l l
m m
n n
N ŋ
O ɔː
Oi ɔɪ
oU
p p
r r
s s
S ʃ
t t
T θ
tS
u
U ʊ
v v
w w
z z
Z ʒ

Shakespeare

Moby Shakespeare contains the complete unabridged works of Shakespeare. This specific resource is not available from Project Gutenberg.

Thesaurus

The Moby Thesaurus II contains 30,260 root words, with 2,520,264 synonyms and related terms - an average of 83.3 per root word. Each line consists of a list of comma-separated values, with the first term being the root word, and all following words being related terms.

Grady Ward placed this thesaurus in the public domain in 1996. It is also available as a Debian package.

Words

Moby Words II is the largest wordlist in the world.[1] The distribution consists of the following 16 files:

Filename Words Description
ACRONYMS.TXT 6,213 Common acronyms and abbreviations
COMMON.TXT 74,550 Common words present in two or more published dictionaries
COMPOUND.TXT 256,772 Phrases, proper nouns, and acronyms not included in the common words file
CROSSWD.TXT 113,809 Words included in the first edition of the Official Scrabble Players Dictionary
CRSWD-D.TXT 4,160 Additions to the Official Scrabble Players Dictionary in the second edition
FICTION.TXT 467 A list of the most commonly occurring substrings in the book The Joy Luck Club
FREQ.TXT 1,000 Most frequently occurring words in the English language, listed in descending order
FREQ-INT.TXT 1,000 Most frequently occurring words on Usenet in 1992, listed with corresponding percentage in decreasing order
KJVFREQ.TXT 1,185 Most frequently occurring substrings in the King James Version of the Bible, listed in descending order
NAMES.TXT 21,986 Most common names used in the USA and Great Britain
NAMES-F.TXT 4,946 Common English female names
NAMES-M.TXT 3,897 Common English male names
OFTENMIS.TXT 366 Most common misspelled English words
PLACES.TXT 10,196 Place names in the USA
SINGLE.TXT 354,984 Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings
USACONST.TXT 7,618 United States Constitution including all amendments current to 1993
Total 863,149

References

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Moby Thesaurus — est un logiciel de thesaurus. Il fait partie du Moby project et a été placé dans le domaine public en 1996. Il est intégré par exemple dans mobysaurus. Ce document provient de « Moby thesaurus ». Catégorie : Logiciel domaine public …   Wikipédia en Français

  • Moby thesaurus — est un logiciel de thesaurus. Il fait partie du Moby project et a été placé dans le domaine public en 1996. Il est intégré par exemple dans mobysaurus. Portail de l’informatique …   Wikipédia en Français

  • Moby (programming language) — The Moby programming language is an experiment in computer programming, design and implementation.The Moby project started out as a testbed for the design of ML2000. Moby is primarily a collaboration between Kathleen Fisher and John Reppy. Much… …   Wikipedia

  • Moby Grape — Columbia Records promotional photo, 1967. (Left to Right) Skip Spence, Jerry Miller, Bob Mosley, Peter Lewis, Don Stevenson Background information Origin San …   Wikipedia

  • Moby Dick Rehearsed — is the title of a play written and directed by Orson Welles. It was performed in London in 1955. A lost film of the play, directed by Welles, starred the original stage cast. Welles used minimal stage design. The stage was bare, the actors… …   Wikipedia

  • Moby — 2009 Moby (* 11. September 1965 in Harlem, New York City; bürgerlich Richard Melville Hall) ist ein US amerikanischer Musiker, Sänger, DJ und Musikproduzent, der auch als Voodoo Child veröffentlicht …   Deutsch Wikipedia

  • Project Moby Dick — was a Cold War reconnaissance operation by the U.S. Military in which large balloons floated cameras over the Soviet Union. The spy balloons would photograph sensitive Soviet sites and hang in the air until a crew flying the C 119 Flying Boxcar… …   Wikipedia

  • Project Gotham Racing series — Project Gotham Racing (PGR) is a franchise of racing video games developed by Bizarre Creations and is published by Microsoft. This series is exclusive to the Xbox and Xbox 360 consoles only. This series of racing games consists of Project Gotham …   Wikipedia

  • Project Mogul — (sometimes referred to as Operation Mogul) was a top secret project by the US Army Air Forces involving microphones flown on high altitude balloons, whose primary purpose was long distance detection of sound waves generated by Soviet atomic bomb… …   Wikipedia

  • Project Firestart — is a cinematic action/adventure game for the Commodore 64 computer system. It was designed by Damon Slye and published by Electronic Arts in 1989. The game features a survival horror like atmosphere and a story with multiple game paths and… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”