Brown Corpus

Brown Corpus

The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled by Henry Kucera and W. Nelson Francis at Brown University, Providence, RI as a general corpus (text collection) in the field of corpus linguistics.

In 1967, Kucera and Francis published their classic work "Computational Analysis of Present-Day American English" (1967), which provided basic statistics on what is known today simply as the "Brown Corpus". The Brown Corpus was a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in computational linguistics, and was for many years among the most-cited resources in the field.

One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a hyperbola: the frequency of the n-th most frequent word is roughly proportional to 1/n. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are hapax legomena: words that occur only once in the corpus. This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his "The Psychobiology of Language"), and is known as Zipf's law.

Shortly after publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new "American Heritage Dictionary". This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.

The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.

The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster-Oslo/Bergen Corpus. The tagged corpus enabled far more sophisticated statistical analysis, much of it carried out by graduate student Andrew Mackie. Some of the analysis appears in "Frequency Analysis of English Usage: Lexicon and Grammar", by Winthrop Nelson Francis and Henry Kucera, Houghton Mifflin (January, 1983) ISBN 0-395-32250-2.

Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of American English, the British National Corpus or the International Corpus of English) tend to be much larger, in the order of 100 million words.

ample distribution

The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were "first" published then, and were written by native speakers of American English.

Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.

The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.

The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:
* A. PRESS: Reportage ("44 texts")
** Political
** Sports
** Society
** Spot News
** Financial
** Cultural
* B. PRESS: Editorial ("27 texts")
** Institutional Daily
** Personal
** Letters to the Editor
* C. PRESS: Reviews ("17 texts")
** "theatre"
** "books"
** "music"
** "dance"
* D. RELIGION ("17 texts")
** Books
** Periodicals
** Tracts
* E. SKILL AND HOBBIES ("36 texts")
** Books
** Periodicals
* F. POPULAR LORE ("48 texts")
** Books
** Periodicals
* G. BELLES-LETTRES - Biography, Memoirs, etc. ("75 texts")
** Books
** Periodicals
* H. MISCELLANEOUS: US Government & House Organs ("30 texts")
** Government Documents
** Foundation Reports
** Industry Reports
** College Catalog
** Industry House organ
* J. LEARNED ("80 texts")
** Natural Sciences
** Medicine
** Mathematics
** Social and Behavioral Sciences
** Political Science, Law, Education
** Humanities
** Technology and Engineering
* K. FICTION: General ("29 texts")
** Novels
** Short Stories
* L. FICTION: Mystery and Detective Fiction ("24 texts")
** Novels
** Short Stories
* M. FICTION: Science ("6 texts")
** Novels
** Short Stories
* N. FICTION: Adventure and Western ("29 texts")
** Novels
** Short Stories
* P. FICTION: Romance and Love Story ("29 texts")
** Novels
** Short Stories
* R. HUMOR ("9 texts")
** Novels
** Essays, etc.

Part-of-speech tags used

*. sentence closer (. ; ? *)
*( left paren
*) right paren
* * not, n't
*-- dash
*, comma
*: colon
*ABL pre-qualifier (quite, rather)
*ABN pre-quantifier (half, all)
*ABX pre-quantifier (both)
*AP post-determiner (many, several, next )
*AT article (a, the, no)
*BE be
*BED were
*BEDZ was
*BEG being
*BEM am
*BEN been
*BER are, art
*BEZ is
*CC coordinating conjunction (and, or)
*CD cardinal numberal (one, two, 2, etc.)
*CS subordinating conjunction (if, although)
*DO do
*DOD did
*DOZ does
*DT singular determiner/quantifier (this, that)
*DTI singular or plural determiner/quantifier (some, any)
*DTS plural determiner (these, those)
*DTX determiner/double conjunction (either)
*EX existential there
*FW foreign word (hyphenated before regular tag)
*HV have
*HVD had (past tense)
*HVG having
*HVN had (past participle)
*IN preposition
*JJ adjective
*JJR comparative adjective
*JJS semantically superlative adjective (chief,top)
*JJT morphologically superlative adjective (biggest)
*MD modal auxiliary (can, should, will)
*NC cited word (hyphenated after regular tag)
*NN singular or mass noun
*NN$ possessive singular noun
*NNS plural noun
*NNS$ possessive plural noun
*NP proper noun or part of name phrase
*NP$ possessive proper noun
*NPS plural proper noun
*NPS$ possessive plural proper noun
*NR adverbial noun (home, today, west)
*OD ordinal numeral (first, 2nd)
*PN nominal pronoun (everybody, nothing)
*PN$ possessive nominal pronoun
*PP$ possessive personal pronoun (my, our)
*PP$$ second (nominal) possessive prounon (mine, ours)
*PPL singular reflexive/intensive personal pronoun (myself)
*PPLS plural reflexive/intensive personal pronoun (ourselves)
*PPO objective personal pronoun (me, him, it, them)
*PPS 3rd. singular nominative pronoun (he, she, it, one)
*PPSS other nominative personal pronoun (I, we, they, you)
*QL qualifier (very, fairly)
*QLP post-qualifer (enough, indeed)
*RB adverb
*RBR comparative adverb
*RBT superlative adverb
*RN nominal adverb (here, then, indoors)
*RP adverb/particle (about, off, up)
*TO infinitive marker to
*UH interjection, exclamation
*VB verb, base form
*VBD verb, past tense
*VBG verb, present participle/gerund
*VBN verb, past participle
*VBZ verb, 3rd. singular present
*WDT wh- determiner (what, which)
*WP$ possessive wh- pronoun (whose)
*WPO objective wh- pronoun (whom, which, that)
*WPS nominative wh- pronoun (who, which, that)
*WQL wh- qualifier (how)
*WRB wh- adverb (how, where, when)

Note that some versions of the tagged Brown corpus contain combined tags. For instance the word "wanna" is tagged VB+TO, since it a contracted form of the two words, want/VB and to/TO. Also some tags might be negated, for instance "aren't" would be tagged "BER*", where * signifies the negation. Additionally, tags may have hyphenations: The tag -HL is hyphenated to the regular tags of words in headlines. The tag -TL is hyphenated to the regular tags of words in titles. The hyphenation -NC signifies an "emphasized" word. Sometimes the tag has a FW- prefix which means foreign word.

External links

* [http://khnt.aksis.uib.no/icame/manuals/brown/ Brown Corpus Manual]

* [http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html More details on the Brown Corpus tagset]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Corpus linguistics — is the study of language as expressed in samples (corpora) or real world text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally …   Wikipedia

  • Corpus of Contemporary American English — The freely searchable 425 million word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly available corpus of American English to contain a wide array of texts from… …   Wikipedia

  • Corpus Christi Carol — is a Middle or Early Modern English hymn (or carol), first found by an apprentice grocer named Richard Hill in a manuscript written around 1504. The original writer of the carol remains anonymous. The structure of the carol is seven stanzas, each …   Wikipedia

  • Corpus Christi Bay — Corpus Christi Bay …   Wikipedia

  • Corpus Christi, Texas — City of Corpus Christi   City   …   Wikipedia

  • Corpus Christi IceRays — This article is about the current Corpus Christi IceRays NAHL franchise. For the original CHL franchise of same name, see: Corpus Christi IceRays (1997 2010). Corpus Christi IceRays 2010 11 NAHL season …   Wikipedia

  • Corpus Highmori — Der Hoden [ˈhoːdn̩] oder (seltener) der/die Hode [ˈhoːdə] (v. mittelhochdt.: hode, v. althochdt.: hodo, v. idg.: *skeu(t) „bedecken, verhüllen“) oder der Testikel (v. lat.: testiculus, Vkl. von testis Zeuge [der Virilität], Hode, Plural:… …   Deutsch Wikipedia

  • Brown Shipbuilding — The Brown Shipbuilding Company was founded in Houston, Texas in 1942 as a subsidiary of Brown and Root (now KBR) by brothers Herman and George R. Brown to build ships for the US Navy during World War II. In 1941, Navy officials asked the Brown… …   Wikipedia

  • Corpus lingüístico — Un Corpus lingüístico es un conjunto, normalmente muy amplio, de ejemplos reales de uso de una lengua. Estos ejemplos pueden ser textos (típicamente), o muestras orales (normalmente transcritas). Se llama lingüística de corpus a la subdisciplina… …   Wikipedia Español

  • Corpus Hermeticum discography — A discography of Corpus Hermeticum (record label). HERMES 001 A Handful of Dust Concord LP HERMES 002 A Handful of Dust The Philosophick Mercury CD HERMES 003 A Handful of Dust The Eighthness Of Adam Qadmon cassette HERMES 004 A Handful of Dust… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”