Corpus of Contemporary American English

Corpus of Contemporary American English

The freely-searchable 425 million word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.

It was created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University.[1]

Contents

Content

The corpus is composed of 425 million words from more than 160,000 texts, including 20 million words each year from 1990 to 2011. The most recent update was made in March 2011. The corpus is used by approximately 40,000 people each month, which may make it the most widely-used "structured" corpus currently available.

For each year, the corpus is evenly divided between the five genres: spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

  • Spoken: (85 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.
  • Fiction: (81 million words) Short stories and plays, first chapters of books 1990–present, and movie scripts.
  • Popular magazines: (86 million words) Nearly 100 different magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports.
  • Newspapers: (81 million words) Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section.
  • Academic Journals: (81 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system.

Queries

  • The interface is the same as the BYU-BNC interface for the 100 million word British National Corpus, the 100 million word TIME Magazine corpus, and the 400 million word Corpus of *Historical* American English (COHA), 1810s–2000s (see links below)
  • Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)
  • The corpus is tagged by CLAWS, the same tagger that was used for the BNC and the TIME corpus
  • Chart listings (totals for all matching forms in each genre or year, 1990–present, as well as for sub-genres) and table listings (frequency for each matching form in each genre or year)
  • Full collocates searching (up to ten words left and right of node word)
  • Re-sortable concordances, showing the most common words/strings to the left and right of the searched word
  • Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2005–2010 than previously)
  • One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small' and 'little', or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal')
  • Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')
  • Users can also create their own own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)
  • Note that the corpus is only available through the web interface, due to copyright restrictions.

See also

References

Bibliography

  1. Davies, Mark (2010). "The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English". Literary and Linguistic Computing 25 (4): 447-65. doi:10.1093/llc/fqq018. 
  2. Bennett, Gena R. (2010). Using Corpora in the Language Learning Classroom: Corpus Linguistics for Teachers.. Ann Arbor, Michigan: University of Michigan. pp. 144. ISBN 9780472033850. 
  3. Davies, Mark (2010). "More than a peephole: Using large and diverse online corpora". International Journal of Corpus Linguistics 15 (3): 405-11. doi:10.1075/ijcl.15.3.13dav. 
  4. Anderson, Wendy; Corbett, John (2009), Exploring English with Online Corpora, Palgrave Macmillan, pp. 205, ISBN 9780230551404 
  5. Davies, Mark (2009). "The 385+ Million Word Corpus of Contemporary American English (1990-present)". International Journal of Corpus Linguistics (John Benjamins Publishing Company) 14 (2): 159-190(32). doi:10.1075/ijcl.14.2.02dav. 
  6. Lindquist, Hans (2009). Corpus Linguistics and the Description of English. Edinburgh University Press. ISBN 9780748626151. 
  7. Davies, Mark (2005). "The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation". International Journal of Corpus Linguistics (John Benjamins Publishing Company) 10 (3): 307-334(28). doi:10.1075/ijcl.10.3.02dav. 

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • Corpus linguistics — is the study of language as expressed in samples (corpora) or real world text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally …   Wikipedia

  • American National Corpus — (ANC) is a paid membership based collaboratory with the aim of creating an electronic text corpus of American English. The collection will include text and transcripts of spoken data produced from 1990, with the goal of a 100 million word… …   Wikipedia

  • Corpus lingüístico — Un Corpus lingüístico es un conjunto, normalmente muy amplio, de ejemplos reales de uso de una lengua. Estos ejemplos pueden ser textos (típicamente), o muestras orales (normalmente transcritas). Se llama lingüística de corpus a la subdisciplina… …   Wikipedia Español

  • Text corpus — In linguistics, a corpus (plural corpora ) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or… …   Wikipedia

  • Munich UCL Morphology Corpus — Der Munich UCL Morphology Corpus (MUMC) ist ein Textkorpus, eine elektronische Sammlung englischer Texte und transkribierter Dialoge, dessen Wörter morphologisch, also nach ihren Bestandteilen, markiert sind. Er wird an der LMU München zur… …   Deutsch Wikipedia

  • Oxford English Corpus — The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press s language research programme. It is the largest corpus of its kind, containing over two billion… …   Wikipedia

  • British National Corpus — The British National Corpus (or just BNC) is a 100 million word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus (text collection) in the field of corpus linguistics. The… …   Wikipedia

  • English plural — English grammar series English grammar Contraction Disputes in English grammar English compound English honorifics English personal pronouns English plural English relative clauses English verbs English irregular verbs En …   Wikipedia

  • English language in England — refers to the English language as spoken in England, part of the United Kingdom. There are many different accents and dialects throughout England and people are often very proud of their local accent or dialect, however there are many associated… …   Wikipedia

  • American and British English spelling differences — Spelling differences redirects here. For other uses, see Category:Language comparison. For guidelines on dialects and spelling in the English language version of Wikipedia, see Wikipedia:Manual of Style#National varieties of English. Differences… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”