Heaps' law

Heaps' law

In linguistics, Heaps' law is an empirical law which describes the portion of a vocabulary which is represented by an instance document (or set of instance documents) consisting of words chosen from the vocabulary. This can be formulated as

: V_R(n) = Kn^eta

Where "VR" is the subset of the vocabulary "V" represented by the instance text of size "n". "K" and β are free parameters determined empirically.

With English text corpora, typically "K" is between 10 and 100, and β is between 0.4 and 0.6.



"A typical Heaps-law plot. The x-axis represents the text size, and the y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes."

Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.

It is interesting to note that Heaps' law applies in the general case where the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling.

ee also

* Zipf's law

References

* H. S. Heaps. "Information Retrieval - Computational and Theoretical Aspects". Academic Press, 1978.
* Baeza-Yates and Ribeiro-Neto, "Modern Information Retrieval", ACM Press, 1999.

----


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Zipf's law — Probability distribution name =Zipf s law type =mass pdf Zipf PMF for N = 10 on a log log scale. The horizontal axis is the index k . (Note that the function is only defined at integer values of k . The connecting lines do not indicate continuity …   Wikipedia

  • Zipf–Mandelbrot law — Probability distribution name =Zipf–Mandelbrot type =mass pdf cdf parameters =N in {1,2,3ldots} (integer) q in [0;infty) (real) s>0, (real) support =k in {1,2,ldots,N} pdf =frac{1/(k+q)^s}{H {N,q,s cdf =frac{H {k,q,s{H {N,q,s mean =frac{H {N,q,s… …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • List of linguistics topics — Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the List of basic linguistics topics, the List of phonetics topics, the List of linguists, and the List of cognitive science… …   Wikipedia

  • List of mathematics articles (H) — NOTOC H H cobordism H derivative H index H infinity methods in control theory H relation H space H theorem H tree Haag s theorem Haagerup property Haaland equation Haar measure Haar wavelet Haboush s theorem Hackenbush Hadamard code Hadamard… …   Wikipedia

  • Charles Holden — Charles Henry Holden Portrait of Charles Holden by Benjamin Nelson, 1910 Born 12 May 1875(1875 05 12) Great Lever, Bolton, Lancashire, England …   Wikipedia

  • Armenian Genocide denial — Armenian Genocide Background Armenians in the Ottoman Empire …   Wikipedia

  • List of Biblical names — This is a list of names from the Bible, mainly taken from the 19th century public domain resource: : Hitchcock s New and Complete Analysis of the Holy Bible by Roswell D. Hitchcock, New York: A. J. Johnson, 1874, c1869.Each name is given with its …   Wikipedia

  • Derbyshire lead mining history — Goodluck Mine in Via Gellia. This article details some of the history of lead mining in Derbyshire, England. Contents 1 …   Wikipedia

  • 2 Chronicles 31 — 1 Now when all this was finished, all Israel that were present went out to the cities of Judah, and brake the images in pieces, and cut down the groves, and threw down the high places and the altars out of all Judah and Benjamin, in Ephraim also… …   The King James version of the Bible

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”