Semantic relatedness

Semantic relatedness

Computational Measures of Semantic Relatedness are [http://cwl-projects.cogsci.rpi.edu/msr/ publically available] means for approximating the relative meaning of words/documents. These have been used for essay-grading by the Educational Testing Service, search engine technology, predicting which links people are likely to click on, etc.

* LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times
* PMI (Pointwise Mutual Information) (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents
* GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times
* ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (-) cannot measure relatedness between multi-word terms, long pre-processing times
* NGD (Normalized Google Distance; see below) (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents
* WordNet: (+) humanly constructed; (-) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary
* [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf ESA (Explicit Semantic Analysis)] based on Wikipedia and the ODP
* [http://cwl-projects.cogsci.rpi.edu/papers/VGG08.pdf VGEM] (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (-) performance depends on choosing specific dimensions
* [http://www.cogsci.rpi.edu/cogworks/publications/270_BLOSSOM_final.pdf BLOSSOM] (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (-) highly experimental, requires nontrivial SOM calculation

Semantic similarity measures

SimRank

Google distance

Google distance is a measure of semantic interrelatedness derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the "normalized Google distance" between two search terms "x" and "y" is

:operatorname{NGD}(x,y) = frac{max{log f(x), log f(y)} - log f(x,y)}{log M - min{log f(x), log f(y)

where "M" is the total number of web pages searched by Google; "f"("x") and "f"("y") are the number of hits for search terms "x" and "y", respectively; and "f"("x", "y") is the number of web pages on which both "x" and "y" occur.

If the two search terms "x" and "y" never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero.

See also

* Semantic differential

References

* Cilibrasi, R. & Vitanyi, P.M.B. (2006). Similarity of objects and the meaning of words. Proc. 3rd Conf. Theory and Applications of Models of Computation (TAMC), J.-Y. Cai, S. B. Cooper, and A. Li (Eds.), Lecture Notes in Computer Science, Vol. 3959, Springer-Verlag, Berlin.
* Dumais, S. (2003). Data-driven approaches to information access. Cognitive Science, 27(3), 491-524.
* Gabrilovich, E. and Markovitch, S. (2007). "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis", Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007. [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf]
* Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078-1083). Austin, Tx: The Cognitive Science Society, Inc.
* Kaur, I. & Hornof, A.J. (2005). A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior. Proceedings of the Conference on Human Factors in Computing, CHI 2005 (pp. 51-60).
* Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.
* Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
* Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254-1259). Austin, Tx: The Cognitive Science Society, Inc.
* Lemaire, B., & Denhiére, G. (2004). Incremental construction of an associative network from a corpus. In K. D. Forbus & D. Gentner & T. Regier (Eds.), 26th Annual Meeting of the Cognitive Science Society, CogSci2004. Hillsdale, NJ: Lawrence Erlbaum Publisher.
* Lindsey, R., Veksler, V.D., Grintsvayg, A., Gray, W.D. (2007). The Effects of Corpus Selection on Measuring Semantic Relatedness. Proceedings of the 8th International Conference on Cognitive Modeling, Ann Arbor, MI.
* Pirolli, P. (2005). Rational analyses of information foraging on the Web. Cognitive Science, 29(3), 343-373.
* Pirolli, P., & Fu, W.-T. (2003). SNIF-ACT: A model of information foraging on the World Wide Web. Lecture Notes in Computer Science, 2702, 45-54.
* Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491-502). Freiburg, Germany.
* Veksler, V.D. & Gray, W.D. (2006). Test Case Selection for Evaluating Measures of Semantic Distance. Proceedings of the 28th Annual Meeting of the Cognitive Science Society, CogSci2006.

Google distance references

*Rudi Cilibrasi and Paul Vitanyi (2004). [http://arxiv.org/abs/cs.CL/0412098 , The Google Similarity Distance, ArXiv.org] or [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/trans/tk/&toc=comp/trans/tk/2007/03/k3toc.xml&DOI=10.1109/TKDE.2007.48 The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370-383.] .
* [http://www.newscientist.com/article.ns?id=dn6924 Google's search for meaning] at Newscientist.com.
*Jan Poland and Thomas Zeugmann (2006), [http://www-alg.ist.hokudai.ac.jp/~thomas/publications/dag_c2c_pz.pdf Clustering the Google Distance with Eigenvectors and Semidefinite Programming]
*Aarti Gupta and Tim Oates (2007), [http://www.ijcai.org/papers07/Papers/IJCAI07-261.pdf Using Ontologies and the Web to Learn Lexical Semantics] (Includes comparison of NGD to other algorithms.)
* Wilson Wong, Wei Liu and Mohammed Bennamoun (2007), [http://www.springerlink.com/content/u331u5122822046m/ Tree-Traversing Ant Algorithm for term clustering based on featureless similarities] , Journal of Data Mining and Knowledge Discovery (the use of NGD for term clustering)

External links

* [http://cwl-projects.cogsci.rpi.edu/msr/ Measures of Semantic Relatedness]
* [http://wn-similarity.sourceforge.net WordNet-Similarity] , an open source package for computing the similarity and relatedness of concepts found in WordNet


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Semantic similarity — or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content. Concretely, this can be achieved for instance by defining a topological… …   Wikipedia

  • Semantic memory — refers to the memory of meanings, understandings, and other concept based knowledge unrelated to specific experiences. The conscious recollection of factual information and general knowledge about the world,cite web… …   Wikipedia

  • Collins & Quillian Semantic Network Model — The most prevalent example of the semantic network processing approach is the Collins Quillian Semantic Network Model. cite journal title=Retrieval time from semantic memory journal=Journal of verbal learning and verbal behavior date=1969… …   Wikipedia

  • Coefficient of relationship — In population genetics, Sewall Wright s coefficient of relationship or coefficient of relatedness or relatedness or r is defined as 2 times the Coefficient of Inbreeding.[1] The Coefficient of Relatedness (or: coefficient of kinship) is defined… …   Wikipedia

  • Musical semantics — Music is one of the oldest, and most basic, socio cognitive domains of the human species. Primate vocalizations are mainly determined by music like features (such as pitch, amplitude and frequency modulations, timbre and rhythm), and it is… …   Wikipedia

  • Chunking (psychology) — Chunking, in psychology, is phenomenon whereby individuals group responses when performing a memory task. Tests where individuals can illustrate chunking commonly include serial and free recall, as these both require the individual to reproduce… …   Wikipedia

  • Misattribution of memory — Memory plays an important role in a number of aspects of our everyday lives and allows us to recall past experiences, navigate our environments, and learn new tasks [1]. From this view, information about a source of memory is assumed to contain… …   Wikipedia

  • Memory errors — Memory gaps and errors refer to the incorrect recall, or complete loss, of information in the memory system for a specific detail and/or event. Memory errors may include remembering events that never occurred, or remembering them differently from …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • List of mathematics articles (S) — NOTOC S S duality S matrix S plane S transform S unit S.O.S. Mathematics SA subgroup Saccheri quadrilateral Sacks spiral Sacred geometry Saddle node bifurcation Saddle point Saddle surface Sadleirian Professor of Pure Mathematics Safe prime Safe… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”