- DiShIn
-
DiShIn (Disjunctive Shared Information) is a method for exploitation of multiple inheritance when calculating the shared information content between two ontology concepts being compared by node-based semantic similarity measures. DiShIn re-defines the shared information content between two concepts as the average of all their disjunctive ancestors, assuming that an ancestor is disjunctive if the difference between the number of distinct paths from the concepts to it is different from that of any other more informative ancestor. In other words, a disjunctive ancestor is the most informative ancestor representing a given set of parallel interpretations. DiShIn is an improvement of GraSM in terms of computational efficiency and in the management of parallel interpretations.
Example
For example, palladium, platinum, silver and gold are considered to be precious metals, and silver, gold and copper considered to be coinage metals. Thus, we have:
metal / \ precious coinage / | \ \ / / \ / | \ gold / \ palladium platinum silver copper
When calculating the semantic similarity between platinum and gold, DiShIn starts by calculating the number of paths difference for all their common ancestors:
gold -> coinage -> metal gold -> precious -> metal platinum -> precious -> metal
gold -> precious platinum -> precious
For metal we have two paths from gold and one from platinum, so we have a path difference of one. For precious we have one path from each concept, so we have a path difference of zero.
Since their path difference is distinct, both common ancestors metal and precious are considered to be disjunctive common ancestors.
When calculating the semantic similarity between platinum and palladium, DiShIn starts by calculating the number of paths difference for all their common ancestors:
palladium -> precious -> metal platinum -> precious -> metal
palladium -> precious platinum -> precious
For both metal and precious, we have only one path from each concept, so we have a path difference of zero for both common ancestors. Thus, only the common ancestor precious (the most informative) is considered to be a disjunctive common ancestor.
Given that node-based semantic similarity measures are proportional to the average of the information content of their common disjunctive ancestors: metal and precious in case of platinum and gold; and precious in case of platinum and palladium, means that for DiShIn palladium and platinum are more similar than platinum and gold.
When calculating the semantic similarity between silver and gold, , DiShIn starts by calculating the number of paths difference for all their common ancestors:
gold -> coinage -> metal gold -> precious -> metal silver -> coinage -> metal silver -> precious -> metal
gold -> precious silver -> precious
gold -> coinage silver -> coinage
As in the case of platinum and palladium, here all common ancestors have a path difference of zero, since silver and gold share the same relationships and therefore have parallel interpretations. Thus, only the most informative common ancestor precious or coinage is considered to be a disjunctive common ancestor. This means that for DiShIn the similarity between silver and gold is greater or equal than the similarity between any other pair of the leaf concepts. Thus, DiShIn does not penalize parallel interpretations as GraSM did.
Implementation
After estimating the information content for each concept and the number of distinct paths from one concept to another, DiShIn can be implemented as a single SQL query described in the authors's publication in the Journal of Biomedical Semantics.
References
- Couto, F. & Silva, M. (2011), Disjunctive Shared Information between Ontology Concepts: application to Gene Ontology. Journal of Biomedical Semantics, 2:5
- Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152
- Couto, F., Silva, M., & Coutinho, P. (2005). Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In Proc. Of the ACM Conference in Information and Knowledge Management (CIKM)
Categories:- Computational linguistics
- Statistical distance measures
Wikimedia Foundation. 2010.