SimRank

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.Effectively, SimRank is a measure that says "two objects are similar if they are related to similar objects."

Introduction

Many applications require a measure of "similarity" between objects.One obvious example is the "find-similar-document" query,on traditional text corpora or the World-Wide Web.More generally, a similarity measure can be used to cluster objects, such as for collaborative filtering in a recommender system, in which “similar” users and items are grouped based on the users’ preferences.

Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain.In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences.SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest.On the Web, for example, two pages are related if there are hyperlinks between them.A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference information.In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item.Such domains are naturally modeled as graphs, with nodes representing objects and edges representing relationships.

The intuition behind SimRank algorithm is that, in many domains, similar objects are related to similar objects.More precisely, objects $a$ and $b$ are similar if they are related to objects $c$ and $d$ , respectively, and $c$ and $d$ are themselves similar.The base case is that objects are similar to themselvesG. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538-543. ACM Press, 2002. [http://www-cs-students.stanford.edu/~glenj/simrank.pdf] ] .

It is important to note that SimRank is a general algorithm that determines only the similarity of structural context.SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure.For example, for Web pages SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora.For recommender systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.), as well as similarities between users (e.g., same gender, same spending level).Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.

Basic SimRank equation

For a node $v$ in a graph, we denote by $I(v)$ and $O(v)$ the set of in-neighbors and out-neighbors of $v$ , respectively.Individual in-neighbours are denoted as $I_i(v)$ , for $1 le i le left|I(v)
ight|$ , and individualout-neighbors are denoted as $O_i(v)$ , for $1 le i le left|O(v)
ight|$ .

Let us denote the similarity between objects $a$ and $b$ by $s(a, b) in [0, 1]$ . Following the earlier motivation, a recursive equation is written for $s(a, b)$ .If $a = b$ then $s(a, b)$ is defined to be $1$ .Otherwise,: $s(a, b) = frac{C}{left|I(a)
ight| left|I(b)
ight sum_{i=1}^{left|I(a)
ightsum_{j=1}^{left|I(b)
ight s(I_i(a), I_j(b))$ where $C$ is a constant between $0$ and $1$ .A slight technicality here is that either $a$ or $b$ may not have any in-neighbors.Since there is no way to infer any similarity between $a$ and $b$ in this case, similarity is set to $s(a, b) = 0$ , so the summation in the above equation is defined to be $0$ when $I(a) = emptyset$ or $I(b) = emptyset$ .

Computing SimRank

A solution to the SimRank equations for a graph $G$ can be reached by iteration to a fixed-point.Let $n$ be the number of nodes in $G$ .For each iteration $k$ , we can keep $n^2$ entries $R_k(*, *)$ of length $n^2$ , where $R_k(a, b)$ gives the score between $a$ and $b$ on iteration $k$ .We successively compute $R_{k+1}(*, *)$ based on $R_k(*, *)$ .We start with $R_0(*, *)$ where each $R_0(a, b)$ is a lower bound on the actual SimRank score $s(a, b)$ :: $R_0(a, b) = egin{cases} 1 mbox{ } , mbox{ } mbox{if } a = b mbox{ } , \ 0 mbox{ } , mbox{ } mbox{if } a
eq b mbox{ } . end{cases}$

To compute $R_{k+1}(a, b)$ from $R_k(*, *)$ , we use the basic SimRank equation to get:: $R_{k + 1}(a, b) = frac{C}{left|I(a)
ight| left|I(b)
ight sum_{i=1}^{left|I(a)
ightsum_{j=1}^{left|I(b)
ight R_k(I_i(a), I_j(b))$ for $a
e b$ , and $R_{k+1}(a, b) = 1$ for $a = b$ .That is, on each iteration $k + 1$ , we update the similarity of $(a, b)$ using the similarity scores of the neighbours of $(a, b)$ from the previous iteration $k$ according to the basic SimRank equation.The values $R_k(*, *)$ are nondecreasing as $k$ increases.It was shown in that the values converge to limits satisfying the basic SimRank equation, the SimRank scores $s(*, *)$ , i.e., for all $a, b in V$ , $lim_{k o infty} R_k(a, b) = s(a, b)$ .

Further research on SimRank

* Fogaras and Racz D. Fogaras and B. Racz. Scaling link-based similarity search. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 641--650, New York, NY, USA, 2005. ACM. [http://www2005.org/cdrom/docs/p641.pdf] ] suggested speeding up SimRank computation through probabilistic computation using Monte Carlo method.

See also

* PageRank

Citations

Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

PageRank — is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set. The algorithm may be applied to… … Wikipedia
Semantic similarity — or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content. Concretely, this can be achieved for instance by defining a topological… … Wikipedia
Semantic relatedness — Computational Measures of Semantic Relatedness are [http://cwl projects.cogsci.rpi.edu/msr/ publically available] means for approximating the relative meaning of words/documents. These have been used for essay grading by the Educational Testing… … Wikipedia
Cosine similarity — is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are… … Wikipedia

Academic Dictionaries and Encyclopedias

SimRank

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

SimRank

Look at other dictionaries:

Share the article and excerpts

Direct link