- Cluster labeling
-
Cluster labeling is closely related to the concept of text clustering. This process tries to select descriptive labels for the clusters obtained through a clustering algorithm such as Flat Clustering and Hierarchical Clustering. For example, a cluster of documents that talks about various internet protocols can be best labeled as "Internet Protocols". Typically, the labels are obtained by examining the contents of the documents in a cluster. A good label not only summarizes the central concept of a cluster but also uniquely differentiates it from other clusters in the collection.
Contents
Differential Cluster Labeling
Differential cluster labeling labels a cluster by comparing the terms in one cluster with the terms occurring in other clusters. The techniques used for feature selection in information retrieval, such as mutual information and chi-squared feature selection, can also be applied to differential cluster labeling. Terms having very low frequency are not the best in representing the whole cluster and can be omitted in labeling a cluster. By omitting those rare terms and using a differential test, one can achieve the best results with differential cluster labeling[1].
Mutual Information
Main article: Mutual informationIn the fields of probability theory and information theory, mutual information measures the degree of dependence of two random variables. The mutual information of two variables X and Y is defined as:
where p(x, y) is the joint probability distribution of the two variables, p1(x) is the probability distribution of X, and p2(y) is the probability distribution of Y.
In the case of cluster labeling, the variable X is associated with membership in a cluster, and the variable Y is associated with the presence of a term[2]. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:
In this case, p(C = 1) represents the probability that a randomly-selected document is a member of a particular cluster, and p(C = 0) represents the probability that it isn't. Similarly, p(T = 1) represents the probability that a randomly-selected document contains a given term, and p(T = 0) represents the probability that it doesn't. The joint probability distribution function p(C, T) represents the probability that two events occur simultaneously. For example, p(0, 0) is the probability that a document isn't a member of cluster c and doesn't contain term t; p(0, 1) is the probability that a document isn't a member of cluster c and does contain term t; and so on.
Example
The following example calculates the mutual information between the cluster "commerce" and the term "tariff":
Documents in "commerce" Documents not in "commerce" Documents containing "tariff" 60 10,000 Documents not containing "tariff" 200 500,000 Total number of documents = (60 + 200 + 10,000 + 500,000) = 510,260
P (C = 1, T = 1) = 60/510,260
P (C = 1, T = 0) = 200/510,260
P (C = 0, T = 1) = 10,000/510,260
P (C = 0, T = 0) = 500,000/510,260
P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260
P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260
P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260
P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260
Plugging these probabilities into the above equation gives us the following MI value:
= 0.000417322 - 0.000137100 - 0.000154725 + 0.000155158
= 0.000280655
Therefore, the mutual information between the cluster "commerce" and the term "tariff" is 0.000280655. We can create a label for the cluster "commerce" by calculating the mutual information of each term in the cluster, and selected the k terms with the highest MI value.
Chi-Squared Selection
Main article: Pearson's chi-squared testThe Pearson's chi-squared test can be used to calculate the probability that the occurrence of an event matches the initial expectations. In particular, it can be used to determine whether two events, A and B, are statistically independent. The value of the chi-squared statistic is:
where Oa,b is the observed frequency of a and b co-occurring, and Ea,b is the expected frequency of co-occurrence.
In the case of cluster labeling, the variable A is associated with membership in a cluster, and the variable B is associated with the presence of a term. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:
For example, O1,0 is the observed number of documents that are in a particular cluster but don't contain a certain term, and E1,0 is the expected number of documents that are in a particular cluster but don't contain a certain term. Our initial assumption is that the two events are independent, so the expected probabilities of co-occurrence can be calculated by multiplying individual probabilities[3]:
E1,0 = N * P(C = 1) * P(T = 0)
where N is the total number of documents in the collection.
Example
Using the same data for the mutual information example, we can calculate the expected probabilities and plug them into the equation to calculate the chi-squared statistic:
Documents in "commerce" Documents not in "commerce" Documents containing "tariff" 60 10,000 Documents not containing "tariff" 200 500,000 P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260
P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260
P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260
P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260
= 499,945
= 10,055
= 254.9
= 5.13
= 587
Since each variable can have two values, the number of degrees of freedom is (2 - 1)(2 - 1) = 1. The chi-squared distribution for one degree of freedom states that the probability of observing a value greater than 10.83 is less than 0.001. Therefore, we can reject the null hypothesis, which states that the two events are independent. Since the term "tariff" and the cluster "commerce" are dependent, we can assume that the term is a good label for the cluster.
Cluster-Internal Labeling
Cluster-internal labeling selects labels that only depend on the contents of the cluster of interest. No comparison is made with the other clusters. Cluster-internal labeling can use a variety of methods, such as finding terms that occur frequently in the centroid or finding the document that lies closest to the centroid.
Centroid Labels
Main article: Vector space modelA frequently-used model in the field of information retrieval is the vector space model, which represents documents as vectors. The entries in the vector correspond to terms in the vocabulary. Binary vectors have a value of 1 if the term is present within a particular document and 0 if it is absent. Many vectors make use of weights that reflect the importance of a term in a document, and/or the importance of the term in a document collection. For a particular cluster of documents, we can calculate the centroid by finding the arithmetic mean of all the document vectors. If an entry in the centroid vector has a high value, then the corresponding term occurs frequently within the cluster. These terms can be used as a label for the cluster. One downside to using centroid labeling is that it can pick up words like "place" and "word" that have a high frequency in written text, but have little relevance to the contents of the particular cluster.
Title Labels
An alternative to centroid labeling is title labeling. Here, we find the document within the cluster that has the smallest Euclidean distance to the centroid, and use its title as a label for the cluster. One advantage to using document titles is that they provide additional information that would not be present in a list of terms. However, they also have the potential to mislead the user, since one document might not be representative of the entire cluster.
External knowledge labels
Cluster labeling can be done indirectly using external knowledge such as pre-categorized knowledge such as the one of Wikipedia.[4] In such methods, a set of important cluster text features are first extracted from the cluster documents. These features then can be used to retrieve the (weighted) K-nearest categorized documents from which candidates for cluster labels can be extracted. The final step involves the ranking of such candidates. Suitable methods are such that are based on a voting or a fusion process which is determined using the set of categorized documents and the original cluster features.
External links
References
- ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Cluster Labeling. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/cluster-labeling-1.html>.
- ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Mutual Information. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html>.
- ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Chi2 Feature Selection. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html>.
- ^ David Carmel, Haggai Roitman, Naama Zwerdling. Enhancing cluster labeling using wikipedia. SIGIR 2009: 139-146
Categories:
Wikimedia Foundation. 2010.