Davies–Bouldin index

The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin) in 1979 is a metric for evaluating clustering algorithms^[1]. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

1 Preliminaries
2 Definition
3 Explanation
4 External Links
5 Notes and references

Preliminaries

Let C_i be a cluster of vectors. Let X_j be a N dimensional feature vector assigned to cluster C_i.

$S_i = \frac{1}{T_i} \displaystyle\sum_{j=1}^{T_i}\left\|X_j-A_i\right\|_q$

Here $A i$ is the centroid of C_i and T_i is the size of the cluster i. S_i is a measure of scatter within the cluster. Usually the value of p is 2, which makes this a Euclidean distance function between the centroid of the cluster, and the individual feature vectors. Many other distance metrics can be used, in the case of manifolds and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters. It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results.

$M_{i,j} = \sqrt[p]{\displaystyle\sum_{k=1}^{N}\left|a_{k,i}-a_{k,j}\right|^p }$

M i, j

is a measure of separation between cluster

C i

and cluster

C j

a k, i

is the kth element of

A i

, and there are N such elements in A for it is an N dimensional centroid.

Here k indexes the features of the data, and this is essentially the Euclidean distance between the centers of clusters i and j when p equals 2.

Definition

Let R_i,j be a measure of how good the clustering scheme is. This measure, by definition has to account for M_i,j the separation between the i^th and the j^th cluster, which ideally has to be as large as possible, and S_i, the within cluster scatter for cluster i, which has to be as low as possible. Hence the Davies Bouldin Index is defined as the ratio of S_i and M_i,j such that these properties are conserved:

$R_{i,j} \geqslant 0$ .
$R i, j = R j, i$ .
if $S_j \geqslant S_k$ and $M i, j = M i, k$ then $R i, j > R i, k$ .
and if $S j = S k$ and $M_{i,j} \leqslant M_{i,k}$ then $R i, j > R i, k$ .

$R_{i,j} = \frac{S_i + S_j}{M_{i,j}}$

This is the symmetry condition. Due to such a formulation, the lower the value, the better the separation of the clusters and the 'tightness' inside the clusters is.

$D_i \equiv \max_{j : i \neq j} R_{i,j}$

${DB} \equiv \frac{1}{N}\displaystyle\sum_{i=1}^N D_i$

DB is called the Davies Bouldin Index. This is dependent both on the data as well as the algorithm. D_i chooses the worst case scenario, and this value is equal to R_i,j for the most similar cluster to cluster i. There could be many variations to this formulation, like choosing the average of the cluster similarity, weighted average and so on.

Explanation

These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and it's most similar one, averaged over all the clusters, where the similarity is defined as S_i below. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori. The SOM toolbox contains a MATLAB implementation ^[2].

External Links

Notes and references

^ Davies, D. L.; Bouldin, D. W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence (2): 224. doi:10.1109/TPAMI.1979.4766909. edit
^ "Matlab implementation". http://www.cis.hut.fi/somtoolbox/package/docs2/db_index.html. Retrieved 12 November 2011.

Categories:

Clustering criteria

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… … Wikipedia
25th United States Congress - State Delegations — [ United States Capitol] The Twenty fifth United States Congress was a meeting of the legislative branch of the United States federal government, consisting of the United States Senate and the United States House of Representatives. It met in… … Wikipedia
25th United States Congress - political parties — [ United States Capitol] The Twenty fifth United States Congress was a meeting of the legislative branch of the United States federal government, composed of the United States Senate and the United States House of Representatives. It met in… … Wikipedia

Academic Dictionaries and Encyclopedias

Davies–Bouldin index

Contents

Preliminaries

Definition

Explanation

External Links

Notes and references

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Davies–Bouldin index

Contents

Preliminaries

Definition

Explanation

External Links

Notes and references

Look at other dictionaries:

Share the article and excerpts

Direct link