- The iDistance Technique
The iDistance is an indexing and query processing technique for k nearest neighbor (kNN) queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. The iDistance is designed to process kNN queries in high-dimensional spaces efficiently and it is especially good for skewed data distributions, which usually occur in real-life data sets.
Indexing
Building the iDistance index has two steps:
1. A number of reference points in the data space are chosen. There are variousways of choosing reference points. Using cluster centers asreference points is the most efficient way.
2. The distance between a data point and its closest reference point iscalculated. This distance plus a scaling value is called thepoint's "iDistance". By this means, points in amulti-dimensional space are mapped to one-dimensional values, andthen a B+-tree can be adopted to index the points using theiDistance as the key.
The figure on the right shows an example where three reference points (O1, O2, O3) are chosen. The data points are then mapped to a one-dimensional space and indexed in a B+-tree.
Query Processing
To process a kNN query, the query is mapped to a number ofone-dimensional range queries, which can be processed efficientlyon a B+-tree. In the above figure, the query "Q" is mapped to a value in the B+-tree while the kNN search ``sphere" is mapped to a range in the B+-tree. The search sphere expands gradually until the k NNs are found. This corresponds to gradually expanding range searches in the B+-tree.
The iDistance technique can be viewed as a way ofaccelerating the sequential scan. Instead of scanning records fromthe beginning to the end of the data file, the iDistance startsthe scan from spots where the nearest neighbors can be obtainedearly with a very high probability.
Applications
The iDistance has been used in many applications including
* Image Retrieval [Junqi Zhang, Xiangdong Zhou, Wei Wang, Baile Shi, Jian Pei,Using High Dimensional Indexes to Support Relevance Feedback Based Interactive Images Retrival Proceedings of the 32st international conference on Very large data bases, Seoul, Korea, 1211-1214, 2006.]
* Video indexing [Heng Tao Shen, Beng Chin Ooi, Xiaofang Zhou,Towards Effective Indexing for Very Large Video Sequence Database Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, 730-741, 2005. ]
* Similarity search in P2P systems [Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis,Peer-to-Peer Similarity Search in Metric Spaces Proceedings of the 33rd international conference on Very large data bases, Vienna, Austria, 986-997, 2007. ]
* Mobile computing [Sergio Ilarri, Eduardo Mena, Arantza Illarramendi IEEE Transactions on Mobile Computing, Volume 5, Issue 8, Aug. 2006 Page(s): 1029 - 1043. ]Historical Background
The iDistance was first proposed by Cui Yu, Beng Chin Ooi,Kian-Lee Tan and H. V. Jagadish in 2001 [Cui Yu, Beng Chin Ooi, Kian-Lee Tan and H. V. Jagadish [http://www.comp.nus.edu.sg/~ooibc/papers/Rcuiyu.pdf Indexing the distance: an efficient method to KNN processing] , Proceedings of the 27st international conference on Very large data bases, Roma, Italy, 421-430, 2001.] . Later, together with Rui Zhang, they improved the technique and performeda more comprehensive study on it in 2005 [H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu and Rui Zhang [http://www.comp.nus.edu.sg/~ooibc/todsidistance.pdf iDistance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search] , ACM Transactions on Data Base Systems (ACM TODS), 30, 2, 364-397, June 2005.] .
oftware/source code
*The source code of iDistance implemented in C++ by Rui Zhang can be downloaded from [http://www.csse.unimelb.edu.au/~rui/code.htm] .
Notes
Wikimedia Foundation. 2010.