K-nearest neighbor algorithm

K-nearest neighbor algorithm

In pattern recognition, the "k"-nearest neighbor algorithm ("k"-NN) is a method for classifying objects based on closest training examples in the feature space. "k"-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. It can also be used for regression.

Overview

The "k"-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its "k" nearest neighbors. "k" is a positive integer, typically small. If "k" = 1, then the object is simply assigned to the class of its nearest neighbor. In binary (two class) classification problems, it is helpful to choose "k" to be an odd number as this avoids tied votes.

The same method can be used for regression, by simply assigning the property value for the object to be the average of the values of its "k" nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. In order to identify neighbors, the objects are represented by position vectors in a multidimensional feature space. It is usual to use the Euclidean distance, though other distance measures, such as the Manhattan distance could in principle be used instead. The "k"-nearest neighbor algorithm is sensitive to the local structure of the data.

Algorithm

The training examples are vectors in a multidimensional feature space. The space is partitioned into regions by locations and labels of the training samples. A point in the space is assigned to the class "c" if it is the most frequent class label among the "k" nearest training samples. Usually Euclidean distance is used.

The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the actual classification phase, the test sample (whose class is not known) is represented as a vector in the feature space. Distances from the new vector to all stored vectors are computed and "k" closest samples are selected. There are a number of ways to classify the new vector to a particular class, one of the most used techniques is to predict the new vector to the most common class amongst the K nearest neighbors. A major drawback to using this technique to classify a new vector to a class is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the K nearest neighbors when the neighbors are computed due to their large number. One of the ways to overcome this problem is to take into account the distance of each K nearest neighbors with the new vector that is to be classified and predict the class of the new vector based on these distances.

Parameter selection

The best choice of "k" depends upon the data; generally, larger values of "k" reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good "k" can be selected by various heuristic techniques, for example, cross-validation. The special case where the class is predicted to be the class of the closest training sample (i.e. when "k" = 1) is called the nearest neighbor algorithm.

The accuracy of the "k"-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling(needs citation). Another popular approach is to scale features by the mutual information of the training data with the training classes(needs citation).

Properties

The naive version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. Some optimizations involve partitioning the feature space, and only computing distances within specific nearby volumes. Several different types of nearest neighbor finding algorithms include:
* Linear scan
* Kd-trees
* Balltrees
* Metric trees
* Locality sensitive hashing (LSH)
* Agglomerative-Nearest-Neighbor
* Redundant Bit Vectors (RBV)

The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). "k"-nearest neighbor is guaranteed to approach the Bayes error rate, for some value of "k" (where "k" increases as a function of the number of data points).

The "k"-NN algorithm can also be adapted for use in estimating continuous variables. One such implementation uses an inverse distance weighted average of the "k"-nearest multivariate neighbors. This algorithm functions as follows:
# Compute Euclidean or Mahalanobis distance from target plot to those that were sampled.
# Order samples taking for account calculated distances.
# Choose heuristically optimal "k" nearest neighbor based on RMSE done by cross validation technique.
# Calculate an inverse distance weighted average with the "k"-nearest multivariate neighbors.

ee also

* Nearest neighbor search
* Artificial intelligence
* Data mining
* Parzen window
* Machine learning
* Pattern recognition
* Predictive analytics
* Statistics
* Support vector machine

References

* Belur V. Dasarathy, editor (1991) "Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques", ISBN 0-8186-8930-7
* "Nearest-Neighbor Methods in Learning and Vision", edited by Shakhnarovish, Darrell, and Indyk, The MIT Press, 2005, ISBN 0-262-19547-X

* Estimation of forest stand volumes by Landsat TM imagery and stand-level field-inventory data. Forest Ecology and Management, Volume 196, Issues 2-3, 26 July 2004, Pages 245-255. Helena Mäkelä and Anssi Pekkarinen

* Estimation and mapping of forest stand density, volume, and cover type using the "k"-nearest neighbors method. Remote Sensing of Environment, Volume 77, Issue 3, September 2001, Pages 251-274. Hector Franco-Lopez, Alan R. Ek and Marvin E. Bauer

External links

* [http://paul.luminos.nl/documents/show_document.php?d=197 "k"-nearest neighbor algorithm in Visual Basic and Java] (includes executable and source code)
* [http://people.revoledu.com/kardi/tutorial/KNN/index.html "k"-nearest neighbor tutorial using MS Excel]
* [http://www.radwin.org/michael/projects/learning/ Agglomerative-Nearest-Neighbor Algorithm, by Michael J. Radwin]
* [http://www.dsic.upv.es/~rparedes/english/research/CPW/index.html Weighted distance learning for nearest neighbor error minimization by R. Paredes and E. Vidal]


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • k-nearest neighbor algorithm — KNN redirects here. For other uses, see KNN (disambiguation). In pattern recognition, the k nearest neighbor algorithm (k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance… …   Wikipedia

  • Nearest neighbor search — (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point… …   Wikipedia

  • Nearest neighbor — may refer to: Nearest neighbor search in pattern recognition and in computational geometry Nearest neighbor interpolation for interpolating data Nearest neighbor graph in geometry The k nearest neighbor algorithm in machine learning, an… …   Wikipedia

  • Nearest-neighbor interpolation — (blue lines) in one dimension on a (uniform) dataset (red points) …   Wikipedia

  • Nearest neighbour algorithm — This article is about an approximation algorithm to solve the travelling salesman problem. For other uses, see Nearest neighbor. The nearest neighbour algorithm was one of the first algorithms used to determine a solution to the travelling… …   Wikipedia

  • Nearest-neighbor chain algorithm — In the theory of cluster analysis, the nearest neighbor chain algorithm is a method that can be used to perform several types of agglomerative hierarchical clustering, using an amount of memory that is linear in the number of points to be… …   Wikipedia

  • Nearest neighbor graph — A nearest neighbor graph of 100 points in the Euclidean plane. The nearest neighbor graph (NNG) for a set of n objects P in a metric space (e.g., for a set of points in the plane with Euclidean distance) is a directed graph with P being its… …   Wikipedia

  • Greedy algorithm — A greedy algorithm is any algorithm that follows the problem solving metaheuristic of making the locally optimum choice at each stage Paul E. Black, greedy algorithm in Dictionary of Algorithms and Data Structures [online] , U.S. National… …   Wikipedia

  • Ibk algorithm — The ibk algorithm is an alternate version of the k nearest neighbor algorithm, used in k nearest neighbour classification …   Wikipedia

  • FNN algorithm — The false nearest neighbor (FNN) algorithm is an algorithm for estimating the embedding dimension.ee also* Time series * Nearest neighbor External links * [http://balrog.wku.edu/ amaral/docs/chaospaper/node9.html False nearest neighbors] by… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”