Robust real-time object detection

Robust real-time object detection

The Viola and Jones Object Detection Framework is the first object detection framework to provide competitive object detection rates in real-time. [ [http://research.microsoft.com/~viola/Pubs/Detect/violaJones_IJCV.pdf Viola, Jones: Robust Real-time Object Detection, IJCV 2004] See pages 1,3.] Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection. The purpose of this article is to introduce the contributions which made this advancement possible.

Components of the Framework

Feature Types and Evaluation

The features employed by the detection framework universally involve the sums of image pixels within rectangular areas. As such, they bear some resemblance to Haar basis functions, which have been used previously in the realm of image-based object detection. [C. Papageorgiou, M. Oren and T. Poggio. A General Framework for Object Detection. "International Conference on Computer Vision", 1998] However, since the features used by Viola and Jones all rely on more than one rectangular area, they are generally more complex. The figure at right illustrates the three different types of features used in the framework. The value of any given feature is always simply the sum of the pixels within clear rectangles subtracted from the sum of the pixels within shaded rectangles. As is to be expected, rectangular features of this sort are rather primitive when compared to alternatives such as steerable filters. Although they are sensitive to vertical and horizontal features, their feedback is considerably coarser. However, with the use of an image representation called the integral image, rectangular features can be evaluated in "constant" time, which gives them a considerable speed advantage over their more sophisticated relatives. Because each rectangular area in a feature is always adjacent to at least one other rectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight, and any four-rectangle feature in just nine.

Learning Algorithm

The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of 45,396 possible features, and it would be prohibitively expensive to evaluate them all. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them.

Cascade Architecture

The evaluation of the strong classifiers generated by the learning process can be done quickly, but it isn’t fast enough to run in real-time. For this reason, the strong classifiers are arranged in a cascade in order of complexity, where each successive classifier is trained only on those examples which pass through the preceding classifiers. If at any point in the cascade a classifier rejects the sub-window under inspection, no further processing is performed and the search moves on to the next sub-window (see figure at right). The cascade therefore has the form of a degenerate decision tree. In the case of faces, the first classifier in the cascade - called the attentional operator - uses only two features to achieve a false negative rate of approximately 0% and a false positive rate of 40%. [ [http://research.microsoft.com/~viola/Pubs/Detect/violaJones_IJCV.pdf Viola, Jones: Robust Real-time Object Detection, IJCV 2004] See page 11.] The effect of this single classifier is to reduce by roughly half the number of times the entire cascade is evaluated.

The cascade architecture has interesting implications for the performance of the individual classifiers. Because the activation of each classifier depends entirely on the behavior of its predecessor, the false positive rate for an entire cascade is:

F = prod_{i=1}^K f_i

Similarly, the detection rate is:

D = prod_{i=1}^K d_i

Thus, to match the false positive rates typically achieved by other detectors, each classifier can get away with having surprisingly poor performance. For example, for a 32-stage cascade to achieve a false positive rate of 10^{-6}, each classifier need only achieve a false positive rate of about 65%. At the same time, however, each classifier needs to be exceptionally capable if it is to achieve adequate detection rates. For example, to achieve a detection rate of about 90%, each classifier in the aforementioned cascade needs to achieve a detection rate of approximately 99.7%.

References

External links

* [http://www.cmucam.org/wiki/viola-jones Demo Implementation]
* [http://www.slideshare.net/wolf/avihu-efrats-viola-and-jones-face-detection-slides/ Slides Presenting the Framework]
* [http://mathworld.wolfram.com/HaarFunction.html Information Regarding Haar Basis Functions]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Détection de personne — Un exemple de détection de personnes sur une voie de circulation La détection de personne est un domaine de la vision par ordinateur consistant à détecter un humain dans une image numérique. C est un cas particulier de détection d objet, où l on… …   Wikipédia en Français

  • Boosting methods for object categorization — Given images containing various known objects in the world, a classifier can be learned from them to automatically categorize the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in… …   Wikipedia

  • Object recognition (computer vision) — Feature detection Output of a typical corner detection algorithm …   Wikipedia

  • Collision detection — For collision detection on networks see CSMA/CD Collision detection typically refers to the computational problem of detecting the intersection of two or more objects. While the topic is most often associated with its use in video games and other …   Wikipedia

  • Corner detection — Feature detection Output of a typical corner detection algorithm …   Wikipedia

  • Méthode de Viola et Jones — Un exemple de détection de visage par la méthode de Viola et Jones. La méthode de Viola et Jones est une méthode de détection d objet dans une image numérique, proposée par les chercheurs Paul Viola et …   Wikipédia en Français

  • Caltech 101 — is a dataset of digital images created in September, 2003, compiled by Fei Fei Li, Marco Andreetto, and Marc Aurelio Ranzato at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques. It is… …   Wikipedia

  • Summed Area Table — A Summed Area Table (also known as an Integral Image) is an algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid. It was first introduced to the computer graphics world in 1984 for use in mipmaps… …   Wikipedia

  • Caractéristiques pseudo-Haar — Les caractéristiques pseudo Haar (Haar like features en anglais) sont des caractéristiques utilisées en vision par ordinateur pour la détection d objet dans des images numériques. Très simples et très rapides à calculer, elles ont été utilisées… …   Wikipédia en Français

  • Ridge detection — The ridges (or the ridge set) of a smooth function of two variables is a set of curves whose points are, loosely speaking, local maxima in at least one dimension. For a function of N variables, its ridges are a set of curves whose points are… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”