We had a look at C++ code which helped us get 'Face Detection' working. And now that you are curious to know about how 'Viola-Jones Algorithm works' lets dive into it!!
Introduction:
In 2001, Viola and Jones proposed the first real-time object detection framework. This framework, being able to operate in real-time on 2001 hardware, was partially devoted to human face detection. And we all are well aware about it results - face detection is now a default feature for almost every digital camera and cell phone in the market. Even if these devices may not be using their method directly, the now ubiquitous availability of face detecting devices have certainly been influenced by their work.
The Algorithm:
The Viola-Jones algorithm can be thought of composed of four stages which can be stated as:
1) Haar-like features
2) Integral Image
3) Adaboost Algorithm
4) Cascade of classifiers
Now, lets get into details of each one.. I hope you feeling the same excitement as I currently do!
Haar-like features:
Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region and calculates the difference between these sums(typically darker region and lighter region). This difference is then used to categorise subsections of an image. Let us say we have an image database with human faces. We now know that the region of the eyes is darker than the region of the cheeks. Therefore a common haar feature for face detection is a set of two adjacent rectangles that lie above the eye and the cheek region. The position of these rectangles is defined relative to a detection window that acts like a bounding box to the target object (the face in our case).
In the detection phase of the Viola–Jones object detection framework, a window of the target size is moved over the input image, and for each subsection of the image the Haar-like feature is calculated. This difference is then compared to a learned threshold that separates non-objects from objects. Because such a Haar-like feature is only a weak learner or classifier (its detection quality is slightly better than random guessing) a large number of Haar-like features are necessary to describe an object with sufficient accuracy. For successful face detection we roughly have 1,60,000 features which can be detected. We now take thousands of sample images which have object to be detected (in our case face) in them and thousands of images which have everything in them other than the object of interest. This stage is called training stage. Now, the features relevant to faces are extracted from the images with face and are used to detect face in a new image which was not used for training.
Integral Image:
The Integral Image is used as a quick and effective way of calculating the sum of values (pixel values) of a rectangular subset of a grid which in the field of image processing is the given image. It can also, or is mainly, used for calculating the average intensity within a given image. This is merely the sum of the pixels of the rectangle from the upper-left corner to the given point, and it can be calculated for every point in the original image in a single pass across the image. The word 'integral' comes from the same meaning as 'integrating' – finding the area under a curve by adding together small rectangular areas. If one wants to use the Integral Image, it is normally a wise idea to make sure the image is in greyscale first.
Using this concept of integral image, the area of the shaded region can be calculated as
Sum = Value(C) - Value(B) - Value(D) + Value(A)
Don’t worry if you haven’t understood because concept will be exemplified after you read few of the following lines!
Here, the 1st matrix represents the original image and the second represents the integral image. For every pixel in the integral image, its value is calculated as the sum of all the pixels which are to its left and above it. For the highlighted pixel, we see that there are six such pixels which lie above and too left of it. Calculating its sum gives us the resulting pixel value which is 20. Computing it for every pixel gives us the resulting matrix. Now, was it not very simple to grasp and get hang of!
But how does that matter and why do you need to study this simple concept here?
Though the concept is quite simple to understand, the complex problems it is capable of solving is something that makes it really special. Imagine that you were asked to find the area of shaded region(sum of pixels in our case) in the original image to right. You would need to access six memory locations to compute this. Now to compute the same area using integral image would require accessing four memory locations only!
Phewww... Wouldn't it be easier and cost effective to calculate the area directly rather then then first computing the integral image and then calculating the shaded area?
No, it wouldn't be! Imagine calculating the area of 40 pixels. Accessing 40 pixels and then summing them up. Now imagine calculating that with integral image. All you would need is access to four corner pixels and you are done! Wooaaohhh.. thats lot of saving there...
Okay we save a lot computationally but why do we need this in our algorithm?
Calculating haar features in image is all about calculating the difference between sum of all pixels in darker region and that in lighter region. This is the step where we exploit concept of integral image.
Adaboost Algorithm:
As we have seen earlier, there are roughly 1,60,000 features relevant to a face. Calculating such features for a image which can be thought of made of many small sub-windows clearly make the computation very complex and time consuming. This is where the machine learning algorithm called 'Adaboost Algorithm' comes in picture. It helps us in selecting out of this big number of features, only those features which are extremely necessary for face detection.
Here's how it works! A small weight is assigned to all the features initially. Now, the images which have faces in them are used to test these features. Depending upon how well did a particular feature help the image containing a face to be detected as face, its associated weight is updated. If it was one of the relevant features its weight is increased else its weight is decreased. This procedure is repeated for a number of times and finally we have a number of features which have big weights associated with them. These features independently are not capable of recognizing face. It's the linear combination of weighted features that is capable of doing that. Thus, Adaboost algorithm helps us select only those features which are relevant. As the individual features are not capable of identifying faces themselves they are called weak classifiers! Thus, we get a strong classifier which is the linear combination of weighted weak classifiers!
Cascade of Classifiers:
This stage helps in constructing a cascade of classifiers which achieves increased detection performance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false positive rates. In other words, simple sub windows are used for feature detection initially. The sub-window which do not fulfil conditions are straight off rejected without actually being passed to next stage. Only those sub-windows which satisfy the conditions with the simple features are passed to next complex stage.
The image is self explanatory. Various small parts of image are selected, they are tested for facial features. If they do not fulfil the condition, they are straight rejected. If they fulfil, they are subjected to more complex testing and it continues. If a particular region is found to fulfil all conditions is foud, it is marked as face.
In short, this step basically looks for regions which surely do not belong to faces and rejects them. The non-rejected region is passed to next stage of complex classifiers. This process continues till entire image is evaluated. This reduces the computational time by a very huge factor! The region which satisfies the condition with all the classifiers is detected as face.
This is how Viola-Jones algorithm works. If you still do not follow a particular step or something, do leave comments. You can find the implemented version of face detection here. Feedback is most welcomed... Cheers!!