Computer Vision Talks: August 2015

We had a look at C++ code which helped us get 'Face Detection' working. And now that you are curious to know about how 'Viola-Jones Algorithm works' lets dive into it!!

Introduction:

In 2001, Viola and Jones proposed the first real-time object detection framework. This framework, being able to operate in real-time on 2001 hardware, was partially devoted to human face detection. And we all are well aware about it results - face detection is now a default feature for almost every digital camera and cell phone in the market. Even if these devices may not be using their method directly, the now ubiquitous availability of face detecting devices have certainly been influenced by their work.

The Algorithm:

The Viola-Jones algorithm can be thought of composed of four stages which can be stated as:

1) Haar-like features

2) Integral Image

3) Adaboost Algorithm

4) Cascade of classifiers

Now, lets get into details of each one.. I hope you feeling the same excitement as I currently do!

Haar-like features:

Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region and calculates the difference between these sums(typically darker region and lighter region). This difference is then used to categorise subsections of an image. Let us say we have an image database with human faces. We now know that the region of the eyes is darker than the region of the cheeks. Therefore a common haar feature for face detection is a set of two adjacent rectangles that lie above the eye and the cheek region. The position of these rectangles is defined relative to a detection window that acts like a bounding box to the target object (the face in our case).

In the detection phase of the Viola–Jones object detection framework, a window of the target size is moved over the input image, and for each subsection of the image the Haar-like feature is calculated. This difference is then compared to a learned threshold that separates non-objects from objects. Because such a Haar-like feature is only a weak learner or classifier (its detection quality is slightly better than random guessing) a large number of Haar-like features are necessary to describe an object with sufficient accuracy. For successful face detection we roughly have 1,60,000 features which can be detected. We now take thousands of sample images which have object to be detected (in our case face) in them and thousands of images which have everything in them other than the object of interest. This stage is called training stage. Now, the features relevant to faces are extracted from the images with face and are used to detect face in a new image which was not used for training.

Integral Image:

The Integral Image is used as a quick and effective way of calculating the sum of values (pixel values) of a rectangular subset of a grid which in the field of image processing is the given image. It can also, or is mainly, used for calculating the average intensity within a given image. This is merely the sum of the pixels of the rectangle from the upper-left corner to the given point, and it can be calculated for every point in the original image in a single pass across the image. The word 'integral' comes from the same meaning as 'integrating' – finding the area under a curve by adding together small rectangular areas. If one wants to use the Integral Image, it is normally a wise idea to make sure the image is in greyscale first.

Using this concept of integral image, the area of the shaded region can be calculated as

Sum = Value(C) - Value(B) - Value(D) + Value(A)

Don’t worry if you haven’t understood because concept will be exemplified after you read few of the following lines!

Here, the 1st matrix represents the original image and the second represents the integral image. For every pixel in the integral image, its value is calculated as the sum of all the pixels which are to its left and above it. For the highlighted pixel, we see that there are six such pixels which lie above and too left of it. Calculating its sum gives us the resulting pixel value which is 20. Computing it for every pixel gives us the resulting matrix. Now, was it not very simple to grasp and get hang of!

But how does that matter and why do you need to study this simple concept here?

Though the concept is quite simple to understand, the complex problems it is capable of solving is something that makes it really special. Imagine that you were asked to find the area of shaded region(sum of pixels in our case) in the original image to right. You would need to access six memory locations to compute this. Now to compute the same area using integral image would require accessing four memory locations only!

Phewww... Wouldn't it be easier and cost effective to calculate the area directly rather then then first computing the integral image and then calculating the shaded area?

No, it wouldn't be! Imagine calculating the area of 40 pixels. Accessing 40 pixels and then summing them up. Now imagine calculating that with integral image. All you would need is access to four corner pixels and you are done! Wooaaohhh.. thats lot of saving there...

Okay we save a lot computationally but why do we need this in our algorithm?

Calculating haar features in image is all about calculating the difference between sum of all pixels in darker region and that in lighter region. This is the step where we exploit concept of integral image.

Adaboost Algorithm:

As we have seen earlier, there are roughly 1,60,000 features relevant to a face. Calculating such features for a image which can be thought of made of many small sub-windows clearly make the computation very complex and time consuming. This is where the machine learning algorithm called 'Adaboost Algorithm' comes in picture. It helps us in selecting out of this big number of features, only those features which are extremely necessary for face detection.

Here's how it works! A small weight is assigned to all the features initially. Now, the images which have faces in them are used to test these features. Depending upon how well did a particular feature help the image containing a face to be detected as face, its associated weight is updated. If it was one of the relevant features its weight is increased else its weight is decreased. This procedure is repeated for a number of times and finally we have a number of features which have big weights associated with them. These features independently are not capable of recognizing face. It's the linear combination of weighted features that is capable of doing that. Thus, Adaboost algorithm helps us select only those features which are relevant. As the individual features are not capable of identifying faces themselves they are called weak classifiers! Thus, we get a strong classifier which is the linear combination of weighted weak classifiers!

Cascade of Classifiers:

This stage helps in constructing a cascade of classifiers which achieves increased detection performance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false positive rates. In other words, simple sub windows are used for feature detection initially. The sub-window which do not fulfil conditions are straight off rejected without actually being passed to next stage. Only those sub-windows which satisfy the conditions with the simple features are passed to next complex stage.

The image is self explanatory. Various small parts of image are selected, they are tested for facial features. If they do not fulfil the condition, they are straight rejected. If they fulfil, they are subjected to more complex testing and it continues. If a particular region is found to fulfil all conditions is foud, it is marked as face.

In short, this step basically looks for regions which surely do not belong to faces and rejects them. The non-rejected region is passed to next stage of complex classifiers. This process continues till entire image is evaluated. This reduces the computational time by a very huge factor! The region which satisfies the condition with all the classifiers is detected as face.

This is how Viola-Jones algorithm works. If you still do not follow a particular step or something, do leave comments. You can find the implemented version of face detection here. Feedback is most welcomed... Cheers!!

OpenCV uses machine learning algorithm to search for faces within a picture. A human face can be thought as made up of thousands of small features/characteristics. A typical approach or face detection would involve checking for these thousands of small features and if maximum possible features are found, the region should be classified as 'Face'.

Now, the Machine learning algorithm that we will discuss here, is called "Viola-JonesAlgorithm". This method was proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001. It is a Haar feature-based cascade classifiers is an effective object detection method. We will be separately discussing the algorithm in detail. Here, I will just brief you though the steps and will explain a basic code to have the face detection code working on your deskop!

Basics:

Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Haar features shown in image are used to extract features. Each feature is a single value obtained by subtracting sum of pixels under white rectangle from sum of pixels under black rectangle.

For example, consider the image below. Top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. There are multiple such features!

We apply each and every feature on all the training images. For each feature, it finds the best threshold which will classify the faces to positive and negative. But obviously, there will be errors or misclassifications. We select the features with minimum error rate, which means they are the features that best classifies the face and non-face images. (We would not be going into details here.)

Final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't classify the image, but together with others forms a strong classifier. Combining various such weak classifiers results in around 6000 features. So now you take an image. Divide image into smaller windows. Apply 6000 features to every window. Check if it is face or not.

Wouldn’t this step make the process very much time consuming? I am sure this question would have been hovering in your mind as you read the step! Instead of applying all the 6000 features on a window, we group the features into different stages of classifiers and apply one-by-one. (Normally first few stages will contain very less number of features). If a window fails the first stage, discard it. We don't consider remaining features on it. If it passes, apply the second stage of features and continue the process. The window which passes all stages is a face region.

So this is a simple intuitive explanation of how Viola-Jones face detection works.

Cascades in practice:

Though the theory may sound complicated, in practice it is quite easy. The cascades themselves are just a bunch of XML files that contain OpenCV data used to detect objects. You initialize your code with the cascade you want, and then it does the work for you. Since face detection is such a common case, OpenCV comes with a number of built-in cascades for detecting everything from faces to eyes to hands and legs. There are even cascades for non-human things.

Face-Detection in OpenCV:

Here I will assume that you have OpenCV libraries built on your machine. If you have not or you are struggling with it, you could look into Installing OpenCV.

Also, we will be using C++ with OpenCV for Ubuntu to achieve face detection. I use Eclipse-CDT for writing code. You can see configuring Eclipse with OpenCV if you want to use Eclipse. You can use any other IDE of your choice too. Make sure you link the OpenCV libraries to your project else you are bound to see loads of errors. Check this out if you want to know how to get them linked.

The OpenCV C++ code:

Code Explained:

First, we load the pre-built classifier 'haarcascade_frontalface_alt.xml' using the load() function into a variable of type CascadeClassifier. Then the image for which you want the face to be detected is read into a variable of Mat datatype. In my case I am using Mat img.

Now, the image is converted to gray image using cvtColor function where the first argument is the color Image(input), the second argument is the gray Image(output) and the final argument is the type of conversion(in our case, we are converting a BGR i.e color image to gray)

Now, detectMultiScale() helps us detect the faces. The first arguments to this function is the input image which has to be gray image. The second argument is the vector which will store the details of faces such as its location, its dimensions etc. The next is scaleFactor of type double which is set to 1.3 in my case. You can play around with this value and see results for yourself. The final argument is the minNeighbours which is set to 1.5. Again, experimenting with it would help you set it to correct value for your case.

The for loop is used to locate the co-ordinates of the face detected and draw a rectangle around the face.We also crop the face image and store it into the Mat face_image.

The namedWindow is typically used to show the images.

Now, compile and run this code and you can expect something output to be like:

Also, the cropped face:

And there you are ready with the face detection working.!!

Most of the tutorial is easy to understand. But if you are struck somewhere, do leave comments!

You should now be waiting to read the post on Viola-Jones Algorithm. Subscribe to regularly get updates in your mail box.

Computer Vision Talks

Thursday, 13 August 2015

Viola-Jones in Nut-shell

Sunday, 9 August 2015

Face-Detection with OpenCV

Basics:

Cascades in practice:

Face-Detection in OpenCV:

Code Explained: