Example face detection results from this project. The neutral poses are intended to be easy for a face detector, but there are still three false negatives.

Project 4: Face detection with a sliding window
CS 143: Introduction to Computer Vision

Brief

Stencil code: /course/cs143/asgn/proj4/stencil/
Data: /course/cs143/data/proj4/
Handin: cs143_handin proj4
Required files: README, code/, html/, html/index.html
Due date: November 13th, 11:59pm

Overview

The sliding window model is conceptually simple: independently classify all image patches as being object or non-object. Sliding window classification is the dominant paradigm in object detection and for one object category in particular -- faces -- it is one of the most noticeable successes of computer vision. For example, modern cameras and photo organization tools have prominent face detection capabilities.

HoG For this project you will be implementing a sliding window face detector. You will be incorporating some of the concepts from two high impact papers in object detection. It is recommended that you look over the following papers, as well as the earlier Rowley et al. 1998. These papers are all very influential and relatively easy to read.

Inspired by Dalal-Triggs 2005, you will implement a linear classifier trained on strong, SIFT-like features (such as the HoG descriptor proposed by Dalal and Triggs). A linear classifier is compact, fast to train, and fast to execute. A linear SVM can also be trained on large amounts of data, including mined hard negatives. However, the simplicity of the linear classifier means it is not expressive enough to reach very high accuracies.
Inspired by Viola-Jones 2001, you will also utilize a more powerful non-linear classifier. In particular, the stencil code provides a non-linear SVM that is more accurate than the linear SVM but much slower to execute. To help alleviate the complexity of the classifier, you can implement a Viola-Jones-esque cascade architecture for extra credit. The key idea of such a cascade is to build a more complex non-linear classifier than could be represented as a monolithic classifier and to significantly improve speed by fast-rejecting "easy" non-faces. The cascade architecture is also an elegant way to mine hard negatives.

Not surprisingly, the pipelines are complementary. Using the strong classifiers and strong features together will result in better performance. Common to all three of the referenced papers it the concept of "mining" hard negatives to improve detection accuracy. You will implement a hard negative mining strategy and assess its impact.

Requirements

An object detection system is more complex than the previous projects and thus the stencil code is more complete. Your requirements will focus on the three key elements of object detection systems: (1) representation, (2) strategies for utilizing training data, and (3) classification methods. In particular you are required to:

(1) Utilize a strong feature such as SIFT or HoG to dramatically improve detection accuracy over the baseline raw image patches.
(2) Implement a strategy for iteratively re-training a classifier with mined hard negatives and compare this to a baseline with random negatives.
(3) Utilize and compare linear and non-linear classifiers. Linear classifiers can be trained with large amounts of data, but they may not be expressive enough to actually benefit from that training data. Non-linear classifiers can represent complex decision boundaries, but are limited in the amount of training data they can use at once.

Details and Starter Code

The following is an outline of the stencil code:

1. load_crops.m. Load cropped positive trained examples (faces).
2. get_random_negatives.m and get_hard_negatives.m (you code this). Sample negative examples from non-face scenes. If an initial classifier is available, sample only examples where the initial classifier is wrong (false positives).
3. crops2features.m (you code this). Convert crops to some better image feature. Normalized patches or patch gradients will work better than raw image patches, but the most dramatic increase in performance will come from features such as SIFT or HoG.
4. primal_svm.m. Train linear and non-linear classifier from the positive and negative examples.
5. If building a cascade, adjust the learned classifier to reach desired training false positive rate.
6. (you code this) Goto step 2 unless stopping criteria met. Stopping criteria might be based on amount of negative training data, number of iterations, depth of cascade, or false positive rate.
7. run_detector.m. Run the final classifier on the test set.
8. non_max_supr.m. Perform non-maximum suppression to remove redundant detections.
9. evaluate_all_detections.m. Compute precision-recall curve and average precision.
10. visualize_detections.m. Visualize detections.

Step 2 is where hard negatives are mined except on the first pass where there is no initial classifier and thus random negatives are returned. Mining hard negatives is conceptually simple -- run the classifier on scenes which have no faces and every detection is a false positive. You will often find more false positives than your classifier can use. In this case, randomly subsample the hard negatives.

Step 3 is where you build a feature representation for arbitrary image patches. For a quick improvement to the stencil code detector, try normalizing the image patches in crops2features.m. As with any feature change, you may then need to adjust the learning parameters. To get full credit and high accuracy you must use a feature such as SIFT or HoG. You can implement such features yourself for extra credit, but you can also use off-the-shelf implementations such as those in vl_feat. You do not necessarily need to use crops as an intermediate representation before building your strong features. For example. you can use vl_dsift to get all SIFT descriptors for your detector, as long as you make sure those SIFT parameters are the same as the ones you build for the cropped positive examples.

For step 4, code is provided to train linear and non-linear classifiers. A linear SVM can be trained with huge amounts of positive and negative examples (hundreds of thousands), but the classifier it learns is simple -- effectively a hyperplane in high-dimensional space. A non-linear SVM can be trained on only a few thousand positive and negative examples (memory consumption is quadratic w.r.t training number of training examples because of the need to construct a kernel matrix). However, the decision boundary learned by the non-linear SVM can be arbitrarily complex and the classifier will perform much better than a linear classifier if tuned correctly.

Step 6 will depend on your particular strategy for mining hard negatives or building a classifier cascade. You are free to experiment with any stopping criteria. For instance, Dalal-Triggs only mines hard negatives once. Viola-Jones iterates many more times, adding cascade stages until no more hard negatives can be found.

For step 7, the starter code provides a multi-scale detector. The detector breaks an image in to patches and runs the trained classifier on each patch. There are parameters for step size (or stride) and the ratio between scales. You can modify this function to mine hard negatives.

Steps 8, 9, and 10 are provided for you and you are unlikely to need to modify them.

The stencil code also contains a script, detect_class_photos.m, to run a classifier on the class photos.

Data

The choice of training data is critical for this task. While an object detection system would ideally be trained and tested on a single database (as in the Pascal VOC challenge), face detection papers have traditionally trained on heterogeneous, even proprietary, datasets. As with most of the literature, we will use three databases: (1) positive training crops, (2) non-face scenes to mine for negative training data, and (3) test scenes with ground truth face locations.

You are provided with a positive training database of 6,713 cropped 36x36 faces from the Caltech Web Faces project. We arrived at this subset by filtering away faces which were not high enough resolution, upright, or front facing. There are many additional databases available For example, see Figure 3 in Huang et al. and the LFW database described in the paper. You are free to experiment with additional or alternative training data for extra credit.

Non-face scenes are the easy to collect. We provide a small database of such scenes from Wu et al. and the SUN scene database. You can add more non-face training scenes, although you are unlikely to need more negative training data unless you are building a cascade architecture.

The most common benchmark for face detection is the CMU+MIT test set. This test set contains 130 images with 511 faces. The test set is challenging because the images are highly compressed and quantized. Some of the faces are illustrated faces, not human faces. For this project, we have converted the test set's ground truth landmark points in to Pascal VOC style bounding boxes. We have inflated these bounding boxes to cover most of the head, as the provided training data does. For this reason, you are arguably training a "head detector" not a "face detector" for this project.

Copies of these data sets are available in /course/cs143/data/proj4/. You may want to make a local copy of these to speed up training and testing, but please do not include them in your handin.

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your algorithm. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

It would be interesting to see how your detector performs on additional images beyond those in the test set such as the class photo, the CS department faculty page, etc.

For this project you should include the precision-recall curve of your final classifier and any interesting variants of your algorithm.

Extra Credit

For all extra credit, be sure to analyze on your web page cases whether your extra credit has improved classification accuracy. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit.

Some ideas:

up to 10 pts: Implement a HoG descriptor yourself. Examine performance as you vary parameters such as the number of bins.
up to 10 pts: Implement a cascade architecture as in Viola-Jones. Show the effect that this has on accuracy and run speed. Describe your cascade building process in detail in your handout. The stencil code already has support for a cascade (for both linear and non-linear classifiers) if you build it in the specified format.
up to 10 pts: With a cascade architecture, use an asymmetric classifier such as that described by Wu et al. 2008. In a cascade architecture, false negatives are far worse than false positives, and typical classifiers are not aware of this asymmetry. In Viola-Jones and later works, the false negative rate is reduced and the false positive rate increased after-the-fact simply by adjusting classifier thresholds, but making the classifier training process aware of the asymmetry will improve accuracy.
up to 10 pts: Detect additional object categories. You'll need to get your own training and testing data. One suggestion is to train and run your detector on the Pascal VOC data sets, possibly with the help of their support code. The bounding boxes returned by the stencil code are already in VOC format.
up to 5 pts: Interesting features and combinations of features. Be creative!
up to 5 pts: Find and utilize alternative positive training data. You can either augment or replace the provided training data.
up to 5 pts: Use additional classification schemes (e.g. full decision trees, neural nets, or nearest neighbor methods).
up to 10 pts: Add contextual reasoning to your classifier. For example, one might learn likely locations of faces given scene statistics, in the spirit of Contextual priming for object detection, Torralba. You could try and use typical arrangements of groups of faces as in Understanding Images of Groups of People and Finding Rows of People in Group Images by Gallagher and Chen.
up to 10 pts: Use deformable models instead of fixed templates as in the work of Felzenszwalb et al.

Finally, there will be extra credit and recognition for the students who achieve the highest average precision. You aren't allowed to modify evaluate_all_detections.m which measures your accuracy.

Graduate Credit

To get graduate credit on this project you must do 10 points worth of extra credit. Those 10 points will not be added to your grade, but additional extra credit will be.

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will highlight the best projects as determined by the professor and TAs. If you do not want your results published to the web, you can choose to opt out. If you want to opt out, email cs143tas[at]cs.brown.edu saying so.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time after the first that you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

README - text file containing anything about the project that you want to tell the TAs
code/ - directory containing all your code for this assignment
html/ - directory containing all your html report for this assignment, including images (any images not under this directory won't be published)
html/index.html - home page for your results

Then run: cs143_handin proj4
If it is not in your path, you can run it directly: /course/cs143/bin/cs143_handin proj4

Rubric

+40 pts: Represent image patches with strong features such as SIFT or HoG.
+15 pts: Train linear and non-linear classifiers and compare their accuracy.
+25 pts: Iteratively retrain a each type of classifier after mining for hard negatives and compare this to training only with random negatives.
+20 pts: Writeup with design decisions and evaluation.
+15 pts: Extra credit (up to fifteen points)
-5*n pts: Lose 5 points for every time (after the first) you do not follow the instructions for the hand in format

Final Advice

You can use vl_dsift to convert a single image crop to a single SIFT feature in crops2features.m. You will need to do this convert the positive training data to SIFT features. But to speed up detection and mining of hard negatives you can write a wrapper function for vl_dsift which directly converts an entire image to SIFT descriptors. You need to make sure the SIFT parameters are identical. For a 36x36 patch size, a vl_dsift bin size of 10 seems best. At this bin size, get_img_crops.m and vl_dsift will sample the same points if the step size is the same.
You probably don't want to run non-max suppression while mining hard-negatives.
While the idea of mining for hard negatives is ubiquitous in the object detection literature, it may only modestly increase your performance when compared to a similar number of random negatives
The parameters of the learning algorithms are important. The regularization parameter lambda is important for linear and non-linear SVMs. It controls the amount of bias in the model, and thus the degree of underfitting or overfitting to the training data. Experiment to find its best value. For non-linear SVMs, your kernel may need additional parameters (e.g. sigma for an RBF kernel). These parameters can be very sensitive. Again, find them experimentally.
Your classifiers, especially if they are trained with large amounts of negative data, may "underdetect" because of an overly conservative threshold. You can lower the thresholds on your classifiers to improve your average precision. The precision-recall metric does not penalize a detector for producing false positives, as long as those false positives have lower confidence than true positives. For example, an otherwise accurate detector might only achieve 50% recall on the test set with 1000 detections. If you lower the threshold for a positive detection to achieve 70% recall with 5000 detections your average precision will increase, even though you are returning mostly false positives.
The stencil code detector run_detector.m is configured with a fairly large step size and a fairly large ratio between search scales. In addition, it doesn't search the full resolution image to speed up evaluation. You can leave these parameters unchanged while developing your object detector, but for your final evaluation your performance can increase significantly by lowering all of these parameters. This may make the run time dramatically slower, or even cause out of memory errors in the extreme (if an single scale has too many crops to hold in memory).
Likewise your accuracy is likely to increase as you use more of the training data, but this will slow down your training (and testing, in the case of non-linear classifiers). You can debug your system with smaller amounts of training data (e.g. 1000 positive and negative).
If implementing a cascade architecture for a non-linear classifier, one does not actually need to store all of the training points, but just the support vectors -- training points with non-zero weight. However, this is not likely to be a large memory or speed savings because most training points will be support vectors, depending on your kernel parameters.
The stencil code classifier achieves a paltry 0.05 average precision. You will dramatically improve this by using better features, more and smarter training data, and better classifiers. If you implement the project well, you can train and test a classifier with average precision of 0.85 in about 10 minutes. It is alright if your training and testing is significantly slower, though.
The Viola-Jones algorithm achieves an average precision of 0.895* on the CMU+MIT test set based on the numbers in Table 3 of the paper (This number may be slightly off because Table 3 doesn't fully specify the precision-recall curve, because the overlap criteria for VJ might not match our overlap criteria, and because the test sets might be slightly different -- VJ says the test set contains 507 faces, whereas we count 511 faces). You can beat this number, although you may need to run the detector at very small step sizes and scales.

Credits

Project description and code by James Hays. Figures in this handout are from Dalal and Triggs and Wu et al.. Thanks to Jianxin Wu and Jim Rehg for suggestions.

Students effectively demonstrate how not to be seen by a robot.

Project 4: Face detection with a sliding window CS 143: Introduction to Computer Vision