CSCI1430 : Project 3 Scene recognition with bag of words
The goal of this assignment is to use the bag of words model in an image classification task with a data set of 15 hand labeled scene categories, in each of the categories there are 200 images. The execution of whole system can be interpreted as several independent procedures, including visual vocabulary creation, feature encoding, model training, inference and accuracy evaluation.
Algorithm
- Create Visual Vocabulary
We built the visual vocabulary based on a sample set of dense sift descriptor computed from 1500 training images. For each images, we at first generate a set of random permutations of all its sift descriptors, then sample 15% of them into a descriptor pool. After the sampling step, we send all descriptors in the pool into a k-mean cluster algorithm. The algorithm will return the means of each cluster which are taken as our visual vocabulary.
- Compute features
In this step we represent each of the train and test images as a histogram over the visual vocabulary. For each image, we first compute the dense sift descriptor over the images, then binning each of these descriptors into one of the visual category we built in previous step. We used the spatial pyramid encoding as described in Lazebnik et al 2006. We normalize the final feature vector to make sure its entries sum up to 1.
- Training
We used a kernel-SVM to train our data. We used the histogram intersection function as our kernel function which is the same as eq.1 in Lazebnik et al 2006. We experimented with both Newton method and gradient descent method to estimate the model parameters while both of them gives similar results.
- Inference
We use the MAP estimator to give each of test images an label assignment with highest probability. Because we used a kernel-SVM in training stage, here we also need to compute a kernel matrix for test images, i.e. for each of the test image, we use the same kernel function(histogram intersection) to evaluate its similarity with all training images.
- Accuracy Evaluation
We used cross validation to evaluate our model performance. For images in each of the 15 classes, we first mix up all the training and test images together. Then we randomly pick up half as train data and the other half as test data. We report statistic summaries of accuracies of our estimation, including mean, median, mode and standard deviation.
Extra Credit
- Spatial Pyramid
We built a 2 level spatial pyramid representation for our images(L0, L1 and L2). We use the histogram intersection function as our kernel function. The weighting strategy is also the same as described in Lazebnik et al 2006. For each image, we recursively divide into several grids(1, 4, 16 in our experiment), and compute local histograms for each of these grids which is considered to capture spatial information. All histograms are concatenated together and normalized to 1 using the weighting function.
- Test with different vocabulary size
We tried different vocabulary sizes which is believed to be related with the strength of feature. That is to say, small vocabulary means weak features while big vocabulary means strong features. We found that when vocabulary size is small, the deviation of accuracies becomes large. The explanation may be that when vocabulary size is small, it is too limited to represent the visual complexity in our data set, thus depend on what vocabulary we have chosen, the performance may vary a lot.
- Cross Validation
In traditional cross validation, data is split into several folds and partitions over these folds are made by a round-robin fashion to form the train and test set. In our experiment, we use a different strategy while keep the idea the same, which is, ach time we randomly permute the whole data set and select half as train and the other half as test.
- Test with different features
We tried vl_dsift() and vl_phow() to compute our feature. When using vl_phow(), we scale size vector [4 8] as described in the SUN database paper. However, it didn't bring us any performance improvement. Surprisingly we find that vl_dsift() representation results in slightly more accurate estimations.
Results
Comparison of different system setup over its accuracies (click to see detail)
Setup |
avg. accuracies |
Baseline(vocab=200 + linear SVM) |
.622 |
vocab size=200 + kernel SVM(histogram intersection kernel function) |
.713 |
vocab size=200 + kernel SVM + phow descriptor |
.722 |
vocab size=200 + kernel SVM + dsift descriptor |
.800 |
vocab size=200 + kernel SVM + phow descriptor |
.790 |
Confusion matrix of our best run, where accuracy=.818
Suburb |
Forest |
InsideCity |
OpenCountry |
TallBuilding |
Bedroom |
Kitchen |
Store |
Coast |
Highway |
Mountain |
Street |
Office |
Industrial |
Livingroom |
1.00 |
.88 |
.93 |
.83 |
.81 |
.83 |
.71 |
.93 |
.92 |
.99 |
.72 |
.63 |
.79 |
.60 |
.69 |
Diagonal values of previous confusion matrix, with classes of highest and lowest correct classification rate showed in color.
Statistics of accuracies over different vocabulary size, cross validated by 5 random initializations
The plot shows that increasing vocabulary size will not always improve the performance, which is consistent with the experiment result in Lazebnik et al 2006.
The following table shows the statistic summaries of our cross validation results, 5 runs for each vocab settings
Vocabulary Size |
10 |
20 |
50 |
100 |
200 |
400 |
1000 |
Mean |
.637 |
.724 |
.756 |
.782 |
.783 |
.812 |
.802 |
Median |
.635 |
.727 |
.751 |
.783 |
.781 |
.814 |
.805 |
Mode |
.650 |
.733 |
.770 |
.792 |
.798 |
.818 |
.809 |
Std. Dev. (Normalized by N-1) |
.0106 |
.0079 |
.0125 |
.0078 |
.0102 |
.0057 |
.0085 |
Samples of failure cases, notice that some of the confusions are pretty reasonable
Some of the category, like 'open country' may be visually diverse, so we may need more data to train such category. Some other categories may be semantically or visually similar, such as 'insidercity' v.s. 'street', 'living room' v.s. 'kitchen' and 'opencountry' vs 'highway'
'Bedroom' categorized as 'Industry' |
'coast' as 'country' |
'industry' as 'living room' |
industry' as 'store' |
'insidercity' as 'street' |
'living room' as 'kitchen' |
'opencountry' as 'coast' |
'opencountry' as 'highway' |
'opencountry' as 'store' |
'store' as 'highway' |