The object of this project was to create a 15 category scene recogition pipeline after Lazebnik et al. 2006. A tiny image pipline was also developed as a jumping-off point. A 400 word vocabulary was built from training images by extracting dense SIFT features (using 'fast' vl_dsift with a step size of 20 or 30) and clustering these features using k-means. From there, a bag of SIFTs (again, using 'fast' vl_dsift with a step of 10) was assembled and binned into a histogram of the k-means categories for each test image. A kNN classifier and a linear SVM classifier (trained on 100 test images) were developed to assign test images to categories. This pipline classified images with about 64% accuracy. From this basic pipeline, other options were explored. Soft binning was attempted, but this actually decreased performance to 50% if 3 nearest neighbors were considered, and to 40% if 15 nearest neighbors were considered. This could very well be a result of a bad implementation, but this approach was abandoned. Using vl_dsift without the 'fast' parameter was also attempted, but this showed no appreciable improvement in performance, and in fact slight decrease, so this too was abandoned. Next, a 512 dimensional GIST vector was obtained (using LMgist by Aude Oliva, Antonio Torralba) from each image and appended to the 400 dimensonal histograms of images, which resulted in a great increase in performance. Finally, the bag of SIFTs was binned spatially, which created a 400*bins + 512 dimensional vector in total, which also increased performance. However, these increases in performance are not very elegant and come at the cost of increased computation time and memory use.
| Gist Features | Spatial Binning | Num Spatial Bins | Performance |
| No | No | 1 | 0.64 |
| Yes | No | 1 | 0.726 |
| Yes | Yes | 4 | 0.761 |
| Yes | Yes | 16 | 0.777 |
| No | Yes | 16 | 0.724 |
The highest performance, 78%, was seen when binning each image into 16 spatial bins and including the GIST features. The confusion matrix and example classifications are below.
| Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
|---|---|---|---|---|---|---|---|---|---|
| Kitchen | 0.620 | Bedroom |
LivingRoom |
Bedroom |
Bedroom |
||||
| Store | 0.640 | TallBuilding |
InsideCity |
TallBuilding |
InsideCity |
||||
| Bedroom | 0.610 | LivingRoom |
LivingRoom |
LivingRoom |
LivingRoom |
||||
| LivingRoom | 0.590 | Industrial |
Suburb |
Office |
Store |
||||
| Office | 0.950 | LivingRoom |
Store |
Bedroom |
Store |
||||
| Industrial | 0.690 | Store |
TallBuilding |
Store |
TallBuilding |
||||
| Suburb | 0.990 | Industrial |
Industrial |
LivingRoom |
|||||
| InsideCity | 0.770 | Highway |
Industrial |
Kitchen |
Kitchen |
||||
| TallBuilding | 0.830 | InsideCity |
InsideCity |
Store |
Store |
||||
| Street | 0.910 | OpenCountry |
InsideCity |
Highway |
InsideCity |
||||
| Highway | 0.870 | OpenCountry |
Kitchen |
Street |
Coast |
||||
| OpenCountry | 0.680 | Coast |
Coast |
Highway |
Coast |
||||
| Coast | 0.750 | OpenCountry |
OpenCountry |
OpenCountry |
OpenCountry |
||||
| Mountain | 0.830 | OpenCountry |
OpenCountry |
Forest |
Street |
||||
| Forest | 0.930 | OpenCountry |
OpenCountry |
Mountain |
Mountain |
||||
| Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||