Project 3: Scene recognition with bag of words

Name: Chen Xu
login: chenx

Algorithm Design

The basic flow is as follows:

  1. Dense sample the SIFT features with a bin size of 4 and a step of 8.
  2. Build a visual vocabulary using the features extracted from the first step by kmeans. The vocabulary size is set to 200.
  3. Make the histogram representations of the training images according to the visual vocabulary, train a linear SVM using the histograms and corresponding labels. The fixed number of training image is 100 per class.
  4. Make histogram representations of the testing images, classify them using the SVM, and make the confusion matrix.

The result of the basic part is shown in Table 1. The accuracy is 0.6353 and it's acceptable as improvements haven't been made.

Accuracy Confusion Matrix
0.6353

Table 1: Accuracy and Confusion Matrix of basic part.

Discussion and Extra Points

Improvements have been made in the following aspects:

  1. Single level
  2. The single level uses the highest level of matching pyramid to make histograms, which are feed to SVM. The resulting histogram length is vocab_size * 4 ^ L, corresponding to 4 ^ L bins and vocab_size channels. The SVM is the kernelized non-linear SVM, which uses the Histogram Intersection Kernel. Table 2 shows the best accuracy is achieved when L = 1, and the accuracy is 0.7593.

  3. Using spatial pyramid and pyramid matching kernel.
  4. The histogram used by spatial pyramid is the concatenated histogram which appropriately concatenate weighted histograms at all channels and all levels. So the Histogram Intersection Kernel can still be used by SVM. The length of the histogram is vocab_size * (1 / 3) * (4 ^ (L + 1) - 1). The weights of each histogram at different levels accords to equation(3) in Lazebnik et al. 2006. Table 2 shows that the best result is obtained when L = 2, and the accuracy is 0.7907. Table 3 shows the confusion matrix and kernel matrix of the pyramid matching methord at each level.


    Strong features(M = 200)
    L Single level Pyramid
    0 (1 X 1) 0.712
    1 (2 X 2) 0.7593 0.7740
    2 (4 X 4) 0.7527 0.7907
    3 (8 X 8) 0.7327 0.778

    Table 2: Both the results of single-level and pyramid is much better than basic result(accuracy = 0.6353). And it is not surprising that the results of pyramid are better than the results of single-level at each L. And no improvements are observed when L > 2. All the training and testing image data of above experiments are fixed to 100 images every class. And dense SIFT features are extracted at binsize = 4 and step = 8.


    L Confusion matrix Kernel matrix Kernel size
    1 (2 X 2) 1500 X 1500
    2 (4 X 4) 1500 X 1500
    3 (8 X 8) 1500 X 1500

    Table 3: Confusion matrix and kernel matrix of pyramid matching methord at each level L. As indicated by confusion matrix, confusions are found at indoor classes(kitchen, bedroom, living room). And confusions are much stronger at the bottom-right corner as indicated by kernel matrix. The confidence is much lower at the bottom-right corner with birghter colors.

  5. Cross-validation measurements.
  6. Cross-validation measurement is done at three scales, corresponding to three different volume of training and testing data sets. From Table 2, we can see that best recognition accuracy is achieved when L = 2 in the pyramid matching methord. So I do cross-validation for that. The three scales are: (1) randomly select 100 images for training and another 100 different images for testing for every class, and iterate for 5 times; (2) randomly select 30 images for training and another 30 images for testing, and iterate for 5 times; (3) randomly select 10 images for training and 10 for testing for every class, and iterate for 10 times. Table 4 shows the different cross-validation results of L = 2 pyramid matching in different data scales. Table 5 shows the confusion matrix of the cross-validation measured spatial pyramid matching performance(L = 2), as well as the confidence of every class along the matrix diagnol.


    100 per-class & 5 iteration 30 per-class & 5 interation 10 per-class & 10 interation
    mean std mean std mean std
    0.7851 0.0101 0.7236 0.0258 0.6067 0.0357
    L = 2 L = 2 L = 3

    Table 4: The cross-validation measurement of pyramid matching methord at L = 2. And we also can observe from different training data scales that the more training data, the better recognition accuracy will be.



    Suburb 99.2 Spatial Pyramid Match + Cross-validation (L = 2)
    Forest 81.0 Mean accuracy: 0.7851, std: 0.0101
    InsideCity 94.6
    OpenCountry 87.4
    TallBuilding 87.6
    Bedroom 86.4
    Kitchen 70.0
    Store 93.4
    Coast 92.6
    Highway 91.0
    Mountain 58.8
    Street 60.8
    Office 64.0
    Industrial 46.4
    Livingroom 64.4
    Sub For IC OC TB Bed Kit Sto Cst HW Mnt St Off Ind LR

    Table 5: It can be observed that the most confusion classes are the classes at the bottom-right corner(Livingroom, Office, Industry), the lowest confidence comes from Industry, less than 0.5, the highest confidence comes from Suburb, nearly 1.


  7. Learning parameter tuning.
  8. I tuned the SVM training parameter lambda at three values: lambda = 0.1, 0.5, 1, and L = 3. The results of recognition accuracy is shown in Table 6, and I decided to use lambda = 0.1.


    lambda accuracy
    0.1 0.778
    0.5 0.7733
    1 0.7667

    Table 6: Effects of different lambda.

Final Results

The final results is:

Mean accuracy = 0.7851

Standard deviation = 0.0101

Methord: Spatial pyramid matching & Cross-validation