Project 3: Scene recognition with bag of words

Name: Chen Xu
login: chenx

Algorithm Design

The basic flow is as follows:

Dense sample the SIFT features with a bin size of 4 and a step of 8.
Build a visual vocabulary using the features extracted from the first step by kmeans. The vocabulary size is set to 200.
Make the histogram representations of the training images according to the visual vocabulary, train a linear SVM using the histograms and corresponding labels. The fixed number of training image is 100 per class.
Make histogram representations of the testing images, classify them using the SVM, and make the confusion matrix.

The result of the basic part is shown in Table 1. The accuracy is 0.6353 and it's acceptable as improvements haven't been made.

Accuracy	Confusion Matrix
0.6353

Table 1: Accuracy and Confusion Matrix of basic part.

Discussion and Extra Points

Improvements have been made in the following aspects:

Single level

The single level uses the highest level of matching pyramid to make histograms, which are feed to SVM. The resulting histogram length is vocab_size * 4 ^ L, corresponding to 4 ^ L bins and vocab_size channels. The SVM is the kernelized non-linear SVM, which uses the Histogram Intersection Kernel. Table 2 shows the best accuracy is achieved when L = 1, and the accuracy is 0.7593.

Using spatial pyramid and pyramid matching kernel.

The histogram used by spatial pyramid is the concatenated histogram which appropriately concatenate weighted histograms at all channels and all levels. So the Histogram Intersection Kernel can still be used by SVM. The length of the histogram is vocab_size * (1 / 3) * (4 ^ (L + 1) - 1). The weights of each histogram at different levels accords to equation(3) in Lazebnik et al. 2006. Table 2 shows that the best result is obtained when L = 2, and the accuracy is 0.7907. Table 3 shows the confusion matrix and kernel matrix of the pyramid matching methord at each level.

	Strong features(M = 200)
L	Single level	Pyramid
0 (1 X 1)	0.712
1 (2 X 2)	0.7593	0.7740
2 (4 X 4)	0.7527	0.7907
3 (8 X 8)	0.7327	0.778

Table 2: Both the results of single-level and pyramid is much better than basic result(accuracy = 0.6353). And it is not surprising that the results of pyramid are better than the results of single-level at each L. And no improvements are observed when L > 2. All the training and testing image data of above experiments are fixed to 100 images every class. And dense SIFT features are extracted at binsize = 4 and step = 8.

L	Confusion matrix	Kernel matrix	Kernel size
1 (2 X 2)			1500 X 1500
2 (4 X 4)			1500 X 1500
3 (8 X 8)			1500 X 1500

Table 3: Confusion matrix and kernel matrix of pyramid matching methord at each level L. As indicated by confusion matrix, confusions are found at indoor classes(kitchen, bedroom, living room). And confusions are much stronger at the bottom-right corner as indicated by kernel matrix. The confidence is much lower at the bottom-right corner with birghter colors.

Cross-validation measurements.

Cross-validation measurement is done at three scales, corresponding to three different volume of training and testing data sets. From Table 2, we can see that best recognition accuracy is achieved when L = 2 in the pyramid matching methord. So I do cross-validation for that. The three scales are: (1) randomly select 100 images for training and another 100 different images for testing for every class, and iterate for 5 times; (2) randomly select 30 images for training and another 30 images for testing, and iterate for 5 times; (3) randomly select 10 images for training and 10 for testing for every class, and iterate for 10 times. Table 4 shows the different cross-validation results of L = 2 pyramid matching in different data scales. Table 5 shows the confusion matrix of the cross-validation measured spatial pyramid matching performance(L = 2), as well as the confidence of every class along the matrix diagnol.

100 per-class & 5 iteration		30 per-class & 5 interation		10 per-class & 10 interation
mean	std	mean	std	mean	std
0.7851	0.0101	0.7236	0.0258	0.6067	0.0357
L = 2		L = 2		L = 3

Table 4: The cross-validation measurement of pyramid matching methord at L = 2. And we also can observe from different training data scales that the more training data, the better recognition accuracy will be.

Suburb

99.2

Spatial Pyramid Match + Cross-validation (L = 2)

Forest

81.0

Mean accuracy: 0.7851, std: 0.0101

InsideCity

94.6

OpenCountry

87.4

TallBuilding

87.6

Bedroom

86.4

Kitchen

70.0

Store

93.4

Coast

92.6

Highway

91.0

Mountain

58.8

Street

60.8

Office

64.0

Industrial

46.4

Livingroom

64.4

Sub

For

Bed

Kit

Sto

Cst

Mnt

Off

Ind

Table 5: It can be observed that the most confusion classes are the classes at the bottom-right corner(Livingroom, Office, Industry), the lowest confidence comes from Industry, less than 0.5, the highest confidence comes from Suburb, nearly 1.

Learning parameter tuning.

I tuned the SVM training parameter lambda at three values: lambda = 0.1, 0.5, 1, and L = 3. The results of recognition accuracy is shown in Table 6, and I decided to use lambda = 0.1.

lambda	accuracy
0.1	0.778
0.5	0.7733
1	0.7667

Table 6: Effects of different lambda.

Final Results

The final results is:

Mean accuracy = 0.7851

Standard deviation = 0.0101

Methord: Spatial pyramid matching & Cross-validation