Name: Chen Xu
login: chenx
The basic flow is as follows:
The result of the basic part is shown in Table 1. The accuracy is 0.6353 and it's acceptable as improvements haven't been made.
Accuracy | Confusion Matrix |
---|---|
0.6353 |
Table 1: Accuracy and Confusion Matrix of basic part.
Improvements have been made in the following aspects:
The single level uses the highest level of matching pyramid to make histograms, which are feed to SVM. The resulting histogram length is vocab_size * 4 ^ L, corresponding to 4 ^ L bins and vocab_size channels. The SVM is the kernelized non-linear SVM, which uses the Histogram Intersection Kernel. Table 2 shows the best accuracy is achieved when L = 1, and the accuracy is 0.7593.
The histogram used by spatial pyramid is the concatenated histogram which appropriately concatenate weighted histograms at all channels and all levels. So the Histogram Intersection Kernel can still be used by SVM. The length of the histogram is vocab_size * (1 / 3) * (4 ^ (L + 1) - 1). The weights of each histogram at different levels accords to equation(3) in Lazebnik et al. 2006. Table 2 shows that the best result is obtained when L = 2, and the accuracy is 0.7907. Table 3 shows the confusion matrix and kernel matrix of the pyramid matching methord at each level.
Strong features(M = 200) | ||
---|---|---|
L | Single level | Pyramid |
0 (1 X 1) | 0.712 | |
1 (2 X 2) | 0.7593 | 0.7740 |
2 (4 X 4) | 0.7527 | 0.7907 |
3 (8 X 8) | 0.7327 | 0.778 |
Table 2: Both the results of single-level and pyramid is much better than basic result(accuracy = 0.6353). And it is not surprising that the results of pyramid are better than the results of single-level at each L. And no improvements are observed when L > 2. All the training and testing image data of above experiments are fixed to 100 images every class. And dense SIFT features are extracted at binsize = 4 and step = 8.
L | Confusion matrix | Kernel matrix | Kernel size |
---|---|---|---|
1 (2 X 2) | 1500 X 1500 | ||
2 (4 X 4) | 1500 X 1500 | ||
3 (8 X 8) | 1500 X 1500 |
Table 3: Confusion matrix and kernel matrix of pyramid matching methord at each level L. As indicated by confusion matrix, confusions are found at indoor classes(kitchen, bedroom, living room). And confusions are much stronger at the bottom-right corner as indicated by kernel matrix. The confidence is much lower at the bottom-right corner with birghter colors.
Cross-validation measurement is done at three scales, corresponding to three different volume of training and testing data sets. From Table 2, we can see that best recognition accuracy is achieved when L = 2 in the pyramid matching methord. So I do cross-validation for that. The three scales are: (1) randomly select 100 images for training and another 100 different images for testing for every class, and iterate for 5 times; (2) randomly select 30 images for training and another 30 images for testing, and iterate for 5 times; (3) randomly select 10 images for training and 10 for testing for every class, and iterate for 10 times. Table 4 shows the different cross-validation results of L = 2 pyramid matching in different data scales. Table 5 shows the confusion matrix of the cross-validation measured spatial pyramid matching performance(L = 2), as well as the confidence of every class along the matrix diagnol.
100 per-class & 5 iteration | 30 per-class & 5 interation | 10 per-class & 10 interation | |||
---|---|---|---|---|---|
mean | std | mean | std | mean | std |
0.7851 | 0.0101 | 0.7236 | 0.0258 | 0.6067 | 0.0357 |
L = 2 | L = 2 | L = 3 |
Table 4: The cross-validation measurement of pyramid matching methord at L = 2. And we also can observe from different training data scales that the more training data, the better recognition accuracy will be.
|
Table 5: It can be observed that the most confusion classes are the classes at the bottom-right corner(Livingroom, Office, Industry), the lowest confidence comes from Industry, less than 0.5, the highest confidence comes from Suburb, nearly 1.
I tuned the SVM training parameter lambda at three values: lambda = 0.1, 0.5, 1, and L = 3. The results of recognition accuracy is shown in Table 6, and I decided to use lambda = 0.1.
lambda | accuracy |
---|---|
0.1 | 0.778 |
0.5 | 0.7733 |
1 | 0.7667 |
Table 6: Effects of different lambda.
The final results is:
Mean accuracy = 0.7851
Standard deviation = 0.0101
Methord: Spatial pyramid matching & Cross-validation