CSCI1430 : Project 3 Scene recognition with bag of words

yunzhang

The goal of this assignment is to use the bag of words model in an image classification task with a data set of 15 hand labeled scene categories, in each of the categories there are 200 images. The execution of whole system can be interpreted as several independent procedures, including visual vocabulary creation, feature encoding, model training, inference and accuracy evaluation.

Algorithm

Create Visual Vocabulary
Compute features
Training
Inference
Accuracy Evaluation

Extra Credit

Spatial Pyramid
Test with different vocabulary size
Cross Validation
Test with different features

Results

Comparison of different system setup over its accuracies (click to see detail)

Setup	avg. accuracies
Baseline(vocab=200 + linear SVM)	.622
vocab size=200 + kernel SVM(histogram intersection kernel function)	.713
vocab size=200 + kernel SVM + phow descriptor	.722
vocab size=200 + kernel SVM + dsift descriptor	.800
vocab size=200 + kernel SVM + phow descriptor	.790

Confusion matrix of our best run, where accuracy=.818

Suburb	Forest	InsideCity	OpenCountry	TallBuilding	Bedroom	Kitchen	Store	Coast	Highway	Mountain	Street	Office	Industrial	Livingroom
1.00	.88	.93	.83	.81	.83	.71	.93	.92	.99	.72	.63	.79	.60	.69

Diagonal values of previous confusion matrix, with classes of highest and lowest correct classification rate showed in color.

Statistics of accuracies over different vocabulary size, cross validated by 5 random initializations

The plot shows that increasing vocabulary size will not always improve the performance, which is consistent with the experiment result in Lazebnik et al 2006.

The following table shows the statistic summaries of our cross validation results, 5 runs for each vocab settings

Vocabulary Size	10	20	50	100	200	400	1000
Mean	.637	.724	.756	.782	.783	.812	.802
Median	.635	.727	.751	.783	.781	.814	.805
Mode	.650	.733	.770	.792	.798	.818	.809
Std. Dev. (Normalized by N-1)	.0106	.0079	.0125	.0078	.0102	.0057	.0085

Samples of failure cases, notice that some of the confusions are pretty reasonable

Some of the category, like 'open country' may be visually diverse, so we may need more data to train such category. Some other categories may be semantically or visually similar, such as 'insidercity' v.s. 'street', 'living room' v.s. 'kitchen' and 'opencountry' vs 'highway'

'Bedroom' categorized as 'Industry'	'coast' as 'country'	'industry' as 'living room'	industry' as 'store'	'insidercity' as 'street'
'living room' as 'kitchen'	'opencountry' as 'coast'	'opencountry' as 'highway'	'opencountry' as 'store'	'store' as 'highway'