I wanted to play around with Bag Of Words for visual classification, so I coded a Matlab implementation that uses VLFEAT for the features and clustering. It was tested on classifying Mac/Windows desktop screenshots.
For a small testing data set (about 50 images for each category), the best vocabulary size was about 80. It scored 97% accuracy on the training set, and 85% accuracy on the cross validation set, so the over-fitting can be improved a bit more.
- Collect a data set of examples. I used a python script to download images from Google.
- Partition the data set into a training set, and a cross validation set (80% - 20%).
- Find key points in each image, using SIFT.
- Take a patch around each key point, and calculate it’s Histogram of Oriented Gradients (HoG). Gather all these features.
- Build a visual vocabulary by finding representatives of the gathered features (quantization). This done by k-means clustering.
- Find the distribution of the vocabulary in each image in the training set. This is done by a histogram with a bin for each vocabulary word. The histogram values can be either hard values, or soft values. Hard values means that for each descriptor of a key point patch in an image, we add 1 to the bin of the vocabulary word closest to it in absolute square value. Soft values means that each patch votes to all histogram bins, but give a higher weight to bin representing words that are similar to that patch. Take a look here.
- Train an SVM on the resulting histograms (each histogram is a feature vector, with a label).
- Test the classifier on the cross validation set.
- If results are not satisfactory, repeat 5 for a different vocabulary size and a different SVM parameters.
Visualization of the vocabulary learned by the clustering