Final Results


Benchmark #1 - Raw Performance

Take the mean of the diagonal of 10 confusion matrices.
Then compute the mean and the deviation of the mean.

                        Ntrain =
Different Methods
5 10 20 50
Image Correlation  4.58 ± 0.26%  5.06 ± 0.26%  6.60 ± 0.26%  8.85 ± 0.26%
Spatial Pyramid Matching
(algorithm of Lazebnik et. al.)
18.74 ± 0.48% 25.01 ± 0.50% 31.31 ± 0.74% 39.01 ± 0.45%
Combining Multiple Kernels
in an SVM Framework
(submitted by Manik Varma)
43.99% 53.34% 59.84% 66.44%


Benchmark #2

Originality

There were not enough entries to measure originality with the metric we had intended. The original idea was to assign greater weight to categories that only a few contestants classified well. In place of this metric, the below scatter plot gives some idea of how two different methods performed over the entire set of categories (in this case, for Ntrain=20).

On the top left we plot the error rates for each algorithm and each category, with SPM error rates on the x-axis and the Manik Varma's error rates on the y-axis. There are 256 dots corresponding to 256 categories. Except for a few outliers, the Varma algorithm does better on almost every category. To be specific, it performs better on 92% of the categories.

The fact that this scatter plot is so one-sided suggests that the vanilla SPM algorithm does not add much, if anything, to Varma's method based on a linear combination of kernels. In other words, it has very little originality compared to Varma. This is perhaps not surprising, since SPM is one of the kernels that Varma uses in his linear ensemble.

Consistency

On the lower right is a histogram of the scatter plot densities. This is marginalized over either the y or x axis to give the SPM and Varma histograms, respectively. The SPM histogram is skewed towards high error rates, while the Varma histogram is relatively flat - which means more consistent performance across categories.



Benchmark #3 - Similiarity

Getting a skunk confused with a cat is not quite as bad as getting a skunk confused with a radio telescope. The third benchmark attempts to quantify this fuzzy notion using taxonomic trees. Two of us in the lab (Merrielle Spain, Greg Griffin) taxonomies by hand, without looking at the other person's tree. The results may be seen to the right (Merrielle) and below (Greg).


We tried to organize the categories visually rather than functionally. There are, of course, many different ways to hierarchically organize these 256 categories. By combining two independent trees, we at least hoped to mitigate our individual biases.

Define the distance between any two categories as the mean distance to the nearest common parent. For example, on the tree to the left, the distance between duck and goose is 1, since they are each 1 level away from the node animate=>animal=>air. The distance between boom-box and breadmaker is 2 since this is the mean distance to their nearest common ancestor inanimate=>electronics.

For any two categories, compute their distance on each of the above hierarchies. Take the minimim of these two distances and apply the following formula to arrive at the cost of misclassification:

Distance Non-Hierarchical
Cost
Hierarchical
Cost
Categories are...
0 0.0 0.0 correct
1 1.0 0.5 reasonable mistake
2 1.0 1.0 mistake
≥3 1.0 2.0 bad mistake

In the non-hierarchical case, we do not have a tree which lets us discriminate between similar and different categories. Thus every misclassification carries the same unit cost. Using the above relations we derive misclassification costs matrices C which looks like this:



Finally we define the similarity to be:
Notice that, for the non-hierarchical cost matrix, S just measures ordinary performance. For the hierarchical cost matrix S measures weighted performance, where bad mistakes count more than reasonable mistakes.

Similarity Measured Using : Ordinary Cost Matrix
(0 to 100%)
Hierarchical Cost Matrix
(-100 to 100%)
Ntrain :
5 10 20 50 5 10 20 50
Image Correlation 4.6% 5.1% 6.6% 8.8% -58.9% -57.3% -54.5% -50.4%
Spatial Pyramid Matching 18.7% 25.0% 31.3% 39.0% -23.2% -12.5% -2.2% 10.1%
Combining Multiple Kernels
(Manik Varma)
44.0% 53.3 59.8 66.4 13.4% 28.4 38.6 48.7

To conclude, similarity as we have definied it above is just the ordinary perferformance metric where we have weighted different kinds of mistakes according to how similar the categories are. The high scores for Varma suggest that

  • His algorithm not only performs the best but,
  • When it does make mistakes, the mistakes are less severe

Greg Griffin
Last modified: Mon Oct 15 10:07:58 PDT 2007