Data Pruning


Could a training example be detrimental to learning? Contrary to the common belief that more training data yield better generalization, we show that the quality of the examples also matters and that the learning algorithm might be better off when some training examples are discarded. The question is which examples need to be eliminated, so as to improve generalization performance. We propose a general approach, called 'data pruning', to automatically identify and eliminate examples that are troublesome for learning with a given model. We apply it to a challenging face dataset, achieving significant improvements in performance, especially for very noisy data.

Previous work

The RANSAC method has been most common in identifying outliers. It could easily be applied to relatively simple models (ones which require the estimation of several parameters). RANSAC philosophy is the following: create a large number of trials, extracting small sets of examples (sufficient to estimate the model), hoping that there will be at least one clean sample to produce a good model. In fact there are theoretical estimates of the minimum number of trails needed, so that with a certain confidence such a clean sample would appear (dependent on the noise level, model complexity and the confidence). Unfortunately, for complex models, such as the ones needed for recognizing a face class (e.g. Neural Networks or Support Vector Machines), the number of necessary trials is prohibitive.

Main idea. How to identify the noisy examples?

When training on noisy data, incorrect models will be learned, because of the 'bad' examples. According to those models, the noisy examples cannot be, in general, identified as outliers. In fact, if we knew the 'correct' learning model, then we could identify the bad examples, and if we knew the bad examples, we would be able to train 'good' learning models. It is a 'chicken-and-egg' dilemma...

We propose to learn multiple semi-independent models which are influenced differently by noisy examples. The idea is that most (partial) models would agree on 'good' examples, but disagree on noisy examples. A Bayesian decision is applied to determine which examples are to be eliminated. Our experiments showed that removing noisy examples before training is beneficial. We propose an alternative in which the models could be significantly less powerful than the original model.


Angelova, A., Abu-Mostafa, Y., Perona, P., Pruning Training Sets for Learning of Object Categories, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2005

Angelova, A., Data Pruning, Master's Thesis, CS Dept., California Institute of Technology, 2004

Contact info:

Anelia Angelova

anelia [at] vision [.] caltech [.] edu

CS Department, California Institute of Technology

1200 E. California Blvd. MC 136-93, Pasadena, CA, 91125, USA

Vision Group