OVERCOMING THE PROBLEM OF ULTRA SMALL TRAINING SETS

 

WHY WE NEED A LARGE TRAINING SET

The training set of classified data is assumed to be an unbiased representation of the data we will encounter in use. How many examples we need of targets and non targets depends on many things such as

 

We know of no tight lower bounds on the numbers of targets and non-targets we should have, but we do know of a provable upper bound. This is given by PAC (Probably Approximately Correct) learning theory (1). That theory is very complex, but we can offer a simplified summary here to show the problem.

 

Suppose we want to test a learned discrimination against actual new data. We might do 100 runs of 100 random examples each. We would then count the number of errors in each run. If we say we have 95% confidence that we have a 98% accuracy, we mean that we expect on at least 95 of the runs to have two or fewer errors. In PAC theory, we write

      Confidence = 1 - d

and

       Accuracy = 1 - e.

So, in this example d = 0.02 and e = 0.05. The number of examples relates to their inverses (50 and 20) and a measure of the set variability and problem difficulty called the Vapnik-Chervonenkis dimension. Roughly, we would require somewhat more than the product of the reciprocals or 1000 samples.

 

In practice, we do not have that many training samples of real targets. Nor do we know that the assumption that this is an unbiased sample holds. By classical standards, we are in a lot of trouble.

 

GENERATING NEW DATA FROM OLD

 

Some solutions for generating more samples can be suggested. Computer models can generate an unlimited number of samples, but experience suggests that systems trained on such samples do not fare well against real targets. The number criterion is met, but the unbiased-sample criterion is not met.

 

Perhaps more promising is work the PI has done on extracting enough 3D information from a 2D scene to allow us to display a view from a slightly different perspective (2). This would add reality-based data to the real data to expand the data set.

 

WORKING WITH THE LIMITED DATA WE HAVE

 

The primary approach here is to allow each sample to represent a much larger part of the decision space than it normally does. That brings in the need to discuss Vapnik’s better-known contribution (3) – the SVM (Support Vector Machine). Our work borrows several concepts from SVM and generalizes and extends them.