TRAINING SET DATA

Sometimes we have little choice in the quality and quantity of labeled examples of objects of interest and other object that can cause confusion we find in the training set. But when we do have a choice, the two main concerns are the fairness and the number of samples. It is important to remember what we are trying to do. From a given set of examples and counterexamples, we are seeking to build a kind of data compression system. Its job is to compress the data into only two categories: A and notA in the simplest case. How many examples of each type we need depends on two things – how much within-class variation there is and how hard to separate the classes are. If my task is to recognize the letter A as produced by “Times New Roman,” there is so little variation that I may not need lots of samples to do the job. I need more cat examples to tell cats from dogs than I do to tell cats from catamarans.

Fair samples must represent the class of interest well. Complex sets like “cats” are harder to represent fairly than a subclass such as “Siamese cats.” All poses that are likely to be encountered must be represented. If cats are likely to be partially obscured by rocks, sofas, or dog mouths; then samples of partially obscured cats must be include as well. If you simply “hand pick” the samples, you introduce biases. The best approach is to gather as many examples as possible under realistic conditions.

Number of samples is a critical issue. In the case of military ATR (Automatic Target Recognition) large numbers of real images of the type military systems will see in the field are simply not available. Data can be “manufactured” in the computer by simulation or by morphing of actual data. Both are very dangerous. How many exampled do we need? There are very complex theories that seek to answer that question.

The most famous is PAC (Probably Approximately Correct) learning. http://www.amazon.com/exec/obidos/ASIN/0471030031/qid=1019249854/sr=2-2/ref=sr_2_2/103-7785010-9361431