Terminology

 

 

What are design set, validation set, and test set?

 

Here are some helpful remarks from:

http://www.faqs.org/faqs/ai-faq/neural-nets/part1/section-14.html

 

There seems to be no term in the NN literature for the set of all cases that
you want to be able to generalize to. Statisticians call this set the
"population". Tsypkin (1971) called it the "grand truth distribution," but
this term has never caught on. Neither is there a consistent term in the NN
literature for the set of cases that are available for training and
evaluating an NN. Statisticians call this set the "sample". The sample is
usually a subset of the population. 
 
(Neurobiologists mean something entirely different by "population,"
apparently some collection of neurons, but I have never found out the exact
meaning. I am going to continue to use "population" in the statistical sense
until NN researchers reach a consensus on some other terms for "population"
and "sample"; I suspect this will never happen.) 
 
In NN methodology, the sample is often subdivided into "training",
"validation", and "test" sets. The distinctions among these subsets are
crucial, but the terms "validation" and "test" sets are often confused.
There is no book in the NN literature more authoritative than Ripley (1996),
from which the following definitions are taken (p.354): 
 
Training set: 
   A set of examples used for learning, that is to fit the parameters [i.e.,
   weights] of the classifier. 
Validation set: 
   A set of examples used to tune the parameters [i.e., architecture, not
   weights] of a classifier, for example to choose the number of hidden
   units in a neural network. 
Test set: 
   A set of examples used only to assess the performance [generalization] of
   a fully-specified classifier. 
 
Bishop (1995), another indispensable reference on neural networks, provides
the following explanation (p. 372): 
 
   Since our goal is to find the network having the best performance on
   new data, the simplest approach to the comparison of different
   networks is to evaluate the error function using data which is
   independent of that used for training. Various networks are trained
   by minimization of an appropriate error function defined with respect
   to a training data set. The performance of the networks is then
   compared by evaluating the error function using an independent 
   validation set, and the network having the smallest error with
   respect to the validation set is selected. This approach is called
   the hold out method. Since this procedure can itself lead to some
   overfitting to the validation set, the performance of the selected
   network should be confirmed by measuring its performance on a third
   independent set of data called a test set. 
 
The literature on machine learning often reverses the meaning of
"validation" and "test" sets. This is the most blatant example of the
terminological confusion that pervades artificial intelligence research. 
 
The crucial point is that a test set, by the standard definition in the NN
literature, is never used to choose among two or more networks, so that the
error
on the test set provides an unbiased estimate of the generalization error
(assuming that the test set is representative of the population, etc.). Any
data set that is used to choose the best of two or more networks is, by
definition, a validation set, and the error of the chosen network on the
validation set is optimistically biased. 
 
There is a problem with the usual distinction between training and
validation sets. Some training approaches, such as early stopping, require a
validation set, so in a sense, the validation set is used for training.
Other approaches, such as maximum likelihood, do not inherently require a
validation set. So the "training" set for maximum likelihood might encompass
both the "training" and "validation" sets for early stopping. Greg Heath has
suggested the term "design" set be used for cases that are used solely to
adjust the weights in a network, while "training" set be used to encompass
both design and validation sets. There is considerable merit to this
suggestion, but it has not yet been widely adopted. 
 
But things can get more complicated. Suppose you want to train nets with 5
,10, and 20 hidden units using maximum likelihood, and you want to train
nets with 20 and 50 hidden units using early stopping. You also want to use
a validation set to choose the best of these various networks. Should you
use the same validation set for early stopping that you use for the final
network choice, or should you use two separate validation sets? That is, you
could divide the sample into 3 subsets, say A, B, C and proceed as follows: 
 
 o Do maximum likelihood using A. 
 o Do early stopping with A to adjust the weights and B to decide when to
   stop (this makes B a validation set). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on B (the
   validation set). 
 o Estimate the generalization error of the chosen network using C (the test
   set). 
 
Or you could divide the sample into 4 subsets, say A, B, C, and D and
proceed as follows: 
 
 o Do maximum likelihood using A and B combined. 
 o Do early stopping with A to adjust the weights and B to decide when to
   stop (this makes B a validation set with respect to early stopping). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (this makes C
   a second validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 
 
Or, with the same 4 subsets, you could take a third approach: 
 
 o Do maximum likelihood using A. 
 o Choose among the 3 nets trained by maximum likelihood based on the error
   computed on B (the first validation set) 
 o Do early stopping with A to adjust the weights and B (the first
   validation set) to decide when to stop. 
 o Choose among the best net trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (the second
   validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 
 
You could argue that the first approach is biased towards choosing a net
trained by early stopping. Early stopping involves a choice among a
potentially large number of networks, and therefore provides more
opportunity for overfitting the validation set than does the choice among
only 3 networks trained by maximum likelihood. Hence if you make the final
choice of networks using the same validation set (B) that was used for early
stopping, you give an unfair advantage to early stopping. If you are writing
an article to compare various training methods, this bias could be a serious
flaw. But if you are using NNs for some practical application, this bias
might not matter at all, since you obtain an honest estimate of
generalization error using C. 
 
You could also argue that the second and third approaches are too wasteful
in their use of data. This objection could be important if your sample
contains 100 cases, but will probably be of little concern if your sample
contains 100,000,000 cases. For small samples, there are other methods that
make more efficient use of data; see "What are cross-validation and
bootstrapping?" 
 
References: 
 
   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 
 
   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 
 
   Tsypkin, Y. (1971), Adaptation and Learning in Automatic Systems, NY:
   Academic Press.