SPECIALIZED ALGORITHMS FOR STATISTICAL PATTERN RECOGNITION (EXAMPLES)

The following list of methods is far from exhaustive, but it does include most of the widely used methods.

NEURAL NETWORKS

There is no one thing called neural network pattern recognition. There are literally hundred of them – some good, some easy, some neither. They have only a few things in common.

The steps in using a neural network system are

The number of user choices alerts us to the breadth of this approach.

 

It is often used for one or more of the following reasons

Exceptions exist to every general remark made here.

 

Some neural networks are said to be “self organizing.” That is, you throw data at them and they automatically organize the data into categories that can be used to classify future data as well. These are not magic, however. They are simply neural approaches to something called “clustering.” A human sets out the rules for how disparate data may be and how to measure disparity. All the computer does is work out the implication of those arbitrary rules on a new data set.

 

NEAREST NEIGHBOR

 

This is one of the simplest and most powerful algorithms, and it makes quite explicit what is happening. It is hard to explain in words what a well-trained neural network is doing. We can only say whether it does well or not.

 

In nearest neighbor method, we define a measure of the distance between two data sets. For simplicity, Euclidean distance is usually chosen.

 

In the simplest nearest neighbor classifier, each new data point receives the label of its nearest neighbor in the training set.

 

 A more sophisticated version is sometimes called “next nearest neighbor.” We pick an odd number N. Then we find the nearest N points to the new data point. We then take a majority vote as to which label to give the new point. This approach tends to discount wild isolated points in the training data. Of course, this becomes problematic if we have more than two classes of data.

 

We have developed a much more sophisticated extension of nearest neighbor methods that handles multiple classes as easily as single classes and minimizes errors in a way not described elsewhere.

 

It would obviously be better if we replaced arbitrary distance with a distance related to known statistical properties of the underlying variabilities. Sometimes, we do not know or cannot easily estimate those probabilities, but when we can, we use the quantity r in

which is called the Mahalanobis distance from the feature vector x to the mean vector mx, where Cx is the covariance matrix for x. It can be shown that the surfaces on which r is constant are ellipsoids that are centered about the mean mx. In the special case where the features are uncorrelated and the variances in all directions are the same, these surfaces are spheres, and the Mahalanobis distance becomes equivalent to the Euclidean distance. The point will also have a Mahalanobis distance from points of Class Y, etc.

 

 

COMPUTATION OF DISCRIMINANTS

 

We calculate numbers of the form

               N = a + Sbidi  + Scijdidj + …

Here, d is the ith component of a data vector and the b’s, c’s, etc. are weights to be learned. If we set all but the first two terms to zero, we have a linear discriminant. These are very easy to train and apply. Unfortunately, most problems are not linearly discriminable. Two types of “fixes” are used. First, we can train a sequence of linear discriminants. The prototypical means to do this is called Principal Component Analysis (http://www.amazon.com/exec/obidos/ASIN/047140540X/qid=1019226973/sr=2-1/ref=sr_2_1/103-7785010-9361431) It is widely used but provably inadequate to really hard problems. We have developed an extremely powerful way of doing this (“Independent task Fourier filters,” Opt. Engr. 40, 2414-2418 (2001), H. John Caulfield). Second, we can train nonlinear discriminants. This is much harder to do but results in much better discrimination.

               

 

FOURIER FILTERING

 

A Fourier transform is an invertible linear transform with a wonderful property. The Fourier transform of an input pattern ha a shape and location in the Fourier transform domain that is independent of the object location in the input plane. The location of the center of mass of the input is encoded as phase information that does not impact the shape or location of the Fourier transform pattern. This means we can place a filter in the Fourier plane that favors one class of objects while discriminating against others. Inverting the filtered Fourier transform carries us back to the input domain but signals of target objects (wherever they occur in the input) are now stronger than signals from other objects. This parallel, shift-invariant searching for patterns is invaluable.

 

Another useful aspect of Fourier filtering is that it is easy to do with coherent optics.

 

The main problem is that the Fourier filter is simply gives in the output linear combinations of the Fourier transform plane features. It is a linear discriminant and hence inadequate for most hard problems, as the latter are seldom linearly discriminable. For a “fix” that handles nonlinearly discriminable patterns very well while preserving the shift-invariant properties we like so much, see the discussion on discriminants above.

 

HYPERSPACE SEPARATION SURFACES

 

We imagine that the data vector is really a list of components of a pint in some hyperspace. Each measurement generates a new point. The objective is to set up decision surfaces within that hyperspace that reliably classify new measurements into their proper classes. The most powerful way to do this (until recently) was the SVM (Support Vector Machine) invented by Vladimir Vapnik (http://www.amazon.com/exec/obidos/ASIN/0521780195/qid=1019229982/sr=1-1/ref=sr_1_1/103-7785010-9361431 )

 

The data points or vectors for all members of the training set are plotted. Then the shape of the separation surface (flat, elliptical, etc.) is picked. It is obvious that most data points do not (or should not) affect the details of the decision surface, because they are not “border line” or “support” vectors. Given the support vectors, the SVM (in a very laborious computation) computes the best parameters of the decision surface. “Best” means that it gives optimum “margin” between classes. That is, best separation of support vectors. The more the margin, the better the “generalization” – ability to classify new, un-trained-on data not in the training set well. Our improved approach (see discriminants) sets those margins at whatever value we choose. There is a price to be paid in number of computations needed to make a decision, but it allows excellent discrimination.