Light Constructions - New neural network enables machines to expect what to see
Researchers in London, UK have developed a new kind of neural network for vision and other machine understanding tasks. The technique has the advantage of first separating the problem into perception and recognition tasks, which means the neural network can be trained with less outside intervention. Also, the perception network developed is both biologically valid and much more efficient than its predecessors. The approach, called the Product of Experts, is lead by Geoffrey Hinton at the Gatsby Computational Neuroscience Unit, University College London (UK). If the work is successful, the network models produced may not only help us to build machines that can see, but to understand our own vision systems better.
Neural networks are systems where the processing and storage of information are performed together. In engineering terms, conventional networks can simply be thought of as complex filters that take incoming signals A, B, and C (which could, for instance, be images of faces) and give out the right answers X, Y, and Z (which could be names) respectively. They have major advantages in real world applications such as face recognition because A, B, and C cannot be known exactly in advance: faces change with age, lighting, etc. and no person will present the exact same image to a camera twice. Unlike more conventional algorithmic systems, neural networks reconfigure themselves based on incoming data: they learn that various different images map to X and find their similarities, while at the same time determining how they are different from those images that map to Y. This learning is what makes them so powerful.
Structurally, neural networks are both simple and complicated. They consist of a number of neurons or processing elements that sum and then perform some function (like the sigmoid function) on incoming data. In the most common type of backpropagation network, the first layer of neurons take their information from the outside world: in an image processing application, this means that each neuron looks at the signal coming in from a single pixel. The next layer, the hidden layer, can have any number of neurons, and each of these are connected to all of the pixels in the first. These neurons are then connected to the output. During training, after the image signals have propagated through the network, being processed by the neurons and attenuated/amplified by the interconnection weights, the "answer" is compared with the label already assigned to the data. The network learns by changing the weights or strengths the various neural connections to minimize the difference between these two, and it is how well this process works that makes one neural network architecture or configuration better than another.
The problem with this basic approach is that supervised learning (teaching) is required: the neural network is shown lots of examples of objects that have been labeled in advance, and so eventually begins to associate the input (face) with the label (name). It is known as response learning because it directly links the inputs with the outputs. From a practical point of view, it is inefficient because, not only is lots of training data necessary in order to fully represent the "fuzziness" of the problem at hand, but all that training data has to be labeled somehow -- presumably by a human.
Hinton and his colleagues1 have chosen to concentrate on another approach: perceptual learning. Instead of concentrating on recognizing images, its job is simply to learn to be good at perceiving a given type of data without (at this point) assigning any meaning to it. An example of this kind of learning in humans is evidenced by the fact that people brought up to speak different languages are better at distinguishing between different sets of sounds, even out of the context of a meaningful word or sentence. All the perceptual neural net does is to decompose the data into a particular combination of features: the better adapted the feature set, which is stored in the hidden unit interconnection weights, the more efficient and accurate the network will be for that problem.
By using this perceptual learning as a "front end" to a pattern recognition system, the second part -- the response learning -- becomes a more tractable problem. If the perceptual network has done its job well, a class of objects should now be represented by a relatively small number of feature combinations (compared to the number of images that went into defining the features). Essentially, because the fuzziness of the original images has been encapsulated in the hidden units, much less labelled training data should be necessary for the second stage.Finding a cause
One way of determining whether a network is good at perceiving incoming data is to look and see what kind of data it would generate. So-called generative models, where the neural network is essentially run backwards and the hidden units (features) are stimulated to produce their own "input" have the advantage that they can show what a network "believes in."
Unfortunately the more powerful nonlinear generative models have traditionally been difficult to work with. They fall into two classes, both of which have their disadvantages. The first, known as causal models, can be compared to computer graphics. Though it is easy to generate pictures from, say, a 3D model, it is not easy to reconstruct the 3D model from this data. In fact, this is why machine vision is so difficult in the first place. There are all sorts of ambiguities: what goes with what (image segmentation); the size of the objects (as opposed to how far away they are); where occlusion is taking place. With such models, it is easy to generate images, yet difficult to extract meaning from those images afterwards.
Hinton's approach is called the Product of Experts (POE), and had always been thought to have the opposite problem. Here, in order to generate a dream image from scratch, every hidden unit must agree on every pixel. This is ideal for inferring meaning later, because only certain features could possibly have created the resulting image: there is no ambiguity. To achieve this, each hidden unit is either active or not active based on its own statistics and, if "on," it then "votes" on whether each pixel should be on or off based on the relevant interconnect weights. If the vote is not unanimous, the dice are effectively rolled again to select which hidden units should be active until a set is found that can agree on what image should be produced.Being practical
This is less impossible to achieve than it sounds because not all of the hidden units care about every pixel in the image: they specialize on detecting certain things and ignore others. Therefore, some features will be complimentary to others and will be able to "co-exist." On the other hand, it is still very difficult to find those winning combinations in the first place, which is why such models were thought to be impractical. Here, the analogy is that of trying to prove something mathematically: figuring out how to get to an answer from scratch is hard.
What Hinton realized was that there is no need to start from scratch. Again, as with a mathematical proof, it is actually relatively easy to figure out how to do it if you have the answer in front of you (this, in fact, is why such systems are good at inferring meaning from dream data even though they are not good at generating it). Hinton was able to exploit this in the learning algorithm for a POE system he calls a Restricted Boltzmann Machine (RBM). Data comes in through the input, which excites various units (features) in the hidden layer. These are then used to generate an image, which is compared with the original data to produce an error signal, which in turn is used to update the interconnections weights between the input pixels and hidden units.
This way, with each new piece of data, the feature set is refined to more perfectly generate images like that in the training set. Effectively, the neural network is accepting the excited hidden units, caused by the data, as the "answer" to the question "which features should I use to generate this image," and then trying to optimize the features that have been turned on to create a better image next time. Over time, the system efficiently creates a set of optimized features that can accurately regenerate the training data, and that allows easy inference between generated image and feature combination.Biological validity
In experiments where the technique was applied to both handwriting recognition (Figure 1) and face recognition, similar detectors to those found in our own early vision systems -- such as those that look for lines or edges at different orientations -- emerged naturally as hidden units. The fact that many of the feature detectors look alike despite being specialized (which is also true of human feature detectors), is particular to the POE approach. In causal models, the features or hidden units are independent and compete to effect the image: like features are therefore less apt to co-exist. In Hinton's networks, because features work co-operatively, it is natural to evolve many that will effect the image in a similar way.
Other ways in which the Hinton's approach is biologically credible include its speed, and the fact that it doesn't require synapses (neural interconnections) to work backwards (as is necessary for backpropagation). In the POE system, the data, hidden unit, reconstruction, and weight modification operate in a loop. Hinton has patented the technique and is continuing its development.
This biological basis is important in the context of The Gatsby Computational Neuroscience Unit, as it was set up two years ago -- with funding to last a decade -- to study neural computation theories of perception and action with an emphasis on learning. The unit is funded by the Gatsby Charitable Foundation (named after the book by F. Scott Fitzgerald) established by Lord Sainsbury (who is both a Science Minister in the British Government and a supermarket magnate).
Geoffrey E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Technical Report GCNU TR2000-004, Gatsby Computational Neuroscience Unit, University College London, 2000.