Object detection: from optical correlator to intelligent recognition surveillance system
Increasingly, unmanned aerial vehicles (UAVs) are being used in intelligence, surveillance, and recognition missions. They are capable of surveying large areas of terrain, but it is time-consuming and cumbersome for human operators to assess all the recorded images. In particular, where a specific object, such as a car, is being sought, an intelligent surveillance system could direct the operator's attention to the most likely areas.
The first attempt, in 1964, to automate pattern recognition was the matched spatial filter.1 Information about a reference pattern is stored optically. An optical processor assesses the similarity of a test pattern to a reference pattern by calculating the ‘cross-correlations’ of the two patterns. It does this by calculating the Fourier transform of the test pattern with a reference beam of light. It superimposes this on a hologram—a recording of interference fringes of the Fourier transforms of the reference pattern with the reference beam—before computing the reverse Fourier transform of the superposition, which gives the cross-correlations of the two patterns. All these operations are done at the speed of light and in parallel.
In 1969, Caulfield and Maloney built on this by processing linear combinations of the cross-correlations.2 This improved the ability to distinguish between similar patterns (discriminate) and to recognize a distorted pattern (tolerance), for instance, when the test pattern was rotated relative to the hologram. In this seminal work, Caulfield and Maloney were also the first to introduce the concept of training the hologram, that is, of learning from feedback on performance. Although this early work laid the foundations of future efforts, the variety of pattern recognition tasks in the open environment means that discrimination and tolerance to distortions are still fundamental issues for pattern recognition today. More recently, these have been addressed by neural networks, which in the training process build linear combinations of the cross-correlations and apply nonlinear processing to improve performance.
A more fundamental problem in pattern recognition is data classification. An example of a neural network is the support vector machine (SVM) learning algorithm, which aims to classify data by an optimization process. However, systems can only know what they have already learned, and there is no guarantee a neural network will succeed in generalizing its knowledge to recognize a unique pattern and reject any others. By contrast, optical pattern recognition attempts to detect a unique target among visual clutter. In an uncontrolled environment, target recognition needs to collect more real data, use multisensor fusion, and determine salient regions to focus the attention. We have used preprocessing to determine regions of interest for UAVs trying to detect cars.
Optical composite filters, which compute the linear combinations of cross-correlations, are designed for invariant pattern recognition. Mathematically based image-invariant features are mostly suitable for describing 2D patterns,3 but we have used the scale-invariant feature transform4 (SIFT), which has been shown to be efficient for outdoor images. We considered a feature-based approach to detecting cars, but as detection based on the appearance of a whole object is sensitive to distortion, we have also used a component-based approach.5, 6 In this, we first detect object components and then combine them according to their spatial relations. This approach aims to recognize a class of objects instead of a specific one and is less sensitive to distortion and occlusion (targets covered by clutter).
In our example aerial image with a resolution of 11.2cm/pixel, only the large parts of car bodies, windshields, and shadows can be seen: see Figure 1. We first fuse the image pixels to superpixels based on a combination of the regional (color) and edge (space distance) information by simple linear iterative clustering7 (SLIC). In very large aerial images we then determine the salient regions.8 In the background the superpixels are clustered with average size and regular shape. When human-made objects are present, the superpixel boundaries adhere tightly to the structure and become irregular in size and shape. We used this as our measure of saliency.
In the component-based approach, we merged the superpixels to regions with a statistical region-merging algorithm and described the size and shape of each region by numerical features. We detected and removed cast shadows by their colors. We then detected body parts and windshields or doors among thousands of regions using hierarchical classifications by the SVM. Finally, we combined car components according to their unique spatial relations. The regrouping process is robust against distortions and provides redundancy in detection. It can even help to recover car parts missed by the SVM. This approach is not sensitive to orientation of the cars or to occlusion, as shown in Figure 1. (In this image a sports utility vehicle was not recognized as a car because it was classified differently.)
In the feature-based approach, we computed the SIFT features of each superpixel at its centroid, plus the features from the statistical texture analysis with the improved linear binary pattern algorithm. Furthermore, we computed the SIFT features in two sizes of regions to catch information on the cars themselves and their surroundings. We input tens of thousands of 148-dimensional feature vectors from vehicle/non-vehicle training images with 10-fold cross-validations to train the SVM. In the subregions that the SVM classified as belonging to a vehicle, we used a conventional validation process to finalize the detection. Both approaches were successful at detecting cars in the aerial imagery. Our feature-based approach is more effective than the component-based approach at detecting dark-colored cars.
In summary, in the last five decades there has been tremendous progress in pattern recognition with contributions from optics and other research communities. We have successfully used two approaches, a component-based approach and a feature-based one, to recognize cars in aerial imagery. Both of these approaches, which are very large computation tasks, are slower and use more power than optical processors, which are automated systems that rapidly assess the features of an observed image and classify the results. As hardware, optical correlators lack the flexibility of electronic computers, although research on software for optical correlators is valuable. However, in the future, emerging quantum optical computers may outperform optical correlators. We plan to incorporate more information from the scene to enhance the detection in a much larger database.
Yunlong Sheng joined the Center for Optics, Photonics, and Lasers at Laval University, Canada, in 1985 and is now a full professor. His research interests include optical pattern recognition and image processing, nano-optics, optical trapping, and diffractive optics. He is a Fellow of SPIE and OSA.