Bio-inspired algorithm for online visual tracking
Tracking of targets within aerial footage is becoming increasingly important, for example, as the applications of unmanned aerial vehicles (UAVs) continue to expand (UAVs are now being used in film production, mining, news media, and agriculture). Moreover, the world's security agencies are gathering enormous amounts of UAV video data, from which they are looking for events of interest, e.g., suspicious vehicles. When such vehicles of interest are found in the data, the users would often like to use a UAV to follow the vehicle over time. Online visual tracking is therefore required for these endeavors, and the videos of the events can then be studied by a human analyst. Tracking ground vehicles in UAV video streams is especially difficult, however, because there is usually only a relatively small number of pixels on the target (compared with other tracking problems), and because the targets can change drastically in appearance (due to changes in lighting conditions, UAV altitude, and perspective).
In many tracking algorithms, the problem is defined as a binary classification task, i.e., where the object of interest and the rest of the video frame's content need to be differentiated. These algorithms involve training classifiers that are built on top of visual features (which encode only the target's appearance). Other information, however, can also be used to improve tracking. In addition, ‘smooth pursuit’ is a type of eye movement used by primates when they are tracking small moving objects. With this continuous eye movement, the object's movement is counteracted by keeping the object continually in the center of the field of view. Although saccadic eye movements (i.e., rapid movements of the eye between fixation points) can be elicited in a wide variety of situations, smooth pursuit is only possible when an object is in motion.1
Motivated by the neural circuits that underlie smooth pursuit, we have thus created the smooth pursuit tracking (SPT) algorithm2 for tracking problems in aerial video data. In this method, we combine the object appearance with motion and predicted location information to improve tracking. Although primates using smooth pursuit are limited to the tracking of one object at a time, with our SPT algorithm we can easily track multiple objects simultaneously (with little computational overhead).
In our SPT algorithm, we first generate a top-down appearance saliency map by using an online brain-inspired object recognition algorithm (known as a gnostic field).3, 4 A gnostic field consists of competing gnostic sets, where each set has a population of template-matching units for a particular class. The input we use for the gnostic field is the output of the convolutional feature maps produced by a convolutional neural network, which has been pre-trained on a massive collection of natural images.5 Our resulting appearance saliency map has high values in regions that correspond to areas of the image that resemble the target. We also create a motion saliency map by performing background subtraction. To do this, we use an average model of a subset of the previous frames, after aligning them to the current frame. In addition, for our location saliency map, we use a Kalman filter to predict the location of the target in the next frame. Finally, we create the smooth pursuit map by multiplicatively combining the appearance, motion, and location maps (see Figure 1).
One of the main strengths of our method is the ability to handle long-term occlusions. In primates, smooth pursuit only works when the object being tracked is both moving and visible. Other mechanisms are therefore required to recover a track that has been lost due to occlusion. In this situation, primates thus use saccades to recover the object's location. We simulated this type of scenario by running the selective search algorithm to generate bounding box hypotheses when occlusions were detected. For each candidate box the SPT algorithm measures the confidence that the box contains the target, and the box with the highest confidence is selected. The number of box hypotheses and the search radius will change according to the length of the occlusion. This means that SPT can be used to search for boxes farther away as the length of the occlusion increases.
We have also evaluated the ability of SPT to track vehicles in aerial footage from the Video Verification of Identity (VIVID) data set. In particular, we compared the SPT performance to that of seven recently developed trackers, i.e., the Color Visual Tracking,6 Adaptive Structural Local Sparse Appearance,7 L1 (referring to the L1 norm),8 Multiple Instance Learning,9 Kernalized Correlation Filters,10 Online AdaBoost,11 and Structure Preserving Object Tracking12 algorithms. We used standard metrics, such as precision plots, success plots, and center location error (see Table 1) to make these comparisons. Our results indicate that the SPT algorithm can be used to successfully track vehicles for significantly longer periods than the other state-of-the-art algorithms.
In summary, we have introduced a bio-inspired online tracking algorithm that we developed specifically for tracking small targets in aerial imagery. We find that our smooth pursuit tracking method outperforms a number of state-of-the-art algorithms, by a large margin. We are currently investigating how well our SPT algorithm works with larger objects and non-vehicle targets. In our future work we also plan to study how we can use modular neural networks to train SPT from end to end.
The research reported in this article was supported in part by the US Naval Air Systems Command, under contract N68335-14-C-033. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsoring agencies.
Rochester Institute of Technology
Mohammed Yousefhussien is a Fulbright scholar, working toward a PhD in imaging science. His research interests are centered around computer vision applications using machine learning algorithms.
Christopher Kanan is an assistant professor. He conducts basic and applied research in computer vision and machine learning. He received a PhD in computer science from the University of California, San Diego.
Andrew Browning is a principal research scientist. He received a PhD in cognitive and neural systems from Boston University.