Video surveillance from remote airborne sensing platforms has wide-ranging applications in defense, homeland security, and traffic control. However, tracking ground vehicles with this technology often produces images characterized by limited contrast with rapid changes in sensor gain, low numbers of pixels-on-target, and obfuscation of target appearance due to camouflage and clutter. Consequently, motion-based cues take on a role of primary importance, particularly when prior knowledge of object appearance and shape is unavailable.
We have introduced an approach for moving object detection and localization based on combining forward and backward motion history image (MHI) filters.1 The forward filter integrates results of pixel-level change detection and a decay term within a sliding temporal window to produce an energy image in which moving objects are shown with gradually fading trails behind them (see Figure 1). In contrast, the backward filter yields results with trails preceding each object. Combining the two, we achieve an algorithm that localizes each object in the central frame of the temporal window. For change detection, this approach is faster than optical flow and does not require construction of a background model. It also yields more accurate shape information than techniques based on frame differencing.
Figure 1. To the thermal video (a) we apply forward (b) and backward (c) motion history image (MHI) filters to detect and localize low-contrast moving objects in airborne video, with fused detection shown in (d).
A more rigorous theoretical approach to the motion detection problem treats each pixel in the video sequence as a node in a 3D spatiotemporal Markov random field (MRF) graph model,2 in which its hidden state represents the likelihood that it belongs to a moving object. Such likelihood is inferred from input change detection data as well as by compatibility constraints between the pixel and each of its four spatial and two temporal neighbors. (These are indicated as red and green links, respectively, in Figure 2.) All available data and constraints are combined using the belief propagation (BP) message-passing algorithm in the 6-connected graph. This approach deals effectively with difficult detection problems such as objects camouflaged by their resemblance to the background, or by uniform color that frame difference methods can only partially detect.
Figure 2. From input video (top left), moving object detection (bottom right) is addressed by belief propagation message passing within a 3D spatiotemporal Markov random field (MRF) graph model.
After detecting moving objects, we want to persistently track them in subsequent frames despite changes in lighting, background, and appearance. We formulate this as a two-class classification problem between each object and its surrounding background. A diverse set of features — including intensity, texture, motion, saliency, and template matching — is used to generate a set of maps indicating the likelihood of each pixel belonging to the foreground or background. A weighted linear combination of these likelihoods is based on the ability to use individual features to discriminate foreground from background pixels in the previous frame.3 The resulting fused map provides a soft segmentation suitable for mean-shift tracking. Figure 3 illustrates the overall framework, applied to tracking ground vehicles from thermal airborne video.
Figure 3. A feature-fusion approach combines many component feature likelihood maps based on confidence scores (weights) that are dynamically updated to favor discriminative foreground and background information.
Accurately segmenting object shape has proved to be just as important as location for tracker initialization, drift-free object model adaptation, and object classification. To acquire object shape masks from airborne video, we use a figure-ground segmentation approach enhanced by edge pixel classification.4 Application of a 3D conditional random field model combines segmentation features while maintaining temporal coherence between neighboring nodes. Weighting factors for different data potential functions are updated online to adapt to changing, complex scenes. To obtain accurate boundary information between foreground and background, salient edge pixels are classified as belonging to one of three categories (see Figure 4): within foreground, within background, and between foreground and background. The foreground and boundary edge pixels then guide segmentation of an accurate foreground object shape.
Figure 4. Input image (a); edge pixel classification (b); foreground edge pixels (green), boundary edge pixels (red), and background edge pixels (cyan) (c); object mask overlaid in blue hue over the image frame (d).
In the future, we will explore use of top-down object models to boost performance of the current pixel-level detection, tracking and segmentation algorithms. Shape-constrained object segmentation methods, for example, could be used to simultaneously segment and classify different types of ground vehicles.
We acknowledge support from National Science Foundation grant IIS- 0535324 on persistent tracking.
Zhaozheng Yin, Robert Collins
Computer Science and Engineering Department
Pennsylvania State University (PSU)
University Park, PA
Zhaozheng Yin is currently a PhD candidate in Robert Collins's group at PSU. His research interests include object segmentation, motion detection, tracking, and feature selection and fusion. He received his BS degree from Tsinghua University, China, and his MS degree from the University of Wisconsin at Madison.
Robert T. Collins is an associate professor in the Computer Science and Engineering Department of Penn State University. He earned his PhD from the University of Massachusetts at Amherst. He co-directs the Laboratory for Perception, Action, and Cognition at PSU, where his research interests include video scene understanding, human body segmentation, activity recognition, and multitarget tracking.