Behavior subtraction, a new tool for video analytics

Studying binary motion patterns may enable simple, real-time, autonomous detection of abnormal events in surveillance video.
02 April 2009
Pierre-Marc Jodoin, Janusz Konrad, and Venkatesh Saligrama

The rapid proliferation of surveillance cameras at airports, along highways, and in other public areas makes it increasingly difficult to have humans assist in monitoring. Reliable methods are needed for autonomous video analysis, known as ‘video analytics.’ One of the most important—and difficult—goals of video analytics is to detect abnormalities or events that differ from what is considered usual, such as an abandoned package, a car traveling against traffic, or a fallen elderly person. While it is already possible to identify a simple abnormality using motion detection, such as an intruder in a restricted area, that technology does not work in more complex scenarios, such as a car traveling against dense traffic.

One approach to detect unusual behavior in such ‘clutter’ settings is to learn normal activity from a training video, and then identify abnormal patterns based on object dynamics, shape, or color. Most methods to do this detect moving objects, compute their paths, and then classify the paths to distinguish normal from abnormal objects.1 This approach is difficult, because each stage must be completed successfully. Any errors can propagate from one stage to another like a domino effect.

To address these limitations, we recently developed a simple approach that does not require estimation and classification of object paths.2 Our method's computational complexity is independent of the number of moving objects, and is still general enough to monitor humans, cars, animals, or other moving entities. We have also made two important observations. First, to characterize behavior in a camera's field of view, it is possible to consider only dynamics and forgo luminance/color. Second, to characterize the dynamics, it is adequate to detect activity at a given pixel rather than estimate the motion path through it.

With a static camera, as is common in surveillance applications, our method first applies background subtraction to compute a binary label at each pixel of each frame, with ‘1’ denoting motion and ‘0’ no change. This can be accomplished by fast thresholding (image segmentation) of the absolute difference between each frame and a background computed using the temporal median of the previous 50–100 frames (see Figure 1), or by more reliable methods based on statistical background models.3,4 Some of the signatures we have observed experimentally include random impulses (a tree shaken by wind), regular impulses (highway traffic), bursts of impulses (city traffic regulated by traffic lights), and very wide impulses (abandoned objects). We use these signatures to characterize behavior pixel by pixel, and to subsequently detect abnormalities.


Figure 1. Example of behavior in a static camera, with the video frame (top left), corresponding motion-label frame obtained using background subtraction (bottom left), and binary sequence-of-motion labels for a given pixel (images at right). On the vertical axis, 0 is static and 1 is moving. The horizontal numbers denote image frames. High-density areas mean many moving objects are passing through the pixel.

However, since streaming video allows infinite signature lengths, it is not obvious how to use them to detect abnormalities. We have developed a 'behavior-image' concept, where signatures for all pixels over N number of frames are aggregated to form a single 2D array. This dramatically reduces memory requirements and permits real-time implementation. First, the ‘background-behavior image’ is computed from a training video with normal behavior. Next, observed-behavior images are computed from streaming video using a sliding-window approach (a behavior image is computed for each frame). Finally, each observed-behavior image is compared to the background-behavior image to find abnormalities in a process called ‘behavior subtraction.’

The label aggregation and behavior-image comparison depend on the application. For example, if an unusually high level of motion is to be considered abnormal, the background-behavior image needs to be computed by measuring the maximum activity.2 Then, pixels in observed-behavior images that are larger than those in the background-behavior image are declared abnormal. However, if abnormal behavior means a departure from average motion activity, then the background-behavior image needs to be computed by measuring the average activity.5,6 In that case, pixels in observed-behavior images that are different from those in the background-behavior image are declared abnormal.


Figure 2. Observed video frame and detected motion labels in urban traffic (top row), background- and observed-behavior images with white denoting high activity (middle row), and final abnormality map (bottom).

Figure 3. Abnormality detection in a wide range of scenarios: (a) lingering pedestrian, (b) unexpected train, (c) abandoned object, and (d) and (e) canoe and speedboat on shimmering water. Full videos are available on our Web site.

Our simple, memory-light approach had led to some surprising results. Figure 2 shows urban traffic, where the tram is detected as an abnormality under the maximum-activity assumption, because the training video sequence did not include trams. The background-behavior image (middle row, left) is bright in motor-traffic areas, but darker on tram tracks. The only area where the observed-behavior image is brighter than the background-behavior image is on the tracks, thus detecting the tram as an unusual pattern.

Further examples in Figure 3 show the removal of regular background activity caused by highway traffic, pedestrians, or shimmering water. Only objects with outlying signatures are detected, regardless of their size (from a tiny pedestrian to a large canoe) or nature (human or car). Complete videos of these results showing their dynamic nature are available on our Web page.7

We are currently exploring implementing behavior subtraction in embedded architectures used in Internet-protocol surveillance cameras. This would permit edge-based processing to reduce data flow in the network by communicating frames with unusual content only. We are also working on extending our method to multicamera configurations.


Pierre-Marc Jodoin
University of Sherbrooke
Sherbrooke, Canada

Pierre-Marc Jodoin is an assistant professor. He received his PhD degree in computer vision from the University of Montreal in 2007, and held a teaching position there in the Computer Science Department from September 2003 to August 2006. His interests are in video analytics and computer vision.

Janusz Konrad, Venkatesh Saligrama
Boston University
Boston, MA

Janusz Konrad has been a full professor since 2000. He holds a PhD from McGill University in Montreal (Canada). He is an IEEE fellow, an associate technical editor for the IEEE Communications Magazine, and associate editor for the EURASIP International Journal on Image and Video Processing. His interests are in image and video processing and compression, visual-sensor networks, stereoscopic and 3D visual communications, and multidimensional, digital-signal processing.

Venkatesh Saligrama is an associate professor in the Department of Electrical and Computer Engineering. He received his PhD from the Massachusetts Institute of Technology. He has numerous awards including the Presidential Early Career Award, Office of Naval Research Young Investigator Award, National Science Foundation Career Award, and the Outstanding Achievement Award from United Technologies. His research interests include information and control theory, and network signal processing with applications to sensor networks.


PREMIUM CONTENT
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research