The use of wide-area persistent surveillance by the US military—employing airborne video-camera systems to observe large expanses over extended periods of time (see Figure 1)—is gaining ground as a highly valuable tool for intelligence, surveillance, and reconnaissance. Motion imaging from Iraq and Afghanistan has been used in post-event forensics to help with the early detection of thousands of roadside bombs and terrorist ambushes. Unlike static, single-frame images, wide-area persistent video can track and detect activities and interactions over time, identify patterns that are hard to disguise, and provide signatures of threat-related activities in urban environments. However, current efforts to exploit video data are mostly manual and require hours, or even days, of painstaking analysis to produce results.
Figure 1. Wide-area persistent surveillance. FOV: Field of view. (Photo courtesy of the Defense Advanced Research Projects Agency.) EO: Electro-optical.
Similar to security cameras, wide-area persistent surveillance captures flows of images, at a rate of one to two frames per second, while covering wide geographic areas of up to 40 square miles and generating huge volumes of video data (up to terabytes in a single mission). To detect suspicious behavior and identify potential threats (such as a building used for fabrication of improvised electronic devices or activities indicating a potential ambush), thousands of human analysts would need to assess such volumes of data. For wide-area airborne persistent surveillance to provide commanders and troops with real-time intelligence and situational awareness, algorithms and automated processes that reduce time-to-intelligence and analyst workloads are required.
Can algorithms be developed to analyze millions of video frames for patterns that indicate potential threats, i.e., detect normal from abnormal activities, such as unique driving behavior that may occur before the detonation of a vehicle used for a suicide mission? To answer this question, our research1 focuses on whether behavior or pattern-recognition algorithms can be applied to wide-area surveillance video for high-accuracy recognition of a variety of behavior classes.
We focused on three types of identification in particular, including moving objects (such as individuals or cars), complex interactions among multiple objects (such as person to person and person to vehicle), and the functions of static objects including buildings, structures, and areas (such as apartments, gas stations, and parking lots). We set out to answer whether the intent or function of an object (either an individual or a vehicle) can be inferred from its actions, activities of surrounding objects, and their mutual interactions.
We based our experiment on the Columbus Large Image Format (CLIF) data (see Figure 2), consisting of real-world video imagery collected over the Ohio State University (OSU) campus in 2006 and 2007. The unclassified CLIF data, which contains a varied set of areas, targets, events, and activities, had high enough resolution for us to manually recognize activities of moving objects and context information, thus providing accurate ground truth to test our algorithms.
Figure 2. Example of Columbus Large Image Format (CLIF) data.
We intentionally used data collected from a ‘normal’ environment, because we wanted to distinguish among behavior types in a rich and ambiguous environment, and not uncover ‘good’ versus ‘adversarial’ behavior. ‘Terrorist’ and ‘normal’ behaviors are all ultimately comprised of activity patterns. The purpose of the algorithms is to distinguish these patterns.
To overcome video-quality and tracking-inconsistency issues, we manually labeled video frames with objects and motion events corresponding to their behavior (see Figure 3). During the vignette, a person exits one of the buildings, enters a car, drives to the other side of the parking lot, parks the car, and goes to another building. We used simple types of motion events, which could be detected if people and vehicle tracking were available.
Figure 3. Manual event extraction. Activity: ‘Repark.’
We focused on developing models for activity and function recognition. Activities consist of a sequence of events for a single entity or interacting entities. Spanning minutes to hours, they form a pattern, such as a car decelerating and stopping or one car following another. An entity can be in motion or stationary at any given point in time. Examples of activity keywords included in our models are person-shopping, car-unloading, drop-off, load-carrying, vehicle-following, and meeting. Functions define a composition of several activities performed by multiple entities, possibly at different locations, spanning hours to months. Examples in our models include post-office, taxi-driver, delivery-vehicle, material-storage, crowd-leader, and residential-area.
We paid special attention to discovering complex activity patterns, i.e., behaviors over longer periods that involve multiple interacting entities. Such behaviors are often associated with numerous events occurring in parallel and cannot be recognized by observing and reasoning about individual people, cars, and places. To model complex activities, we used networks to describe profiles of roles and their relations (see Figure 4). We then searched for these patterns within the area covered by the surveillance video.
Figure 4. Example of modeling a complex activity, ‘play football.’ E: End. G: Guard. QB: Quarterback. S: Safety. T: Tackle. WR: Wide receiver.
In testing the performance of several algorithms on the surveillance-video data set, we achieved high recognition accuracy for a wide range of activity and function types (see Figure 5). Our best algorithm achieved 70% accuracy in recognizing activities of people, 97% for groups, 67% for cars, 96% in recognizing complex activities, and 90% in determining the functions of areas and facilities. These results show that accurate, automated activity and function recognition in complex urban terrain is feasible.
Note that we did not thoroughly conduct ‘people tracking’ or ‘dismounts’ because of the low resolution of the imagery. Without better motion tracking of individuals, vehicle-only data is ambiguous and provides lower accuracy of recognition. In contrast, complex multi-entity activities yield higher accuracy because complex dependencies can be exploited to achieve better identification.
Figure 5. Accuracy of recognition for best algorithm.
Much of the behavior in the OSU imagery is individualistic (students engaged in their individual tasks of going to class, the library, shopping, etc.). Yet, in a conflict environment, evidence shows that complex activities are the more prevalent behaviors of interest to military intelligence. Adversarial actions, such as ambushes, are intentionally planned and coordinated amongst numerous entities. This suggests that within the intelligence-analysis and decision-support domain, algorithms to identify suspicious multi-entity activities would enable better threat recognition.
The full promise of exploiting wide-area persistent surveillance to anticipate and interdict threats will be made possible as algorithms and technologies that automate the discovery of activities mature (to which we continue to contribute). Rather than analysts poring through millions of video images, we envision refined algorithms allowing them to query events, activities, and relationships to identify particular threats. Our ultimate goal is to enable analysts to exploit video in the same way that large text repositories can currently be searched and retrieved using simple and complex queries.