Automated object detection and tracking play an increasingly important role in modern surveillance systems because the large number of cameras typically installed at a given site can no longer be monitored by human operators. However, vision research is confronted with many challenges related to robust object tracking in realistic scenarios, especially pedestrians. In addition to classical security and monitoring purposes, detailed information about pedestrian flow over large spatial areas may be beneficial in many situations. For instance, it could be used to assess the frequencies of customers visiting certain shops, to set the intervals of signal changes at pedestrian crossings (see Figure 1), or to support complex semantic analysis of motion patterns and behavioral analysis in public-transportation franchises, which would enable optimization of services.
Figure 1. Typical application of a smart pedestrian sensor. By detecting waiting people at a signalized crossing, the green phase can be timed optimally to increase traffic flow, reduce the number of jaywalkers, and simultaneously enhance the crossing's safety.
Recent advances in computer vision have led to impressive results in pedestrian-detection and tracking capabilities. Nevertheless, a generally usable and robust system that can handle complex situations is not yet available. The scenarios for surveillance camera usage range from indoor locations under constant illumination to crowded outdoor areas. However, reliable and robust pedestrian detection in sparsely crowded settings is challenging for even the best vision algorithms available today. The reasons for this include the complexity of real-world situations with large numbers of people in environments where observations can only be made at shallow camera angles in poor lighting, reliable detection requirements (i.e., with a small number of false positives and high detection rates), and occlusions in the field of view that render object detection very difficult.
Our detection philosophy differs fundamentally from those underlying most current surveillance systems. Instead of generating a sophisticated background frame and using pixel differences between images, we discover pedestrians solely based on their appearance: by learning what people look like. Our algorithms can be taught the general shape and appearance of pedestrians (offline) based on thousands of manually created training samples. The system tries to maintain a workable balance between generalization (every pedestrian must be detected) and sensitivity (do not confuse with another object class) by slightly tuning the decision boundary as required at runtime. This approach has several, mostly positive, implications. The vision sensor's performance becomes essentially independent of illumination and is unaffected by panning and zooming. Inclement-weather conditions such as rainfall or light fog do not significantly decrease performance. We can even track pedestrians in severe snowfall, although with reduced accuracy.
Our challenges were numerous. For example, even the state-of-the-art in computer vision cannot yet deal with every situation. Crowded scenes are notoriously difficult. As an additional complication, real-time operation of this type of system requires a large amount of processing power, and customers are not normally expected to install dedicated tracking server farms at their sites. Optimization at every level of the system, from algorithms to implementation, was necessary to achieve a level of quality that allows us to use the system in realistic settings. Ultimately, we aim to develop an integrated system (a smart camera) for applications that detect, track, and count pedestrians in transportation hubs to collect statistics of pedestrian flow; control and influence the timings of signalized pedestrian crossings (to reduce the number of jaywalkers and improve overall traffic flow); create warning systems for drivers to signal approaching pedestrians; record frequencies of pedestrians and motorized traffic for urban-planning purposes; provide real-world counting and traffic data for simulation models; and monitor customer activity for retail purposes.
As an example, Figure 2 shows an outdoor scene at a public-transportation hub in Graz (Austria) where people gather and wait for trams. Two pedestrians are being tracked and older, previously detected paths are shown in light gray. Tracking results for an entire day are shown in Figure 3. Using these trajectories, we can obtain passenger frequencies across virtual tripwires and collect pedestrian-flow statistics over intervals in space and time. Accuracies of 95% or better can be expected,1 depending on the imaging situation, geometry, and average pedestrian densities.
Figure 2. Outdoor scene at a public-transportation hub in Graz (Austria) where people gather and wait for trams. Two pedestrians are tracked (blue boxes) and all of the previously detected trajectories are shown in light gray.
Figure 3. Trajectories of pedestrians for an entire day: more than 1000 people can be followed. The trajectories coincident with the tram tracks are false detections of patterns on the trams.
Our tracker works by continuous detection of pedestrians in every incoming video frame and combining those single-frame events into coherent trajectories using motion analysis. The underlying detection algorithm delivers video resolutions up to 1024×768 pixels in real time. It is based on the ‘histograms of oriented gradients’ algorithm,2 which computes local statistics of edge orientations to form compact descriptions that are used to distinguish between pedestrians and background. A full image is scanned by sliding a detection window over all possible locations and repeating the process for a range of expected pedestrian sizes.
Temporally continuous detection ensures that all pedestrians entering the camera's field of view are located as soon as possible. We then use these identifications to update existing trajectories so that detection and tracking support each other. We apply optical flow-based motion estimation3 to predict the pedestrian's location between frames. Possible matches between extrapolated and actual detections are verified using texture-analysis methods. If an object (pedestrian) is not detected at its predicted location or matching fails, it nevertheless remains ‘active’ for a few frames but is marked as tentative. This way, occlusions or gaps in detection can be overcome for the duration of a few frames. Once the pedestrian is detected again, the state is reverted to tracked. When a pedestrian leaves the camera's field of view or is occluded for too long, the trajectory is closed and analyzed to determine whether it should be stored or rejected as invalid, based on the quality of detection during its lifetime.
Advances in computer-vision science enable us to build tracking systems that can operate in light- to medium-dense crowds of people. Our pedestrian tracker can operate in indoor and outdoor scenarios independent of perspective, illumination conditions, or object velocity. It is practically configuration-free and represents a new type of vision-based technology that is smarter than what has thus far been deployed.
We are currently working to integrate the software into a smart camera that can autonomously collect information about pedestrian flow over many days and/or weeks, or act as a sensor to control and optimize traffic flow at signalized crossings.
Oliver Sidla founded SLR Engineering in 2008. He has more than 15 years' experience in machine vision and computer science, designing systems ranging from stereo-vision applications for lunar-landing simulations to machine-vision inspection and sorting systems for a variety of industrial-production settings (glass, steel, paper). In recent years, he has been concentrating on the development of object detection and tracking software for visual surveillance and monitoring. His company specializes in pedestrian detection and tracking, vehicle and object detection, and license-plate reading. Oliver Sidla received his master's degree in computer science from the Technical University of Graz (Austria) in 1991.