SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:
SPIE Photonics West 2018 | Call for Papers

SPIE Defense + Commercial Sensing 2018 | Call for Papers




Print PageEmail PageView PDF

Electronic Imaging & Signal Processing

Collaborative method improves visual behavior recognition

A general framework that combines attribute extraction and image-sequence identification also enhances the results for both tasks.
1 June 2009, SPIE Newsroom. DOI: 10.1117/2.1200905.1613

Visual behavior recognition is a highly active research area, because it is inherently interesting to scientists and has compelling applications like automated surveillance, human-computer interaction, and medical diagnosis. It involves assigning one of several behavior classes to a sequence of one or more images. There are many different approaches for this task,1 but the general trend is to separate it into two sequential phases: extraction of relevant attributes for detection from the input image sequence, and recognition, where the extracted feature sequence is placed into a class of actions. In a two-pass procedure, the recognition success depends highly on the attribute extraction phase, without being able to influence it.

Figure 1. Tests on a 15-word vocabulary, each finger-spelled 10 times by a hearing-impaired person. Rows 1–2: Correct segmentation and behavior recognition using our collaborative framework are shown on a test sequence representing the word ‘Albania.’ The segmentation can handle the cluttered background because of the knowledge infused during the identification process. Rows 3–4: In the traditional sequential approach, the cluttered background impairs the segmentation, and recognition fails because of incorrect attributes, so the recognized word is ‘Algeria.’

We are focusing on recognizing single-object behavior from monocular image sequences. To date, researchers have addressed shortcomings of attribute extraction by adding various ad hoc processing steps that adjust the features to the subsequent recognition. We are proposing a unified, mathematical framework for joint resolution of attribute extraction and behavior recognition.2, 3 An added benefit of this approach is that collaboration and sharing of existing knowledge improves the results of each task.

In our model, behavior comprises a sequence of simple actions, each with a different likelihood of observing a particular set of image attributes at a given time. We couple a statistical hidden Markov model (HMM) with a probabilistic feature extraction model. In this way, we infer the most likely behavior, associated action sequence, and extracted attribute sequence.

Inference is performed by adapting the Viterbi decoding algorithm to our model to find the most likely sequence. We translate the probabilistic attribute extraction into a variational segmentation model4 that allows us to pull out features as functions of the image and object contour. The model combines them with prior information about the most likely recognition attribute.

Our framework is defined in general terms with free parameters that can be chosen depending on the application. A finger-spelling identification application shows the robustness of our model in a challenging scenario — one with a cluttered background. Our model is also strong with the traditional sequential approach, where attributes are first extracted by variational image segmentation and then used for behavior recognition via HMMs (see Figure 1). The recognition rate for a 15-word vocabulary is 85.3%, with rates higher than 80% for 12 out of 15 words.

Previous behavior recognition approaches used a sequential, two-phase approach that can cause recognition errors when the attribute extraction phase fails due to difficult imaging or other conditions. Rather than applying an ad hoc solution, we propose a mathematical framework that joins feature extraction and actual identification. This method improves attribute extraction due to the added knowledge from the ongoing recognition. That, in turn, results in better identification. Future work will focus on extending the current finger-spelling application to multi-user scenarios, applying the framework to other behavior recognition tasks, and extending it to handle more complex actions.

Laura Gui 
Signal Processing Laboratory (LTS5)
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Lausanne, Switzerland

Laura Gui received her BSc in computer science from the Polytechnic University of Timisoara, Romania in 2003 and her PhD from the EPFL in 2008. Her research includes image segmentation, active contours, and probabilistic methods for sequence classification.

Nikos Paragios
Laboratoire MAS
École Centrale de Paris (ECP)
Chatenay-Malabry, France
GALEN group
INRIA Saclay, Île-de-France 
Orsay, France

Nikos Paragios holds BSc, MSc, PhD, and DSc degrees. He is a professor of applied mathematics at the ECP and an affiliated researcher directing the GALEN group at INRIA Saclay. His research includes computer vision and medical-image analysis.