SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:

SPIE Photonics West 2019 | Register Today

SPIE Defense + Commercial Sensing 2019 | Call for Papers



Print PageEmail PageView PDF

Electronic Imaging & Signal Processing

Audio to the Rescue

The addition of audio data improves the accuracy of video face recognition.

From oemagazine January 2004
31 January 2004, SPIE Newsroom. DOI: 10.1117/2.5200401.0002

The needs of law enforcement and security personnel, along with automated video-indexing applications for video archives, are driving the development of new systems and techniques for automatic identification of human faces. These systems typically use a biometric key to identify a person within a population.

Traditionally, we define biometrics as the science and technology of measuring and statistically analyzing biological data. Until recently, biometrics was the science of statistically evaluating various aspects of life expectancy. In today's security-conscious world, however, a biometric 'key' is a digital version or mathematic formula that digitally describes or represents unique characteristics of an individual, whether it be the shape of the face, eyes, hands, voice, or any other unique attribute. In most cases, the choice of the particular biometric relies strongly on the final application. For instance, retinal scans have demonstrated high recognition accuracy, but their use is limited to the availability of cooperative individuals, which is not always possible.

Although researchers have obtained relatively high recognition rates using a single biometric, systems based on a single biometric can still suffer limitations; for example, changes in illumination or pose for a face-based biometric, or changes in ambient noise and channel distortion for a voice-based biometric. A combined approach using complementary biometrics can improve system performance because degradations for each modality usually are uncorrelated. A good example of a system that combines multiple information sources is the human brain and sensory system. Studies show that a person is more intelligible to an individual who can both see and hear them than for either method alone.

Although much attention has been placed on biometric development for security and physical access applications, the recognition of people in video sequences for video indexing applications is an immediate need with significant commercial opportunity; for example, people currently have to manually review videotape of sports and news events to create video archives that are searchable by person, time, event, place, and so on. An automated method would simplify this process.

With video indexing, the user needs to locate video clips in which a particular person m appears. Examples include taped footage of news anchors, head and shoulder sequences of people being interviewed, and so on. In our approach, if person m is being sought, then, for each clip in the video sequence, the identity m will be proposed and the recognition system will verify or deny this identity claim. Although our initial work focused on a system that used a face biometric only, we have improved recognition results by including voice information. Although we focus on video indexing, the techniques we present here are general enough to be applied to many other face recognition applications.

I Know That Face

Face recognition has been an active research topic for more than a decade. Initially, face recognition systems focused on still images. In recent years, face recognition in image sequences has gained significant attention. Image sequences offer the advantage of allowing an automated system to select individual frames that offer the best chance for a biometric match with the stored video footage.

Face recognition approaches on still images can be broadly grouped into geometric and template-matching techniques. Geometric facial characteristics are based on the distances between different facial features. Although this technique has been widely used, its effectiveness has been limited.

Template matching involves the comparison of a 2-D array of pixel intensity values (an image of a face) with several stored templates representing the whole face. More successful template-matching approaches use principal-components analysis (PCA) or linear-discriminant analysis (LDA). These approaches provide dimensionality reduction, which is a way to represent the data with a reduced number of features. This step is very important to obtain a system with good generalization, such as the ability to work well on unseen test data.

Other template matching methods use neural-network classification and deformable templates, such as elastic graph matching (EGM). Recently, researchers have proposed a set of approaches that use different techniques to correct perspective distortion. These techniques are sometimes referred to as view-tolerant; an example of these techniques is based on pseudo-2-D hidden Markow models (HMMs).1 A comparison between different face recognition techniques can be found in a recent survey paper, which concludes that although all of the algorithms have been successfully used for face recognition, each offers its own advantages and disadvantages; therefore, the technique to be used is chosen based on the final application.2 EGM techniques require large face resolutions, for example. Methods such as LDA are better suited for identification applications, in which only one face example is generally available for each person. In other cases, the difficulty of training the face model limits the use of some algorithms, such as HMM algorithms.

A recent comparison of 12 different face recognition algorithms by the U.S. National Institute of Standards and Technology determined that EGM, LDA, and PCA are the three most successful approaches, with each method showing different levels of performance on different subsets of images.3

Principal Components Analysis

Our group is focusing on face recognition using a variant of the well-known PCA technique,4 also known as the self-eigenfaces technique.5 This technique is well suited for video indexing applications when many images of a specific face viewed from a similar perspective are available for training purposes. As a general conclusion, the self-eigenface approach works well as long as the image under test is similar to the ensemble of images used in the calculation of the self-eigenfaces; for instance, if the system has been trained using frontal faces, it won't be able to recognize profile views. This happens because the recognition error of the profile face will be high, irrespective of the identity of the test face.

We can extend this conclusion to the general PCA approach. The self-eigenface approach takes advantage of this by performing a separate PCA for each person Pi to be recognized. The PCA process can be summarized as follows, let X = {x1, x2,..., xm} be a training data set, where xi are vector representations of face images obtained after concatenating all the columns of the image. We can compute the mean xµ and covariance of the data as

A non-zero vector νk for which

is an eigenvector of the covariance matrix while λk is the corresponding eigenvalue. Since the eigenvectors of the covariance matrix look like faces, they are called self-eigenfaces to emphasize that they are built using different views of the person Pi (see figure 1).

Figure 1. A small sample of training faces and the corresponding mean face and first self-eigenfaces can be used to model a particular person.


be a matrix built with the eigenvectors that correspond to the k largest eigenvalues. The subspace spanned by the eigenvectors of V is usually called principal subspace. Using the principal subspace, a face pattern can be linearly transformed into a k-dimensional vector (usually k is much smaller than the number of pixels of x) by

Conversely, we can approximate the original vector x from its transformed vector y as:


The test phase involves projecting and reconstructing each test face using a particular set of self-eigenfaces. The reconstruction error

provides a confidence measurement that the test face corresponds to Pi. The basic idea behind this method is that given a test face, we can achieve a low reconstruction error (good fit) when we use the self-eigenface set of the corresponding identity.

The self-eigenface technique can be easily extended to video sequences by repeatedly applying the face recognition to every frame and then giving a global confidence value that Pi appears in the sequence. A practical way to obtain a global confidence measurement FC(Pi), can be done using the median value:

where Ek(Pi) is the face confidence for frame k. The median value assures good recognition for half of the frames.

Taking the median for a set of numbers requires first sorting the numbers from highest to lowest, then defining the median as the one in the middle of the sorted list. This means that if we have a low median value, between half and the rest of the values in the set will be lower. Another advantage of the median is that it is effective in dealing with outliers compared to the mean value. Imagine that we could not recognize a face in one frame that yields a high reconstruction error. Using the mean will deeply affect the final value; if you use the median, however, the final result won't be greatly affected as long as this does not happen in more than half of the frames.

Face and Voice Fusion

As we mentioned earlier, combining different biometrics can result in improved automated recognition. Techniques for combining different information sources can be broadly grouped into pre-mapping and post-mapping fusion techniques.6 The first group consists of combining information before any use of "classifier" or "expert." A classifier is a hard decision—"this is a man's voice," for example—while an expert is an expression of confidence value for each possible decision—"there is a 70% probability that this is a man's voice," for example. Pre-mapping fusion has been widely used in lip reading, for instance, in which visual and speech features combine to increase intelligibility.

Post-mapping techniques combine information after mapping from the feature space to the opinion/decision space using either a classifier or an expert. Pre-mapping techniques are more appropriate when the information sources are closely synchronized, as is the case for a standard videotape, in which the voice track and video track typically are synchronized. In the absence of synchronization, pre-mapping techniques do not work as well, especially if the number of features under consideration is high. Unfortunately, we can't offer a firm definition of what constitutes a high number of features. The answer deeply depends on the application. In general, the idea is that more features is better, but only if they are good features that help to discriminate one image set from another.

Post-mapping fusion has the advantage of being able to combine opinions from different expert qualifiers, even if their outputs are not commensurate—meaning the expert values are of a different type or fall in different ranges; for these reasons, we have used post-mapping fusion to combine the outputs from face and voice biometrics.

In general, we refer to person-recognition techniques that use the voice as a biometric for speaker recognition. Note that the objective here is not to know what is being said (speech recognition) but who says it. Speaker recognition techniques usually formulate the problem as a basic hypothesis test, in which, given a speech segment S, a decision whether or not it was spoken by person Pi has to be made.7 The optimum test is given by the log-likelihood ratio:

where p (S/Pi) and p (S/BM ) are conditional probability-density functions based on models person Pi and background noise respectively. These models typically are created using Gaussian mixture models.8 After the speaker recognition process, a confidence value AC(Pi) is available, which can be used together with the face confidence value FC(Pi) to increase the recognition performance.

Figure 2. Scatter plot shows face and voice confidences in the likelihood space, demonstrating that true and false candidates are better classified in the 2-D space. Speaker confidence and face confidence are expert confidence values.

A scatter plot of the 2-D opinion vectors [AC(Pi), FC(Pi)] clearly shows that true and false candidates are better separated in the 2-D space, rather than by using a single factor and set confidence threshold (see figure 2 on p. 19). This can be done using a post-classifier that takes the expert opinions as features in the likelihood space. The post-classifier does not need to be very sophisticated; in fact, we have found that a simple mean-squared-error linear classifier is a good compromise between accuracy and generalization.9

Figure 3. Person recognition using only a single biometric (face data) can fail, whereas including multiple biometrics (audio and face data) allows the system to correctly accept and reject the examples. Courtesy WLFI-TV

Our experiments show that the audio-visual approach to person recognition increases the performance up to 97% of true classification, compared to 93% obtained using only the image information (see figure 3 on p. 19).

Face recognition techniques are growing more sophisticated all the time. Adding speech information to the mix improves face-recognition performance, confirming the combination of audio and visual information as a very promising trend in face recognition. oe


1. S. Eickler, S. Muller, et al., Image and Vision Computing, 18[4], p. 279 (2000).

2. W. Zhao, R. Chellappa, et al., A literature survey, Technical Report CART-TR-948, University of Maryland (2002).

3. P. Phillips, H. Moon, et al., The FERET verification testing protocol for face recognition algorithms, Technical report NISTIR 6281, National Institute of Standards and Technology (1998).

4. M. Turk and A. Pentland, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, paper #XXX, p. 586-591 (1991).

5. E. Acosta, L. Torres, et al., International Conference on Acoustics, Speech and Signal Processing, Orlando, FL (2002).

6. C. Neti and A. Senior, Audio-visual speaker recognition for broadcast news, DARPA HUB4 Workshop, Washington DC (1999).

7. S. Furui, Automatic speech and speaker recognition, p. 31, Kluwer Academic Publishers, Boston, MA (1996).

8. D. Reynolds, T. Quatieri, et al., Digital Signal Processing, A review journal 10[1-3], p. 19 (2000).

9. R. Duda, P. Hart, et al., Pattern Classification, Wiley-interscience, 2nd edition, 2001.

Alberto Albiol
Alberto Albiol is a professor at the School of Telecommunications Engineering, Department of Engineering, Technical University of Valencia, Valencia, Spain.
Luis Torres
Luis Torres is a professor of telecommunications at Technical University of Catalonia, Barcelona, Spain.
Edward Delp
Edward Delp is a professor at the School of Electrical and Computing Engineering, Purdue University, West Lafayette, IN.