3D human gesture recognition using integral imaging
Optical image sensing and visualization technologies in 3D have been researched extensively in fields as diverse as TV broadcasting, entertainment, medical sciences, and robotics.1–4 One promising technology, integral imaging, is an autostereoscopic 3D imaging method that offers a passive and relatively inexpensive way to capture 3D information and visualize it optically or computationally.5–7Integral imaging belongs to the broader class of multi-view imaging techniques that allow depth analysis from three points of view: stereo, time-of-flight, and structured-light strategies.8 Integral imaging has also been used for classification tasks.9 However, we are the first to apply it to action recognition.10, 11
Integral imaging provides the 3D profile and range of the objects in a scene using an array of high-resolution imaging sensors or in a synthetic aperture mode (see Figure 1). When a single sensor captures multiple 2D images, it is possible to obtain larger field-of-view (FOV) 2D images. In the synthetic aperture integral imaging mode that we used, a series of sensors are distributed in a grid, or a single sensor is moved to the positions in the grid. The horizontal or vertical distance between two of these positions is called the pitch (p). The 3D image reconstruction can be achieved by computationally simulating the optical back-projection of the elemental images. In Figure 1, cx and cv are the horizontal and vertical sizes of the sensor and f its focal length. We used a computer-synthesized, virtual pinhole array to inversely map the elemental images into the object space (see Figure 1). Superimposing the properly shifted elemental images created the 3D reconstructed images.
Our methodology is based on acquiring 3D videos of hand gestures using an integral imaging system formed by an array of 3×3 cameras. We analyzed the potential of gesture recognition using 3D integral imaging and compared the performance to 2D single-camera videos. We processed sectional reconstructed representations of the objects in the scene using gesture recognition strategies. We believe the experiments provide evidence of the feasibility of gesture recognition with integral imaging.
Our setup included a 3×3 array of Stingray F080B/C cameras. Using an IEEE 1394 high-speed communication serial bus, we captured nine synchronized videos at 15fps and a resolution of 1024×768 pixels. We subsequently rectified the acquired videos.12 We then acquired two performances of three different gestures from 10 people. We captured the three gestures, which were made by extending the right arm: left, deny, and opening and closing the hand (see Figure 2).10,11 Each camera lens was focused at a plane about 2m away. The depth of field allowed for all objects and people from 0.5 to 3.5m away to be in focus. The 10 people were about 2.5m in front of the camera array. We acquired their gestures in a laboratory with no other movements. We recorded 60 videos corresponding to the three actions the 10 people performed twice. The 3D volume for its first frame was reconstructed to infer the distance at which the hand was in focus. This distance was then used to reconstruct the volume for the remaining frames. We made the reconstruction from 1 to 3.5m in 10mm steps. Figure 2 shows the reconstructed plane where the action is in focus for the three different gestures we considered.
The method for characterizing and recognizing gestures can be summarized as follows:13 generating and characterizing spatiotemporal interest points (STIPs) for each video (see Figure 3);10,11 quantizing the resulting descriptors in a number of visual words (also called the codebook); creating a bag-of-words (BoW) representation for each video using its STIPs and the resulting codebook; and classifying unseen videos from their BoW representations.
We generated STIPs and, from them, histograms of oriented gradients (HOG) and of optic flow (HOF). We quantized these histograms into visual words through k-means clustering. We represented each video by creating a histogram of codewords.14To estimate the gesture recognition performance, we followed a ‘leave-one-subject-out’ protocol.15, 16 We chose support vector machines as the pattern recognition method for classifying the gestures.17Our results showed10 that integral imaging outperformed acquisition with the central camera of the 3×3 array when comparing the best descriptor in each case (HOF for monocular and HOF+HOG for integral imaging).
In summary, 3D information shows potential in improving the accuracy of human gesture recognition. Integral imaging allows us to reconstruct a 3D scene for only the planes where the gesture preferentially appears. This opens the door to the application of recognition strategies that were not previously possible and eventually to substantially increased recognition capability. Our next step will be to computationally parallelize the entire process so that it can be applied nearly in real time and to use other gesture recognition descriptors that exploit the focusing capabilities of integral imaging.
Universitat Jaume I
Castell´on de la Plana, Spain
Pedro Latorre-Carmona is a postdoctoral researcher whose interests are 3D integral imaging, pattern recognition, photon-starved visualization, and multispectral image processing.
University of Connecticut
Bahram Javidi received his BS from George Washington University and MS and PhD from the Pennsylvania State University, all in electrical engineering. He is the Board of Trustees Distinguished Professor at the University of Connecticut. He has more than 900 publications, including over 400 peer-reviewed journal articles and over 440 conference proceedings, among them some 120 plenary addresses, keynote addresses, and invited conference papers.