Using auditory depth cues to enhance stereoscopic visual media

Auditory cues influence depth perception in 3DTV and could potentially extend the range of depth for comfortably viewing a 3D game or film.
11 December 2013
Jonathan Berry and Nick Holliman

The presentable range of depth for any given stereoscopic (3D) display is limited to a defined ‘depth budget’ (see Figure 1). This restricts the creative freedom of content producers, often requiring them to use computational methods to suppress an image's desired range of depth to ensure that it does not exceed the intended display's depth budget.1, 2

Figure 1. The depth budget (shaded yellow and blue) for a 3D (i.e., stereoscopic) display exists due to the limitations of technology and the human visual system.8

A potential strategy for improving depth budget is to exploit auditory-visual (or ‘cross-modal’) interactions. There are now a number of literature reports that demonstrate a connection between the human auditory and visual systems.3 As such, replacing a 3D display's visual depth budget with an extended cross-modal one could potentially offer an improved depth experience for users. In our study, we investigated the possibility of using auditory cues to influence the perceived depth of a visual stimulus in a 3D display, thereby extending the perceived depth experienced in a 3D game or film.4

We began by characterizing the minimum audible depth difference (MAD) for our experimental setup. This is a prerequisite to our study because in order to investigate cross-modal perception, we must first understand the limitations of the individual auditory and visual modes. While the minimum visible stereoscopic difference is well understood,5, 6 only a handful of papers address the MAD.7 These papers reference a theoretical model—called the pressure discrimination hypothesis—which reduces auditory depth perception to a loudness discrimination task. The model predicts an MAD equivalent to 5% of the listening distance. Empirical studies, however, have struggled to match the values predicted by the theoretical model, with MAD values varying substantially with the nature of the environment and the technology used. This suggests that other cues, possibly arising from the experimental setup, confound the task.

Given the importance of measuring the MAD for our own experimental setup, we tested how well listeners could distinguish between two sound sources set at variable distances apart from each other, and at variable distances in front of the listener. Our results revealed that performance was significantly different from chance for every distance tested (see Figure 2). For the 1.25m listening distance, we placed an upper bound of 25cm on the MAD. This distance is 20% of the listening distance, as opposed to the 5% predicted by the theoretical model.

Figure 2. Changes in mean subject score as subjects attempt to differentiate between two speakers at distances 0.25–2.5m apart from each other (the depth difference). At all depth differences, the mean score is significantly different from chance, although performance drops off toward the lower depth differences. From these results we placed an upper bound on the MAD of 25cm at 1.25m.4 MAD: Minimum audible depth difference.

Having established a working value for the MAD, we next investigated whether auditory depth influences the perception of visual depth in 3DTV, i.e., whether a cross-modal influence exists. Our experimental setup consisted of a visual stimulus (mobile telephone) with a fixed stereo depth, accompanied by an auditory stimulus (telephone ring) at one of two possible depths: either congruent with the visual stimulus, or incongruent at 25cm in front of the visual stimulus (see Figure 3). Our subjects were randomly shown pairs of these images and asked to determine which image shows the stimulus to be nearer. In the majority of tests, we found that subjects judged the image's stimulus to be nearer if the accompanying sound was nearer. If the subjects had failed to identify any depth difference between the images, we would have expected a random distribution of responses with respect to the auditory condition. Our observations suggest that incongruent, cross-modal depth presentations can yield fused stimuli at a depth different to the visual component alone. This effect may allow us to extend the depth budget of 3DTV in the future.

Figure 3. The arrangement of the visual and audio stimuli for the cross-modal experiment. In the majority of cases, subjects judged the visual stimulus to be nearer if it was accompanied by sound from the speaker labeled ‘N’ instead of the speaker labeled ‘F.’ From this observation, we propose that it is possible to influence the perceived location of a cross-modal stimulus by varying the audio component alone.

Our experiments showed that subjects can hear an auditory depth difference of less than 0.25m from a distance of 1.25m away. Furthermore, this auditory depth difference influenced a subject's perception of depth in a stereoscopic image. We are currently working to improve the quality and reliability of the results described in this article. We are first seeking to re-measure our value for the MAD by using a new experimental design and a set of carefully matched loudspeakers. From this we intend to formulate a reliable screening test for auditory depth perception. We will then use this screening test and the newly calibrated apparatus to perform the cross-modal experiment again. Open questions for future research include the size of the cross-modal bias, and which variables influence it.

The authors thank the Engineering and Physical Sciences Research Council and Durham University for funding Jonathan Berry's PhD. We also acknowledge the invaluable contribution of Amy Turner (now at Microsoft UK) toward early experimentation in this study. We would especially like to thank Tommaso Selvetti of K-Array, Italy, for finding the matched pair of their loudspeakers that we are using in our recent experiments.

Jonathan Berry
Durham University
Durham, United Kingdom

Jonathan Berry is a PhD student in the School of Engineering and Computing Sciences.

Nick Holliman
The University of York
York, United Kingdom

Nick Holliman researches the science and engineering of interactive digital media with a specialty in binocular 3D, including the theory, human factors, and applications of 3D displays.

1. N. Holliman, Mapping perceived depth to regions of interest in stereoscopic images, Proc. SPIE 5291, p. 117-128, 2004. doi:10.1117/12.525853
2. N. Holliman, Smoothing region boundaries in variable depth mapping for real time stereoscopic images, Proc. SPIE 5664, p. 281-292, 2005. doi:10.1117/12.586712
3. L. Shams, R. Kim, Crossmodal influences on visual perception, Phys. Life Rev. 7, p. 269-284, 2010.
4. A. Turner, J. Berry, N. Holliman, Can the perception of depth in stereoscopic images be influenced by 3D sound?, Proc. SPIE 7863, p. 786307, 2011. doi:10.1117/12.871960
5. H. J. Howard, A test for the judgment of distance, Trans. Am. Ophthalmol. Soc. 17, p. 195, 1919.
6. V. L. Fu, E. E. Birch, J. M. Holmes, Assessment of a new Distance Randot stereoacuity test, J. Am. Assoc. Ped. Ophthalmol. Strabismus 10(5), p. 419-423, 2006.
7. D. H. Ashmead, D. LeRoy, R. D. Odom, Perceptions of the relative distances of nearby sound sources, Percept. Pyschophys. 47(4), p. 326-331, 1990.
8. G. R. Jones, D. Lee, N. S. Holliman, D. Ezra, Controlling perceived depth in stereoscopic images, Proc. SPIE 4297, p. 42-53, 2001. doi:10.1117/12.430855
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research