Using auditory depth cues to enhance stereoscopic visual media
The presentable range of depth for any given stereoscopic (3D) display is limited to a defined ‘depth budget’ (see Figure 1). This restricts the creative freedom of content producers, often requiring them to use computational methods to suppress an image's desired range of depth to ensure that it does not exceed the intended display's depth budget.1, 2
A potential strategy for improving depth budget is to exploit auditory-visual (or ‘cross-modal’) interactions. There are now a number of literature reports that demonstrate a connection between the human auditory and visual systems.3 As such, replacing a 3D display's visual depth budget with an extended cross-modal one could potentially offer an improved depth experience for users. In our study, we investigated the possibility of using auditory cues to influence the perceived depth of a visual stimulus in a 3D display, thereby extending the perceived depth experienced in a 3D game or film.4
We began by characterizing the minimum audible depth difference (MAD) for our experimental setup. This is a prerequisite to our study because in order to investigate cross-modal perception, we must first understand the limitations of the individual auditory and visual modes. While the minimum visible stereoscopic difference is well understood,5, 6 only a handful of papers address the MAD.7 These papers reference a theoretical model—called the pressure discrimination hypothesis—which reduces auditory depth perception to a loudness discrimination task. The model predicts an MAD equivalent to 5% of the listening distance. Empirical studies, however, have struggled to match the values predicted by the theoretical model, with MAD values varying substantially with the nature of the environment and the technology used. This suggests that other cues, possibly arising from the experimental setup, confound the task.
Given the importance of measuring the MAD for our own experimental setup, we tested how well listeners could distinguish between two sound sources set at variable distances apart from each other, and at variable distances in front of the listener. Our results revealed that performance was significantly different from chance for every distance tested (see Figure 2). For the 1.25m listening distance, we placed an upper bound of 25cm on the MAD. This distance is 20% of the listening distance, as opposed to the 5% predicted by the theoretical model.
Having established a working value for the MAD, we next investigated whether auditory depth influences the perception of visual depth in 3DTV, i.e., whether a cross-modal influence exists. Our experimental setup consisted of a visual stimulus (mobile telephone) with a fixed stereo depth, accompanied by an auditory stimulus (telephone ring) at one of two possible depths: either congruent with the visual stimulus, or incongruent at 25cm in front of the visual stimulus (see Figure 3). Our subjects were randomly shown pairs of these images and asked to determine which image shows the stimulus to be nearer. In the majority of tests, we found that subjects judged the image's stimulus to be nearer if the accompanying sound was nearer. If the subjects had failed to identify any depth difference between the images, we would have expected a random distribution of responses with respect to the auditory condition. Our observations suggest that incongruent, cross-modal depth presentations can yield fused stimuli at a depth different to the visual component alone. This effect may allow us to extend the depth budget of 3DTV in the future.
Our experiments showed that subjects can hear an auditory depth difference of less than 0.25m from a distance of 1.25m away. Furthermore, this auditory depth difference influenced a subject's perception of depth in a stereoscopic image. We are currently working to improve the quality and reliability of the results described in this article. We are first seeking to re-measure our value for the MAD by using a new experimental design and a set of carefully matched loudspeakers. From this we intend to formulate a reliable screening test for auditory depth perception. We will then use this screening test and the newly calibrated apparatus to perform the cross-modal experiment again. Open questions for future research include the size of the cross-modal bias, and which variables influence it.
The authors thank the Engineering and Physical Sciences Research Council and Durham University for funding Jonathan Berry's PhD. We also acknowledge the invaluable contribution of Amy Turner (now at Microsoft UK) toward early experimentation in this study. We would especially like to thank Tommaso Selvetti of K-Array, Italy, for finding the matched pair of their loudspeakers that we are using in our recent experiments.
Jonathan Berry is a PhD student in the School of Engineering and Computing Sciences.
Nick Holliman researches the science and engineering of interactive digital media with a specialty in binocular 3D, including the theory, human factors, and applications of 3D displays.