SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:
SPIE Photonics West 2019 | Register Today

SPIE Defense + Commercial Sensing 2019 | Register Today

2019 SPIE Optics + Photonics | Call for Papers



Print PageEmail PageView PDF

Optoelectronics & Communications

Optical techniques for sound processing

Optical and visual microphones offer novel possibilities for sound capture and storage beyond traditional mechanical sensing systems.
20 September 2016, SPIE Newsroom. DOI: 10.1117/2.1201608.006681

From an engineering perspective, there are several aspects to the relationship between optics and sound. For example, the notation of sound events (i.e., through the musical notation system) offers a graphical representation of sound in terms of quantized frequency along a quantized time axis. In other words, optical characters are used to capture sound and to deliver a symbolic recording of a melody. This can be considered the first kind of optical recording and storing technique for sound. With the invention of photoelectrical cells and laser-based optics, however, more sophisticated optical techniques for sound recording, storing, and processing have also been, and continue to be, developed. To date, such developments have led to sound being optically recorded on film, compact disc systems, and digital video discs. Recently, however, storage and distribution of sound and video have been completely replaced by specific audio and video coding standards. In addition, the capture of sound with optical and visual microphone techniques offers a range of new possibilities for specific sound-sensing scenarios.

Purchase SPIE Field Guide to Optical Fiber TechnologyIn the past, the recording of sound with microphones has mainly been achieved with the use of a condenser and moving-coil electronic components that transform a mechanical sound wave into an electric signal. In contrast, with optical microphones1—see Figure 1(a)—the vibration of the reflective membrane is directly measured to assess the light intensity variation at the photodetector. Optical microphones have thus gained a great deal of interest and have found a number of applications (ranging from pattern recognition to the generation of sound effects). For example, optical audio reconstruction2 and laser microphones3, 4 have been demonstrated. Another recent development is the visual microphone5—see Figure 1(b)—which is a camera-based sound-capturing system. In this setup, sound events in a room (e.g., those coming from speakers) cause an object to vibrate slightly and a high-speed camera is used to record the object. By subsequently conducting frame-by-frame processing, the vibration of the object can be analyzed to recover the sound waveform. For this kind of application, the video frame rate should be similar to that of typical audio sampling rates (i.e., from 8–44.1kHz). This type of analysis of temporal variations in videos6 can thus serve as the basis for new research directions.

Figure 1. Schematic diagrams of (a) an optical microphone and (b) a visual microphone. In the optical microphone, the vibration of the reflective membrane is used as a measure of the variation in light intensity at the photodetector. In the visual microphone system, a high-speed camera is used to record the sound-induced vibrations of an object. The sound waveform is then recovered via frame-by-frame processing of the recorded vibrations.

In our work,7,8 we provide an overview of optical techniques that can be used for sound processing, and of current trends in this research field. We also outline various basic optical sound-processing systems and their performance parameters. In addition, we introduce our novel sound-synthesis technique that is based in the time-frequency domain.

Until recently, sound processing has generally been performed with the use of time-domain algorithms. With the newer generations of digital signal processors, however, we have been able to develop new algorithms9 and to implement them in real time on mobile devices. In our audio processing chain, we include a sound-analysis step in which Fourier transforms of consecutive overlapping frames of input samples are performed to produce a spectrogram. This spectrogram can be interpreted as an image in which every pixel corresponds to a quantized time slot and frequency band (time is depicted along the x-axis and frequency is shown along the y-axis). In addition, we encode the signal power of each time-frequency point with a grayscale or hue value. In the time-frequency processing step, we use optical pattern detection and recognition techniques to analyze and process the spectrogram. The processed spectrogram can then be transferred back to the time-domain for sound reproduction (e.g., in loudspeakers or headphones).

We perform the spectrogram re-synthesis step with the use of a phase vocoder or a short-timescale Fourier transform framework (where the columns of the image/spectrogram are treated as individual spectra).9 After obtaining an inverse Fourier transform for all of the spectra, we sum the resulting time-domain blocks (with some overlap) to yield the synthesized output signal. With this process, it is important to note that the usually non-symmetric and real-valued image data has to be extended with a conjugate symmetric mirror spectrum so that real-valued time domain signals can be obtained. The missing phase values can be chosen as either random values or zero. Furthermore, the application of more complex phase-estimation techniques should also be possible.

The complete processing chain for an image-based synthesizer is shown in Figure 2. First, the image captured by a camera is pre-processed (through thresholding, edge-detection, smoothing, and other methods) to yield a spectrogram. This stage is an important part of shaping the resulting sound. Once the time-domain signal synthesis has been performed with a phase vocoder framework, the signal can be played back from a loudspeaker. The commercially available Photosounder program10 is based on these principles and offers versatile controls for influencing the sound synthesis process (and for modifying the source image).

Figure 2. Illustration of an image-based sound synthesis processing chain. First, an image is captured with a camera. This image then undergoes pre-processing (e.g., via thresholding, edge-detection, and smoothing) to produce a spectrogram. A phase vocoder framework is then used to perform the time-domain signal synthesis. In the final step, the recovered signal can be played from a loudspeaker.

In summary, we have provided an introduction and overview of optical techniques for sound processing. By approaching sound processing from both an optical and acoustic viewpoint, a wide range of improvements can be made. For example, new approaches for capturing and processing combined audio/video signals from real-life scenes are the focus of our ongoing research.

Sebastian Kraft, Udo Zölzer
Helmut Schmidt University
Hamburg, Germany

1. W. Niehoff, V. Gorelik, M. Hibbing, Optisches Mikrofon, German Patent DE19835947 A1, 1998.
2. B. Li, J. B. L. Smith, I. Fujinaga, Optical audio reconstruction for stereo phonograph records using white light interferometry, 10th Int'l Soc. Music Info. Retrieval Conf., p. 627-632, 2009.
3. J. T. Vehgdan, Laser microphone, US Patent 6147787 A, 2000.
4. C.-C. Wang, S. Trivedi, F. Jin, V. Swaminathan, P. Rodriguez, N. S. Prasad, High sensitivity pulsed laser vibrometer and its application as a laser microphone, Appl. Phys. Lett. 94, p. 051112, 2009. doi:10.1063/1.3078520
5. A. Davis, M. Rubinstein, N. Wadhwa, G. Mysore, F. Durand, W. T. Freeman, The visual microphone: passive recovery of sound from video, ACM Trans. Graphics Proc. ACM SIGGRAPH 33, p. 79, 2014.
6. M. Rubinstein, Analysis and Visualization of Temporal Variations in Video, 2014. Massachusetts Institute of Technology
7. S. Kraft, U. Zölzer, Optical techniques for sound processing, Proc. SPIE 9948, p. 9948-33, 2016. (In press.)
8. F. Eichas, U. Zölzer, Modeling of an optocoupler-based audio dynamic range control circuit, Proc. SPIE 9948, p. 9948-31, 2016. (In press.)
9. U. Zölzer, DAFX: Digital Audio Effects, Wiley, 2011.
10. http://photosounder.com/ Webpage of the commercial Photosounder software. Accessed 16 August 2016.