Optical techniques for sound processing
From an engineering perspective, there are several aspects to the relationship between optics and sound. For example, the notation of sound events (i.e., through the musical notation system) offers a graphical representation of sound in terms of quantized frequency along a quantized time axis. In other words, optical characters are used to capture sound and to deliver a symbolic recording of a melody. This can be considered the first kind of optical recording and storing technique for sound. With the invention of photoelectrical cells and laser-based optics, however, more sophisticated optical techniques for sound recording, storing, and processing have also been, and continue to be, developed. To date, such developments have led to sound being optically recorded on film, compact disc systems, and digital video discs. Recently, however, storage and distribution of sound and video have been completely replaced by specific audio and video coding standards. In addition, the capture of sound with optical and visual microphone techniques offers a range of new possibilities for specific sound-sensing scenarios.
In the past, the recording of sound with microphones has mainly been achieved with the use of a condenser and moving-coil electronic components that transform a mechanical sound wave into an electric signal. In contrast, with optical microphones1—see Figure 1(a)—the vibration of the reflective membrane is directly measured to assess the light intensity variation at the photodetector. Optical microphones have thus gained a great deal of interest and have found a number of applications (ranging from pattern recognition to the generation of sound effects). For example, optical audio reconstruction2 and laser microphones3, 4 have been demonstrated. Another recent development is the visual microphone5—see Figure 1(b)—which is a camera-based sound-capturing system. In this setup, sound events in a room (e.g., those coming from speakers) cause an object to vibrate slightly and a high-speed camera is used to record the object. By subsequently conducting frame-by-frame processing, the vibration of the object can be analyzed to recover the sound waveform. For this kind of application, the video frame rate should be similar to that of typical audio sampling rates (i.e., from 8–44.1kHz). This type of analysis of temporal variations in videos6 can thus serve as the basis for new research directions.
In our work,7,8 we provide an overview of optical techniques that can be used for sound processing, and of current trends in this research field. We also outline various basic optical sound-processing systems and their performance parameters. In addition, we introduce our novel sound-synthesis technique that is based in the time-frequency domain.
Until recently, sound processing has generally been performed with the use of time-domain algorithms. With the newer generations of digital signal processors, however, we have been able to develop new algorithms9 and to implement them in real time on mobile devices. In our audio processing chain, we include a sound-analysis step in which Fourier transforms of consecutive overlapping frames of input samples are performed to produce a spectrogram. This spectrogram can be interpreted as an image in which every pixel corresponds to a quantized time slot and frequency band (time is depicted along the x-axis and frequency is shown along the y-axis). In addition, we encode the signal power of each time-frequency point with a grayscale or hue value. In the time-frequency processing step, we use optical pattern detection and recognition techniques to analyze and process the spectrogram. The processed spectrogram can then be transferred back to the time-domain for sound reproduction (e.g., in loudspeakers or headphones).
We perform the spectrogram re-synthesis step with the use of a phase vocoder or a short-timescale Fourier transform framework (where the columns of the image/spectrogram are treated as individual spectra).9 After obtaining an inverse Fourier transform for all of the spectra, we sum the resulting time-domain blocks (with some overlap) to yield the synthesized output signal. With this process, it is important to note that the usually non-symmetric and real-valued image data has to be extended with a conjugate symmetric mirror spectrum so that real-valued time domain signals can be obtained. The missing phase values can be chosen as either random values or zero. Furthermore, the application of more complex phase-estimation techniques should also be possible.
The complete processing chain for an image-based synthesizer is shown in Figure 2. First, the image captured by a camera is pre-processed (through thresholding, edge-detection, smoothing, and other methods) to yield a spectrogram. This stage is an important part of shaping the resulting sound. Once the time-domain signal synthesis has been performed with a phase vocoder framework, the signal can be played back from a loudspeaker. The commercially available Photosounder program10 is based on these principles and offers versatile controls for influencing the sound synthesis process (and for modifying the source image).
In summary, we have provided an introduction and overview of optical techniques for sound processing. By approaching sound processing from both an optical and acoustic viewpoint, a wide range of improvements can be made. For example, new approaches for capturing and processing combined audio/video signals from real-life scenes are the focus of our ongoing research.