At the present time, humans cannot interact spontaneously with computers. Current state-of-the-art speech recognizers operate with an optimistic word-accuracy rate of ~99%. In spontaneous speech, users typically utter 150–200 words per minute. This implies that the recognizer will incur one error in less than a minute. Thus, true ‘hands-free’ computer interaction is not yet possible. Detection of when a user is talking to a machine as opposed to someone else is an unsolved problem.
We propose the use of a special trigger, a ‘wake-up word’ (WUW), to indicate when a user addresses a machine, similar to when humans use proper names to refer to each other. For this approach to be successful, we need to solve the problem of false acceptance. Present recognizers assume that all spoken interaction is in-vocabulary (INV, as opposed to out-of-vocabulary or OOV) speech. However, employing a machine WUW requires a computer to listen all the time to be able to respond appropriately. Hence, we have to solve problems related to correct recognition and rejection.
Figure 1. Details of signal processing at the front end of our speech-recognition setup, which contains a common spectrogram-computation module, feature-based voice-activity detection (VAD), and modules that compute three features for each analysis frame, including mel-frequency cepstral coefficients (MFCCs), linear predictive coefficient (LPC) MFCCs, and enhanced (ENH) MFCCs. The VAD classifier determines the state (speech or no speech) of each frame or segment. This information is used by the recognizer's back end. FFT: Fast Fourier transform.
We solved these problems by developing a WUW speech-recognition (SR) system.1,2 Our WUW SR is a highly efficient and accurate recognizer,3 specializing in detection of a single word or phrase when spoken in the alerting (or WUW) context of requesting attention,4,5 while rejecting all other words, phrases, sounds, noises, and other acoustic events with virtually 100% accuracy, including the same word or phrase uttered in nonalerting (referential) context.6,7
Figure 2. Overall wake-up-word speech-recognition (WUW SR) architecture. The signal-processing module accepts raw audio samples and produces spectral representations of short time (t) signals. The feature-extraction module generates features from this spectral representation, which are decoded with the corresponding hidden Markov models (HMMs). The individual feature scores are classified using support vector machines. INV, OOV: in-, out-of-vocabulary speech.
The WUW SR task is similar to keyword spotting. However, it is different in one important aspect, i.e., in being able to discriminate the specific word/phrase used only in alerting context. Specifically, the sentence “Computer, begin PowerPoint presentation” exemplifies the use of the word ‘computer’ in an alerting context. On the other hand, in “My computer has dual Intel 64bit processors, each with quad cores,” the word ‘computer’ is used in referential (nonalerting) context.
We developed WUW SR using only acoustic features, without relying on language modeling. They are based on the triplet of features computed by the system (see Figure 1). Figure 2 shows the system's overall architecture. Figure 3 shows results for one feature used for classification.
Figure 3. 3D discriminating surface using the triple-scoring technique deployed by our WUW recognition approach. The blue region indicates OOV scores, while the red points (enclosed in the green region) represent INV data.
Extensive testing has demonstrated accuracy improvements of several orders of magnitude compared to both the best-known academic SR system, Hidden Markov Model Toolkit (HTK), as well as a leading commercial system, Microsoft SAPI 5.1. Specifically, our WUW SR system correctly detects the WUW with 99.3% accuracy, while it correctly rejects non-WUW occurrences with 99.97% accuracy. In continuous and spontaneous free speech, our WUW system makes 0.45 false-acceptance errors per hour. The system's WUW detection performance is 2514% (26×) better than that of HTK for the same training and test data, and 2271% (24×) better compared to Microsoft SAPI 5.1. Its non-WUW rejection performance is over 62,233% (653×) better than that of HTK and 5900 to 42,900% (60 to 430×) better compared to Microsoft SAPI 5.1.
We are currently exploring ways to further enhance disriminability of WUW recognition used in alerting context from other referential settings using additional prosodic features. Intuitive and empirical evidence suggests that alerting contexts tend to be characterized by additional emphasis (needed to get the desired attention). This investigation is already leading to development of prosodic-based features that will capture this increased emphasis.
Although our system performs admirably, there is room for improvement in several areas. Making support-vector-machine classification independent of OOV data is one of our primary tasks. Development of such a solution will enable deployment of various adaptation techniques in or near real time. The most important aspect of our solution is its generic nature and applicability to modeling of any basic speech unit, such as phonemes (context dependent or independent, e.g., tri-phones), which is necessary for large-vocabulary continuous SR applications. If the accuracy achieved here is portable to those models, as expected, these methods have the potential to revolutionize SR technology.
Florida Institute of Technology
Cocoa Beach, FL
Veton Këpuska has been working on SR technology for over two decades. He has worked for Voice Processing, BBN Technologies, and Nuance. He joined the Florida Institute of Technology to promote his WUW technology. He is also vice president of start-up company AdelaVoice.