Improving Wake-Up-Word and General Speech Recognition Systems
Abstract
Automatic Speech Recognition (ASR), a technology that allows a machine to recognize the
utterances spoken into a microphone by a person and then converts it to text, is commonly
used for different types of applications, such as command and control systems, personal
assistant systems, medical systems, disabilities systems, dictation systems, telephony
systems, and embedded applications. Due to its extensive use, interest in ASR technology
has surged among inventors and researchers alike. They have worked diligently to improve
the performance of the ASR systems by developing several techniques or approaches in
different aspects,such as enhancing features, training an acoustic model, enhancing language
model methodology, and improving decoding methodology. Many techniques have focused
on improving the accuracy of speech recognition in General Automatic Speech Recognition
(General ASR) systems, which are better known as Large Vocabulary Continuous Speech
Recognition (LVCSR). Some other approaches have focused on Wake-Up-Word ASR
systems (WUW ASR), which are similar to keyword spotting. One important aspect of
WUW ASR systems is the ability to discriminate the specific word or phrase used only in an
alerting context and not in others, such as referential contexts. For example, when a user
speaks a word like “Car'' in the sentence "Car, show me the camera?'' the word “Car” is used
in the altering context. The word "Car" is used in referential context when used in the
sentence, “Every car should have a speaker”. It is difficult to determine, in real-time, if the
user is speaking to the Car or about the Car. In other words, the WUW ASR system should
be able to discriminate if the user is speaking to the recognizer or not. Most companies that
produce ASR systems have focused on improving the speech recognition accuracy in General ASR systems without improving the speech recognition accuracy in WUW ASR
systems. Recently, the WUW ASR system has come to the forefront of speech recognition
with the advent of voice-assist technologies such as Microsoft Cortana, Amazon Alexa,
Apple Siri, and Google Assistant. All of these companies have started to focus on the WUW
ASR systems to improve the WUWs that activate their devices and applications for
interaction with the users.
This dissertation focuses on the design and implementation of a whole ASR system that can
work in both the WUW and General ASR systems with high accuracy. The new ASR system
will be used to solve one of the biggest problems that speech recognition technology faces,
which is how to discriminate between the uses of a word or phrase in an alerting versus a
referential context and using General ASR systems with high accuracy. By using this
paradigm, the accuracy of commands that are used to interact with machines, such as one
word or an entire sentence, will improve and be able to reach high accuracy. Moreover, due
to the increasing number of different speech commands, this model will be able to reduce
the number of false alarms in the devices and applications that use the speech commands.
Our study proposes a higher accuracy, innovative ASR system that is capable of working
with WUW and General ASR systems. In order to develop the new ASR system with high
accuracy, the following steps were carried out: (1) modifying the structure of General ARS
system, (2) selecting the best platform to test the proposed ASR system, (3) simulating the
WUW and General Acoustic Models (AMs), and (4) designing a decision support.
Moreover, the ASR system performance has been significantly affected, to a large degree,
by acoustic environmental conditions such as noise types, noise levels, speaker accents, and
microphone variability. These acoustic environmental conditions can affect the accuracy of
the ASR system. To overcome the issues and test the proposed ASR system, the new ASR
system was trained and tested with different acoustic environmental conditions, such as
different background noise levels, noise types, different speaker distances to the microphone,
and different speakers. The results of our experiment showed that all stages of the proposed
ASR system worked with high performance and the new system was able to make a final
decision if the result of the word or phrase is a WUW with 100% accuracy (Confidence
Word 100%) or General with 100% accuracy (Confidence Word 100%).