Improving Wake-Up-Word and General Speech Recognition Systems
Bohouta, Gamal Mohamed
MetadataShow full item record
Automatic Speech Recognition (ASR), a technology that allows a machine to recognize the utterances spoken into a microphone by a person and then converts it to text, is commonly used for different types of applications, such as command and control systems, personal assistant systems, medical systems, disabilities systems, dictation systems, telephony systems, and embedded applications. Due to its extensive use, interest in ASR technology has surged among inventors and researchers alike. They have worked diligently to improve the performance of the ASR systems by developing several techniques or approaches in different aspects,such as enhancing features, training an acoustic model, enhancing language model methodology, and improving decoding methodology. Many techniques have focused on improving the accuracy of speech recognition in General Automatic Speech Recognition (General ASR) systems, which are better known as Large Vocabulary Continuous Speech Recognition (LVCSR). Some other approaches have focused on Wake-Up-Word ASR systems (WUW ASR), which are similar to keyword spotting. One important aspect of WUW ASR systems is the ability to discriminate the specific word or phrase used only in an alerting context and not in others, such as referential contexts. For example, when a user speaks a word like “Car'' in the sentence "Car, show me the camera?'' the word “Car” is used in the altering context. The word "Car" is used in referential context when used in the sentence, “Every car should have a speaker”. It is difficult to determine, in real-time, if the user is speaking to the Car or about the Car. In other words, the WUW ASR system should be able to discriminate if the user is speaking to the recognizer or not. Most companies that produce ASR systems have focused on improving the speech recognition accuracy in General ASR systems without improving the speech recognition accuracy in WUW ASR systems. Recently, the WUW ASR system has come to the forefront of speech recognition with the advent of voice-assist technologies such as Microsoft Cortana, Amazon Alexa, Apple Siri, and Google Assistant. All of these companies have started to focus on the WUW ASR systems to improve the WUWs that activate their devices and applications for interaction with the users. This dissertation focuses on the design and implementation of a whole ASR system that can work in both the WUW and General ASR systems with high accuracy. The new ASR system will be used to solve one of the biggest problems that speech recognition technology faces, which is how to discriminate between the uses of a word or phrase in an alerting versus a referential context and using General ASR systems with high accuracy. By using this paradigm, the accuracy of commands that are used to interact with machines, such as one word or an entire sentence, will improve and be able to reach high accuracy. Moreover, due to the increasing number of different speech commands, this model will be able to reduce the number of false alarms in the devices and applications that use the speech commands. Our study proposes a higher accuracy, innovative ASR system that is capable of working with WUW and General ASR systems. In order to develop the new ASR system with high accuracy, the following steps were carried out: (1) modifying the structure of General ARS system, (2) selecting the best platform to test the proposed ASR system, (3) simulating the WUW and General Acoustic Models (AMs), and (4) designing a decision support. Moreover, the ASR system performance has been significantly affected, to a large degree, by acoustic environmental conditions such as noise types, noise levels, speaker accents, and microphone variability. These acoustic environmental conditions can affect the accuracy of the ASR system. To overcome the issues and test the proposed ASR system, the new ASR system was trained and tested with different acoustic environmental conditions, such as different background noise levels, noise types, different speaker distances to the microphone, and different speakers. The results of our experiment showed that all stages of the proposed ASR system worked with high performance and the new system was able to make a final decision if the result of the word or phrase is a WUW with 100% accuracy (Confidence Word 100%) or General with 100% accuracy (Confidence Word 100%).