Automatic Speech Recognition Adaptation for Various Noise Levels
Abdulaziz, Azhar Sabah
MetadataShow full item record
The automatic speech recognition (ASR) is a set of complicated algorithms that convert the intended spoken utterance into a textual form. Acoustic features, which are extracted from the speech signal, are matched against a trained network of linguistic and acoustic models. The ASR performance is degraded significantly when the ambient noise is different than that of the training data. Many approaches have been introduced to address this problem with various degrees of complexity and improvement rates. The general pattern of solving this issue lies in three categories: empowering features, train a general acoustic model and transform models to match noisy features. The acoustic noise is added to the training speech data after collecting them for two reasons: firstly because the data are usually recorded in a specific environment and secondly to control the environments during the training and testing phases. The speech and noise signals are usually combined in the electrical domain using straightforward linear addition. Although this procedure is commonly used, it is investigated in depth in this research. It has been proven that the linear addition is no more than an approximation of the real acoustic combination, and it is valid if the speech and noise are non-coherent signals. The adaptive model switching (AMS) solution is proposed, so that the ASR measures the noise level then picks the model that should produce as minimum errors as possible. This solution is a trade-off between model generalization and transformation properties, so that both error and speed costs are maintained as minimum as possible. The short time of silence (STS), which is a signal-to-noise ratio (SNR) level detector, was designed specifically for the proposed system. The proposed AMS approach is a general recipe that could be applied to any other ASR systems, although it was tested on Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) recognizer. The AMS ASR has outperformed the model generalization and multiple-decoder maximum score voting for both accuracy and decoding speed. The average error rate reduction was around 34.11% , with a decoding speed improvement of about 37.79% relatively, both compared to the baseline ASR.