Additive Noise Subtraction for Environmental Noise in Speech Recognition
Alrouqi, Noof Nour
MetadataShow full item record
Nowadays, technology encourages human beings to communicate by speech via quick voice notes through applications on cell phones. Environmental real white stationary noises impact speech intelligibility, cause the underperformance of voice production, create unclear speech recordings, and produce a disordered voice with problematic characteristics from the environmental real noises. This thesis addresses noise reduction and its major challenges, techniques, and evaluation methods and investigates relative literature by conducting a systematic classification and analyses of the selected papers, thereby proposing a framework based on additive noise subtraction for environmental-noise speech recognition called the ANSESR framework, which is supported with an automated tool for estimating and reducing the environmental noises and producing high-quality speech signals. The framework contains four main sequential stages. First, the prepossessing speech signal stage uses the Hamming window technique. Second, the speech enhancements stage uses the spectral subtraction (SS) technique and additive white Gaussian noise (AWGN) channel. Third, the feature extraction stage uses the Mel frequency cepstral coefficient (MFCC) technique. Fourth, the template matching stage uses the dynamic time warping (DTW) technique. In the experiment, the environmental noisy signal input is used based on assuming different levels of the signal-to-noise ratio (SNR)through segmentation of the input signal into short frames to cope with very short sounds to produce denoised signals, AWGN signals including different levels of the SNR, and enhanced signals. Through the feature extraction, the short-term speech analysis is obtained based on the discrete cosine transform (DCT) domain. The framework validation relies on a single mean template based on the DTW technique, obtaining 98.5179 % of the average recognition accuracy rate. Thus, the results show optimal matching paths during the mean scores that refer to the recognition rates for providing the high utterance of the disordered speech.