A Robust Speaker Identification Algorithm Based on Atomic Decomposition and Sparse Redundant Dictionary Learning
Bryan Jr., Thomas James
MetadataShow full item record
Speaker identification is performed in high additive white Gaussian noise environments by processing sounds as images in the time-frequency plane. The technique creates audio streams that remove background noise from speech while preserving speech quality. Atomic decomposition is implemented by the matching pursuit algorithm using Gabor or gammatone atomic dictionaries. Gabor atoms were originally proposed as fundamental building blocks of speech; whereas, gammatone atoms closely resemble the human cochlear impulse response. Atomic decomposition creates a sparsely populated time-frequency vector referred to as weight space. The populated positions in weight space have amplitude weights that are proportional to localized energy in time and frequency. The weight space vector, combined with the atomic dictionary, represents a concise, denoised, compressed version of the original signal. Custom atomic dictionaries, called basis vectors, are learned from envelope samples that have superior data denoising and data compression characteristics compared to the performance of the Gabor or gammatone atomic dictionaries. Unsupervised feature learning by a sparse autoencoder learns basis vectors from Gaussian or gamma envelope samples. Envelope sampling, generates audio patches from speech that have either Gaussian of gamma time windows. Speaker identification is performed in weight space by taking histograms of the energy distribution for each basis vector over the span of a single sentence. Delta time differences are computed for the two highest energy basis vectors. Normalized time difference histograms are created for the two highest energy basis vectors during training and testing. Speaker identification is performed by finding the closest Euclidian distance between the training and testing normalized time difference histograms. The method represents a simple pitch tracking heuristic that is shown to be robust in the presence of additive white Gaussian noise. The algorithm is targeted for low power embedded devices like hearing aids, and is designed to perform speaker identification in high noise settings by learning basis vectors from pairs of speakers. The algorithm uses three passes of matching pursuit during training and two passes for testing. The first training and testing pass performs audio time segmentation using Gabor atoms to generate audio snippets. The second training pass extracts Gaussian envelope samples that are used by a sparse autoencoder to learn basis vectors from pairs of speakers. The third training pass and second testing pass, decomposes the data with the basis vectors. Training is done with two sentences from the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpora, which results in a total training sample size of 2-4 seconds. Testing is done on individual TIMIT sentences that average 1-2 seconds in duration. Sixteen TIMIT sentences are used for testing for each pair of speakers. The classifier has a speaker identification sentence accuracy of 93% that does not degrade for signal-to-noise ratios of 30 dB, 20 dB, 10 dB, 5 dB, and 0 dB using pairs of speakers from the TIMIT database.