Speech Emotion Recognition using Connectionist Models in a Tandem System
This article presents a tandem Speech Emotion Recognition (SER) system by which 8 archetypal emotions can be differentiated based upon two different types of acoustic features as inputs to Artificial Neural Network (ANN) models. The two types of features that are fed into the classifiers reveal the degree of excitement and pleasentness of speech. The author has implied the time-based characteristics of speech in the feature extraction method by monitoring the trend of local features through time. Thus, two global features are proposed that are derived from Teager Energy Operator (TEO)-based (T EOg) and spectral features, Mel-Frequency Cepstral Coeffcients (MF CCg). In this study we established a tandem system of two hierarchies that follows a cognitive model to separate emotions based on the amount of stress in the voice Teager Energy Operator-Critical Band-Autocorrelation-Envelope (TEO-CB-Auto-Env ) and the pleasentness of the emotion (MFCC). In this research, we proposed a baseline measurement of the recognition based on the current feature vectors and make an analogy between the baseline and the tandem system to demonstrate the superiority of the proposed tandem system against the non-hierarchical systems. Moreover, we compared our results with the recognition rates from some of the cited articles. Additionally, inspired by the cognitive model, the author defined a hybrid tandem system in which the first hierarchy gets the T EOg as input to the classifier and two models in the second hierarchy get the MF CCg features for their input layers This system will be compared to a tandem system with only MF CCg feature vectors in the hierarchies in terms of the effectiveness and efficiency. Based on our experiments, it turns out that the former system returns a higher degree of efficiency whereas the latter tandem system gives a higher recognition rate. In our system, we made use of a binary-class Multi-Layer Perceptron (MLP) and two multi-class MLPs for the first and the second hierarchies, respectively. Considering only the audio part, the classification is performed on three emotion-based datasets: Surrey Audio-Visual Expressed Emotion (SAVEE), Berlin Database of Emotional Speech (Emo-DB), and eNTERFACE Audio-Visual Emotion Database (eNTERFACE). The systems are considered speaker- and gender-indepenedent. We have used Unweighted Accuracy (UW) accuracy to evaluate our methods. Our tandem system at its best given only MF CCg returns prediction rates as %77.26, %71.42 and %66.49 on the Emo-DB, SAVEE and eNTERFACE datasets, respectively. Whereas this measurement using a hybrid feature (second best) are %75.067, %67.596 and %65.197.