Dual-Regularized 1D-CNN with MFCC Frame-Mean Features for Modular Speech Emotion Recognition

Qi Zhang

doi:10.54097/5wkzhc71

Authors

Qi Zhang

DOI:

https://doi.org/10.54097/5wkzhc71

Keywords:

Emotion Recognition, CNN, MFCC, Signal Preprocessing

Abstract

Against the backdrop of the intelligent upgrade of human-computer interaction (HCI), speech emotion recognition, as a core task in the field of natural language processing (NLP), plays a crucial role in optimizing user experience. This study proposes and implements a speech emotion recognition system based on Convolutional Neural Network (CNN), aiming to address the issues of low accuracy and weak generalization ability of existing technologies in recognizing emotional cues from daily natural speech. The system is built on a dataset containing 1200 speech samples, which covers 6 typical emotions (anger, fear, happiness, neutrality, sadness, and surprise) simulated by 4 actors (2 males and 2 females), with 200 samples for each emotion. Developed using Python, the system integrates speech signal processing and deep learning technologies to complete the full-process development from speech data preprocessing to emotion prediction. The specific workflow includes: batch reading of speech files and emotional label annotation through recursive traversal; elimination of noise interference via preprocessing techniques such as pre-emphasis, framing, and windowing; extraction of Mel-Frequency Cepstral Coefficients (MFCC) as the core carrier of emotional features; construction and optimization of the CNN model (introducing L2 regularization and Dropout mechanism to suppress overfitting); and adjustment of training parameters (e.g., number of epochs, learning rate) through multiple rounds of experiments. Experimental results show that the optimized system achieves an emotion prediction accuracy of 98.50% on the training set and 96.90% on the test set, and exhibits stable classification performance across different emotion categories. It effectively meets the practical requirements of speech emotion recognition and provides reliable technical support for scenarios such as intelligent customer service and in-vehicle interaction.

Downloads

Download data is not yet available.

References

[1] Baevski, Y., Zhou, W., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

[2] Kunesova,S.,Szöke,I.,&Cernocký,J.(2021).Self-supervised speech representation learning for speech enhancement, speaker recognition and speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6589-6593.

[3] Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J,& Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5644-5648.

[4] Liu,Y.,Wang,X.,& Chen, J. (2022). Context-aware dilated convolution network for speech emotion recognition. IEEE Transactions on Affective Computing, 14(3), 1456-1467.

[5] OpenAI. (2022). Whisper: Robust speech recognition via large-scale supervised training. arXiv preprint arXiv:2212.04356.

[6] Hsu, W. N., Bolte, B., Tsai, Y. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of speech units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451-3460.

[7] Zhang,L.,Li,Y.,&Zhao,H.(2023).Multi-dilated convolution network with spatial pyramid pooling for speech emotion recognition. In Proceedings of the Interspeech, 4321-4325.

[8] Wang, Z., Liu, J., & Zhang, C. (2022). Efficient channel attention mechanism for lightweight speech emotion recognition. IEEE Signal Processing Letters, 29, 2012-2016.

[9] Chen,Y.,Zhang,Y.,&Liu, H. (2021). Parallel convolutional neural network for speech and warning signal enhancement in noisy environments. IEEE Transactions on Vehicular Technology, 70(12), 12684-12693.

[10] Li,Sun C, Li H, Ma L. Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network[J]. Frontiers in Psychology, 2023, 13: 1075624.

[11] Zhao, Y., Li, J., & Wang, Z. (2022). CNN-BiGRU based siamese network for speech emotion recognition. In Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 6809-6813.

[12] Sun,Y.,Zhang, X., & Liu, J. (2023). SE-Conformer-TCN: A modified conformer for mandarin speech recognition. IEEE Transactions on Speech and Audio Processing, 31, 1890-1902.

[13] Jiang, T., Chen, Y., & Zhang, H. (2024). Emotion-aware human-computer interaction: A survey of speech emotion recognition in intelligent systems. IEEE Transactions on Human-Machine Systems, 54(2), 289-302.

[14] Liu,C., Wang, Y., & Li, Z. (2022). Voice user experience evaluation based on speech emotion recognition. Journal of Interactive Marketing, 61, 123-135.

[15] Zhang, Q., Chen, J., & Wang, L. (2023). Affective intelligent in-vehicle interaction system based on speech emotion analysis. IEEE Transactions on Intelligent Transportation Systems, 24(8), 8210-8220.

[16] Li, S., Zhang, H., & Chen, Y. (2021). Speech emotion recognition for mental health assessment: A review. Journal of Medical Systems, 45(12), 101.

[17] Wang, H., Liu, Z., & Zhang, Y. (2023). Multimodal emotion interaction: Fusion of speech, text and visual cues. IEEE Transactions on Multimedia, 25, 4320-4332.

[18] Chen, L., Wang, X., & Li, J. (2022). Spectral feature analysis for mandarin emotional prosody recognition. In Proceedings of the Interspeech, 3567-3571.

Kaya, H.,&Şengür,A.(2021). Speech emotion recognition using vision transformer with mel-spectrograms. Neural Computing and Applications, 33(20), 12819-12832.