Band-Focused EdgeNeXt: A Lightweight Architecture for Tibetan Dialect Classification via Spectral Attention and Dual-Pooling Fusion

Yuang Lu; Zhenye Gan

doi:10.54097/dmmswy82

Authors

Yuang Lu
Zhenye Gan

DOI:

https://doi.org/10.54097/dmmswy82

Keywords:

Tibetan dialect classification, SE Block, Dual-pooling fusion, Low-resource speech recognition, Edge computing

Abstract

This study enhances Tibetan dialect classification in low-resource scenarios by proposing a Frequency Band-Focused SE Block and GAP+GMP dual-pooling fusion. The SE Block dynamically weights critical features (e.g., Ü-Tsang F2 formant) while resisting noise. Dual-pooling resolves feature smoothing/loss issues, with progressive stochastic depth boosting generalization. On a 26,762-spectrogram dataset (Ü-Tsang/Amdo/Khams), the 5.8M-parameter model achieves 99.4% accuracy, surpassing EdgeNeXt/RepViT/DilatedFormer by 0.6%/4.0%/0.5%, halving misclassification rates. Signal-to-noise ratio =5dB tests confirm robustness for edge-computing deployment.

Downloads

Download data is not yet available.

References

[1] Lim K S. The tonal and intonational phonology of Lhasa Tibetan[D]. Université d'Ottawa/University of Ottawa, 2018.

[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[3] Maaz M, Shaker A, Cholakkal H, et al. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 3-20.

[4] Howard A G. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017

[5] Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention[C]//International conference on machine learning. PMLR, 2021: 10347-10357.

[6] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457.

[7] Nirthika R, Manivannan S, Ramanan A, et al. Pooling in convolutional neural networks for medical image analysis: a survey and an empirical study[J]. Neural Computing and Applications, 2022, 34(7): 5321-5347.

[8] Pham H, Le Q. Autodropout: Learning dropout patterns to regularize deep networks[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(11): 9351-9359.

[9] Huang G, Sun Y, Liu Z, et al. Deep networks with stochastic depth[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016: 646-661.

[10] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.

[11] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in neural information processing systems, 2020, 33: 12449-12460.

[12] Wang A, Chen H, Lin Z, et al. Repvit: Revisiting mobile cnn from vit perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 15909-15920.

[13] Jiao J, Tang Y M, Lin K Y, et al. Dilateformer: Multi-scale dilated transformer for visual recognition[J]. IEEE Transactions on Multimedia, 2023, 25: 8906-8919.

[14] Maaz M, Shaker A, Cholakkal H, et al. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 3-20.