Band-Focused EdgeNeXt: A Lightweight Architecture for Tibetan Dialect Classification via Spectral Attention and Dual-Pooling Fusion
DOI:
https://doi.org/10.54097/dmmswy82Keywords:
Tibetan dialect classification, SE Block, Dual-pooling fusion, Low-resource speech recognition, Edge computingAbstract
This study enhances Tibetan dialect classification in low-resource scenarios by proposing a Frequency Band-Focused SE Block and GAP+GMP dual-pooling fusion. The SE Block dynamically weights critical features (e.g., Ü-Tsang F2 formant) while resisting noise. Dual-pooling resolves feature smoothing/loss issues, with progressive stochastic depth boosting generalization. On a 26,762-spectrogram dataset (Ü-Tsang/Amdo/Khams), the 5.8M-parameter model achieves 99.4% accuracy, surpassing EdgeNeXt/RepViT/DilatedFormer by 0.6%/4.0%/0.5%, halving misclassification rates. Signal-to-noise ratio =5dB tests confirm robustness for edge-computing deployment.
Downloads
References
[1] Lim K S. The tonal and intonational phonology of Lhasa Tibetan[D]. Université d'Ottawa/University of Ottawa, 2018.
[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[3] Maaz M, Shaker A, Cholakkal H, et al. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 3-20.
[4] Howard A G. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017
[5] Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention[C]//International conference on machine learning. PMLR, 2021: 10347-10357.
[6] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457.
[7] Nirthika R, Manivannan S, Ramanan A, et al. Pooling in convolutional neural networks for medical image analysis: a survey and an empirical study[J]. Neural Computing and Applications, 2022, 34(7): 5321-5347.
[8] Pham H, Le Q. Autodropout: Learning dropout patterns to regularize deep networks[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(11): 9351-9359.
[9] Huang G, Sun Y, Liu Z, et al. Deep networks with stochastic depth[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016: 646-661.
[10] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[11] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in neural information processing systems, 2020, 33: 12449-12460.
[12] Wang A, Chen H, Lin Z, et al. Repvit: Revisiting mobile cnn from vit perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 15909-15920.
[13] Jiao J, Tang Y M, Lin K Y, et al. Dilateformer: Multi-scale dilated transformer for visual recognition[J]. IEEE Transactions on Multimedia, 2023, 25: 8906-8919.
[14] Maaz M, Shaker A, Cholakkal H, et al. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 3-20.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computing and Electronic Information Management

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








