Co-optimized Vision Transformer Deployment on Edge Devices: Algorithm-Hardware-Compiler 3D Evolution

Yifan Wu

doi:10.54097/b7d7w798

Authors

Yifan Wu

DOI:

https://doi.org/10.54097/b7d7w798

Keywords:

ViT compression, MambaVision, PH-Reg, Collaborative optimization, Edge deployment

Abstract

Vision Transformer (ViT) with its attention mechanism in based on visual task performance, but its high computational complexity and memory requirements (such as ViT-base under the 224 x 224 input should be 17.6 GFLOPs, more than 2 GB of FP32 inference memory) limits its deployment on resource-constrained edge devices. In this paper, we propose a collaborative optimization framework that combines algorithm compression, hardware-aware acceleration, and compiler optimization, with a special focus on the possible breakthrough technologies in 2025 - MambaVision hybrid architecture and PH-Reg dynamic robustness enhancement. Through reliable optimization methods, the framework reduces PackQViT latency to 12.3 ms, achieves 62 img/s throughput of DynamicViT, and maintains or improves the accuracy over ViT-Base accuracy of 84.6% (e.g., PackQViT reaches 85.2%). In addition, challenges such as ultra-low-precision quantization generalization, dynamic architecture stability, cross-device collaboration, and the balance between privacy and energy efficiency are also explored.

Downloads

Download data is not yet available.

References

[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[2] Rao, Y., Zhao, W., Liu, B., et al. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34, 13937-13949.

[3] Capra, M., Bussolino, B., Marchisio, A., et al. (2020). Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access, 8, 225134-225180.

[4] Liu, X., Peng, H., Zheng, N., et al. (2023). Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14420-14430.

[5] Dong, P., Lu, L., Wu, C., et al. (2023). Packqvit: Faster sub-8-bit vision transformers via full and packed quantization on the mobile. Advances in Neural Information Processing Systems, 36, 9015-9028.

[6] Hatamizadeh, A., & Kautz, J. (2025). Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, 25261-25270.

[7] Chen, Y., Yan, Z., Zhou, C., et al. (2025). Vision transformers with self-distilled registers. arXiv preprint arXiv:2505.21501.

[8] Chen, T., Moreau, T., Jiang, Z., et al. (2018). TVM: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 578-594.

[9] Li, Z., Lu, A., Xie, Y., et al. (2024, May). Quasar-vit: Hardware-oriented quantization-aware architecture search for vision transformers. In Proceedings of the 38th ACM International Conference on Supercomputing, 324-337.

[10] Bhalgat, Y., Lee, J., Nagel, M., et al. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 696-697.

[11] Li, Z., Yang, T., Wang, P., et al. (2022). Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703.

[12] Mehta, S., & Rastegari, M. (2021). Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178.

[13] Jiang, H., Huang, S., Li, W., et al. (2022). ENNA: An efficient neural network accelerator design based on ADC-free compute-in-memory subarrays. IEEE Transactions on Circuits and Systems I: Regular Papers, 70(1), 353-363.

[14] Li, Z., & Gu, Q. (2023). I-vit: Integer-only quantization for efficient vision transformer inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 17065-17075.