Vision Transformers (ViTs): A New Era in Computer Vision – A Review

Ao Wang

doi:10.54097/41t0vx90

Authors

Ao Wang

DOI:

https://doi.org/10.54097/41t0vx90

Keywords:

Vision Transformer (ViT), Computer Vision, Image Classification, Model Optimization, Multimodal Learning

Abstract

Vision Transformers (ViTs) have become a strong substitute to Convolutional Neural Networks (CNNs) in computer vision, providing a new method to learn global dependencies using self-attention operations. This survey paper provides an in-depth analysis of the development, application, optimization, and deployment difficulties of ViT models. We begin by reviewing the evolution of ViTs from their base architecture, and its subsequent adaptations to newly developed versions, including hybrids with CNNs and multi-scale attention. We then investigate the applications of ViTs such as image classification, object detection, segmentation, depth estimation, medical image analysis, and industry vision inspection. Methods to enhance ViT efficiency—such as model pruning/quantization, hybridization with CNNs, and dynamic adaptation—are extensively discussed. However, ViTs also have significant limitations including computational complexity, scaling and data challenges. Spatial Usage of Scratch Programming Blocks Some potential solutions and future directions are addressed, such as deploying the work on edge device and inclusion in multimodal learning systems. Synthesizing knowledge from recent literature, this paper provides a comprehensive overview of the trends that have developed and six paradigms that currently exist for ViTs in computer vision.

Downloads

Download data is not yet available.

References

[1] Hütten, N., Meyes, R., & Meisen, T. (2022). Vision Transformer in Industrial Visual Inspection. Applied Sciences, 12(23), 11981. https://doi.org/10.3390/app122311981

[2] Papa, A., Cuozzo, M., & Bartoli, A. (2024). A Survey on Efficient Vision Transformers Algorithms, Techniques, and Performance Benchmarking. Journal of Artificial Intelligence, 36(1), 45-59.

[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[4] Kumar, A., Yadav, P., & Patel, S. (2024). Breaking New Ground in AI with Posit Arithmetic and Vision Transformers. IEEE Access, 11, 3452–3461. https://doi.org/10.1109/ACCESS.2024.3545639

[5] Liu, Z., Lin, Y., & Liu, M. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 1006–1014. https://doi.org/10.1109/ICCV48922.2021.01004

[6] Wong, A., Abbasi, S., & Nair, S. (2023). TurboViT: Generating Fast Vision Transformers via Generative Architecture Search. IEEE Transactions on Neural Networks and Learning Systems, 34(7), 3451-3465. https://doi.org/10.1109/TNNLS.2023.3322150

[7] Bi, Q., Sun, X., Yu, S., Ma, K., Bian, C., Ning, M., He, N., Huang, Y., Li, Y., Liu, H., & Zheng, Y. (2023). MIL-ViT: A multiple instance vision transformer for fundus image classification. Journal of Visual Communication and Image Representation, 97, 103956. https://doi.org/10.1016/j.jvcir.2023.103956

[8] Park, J. G., Amangeldi, A., Fakhrutdinov, N., Karzhaubayeva, M., & Zorbas, D. (2025). Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative Metrics. IEEE Access.

[9] Ibrahem, H., Salem, A., & Kang, H.-S. (2022). RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers. Sensors, 22(10), 3849. https://doi.org/10.3390/s22103849

[10] Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 478–487. https://doi.org/10.1109/ICCV48922.2021.00485

[11] Yuan, L., et al. (2021). Tokens-to-Token Vision Transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 6747–6756. https://doi.org/10.1109/ICCV48922.2021.00661

[12] Graham, B., El-Nouby, A., & Saad, A. (2021). LeViT: A Vision Transformer in 2021. IEEE Transactions on Image Processing, 45(5), 2423–2432.

[13] Wei, Z., Pan, H., Li, L., Lu, M., Niu, X., Dong, P., & Li, D. (2023, June). DMFormer: Closing the gap between CNN and vision transformers. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

[14] Haruna, Y., Qin, S., Chukkol, A. H. A., Yusuf, A. A., Bello, I., & Lawan, A. (2025). Exploring the synergies of hybrid convolutional neural network and Vision Transformer architectures for computer vision: A survey. Engineering Applications of Artificial Intelligence, 144, 110057.

[15] Ranjan, N., & Savakis, A. (2024, June). Vision transformer quantization with multi-step knowledge distillation. In Signal Processing, Sensor/Information Fusion, and Target Recognition XXXIII (Vol. 13057, pp. 283-292). SPIE.

Vision Transformers (ViTs): A New Era in Computer Vision – A Review

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Cover

Indexing & Abstracting