SDAFormer: A Semantic-Guided and Detail-Aware Transformer for Apple Counting in Complex Orchards

Chenyu Zhu

doi:10.54097/new4re42

Authors

Chenyu Zhu

DOI:

https://doi.org/10.54097/new4re42

Keywords:

Apple counting, Density map estimation, Semantic guidance, Transformer

Abstract

Accurate apple counting is crucial for orchard yield estimation and automated management. However, in complex natural agricultural settings, issues such as scale variations, fruit occlusion, and background interference pose significant challenges to existing counting methods. Current mainstream models often struggle to balance global contextual information with local fine-grained features, resulting in inaccurate counts in these areas and difficulty in effectively distinguishing fruits from complex backgrounds. To address the issues of easily disturbed shallow-level details and insufficient coordination between high-level semantics and local structure that apple targets face under varying scales and occlusion conditions in real orchard scenarios, this paper proposes a semantic-guided and detail-aware Transformer-based apple counting method, Named SDAFormer. This method uses the Semantic-Aware Detail Refinement Module (SADRM) to explicitly inject deep semantic information into shallow-level edge, texture, and local structural features, thereby enhancing the feature completeness and discriminative power of occluded apple regions; Through the Coordinate-Aware Multi-scale Module (CAMM), it enhances the position-aware capabilities and multi-scale context modeling during the density map regression stage, thereby improving the model’s counting stability under varying scales and in partially occluded scenarios. Experimental results demonstrate that this method achieves superior counting performance on a self-built apple dataset, with a Mean Absolute Error (MAE) of 3.61 and a Mean Squared Error (MSE) 4.76.

Downloads

Download data is not yet available.

References

[1] Villacrés J, Viscaino M, Delpiano J, et al. Apple orchard production estimation using deep learning strategies: a comparison of tracking-by-detection algorithms[J]. Computers and Electronics in Agriculture, 2023, 204: 107513. DOI:10.1016/j.compag.2022.107513.

[2] He L, Fang W, Zhao G, et al. Fruit yield prediction and estimation in orchards: a state-of-the-art comprehensive review for both direct and indirect methods[J]. Computers and Electronics in Agriculture, 2022, 195: 106812. DOI:10.1016/j.compag.2022.106812.

[3] Schmitz C, Zimmermann L, Schiffers K, et al. ProbApple: a probabilistic model to forecast apple yield and quality[J]. Agricultural Systems, 2025, 208: 104298. DOI:10.1016/j.agsy.2025.104298.

[4] Chen S, Zhang S, Li H, et al. Optimizing irrigation and nitrogen management enhances apple yield and quality through improving soil quality on the Loess Plateau[J]. Plant and Soil, 2025, 489(1): 255-271. DOI:10.1007/s11104-025-07712-z.

[5] Ahmed D, Sapkota R, Churuvija M, et al. Machine vision-based crop-load estimation using YOLOv8[EB/OL]. arXiv preprint: arXiv:2304.13282, 2023. DOI:10.48550/arXiv.2304.13282.

[6] Bhusal S, Bhattarai U, Karkee M. Trellis wire detection for obstacle avoidance in apple orchards[J]. IFAC-PapersOnLine, 2022, 55(32): 72-77. DOI:10.1016/j.ifacol.2022.11.117.

[7] Rong J, Zhang H, Zhou F, et al. Tomato cluster detection and counting using improved YOLOv5 based on RGB-D fusion[J]. Computers and Electronics in Agriculture, 2023, 207: 107741. DOI:10.1016/j.compag.2023.107741.

[8] Yu X, Wang Y, An D, et al. Counting method for cultured fishes based on multi-modules and attention mechanism[J]. Aquacultural Engineering, 2022, 96: 102215. DOI:10.1016/j.aquaeng.2021.102215.

[9] Wu Z, Sun X, Jiang H, et al. NDMFCS: an automatic fruit counting system in modern apple orchard using abatement of abnormal fruit detection[J]. Computers and Electronics in Agriculture, 2023, 211: 108036. DOI:10.1016/j.compag.2023.108036.

[10] Yan Z, Wu Y, Zhao W, et al. Research on an apple recognition and yield estimation model based on the fusion of improved YOLOv11 and DeepSORT[J]. Agriculture, 2025, 15(7): 765. DOI:10.3390/agriculture15070765.

[11] Sapkota R, Meng Z, Churuvija M, et al. Comprehensive performance evaluation of YOLOv12, YOLO11, YOLOv10, YOLOv9 and YOLOv8 on detecting and counting fruitlet in complex orchard environments[EB/OL]. arXiv preprint: arXiv:2407.12040, 2024. DOI:10.48550/arXiv.2407.12040.

[12] Cao D, Luo W, Tang R, et al. Research on apple detection and tracking count in complex scenes based on the improved YOLOv7-Tiny-PDE[J]. Agriculture, 2025, 15(5): 483. DOI:10.3390/agriculture15050483.

[13] Häni N, Roy P, Isler V. Apple counting using convolutional neural networks[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 2018: 2559-2565. DOI:10.1109/IROS.2018.8594304.

[14] Wang Q, Nuske S, Bergerman M, et al. Detection and localization of overlapped fruits: application in an apple harvesting robot[J]. Electronics, 2020, 9(6): 1023.

[15] Zhang S, Wu X, You Z, et al. A method of apple image segmentation based on color-texture fusion feature and machine learning[J]. Agronomy, 2020, 10(7): 972.

[16] Fan P, Lang G, Yan B, et al. A method of segmenting apples based on gray-centered RGB color space[J]. Remote Sensing, 2021, 13(6): 1211.

[17] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.

[18] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]//International Conference on Learning Representations. 2015.

[19] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.

[20] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.

[21] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 779-788.

[22] Gao F, Wu Z, Suo R, et al. Apple detection and counting using real-time video based on deep learning and object tracking[J]. Transactions of the Chinese Society of Agricultural Engineering, 2021, 37(21): 217-224.

[23] Zhao J, et al. Research on apple recognition algorithm in complex orchard environment based on deep learning[J]. Sensors, 2023, 23(12): 5425.

[24] Hu Y, et al. Fruit detection and counting in apple orchards based on improved Yolov7 and multi-object tracking methods[J]. Sensors, 2023, 23(13): 5903.

[25] Abeyrathna R M R D, Nakaguchi V M, Minn A, et al. Recognition and counting of apples in a dynamic state using a 3D camera and deep learning algorithms for robotic harvesting systems[J]. Sensors, 2023, 23(8): 3810.

[26] Yang X, et al. Automatic apple detection and counting with AD-YOLO and MR-SORT[J]. Sensors, 2024, 24(21): 7012.

[27] Matos R, de Belen R A J, Perez T, et al. Tracking and counting apples in orchards under intermittent occlusions and low frame rates[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2024: 4681-4690.

[28] Jin T, Han X, Wang P, et al. Enhanced deep learning model for apple detection, localization, and counting in complex orchards for robotic arm-based harvesting[J]. Smart Agricultural Technology, 2025, 10: 100784.

[29] Cao D, Luo W, Tang R, et al. Research on apple detection and tracking count in complex scenes based on the improved YOLOv7-Tiny-PDE[J]. Agriculture, 2025, 15(5): 483.

[30] Wang X, Tang J, Whitty M A. DeepPhenology: Estimation of apple flower phenology distributions based on deep learning[J]. Computers and Electronics in Agriculture, 2021, 184: 106123.

[31] Bhattarai U, Karkee M. A weakly-supervised approach for flower/fruit counting in apple orchards[J]. Computers in Industry, 2022, 138: 103635.

[32] Lempitsky V, Zisserman A. Learning to count objects in images[C]//Advances in Neural Information Processing Systems 23. 2010: 1324-1332.

[33] Zhang Y, Zhou D, Chen S, et al. Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 589-597.

[34] Li Y, Zhang X, Chen D. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1091-1100.

[35] Ma Z, Wei X, Hong X, et al. Bayesian loss for crowd count estimation with point supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 6142-6151.

[36] Wang B, Liu H, Samaras D, et al. Distribution matching for crowd counting[C]//Advances in Neural Information Processing Systems 33. 2020.

[37] Gao J, Wang Q, Li X. PCC Net: perspective crowd counting via spatial convolutional network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(10): 3486-3498.

[38] Tian Y, Chu X, Wang H. CCTrans: simplifying and improving crowd counting with transformer[EB/OL]. arXiv:2109.14483, 2021.

[39] Dai M, Huang Z, Gao J, et al. Cross-head supervision for crowd counting with noisy annotations[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2023: 1-5.