MSDA-SparseInst: Real-Time Instance Segmentation via Multi-Scale Feature Fusion and Dual-Branch Attention

Yinghao Chen

doi:10.54097/e5s94861

Authors

Yinghao Chen

DOI:

https://doi.org/10.54097/e5s94861

Keywords:

Real-time instance segmentation, Urban road scenes, SparseInst, Multi-scale feature fusion, Attention mechanism

Abstract

Real-time instance segmentation is crucial for urban road-scene understanding, where accurate pixel-level perception is required under complex backgrounds, occlusion, and large scale variation. However, existing efficient methods often struggle to balance segmentation accuracy and inference speed, especially for small distant objects and densely distributed instances. To address this issue, this paper proposes MSDA-SparseInst, a real-time instance segmentation framework based on SparseInst. Specifically, an improved backbone is adopted to enhance feature extraction, a Multi-scale Dilated Feature Aggregation (MDFA) module is introduced to strengthen cross-scale contextual modeling, and a lightweight dual-branch attention strategy composed of GCSA and GGCA is designed to refine decoder features. Experimental results on the Cityscapes validation set show that the proposed method achieves 21.8 AP, 43.4 AP50, and 18.1 AP75 at 30.7 FPS, improving the baseline SparseInst by 3.0 AP while maintaining real-time performance. The results demonstrate that MSDA-SparseInst provides a better trade-off between segmentation accuracy and efficiency for urban road-scene instance segmentation.

Downloads

Download data is not yet available.

References

[1] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3213-3223.

[2] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.

[3] Bolya D, Zhou C, Xiao F, et al. Yolact: Real-time instance segmentation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 9157-9166.

[4] Lee Y, Park J. Centermask: Real-time anchor-free instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 13906-13915.

[5] Wang X, Zhang R, Kong T, et al. Solov2: Dynamic and fast instance segmentation[J]. Advances in Neural information processing systems, 2020, 33: 17721-17732.

[6] Tian Z, Shen C, Chen H. Conditional convolutions for instance segmentation[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 282-298.

[7] Cheng T, Wang X, Chen S, et al. Sparse instance activation for real-time instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4433-4442.

[8] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137-1149.

[9] Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6154-6162.

[10] He J, Li P, Geng Y, et al. Fastinst: A simple query-based model for real-time instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 23663-23672.

[11] Lyu C, Zhang W, Huang H, et al. Rtmdet: An empirical study of designing real-time object detectors[J]. arXiv preprint arXiv:2212.07784, 2022.

[12] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.

[13] Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8759-8768.

[14] Tan M, Pang R, Le Q V. Efficientdet: Scalable and efficient object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10781-10790.

[15] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.

[16] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.

[17] Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 13713-13722.

[18] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[19] He T, Zhang Z, Zhang H, et al. Bag of tricks for image classification with convolutional neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 558-567.

[20] Zhu X, Hu H, Lin S, et al. Deformable convnets v2: More deformable, better results[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 9308-9316.

[21] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2881-2890.

[22] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv preprint arXiv:1706.05587, 2017.

[23] Zhang X, Zhou X, Lin M, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6848-6856.

[24] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Cham: Springer International Publishing, 2014: 740-755.

[25] Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library[J]. Advances in neural information processing systems, 2019, 32.

[26] Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101, 2017.

[27] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International journal of computer vision, 2015, 115(3): 211-252.