Structure Enhancement and Cross-Modal Alignment for Open-Vocabulary Semantic Segmentation

Jiawei Bai

doi:10.54097/6kbbz575

Authors

Jiawei Bai

DOI:

https://doi.org/10.54097/6kbbz575

Keywords:

Open-Vocabulary Semantic Segmentation, Cross-modal Alignment, Structure Enhancement

Abstract

This paper proposes a structure-enhanced cross-modal alignment method for open-vocabulary semantic segmentation. Existing methods mostly rely on CLIP’s image-level vision-language alignment capability, but CLIP visual features remain insufficient for modeling fine-grained spatial information such as boundaries, textures, and region structures. Moreover, relying solely on semantic alignment between images and text categories makes it difficult to model fine-grained correspondences between visual and textual information. To address these issues, we design a DINO Structure Enhancement Module and a Cross-Modal Alignment Module (CCMA). The DINO Structure Enhancement Module introduces a parameter-frozen DINO model to extract structural priors and adaptively enhance CLIP visual features, thereby producing structure-aware visual features. CCMA jointly models global visual semantics, local region features, and text semantic prototypes to mine fine-grained vision-language consistency at the region level, thereby strengthening the correspondence between image regions and textual semantics. Experimental results demonstrate that the proposed method effectively improves open-vocabulary semantic segmentation performance.

Downloads

Download data is not yet available.

References

[1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, p. 3431-3440.

[2] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 2881-2890.

[3] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence. 2017, 40(4), p. 834-848.

[4] Xie E, Wang W, Yu Z, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers[J]. Advances in neural information processing systems. 2021, 34, p. 12077-12090.

[5] Cheng B, Misra I, Schwing A G, et al. Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 1290-1299.

[6] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021, p. 8748-8763.

[7] Ghiasi G, Gu X, Cui Y, et al. Scaling open-vocabulary image segmentation with image-level labels[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022, p. 540-557.

[8] Liang F, Wu B, Dai X, et al. Open-vocabulary semantic segmentation with mask-adapted clip[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 7061-7070.

[9] Rao Y, Zhao W, Chen G, et al. Denseclip: Language-guided dense prediction with context-aware prompting[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18082-18091.

[10] Ding J, Xue N, Xia G S, et al. Decoupling zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 11583-11592.

[11] Cho S, Shin H, Hong S, et al. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, p, 4113-4123.

[12] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 9650-9660.

[13] Oquab M, Darcet T, Moutakanni T, et al. Dinov2: Learning robust visual features without supervision[J]. arXiv preprint arXiv:2304.07193, 2023.

[14] Wysoczańska M, Siméoni O, Ramamonjisoa M, et al. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024, p. 320-337.

[15] Barsellotti L, Bianchi L, Messina N, et al. Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, p. 22025-22035.

[16] Lüddecke T, Ecker A. Image segmentation using text and image prompts[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 7086-7096.

[17] Xu J, De Mello S, Liu S, et al. Groupvit: Semantic segmentation emerges from text supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18134-18144.

[18] Xu J, Hou J, Zhang Y, et al. Learning open-vocabulary semantic segmentation models from natural language supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 2935-2944.

[19] Caesar H, Uijlings J, Ferrari V. Coco-stuff: Thing and stuff classes in context[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 1209-1218.

[20] Zhou B, Zhao H, Puig X, et al. Scene parsing through ade20k dataset[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 633-641.

[21] Everingham M, Eslami S M A, Van Gool L, et al. The pascal visual object classes challenge: A retrospective[J]. International journal of computer vision, 2015, 111(1), p. 98-136.

[22] Mottaghi R, Chen X, Liu X, et al. The role of context for object detection and semantic segmentation in the wild[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, p. 891-898.

[23] Xian Y, Choudhury S, He Y, et al. Semantic projection network for zero-and few-label semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, p. 8256-8265.

[24] Bucher M, Vu T H, Cord M, et al. Zero-shot semantic segmentation[J]. Advances in Neural Information Processing Systems, 2019, 32.

[25] Li B, Weinberger K Q, Belongie S, et al. Language-driven semantic segmentation[J]. arXiv preprint arXiv:2201.03546, 2022.

[26] Ding J, Xue N, Xia G S, et al. Decoupling zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 11583-11592.

[27] Xu M, Zhang Z, Wei F, et al. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022, p. 736-753.

[28] Zhou Z, Lei Y, Zhang B, et al. Zegclip: Towards adapting clip for zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 11175-11185.

[29] Xu M, Zhang Z, Wei F, et al. Side adapter network for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 2945-2954.

[30] Xie B, Cao J, Xie J, et al. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF conference on compu