Structure Enhancement and Cross-Modal Alignment for Open-Vocabulary Semantic Segmentation
DOI:
https://doi.org/10.54097/6kbbz575Keywords:
Open-Vocabulary Semantic Segmentation, Cross-modal Alignment, Structure EnhancementAbstract
This paper proposes a structure-enhanced cross-modal alignment method for open-vocabulary semantic segmentation. Existing methods mostly rely on CLIP’s image-level vision-language alignment capability, but CLIP visual features remain insufficient for modeling fine-grained spatial information such as boundaries, textures, and region structures. Moreover, relying solely on semantic alignment between images and text categories makes it difficult to model fine-grained correspondences between visual and textual information. To address these issues, we design a DINO Structure Enhancement Module and a Cross-Modal Alignment Module (CCMA). The DINO Structure Enhancement Module introduces a parameter-frozen DINO model to extract structural priors and adaptively enhance CLIP visual features, thereby producing structure-aware visual features. CCMA jointly models global visual semantics, local region features, and text semantic prototypes to mine fine-grained vision-language consistency at the region level, thereby strengthening the correspondence between image regions and textual semantics. Experimental results demonstrate that the proposed method effectively improves open-vocabulary semantic segmentation performance.
Downloads
References
[1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, p. 3431-3440.
[2] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 2881-2890.
[3] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence. 2017, 40(4), p. 834-848.
[4] Xie E, Wang W, Yu Z, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers[J]. Advances in neural information processing systems. 2021, 34, p. 12077-12090.
[5] Cheng B, Misra I, Schwing A G, et al. Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 1290-1299.
[6] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021, p. 8748-8763.
[7] Ghiasi G, Gu X, Cui Y, et al. Scaling open-vocabulary image segmentation with image-level labels[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022, p. 540-557.
[8] Liang F, Wu B, Dai X, et al. Open-vocabulary semantic segmentation with mask-adapted clip[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 7061-7070.
[9] Rao Y, Zhao W, Chen G, et al. Denseclip: Language-guided dense prediction with context-aware prompting[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18082-18091.
[10] Ding J, Xue N, Xia G S, et al. Decoupling zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 11583-11592.
[11] Cho S, Shin H, Hong S, et al. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, p, 4113-4123.
[12] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 9650-9660.
[13] Oquab M, Darcet T, Moutakanni T, et al. Dinov2: Learning robust visual features without supervision[J]. arXiv preprint arXiv:2304.07193, 2023.
[14] Wysoczańska M, Siméoni O, Ramamonjisoa M, et al. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024, p. 320-337.
[15] Barsellotti L, Bianchi L, Messina N, et al. Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, p. 22025-22035.
[16] Lüddecke T, Ecker A. Image segmentation using text and image prompts[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 7086-7096.
[17] Xu J, De Mello S, Liu S, et al. Groupvit: Semantic segmentation emerges from text supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18134-18144.
[18] Xu J, Hou J, Zhang Y, et al. Learning open-vocabulary semantic segmentation models from natural language supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 2935-2944.
[19] Caesar H, Uijlings J, Ferrari V. Coco-stuff: Thing and stuff classes in context[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 1209-1218.
[20] Zhou B, Zhao H, Puig X, et al. Scene parsing through ade20k dataset[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 633-641.
[21] Everingham M, Eslami S M A, Van Gool L, et al. The pascal visual object classes challenge: A retrospective[J]. International journal of computer vision, 2015, 111(1), p. 98-136.
[22] Mottaghi R, Chen X, Liu X, et al. The role of context for object detection and semantic segmentation in the wild[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, p. 891-898.
[23] Xian Y, Choudhury S, He Y, et al. Semantic projection network for zero-and few-label semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, p. 8256-8265.
[24] Bucher M, Vu T H, Cord M, et al. Zero-shot semantic segmentation[J]. Advances in Neural Information Processing Systems, 2019, 32.
[25] Li B, Weinberger K Q, Belongie S, et al. Language-driven semantic segmentation[J]. arXiv preprint arXiv:2201.03546, 2022.
[26] Ding J, Xue N, Xia G S, et al. Decoupling zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 11583-11592.
[27] Xu M, Zhang Z, Wei F, et al. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022, p. 736-753.
[28] Zhou Z, Lei Y, Zhang B, et al. Zegclip: Towards adapting clip for zero-shot semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 11175-11185.
[29] Xu M, Zhang Z, Wei F, et al. Side adapter network for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, p. 2945-2954.
[30] Xie B, Cao J, Xie J, et al. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation[C]//Proceedings of the IEEE/CVF conference on compu
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computing and Electronic Information Management

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








