A Multimodal Video Anomaly Detection Method Based on Vision-Language Alignment
DOI:
https://doi.org/10.54097/d1qryr89Keywords:
Weakly supervised learning, Vision-language alignment, Text prompts, Fine-grained semantics, Multi-task lossAbstract
Existing weakly supervised anomaly detection methods suffer from insufficient semantic alignment of features and a lack of fine-grained localisation capabilities. Furthermore, purely visual models have limitations in understanding semantic information, making it difficult to effectively integrate visual features with semantic information, which results in poor discrimination of multi-class anomalies. To address these issues, this paper proposes a vision-language aligned video anomaly detection model. By introducing a vision-language alignment branch and utilising a pre-trained CLIP model to achieve fine-grained semantic alignment between video features and category text prompts, this model overcomes the limitations of purely visual models in understanding semantic information.It designs an adaptive gated fusion mechanism to dynamically fuse the global anomaly scores from the original visual branch with the semantically guided scores from the alignment branch, combining the complementary strengths of visual pattern recognition and semantic understanding to enhance the model’s ability to distinguish between multiple anomaly categories.And it constructs a multi-task loss function to jointly optimise temporal localisation and fine-grained classification tasks, making full use of video-level weak supervision signals and cross-modal alignment information. Experimental results on the UCF-Crime and XD-Violence datasets demonstrate that this method effectively improves fine-grained anomaly localisation and classification performance, exhibiting significant advantages.
Downloads
References
[1] Zhang H M, Yan D D, Tian Q Q. Improved spatio-temporal graph convolutional networks for video anomaly detection[J]. Opto-Electron Eng, 2024, 51(05): 48-60.DOI: 10.12086/oee.2024.240034
[2] Li N J, Nie X S, Li T, et al. A review of state-of-the-art video anomaly detection methods based on deep learning[J].Computer Applications Research, 2025, 42(03): 663-676.DOI: 10.19734/j.issn.1001-3695.2024.06.0241
[3] Zhang Y, Song J, Jiang Y, et al. Online video anomaly detection[J]. Sensors, 2023, 23(17): 7442.DOI: 10.3390/s23177442
[4] Ramachandra, Bharathkumar, Michael J. Jones, et al. A survey of single-scene video anomaly detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(5): 2293-2312.DOI: 10.1109/TPAMI.2020.3040591
[5] Li N, Wu X, Xu D, et al. Spatio-temporal context analysis within video volumes for anomalous-event detection and localization[J]. Neurocomputing, 2015, 155: 309-319.
[6] Breitenstein M D, Reichlin F, Leibe B, et al. Robust tracking-by-detection using a detector confidence particle filter[C]//Proceedings of the 2009 IEEE 12th International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2009: 1515-1522.DOI: 10.1109/ICCV.2009.5459278
[7] Piciarelli C, Micheloni C, Foresti G L. Trajectory-based anomalous event detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1544-1554.
[8] Wang X, Tieu K, Grimson E. Learning semantic scene models by trajectory analysis[C]//European Conference on Computer Vision. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 110-123.DOI: 10.1007/11744078_9
[9] Andrade E L, Blunsden S, Fisher R B. Hidden markov models for optical flow analysis in crowds[C]//Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006). Los Alamitos: IEEE Computer Society, 2006: 460-463.DOI: 10.1109/ICPR.2006.621
[10] Hu W, Tan T, Wang L, et al. A survey on visual surveillance of object motion and behaviors[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2004, 34(3): 334-352.DOI: 10.1109/TSMCC.2004.829274
[11] Mehran R, Oyama A, Shah M. Abnormal crowd behavior detection using social force model[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2009: 935-942.
[12] Sharif M H, Djeraba C. An entropy approach for abnormal activities detection in video streams[J]. Pattern Recognition, 2012, 45(7): 2543-2561.DOI: 10.1016/j.patcog.2012.01.009
[13] Feizi A, Aghagolzadeh A, Seyedarabi H. Using optical flow and spectral clustering for behavior recognition and detection of anomalous behaviors[C]//Proceedings of the 2013 8th Iranian Conference on Machine Vision and Image Processing (MVIP). Los Alamitos: IEEE Computer Society, 2013: 210-213.
[14] Zhou S, Shen W, Zeng D, et al. Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes[J]. Signal Processing: Image Communication, 2016, 47: 358-368.DOI: 10.1016/j.image.2016.07.007
[15] Sabokrou M, Fathy M, Hoseini M. Video anomaly detection and localisation based on the sparsity and reconstruction error of auto‐encoder[J]. Electronics Letters, 2016, 52(13): 1122-1124.DOI: 10.1049/el.2016.1026
[16] Nguyen T N, Meunier J. Anomaly detection in video sequence with appearance-motion correspondence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2019: 1273-1283.DOI: 10.1109/ICCV.2019.00136
[17] Zaheer M Z, Mahmood A, Khan M H, et al. Generative cooperative learning for unsupervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 14744-14754.DOI: 10.1109/CVPR52688.2022.01435
[18] Huang C, Wen J, Xu Y, et al. Self-supervised attentive generative adversarial networks for video anomaly detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 34(11): 9389-9403.DOI: 10.1109/TNNLS.2022.3155154
[19] Huang C, Wu Z, Wen J, et al. Abnormal event detection using deep contrastive learning for intelligent video surveillance system[J]. IEEE Transactions on Industrial Informatics, 2021, 18(8): 5171-5179.DOI: 10.1109/TII.2021.3121891
[20] Lv H, Yue Z, Sun Q, et al. Unbiased multiple instance learning for weakly supervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2023: 8022-8031.DOI: 10.1109/CVPR52729.2023.00774
[21] Tian Y, Pang G, Chen Y, et al. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2021: 4975-4986.DOI: 10.1109/ICCV48922.2021.00493
[22] Cho M A, Kim M, Hwang S, et al. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2023: 12137-12146.DOI: 10.1109/CVPR52729.2023.01167
[23] Chen J, Li L, Su L, et al. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2024: 18319-18329.
[24] Pu Y, Wu X, Yang L, et al. Learning prompt-enhanced context features for weakly-supervised video anomaly detection[J]. IEEE Transactions on Image Processing, 2024, 33: 4923-4936.
[25] Zhang Y, Zhao Y A, Kong L P, et al. Video anomaly detection based on multi-scale fusion and dual memory units[J]. Computer Technology and Development,2026,36(04):69-77.DOI:10.20165/j.cnki.ISSN1673-629X.2025.0286.
[26] Zhang C, Li G, Xu Q, et al. Weakly supervised anomaly detection in videos considering the openness of events[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(11): 21687-21699.
[27] Wu P, Liu J, Shi Y, et al. Not only look, but also listen: Learning multi modal violence detection under weak supervision[C]//Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer International Publishing, 2020: 322-339.DOI: 10.1007/978-3-030-58536-5_20
[28] Wu P, Liu J. Learning causal temporal relation and feature discrimination for anomaly detection[J]. IEEE Transactions on Image Processing, 2021, 30: 3513-3527.DOI: 10.1109/TIP.2021.3062204
[29] Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 11976-11986.DOI: 10.1109/CVPR52688.2022.01167
[30] Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning (ICML). New York: PMLR, 2021: 10347-10357.
[31] Wu P, Zhou X, Pang G, et al. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2024: 6074-6082.
[32] Joo H K, Vo K, Yamazaki K, et al. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection[C]//Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP). Los Alamitos: IEEE Computer Society, 2023: 3230-3234.
[33] Zhou H, Yu J, Yang W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2023: 3769-3777.DOI: 10.1609/aaai.v37i3.25492
[34] Zhao Y G, Yang Y J, Xiang T, et al. Video anomaly detection framework based on bidirectional spatio-temporal feature fusion GAN[J]. Journal of Jilin University (Information Science Edition), 2025, 43(05): 1128-1137.DOI:10.19292/j.cnki.jdxxp.20250623.001
[35] Sapkota H, Yu Q. Bayesian nonparametric submodular video partition for robust anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 3212-3221.DOI: 10.1109/CVPR52688.2022.00320
[36] Chen Y, Liu Z, Zhang B, et al. Mgfn:Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2023: 387-395.DOI: 10.1609/aaai.v37i1.25100
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Computing and Electronic Information Management

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








