A Multimodal Video Anomaly Detection Method Based on Vision-Language Alignment

Yueai Zhao; Yan Zhang; Shihao Wang; Lipei Kong

doi:10.54097/d1qryr89

Authors

Yueai Zhao
Yan Zhang
Shihao Wang
Lipei Kong

DOI:

https://doi.org/10.54097/d1qryr89

Keywords:

Weakly supervised learning, Vision-language alignment, Text prompts, Fine-grained semantics, Multi-task loss

Abstract

Existing weakly supervised anomaly detection methods suffer from insufficient semantic alignment of features and a lack of fine-grained localisation capabilities. Furthermore, purely visual models have limitations in understanding semantic information, making it difficult to effectively integrate visual features with semantic information, which results in poor discrimination of multi-class anomalies. To address these issues, this paper proposes a vision-language aligned video anomaly detection model. By introducing a vision-language alignment branch and utilising a pre-trained CLIP model to achieve fine-grained semantic alignment between video features and category text prompts, this model overcomes the limitations of purely visual models in understanding semantic information.It designs an adaptive gated fusion mechanism to dynamically fuse the global anomaly scores from the original visual branch with the semantically guided scores from the alignment branch, combining the complementary strengths of visual pattern recognition and semantic understanding to enhance the model’s ability to distinguish between multiple anomaly categories.And it constructs a multi-task loss function to jointly optimise temporal localisation and fine-grained classification tasks, making full use of video-level weak supervision signals and cross-modal alignment information. Experimental results on the UCF-Crime and XD-Violence datasets demonstrate that this method effectively improves fine-grained anomaly localisation and classification performance, exhibiting significant advantages.

Downloads

Download data is not yet available.

References

[1] Zhang H M, Yan D D, Tian Q Q. Improved spatio-temporal graph convolutional networks for video anomaly detection[J]. Opto-Electron Eng, 2024, 51(05): 48-60.DOI: 10.12086/oee.2024.240034

[2] Li N J, Nie X S, Li T, et al. A review of state-of-the-art video anomaly detection methods based on deep learning[J].Computer Applications Research, 2025, 42(03): 663-676.DOI: 10.19734/j.issn.1001-3695.2024.06.0241

[3] Zhang Y, Song J, Jiang Y, et al. Online video anomaly detection[J]. Sensors, 2023, 23(17): 7442.DOI: 10.3390/s23177442 DOI: https://doi.org/10.3390/s23177442

[4] Ramachandra, Bharathkumar, Michael J. Jones, et al. A survey of single-scene video anomaly detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(5): 2293-2312.DOI: 10.1109/TPAMI.2020.3040591 DOI: https://doi.org/10.1109/TPAMI.2020.3040591

[5] Li N, Wu X, Xu D, et al. Spatio-temporal context analysis within video volumes for anomalous-event detection and localization[J]. Neurocomputing, 2015, 155: 309-319. DOI: https://doi.org/10.1016/j.neucom.2014.12.064

[6] Breitenstein M D, Reichlin F, Leibe B, et al. Robust tracking-by-detection using a detector confidence particle filter[C]//Proceedings of the 2009 IEEE 12th International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2009: 1515-1522.DOI: 10.1109/ICCV.2009.5459278 DOI: https://doi.org/10.1109/ICCV.2009.5459278

[7] Piciarelli C, Micheloni C, Foresti G L. Trajectory-based anomalous event detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1544-1554. DOI: https://doi.org/10.1109/TCSVT.2008.2005599

[8] Wang X, Tieu K, Grimson E. Learning semantic scene models by trajectory analysis[C]//European Conference on Computer Vision. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 110-123.DOI: 10.1007/11744078_9 DOI: https://doi.org/10.1007/11744078_9

[9] Andrade E L, Blunsden S, Fisher R B. Hidden markov models for optical flow analysis in crowds[C]//Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006). Los Alamitos: IEEE Computer Society, 2006: 460-463.DOI: 10.1109/ICPR.2006.621 DOI: https://doi.org/10.1109/ICPR.2006.621

[10] Hu W, Tan T, Wang L, et al. A survey on visual surveillance of object motion and behaviors[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2004, 34(3): 334-352.DOI: 10.1109/TSMCC.2004.829274 DOI: https://doi.org/10.1109/TSMCC.2004.829274

[11] Mehran R, Oyama A, Shah M. Abnormal crowd behavior detection using social force model[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2009: 935-942. DOI: https://doi.org/10.1109/CVPR.2009.5206641

[12] Sharif M H, Djeraba C. An entropy approach for abnormal activities detection in video streams[J]. Pattern Recognition, 2012, 45(7): 2543-2561.DOI: 10.1016/j.patcog.2012.01.009 DOI: https://doi.org/10.1016/j.patcog.2011.11.023

[13] Feizi A, Aghagolzadeh A, Seyedarabi H. Using optical flow and spectral clustering for behavior recognition and detection of anomalous behaviors[C]//Proceedings of the 2013 8th Iranian Conference on Machine Vision and Image Processing (MVIP). Los Alamitos: IEEE Computer Society, 2013: 210-213. DOI: https://doi.org/10.1109/IranianMVIP.2013.6779980

[14] Zhou S, Shen W, Zeng D, et al. Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes[J]. Signal Processing: Image Communication, 2016, 47: 358-368.DOI: 10.1016/j.image.2016.07.007 DOI: https://doi.org/10.1016/j.image.2016.06.007

[15] Sabokrou M, Fathy M, Hoseini M. Video anomaly detection and localisation based on the sparsity and reconstruction error of auto‐encoder[J]. Electronics Letters, 2016, 52(13): 1122-1124.DOI: 10.1049/el.2016.1026 DOI: https://doi.org/10.1049/el.2016.0440

[16] Nguyen T N, Meunier J. Anomaly detection in video sequence with appearance-motion correspondence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2019: 1273-1283.DOI: 10.1109/ICCV.2019.00136 DOI: https://doi.org/10.1109/ICCV.2019.00136

[17] Zaheer M Z, Mahmood A, Khan M H, et al. Generative cooperative learning for unsupervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 14744-14754.DOI: 10.1109/CVPR52688.2022.01435 DOI: https://doi.org/10.1109/CVPR52688.2022.01433

[18] Huang C, Wen J, Xu Y, et al. Self-supervised attentive generative adversarial networks for video anomaly detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 34(11): 9389-9403.DOI: 10.1109/TNNLS.2022.3155154 DOI: https://doi.org/10.1109/TNNLS.2022.3159538

[19] Huang C, Wu Z, Wen J, et al. Abnormal event detection using deep contrastive learning for intelligent video surveillance system[J]. IEEE Transactions on Industrial Informatics, 2021, 18(8): 5171-5179.DOI: 10.1109/TII.2021.3121891 DOI: https://doi.org/10.1109/TII.2021.3122801

[20] Lv H, Yue Z, Sun Q, et al. Unbiased multiple instance learning for weakly supervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2023: 8022-8031.DOI: 10.1109/CVPR52729.2023.00774 DOI: https://doi.org/10.1109/CVPR52729.2023.00775

[21] Tian Y, Pang G, Chen Y, et al. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society, 2021: 4975-4986.DOI: 10.1109/ICCV48922.2021.00493 DOI: https://doi.org/10.1109/ICCV48922.2021.00493

[22] Cho M A, Kim M, Hwang S, et al. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2023: 12137-12146.DOI: 10.1109/CVPR52729.2023.01167 DOI: https://doi.org/10.1109/CVPR52729.2023.01168

[23] Chen J, Li L, Su L, et al. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2024: 18319-18329. DOI: https://doi.org/10.1109/CVPR52733.2024.01734

[24] Pu Y, Wu X, Yang L, et al. Learning prompt-enhanced context features for weakly-supervised video anomaly detection[J]. IEEE Transactions on Image Processing, 2024, 33: 4923-4936. DOI: https://doi.org/10.1109/TIP.2024.3451935

[25] Zhang Y, Zhao Y A, Kong L P, et al. Video anomaly detection based on multi-scale fusion and dual memory units[J]. Computer Technology and Development,2026,36(04):69-77.DOI:10.20165/j.cnki.ISSN1673-629X.2025.0286.

[26] Zhang C, Li G, Xu Q, et al. Weakly supervised anomaly detection in videos considering the openness of events[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(11): 21687-21699. DOI: https://doi.org/10.1109/TITS.2022.3174088

[27] Wu P, Liu J, Shi Y, et al. Not only look, but also listen: Learning multi modal violence detection under weak supervision[C]//Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer International Publishing, 2020: 322-339.DOI: 10.1007/978-3-030-58536-5_20 DOI: https://doi.org/10.1007/978-3-030-58577-8_20

[28] Wu P, Liu J. Learning causal temporal relation and feature discrimination for anomaly detection[J]. IEEE Transactions on Image Processing, 2021, 30: 3513-3527.DOI: 10.1109/TIP.2021.3062204 DOI: https://doi.org/10.1109/TIP.2021.3062192

[29] Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 11976-11986.DOI: 10.1109/CVPR52688.2022.01167 DOI: https://doi.org/10.1109/CVPR52688.2022.01167

[30] Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning (ICML). New York: PMLR, 2021: 10347-10357.

[31] Wu P, Zhou X, Pang G, et al. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2024: 6074-6082. DOI: https://doi.org/10.1609/aaai.v38i6.28423

[32] Joo H K, Vo K, Yamazaki K, et al. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection[C]//Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP). Los Alamitos: IEEE Computer Society, 2023: 3230-3234. DOI: https://doi.org/10.1109/ICIP49359.2023.10222289

[33] Zhou H, Yu J, Yang W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2023: 3769-3777.DOI: 10.1609/aaai.v37i3.25492 DOI: https://doi.org/10.1609/aaai.v37i3.25489

[34] Zhao Y G, Yang Y J, Xiang T, et al. Video anomaly detection framework based on bidirectional spatio-temporal feature fusion GAN[J]. Journal of Jilin University (Information Science Edition), 2025, 43(05): 1128-1137.DOI:10.19292/j.cnki.jdxxp.20250623.001

[35] Sapkota H, Yu Q. Bayesian nonparametric submodular video partition for robust anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos: IEEE Computer Society, 2022: 3212-3221.DOI: 10.1109/CVPR52688.2022.00320 DOI: https://doi.org/10.1109/CVPR52688.2022.00321

[36] Chen Y, Liu Z, Zhang B, et al. Mgfn:Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Palo Alto: AAAI Press, 2023: 387-395.DOI: 10.1609/aaai.v37i1.25100 DOI: https://doi.org/10.1609/aaai.v37i1.25112

A Multimodal Video Anomaly Detection Method Based on Vision-Language Alignment

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Cover

Indexing & Abstracting