Research on Real-scene Video Face Restoration Methods Based on Time Consistency and Multimodal Fusion
DOI:
https://doi.org/10.54097/2sb9jw88Keywords:
Video Face Restoration, Audio-guided Learning, Multimodal Fusion, Temporal ConsistencyAbstract
This paper proposes a simplified audio-guided video face restoration method. The goal is to recover high-quality, temporally consistent face videos. We designed a multi-stage framework that integrates audio and visual modalities through simple yet effective components. Specifically, we extract low-level HOG features from video frames and MFCC features from audio. We then utilize a simplified 3D convolutional network to predict dictionary indices guided by both modalities. A pre-trained TS-VQGAN decoder reconstructs high-quality frames. Simplified spatio-temporal fidelity modules and optical flow smoothing techniques are simultaneously applied to enhance spatio-temporal consistency. Experimental results on the VoxCeleb2 dataset demonstrate that our method outperforms single-modal methods such as BasicVSR++ and VQF in terms of PSNR, SSIM, and LPIPS metrics. This indicates that cross-modal fusion can still deliver consistent performance improvements in practical video restoration tasks even under a simplified structure.
Downloads
References
[1] Wang, X., Li, Y., Zhang, H., & Shan, Y. (2021). Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9168-9178). DOI: https://doi.org/10.1109/CVPR46437.2021.00905
[2] Wang, Z., Zhang, J., Wang, X., Chen, T., Shan, Y., Wang, W., & Luo, P. (2024). Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos. IEEE Transactions on Image Processing. DOI: https://doi.org/10.1109/TIP.2024.3463414
[3] Xu, K., Xu, L., He, G., Yu, W., & Li, Y. (2024). Beyond alignment: Blind video face restoration via parsing-guided temporal-coherent transformer. arXiv preprint arXiv: 2404.13640. DOI: https://doi.org/10.24963/ijcai.2024/165
[4] Xu, Y., Song, Z., & Lu, J. (2025, January). Universal Video Face Restoration Method Based on Vision-Language Model. In The 16th Asian Conference on Machine Learning (Conference Track).
[5] Cheng, H., Guo, Y., Yin, J., Chen, H., Wang, J., & Nie, L. (2021). Audio-driven talking video frame restoration. IEEE Transactions on Multimedia, 26, 4110-4122. DOI: https://doi.org/10.1109/TMM.2021.3118287
[6] Wang, Y., Teng, J., Cao, J., Li, Y., Ma, C., Xu, H., & Luo, D. (2025). Efficient video face enhancement with enhanced spatial-temporal consistency. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 2183-2193). DOI: https://doi.org/10.1109/CVPR52734.2025.00209
[7] Tan, J., Park, H., Zhang, Y., Wang, T., Zhang, K., Kong, X., ... & Luo, W. (2024, October). Blind face video restoration with temporal consistent generative prior and degradation-aware prompt. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 1417-1426). DOI: https://doi.org/10.1145/3664647.3680917
[8] Feng, R., Li, C., & Loy, C. C. (2024, September). Kalman-inspired feature propagation for video face super-resolution. In European Conference on Computer Vision (pp. 202-218). Cham: Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-73347-5_12
[9] Chen, Z., He, J., Lin, X., Qiao, Y., & Dong, C. (2024). Towards real-world video face restoration: A new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5929-5939). DOI: https://doi.org/10.1109/CVPRW63382.2024.00600
[10] Xie, L., Wang, X., Zhang, H., Dong, C., & Shan, Y. (2022). Vfhq: A high-quality dataset and benchmark for video face super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 657-666). DOI: https://doi.org/10.1109/CVPRW56347.2022.00081
[11] Zhang, X., & Wu, X. (2022). Multi-modality deep restoration of extremely compressed face videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2024-2037. DOI: https://doi.org/10.1109/TPAMI.2022.3157388
[12] Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 1, pp. 886-893). Ieee. DOI: https://doi.org/10.1109/CVPR.2005.177
[13] Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231. DOI: https://doi.org/10.1109/TPAMI.2012.59
[14] McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. SciPy, 2015, 18-24. DOI: https://doi.org/10.25080/Majora-7b98e3ed-003
[15] Farnebäck, G. (2003, June). Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis (pp. 363-370). Berlin, Heidelberg: Springer Berlin Heidelberg. DOI: https://doi.org/10.1007/3-540-45103-X_50
[16] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4), 600-612. DOI: https://doi.org/10.1109/TIP.2003.819861
[17] Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. DOI: https://doi.org/10.21437/Interspeech.2018-1929
[18] Gu, Y., Wang, X., Xie, L., Dong, C., Li, G., Shan, Y., & Cheng, M. M. (2022, October). Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In European Conference on Computer Vision (pp. 126-143). Cham: Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-19797-0_8
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Computing and Electronic Information Management

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








