Research on Real-scene Video Face Restoration Methods Based on Time Consistency and Multimodal Fusion

Maokang Sun

doi:10.54097/2sb9jw88

Authors

Maokang Sun

DOI:

https://doi.org/10.54097/2sb9jw88

Keywords:

Video Face Restoration, Audio-guided Learning, Multimodal Fusion, Temporal Consistency

Abstract

This paper proposes a simplified audio-guided video face restoration method. The goal is to recover high-quality, temporally consistent face videos. We designed a multi-stage framework that integrates audio and visual modalities through simple yet effective components. Specifically, we extract low-level HOG features from video frames and MFCC features from audio. We then utilize a simplified 3D convolutional network to predict dictionary indices guided by both modalities. A pre-trained TS-VQGAN decoder reconstructs high-quality frames. Simplified spatio-temporal fidelity modules and optical flow smoothing techniques are simultaneously applied to enhance spatio-temporal consistency. Experimental results on the VoxCeleb2 dataset demonstrate that our method outperforms single-modal methods such as BasicVSR++ and VQF in terms of PSNR, SSIM, and LPIPS metrics. This indicates that cross-modal fusion can still deliver consistent performance improvements in practical video restoration tasks even under a simplified structure.

Downloads

Download data is not yet available.

References

[1] Wang, X., Li, Y., Zhang, H., & Shan, Y. (2021). Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9168-9178).

[2] Wang, Z., Zhang, J., Wang, X., Chen, T., Shan, Y., Wang, W., & Luo, P. (2024). Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos. IEEE Transactions on Image Processing.

[3] Xu, K., Xu, L., He, G., Yu, W., & Li, Y. (2024). Beyond alignment: Blind video face restoration via parsing-guided temporal-coherent transformer. arXiv preprint arXiv: 2404.13640.

[4] Xu, Y., Song, Z., & Lu, J. (2025, January). Universal Video Face Restoration Method Based on Vision-Language Model. In The 16th Asian Conference on Machine Learning (Conference Track).

[5] Cheng, H., Guo, Y., Yin, J., Chen, H., Wang, J., & Nie, L. (2021). Audio-driven talking video frame restoration. IEEE Transactions on Multimedia, 26, 4110-4122.

[6] Wang, Y., Teng, J., Cao, J., Li, Y., Ma, C., Xu, H., & Luo, D. (2025). Efficient video face enhancement with enhanced spatial-temporal consistency. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 2183-2193).

[7] Tan, J., Park, H., Zhang, Y., Wang, T., Zhang, K., Kong, X., ... & Luo, W. (2024, October). Blind face video restoration with temporal consistent generative prior and degradation-aware prompt. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 1417-1426).

[8] Feng, R., Li, C., & Loy, C. C. (2024, September). Kalman-inspired feature propagation for video face super-resolution. In European Conference on Computer Vision (pp. 202-218). Cham: Springer Nature Switzerland.

[9] Chen, Z., He, J., Lin, X., Qiao, Y., & Dong, C. (2024). Towards real-world video face restoration: A new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5929-5939).

[10] Xie, L., Wang, X., Zhang, H., Dong, C., & Shan, Y. (2022). Vfhq: A high-quality dataset and benchmark for video face super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 657-666).

[11] Zhang, X., & Wu, X. (2022). Multi-modality deep restoration of extremely compressed face videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2024-2037.

[12] Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 1, pp. 886-893). Ieee.

[13] Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.

[14] McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. SciPy, 2015, 18-24.

[15] Farnebäck, G. (2003, June). Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis (pp. 363-370). Berlin, Heidelberg: Springer Berlin Heidelberg.

[16] Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4), 600-612.

[17] Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.

[18] Gu, Y., Wang, X., Xie, L., Dong, C., Li, G., Shan, Y., & Cheng, M. M. (2022, October). Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In European Conference on Computer Vision (pp. 126-143). Cham: Springer Nature Switzerland.