Movie Rating Prediction on the MovieLens 32M Dataset Using Random Forests

Gaorui Zhang; Huiqin Sun; Sai Li; Juan Li

doi:10.54097/qsyt8v31

Authors

Gaorui Zhang
Huiqin Sun
Sai Li
Juan Li

DOI:

https://doi.org/10.54097/qsyt8v31

Keywords:

Movie Rating Prediction, Random Forest, Data Mining, Data Preprocessing

Abstract

The challenges of user viewing decision-making in the global film industry and the insufficient accuracy of existing rating prediction models are addressed in this study. The investigation uses the MovieLens 32M dataset to explore movie rating prediction. Initially, stratified sampling was conducted on the dataset utilising Python scripts. Subsequent to the amalgamation of data, the implementation of a differentiated approach to the management of missing values, the purification of outliers, and the transformation of temporal features, the dataset was segmented into training and test sets at an 8:2 ratio, whilst ensuring the maintenance of consistent rating distributions. Consequently, a Random Forest regression model was constructed. GridSearchCV was employed to optimize hyperparameters such as the number of trees and maximum depth. The final model demonstrated excellent performance on the test set, with a coefficient of determination (R2) of 0.8776, a mean squared error (MSE) of 0.1399, and a mean absolute error (MAE) of 0.1697. This approach demonstrated a substantial improvement in performance when compared to established benchmark models such as linear regression and support vector machines. It effectively captured the nonlinear relationships present in the rating data, thus showcasing its ability to handle complex data structures. Feature importance analysis revealed that the user's average historical rating (importance score 0.7921) and the movie's average historical rating (0.0680) are the core factors influencing rating predictions, while the rating standard deviation and user ID have weaker impacts. The findings of this research provide quantitative evidence for the optimisation of scheduling strategies for film producers, the enhancement of personalised recommendation systems, and the evaluation of film value on content platforms.

Downloads

Download data is not yet available.

References

[1] MUDAMBI S M, SCHUFF D. Research note:what makes a helpful online review? a study of customer reviews on Amazon. com[J]. MIS Quarterly, 2010,34(1):185-200.

[2] Xu Xingbo, Zhang Mingxi, Zhao Rui, et al. Movie Rating Prediction Based on Interaction Attribute Enhancement [J]. Software Guide, 2024, 23(01): 182-189.

[3] SARWAR B, KARYPIS G, KONSTAN J, et al. Item-based collaborative filtering recommendation algorithms[C] //Proceedings of the 10th International Conference on World Wide Web, 2001:285-295.

[4] LIM Y J, TEH Y W. Variational Bayesian approach to movie rating prediction[C]//Proceedings of KDD Cup and Workshop, 2007:15-21.

[5] ZHOU D, HAO S, ZHANG H, et al. Novel SDDM rating prediction models for recommendation systems[J]. IEEE Access, 2021, 9:101197-101206.

[6] Yu Jinping, Liang Qinghao. Research on Movie Rating Prediction Based on Bayesian Optimization of XGBoost Algorithm [J]. Computer Knowledge and Technology, 2024, 20(17): 15-18.

[7] Qian Minglu. Research on Precision Customer Acquisition Strategies for Bank A Credit Cards Based on Random Forest Algorithm [D]. Zhejiang Gongshang University, 2024.

[8] Zou Ting. Academic Early Warning Analysis for Students Based on Random Forest Algorithm [D]. Nanchang University, 2024.