Published December 15, 2023 | Version v1
Journal article Open

Effective Use of Content as a Feature in IMDB Dataset Analysis

Description

This study uses data from IMDb and TMDB to build two machine learning models. One model predicts movie ratings, and the other classifies movie genres. To predict ratings, we work with a carefully selected subset of IMDb data, including details like titles, genres, ratings, and crew roles, to create a structured dataset. We focus on Gradient Boosting Decision Trees (GBDT), including XGBoost and CatBoost models, and check their performance with metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). Out of the tested models, the Gradient Boosting Regressor gives the best results, achieving a balance between speed and accuracy. For genre classification, we collect plot summaries from TMDB. Using Sentence Transformers, we create embeddings that capture the relationships between genres. We feed these embeddings into a convolutional neural network (CNN) that incorporates Conv2D layers with a 1x2 kernel size and MaxPooling2D with a 1x2 pool size. Results suggest that content management may optimize rating and genre prediction and provide insights.

Files

2023_JAIwA_1293353movie.pdf

Files (652.5 kB)

Name Size Download all
md5:990c575b9efe29c5ccf91eaa46dbe23b
652.5 kB Preview Download

Additional details

References

  • Abarja, R. A., & Wibowo, A. (2020). Movie rating prediction using convolutional neural network based on historical values. Int. J, 8, 2156-2164.
  • Ant K, Soğukpınar U, Amasyalı MF. Comparison of templates with word2vec in finding semantic relations between words. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2018; 1(1): 13-17.
  • Ant K, Diri B. Emotional harmony in my social network. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2018; 1(2): 117-121.
  • Awan, M. J., Khan, R. A., Nobanee, H., Yasin, A., Anwar, S. M., Naseem, U., & Singh, V. P. (2021). A recommendation engine for predicting movie ratings using a big data approach. Electronics, 10(10), 1215.
  • Demir, M., Kutlu, Y. (2021). The Effect of Preprocessing Stage on Sentiment Analysis in Turkish Texts. Journal of Artificial Intelligence with Applications, 2(1), 1-4, doi : 10.5281/zenodo.14587317
  • Doshi, P., & Zadrozny, W. (2018). Movie genre detection using topological data analysis. In Statistical Language and Speech Processing: 6th International Conference, SLSP 2018, Mons, Belgium, October 15–16, 2018, Proceedings 6 (pp. 117-128). Springer International Publishing.
  • Ertugrul, A. M., & Karagoz, P. (2018, January). Movie genre classification from plot summaries using bidirectional LSTM. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) (pp. 248-251). IEEE.
  • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
  • Fukushima, K. (1980). Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
  • Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for big data: an interdisciplinary review. Journal of big data, 7(1), 94.
  • Hoang, Q. (2018). Predicting movie genres based on plot summaries. arXiv preprint arXiv:1801.04813.
  • IMDb. (n.d.). title.ratings.tsv. Retrieved September 21, 2023, 23:27 TRT, from https://datasets.imdbws.com/title.ratings.tsv.gz
  • IMDb. (n.d.). title.basics.tsv. Retrieved September 21, 2023, 23:54 TRT, from https://datasets.imdbws.com/title.basics.tsv.gz
  • IMDb. (n.d.). title.principals.tsv. Retrieved September 22, 2023, 00:13 TRT, from https://datasets.imdbws.com/title.principals.tsv.gz
  • Jacovi, A., Shalom, O. S., & Goldberg, Y. (2018). Understanding convolutional neural networks for text classification. arXiv preprint arXiv:1809.08037.
  • LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.
  • Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 33(12), 6999-7019.
  • Musa, J. M., & Zhihong, X. (2020, April). Item based collaborative filtering approach in movie recommendation system using different similarity measures. In Proceedings of the 2020 6th International Conference on Computer and Technology Applications (pp. 31-34).
  • M. Nair, A., & Preethi, N. (2022). A pragmatic study on movie recommender systems using hybrid collaborative filtering. In IoT and Analytics for Sensor Networks: Proceedings of ICWSNUCA 2021 (pp. 489-494). Springer Singapore.
  • Puhazholi, S., & Francis, F. S. (2023). Predicting Human Psychological Factors from Movie Ratings and Genres Using Genetic Algorithms. In 2023 International Conference on Network, Multimedia and Information Technology (NMITCON) (pp. 1-4). IEEE.
  • Reimers, N. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084.
  • Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813.
  • Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., & Zhong, C. (2022). Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys, 16, 1-85.
  • Sanwal, M., & ÇALIŞKAN, C. (2021). A hybrid movie recommender system and rating prediction model. International Journal of Information Technology and Applied Sciences (IJITAS), 3(3), 161-168.
  • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.
  • Szczepanek, R. (2022). Daily streamflow forecasting in mountainous catchment using XGBoost, LightGBM and CatBoost. Hydrology, 9(12), 226.
  • Tanvir, S., Abontee, A. P., Rayhan, M. A., Rahman, M. F., Sultana, T., & Ahmed, A. (2023). Movie Genre Prediction Using AI-Generated Data and Natural Language Processing. In 2023 26th International Conference on Computer and Information Technology (ICCIT) (pp. 1-4). IEEE.
  • Tohma, K., Okur, H. I., Kutlu, Y., & Sertbas, A. (2023). Sentiment Analysis in Turkish Question Answering Systems: An Application of Human-Robot Interaction. IEEE Access.
  • Welsh, James M., (1975). Documents of Film Theory: Ricciotto Canudo's" Manifesto of the Seven Arts". Literature/Film Quarterly, 252-254.
  • Zabaleta de Larrañaga, I. (2021). Using objective data from movies to predict other movies' approval rating through Machine Learning.