Movie genre preference prediction using machine learning for customer-based information

This work introduces a movie genre preference predictive model usable by small and medium-sized enterprises (SMEs) who are in need of a data-based and analytical approach to stock proper movies for local audiences and retain more customers. We used classification models to extract features from one thousand customers’ demographic, behavioral and social information to predict their movie genre preference. In the implementation, a Gaussian kernel support vector machine (SVM) classification model and a logistic regression model were established to extract features from sample data and their test error-in-sample were compared. Comparison of error-out-sample was also made under different Vapnik–Chervonenkis (VC) dimensions in the machine learning algorithm to find and prevent overfitting. Gaussian kernel SVM prediction model can correctly predict movie genre preferences in 85% of positive cases. The accuracy of the algorithm increased to 93% with a smaller VC dimension and less overfitting. These findings advance our understanding of how to use machine learning approach to predict customers’ preferences with a small data set and design prediction tools for these enterprises.


I. INTRODUCTION
Movie industry has long relied upon various forms of data to answer questions regarding customer interest and suggested target market segments.To make informed decisions on types of movies to produce and which movie genres will be favored by specific demographic groups, data generation and analysis is necessarily rooted in all types of movie-making and streaming decisions.
A famous example of the power of data analytics in the movie industry is Netflix's algorithms used to predict the types of movies and specific movies that its individual customers are most likely to want to watch next.Netflix's online streaming service necessarily relies upon excellent customer service to maintain its customer base.This necessity requires the compilation of relevant demographic and user preference information to accurately predict movie preferences.To solve this issue, Netflix utilizes Cinematch to collect and analyze customer demographic and preference data to gain a great competitive edge over competitors like Blockbuster [1].This type of data analytics is not only pertinent to online streaming services, however.The same type of data is needed to determine which targeted customer segments are most likely to buy or rent specific genres of movies from physical movie retail chains.Many small-medium entities (SMEs) have little or no analytical framework at hand to determine issues of marketing types of movies or whom they should be marketing.The goal of this paper is to run a predictive model usable by any SMEs in the movie rental industry in dire need of a databased, analytical approach to answering questions regarding; 1) which customers are more likely to be interested in specific movie genres, and subsequently, 2) which market segments ought these companies target in their marketing campaigns based on the data.A company's approach to predicting which groups of people will be interested in certain movie genres constitutes a major role in the viability of SMEs in the movie-rental market.Over the past 10 years, the number of movie-rental retail chains has drastically reduced.In 2016, only 5% of the total TV movie-rental market revenue came from physical rental stores [2].SMEs have immense difficulty competing with established, well-known nationwide brands and need every advantage available.Simply put: SME rental businesses cannot decide which movies to promote or store without basing the decisions on predictive analysis.
Using an accurate predictive analysis model ought to be utilized by SME movie-rental businesses that do not currently employ one to reduce or prevent a loss in customer base.National consumer spending on movie rentals from physical stores has reduced from $1.22 billion in 2012 to a mere $0.49 billion in 2016 [3].Ensuring the proper movies are stocked for the right audiences becomes even more critical when people are spending less on rentals every year.
The use of predictive analysis to ensure the right types of customers both make it into the store, as well as to help determine the correct genres of movies to stock in the store.Not only would an accurate predictive analysis retain more customers, it would reduce the amount of inventory within the store that goes unused for long periods of time.An increase in the inventory turnover and in the customer base, are two factors that directly correlate with how well a business is going.The higher the inventory turnover, the quicker movies are leaving the store to be bought or rented.A positive effect in one or both factors would result in more revenue brought in for the movie rental company.Netflix represents an excellent example of using predictive analysis to maintain and attract new customers through targeted advertising of the right movies for the customers.The movie rental company Redbox utilizes big data to analyze trends in consumer preference.Redbox has utilized non-linear regression models to help propel their growth and uses predictive analysis to determine inventory of specific movies in different geographic areas [4].This has helped spur on the massive growth that Redbox has seen.The company has used analytics to aid in its expansion from 5,000 kiosk locations in 2008 to 35,000 kiosk locations in 2014 [4].
However, relying solely upon the results of data is not a sound course of action from a strategic standpoint.Netflix often makes strategic decisions that contradict what trends data analysis has shown.For instance, data analysis showed that Netflix ought to make the subscription cancelling process more difficult for customers.Going against the data, Netflix executives decided to not place what they considered unfair burdens on the consumer to cancel their subscriptions [5].
Simply put, our data cannot answer questions regarding the current mega-trends in which demographics and personal attributes contribute to an individual liking certain movie genres across the country.But, the proposed model is designed to be able to predict genre preferences within a city or a state.As [6] said, recommender users that live in South America often dislike Hollywood drama movies.Hence, the movie recommender users in different states have different movie genre preferences and recommendations for users in other parts of the world have to be refined.The reason why we did not use the 10 million dataset from MovieLens is that this dataset does not include demographic information.Each user is represented by an ID and no other personal information is provided [6].
Our dataset consists of one sample with 1100 observations regarding movie genres.Preferences for 11 types of movies, 22 personal attribute variables and five demographic variables.The question our data can answer, however, is how a small company can utilize our predictive analysis model to determine which customers of theirs will be drawn to a specific type of movie.Using survey data, we will display the many relationships existing within current company data that exists between customer preferences and demographic information to display trends in age, gender and other factors in personal movie preferences that can be used to accurately market specific movies to customers.In general, the user information such as gender, location, or preference is effectively used in movie recommendation systems [7]- [10].In this paper, we will examine the characteristics of survey respondents who like comedy movies.
Our survey recorded individual values for their propensity to like certain types of movies or how strongly the person feels a certain characteristic applies to himself or herself.The values range from one to five, one being the lowest and five being the highest.The median age for men and women who enjoy comedy movies is 20 years old.Out of the 1008 responses for comedy movies, 507 of them are considered positive responses (a value of five) and 501 of them are considered negative responses (a value of zero to four).Breaking down the survey respondents by demographic, women who have obtained their bachelor's degrees tend to like comedy movies the most; with an average score of 4.5.This average score holds true for whether the respondents live in either of our two living categories (village or city).For the men, those who live in cities who graduated high school (but have not graduated from college) have an average comedy movie score of 4.58.For men who live in villages, the group with the highest average comedy score is college graduates with an average of 4.21; significantly less than the average response from women in the same demographic categories.
In this paper, we propose an improved machine learning approach to predict movie genre preferences based on demographic information.We investigate the prediction accuracy of two machine learning algorithms including logistic regression and Gaussian Kernel SVM.The paper is structured as follows.In Section II, we describe the related methods used in movie recommendation systems.In Section III, we present the proposed predictive method.In Section IV, we evaluate experimental results and discuss prediction performance.In Section V, we leave the reader with concluding thoughts and future work.

II. Related Work
Various recommender systems have been developed to guide customers to find items that might be of interest to them.And the recommendation performance has being improved recently by different approaches [11], [12].Researchers usually categorize recommender systems into collaborative filtering and content-based filtering systems [11], [13], [14].This section provides a brief review of both filtering methods and known issues associated with the approaches.

A. Collaborative Filtering Approach
The collaborative filtering approach recommends the items of interest to a particular user based on the similarity to past ratings.Based on customers' preferences, a collaborative filter calculates the correlation coefficient between the customers being served with other customers.This is called Pearson correlation coefficient [9], [15], [16].If the coefficient is near +1, this means that the two customers have similar preferences.Secondly, the approach will select neighbors who have a high coefficient for the customer.Finally, the collaborative filtering method predicts the customer's preference for a specific movie genre based on neighbors' ratings.

B. Known Problems of Collaborative Filtering Approach
The two major problems are sparsity and cold-start.The sparsity problem occurs when there are not enough customer information and ratings available.If we collect survey results only from small amount of users, the accuracy of the recommendation from trained recommendation system will be lower than the accuracy obtained based on a large number of samples [9], [10], [17]- [19].Actually, we found the more features we used in machine learning algorithm, the more samples we need to prevent overfitting.In addition, the coldstart is the other problem to collaborative filtering.The problem happens when new customers or movies do not have enough information or rating in the recommendation system [20]- [22].Although existing systems could become unreliable because of the cold-start problem, the recommendation model designed for SMEs is affected less than those models designed for individual customers by the cold-start issue.The model for SMEs predicts genre preferences for a group of audiences rather than one audience, so a new customer who has no information recorded will not affect the prediction for the general genre preferences of targeted customer base.However, the model for SMEs in quickly growing cities should be updated continually over time to prevent degrading as the numbers of customers increases significantly.

C. Content-Based Filtering Approach
This algorithm is based on descriptions of items and user preferences to recommend products to customers.The approach compares user's preferences with new items' representations and matches user preferences with item attributes.There are some machine learning technologies that have been applied to the content-based filtering approach, such as naïve Bayes [23], [24].

D. Drawback of Content-Based Approach
Content-based filtering algorithm has some drawbacks.First, this algorithm relies on appropriate information for categorization.It cannot generate reliable suggestions without such information.That means this content based filtering method cannot provide suitable recommendation results if the analyzed content does not contain sufficient information for classifying items.Sometimes, domain knowledge and an ontology are required to determine attributes in recommendation [25].In addition, when enormous amounts of attributes information is calculated by matrix-based approaches, the scalability and sparsity problem may occur [26].
This work here can be viewed as a confluence and continuation of the above-mentioned works.We draw some key points from a customer-oriented recommender system and present new contributions.In particular, we extend the previous application by helping SMEs utilize data analysis to answer questions regarding customer preferences and marketing segment of the movie rental industry.The main contributions of this work are the following: • A novel machine learning based collaborative filtering recommender system is presented for SMEs.The proposed approach employed two classification models to fully implement accurate movie genre prediction in quantitative and qualitative aspects.• An analysis is performed to study how to process the dataset and select representative features to prevent the overfitting issue and improve prediction performance.• Some suggestions are provided on how to choose machine learning classification algorithms based on samples and features.

III. METHODS
In this section, we describe the methodology behind experiments that were performed.

A. Database
The sample shown in Table I was composed of 38 hypothetical decision-making dimensions (11 for different preference to movie genres; 27 for demographic information) with a rating ordinal scale from 1 to 5. The scale was set as below: 1-strongly bored, 2-bored, 3-whatever, 4-acceptable, 5strongly interested.For the sake of simplicity, we only regard the scale of 5 as positive recommendation and other scales are negative ones.A sample size of 1100 respondents with quota characteristics enables the research study to be generalized to a young population.
Before analyzing the sample of 1100 respondents, we firstly cleaned missing values in personal inquiring by ignoring incomplete observations.Data were considered sparse when the expected values in a dataset were missing.Variables with missing information represented by "N/A" or left blank completely were deleted from the data set to avoid skewing the results.This initial round of cleaning provided 1008 complete responses.

B. Investigate and Select Features
We identified predictors that separated classes well by plotting different pairs of predictors on scatter plots.The plot helped investigate classes' separation to include or exclude predictors.For example, the feature "PC" and "Finance" did not separate movie genre preferences into two classes, and thus, both of them should be excluded from the useful features.Based on scatter plots, the following predictors in Table I were excluded.

C. 5-Fold Cross Validation
Preference for comedy was used as a response variable; 31 predictors were used as independent variables.To prevent overfitting, one way was to not use the entire data set when training the classifier.Part of the data was removed before training begins.When the training process was finished, the removed data can be used to test the performance of the learned model on "unknown" data.The whole data set was divided into five subsets, the so-called 5-fold cross validation.The training samples were randomly partitioned to five equal parts.We used four parts for training and the left one part for validation.So each time, one of the five subsets was used as the test set and the other four subsets were put together to form a training set.Then the average error across all five trials was computed.The advantage of the 5-fold cross validation was that it did not matter how the data sets were divided.Each observation set was used as a test set once, and used in the training set four times.The variance of the trained prediction model was reduced.The average of validation errors is called the cross validation error.We used the cross validation error in the model selection process.

D. Least Absolute Shrinkage and Selection Operator
Least absolute shrinkage and selection operator (LASSO) were used to select and regularize variables in the training.
LASSO solves the problem: ( where N is the number of observations; yi is the response at observation I; xi is the data, a vector of p values at observation I; is a nonnegative regularization parameter; the parameters and are a scalar and a vector of length p, respectively.The algorithm that LASSO used is based on the Alternating Direction Method of Multipliers (ADMM) [27].Due to the space limitation, no further details of algorithm are presented here.The interested reader is referred to [28], [29] for a more complete description.
The LASSO algorithm was programmed to input a matrix of 800*27 dimensional predictors with redundant variables.The algorithm returned the coefficient vectors for "Happiness in life" and "Internet Usage" were zero.It meant that lasso identified the two redundant predictors in the samples, and so, we removed the two predictors in the training.Experimental evaluations have shown that using LASSO improves the prediction accuracy of preferences [30].

E. Logistics Regression Model
A logistic regression classifier was trained to classify movie preference (very interested or not) using 800 observations from a training data set.As a statistical method, Logistic regression analyzed a dataset in which there were 25 independent variables that determined an outcome.Logistic regression outcome is a dichotomous dependent variable including only two possible outcomes as 1 (TRUE, interested, etc.) or 0 (FALSE, not interested, etc.).
The objective of logistic regression algorithm was to find the best fitting model to describe the relationship between the dependent variable (movie genre preference) and a set of independent predictor variables (customers' information).Logistic regression generated the coefficients to predict a logit transformation of the probability of presence of the preference for comedy.Due to the space constrains, no further details of logistic regression are presented here.The interested reader is referred to [31], [32] for a more complete description.

F. SVM Model with Gaussian Kernels
The SVM classifier obtained by solving the convex Lagrange dual of the primal max-margin SVM formulation is as follows: (2 where N is the number of support vectors.Instead of imagining the original features of each data point, we considered a transformation from space to a new feature space .
The data point had 25 features, one for each support vector.The value of the nth feature was equal to the value of the kernel between the nth support vector and the data point being classified.In this space, the original SVM classifier was just like any other linear discriminant.
Note that after the transformation, the original features of the data point were irrelevant.It was represented only in terms of its dot products with support vectors (which are basically special data points chosen by the SVM optimization algorithm).The Gaussian kernel has another name called Radial Basis Function (RBF) kernel is given as: The linear combination of Gaussians is centered at support vectors.
(4) , So, the final hypothesis from the Gaussian SVM algorithm used Gaussian kernel and found the coefficients in the linear combination, and the Gaussian function on those support vectors.The bias term b was a feasible intercept based on constraint conditions.
The SVM classifier with the Gaussian kernel is simply a weighted linear combination of the kernel function computed between a particular data point x and each of the support vectors .The role of a support vector in the classification of a data point is tempered with .

G. Training Models
To programmatically train a classifier, we used the flow chart in Fig. 1 to automate train the logistic regression model and Gaussian kernel SVM model with the training data.We input a table containing the predictor and response and got output for the trained Classifier and accuracy.

IV. EVALUATION
In this section, we describe the results of our experiments with the machine learning models.The goal of our evaluation was to determine if the system was able to predict genres which customers mostly like.Across the paper, the accuracy is expressed using the area under receiver operating characteristic curve (AUC) coefficient and Receiver Operating Characteristic (ROC) curves.AUC is widely used in social sciences.If the AUC=0.5, the classifier is correct at half of the time.If the AUC=1, the classifier is always correct [32].

A. Test Resources
In order to demonstrate the robustness of the prediction obtained using two proposed classifiers, 208 test samples from the same survey were used to validate our classification algorithms, so as to provide a benchmark to compare prediction accuracy between two classifiers.

B. Test Results and Discussion
We classify each response in four classes as below: • True Positive (TP): the system suggests that customers like comedy very much, and the customers rated comedy with a five.• True Negative (TN): the system suggests that customers will dislike comedy, and the customers rated comedy with a four or lower value.• False Positive (FP): the system suggests that customers like comedy very much, and the customers rated comedy with a four or lower value.• False Negative (FN): the system suggests that customers will dislike comedy, and the customers rated comedy with a five.
In the movie genre recommendation scenario, precision is computed as the number of comedies that the system correctly predicts as the genre that the customer will like to watch (TP) divided by the total number of the comedy genre recommended positively.The general formula for Precision is: (6) The Recall is defined as the number of retrieved relevant resources divided by the number of relevant resources.In the genre recommendation system, it can be computed by dividing the number of genre correctly recommended by the total number of genres that are worth recommending.The formula for Recall is as follows: (7) Accuracy stands for the fraction of resources predicted as the positive or negative for which the prediction was correct [6].We use the following formula for accuracy: (8) Certainly, different applications have different precedence to precision and recall.In our application, recall was frequently regarded as more important than precision, as it was acceptable to increase the amount of false positives (FP).Some customers may also like to rent or buy comedy movies for entertainment, even though their responses on the scale of 4 (acceptable) were regarded as negative prediction in our algorithm.The test performance summary without movie genre preference predictors is listed in Table III.In comparing the results in Fig. 2 obtained from the Gaussian SVM and logistic regression classifiers, it was fairly obvious that Gaussian SVM performed better compared to logistic regression classifier in terms of positive class irrespective of error-in-sample and error-out-sample.Under the 800 training experiments, maximum positive prediction accuracy of 88% and 84% were obtained using Gaussian kernel SVM classifier and logistic regression, respectively.
In a comparison of Tables IV and V, the movie genre preference data did not improve the accuracy as we expected but led to a surprising drop in prediction accuracy.After we discarded the movie preference data in the training and test, the data dimension decreased.In 800 new training experiments without these preference data, maximum positive prediction accuracy of 83% and 66% were obtained using Gaussian kernel SVM classifier and logistic regression, respectively.Under 208 new tests, the maximum positive prediction accuracy of 93% and 90% were obtained using Gaussian kernel SVM classifier and logistic regression, respectively.Thus, when the data dimension became larger and the VCdimension became larger, the error-in-sample was lower but the error-out-sample was higher, and as such, bad generalization and overfitting happened.Moreover, random noise in the collected data also affected overfitting.Besides noise information in the training data, the target complexity in the training data also acted like noises; thus, that larger VCdimension needs more training data to avoid overfitting.
To prevent overfitting when machine learning used movie genre preference information to find the target function in the hypothesis set, more data or observations and less noise in the training data were necessary.Otherwise, discarding genre information in the training was a better choice.
From studies that we performed, we recommend to use Gaussian kernel SVM rather than logistic regression in machine learning when the number of features is small; for example, the feature number is less than 1000 and the training samples is intermediate, for example, the number of samples is between 100 and 10000.

V. CONCLUSION
From the studies that were performed, we can conclude that our application of machine learning techniques for movie genre prediction is quite successful.The geographic factors contain more information about movie preference prediction than that we can perceive by sex classification as usual.The experiments showed that the listed geographic information can be used to accurately predict customer movie genre preferences, for example, comedy movies.As the paper shows, features which were chosen in machine learning had an impact on the prediction accuracy; however, some features were redundant predictors in samples and need be found and removed by algorithms.The experiments also compared the prediction results between small and large VC-dimensions and showed that the accuracy was negatively correlated with the VCdimensions when the amount of sample data was not sufficient.The lower accuracy of high VC-dimension classifier confirmed that more predictors of the information can result in overfitting and negatively influence prediction accuracy when the amount of training data was not sufficient.The findings showed that the Gaussian kernel SVM algorithm and proper predictor numbers significantly enhanced the prediction power of the user preference for movie genre.Therefore, the proposed machine learning algorithm allows us to overcome the shortcomings of a traditional massive data-based recommender system and could be enhanced by more local customer samples and be used particularly by local SMEs in the movie rental industry.We foresee our future research on genre recommendation in two main directions.First, we would like to experiment with classifying preference according to a five star rating rather than a simple binary classification scheme.Secondly, we would like to use additional text features.The attributes used in our recommendations might be insufficient for recommending movie genres.Additional features related to review comments could be incorporated to build more accurate recommendation model.

Fig. 2 (
Fig. 2 (a) Medium Gaussian SVM Positive class prediction results; (b) logistic regression Positive class prediction results

TABLE I .
RESEARCH DATA SAMPLING METHODOLOGY

TABLE II .
The predictors that were excluded because they cannot separate the comedy genre preference into two classes

TABLE IV .
POSITIVE PREDICTION ERROR IN TRAINING AND TESTS WHEN DATA INCLUDED MOVIE GENRE PREFERENCE SAMPLING

TABLE V .
POSITIVE PREDICTION ERROR IN TRAINING AND TESTS WHEN DATA DID NOT INCLUDE MOVIE GENRE PREFERENCE SAMPLING