Supervised Learning Model for Kickstarter Campaigns With R Mining

Web mediated crowd funding is a talented paradigm used by project launcher to solicit funds from backers to realize projects. Kickstarter is one such largest funding platform for creative projects. However, not all the campaigns in Kickstarter attain their funding goal and are successful. It is therefore important to know about campaigns’ chances of success. As a broad goal, authors intended in extraction of the hidden knowledge from the Kickstarter campaign database and classification of these projects based on their dependency parameters. For this authors have designed a classification model for the analysis of Kickstarter campaigns by using direct information retrieved from Kickstarter URLs. This aids to identify the possibility of success of a campaign.


INTRODUCTION
Data Mining promotes distinct tools and algorithms for analyze the data patterns. Authors have explored efficiency of using machine learning algorithms for building classifiers to determine success rate of project launch. This paper explains a data mining process for investigating the relationship between the result of project launch (success, partial, failed) and a set of verticals describing the project, using the R environment and selected R packages for data analysis [10]. Reported research focused on design of classification model for Kickstarter project launch by assessing different classifiers experimentally.
The purpose of this project was to develop a system with machine learning techniques applied to Kickstarter campaigns dataset to classify projects [6,8,14]. To do this, authors have trained different classifiers on projects data. This approach required training data constructed by refering Kickstarter projects. This data included characteristic features of Kickstarter campaign retrieved from project URLs (https://www.kickstarter.com). This study reveals that the project properties play a vital role in predicting success.

PRIOR ART: EMERGING RESEARCH DIRECTIONS
In this literature review, references of the relevant work have taken and explained the same with respect to this research. This section surveys the most relevant studies carried out in this field to date. This review is supplemented by referring about 25 research papers. Some selected references for broad overview are taken here.
Chen et al. have developed a system to predict the success or failure of Kickstarter project before its completion. For this purpose they have trained support vector machine (SVM) on campaigns' data [4]. The dataset includes data retrieved from Kickstsrter projects as well as social media sources. Final classifier of this model is able to predict campaign's final outcome with 90% accuracy. The finding of this research explores that project properties are important features in determining success of a project. Etter et. al. aimed at developing a method for predicting success of Kickstarter projects by using direct information and social media [7]. They have classified the campaigns as probable success or failure based on time series of money and information retrieved from tweets and Kickstarter's projects graph. Authors have shown the importance of social feature in predicting success of projects.
Researchers from Georgia Institute of Technology Atlanta have explored the features which lead to successfully funding Kickstarter projects [13]. This study revealed that the language used in the campaign has surprising predictive power of 58.56%. Authors have explained the use of predictive phrases along with the control variables for backers and project creators to the best use of time and money. Aleyasen has developed a model to predict success of a project based on project information. To build the classifier researcher has used Kickstarter campaigns' dataset [1]. This model is able to predict success or failure of a project by 73% based on the description text. This development facilitated with a web interface which allow creator to enter campaign details and it provides feedback based on classification result.
Researchers have designed a tool for project creators to get feedback about their campaigns [12]. To accomplish this researchers have applied various machine learning classification algorithms on crowd funding projects at the time of launch. The accuracy of this tool is 68%, whether a campaign will be successful or not. The outcome of this tool is a prediction engine can be used to guide project creators. Authors have studied and analyzed the factors affecting campaign results [19]. This literature targets the project page content and usage patterns of project updates.
Semantic analysis is applied and they found discrepancies between intent of project updates and uses in practice. This analysis reveals that impact of updates rather than project details had stronger associations with campaign success. Yet another paper by Rakesh et al. explains the features determining projects' success [17]. They have expanded project features in to temporal behaviour, personal behaviour, geo-location behaviour, and social network behaviour. Using comprehensive dataset researchers have provided insights of these features and their effects on the success of campaigns. Authors have studied dynamics of Kickstarter and impact of social networks to this [11].
Literature review reveals that, development of classification model for Kickstarter campaign has been an emerging area of research in the current decade [2]. In regard to this, authors aimed at retrieving of the hidden knowledge from the Kickstarter campaigns and classification of these projects based on their features. To accomplish this authors have designed a classifiers for the analysis of Kickstarter campaigns by using direct information available online.

STRUCTURAL DESIGN OF RESEARCH FRAMEWORK
The structure of the paper follows the framework of a data mining process as shown in Figure 1. Reported research applied a variety of machine learning classifiers to learn the concept of online crowd funding project. A five-step procedure is followed for the design of research framework comprises: 1. Problem Definition: Analysis of relative importance of the campaign details for its success in reaching funding goal 2. Kickstarter project pages are directly parsed to get many of the project properties, and required preprocessing has done for mining purpose [18] 3. Data set preprocessed as per the requirement of machine learning algorithms 4. In supervised learning phase, classifier models are developed which associates the class variable and the explanatory variables using a training set randomly selected from the data set 5. Performance of the each classifier evaluated and selects the "best" one. It allows checking the performance of the trained classifiers against a testing set, evaluating their predictive accuracy with data not used in the training step Present research attempted to design data mining model in R, a free software environment [3]. R is a simple, but very powerful data mining and statistical data processing tool for research [16]. The mining code presented here was developed and tested using R version 3.1.2, and the corresponding scripts given here. The steps carried out in this data mining process and the R packages used are summarized in Table 1.

DATA MINING PROCESS
This section elaborates data mining process carried out in this research. Implementation of different classification models, their evaluation with experimental results are explained here.

DATA COLLECTION AND DATASET DESCRIPTION
Authors have designed dataset consists of project details retrieved directly from referring Kiststarter project URLs (www.kickstarter.com; www.quandl.com/data). Dataset provides information on over 120 project pages. The structure of Kickstarter pages includes a video, a goal, a project description, reward structure, and links to social media platforms etc. Project campaigns' main characteristics values are collected and stored in database [15]. Authors have explored a number of features from Kickstarter projects and their related data in order to performed supervised learning. Specially, we looked at the attributes given in Table 2.

DATA PREPROCESSING
Data Pre-processing applied for identifying the missing values, noisy data and irrelevant and redundant information from dataset. As classification algorithms works on numeric values only, categorical variables are converted in to numerical form. Kickstarter projects categories and corresponding numerical value assigned given in Table 3. Presence of video denoted as 1 where as absence as 0. Percentage of funding is calculated by using attributes "funding goal" and "amount pledged" [9]. It is classified in to five classes as explained in Table 4. For the purpose of supervised learning the class variable "result" is treated as dependent variable. Project density of different classes is shown in Figure 2.

DATA EXTRACTION AND EXPLORATION -DATASET STATISTICS
Data exploration with R starts with inspecting the dimensionality, structure and data of an R object. Data set contains five classes of projects based on percentage of funding. Data frame is created by reading dataset. Results of basic statistical computation and number of observations for each type are reported by summary function as given Figure 3.

BUILDING CLASSIFIES
Data mining model is developed by building classification rules for the target variable "result". Authors have executed the dataset through a variety of different classification algorithms. Four classification methods during the learning step were chosen to represent a wide range of approaches in statistics and to analyze data are explained in this section.
In order to train classifiers, select their parameters and evaluate their performance, the dataset is randomly separated into 2 parts: 70% of the campaigns are selected as the training set, 30% as the validation set to evaluate the "out-of-sample" performance of the classifiers. Data set separation done in R.

NAÏVE BAYES CLASSIFIER
The Naïve Bayes classification is based on a probabilistic model that integrates strong independence assumptions. It can handle an arbitrary number of independent variables whether continuous or categorical. E1071, R package is installed to execute Naïve Bayes Classifier. Figure 4 explains R implementation of Naïve Bayes classifier.

NEURAL NETWORK
The R package NNET provides methods for using feed-forward neural networks with a single hidden layer. For the purpose of machine learning dataset separated training and validation sets. This allows validating the ANN on data that it was never trained with. The neural network requires that the records be normalized using one-of-n normalization. Input values are normalized between 0 and 1 as per the requirement of ANN. Training data is trained by "nnet" function. Generated model is tested on test data with "predict" function. R implementation of neural network is given in Figure 5.

RANDOM FOREST
The RANDOMFOREST package is used for classification by random forest classifiers. For classification the corresponding method implements Breiman's random-forest algorithm. It can also be employed for assessing proximities among data points in unsupervised mode. Figure 6 explains R implementation of Random Forest classifier. The RPART package is used for classification by decision trees. Recursive partitioning is a fundamental tool in data mining. It explores the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical value. The resulting models can be represented as binary trees. R implementation is shown in Figure 7 and corresponding tree is shown in Figure 8.

MODEL EVALUATION
After developing classifiers, the criteria defined for evaluating classifiers. The performance of the four classifiers for target variable "result" can be described by the "confusion matrix", a squared contingency table. Corresponding classification model evaluation is explained in Table 5. Classification models are evaluated on the basis of their accuracy, i.e. the percentage of observations correctly classified. It is calculated as the ratio between the sum of the diagonal elements of the confusion matrix and the sample size. Method of evaluation revels that neural network is the suitable model in this case is considered as classifier for Kickstarter campaigns. Figure 9 shows, using scatter plot, the distribution of the performances for each of the four classifiers. This graph confirms the conclusion drawn from Table 5.

SIGNIFICANCE AND CONCLUSION
The present research carried out a data mining techniques implementation using different R packages. The machine learning algorithms described here are powered by scraped dataset. We were interested in design of classification model for Kickstarter campaigns. We have executed different classifiers on project dataset. Neural network is the clear winner in this contest. The results we achieved through the basic set of variables described in Table 1 are encouraging, we are able to of execute supervised learning for classification projects accurately.
The main results can be briefly summarized as follows: 1. From our analysis, we determined that the project properties play a vital role in predicting success 2. Model evaluation revels that neural network is the suitable classifier for Kickstarter campaigns. 3. Kickstarter claims that projects with a video report higher success rates than those without. Even our research proves this claim.