Fake News Detection Using Machine Learning

The emergence of the World Wide Web and the quick uptake of social media platforms (like Facebook, Instagram and Whatsapp) have made it possible for information to be disseminated in ways that have never previously been seen in the history of humanity. Online hiring has changed the hiring pattern. In particular, putting job adverts on corporate websites and career portals includes looking for a sizable pool of qualified candidates worldwide. Unfortunately, it has become yet another forum for scammers, which threatens applicants’ privacy and harms the reputation of companies. The topic of detecting recruiting fraud and scams is addressed in this case study. Three machine learning models are used in the construction of an effective recruitment fraud detection model, which includes a number of significant organizational, job description, and kind of remuneration features. The proposed system uses different Three Machine learning approaches like Support Vector Machine, Random Forest, Naive Bayes Classifier to determine whether a job posting is genuine or false. We extracted features from data using two techniques: Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words (BoW). As a result, three models achieve decent results. An ensemble model is created by training three independent machine learning models using different segments of samples and then using a simple majority vote of the three models to determine the final predictions. As a result, three models achieve decent results and an accuracy of more than 98.18% have been achieved using Random Forest Model.


INTRODUCTION
May be a decade ago, there were days where people relied on sources like newspapers and television for news.Doordarshan and All India Radio were almost used by all.These news were from reliable sources and delivered to people only at a standard timing.Due to globalization and digitization, there started the era of smart devices.As the technology enhanced, people could afford to buy devices in which they could send and receive digitized data.With this development in technology, dependability on digitized data increased.There emerged innovative ideas where in social media became a medium for easier communication.Gaps between people and nations started diminishing and folks were connected to each other globally through social networks.It was a platform for networking and business.Finding customers and reaching out to right people started happening after the introduction to new communication media.As a result, new business ideas and advertising techniques came up.Back then people walked miles to spend quality time in the libraries, subscribed to weekly and monthly magazines.But now, people read blogs and articles online and subscribe to email newsletters.Even standard newspapers have a domain where they keep uploading daily e-papers.Along with enormous advancement in communication technology, there were also weeds in the farm.Here, weeds indicate the unrealistic data or information that circulate and deceive common people.There could be many reasons as to why these fake news are increasing among people day by day.One such reason could be to put an enterprise or a public figure down so that they lose their image and suffer loss.In these cases, the fake news are generated by haters in such a way that it is too appealing to a common man and anybody would believe in the misinterpretation that was spread intentionally.Individuals in the media companies like journalists face a great challenge as their high quality vocabulary and work are questioned due to fake news.Some hoaxes generated bring in unnecessary discussions among public causing fear and riots.The trustworthiness of data on the web has risen as a paramount issue of present day society.The challenge is to identify such deceptive news.Since the topic isn't new, many scientists believed that this problem could be tackled with the help of Machine Learning [1].The usage of Artificial Intelligence has been increased in problems that deal with classification due to availability of required resources such as affordable hardware components and datasets of larger size [1].Platforms like Whatsapp have evolved in such a way that the forwarded information are indicated to the user and notified that they cannot be trusted.
The main aim here is to develop a system that takes in a dataset and educates the users as to which classifier among Support Vector Machine, Logistic Regression, and Naive Bayes Classifier, accurately classifies the data as fake or true.The analysis shows the confusion matrices with respect to TF and TF-IDF features for both Liar-Liar and ISOT datasets for the algorithms.Precision, recall and F1-scores for the corresponding confusion matrices are intended to be displayed.

METHODOLOGY
As the end result of the project, it is expected to obtain the most accurate algorithm, among SVM, LR, Naïve Bayes classifier, that gives the accuracy, confusion matrix.The algorithms used are not friendly with noisy, incomplete and inconsistent data.Hence there is a need to modify the content into which the classifiers can tackle.The first step that the system has to undergo is pre-processing.There are many ways that add value to the performance of a classifier.They involve having more data for training, usage of bigger length news articles, removal of stop words and stemming process getting indulged in pre-processing [1].In pre-processing step, stop word removal, punctuation removal and conversion of text to lowercase is done.Thus, by doing so the useless contents are left behind.Once this is done, extraction of features is initiated to make it easy for algorithm for classification with respect to authentication of news.There are many features that could be extracted for detection of fabricated news amongst which there could be three features that would serve up to best of their abilities.They are: TF-IDF which uses bi-gram frequency, probabilistic context free grammars (PCFGs) or syntactical features and TF feature.Usage of every feature independently yields different accuracy values.Out of these, TF-IDF outperforms amongst all these features [4].The extracted features in our project are TF and TF-IDF [8].Term-frequency tells about the word that has arisen maximum number of times in a sentence.In case of TF, the weight for word is equally distributed.Thus, to have knowledge about the word that has more weight in the sentence, TF-IDF is used.These two features play a main role for detection of deceptive news.

Datasets
ISOT_Fake_News_Dataset_ReadMe and Liar-Liar dataset are datasets that are used throughout the analysis.Two sets of datasets with varying size where used to compare the outcome of the machine learning models.The first dataset used here is named as 'Liar Liar Dataset' [17].This dataset contains about 12.8K news articles.The second dataset used here is named as 'ISOT Fake News Dataset' [18] [19].It is about 31K in size.Both the above mentioned datasets have statement and label columns.

Naive Bayes Classifier
It is a reliable and a straightforward approach in the field of classification.The algorithm is well suited for natural language processing.It is based on the concept of probability and Bayes theorem and is used to predict the unknown class.The features are considered to be independent of each other.This assumption is represented mathematically as follows: where the prior probabilities P(A) and P(B) represent the probability of the hypothesis A being true and probability of the data respectively, the posterior probability P(A|B) and P(B|A) represent the probability of hypothesis A given the data B and the probability of data B given that the hypothesis A was true.Here, the prior probabilities of the dependent classes and then of the features are calculated, then the posterior probability is calculated.Finally, the article belongs to the dependent class with the highest posterior probability.

Logistic Regression
It is a simple model to perform supervised binary classification in machine learning.The model tries to create a connection between the features (independent variables) and the variables that needs to be predicted (dependent) .It takes a weighted combination of the input features, and passes it through a sigmoid function, which performs a probabilistic estimation of the outcome by smoothly mapping any real number to a number between 0 and 1.The values obtained between 0 and 1 as the final outcome will then be converted into exactly 0 or 1.
where p is the probability of success.This is the sigmoid Function.If p is the probability of success, 1-p will be the probability of failure which can be written as: where q is the probability of failure.On dividing (1) by ( 2), we get After taking log on both side, we get, ( (3) is the logistic regression equation.

Support Vector Machine
It is a supervised classification algorithm used in machine learning.The advantage of using support vector machine is that the algorithm works well with the feature spaces having high dimensions.It also works well where the count of dimensions is higher than the number of samples.The algorithm works on the concept of finding a separator (hyperplane) line between the data of different classes.Using the hyper plane and the features of the new data, the category of the data could be found out.The extracted features are then fed to classifiers, that was earlier mentioned, separately.The output from each can be viewed and the best result obtained can be used for development of the fake news detector.

RESULTS
In this project, the performance of the classification models are compared by means of the precision, recall and F1 score on two different datasets for two different features i.e TF and TF-IDF.Precision is the ratio of true positive to the union of true positive and false positive.Recall is the ratio of true positive to the union of true positive and false negative values in the confusion matrix.F1 score is also called Fmeasure or F-score which gives the combined balanced score between precision and recall.It is also called harmonic mean of precision and recall.As per the analysis made on Liar-Liar dataset, Naïve Bayes gave better accuracy for TF feature, whereas Logistic Regression gave better accuracy for TF-IDF feature.The second dataset -ISOT gave better results compared to Liar-Liar dataset for the same features.This is because the size of ISOT dataset is almost thrice the size of Liar-Liar dataset.

Figure 1 .
Figure 1.Distribution of data in Liar-Liar Dataset

Figure 6 .
Figure 6.Confusion matrix of Logistic regression

Table II .
Classification report for ISOT Dataset

Table
II depicts the precision, recall and f1-score of three different classifiers for TF and TF-IDF features.