Fake News Detection using Machine Learning

Advancement in technology and easy access to internet increases use of online platforms and social media as a source of information. Online resources have quick access, but we can't guarantee the authenticity of information. Finding fake news involves identifying reports that are intentionally false or hoaxes disseminated through mainstream news sources (print and broadcast) or internet social platforms. Automatic fake news identification involves determining the authenticity of news reports. The detection of fake news is a crucial but complicated task in natural language processing. Our aim is to design and train model using machine learning techniques for automatic detection of fake news. The suggested model achieves encouraging results.


I. INTRODUCTION
Fake news and frauds always exist even before modern era of technology and internet. The most common definition of fake news is "false information that is fabricated to mislead readers with intent." Fabricated information is spread through social media and websites to increase popularity, readings, or psychological well-being [1]. In general, the main purpose is to profit on clickbait. The main purpose of this practice is to increase ad clicks and then eventually ad income.
Advancements in technology enhance use of online platforms as a source of information as people can easily use them to access news. Fake news affect people's views and choices, as happened in 2016 US presidential election [2]. The individuals who create false links to mislead other users still exist. Users who use social media sites in good way, continue to suffer from false or misleading news. Eventually, surfing the social media sites seems wastage of time. Customers are still coping with the websites whose presence impacts badly on readers' mind to engage with false news [3].
Mostly, readers read false information and start believing it without verifying the same. Sometimes, well-known, and popular websites also post flashy but false headlines to increase their web traffic and ad income. Therefore, detection of fake news or misleading information on social media platforms is a huge challenge [4] [5]. It becomes very difficult to handle the consequences of fake news posted anywhere on the internet if the same is not detected at its earliest. Ignorant users share such fake news in their circles without any verification resulting in very bad impact on individuals, families, societies, organizations, and governments (e.g., during the COVID-19 pandemic). Researchers are exploring that machine learning approaches can deal with the challenge of fake news detection. Reference [6] discovers that fake news is prevailing over time. Machine learning algorithms, once trained, find misleading information automatically with high probability. Automated false news detection can reduce the human effort and time than manual detection.
This research work facilitates users by suggesting a way of filtering or detecting the website links with any fake or inaccurate information [3]. This study makes careful selection of features such as title and content and examines results of various models. Accuracy results of existing works using real fake news data are not up to par. Additionally, focus is to provide a more effective feature extractor bidirectional learning approach than previous models.
Research Question: Can we provide an a better approach for detecting fake reviews for online platforms? The major aim is to detect fake news from online reviews using deep learning.
The rest of the paper is structured as follows: Section 2 discusses the existing related studies during the past 2-3 years. Section 3 describes data set and pre-processing techniques. Section 4 defines the proposed solution along with significance of multiple algorithms . Section 5 is about results and discussion. Final section concludes the work.

II. LITERATURE REVIEW
Keya et al. [5] implement deep learning and natural language processing techniques for low resource languages like Bangla by using fake news and headlines material. They use F1 score, accuracy, recall and precision as evaluation measures. They combine pre-trained embedding of Glove with (i) Convolutional Neural Network and Gated Recurrent Unit, and (ii) long short term memory(LSTM) and convolution neural network (CNN). They observe that former achieves better test accuracy as 98.71% as compared to latter which gives 93% accuracy.GLove embeddings are non-contextual embeddings. By using contextual embeddings, better results can be achieved.
In [7], Ahmed et al. collect data for real news from an online news article website and use data available on Kaggle for for fake news. They report that the use of nouns and adjectives is more used in original news dataset and less in fake news Problem of storing the context is not addressed in this paper. Reference [8] develops a solution to classify fake news of online social media network platform i.e, twitter. Initially logistic regression is used for fake news classification with accuracy of 90.37%. Then, Bi-LSTM improves accuracy results to 93.47% due to its ability to store context. Bidirectional Encoder Representations from Transformers (BERT) is quite better in storing context and obtain good accuracy as compared to Bi-LSTM.
Reference [9] reports high accuracy of 99% using decision tree classifier on ISOT fake news dataset. They use a fraction of their dataset and not complete data but handle complete news at one time by extracting important features, instead of dealing in segments. The accuracy achieved is highest and better than other state of art models but this work does not store context.
In [10], Parita et al. collect data from different online media sources. The main goal is to classify whether the collected news is real or fake. For this purpose, authors use multiple classification techniques such as SVM, Naïve Bayes and LSTM depending on the nature of the dataset. LSTM achieves the best accuracy as 94%. Accuracy is not that good so it can be improved by using embeddings.
The research work [11] creates a huge dataset by collecting manually labelled statements of different subject from a website "PolitiFact.com" managed by Poynter Institute. It provides a complete analysis and links to source documents. This is the largest having fake news related to different topics. It also offers fact checking mechanism. The authors present deep learning techniques to extract features and a hybrid CNN to test data. They observe that hybrid deep learning approaches provide better results on complex data.
Reference [12] presents a review of previous studies with regard to fake news detection and finds that multimodel methods are effective for this purpose. As fake news are available not only in text form but also in form of images, voice and video on social media. Therefore, various content based approaches are used for better results. Supervised and unsupervised approaches can be used to detect whether the particular news is fake or not.
In [13], Pritika et al. investigate Covid-19 related fake news data in English language. They use ensemble technique by combine Roberta and BERT and achieve accuracy of 98%. Roberta is a pretrained and optimized form of bidirectional transformers. They observe that the latest models work well for textual data problems.Transformers are not that good for storing context as compared to BERT.
Reference [14] develops an ensemble classifier to detect fake news using LIAR dataset. The ensemble techniques used for classification are Random Forest(RF) and Support Vector Machine(SVM). SVM alone achieves 62% accuracy by using TF-IDF. The suggested hybrid model (i.e., SVM and RF) outperforms the state of art model by 2.3% in accuracy. Accuracy score is very low and the problem of storing context is not addressed yet.
In reference [narayan2021fake], authors address the problem of fake news detection using hybrid deep neural network and stacked LSTM. They implement Glove 300d word embedding with stacked LSTM after studying the existing work. They implement models on two different datasets available on kaggle. Accuracy is used as evaluation measure and turns out to be 97% for training dataset which is better as compared to existing models.
Reference [wynne2022fake] develops an ensemble classifier to detect fake news using LIAR dataset. The ensemble techniques used for classification are Random forest(RF), Support Vector Machine(SVM). The model proposed by the paper using SVM and RF outperforms the state of art model by 2.3% in accuracy. Table I provides summary of fake news detection methodologies provided in existing studies. • Title: titles of the news article • Author: name of authors of the articles • News: the text of the news • Label: 1 is used for fake news and 0 for authentic news

B. Data Preprocessing
Data preprocessing is the process of making data clean for experimentation and it impacts the performance of our models. As shown in Fig. 1, data pre-processing follows the following steps are: Islamabad, Pakistan, 22 -23 February 2023 Proceedings of 2023 3rd International Conference on Artificial Intelligence (ICAI) 104

Fig. 1: Preprocessing Steps
• First, data is visualized using different measure to check the size and dimensions of data and stored it using pandas' data frame. • Secondly, redundant and missing values are checked, and rows with with missing values are removed. • Next, data is cleaned for strings except character strings, this step clean data from unnecessary links, numbers, and special characters. • All stop words are removed because they have high frequencies. They give less importance to real data and effect results. • As NLP is case sensitive all the data is converted to lowercase to avoid ambiguity. • Whole data is tokenized and stemming is applied to convert derived words to their root words. • A corpus of unique words is created for encoding. To create a feature matrix, we converted our data to one-hot coding vectors. The size of these vectors is the size of vocabulary. • The n-grams, term frequency-inverted document frequency (TF-idf) word2vec and a few latest such as are the most used word embedding for vectorization of news content feature. Our work uses unidirectional representations like n-grams, Tf-idf and word2vec. Fig. 2 displays the most frequent words present in dataset.

IV. CLASSIFICATION MODELS
After prepossessing multiple machine learning models such as Naive Bayes, logistic regression, decision tree are trained and tested on test data. As the news text is long and context dependency is not solved in machine learning algorithms. Some of recent work have improved results using sequential models like Recurrent Neural Networks (RNN), LSTM. Sequential models help in using contextual data instead of just focusing on a single word. Deep Learning also introduced Bidirectional models that involve context from both sides of a word, so they are more effective for NLP. RNN are far better than machine learning models but they still have issues for long term dependencies and LSTM resolves such issue. We use bidirectional and sequential model to solve long term dependency. Logistic regression is important in fake news detection because it is a widely-used and effective classification algorithm for binary classification tasks, which is often used to classify whether a piece of information is real or fake. Initially, it was binary classification problem that's why logistic regression is used. Logistic regression is simple and interoprabe. The algorithm produces coefficients for each input feature that represent the strength and direction of their relationship with the output variable. These coefficients can be easily interpreted and used to identify the most important features for predicting whether a piece of information is real or fake. Naive Bayes works by calculating the probability of a given piece of text belonging to a particular class (in this case, real or fake news) based on the occurrence of each word in the text. It assumes that each feature (i.e., each word in the text) is independent of all other features, which allows it to make fast and accurate predictions even with a large number of features. In the context of fake news detection, Naive Bayes can be used to identify the probability of a piece of text being fake news based on the occurrence of certain words or phrases that are common in fake news articles. This can be particularly useful for identifying patterns and characteristics that are common among fake news articles. We use decision trees because it is an important tool in the detection of fake news because of their interoperability, and versatility. They can help researchers and practitioners better understand the factors that contribute to the spread of fake news and develop effective strategies for identifying and combating it. Another advantage of decision trees is their ability to handle both categorical and numerical data, which makes them well-suited for analyzing the textual and contextual features of news articles and social media posts. This is particularly important in the context of fake news detection, where a wide variety of features can be used to distinguish real from fake news, including linguistic features such as sentiment, word frequency, and grammar, as well as contextual features such as the source of the information and the time and place it was published.
LSTMs are a type of neural network that are designed to model the temporal dynamics of sequential data. They are able to capture long-term dependencies in data, making them well-suited to analyzing text data, which often has complex structures and long-range dependencies. We use LSTM to analyze the text of articles and social media posts to identify patterns that are indicative of fake news. LSTMs can be used to identify common linguistic cues that are often associated with fake news, such as the use of emotive language, misleading headlines, and lack of evidence to support claims.
The main focus of this work is to use machine learning and few deep learning models to compare the results and identify which model is a better classifier for this problem and data set. We classify news either as authentic or misleading. LSTM with pre-tained word embedding outperforms other models and achieves best accuracy of 99%. Table II shows the results of our implemented models. V. RESULTS AND DISCUSSIONS After prepossessing multiple machine learning models are trained such as Naive Bayes, logistic regression, decision tree etc. and then tests these models on test data. As the news text is long and context dependency is not solved in machine learning algorithms. Some of recent work have improved results using sequential models like RNN, LSTM. Sequential models help in using contextual data instead of just focusing on a single word. Deep Learning also introduced Bidirectional models that involve context from both sides of a word, so they are more effective for NLP. RNN are far better than machine learning models but they still have issues for long term dependencies so LSTM will resolve that issue. We will use BERT, sequential model, LSTM that solve long term dependency. To apply all the models to obtained results, numpy version 1.19.5 and TensorFlow is used. We used confusion matrix, F1-score, and accuracy as evaluation metric in this paper. The main focus is to use machine learning and few deep learning models to compare the results and identify which model is a better classifier for this problem and data set. We will do classification of news either they are authentic or misleading. It is a NLP problem. Multiple models for example regression, naive Bayes , LSTM are used as per requirements. Finally LSTM with pretained word embedding are implemented which outperforms and achieves best accuracy of 99

A. Logistic Regression
Logistic regression is classification model and fake news detection is purely classification problem. The main concern is to discriminate between 'actual' and fake news that's why logistic regression is used. It can be used to classify whether the news or statement is real or fake. IT uses "label" as dependent variables and other like title text and authors as independent variables. If label is predicted as '1' that means its Real otherwise its fake. Logistic regression provides 72% accuracy.

B. Naive Bayes
Naive Bayes Classifier calculates the probability of previous event and then compare it with existing event. Each and every event's probability is computed, and then the overall probability of the news in relation to the data set is computed. To the best of my knowledge, to achieve best results in terms of accuracy, I realized Logistic regression is not enough so we use Naïve Bayes classifier to classify whether the news or statement is real or fake. I used multinomial Bayes theorem here. Then I computed and analyzed the results by confusion matrix It uses the logic of Bayes theorem: P(A-B) = P(A) P(B-A)P(B) We get accuracy score of 67%, 64% F1-score , 64% precision and 58% recall of testing data set.

C. Decision Trees
Decision Tree make analysis of data more easy and understandable. It builds tree using smaller parts of data. It works for both numerical and categorical data. A tree shaped classifier which is not good in terms of accuracy but It make sure to give all possible outcomes of decision each path to conclusion. Although Naïve Bayes is better but it provides automatic feature extraction. By using Decision Trees, accuracy score of test dataset is 91%, 91 % F1-score , 91 % precision and 91 % recall. Fig. 3

D. Random Forest
Random forest classifier creates many trees by utilising features of subsets.It is extension of bagging and simply merges the output of multiple decision trees. It can be used for classification and regression issues as well as unsupervised learning. The accuracy score of 92%,92% F1-score , 98% precision and 86% recall of testing data set. The confusion matrix is shown in Fig. 4

E. LSTM with Pretained BERT Model
There can be a significant difference in performance between using Bidirectional Encoder Representations from Transformers (BERT) and Long Short-Term Memory (LSTM) for fake news detection. BERT is a powerful pre-trained language model that outperforms traditional models like LSTM in many natural language processing tasks. One of the key advantages of BERT is its ability to capture contextual information in language, which is important for tasks like fake news detection where understanding the context and nuance of the text is crucial. BERT uses a transformer-based architecture that allows it to learn complex relationships between words and phrases in a sentence, which can be difficult for traditional models like LSTM to capture. In contrast, LSTM is a type of recurrent neural network commonly used for language modeling tasks. While LSTM can perform well on some natural language processing tasks, it may struggle with more complex language tasks that require an understanding of context and nuance. In this model,the value for embedding feature vectors = 40 which are target feature vectors for the embedding layer. Single LSTM Layer with 100 nodes are used. Dense Layer with 1 neuron and sigmoid activation function is used since, this is a binary classification problem. Dropout technique is used to avoid over fitting and adam optimizer is used for optimizing the loss function. By using BERT pretrained embedding and LSTM combined, we can get best results. The motivation behind this concept was , it uses sequential model concept and stores the previous information. The training parameters for LSTM are shown in Fig. 5 By using dropout rate of 0.3 , we get accuracy score of 99%, 99% F1-score , 100% precision and 99% recall of testing data set. It is the most highest accuracy score achieved for fake news detection. The confusion matrix for LSTM is shown in Fig 6. Fig. 7 presents accuracy score of all models for easy comparison.

VI. CONCLUSION
Fake news detection is essential in identifying whether or not a piece of news is authentic. In order to identify false news, machine learning models including Regression, Naive Bayes, and Decision Trees, Random Forest are investigated, and an ensemble model combining LSTM and BERT is suggested. Pre-trained BERT embedding was utilised instead of the more conventional bag of words technique since it considerably enhances the training process. Each word is given a vector representation, together with information on how it relates to other words in the vocabulary and how they differ from one another. Our suggested model trained on "Fake News" provided an accuracy of 99%. For the purpose of verifying the results, various evaluation measures were used including F1-  score, precision, and recall. The data set utilised for training is unbalanced. We intend to compile and include more false information in model training. In future, we aim to take lexical features into consideration.