Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

naily, rehab

doi:10.5281/zenodo.18183131

Published January 8, 2026 | Version CC BY 4.0 (Attribution required).

Other Open

Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

naily, rehab (Researcher)¹

1. RTIM-Lab

Contributors

Researcher:

naily, rehab¹

1. RTIM-Lab

The proliferation of online misinformation requires scalable, automated systems for fact-checking. This project develops a distributed machine learning model to classify news articles as Reliable (0) or Fake News (1). Using Apache Spark via PySpark, we process a dataset of approximately 45,000 articles (Fake.csv and True.csv) with a near-balanced distribution (52% fake, 48% real). The pipeline includes data loading, text cleaning (removing URLs, special characters, digits, and extra spaces), feature engineering with TF-IDF, unsupervised K-Means clustering (K=5) for thematic exploration, and supervised classification with Logistic Regression and Naive Bayes. Evaluated on a stratified 80/20 train-test split, the Logistic Regression model outperforms with 99.21% accuracy, precision, recall, and F1-score, and an AUC-ROC of 0.9994. This demonstrates the effectiveness of big data techniques for high-accuracy fake news detection. The upload includes the Jupyter notebook (code implementation) and PDF presentation (project summary).

Methods (Antigua and Barbuda Creole English)

Project Overview

This repository contains the code and presentation for a big data project focused on fake news detection through binary text classification. The goal is to build a scalable system using Apache Spark and PySpark to handle large volumes of textual data efficiently. The project was developed as part of a practical exam (TP) in big data processing.

Problem Statement

The spread of disinformation online demands automated, evolving systems for fact verification.

Objective

Develop a distributed machine learning model capable of classifying press articles as Reliable (0) or Fake News (1).

Dataset

Total Articles: 45,000 (from Fake.csv + True.csv)
Distribution: 52% Fake (23,481 articles) / 48% Real (21,417 articles)

Big Data Architecture

Technological Choice: Apache Spark via PySpark for distributed and rapid processing of large textual data volumes.
Spark Configuration:
- Version: Spark 3.5.4
- Mode: local[*]
- Application: FakeNewsDetection
- Driver Memory: 4 GB

Step 1: Preparation and Cleaning

Loading and Merging: Load Fake.csv and True.csv, add a label column (1 for Fake, 0 for True), and union into a combined DataFrame.
Text Cleaning (UDF): Use a PySpark User Defined Function (UDF) applied in parallel:
- Remove URLs
- Remove special characters and digits
- Convert to lowercase
- Remove multiple spaces
Class Distribution: Balanced dataset with 52% Fake and 48% Real articles.

ML Pipeline: Text Transformation to Vectors

Tokenizer: Separate cleaned text into individual words (tokens).
StopWordsRemover: Remove common words (e.g., "the", "a", "in") to improve relevance.
HashingTF: Convert tokens into frequency vectors (1,000 dimensions).
IDF: Weight frequencies by rarity of terms in the corpus.

Unsupervised Analysis

K-Means Clustering (K=5): Explores natural data structure to validate topic separation and identify dominant themes.
Cluster Results:
- Cluster 0: Trump, Republican, Hillary, Clinton, Political
- Cluster 1: Said, Government, People, State, Federal
- Cluster 2: Police, Shooting, US, CIA, Intelligence
- Cluster 3: Trump, Group, America, Organization, Political
- Cluster 4: HUD, Lobbying, Housing, Agencies, Federal
Conclusion: Clusters reveal distinct political and social themes (politics, government, security, organizations, lobbying), confirming semantic richness and classification relevance.

Data Split

Strategy: Stratified split to maintain balanced class distribution.
Ratio: 80% Training (~35,500 articles) / 20% Test (~8,700 articles)
Reproducibility: Random seed = 42 for consistent results.

Classification Models

Logistic Regression: Simple linear model effective for binary classification. Produces calibrated probabilities.
- Parameters: maxIter=100, regParam=0.01
Naive Bayes: Based on Bayes' theorem, performs well on text classification assuming feature independence.
- Parameters: Type=Multinomial, smoothing=1.0
Feature Engineering: 80% of dataset used for TF-IDF vectorization.

Evaluation Metrics

Five key metrics for binary classification performance:

Accuracy: Proportion of correct predictions. (TP + TN) / (TP + TN + FP + FN)
Precision: Among predicted "Fake", how many are truly fake. TP / (TP + FP)
Recall: Among truly fake articles, how many were correctly identified. TP / (TP + FN)
F1-Score: Harmonic mean of Precision and Recall. 2 × (Precision × Recall) / (Precision + Recall)
AUC-ROC: Measures ability to distinguish classes across decision thresholds (0 to 1, where 1 is perfect).

Files

Détection_de_Fausses_Nouvelles_avec_Big_Data (1).pdf

Files (6.1 MB)

Name	Size	Download all
Détection_de_Fausses_Nouvelles_avec_Big_Data (1).pdf md5:c5d945c7dcf312148b34d500f06062d5	5.2 MB	Preview Download
our_project.ipynb md5:a569a1019d63daf37dfe5f5a27a9a950	930.8 kB	Preview Download

Additional details

Available: 2025-12-26

	All versions	This version
Views	18	18
Downloads	3	3
Data volume	20.7 MB	20.7 MB

Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark

Authors/Creators

Contributors

Researcher:

Description

Methods (Antigua and Barbuda Creole English)

Project Overview

Problem Statement

Objective

Dataset

Big Data Architecture

Step 1: Preparation and Cleaning

ML Pipeline: Text Transformation to Vectors

Unsupervised Analysis

Data Split

Classification Models

Evaluation Metrics

Files

Détection_de_Fausses_Nouvelles_avec_Big_Data (1).pdf

Files (6.1 MB)

Additional details

Dates