Fake News Detection with Big Data: Binary Text Classification Using Apache Spark and PySpark
Description
The proliferation of online misinformation requires scalable, automated systems for fact-checking. This project develops a distributed machine learning model to classify news articles as Reliable (0) or Fake News (1). Using Apache Spark via PySpark, we process a dataset of approximately 45,000 articles (Fake.csv and True.csv) with a near-balanced distribution (52% fake, 48% real). The pipeline includes data loading, text cleaning (removing URLs, special characters, digits, and extra spaces), feature engineering with TF-IDF, unsupervised K-Means clustering (K=5) for thematic exploration, and supervised classification with Logistic Regression and Naive Bayes. Evaluated on a stratified 80/20 train-test split, the Logistic Regression model outperforms with 99.21% accuracy, precision, recall, and F1-score, and an AUC-ROC of 0.9994. This demonstrates the effectiveness of big data techniques for high-accuracy fake news detection. The upload includes the Jupyter notebook (code implementation) and PDF presentation (project summary).
Methods (Antigua and Barbuda Creole English)
Project Overview
This repository contains the code and presentation for a big data project focused on fake news detection through binary text classification. The goal is to build a scalable system using Apache Spark and PySpark to handle large volumes of textual data efficiently. The project was developed as part of a practical exam (TP) in big data processing.
Problem Statement
The spread of disinformation online demands automated, evolving systems for fact verification.
Objective
Develop a distributed machine learning model capable of classifying press articles as Reliable (0) or Fake News (1).
Dataset
- Total Articles: 45,000 (from Fake.csv + True.csv)
- Distribution: 52% Fake (23,481 articles) / 48% Real (21,417 articles)
Big Data Architecture
- Technological Choice: Apache Spark via PySpark for distributed and rapid processing of large textual data volumes.
- Spark Configuration:
- Version: Spark 3.5.4
- Mode: local[*]
- Application: FakeNewsDetection
- Driver Memory: 4 GB
Step 1: Preparation and Cleaning
- Loading and Merging: Load Fake.csv and True.csv, add a label column (1 for Fake, 0 for True), and union into a combined DataFrame.
- Text Cleaning (UDF): Use a PySpark User Defined Function (UDF) applied in parallel:
- Remove URLs
- Remove special characters and digits
- Convert to lowercase
- Remove multiple spaces
- Class Distribution: Balanced dataset with 52% Fake and 48% Real articles.
ML Pipeline: Text Transformation to Vectors
- Tokenizer: Separate cleaned text into individual words (tokens).
- StopWordsRemover: Remove common words (e.g., "the", "a", "in") to improve relevance.
- HashingTF: Convert tokens into frequency vectors (1,000 dimensions).
- IDF: Weight frequencies by rarity of terms in the corpus.
Unsupervised Analysis
- K-Means Clustering (K=5): Explores natural data structure to validate topic separation and identify dominant themes.
- Cluster Results:
- Cluster 0: Trump, Republican, Hillary, Clinton, Political
- Cluster 1: Said, Government, People, State, Federal
- Cluster 2: Police, Shooting, US, CIA, Intelligence
- Cluster 3: Trump, Group, America, Organization, Political
- Cluster 4: HUD, Lobbying, Housing, Agencies, Federal
- Conclusion: Clusters reveal distinct political and social themes (politics, government, security, organizations, lobbying), confirming semantic richness and classification relevance.
Data Split
- Strategy: Stratified split to maintain balanced class distribution.
- Ratio: 80% Training (~35,500 articles) / 20% Test (~8,700 articles)
- Reproducibility: Random seed = 42 for consistent results.
Classification Models
- Logistic Regression: Simple linear model effective for binary classification. Produces calibrated probabilities.
- Parameters: maxIter=100, regParam=0.01
- Naive Bayes: Based on Bayes' theorem, performs well on text classification assuming feature independence.
- Parameters: Type=Multinomial, smoothing=1.0
- Feature Engineering: 80% of dataset used for TF-IDF vectorization.
Evaluation Metrics
Five key metrics for binary classification performance:
- Accuracy: Proportion of correct predictions. (TP + TN) / (TP + TN + FP + FN)
- Precision: Among predicted "Fake", how many are truly fake. TP / (TP + FP)
- Recall: Among truly fake articles, how many were correctly identified. TP / (TP + FN)
- F1-Score: Harmonic mean of Precision and Recall. 2 × (Precision × Recall) / (Precision + Recall)
- AUC-ROC: Measures ability to distinguish classes across decision thresholds (0 to 1, where 1 is perfect).
Files
Détection_de_Fausses_Nouvelles_avec_Big_Data (1).pdf
Files
(6.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c5d945c7dcf312148b34d500f06062d5
|
5.2 MB | Preview Download |
|
md5:a569a1019d63daf37dfe5f5a27a9a950
|
930.8 kB | Preview Download |
Additional details
Dates
- Available
-
2025-12-26