Published April 12, 2026 | Version v1
Software Open

AI-Based Pharmacovigilance Risk Detection Using FDA FAERS Data

  • 1. ROR icon Saint Louis University

Contributors

Description

AI-Based Pharmacovigilance Risk Detection Using FDA FAERS Data

ABSTRACT

An AI-driven pharmacovigilance workflow using FDA FAERS (Adverse Event Reporting System) data. The objective is to identify and analyze drug–adverse event relationships and classify risk levels using data-driven techniques.

The dataset was preprocessed by merging drug, reaction, and demographic tables, followed by filtering for primary suspect drugs and removing non-clinical reporting terms. Drug–reaction pairs were aggregated and assigned risk levels using percentile-based classification to address class imbalance.

A machine learning model (Random Forest) was trained to predict risk levels based on drug, reaction, and reporting frequency. The model was evaluated using train-test split validation and classification metrics.

An interactive Streamlit dashboard was developed to visualize top drugs, adverse reactions, high-risk signals, and enable real-time risk prediction.

This project demonstrates practical application of healthcare data science, pharmacovigilance analytics, and machine learning in drug safety monitoring.

OBJECTIVES

• To analyze FAERS data and identify drug–adverse event patterns  
• To classify risk levels using data-driven percentile methods  
• To build a machine learning model for risk prediction  
• To develop an interactive dashboard for data exploration  
• To demonstrate pharmacovigilance analytics using real-world data  

METHODOLOGY

The workflow consists of multiple stages. First, FAERS datasets (DEMO, DRUG, REAC) were merged using primary identifiers. A sample dataset was created for efficient processing.

Primary suspect drugs were filtered using the role_cod field. Non-clinical reporting terms such as “off label use” were removed to ensure meaningful analysis.

Drug–reaction pairs were grouped and frequency counts were computed. Risk levels were assigned using percentile-based thresholds to ensure balanced classification across LOW, MEDIUM, and HIGH categories.

Categorical variables such as drug names and reactions were encoded using Label Encoding. A Random Forest classifier was trained using drug, reaction, and frequency as features.

The model was evaluated using train-test split and standard classification metrics. Finally, a Streamlit dashboard was built to visualize insights and enable real-time prediction.

DATA SOURCE

Data Source: FDA FAERS (Adverse Event Reporting System)

Files used:
• DEMO – patient demographic data  
• DRUG – drug information and role  
• REAC – reported adverse reactions  

The dataset contains real-world adverse event reports used for pharmacovigilance and drug safety monitoring.

• drugname – name of the drug  
• pt – preferred term (adverse reaction)  
• count – frequency of reports  
• risk_level – derived classification (LOW, MEDIUM, HIGH)  

MACHINE LEARNING MODEL

Model: Random Forest Classifier

Features:
• Encoded drug name  
• Encoded reaction (pt)  
• Count (frequency)

Target:
• Risk level (LOW, MEDIUM, HIGH)

Evaluation:
• Train-test split (80/20)  
• Accuracy, classification report, confusion matrix  

Note: Initial model showed biased accuracy due to class imbalance, which was resolved using percentile-based risk classification.

DASHBOARD DESCRIPTION

An interactive Streamlit dashboard was developed to visualize pharmacovigilance insights.

Key features:
• Top drugs visualization  
• Top adverse reactions visualization  
• Risk-based filtering  
• Drug search functionality  
• High-risk drug–reaction pairs  
• CSV download option  
• AI-based risk prediction module  

The dashboard enables users to explore safety signals and perform real-time risk prediction.

LIMITATIONS

• FAERS data is based on voluntary reporting and may contain bias  
• Risk classification is based on frequency, not clinical severity  
• Model performance depends on data distribution  
• External validation was not performed

FUTURE WORK 

• Incorporate patient-level features (age, gender)  
• Use temporal trends for risk prediction  
• Apply deep learning models  
• Integrate real-time FAERS updates  
• Deploy dashboard as a web application  

Files

faers_ml_dataset.csv

Files (36.4 MB)

Name Size Download all
md5:476cb8c10e96864b544af47e16d7073d
6.5 kB Download
md5:31daa19506d141bf41a239c06633d2ae
1.2 MB Preview Download
md5:7e5060a5a18e36878b4b4bd3f66fbf8f
908.0 kB Preview Download
md5:857ea84e40acd911ec9301ffb60389fb
33.0 MB Preview Download
md5:3d4b96054ab067f48fb62469a39a2b50
33.0 kB Preview Download
md5:a9c331956f6b20c2aaafc5bffcb81961
304.6 kB Preview Download
md5:b17632cefdde8ca9452451f502dc6b93
297.9 kB Preview Download
md5:7b34abd5fab868ca87ffb4b93e6d51ed
234.9 kB Preview Download
md5:ec3291c6ff8f6a2332689484b1cfc571
179.0 kB Preview Download
md5:96ea7114d9d369fd88a833762a0936c3
193.3 kB Preview Download

Additional details

Software

Programming language
Python

References