README: Adverse Drug Reaction (ADR) Clustering Task

Overview

This repository contains code and data for unsupervised clustering of adverse drug reaction (ADR) narratives using a Sentence-BERT (S-BERT) embedding pipeline and a modified SS-DBSCAN clustering algorithm. The focus is on high-dimensional text data with sizes ranging from 1,000 to full scale (5,000+ records).

The dataset is extracted from MIMIC III clinical notes and text data by focusing on the most important features describing drug responses and patient histories. We performed clustering on the unlabeled data to identify potential adverse drug reactions.

---

Dataset

File: adr_filtered.csv
Description: A CSV file with one column: text, containing unstructured clinical text. No labels are included.

Sample Format:
text
"the patient experienced an adverse reaction during infusion of rituxan..."
"the treatment went well and the patient was responding poasitely to treatment, family members were there to take care of the patient"

---

Instructions to Reproduce Results

Requirements
Make sure to install the following dependencies (via pip or conda):

pip install sentence-transformers pandas scikit-learn matplotlib and all other required libraries

Step-by-Step Execution (from Notebook)

1. Open the Jupyter notebook: mimic-5k_PCA_tSNE_clustering.ipynb
2. In the first code block, update or confirm the path to adr_filtered.csv.
3. Run the full notebook to:
   - Preprocess text (remove punctuation, clean sentences)
   - Generate S-BERT embeddings using all-mpnet-base-v2
   - Reduce dimensionality with PCA/t-SNE
   - Apply SS-DBSCAN for clustering
   - Visualize clusters

---

Reproducing with Different Data Sizes

To test scalability and evaluate cluster quality over various dataset sizes, modify the following line early in the notebook:

df_unlabeled = df_unlabeled.sample(n=1000, random_state=42)

Change n=1000 to:
- 2000
- 3000
- 4000
- or full dataset (remove .sample() to use all records)

Each run will produce different clustering structures and results, visualized using 2D plots.

---

Outputs

- features_array_unlabeled.npy – S-BERT embedding matrix
- plots/ – cluster visualizations
- silhouette_scores.csv – quality metrics of clustering (if calculated)

---

Citation

If using this resource in your research, please cite the Zenodo DOI:
https://doi.org/10.5281/zenodo.13889331

---

Contact

For any questions, please contact:
Gloriana J. Monko
Email: [gmonko24@gmail.com]