Artifacts for "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study"
Authors/Creators
Description
This repository contains data and analysis code for the paper "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study", presented at the USENIX Security Symposium 2025. Our dataset consists of structured JSON records summarizing ransomware incidents reported before December 2024. The accompanying code demonstrates various statistical analyses enabled by this dataset, as described in the paper. A summary of the provided data and code is included below.
Incident data (incidents.zip)
This archive contains the final output of our pipeline: JSON records with detailed attributes of individual ransomware incidents. Each record includes the victim and attacker identities, the corresponding incident date, and details across seven incident aspects: attack vector, attacker actions, victim actions, impact, affected data, indirect victims, and incident timeline. Each record also provides the URLs and full text of the articles from which attributes were extracted. For each attribute, evidence is included in the form of source URLs and the exact text spans from which the attribute was extracted. A detailed JSON schema for our incident records is provided in schema.json.
We also provide incidents.csv, a reduced CSV version of the JSON records to facilitate analysis.
Additional data (data.zip)
Pipeline steps
We provide raw and intermediate data from different stages of our data extraction and annotation pipeline:
- Articles: Full set of input articles, each potentially reporting one or more ransomware incidents. These are sourced from the Common Crawl News dataset and additional auxiliary sources.
- De-duplicated articles: A filtered version of the full article set, removing near-identical entries.
- Annotations: Text segments extracted using an AI chatbot that potentially describe specific aspects of a ransomware incident.
- Deduplicated annotations: Annotations with canonicalized victim names to group reports referring to the same entity.
- Attributes: Structured fields extracted by an AI chatbot from the above annotations (e.g., ransom demands, payments, operational impacts, etc.).
Ransomware groups and variants
We provide canonicalized names of ransomware groups and variants from three independent sources, along with a consolidated list used in our analysis.
Code (code.zip)
Analysis
The notebook analysis.ipynb reproduces the analyses in Section 4 of the paper. To run the notebook, install the requirements and launch JupyterLab using:
pip install jupyterlab -r requirements.txtjupyter lab --notebook-dir=code
You can then open and execute the notebook in JupyterLab.
Validation tool
We include a Streamlit-based web UI for validating the attributes in each incident record. To run the tool, install the requirements and launch the app using:
pip install streamlit -r requirements.txtstreamlit run code/validation-app.py
You can then load either a single JSON incident record or a file containing a list of records for review. Each attribute can be marked as correct or incorrect by a human annotator, and the results can be exported as a CSV file to evaluate the pipeline's accuracy.
Reference
Armin Sarabi, Ziyuan Huang, Chenlan Wang, Tai Karir, and Mingyan Liu. "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study". In USENIX Security Symposium. 2025.
Files
incidents.csv
Files
(4.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:9f17f5e74948ce7470a3c18c374f3646
|
555.1 kB | Preview Download |
|
md5:2d8f4230ad265a93fe81e7d2f4e4ab15
|
4.8 GB | Preview Download |
|
md5:0937ce926fae9da1b901ebb8f8b27c3c
|
4.8 MB | Preview Download |
|
md5:de2420cf705c8d98de5c8b30c3b60998
|
25.7 MB | Preview Download |
|
md5:4d85b5bf4cab1cc4c1025e3bb947c4fd
|
21.4 kB | Preview Download |