Artifacts for "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study"

Sarabi, Armin; Huang, Ziyuan; Wang, Chenlan; Karir, Tai; Liu, Mingyan

doi:10.5281/zenodo.15571866

Published June 6, 2025 | Version 1.0.0

Dataset Open

Artifacts for "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study"

1. University of Michigan

This repository contains data and analysis code for the paper "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study", presented at the USENIX Security Symposium 2025. Our dataset consists of structured JSON records summarizing ransomware incidents reported before December 2024. The accompanying code demonstrates various statistical analyses enabled by this dataset, as described in the paper. A summary of the provided data and code is included below.

Incident data (`incidents.zip`)

This archive contains the final output of our pipeline: JSON records with detailed attributes of individual ransomware incidents. Each record includes the victim and attacker identities, the corresponding incident date, and details across seven incident aspects: attack vector, attacker actions, victim actions, impact, affected data, indirect victims, and incident timeline. Each record also provides the URLs and full text of the articles from which attributes were extracted. For each attribute, evidence is included in the form of source URLs and the exact text spans from which the attribute was extracted. A detailed JSON schema for our incident records is provided in schema.json.

We also provide incidents.csv, a reduced CSV version of the JSON records to facilitate analysis.

Additional data (`data.zip`)

Pipeline steps

We provide raw and intermediate data from different stages of our data extraction and annotation pipeline:

Articles: Full set of input articles, each potentially reporting one or more ransomware incidents. These are sourced from the Common Crawl News dataset and additional auxiliary sources.
De-duplicated articles: A filtered version of the full article set, removing near-identical entries.
Annotations: Text segments extracted using an AI chatbot that potentially describe specific aspects of a ransomware incident.
Deduplicated annotations: Annotations with canonicalized victim names to group reports referring to the same entity.
Attributes: Structured fields extracted by an AI chatbot from the above annotations (e.g., ransom demands, payments, operational impacts, etc.).

Ransomware groups and variants

We provide canonicalized names of ransomware groups and variants from three independent sources, along with a consolidated list used in our analysis.

Code (`code.zip`)

Analysis

The notebook analysis.ipynb reproduces the analyses in Section 4 of the paper. To run the notebook, install the requirements and launch JupyterLab using:

pip install jupyterlab -r requirements.txt
jupyter lab --notebook-dir=code

You can then open and execute the notebook in JupyterLab.

Validation tool

We include a Streamlit-based web UI for validating the attributes in each incident record. To run the tool, install the requirements and launch the app using:

pip install streamlit -r requirements.txt
streamlit run code/validation-app.py

You can then load either a single JSON incident record or a file containing a list of records for review. Each attribute can be marked as correct or incorrect by a human annotator, and the results can be exported as a CSV file to evaluate the pipeline's accuracy.

Reference

Armin Sarabi, Ziyuan Huang, Chenlan Wang, Tai Karir, and Mingyan Liu. "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study". In USENIX Security Symposium. 2025.

Files

incidents.csv

Files (4.8 GB)

Name	Size	Download all
code.zip md5:9f17f5e74948ce7470a3c18c374f3646	555.1 kB	Preview Download
data.zip md5:2d8f4230ad265a93fe81e7d2f4e4ab15	4.8 GB	Preview Download
incidents.csv md5:0937ce926fae9da1b901ebb8f8b27c3c	4.8 MB	Preview Download
incidents.zip md5:de2420cf705c8d98de5c8b30c3b60998	25.7 MB	Preview Download
schema.json md5:4d85b5bf4cab1cc4c1025e3bb947c4fd	21.4 kB	Preview Download

	All versions	This version
Views	340	340
Downloads	406	406
Data volume	194.4 GB	194.4 GB

Artifacts for "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study"

Authors/Creators

Description

Incident data (incidents.zip)

Additional data (data.zip)

Pipeline steps

Ransomware groups and variants

Code (code.zip)

Analysis

Validation tool

Reference

Files

incidents.csv

Files (4.8 GB)

Incident data (`incidents.zip`)

Additional data (`data.zip`)

Code (`code.zip`)