Published June 6, 2025 | Version 1.0.0
Dataset Open

Artifacts for "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study"

  • 1. EDMO icon University of Michigan

Description

This repository contains data and analysis code for the paper "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study", presented at the USENIX Security Symposium 2025. Our dataset consists of structured JSON records summarizing ransomware incidents reported before December 2024. The accompanying code demonstrates various statistical analyses enabled by this dataset, as described in the paper. A summary of the provided data and code is included below.

Incident data (incidents.zip)

This archive contains the final output of our pipeline: JSON records with detailed attributes of individual ransomware incidents. Each record includes the victim and attacker identities, the corresponding incident date, and details across seven incident aspects: attack vector, attacker actions, victim actions, impact, affected data, indirect victims, and incident timeline. Each record also provides the URLs and full text of the articles from which attributes were extracted. For each attribute, evidence is included in the form of source URLs and the exact text spans from which the attribute was extracted. A detailed JSON schema for our incident records is provided in schema.json.

We also provide incidents.csv, a reduced CSV version of the JSON records to facilitate analysis.

Additional data (data.zip)

Pipeline steps

We provide raw and intermediate data from different stages of our data extraction and annotation pipeline:

  1. Articles: Full set of input articles, each potentially reporting one or more ransomware incidents. These are sourced from the Common Crawl News dataset and additional auxiliary sources.
  2. De-duplicated articles: A filtered version of the full article set, removing near-identical entries.
  3. Annotations: Text segments extracted using an AI chatbot that potentially describe specific aspects of a ransomware incident.
  4. Deduplicated annotations: Annotations with canonicalized victim names to group reports referring to the same entity.
  5. Attributes: Structured fields extracted by an AI chatbot from the above annotations (e.g., ransom demands, payments, operational impacts, etc.).

Ransomware groups and variants

We provide canonicalized names of ransomware groups and variants from three independent sources, along with a consolidated list used in our analysis.

Code (code.zip)

Analysis

The notebook analysis.ipynb reproduces the analyses in Section 4 of the paper. To run the notebook, install the requirements and launch JupyterLab using:

pip install jupyterlab -r requirements.txt
jupyter lab --notebook-dir=code

You can then open and execute the notebook in JupyterLab.

Validation tool

We include a Streamlit-based web UI for validating the attributes in each incident record. To run the tool, install the requirements and launch the app using:

pip install streamlit -r requirements.txt
streamlit run code/validation-app.py

You can then load either a single JSON incident record or a file containing a list of records for review. Each attribute can be marked as correct or incorrect by a human annotator, and the results can be exported as a CSV file to evaluate the pipeline's accuracy.

Reference

Armin Sarabi, Ziyuan Huang, Chenlan Wang, Tai Karir, and Mingyan Liu. "The Ransomware Decade: The Creation of a Fine-Grained Dataset and a Longitudinal Study". In USENIX Security Symposium. 2025.

Files

incidents.csv

Files (4.8 GB)

Name Size Download all
md5:9f17f5e74948ce7470a3c18c374f3646
555.1 kB Preview Download
md5:2d8f4230ad265a93fe81e7d2f4e4ab15
4.8 GB Preview Download
md5:0937ce926fae9da1b901ebb8f8b27c3c
4.8 MB Preview Download
md5:de2420cf705c8d98de5c8b30c3b60998
25.7 MB Preview Download
md5:4d85b5bf4cab1cc4c1025e3bb947c4fd
21.4 kB Preview Download