SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

Büchel, Marvin; Paladini, Tommaso; Longari, Stefano; Carminati, Michele; Zanero, Stefano; Binyamini, Hodaya; Engelberg, Gal; Guizzardi, Giancarlo; Caselli, Marco; Continella, Andrea; van Steen, Maarten; Peter, Andreas; van Ede, Thijs

doi:10.5281/zenodo.16753555

Published August 2025 | Version 1.1

Conference proceeding Open

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

1. Carl von Ossietzky Universität Oldenburg
2. Politecnico di Milano
3. Accenture Labs
4. University of Twente
5. Siemens AG

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

This repository contains the code of the paper "SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?"

Abstract: Cyber Threat Intelligence (CTI) plays a critical role in sharing knowledge about new and evolving threats. With the increased prevalence and sophistication of threat actors, intelligence has expanded from simple indicators of compromise to extensive CTI reports describing high-level attack steps known as Tactics, Techniques and Procedures (TTPs). Such TTPs, often classified into the ontology of the MITRE ATT&CK framework, make CTI significantly more valuable, but also harder to interpret and automatically process. Natural Language Processing (NLP) makes it possible to automate large parts of the knowledge extraction from CTI reports; over 40 papers discuss approaches, ranging from named entity recognition over embedder models to generative large language models. Unfortunately, existing solutions are largely incomparable as they consider decisively different and constrained settings, rely on custom TTP ontologies, and use a multitude of custom, inaccessible CTI datasets. We take stock, systematize the knowledge in the field, and empirically evaluate existing approaches in a unified setting for fair comparisons. We gain several fundamental insights, including (1) the finding of a kind of performance limit that existing approaches seemingly cannot overcome as of yet, (2) that traditional NLP approaches (possibly counterintuitively) outperform modern embedder-based and generative approaches in realistic settings, and (3) that further research on understanding inherent ambiguities in TTP ontologies and on the creation of qualitative datasets is key to take a leap in the field.

The repository is structured as follows:
* NER: Contains the code related to the NER approaches (Sections 2.2, 4).
* classification: Contains related to the Classification approaches (Sections 2.3, 5)
* gLLM: Contains the code related to the Generation approaches (Sections 2.4, 6)
* scraping: Contains the code for collecting papers from DBLP and Google Scholar, following the description of Section 2.1
* ext_tools: Contains the code of the comparison experiment presented in Appendix A.1, with the implementations of three state-of-the-art approaches.
* datasets: Contains the used datasets MITRE TRAM2 with our proposed split, Bosch AnnoCTR dataset, Augmented TRAM2 dataset and the corresponding instruction datasets to train the generative LLM in natural language.

Setup

Each folder (NER, classification, gLLM) corresponds to a sub-project and, therefore, requires its own setup. To use the code, navigate inside the corresponding folder and follow the instructions listed in the corresponding README.md.

Notes
Throughout this repository, the Bosch AnnoCTR dataset is often referred as "bosch".

Files

README.md

Files (33.2 GB)

Name	Size	Download all
classification.zip md5:dfd05b777e8bfcbc5b94231c9758ae35	25.3 MB	Preview Download
datasets.zip md5:40fdf8fbd24cec8f2f2a442c2196f623	30.0 MB	Preview Download
ext_tools.zip md5:486126c399a3420f44a33f189ba588e9	17.6 MB	Preview Download
generation.zip md5:c1fe895bfdab7454e87a0cdaa32008a4	85.9 MB	Preview Download
NER.zip md5:86011964e16f4a29958d4c808f374b41	1.4 MB	Preview Download
README.md md5:00c54427233c5dbfd2d3ce714b1b4b20	4.4 kB	Preview Download
Repo Models.zip md5:649679f24e7e06b63b86662388e9e3fd	33.0 GB	Preview Download
scraping.zip md5:32ac905a28c73fe2d5404ba66b24ab90	4.6 kB	Preview Download

Additional details

Accepted: 2025-06-06

Programming language: Python
Development Status: Concept

	All versions	This version
Views	440	246
Downloads	440	195
Data volume	1.9 TB	896.4 GB

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

Setup

Files

README.md

Files (33.2 GB)

Additional details

Dates

Software

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

Creators

Description

SoK: Automated TTP Extraction from CTI Reports – Are We There Yet?

Setup

Files

README.md

Files (33.2 GB)

Additional details

Dates

Software