LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution

Hassan Jalil, Hadi

doi:10.25781/KAUST-Z05OK

Published June 7, 2026 | Version v1

Dataset Open

LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution

Hassan Jalil, Hadi (Project leader)¹

1. king abdullah university of science and technology

The Large-scale Code-Centric Dataset (LCCD) is a malware analysis dataset containing ~34,700 binary samples with deep static analysis, AI-generated analysis, decompiled code, control flow graphs, threat intelligence data, and pre-built training data for machine learning. LCC-LLM is a comprehensive code-centric dataset designed to support Large Language Model (LLM)-based malware family attribution. The dataset includes decompiled C code, assembly instructions, function call graphs (FCGs), hex dumps, and rich metadata for both malware and benign executables, enabling advanced research in malware understanding, cyber threat intelligence, and AI-driven cybersecurity.

Files

Files (22.0 GB)

Name	Size	Download all
LCCD_Dataset_lvl19.tar md5:b9f3fd8f087806755d009c4e47cf5a77	22.0 GB	Download

Additional details

DOI: 10.48550/arXiv.2605.05807

Available: 2026

@article{pohlenz2026lcc, title={LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution}, author={Pohlenz, Christopher G Pedraza and Hadi, Hassan Jalil and Hassan, Ali and Shoker, Ali}, journal={arXiv preprint arXiv:2605.05807}, year={2026} }

	All versions	This version
Views	11	11
Downloads	5	5
Data volume	132.1 GB	132.1 GB

Files (22.0 GB)

Identifiers

Dates

References

LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution

Authors/Creators

Description

Files

Files (22.0 GB)

Additional details

Identifiers

Dates

References