Published June 7, 2026
| Version v1
Dataset
Open
LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution
Authors/Creators
- 1. king abdullah university of science and technology
Description
The Large-scale Code-Centric Dataset (LCCD) is a malware analysis dataset containing ~34,700 binary samples with deep static analysis, AI-generated analysis, decompiled code, control flow graphs, threat intelligence data, and pre-built training data for machine learning. LCC-LLM is a comprehensive code-centric dataset designed to support Large Language Model (LLM)-based malware family attribution. The dataset includes decompiled C code, assembly instructions, function call graphs (FCGs), hex dumps, and rich metadata for both malware and benign executables, enabling advanced research in malware understanding, cyber threat intelligence, and AI-driven cybersecurity.
Files
Files
(22.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b9f3fd8f087806755d009c4e47cf5a77
|
22.0 GB | Download |
Additional details
Identifiers
Dates
- Available
-
2026
References
- @article{pohlenz2026lcc, title={LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution}, author={Pohlenz, Christopher G Pedraza and Hadi, Hassan Jalil and Hassan, Ali and Shoker, Ali}, journal={arXiv preprint arXiv:2605.05807}, year={2026} }