Published June 7, 2026 | Version v1
Dataset Open

LCC-LLM: Leveraging Code-Centric Dataset for Large Language Models Malware Family Attribution

  • 1. king abdullah university of science and technology

Description

The Large-scale Code-Centric Dataset (LCCD) is a malware analysis dataset containing ~34,700 binary samples with deep static analysis, AI-generated analysis, decompiled code, control flow graphs, threat intelligence data, and pre-built training data for machine learning. LCC-LLM is a comprehensive code-centric dataset designed to support Large Language Model (LLM)-based malware family attribution. The dataset includes decompiled C code, assembly instructions, function call graphs (FCGs), hex dumps, and rich metadata for both malware and benign executables, enabling advanced research in malware understanding, cyber threat intelligence, and AI-driven cybersecurity.

Files

Files (22.0 GB)

Name Size Download all
md5:b9f3fd8f087806755d009c4e47cf5a77
22.0 GB Download

Additional details

Dates

Available
2026

References

  • @article{pohlenz2026lcc, title={LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution}, author={Pohlenz, Christopher G Pedraza and Hadi, Hassan Jalil and Hassan, Ali and Shoker, Ali}, journal={arXiv preprint arXiv:2605.05807}, year={2026} }