Published July 1, 2026 | Version 1.0.0

CTU-HONEY-LLM-2: Two Datasets of Shell Interactions for Fine-Tuning LLM-Based SSH Honeypots

  • 1. ROR icon Czech Technical University in Prague
  • 1. Universidad Nacional de Cuyo
  • 2. ROR icon Czech Technical University in Prague

Description

Datasets used to fine-tune open large language models (LLMs) as interactive SSH honeypots that emulate a Linux shell. Each conversation is a multi-turn exchange between a user (attacker input) and an assistant (Linux-terminal output), teaching the model to respond to shell commands as a real terminal would.

Two datasets are released:

  • D_orig (112 conversations): the original dataset used to fine-tune the GPT-3.5 model and a QLoRA Llama 3.1 8B. Covers basic commands such as `ls`,
    `cd`, `cat`, `touch`, `echo`, `who`, `sudo`, `ssh`, `cp`. Mostly 1–3 turns. Limited coverage of stateful interactions, command history, and permission errors.
  • D_new (284 conversations): an expanded dataset built from real SSH honeypot logs. Broader command coverage and multi-turn interactions, including file creation/deletion, directory changes, and other stateful behaviors. Targets the specific failures that let attackers identify fake shells.

The D_orig and the D_new test sets are separate from their respective full files. Training should use the full files. Testing should use the test files and hold them out from training.

The files are formatted as JSON-lines (`.jsonl`) following the OpenAI chat schema with one conversation per line:

json
{"messages": [
  {"role": "system",    "content": "You are a Linux OS terminal. ..."},
  {"role": "user",      "content": "ls -la"},
  {"role": "assistant", "content": "total 20\ndrwxr-xr-x ..."}
]}

Files

Files (6.4 MB)

Name Size Download all
md5:2d2921575b1042130222f357917b0bb3
5.5 MB Download
md5:d59396caec19bbe03aad3d058be5231f
641.7 kB Download
md5:ac94fa1906f9e2d578558baa90693530
158.3 kB Download
md5:1585ac5634bcf8f9691aa359720cc1f2
35.3 kB Download

Additional details

Related works

Continues
Conference paper: 10.1109/EuroSPW67616.2025.00082 (DOI)

Software

Repository URL
https://github.com/stratosphereips/shelLM
Programming language
Python
Development Status
Active