Published March 27, 2026
| Version v1
Dataset
Open
Artifact for Paper: Compressing Code Context for LLM-based Issue Resolution
Authors/Creators
Description
# Compressing Code Context for LLM-based Issue Resolution
This artifact accompanies the paper **"Compressing Code Context for LLM-based Issue Resolution"**. It contains the implementation of two components:
1. **Oracle-Guided Context Distillation (OCD)** — an offline search-based pipeline that finds the minimal sufficient code context for resolving a bug (using HDD + genetic algorithm search with Docker-based evaluation).
2. **SWEzze Reranker (SWEzze)** — a fine-tuned sequence classification model that predicts which code segments to retain at inference time, without requiring expensive search.
---
## Directory Structure
```
artifact/
├── data/
│ ├── gpt-5.2/ # Included artifact outputs for GPT-5.2
│ ├── qwen3-coder-next/ # Included artifact outputs for Qwen3-Coder-Next
│ └── deepseek-v3.2/ # Included artifact outputs for DeepSeek-V3.2
├── OCD/
│ ├── compress/ # OCD pipeline: CLI, HDD/GA search, reward function
│ │ ├── cli/compress.py # Main entrypoint for context compression
│ │ └── core/ # HDD, GA, reward function implementations
│ ├── docker/ # Docker container management (build, run, test)
│ ├── git/ # Git operations (clone, checkout, apply patch)
│ ├── services/ # Repository workspace management
│ └── shared/ # Cross-cutting utilities
│ ├── model.py # LLM backend (API / HF / vLLM)
│ ├── prompt.py # Prompt templates (Agentless-style)
│ ├── editing.py # Patch post-processing
│ ├── get_repo_structure.py # Repository structure extraction
│ ├── constants.py # SWE-bench/SWE-smith specs
│ └── ...
└── SWEzze/ # SWEzze reranker (inference + training)
├── reranker.py # PatchAwareRerankerCompressor
├── base.py # BaseCompressor interface
├── data/ # Reranker training data preparation
└── training/
└── train_reranker.py # Reranker training (pointwise / pairwise / support-aware)
```
---
## Included Data
The artifact includes precomputed outputs under `artifact/data/` for three
models:
- `gpt-5.2`
- `qwen3-coder-next`
- `deepseek-v3.2`
Within each model directory, results are grouped by compression method:
- `swezze`
- `swepruner`
- `llmlingua`
- `longcodezip`
- `no_compression`
- `no_context`
Each `artifact/data/<model>/<method>/` directory contains:
- `compressed.jsonl` — the compressed-context outputs for that model/method
setting
- `patches.jsonl` — the corresponding generated patches for the same instances
These files are included as ready-to-inspect artifact outputs for comparison,
analysis, and case studies; they are not required to run the OCD pipeline or
train the SWEzze reranker from scratch.
---
## Requirements
### System Requirements
- Python 3.9+
- Docker daemon (required for OCD pipeline evaluation)
- Git
- CUDA-capable GPU (required for vLLM backend and model training; optional for API backend)
### Python Dependencies
```bash
pip install docker gitpython datasets tqdm transformers torch openai python-dotenv \
jsonlines libcst peft trl scikit-learn
```
For vLLM inference backend:
```bash
pip install vllm
```
### Agentless
```bash
git clone https://github.com/OpenAutoCoder/Agentless.git
cd Agentless
pip install -r requirements.txt
```
### SWE-bench
```bash
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .
```
SWE-bench Docker images are downloaded automatically on first run. Ensure sufficient disk space (several GB per project family).
### SWE-smith
```bash
git clone https://github.com/SWE-bench/SWE-smith.git
cd SWE-smith
pip install -e .
```
Set `--dataset swesmith` when running the compression pipeline against SWE-smith instances.
---
## Environment Variables
Create a `.env` file in the project root (or export these variables):
```bash
# Required for API backend (OpenAI-compatible endpoint)
API_KEY=your_api_key_here
BASE_URL=https://api.openai.com/v1 # or your custom endpoint
# Required for cloning GitHub repositories (OCD pipeline)
GITHUB_ACCESS_TOKEN=your_github_token_here
# Optional: override HuggingFace mirror
HF_ENDPOINT=https://huggingface.co
# Optional: vLLM server configuration
VLLM_BASE_URL=http://localhost:8005/v1
VLLM_API_KEY=EMPTY
VLLM_TENSOR_PARALLEL_SIZE=2
VLLM_DATA_PARALLEL_SIZE=1
```
---
## Usage
### Part 1: Oracle-Guided Context Distillation (OCD)
The OCD pipeline takes Agentless output (fault localization + repair samples) and finds the minimal sufficient context for each instance.
#### Input Format
The input JSONL file must contain one record per instance:
```json
{
"instance_id": "repo__owner.issue_number",
"samples": [
{
"prompt": "<Agentless-format prompt containing issue, file, and code context>",
"patches": ["<patch diff 1>", "<patch diff 2>", ...],
"found_files": ["path/to/file.py"],
"found_edit_locs": {"path/to/file.py": ["function_name"]}
}
]
}
```
Additionally, an `--auxiliary_data_path` directory is required containing per-instance subdirectories with:
- `coverage.json` — code coverage data for the instance
- `patch.diff` — the gold patch diff
#### Running the Compression Pipeline
```bash
python -m OCD.compress.cli.compress \
--data_path /path/to/agentless_output.jsonl \
--auxiliary_data_path /path/to/auxiliary_data \
--model <model_name> \
--backend api \
--threads 4 \
--majority_voting 5 \
--dataset swebenchlite \
--playground ./playground
```
**Key arguments:**
| Argument | Default | Description |
|----------|---------|-------------|
| `--data_path` | (required) | Path to input JSONL with Agentless repair samples |
| `--auxiliary_data_path` | `./auxiliary_data` | Directory with coverage data and gold patches |
| `--model` | `gpt-3.5-turbo` | LLM model name (API model or local model path) |
| `--backend` | `auto` | `api` (OpenAI-compatible), `vllm`, `hf`, or `auto` |
| `--threads` | `4` | Number of parallel compression threads |
| `--majority_voting` | `5` | Patch candidates for majority-vote evaluation |
| `--dataset` | `swesmith` | `swebenchlite` or `swesmith` |
| `--playground` | `./playground` | Directory for cloned repositories |
| `--instance_id` | None | Process a single instance (for debugging) |
#### Output Format
Results are written to `<data_path>_compressed.jsonl`:
```json
{
"instance_id": "repo__owner.issue_number",
"issue_description": "...",
"buggy_file": "path/to/file.py",
"samples": [
{
"compression_method": "HDD",
"initial_context": "<full code context>",
"compressed_context": "<minimal sufficient context>",
"compression_ratio": 0.35
}
]
}
```
Compression methods: `HDD` (passes original patches), `GA` (full genetic algorithm), `HEURISTIC+HDD`, `GA+HDD`, `EMPTY` (no context needed).
---
### Part 2: SWEzze Reranker
The `SWEzze` module provides a reranker-based compressor that scores and selects code segments without running the search pipeline.
#### Inference
```python
from SWEzze import PatchAwareRerankerCompressor
compressor = PatchAwareRerankerCompressor(
model_name_or_path="Qwen/Qwen3-Reranker-0.6B", # or path to fine-tuned checkpoint
# adapter_path="./outputs/SWEzze_reranker/lora_adapter", # optional LoRA adapter
device="cuda",
budget_tokens=4096,
)
compressed_context = compressor.compress(
issue=issue_description,
found_files=["path/to/file.py"],
initial_context=full_code_context,
)
```
#### Training Data Preparation
Convert OCD output to reranker training format:
```bash
python -m SWEzze.data.prepare_reranker_data \
--input /path/to/compressed.jsonl \
--output ./data/reranker_train.json \
--mode pointwise \
--split
```
#### Training the Reranker
```bash
python -m SWEzze.training.train_reranker \
--model_name Qwen/Qwen3-Reranker-0.6B \
--train_data ./data/reranker_train.json \
--val_data ./data/reranker_val.json \
--output_dir ./outputs/SWEzze_reranker \
--mode pointwise \
--lora_r 64 \
--lora_alpha 128 \
--per_device_train_batch_size 8 \
--num_train_epochs 3 \
--learning_rate 2e-4
```
The training script supports both base and support-aware training modes:
| Mode | Description |
|------|-------------|
| `pointwise` | Binary relevance labels per (query, passage) pair |
| `pairwise` | Contrastive loss over (query, positive, negative) triplets |
| `auto` | Automatically detect format from data |
---
## Notes
- **Docker daemon** must be running for the OCD pipeline (it launches Docker containers for each SWE-bench instance).
- The OCD pipeline downloads SWE-bench Docker images on first run. Ensure sufficient disk space (several GB per project).
- Repository cloning requires a valid `GITHUB_ACCESS_TOKEN`.
- The `--playground` directory stores cloned repositories between runs to avoid redundant cloning.
- Output files support incremental resumption: already-processed instances are skipped on restart.
Files
artifact.zip
Files
(103.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b5ec44b5f58624dbf51fe77dd94a2265
|
103.8 MB | Preview Download |
Additional details
Dates
- Submitted
-
2026-03-27