Artifact for Paper: Compressing Code Context for LLM-based Issue Resolution

Anonymous

doi:10.5281/zenodo.19248411

Published March 27, 2026 | Version v1

Dataset Open

Artifact for Paper: Compressing Code Context for LLM-based Issue Resolution

Anonymous

# Compressing Code Context for LLM-based Issue Resolution

This artifact accompanies the paper **"Compressing Code Context for LLM-based Issue Resolution"**. It contains the implementation of two components:

1. **Oracle-Guided Context Distillation (OCD)** — an offline search-based pipeline that finds the minimal sufficient code context for resolving a bug (using HDD + genetic algorithm search with Docker-based evaluation).

2. **SWEzze Reranker (SWEzze)** — a fine-tuned sequence classification model that predicts which code segments to retain at inference time, without requiring expensive search.

---

## Directory Structure

```

artifact/

├── data/

│ ├── gpt-5.2/ # Included artifact outputs for GPT-5.2

│ ├── qwen3-coder-next/ # Included artifact outputs for Qwen3-Coder-Next

│ └── deepseek-v3.2/ # Included artifact outputs for DeepSeek-V3.2

├── OCD/

│ ├── compress/ # OCD pipeline: CLI, HDD/GA search, reward function

│ │ ├── cli/compress.py # Main entrypoint for context compression

│ │ └── core/ # HDD, GA, reward function implementations

│ ├── docker/ # Docker container management (build, run, test)

│ ├── git/ # Git operations (clone, checkout, apply patch)

│ ├── services/ # Repository workspace management

│ └── shared/ # Cross-cutting utilities

│ ├── model.py # LLM backend (API / HF / vLLM)

│ ├── prompt.py # Prompt templates (Agentless-style)

│ ├── editing.py # Patch post-processing

│ ├── get_repo_structure.py # Repository structure extraction

│ ├── constants.py # SWE-bench/SWE-smith specs

│ └── ...

└── SWEzze/ # SWEzze reranker (inference + training)

├── reranker.py # PatchAwareRerankerCompressor

├── base.py # BaseCompressor interface

├── data/ # Reranker training data preparation

└── training/

└── train_reranker.py # Reranker training (pointwise / pairwise / support-aware)

```

---

## Included Data

The artifact includes precomputed outputs under `artifact/data/` for three

models:

- `gpt-5.2`

- `qwen3-coder-next`

- `deepseek-v3.2`

Within each model directory, results are grouped by compression method:

- `swezze`

- `swepruner`

- `llmlingua`

- `longcodezip`

- `no_compression`

- `no_context`

Each `artifact/data/<model>/<method>/` directory contains:

- `compressed.jsonl` — the compressed-context outputs for that model/method

setting

- `patches.jsonl` — the corresponding generated patches for the same instances

These files are included as ready-to-inspect artifact outputs for comparison,

analysis, and case studies; they are not required to run the OCD pipeline or

train the SWEzze reranker from scratch.

---

## Requirements

### System Requirements

- Python 3.9+

- Docker daemon (required for OCD pipeline evaluation)

- Git

- CUDA-capable GPU (required for vLLM backend and model training; optional for API backend)

### Python Dependencies

```bash

pip install docker gitpython datasets tqdm transformers torch openai python-dotenv \

jsonlines libcst peft trl scikit-learn

```

For vLLM inference backend:

```bash

pip install vllm

```

### Agentless

```bash

git clone https://github.com/OpenAutoCoder/Agentless.git

cd Agentless

pip install -r requirements.txt

```

### SWE-bench

```bash

git clone https://github.com/princeton-nlp/SWE-bench.git

cd SWE-bench

pip install -e .

```

SWE-bench Docker images are downloaded automatically on first run. Ensure sufficient disk space (several GB per project family).

### SWE-smith

```bash

git clone https://github.com/SWE-bench/SWE-smith.git

cd SWE-smith

pip install -e .

```

Set `--dataset swesmith` when running the compression pipeline against SWE-smith instances.

---

## Environment Variables

Create a `.env` file in the project root (or export these variables):

```bash

# Required for API backend (OpenAI-compatible endpoint)

API_KEY=your_api_key_here

BASE_URL=https://api.openai.com/v1 # or your custom endpoint

# Required for cloning GitHub repositories (OCD pipeline)

GITHUB_ACCESS_TOKEN=your_github_token_here

# Optional: override HuggingFace mirror

HF_ENDPOINT=https://huggingface.co

# Optional: vLLM server configuration

VLLM_BASE_URL=http://localhost:8005/v1

VLLM_API_KEY=EMPTY

VLLM_TENSOR_PARALLEL_SIZE=2

VLLM_DATA_PARALLEL_SIZE=1

```

---

## Usage

### Part 1: Oracle-Guided Context Distillation (OCD)

The OCD pipeline takes Agentless output (fault localization + repair samples) and finds the minimal sufficient context for each instance.

#### Input Format

The input JSONL file must contain one record per instance:

```json

{

"instance_id": "repo__owner.issue_number",

"samples": [

{

"prompt": "<Agentless-format prompt containing issue, file, and code context>",

"patches": ["<patch diff 1>", "<patch diff 2>", ...],

"found_files": ["path/to/file.py"],

"found_edit_locs": {"path/to/file.py": ["function_name"]}

}

]

}

```

Additionally, an `--auxiliary_data_path` directory is required containing per-instance subdirectories with:

- `coverage.json` — code coverage data for the instance

- `patch.diff` — the gold patch diff

#### Running the Compression Pipeline

```bash

python -m OCD.compress.cli.compress \

--data_path /path/to/agentless_output.jsonl \

--auxiliary_data_path /path/to/auxiliary_data \

--model <model_name> \

--backend api \

--threads 4 \

--majority_voting 5 \

--dataset swebenchlite \

--playground ./playground

```

**Key arguments:**

| Argument | Default | Description |

|----------|---------|-------------|

| `--data_path` | (required) | Path to input JSONL with Agentless repair samples |

| `--auxiliary_data_path` | `./auxiliary_data` | Directory with coverage data and gold patches |

| `--model` | `gpt-3.5-turbo` | LLM model name (API model or local model path) |

| `--backend` | `auto` | `api` (OpenAI-compatible), `vllm`, `hf`, or `auto` |

| `--threads` | `4` | Number of parallel compression threads |

| `--majority_voting` | `5` | Patch candidates for majority-vote evaluation |

| `--dataset` | `swesmith` | `swebenchlite` or `swesmith` |

| `--playground` | `./playground` | Directory for cloned repositories |

| `--instance_id` | None | Process a single instance (for debugging) |

#### Output Format

Results are written to `<data_path>_compressed.jsonl`:

```json

{

"instance_id": "repo__owner.issue_number",

"issue_description": "...",

"buggy_file": "path/to/file.py",

"samples": [

{

"compression_method": "HDD",

"initial_context": "<full code context>",

"compressed_context": "<minimal sufficient context>",

"compression_ratio": 0.35

}

]

}

```

Compression methods: `HDD` (passes original patches), `GA` (full genetic algorithm), `HEURISTIC+HDD`, `GA+HDD`, `EMPTY` (no context needed).

---

### Part 2: SWEzze Reranker

The `SWEzze` module provides a reranker-based compressor that scores and selects code segments without running the search pipeline.

#### Inference

```python

from SWEzze import PatchAwareRerankerCompressor

compressor = PatchAwareRerankerCompressor(

model_name_or_path="Qwen/Qwen3-Reranker-0.6B", # or path to fine-tuned checkpoint

# adapter_path="./outputs/SWEzze_reranker/lora_adapter", # optional LoRA adapter

device="cuda",

budget_tokens=4096,

)

compressed_context = compressor.compress(

issue=issue_description,

found_files=["path/to/file.py"],

initial_context=full_code_context,

)

```

#### Training Data Preparation

Convert OCD output to reranker training format:

```bash

python -m SWEzze.data.prepare_reranker_data \

--input /path/to/compressed.jsonl \

--output ./data/reranker_train.json \

--mode pointwise \

--split

```

#### Training the Reranker

```bash

python -m SWEzze.training.train_reranker \

--model_name Qwen/Qwen3-Reranker-0.6B \

--train_data ./data/reranker_train.json \

--val_data ./data/reranker_val.json \

--output_dir ./outputs/SWEzze_reranker \

--mode pointwise \

--lora_r 64 \

--lora_alpha 128 \

--per_device_train_batch_size 8 \

--num_train_epochs 3 \

--learning_rate 2e-4

```

The training script supports both base and support-aware training modes:

| Mode | Description |

|------|-------------|

| `pointwise` | Binary relevance labels per (query, passage) pair |

| `pairwise` | Contrastive loss over (query, positive, negative) triplets |

| `auto` | Automatically detect format from data |

---

## Notes

- **Docker daemon** must be running for the OCD pipeline (it launches Docker containers for each SWE-bench instance).

- The OCD pipeline downloads SWE-bench Docker images on first run. Ensure sufficient disk space (several GB per project).

- Repository cloning requires a valid `GITHUB_ACCESS_TOKEN`.

- The `--playground` directory stores cloned repositories between runs to avoid redundant cloning.

- Output files support incremental resumption: already-processed instances are skipped on restart.

Files

artifact.zip

Files (103.8 MB)

Name	Size	Download all
artifact.zip md5:b5ec44b5f58624dbf51fe77dd94a2265	103.8 MB	Preview Download

Additional details

Submitted: 2026-03-27

	All versions	This version
Views	83	83
Downloads	20	20
Data volume	2.1 GB	2.1 GB

Artifact for Paper: Compressing Code Context for LLM-based Issue Resolution

Authors/Creators

Description

Files

artifact.zip

Files (103.8 MB)

Additional details

Dates