# GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents


Visualizing the changes made by a code agent for optimizing the [Open Bandit Pipeline (OBP)](https://github.com/st-tech/zr-obp) and [Scope-RL](https://github.com/hakuhodo-technologies/scope-rl).

## Quick Info
`notebooks` contains the python notebook that we will be using as our input

`runs` contains all the different file versions the agent generated for each session the agent runs.

`scripts` contains various scripts for the project

`images` contains the source images for this README

The following illustrates how the main script `start.py` functions:
![flow diagram](images/flow_diagram.png)

## Usage

### Setup

The `obp` library that is needed to run the evaluation notebooks has some issue working with more recent Python versions and works well with Python3.8. However, crewAI requires Python3.10+. To resolve this, we use miniconda for managing different Python versions in this project.

#### Step 1: Create OBP Environment (Python 3.8)
First, download [miniconda](https://www.anaconda.com/download) if you don't have it installed.

```bash
# Create conda environment
conda create -n env38 python=3.8 -y
conda activate env38

# Install dependencies via conda (avoids PyYAML build issues)
conda install -c conda-forge pyyaml=6.0.0 nbformat jupyter -y

# Install OBP library
pip install obp==0.5.7
```
#### Get Python path for configuration

After activating your environment, run the following command to obtain the Python interpreter path:

```bash
which python    # macOS/Linux
where python    # Windows
# Copy the output path
```

**Common Issues:**

- **PyYAML build failure**:  
  Error: `Building wheel for PyYAML failed` or `Microsoft Visual C++ 14.0 is required`  
  *Solution*: Use `conda install` as shown above.

- **Missing nbformat**:  
  Error: `ModuleNotFoundError: No module named 'nbformat'`  
  *Solution*: Install `jupyter` as shown above.

#### Step 2: Create Scope-RL Environment (Python 3.9+)
This project also supports `Scope-RL` notebooks, which require a separate conda environment.

```bash
# Deactivate the conda environment if it is still active
conda deactivate

# Create and activate the env39_scope conda environment
conda create -n env39_scope python=3.9 -y
conda activate env39_scope

# Install Scope-RL dependencies
pip install -r requirements_scope.txt
```

Now, get the Python interpreter path for this new environment.

```bash
which python  # macOS/Linux
where python   # Windows
# Copy the output path
```

#### Step 3: Create Agent Environment (Python 3.10+)
Finally, create the main environment for running the agent framework itself.

```bash
# Deactivate the previous environment if it is still active
deactivate

# Create the main virtual environment
python -m venv venv

# Activate the virtual environment
source venv/bin/activate     # macOS/Linux
venv\Scripts\activate        # Windows

# Install agent dependencies
pip install -r requirements.txt
```

#### Step 4: Configure Python Interpreters and API Keys
Now, configure the paths for the OBP and Scope-RL Python interpreters in the `scripts/config.yml` file. The tool uses this file to automatically select the correct environment for each notebook.

Edit the `interpreter_map` section within `scripts/config.yml` with the full paths you copied from Step 1 and Step 2:
```yaml
settings:
  # ... other settings
  interpreter_map:
    # Framework-specific interpreters
    scope_rl: "/path/to/env39_scope/bin/python" # Path from Step 2
    obp: "/path/to/env38/bin/python"          # Path from Step 1
    # ... other mappings
```

Next, set up your API keys.

```bash
# Copy the template and edit it with your API keys
cp scripts/API_KEYS_TEMPLATE scripts/API_KEYS
# Edit the scripts/API_KEYS file with your actual API keys
```

**You're now ready to run the tool!**

### Environment Usage Guide
This project requires **THREE separate environments**:

- **`env38` (conda)**: Python 3.8, used exclusively for running OBP notebooks.
- **`env39_scope` (conda)**: Python 3.9+, used exclusively for running Scope-RL notebooks.
- **`venv` (venv)**: Python 3.10+, the main environment for running the optimization tool and agents (`start.py`).

**Important**: For normal operation, always ensure the main `venv` environment is activated before running `python start.py`. The tool will automatically use the correct interpreter for the notebook you are running based on the configuration file.

**Important things to note:**
- The notebooks must have some way of saving the results of various metrics each time it is ran to an external file. We assume that these results will be available in a `results.csv` file in the `notebooks` folder. *This needs to be adjusted as it isn't the case* 
- The `results.csv` file follows the following format:
    ```
    iteration,metric,estimator,result
    1,relative_ee_for_ipw_lr,ipw,0.00042397946391084545
    1,relative_ee_for_ipw_lr,dm,0.16357882127169976
    1,relative_ee_for_ipw_lr,dr,0.00036068380851626754
    ...
    ```

### Output Option Reliability
Note that `manual_patch` and `agent_applies` options may encounter execution failures more frequently than `whole_code`. For maximum reliability, use `-opt whole_code`.

### Running

The script can be ran using the following command:
```
python start.py path/to/notebook.ipynb model_name -fw framework_name -n iterations -opt option
```
- `-fw`: framework: Optional name of the agent framework (default: no agent)
- `-opt`: The output option. `whole_code` generates the complete notebook, `manual_patch` provides a patch to be applied manually, and `agent_applies` has the agent apply the changes directly. Defaults to `whole_code`.
- `-n`: iterations: Number of iterations to run the agent for (default: 1) 

**Model Names**: `gemini-1.5-flash`, `gpt-4o`, `gpt-4o-mini`, `mistral-large-latest`, etc.

Available choices for 'agent':
- None
- `AutoGen`
- `CrewAI`
- `two_agent`

### Performance Notes
**Warning**: Advanced notebooks and multi-iteration runs can be time-intensive:
- Single iterations: 5-15 minutes typical
- 3+ iterations: 30+ minutes, potentially **over an hour**
- Advanced notebooks (`*_advanced.ipynb`): 3+ iterations can exceed 1-2 hours total
- Scope-RL notebooks generally take longer than OBP notebooks

**Recommendation**: Start with single iterations (`-n 1`) for testing before committing to longer optimization runs.

---

**Example Commands:**

**1. OBP Notebook - Default (whole_code)**
This command runs the agent on an `OBP.ipynb` notebook and generates the entire modified notebook as output.
```bash
python start.py notebooks/obd.ipynb gemini-1.5-flash -n 1
```

**2. OBP Notebook - Default (whole_code)**
This command runs the agent on the `multiclass.ipynb` notebook, also using the default `whole_code` output.
```bash
python start.py notebooks/multiclass.ipynb gemini-1.5-flash -opt whole_code -n 1
```

**3. Scope-RL Basic Notebook - Default (whole_code)**
This example runs the agent on a basic discrete action `Scope-RL` notebook.
```bash
python start.py notebooks/basic/basic_synthetic_discrete_zoo.ipynb gemini-1.5-flash  -n 1
```

**4. Scope-RL Recommender System Notebook (CrewAI)**
Here, the agent uses the CrewAI framework to optimise a `Scope-RL` notebook, applying the changes directly over two iterations.
```bash
python start.py notebooks/rec/rec_synthetic_discrete_basic.ipynb gemini-1.5-flash -fw crewai -opt agent_applies -n 2
```

**5. Scope-RL Real-Time Bidding Notebook (AutoGen)**
This command uses the AutoGen framework to target a continuous action space notebook, generating a patch file.
```bash
python start.py notebooks/rtb/rtb_synthetic_continuous_zoo.ipynb gemini-1.5-flash -fw autogen -opt manual_patch -n 1
```

**6. Most Reliable Framework Option**
```bash
python start.py notebooks/multiclass.ipynb gemini-1.5-flash -fw two_agent -n 1
```

The `two_agent` framework creates comprehensive analysis documentation and detailed iteration tracking. Results are stored in `runs/organized_results/[framework]/[notebook]/two_agent/two_agent_results/` which includes:
- **Analysis documents**: `R-doc/Instructions*.md` and `analysis_*.md` files containing detailed parameter optimization analysis
- **Generated code**: `R-doc/newcode*.py` files for each iteration
- **Performance tracking**: `Results/iteration_*_summary.md` files with metrics and status
- **Final summary**: `Results/final_summary.md` with best iteration selection and performance analysis
- **Code differences**: `Results/diff_iteration_*.txt` showing changes between iterations

### Example output:
![example output 1](images/example_output_1.png)
![example output 2](images/example_output_2.png)

### Learning from past iterations
Both the LLMs and agent frameworks can learn from their past iterations. This is done by prompting the agent/LLM to further optimize the notebook after passing the results of each run to the agent with a summary of the changes.

**Note**: The specialized learning scripts (`learner_from_prev.py`, `crew_learner/crew.py`) are available but standard multi-iteration runs using `start.py` are more commonly used.

**Learning with LLMs**:

To run the learner LLMs, the following command can be used:
```
python scripts/learn_from_previous_iterations/learner_from_prev.py synthetic.ipynb gemini-1.5-flash  
```

**Learning with Agent Frameworks**:

The agent frameworks can also learn from previous runs to improve their performance over multiple iterations.

**CrewAI**
```
python scripts/learn_from_previous_iterations/crew_learner/crew.py synthetic.ipynb gemini-1.5-flash -n 3
```

**AutoGen**
```
python scripts/learn_from_previous_iterations/autogen_learner.py synthetic.ipynb gemini-1.5-flash -n 2
```

### Running all models and agents:
To run each supported model (`gpt-4o`, `mistral-large`, etc) with each suported agent (`None`, `CrewAI`,`AutoGen`) you can run:
```python start.py path/to/notebook_folder model -a
```

## Viewing output: `runs` directory
The runs directory contains copies of the notebook each time the agent modified it: if the agent ran for 3 iterations, it will contain 4 files (including the original, untouched file).

This directory also contains `diff.txt` which displays the diff between each file to help see what parts of the file were changed by the agent in each iteration.

Results for running all agents and frameworks are in directories ending with `-all`.

## Agent Guide
Whilst constructing an agent to modify files, simply giving it a prompt such as 

"edit this file to improve it's error estimation"

has proven to be challenging as it becomes difficult for the agent to retain the entire piece of code provided to it. This results in unnecessary code deletion, syntax errors and in some cases no output at all.

To counteract this, we have adopted the following architecture, based on AbanteAI's [MentatBot](https://mentat.ai/blog/mentatbot-sota-coding-agent):
<p align="center">
  <img src="images/agent_flow_diagram.png" />
</p>

By dividing this optimization task into 2, it is easier for us to retain necessary code, preventing any changes that are too harmful.

## Reference
Please consider citing this paper if you find this useful: 
Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, and Fatemeh Fard. 2025. GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents. \textit{arXiv preprint arXiv:XXXX.XXXXX (2025)}.
```
@article{growthhacker2025ope,
  title={GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents},
  author={Wu, Jie JW and Herlihy, Ayanda Patrick and Mirza, Ahmad Saleem and Afoud, Ali and Fard, Fatemeh},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}
```
