# ContraBin: Pre-Training Representations of Binary Code Using Contrastive Learning

This repository contains the implementation of **ContraBin**, a novel framework for pre-training representations of binary code. ContraBin is designed to address the semantic challenges in binary code analysis by incorporating binary code, source code, and comments into a unified contrastive learning framework. By leveraging innovative techniques such as simplex interpolation and intermediate representation learning, ContraBin sets a new benchmark for binary code comprehension tasks.

---

## Key Features

- **Unified Framework**: ContraBin integrates binary code, source code, and comments to bridge the semantic gap and improve representation learning.
- **Simplex Interpolation**: A novel method inspired by human learning processes to refine contrastive learning objectives.
- **Intermediate Contrastive Learning**: Leverages semantic and contextual information from source code and comments to enhance binary code embeddings.
- **Comprehensive Evaluation**: Tested on four downstream tasks:
  - Algorithmic Functionality Classification
  - Function Name Recovery
  - Code Summarization
  - Reverse Engineering
- **Extensible Design**: Modular structure for seamless integration and experimentation.

---

## Repository Structure

```plaintext
project_root/
├── README.md
├── requirements.txt
├── main.py
├── config/
│   ├── __init__.py
│   ├── configs.py
├── models/
│   ├── __init__.py
│   ├── encoders.py
│   ├── heads.py
│   ├── model.py
├── data/
│   ├── __init__.py
│   ├── data_processing.py
│   ├── dataset_utils.py
├── utils/
│   ├── __init__.py
│   ├── metrics.py
│   ├── visualization.py
```

---

## Installation

1. Download the repository from Zenodo.

2. Extract the files and navigate to the project root directory.

3. Install the dependencies:  
   ```bash
   pip install -r requirements.txt
   ```

---

## Usage

### Training
To train the model, execute the following command:
```bash
python main.py --mode train --config config/configs.py
```

This will initiate the training pipeline using the provided configuration settings. Ensure that the dataset is preprocessed and placed in the `data/` directory.

### Evaluation
To evaluate the model on downstream tasks, use:
```bash
python main.py --mode eval --config config/configs.py
```

This command will load the trained model and compute performance metrics for the specified evaluation tasks.


### Visualization
To analyze the dataset or visualize the model's performance, run:
```bash
python utils/visualization.py
```

This script provides detailed insights into the dataset distribution and model performance. The generated plots and statistics are saved in the `outputs/` directory. You can customize the visualization settings within the `visualization.py` script to fit your specific requirements.


