Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

Idialu, Oseremen Joy; Matthews, Noble Saji; Maipradit, Rungroj; Nagappan, Mei; Atlee, Joanne Atlee

doi:10.5281/zenodo.18834715

Published July 2, 2024 | Version v3

Conference paper Open

Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

1. University of Waterloo

Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.

# Whodunit: CodeChef AI & Human Solutions Dataset - Replication Package

This repository contains the data for the [CodeChef](https://www.codechef.com/) problems and the code used to collect, extract code-style and code complexity features from it, as well as the code to build and evaluate the classifiers for the paper `Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems`. It also contains the modified baseline code.

## Data Collection

The data corresponds to 399 problems filtered from the initial set of `1100` problems. The code used for collection and filtering can be found in the `_02_data_collection/` directory.

### Data Format

The JSON data (`final_dataset.json` and `final_successful_dataset.json`) are stored as a nested dictionary. The top-level keys are the **11 difficulty levels** of CodeChef. Each difficulty level is a dictionary with the key being the `problem_code_id` on CodeChef and the value being the data for the problem. The data for each problem is structured in a dictionary with the following keys:

- **`constraints`**: A string describing any constraints related to the problem.

- **`subtasks`**: A string detailing the subtasks associated with the problem.

- **`sample_test_cases`**: An array of dictionaries, each representing a public test case. Each test case includes:

- `input`: The input given to the problem.

- `output`: The expected output for the given input.

- `explanation`: A detailed explanation of why the output is as expected.

- **`problem_statement`**: A string describing the problem, its background, and requirements.

- **`input_format`**: A string describing the format in which input is provided.

- **`output_format`**: A string describing the format in which output is expected.

- **`problem_name`**: The name of the problem.

- **`user_tags`**: An array of strings representing user-defined tags for the problem.

- **`computed_tags`**: An array of strings representing system-generated tags for the problem.

- **`problem_code_id`**: A string representing the unique code ID of the problem.

- **`difficulty_level`**: A string or number indicating the difficulty level of the problem.

- **`ai_solutions`**: An array of strings, each representing GPT-4 (v0613) generated solution to the problem.

- **`human_solutions`**: An array of dictionaries, each containing details about a solution submitted by a user, which includes:

- `id`: A unique identifier for the solution.

- `submission_date`: The date of submission.

- `language`: The programming language used.

- `username`: The username of the submitter.

- `user_rating_star`: The user's rating.

- `contest_code`: The code of the contest in which the solution was submitted.

- `tooltip`: Status of the solution (e.g., accepted, rejected).

- `score`: The score achieved by the solution.

- `points`: The points achieved by the solution.

- `icon`: A link to an icon representing the status of the solution.

- `time`: The execution time of the solution.

- `memory`: The memory used by the solution.

- `solution`: A unique identifier for the solution.

- `code`: The actual code of the solution.

**Note:** The `input_format`, `output_format` and `constraints` fields are not available for older problems on CodeChef. In such cases, the information is present in the `problem_statement` field.

## Feature Extraction

Before extracting features, comments and multi-line strings must be removed using:
- **`remove_all_comments.ipynb`**: Accepts the `source_directory`, `destination_directory` and `output_file_path` which are the paths to the directory containing the files, the directory to store the files with comments removed and the path to a file that logs information about the file and removal process.

This contains the feature extraction notebooks. Three extraction notebooks generate different feature sets:
-**`extract_main_features.ipynb`**: Generates `rq1_main_features.csv`, `rq3_correct_solutions_features.csv`, `rq3_sampled_solutions_features.csv`, `rq4_easy/medium/hard_problems_features.csv`
- **`extract_with_halstead_features.ipynb`**: Generates `rq1_with_halstead_features.csv`
- **`extract_non_gameable_features.ipynb`**: Generates `rq2_non_gameable_features.csv`

### Features

- `rq1_main_features.csv`: Contains the main classifier's features (`RQ1`).

- `rq1_with_halstead_features.csv`: Contains the halstead features (`RQ1`).

- `rq2_non_gameable_features.csv`: Contains the non-gameable features (`RQ2`).

- `rq3_correct_solutions_features.csv`: Contains the features for solutions that passed the public test cases (`RQ3`).

- `rq3_sampled_solutions_features.csv`: Contains the features for solutions sampled from the unverified set (`RQ3`).

- `rq4_easy_problems_features.csv`: Contains the features for solutions to the easy problems (`RQ4`).

- `rq4_medium_problems_features.csv`: Contains the features for solutions to the intermediate problems (`RQ4`).

- `rq4_hard_problems_features.csv`: Contains the features for solutions to the hard problems (`RQ4`).

## Classification

Eight classification notebooks corresponding to the research questions:

### RQ1: How well can code-stylometry features distinguish human-authored code from GPT-4 generated code?

- **rq1_main_classification.ipynb**: Uses main feature set

- **`rq1_with_halstead_classification.ipynb`**: Uses main features + Halstead metrics

### RQ2: How influential are non-gameable features in differentiating human-authored vs. GPT-4 generated code?

- **`rq2_non_gameable_classification.ipynb`**: Uses only non-gameable features (excludes whiteSpaceRatio and emptyLinesDensity)

### RQ3: How well does the classifier perform when trained and evaluated on only correct solutions?

- **`rq3_correct_solutions_classification.ipynb`**: Trained on verified correct solutions

- **`rq3_sampled_solutions_classification.ipynb`**: Trained on sampled solutions matching verified distribution

### RQ4: How well does the classifier perform when trained and evaluated across varying levels of problem difficulty?

- **`rq4_easy_problems_classification.ipynb`**: Trained on easy difficulty problems

- **`rq4_medium_problems_classification.ipynb`**: Trained on medium difficulty problems

- **`rq4_hard_problems_classification.ipynb`**: Trained on hard difficulty problems

**Each classification notebook includes:**

- Feature loading and preprocessing

- GroupKFold cross-validation (prevents data leakage by problem ID)

- XGBoost classifier training

- Performance metrics (accuracy, precision, recall, F1, AUC-ROC)

- SHAP analysis for feature interpretability

**Note:** Each notebook was created to run independently, hence the duplicate code in the different notebooks.

## For more information, please refer to the `README.md file`

Files

_01_datasets.zip

Files (28.1 MB)

Name	Size	Download all
_01_datasets.zip md5:9d48858d99d63b4f47e1fec103377ae8	27.2 MB	Preview Download
_02_data_collection.zip md5:34d13fe2129ad1c8d5e153b8a3040dca	14.8 kB	Preview Download
_03_feature_extraction.zip md5:ae4cae0460b58863304048432261109e	18.0 kB	Preview Download
_04_classification.zip md5:a49f7d675923930f90a4c0da8d4c79ae	704.4 kB	Preview Download
_05_baseline_evaluation.zip md5:cfdd2b5a30ee70858c22dca72edd71eb	21.3 kB	Preview Download
_06_pan_evaluation.zip md5:35e9950abc889e4ec44344aac7a53643	15.3 kB	Preview Download
Online_Appendix.pdf md5:07acdf01b64a50a1f22e37baaec2a776	118.0 kB	Preview Download
README.md md5:9cae58de01545478b6fe80dcffdc724b	19.1 kB	Preview Download
requirements.txt md5:ba96a176d3f904853c30234b95ca0b4f	242 Bytes	Preview Download

	All versions	This version
Views	881	58
Downloads	5,229	76
Data volume	1.3 GB	172.0 MB

Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

Authors/Creators

Description

Files

_01_datasets.zip

Files (28.1 MB)