Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

doi:10.5281/zenodo.10153319

Published July 2, 2024 | Version v2

Conference paper Open

Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

1. University of Waterloo

Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.

# Whodunit: CodeChef AI & Human Solutions Dataset

This repository contains the data for the [CodeChef](https://www.codechef.com/) problems and the code used to collect, extract code-style and code complexity features from it, as well as the code to build and evaluate the classifiers for the paper `Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems`. It also contains the modified baseline code.

## Data Collection

The data corresponds to `399` problems filtered from the initial set of `1100` problems. The code used for collection and filtering can be found in the scripts directory. The final data is stored as described below.

### Files

- `final_dataset.json`: Contains the data for all the problems on CodeChef.

- `final_successful_dataset.json`: Contains the same data as `final_dataset.json` but the `ai_solutions` field only contains the solutions that successfully passed the public test cases.

The `final_files` directory contains zipped sets of python files (binned into `easy`, `medium` and `hard` difficulties) that were used for the experiments in the paper.

- `unverified.zip`: Contains 2 Human and 2 AI solutions for each problem.

- `verified.zip`: Contains solutions that are verified to pass the public test cases.

- `sampled.zip`: Contains a random set of problems sources from the `unverified` set such that it matches the distribution of the `verified` set.

### Data Format

The JSON data is stored as a nested dictionary. The top level keys are the `11 difficulty levels` of CodeChef. Each difficulty level is a dictionary with the key being the `problem_code_id` on CodeChef and the value being the data for the problem. The data for each problem is structured in a dictionary with the following keys:

- `constraints`: A string describing any constraints related to the problem.

- `subtasks`: A string detailing the subtasks associated with the problem.

- `sample_test_cases`: An array of dictionaries, each representing a public test case. Each test case includes:

- `input`: The input given to the problem.

- `output`: The expected output for the given input.

- `explanation`: A detailed explanation of why the output is as expected.

- `problem_statement`: A string describing the problem, its background, and requirements.

- `input_format`: A string describing the format in which input is provided.

- `output_format`: A string describing the format in which output is expected.

- `problem_name`: The name of the problem.

- `user_tags`: An array of strings representing user-defined tags for the problem.

- `computed_tags`: An array of strings representing system-generated tags for the problem.

- `problem_code_id`: A string representing the unique code ID of the problem.

- `difficulty_level`: A string or number indicating the difficulty level of the problem.

- `ai_solutions`: An array of strings, each representing GPT-4 (v0613) generated solution to the problem.

- `human_solutions`: An array of dictionaries, each containing details about a solution submitted by a user, which includes:

- `id`: A unique identifier for the solution.

- `submission_date`: The date of submission.

- `language`: The programming language used.

- `username`: The username of the submitter.

- `user_rating_star`: The user's rating.

- `contest_code`: The code of the contest in which the solution was submitted.

- `tooltip`: Status of the solution (e.g., accepted, rejected).

- `score`: The score achieved by the solution.

- `points`: The points achieved by the solution.

- `icon`: A link to an icon representing the status of the solution.

- `time`: The execution time of the solution.

- `memory`: The memory used by the solution.

- `solution`: A unique identifier for the solution.

- `code`: The actual code of the solution.

**Note:** The `input_format`, `output_format` and `constraints` fields are not available for older problems on CodeChef. In such cases, the information is present in the `problem_statement` field.

## Feature Extraction

Before extracting the features, the comments and multi-line strings are to be removed by running the `scripts/remove_all_comments.py` script. This script accepts the `source_directory`, `destination_directory` and `output_file_path` which are the paths to the directory containing the files, the directory to store the files with comments removed and the path to a file that logs information about the file and removal process.

The code for extracting the code-style and code complexity features from the dataset can be found in the `script/feature_extraction` directory. This directory has 3 files: `extract_main_features.py`, `extract_non_gameable_features.ipynb`, and `extract_with_halstead_features.ipynb` corresponding to the 3 different feature sets used in the paper. These scripts read the files from the input path provided and outputs the extracted features to the output path provided. The extracted features are stored in the `data/features` directory from which 8 features files are generated. The features files are named as follows:

- `main_features.csv`: Contains the main classifier's features (`RQ1`).

- `with_halstead_features.csv`: Contains the halstead features (`RQ1`).

- `non_gameable_features.csv`: Contains the non-gameable features (`RQ2`).

- `correct_solutions_features.csv`: Contains the features for solutions that passed the public test cases (`RQ3`).

- `sampled_solutions_features.csv`: Contains the features for solutions sampled from the unverified set (`RQ3`).

- `easy_problems_features.csv`: Contains the features for solutions to the easy problems (`RQ4`).

- `medium_problems_features.csv`: Contains the features for solutions to the intermediate problems (`RQ4`).

- `hard_problems_features.csv`: Contains the features for solutions to the hard problems (`RQ4`).

**Note:** The extraction scripts require that the directory to the respective dataset is provided as input and the output path is specified. The output path in this package is the `data/features` directory.

## Classification

There are 8 scripts corresponding to the 8 different classifiers used in the paper and the 8 extracted features files. The scripts are found in the `scripts/classification` directory. The scripts are named as follows:

- `main_classification.py`: Contains the code for the main classifier (`RQ1`).

- `with_halstead_classification.py`: Contains the code for the halstead classifier (`RQ1`).

- `non_gameable_classification.py`: Contains the code for the non-gameable classifier (`RQ2`).

- `correct_solutions_classification.py`: Contains the code for the classifier trained on solutions that passed the public test cases (`RQ3`).

- `sampled_solutions_classification.py`: Contains the code for the classifier trained on solutions sampled from the unverified set (`RQ3`).

- `easy_problems_classification.py`: Contains the code for the classifier trained on solutions to the easy problems (`RQ4`).

- `medium_problems_classification.py`: Contains the code for the classifier trained on solutions to the intermediate problems (`RQ4`).

- `hard_problems_classification.py`: Contains the code for the classifier trained on solutions to the hard problems (`RQ4`).

Each of these scripts accept a `.csv` file containing the features as input and prints out the results. The scripts also display the SHAP plot for the top 10 features.

## For more information, please refer to the `README.md file`

Files

ai_solutions_testing.ipynb

Files (8.9 MB)

Name	Size	Download all
ai_solutions_testing.ipynb md5:7b9f9af5730304407dedaff17a90fec9	4.5 kB	Preview Download
ASTandLexClassification.py md5:88c3caa459312360608cae0eb560452e	21.2 kB	Download
ASTandLexLooper.py md5:4b205769ac173e10d68465743852c065	4.3 kB	Download
ASTandLexParse.py md5:3a6fe5dd195514b360df89221603e4e3	11.2 kB	Download
ASTLooper.py md5:aaf3cd6d98b12513bfb54a12f0856bc0	4.0 kB	Download
ASTParser.py md5:d30624b9236f81dc07d8ab6be0ebb9b7	12.5 kB	Download
BigramDictUpdate.py md5:843c8394c191b882933abddd783f2c4e	6.8 kB	Download
builder.py md5:78597a0ed20d7ad78bbdd8f258b69224	258 Bytes	Download
codechef_problem_collection.ipynb md5:b82673dd9940d0360e863c48ea1cc08a	13.2 kB	Preview Download
codechef_solution_fetcher.ipynb md5:8c592458a5baeb565a1d93962c717274	9.4 kB	Preview Download
correct_solutions_classification.ipynb md5:d251505dbb3ad62d69caa7f406a074e5	5.6 kB	Preview Download
correct_solutions_features.csv md5:1da635da8bd6f6910e4c6615f51010fe	222.7 kB	Preview Download
CreateNodeTypeSet.py md5:7703ee45a63b7d4390c0dabf1e48d2ba	4.3 kB	Download
DictOfNodes.py md5:5114af4d13f717d0baed16e56d7bbccb	3.2 kB	Download
easy_problems_classification.ipynb md5:fc7b266fdfd9f60293aef102980522bf	5.6 kB	Preview Download
easy_problems_features.csv md5:2394c278d1815d6a55f480a9a9559762	204.8 kB	Preview Download
extract_main_features.ipynb md5:7a851764c075a35a9aa9243ec5c89ba9	39.0 kB	Preview Download
extract_non_gameable_features.ipynb md5:2d8462ffaba2ad2dcc41a1c4cd0ca50d	37.4 kB	Preview Download
extract_with_halstead_features.ipynb md5:7258fd96e7cb77c298da50c3d3706b61	39.2 kB	Preview Download
final_dataset.json md5:1f12edf472030c6bfe681910423e3c76	2.3 MB	Preview Download
final_successful_dataset.json md5:ba2b99af1a44db9720dc5ec0ce7aafbc	2.1 MB	Preview Download
gpt_solution_collector.ipynb md5:49966435a80aceda2ecac2e70ac78f05	5.0 kB	Preview Download
hard_problems_classification.ipynb md5:fef83a64324857d5418306c4e545b54b	5.6 kB	Preview Download
hard_problems_features.csv md5:4fc0143c11f807ad4fe7f7e747ea6a25	217.6 kB	Preview Download
main_and_halstead_features.csv md5:a5e55092c963eda02c451df49fd2a8f2	663.4 kB	Preview Download
main_classification.ipynb md5:297a908aa025886c7dd14c5f531eb2d1	5.6 kB	Preview Download
main_features.csv md5:9d3f8ddff1f7acc07447c4733c8d124d	648.1 kB	Preview Download
medium_problems_classification.ipynb md5:7f9d22179258967b1c5b4bfdd048a72d	5.6 kB	Preview Download
medium_problems_features.csv md5:f41b4de4bf46e6e3d52c759feba56332	209.0 kB	Preview Download
NGramEmptyPair.py md5:a31b4ae458e57e505cfb474a1e4fec48	8.8 kB	Download
non_gameable_classification.ipynb md5:33c41cad77c4ba70565b6f0554ee42dd	5.6 kB	Preview Download
non_gameable_features.csv md5:b74f782526264d5f48e1757cfac4f722	633.0 kB	Preview Download
Online_Appendix.pdf md5:07acdf01b64a50a1f22e37baaec2a776	118.0 kB	Preview Download
QuadgramDictUpdate.py md5:94678527d547248c9c95bac0b1587ee3	9.9 kB	Download
README.md md5:62230453c6681b76b2a63224593f0133	11.6 kB	Preview Download
remove_all_comments.ipynb md5:3d8d75092f7a7eead3be4e7c6cf42a4b	4.7 kB	Preview Download
requirements.txt md5:e70a9a6376d6931eb5f81592c4586e14	155 Bytes	Preview Download
sampled.zip md5:2082157d37844df1357294586d5e0638	232.2 kB	Preview Download
sampled_solutions_classification.ipynb md5:80dd21e62f5544e27a69ca71bef111c5	5.6 kB	Preview Download
sampled_solutions_features.csv md5:a804bcaa50976978b9f15d48a7880efc	231.8 kB	Preview Download
solution_binning.ipynb md5:76903ede93b02f05cf5c06a61885143c	13.1 kB	Preview Download
TrigramDictUpdate.py md5:7e26e4ef6e9e6743652a92db591d290d	5.2 kB	Download
unverified.zip md5:567dfb1369d8957a8cd04454c13cab2e	626.8 kB	Preview Download
verified.zip md5:4bacb73c4230ef553cc6992bae86a6d1	224.8 kB	Preview Download
with_halstead_classification.ipynb md5:f4f1751a6cb2d9b721cca6cfde193a16	5.8 kB	Preview Download

	All versions	This version
Views	265	222
Downloads	1,302	1,236
Data volume	321.8 MB	295.0 MB

Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

Creators

Description

Files

ai_solutions_testing.ipynb

Files (8.9 MB)