Published July 2, 2024 | Version v2
Conference paper Open

Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems

Description

Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
 

# Whodunit: CodeChef AI & Human Solutions Dataset

This repository contains the data for the [CodeChef](https://www.codechef.com/) problems and the code used to collect, extract code-style and code complexity features from it, as well as the code to build and evaluate the classifiers for the paper `Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems`. It also contains the modified baseline code.
 

## Data Collection

The data corresponds to `399` problems filtered from the initial set of `1100` problems. The code used for collection and filtering can be found in the scripts directory. The final data is stored as described below.

 

### Files 

- `final_dataset.json`: Contains the data for all the problems on CodeChef.

- `final_successful_dataset.json`: Contains the same data as `final_dataset.json` but the `ai_solutions` field only contains the solutions that successfully passed the public test cases.

The `final_files` directory contains zipped sets of python files (binned into `easy`, `medium` and `hard` difficulties) that were used for the experiments in the paper.

- `unverified.zip`: Contains 2 Human and 2 AI solutions for each problem.

- `verified.zip`: Contains solutions that are verified to pass the public test cases.

- `sampled.zip`: Contains a random set of problems sources from the `unverified` set such that it matches the distribution of the `verified` set.
 

### Data Format

The JSON data is stored as a nested dictionary. The top level keys are the `11 difficulty levels` of CodeChef. Each difficulty level is a dictionary with the key being the `problem_code_id` on CodeChef and the value being the data for the problem. The data for each problem is structured in a dictionary with the following keys: 

- `constraints`: A string describing any constraints related to the problem.

- `subtasks`: A string detailing the subtasks associated with the problem.

- `sample_test_cases`: An array of dictionaries, each representing a public test case. Each test case includes:

- `input`: The input given to the problem.

- `output`: The expected output for the given input.

- `explanation`: A detailed explanation of why the output is as expected.

- `problem_statement`: A string describing the problem, its background, and requirements.

- `input_format`: A string describing the format in which input is provided.

- `output_format`: A string describing the format in which output is expected.

- `problem_name`: The name of the problem.

- `user_tags`: An array of strings representing user-defined tags for the problem.

- `computed_tags`: An array of strings representing system-generated tags for the problem.

- `problem_code_id`: A string representing the unique code ID of the problem.

- `difficulty_level`: A string or number indicating the difficulty level of the problem.

- `ai_solutions`: An array of strings, each representing GPT-4 (v0613) generated solution to the problem.

- `human_solutions`: An array of dictionaries, each containing details about a solution submitted by a user, which includes:

- `id`: A unique identifier for the solution.

- `submission_date`: The date of submission.

- `language`: The programming language used.

- `username`: The username of the submitter.

- `user_rating_star`: The user's rating.

- `contest_code`: The code of the contest in which the solution was submitted.

- `tooltip`: Status of the solution (e.g., accepted, rejected).

- `score`: The score achieved by the solution.

- `points`: The points achieved by the solution.

- `icon`: A link to an icon representing the status of the solution.

- `time`: The execution time of the solution.

- `memory`: The memory used by the solution.

- `solution`: A unique identifier for the solution.

- `code`: The actual code of the solution. 

**Note:** The `input_format`, `output_format` and `constraints` fields are not available for older problems on CodeChef. In such cases, the information is present in the `problem_statement` field.

 

## Feature Extraction

Before extracting the features, the comments and multi-line strings are to be removed by running the `scripts/remove_all_comments.py` script. This script accepts the `source_directory`, `destination_directory` and `output_file_path` which are the paths to the directory containing the files, the directory to store the files with comments removed and the path to a file that logs information about the file and removal process.

The code for extracting the code-style and code complexity features from the dataset can be found in the `script/feature_extraction` directory. This directory has 3 files: `extract_main_features.py`, `extract_non_gameable_features.ipynb`, and `extract_with_halstead_features.ipynb` corresponding to the 3 different feature sets used in the paper. These scripts read the files from the input path provided and outputs the extracted features to the output path provided. The extracted features are stored in the `data/features` directory from which 8 features files are generated. The features files are named as follows:

- `main_features.csv`: Contains the main classifier's features (`RQ1`).

- `with_halstead_features.csv`: Contains the halstead features (`RQ1`).

- `non_gameable_features.csv`: Contains the non-gameable features (`RQ2`).

- `correct_solutions_features.csv`: Contains the features for solutions that passed the public test cases (`RQ3`).

- `sampled_solutions_features.csv`: Contains the features for solutions sampled from the unverified set (`RQ3`).

- `easy_problems_features.csv`: Contains the features for solutions to the easy problems (`RQ4`).

- `medium_problems_features.csv`: Contains the features for solutions to the intermediate problems (`RQ4`).

- `hard_problems_features.csv`: Contains the features for solutions to the hard problems (`RQ4`).

**Note:** The extraction scripts require that the directory to the respective dataset is provided as input and the output path is specified. The output path in this package is the `data/features` directory.

 

## Classification

There are 8 scripts corresponding to the 8 different classifiers used in the paper and the 8 extracted features files. The scripts are found in the `scripts/classification` directory. The scripts are named as follows:

- `main_classification.py`: Contains the code for the main classifier (`RQ1`).

- `with_halstead_classification.py`: Contains the code for the halstead classifier (`RQ1`).

- `non_gameable_classification.py`: Contains the code for the non-gameable classifier (`RQ2`).

- `correct_solutions_classification.py`: Contains the code for the classifier trained on solutions that passed the public test cases (`RQ3`).

- `sampled_solutions_classification.py`: Contains the code for the classifier trained on solutions sampled from the unverified set (`RQ3`).

- `easy_problems_classification.py`: Contains the code for the classifier trained on solutions to the easy problems (`RQ4`).

- `medium_problems_classification.py`: Contains the code for the classifier trained on solutions to the intermediate problems (`RQ4`).

- `hard_problems_classification.py`: Contains the code for the classifier trained on solutions to the hard problems (`RQ4`).

Each of these scripts accept a `.csv` file containing the features as input and prints out the results. The scripts also display the SHAP plot for the top 10 features.

## For more information, please refer to the `README.md file`

Files

ai_solutions_testing.ipynb

Files (8.9 MB)

Name Size Download all
md5:7b9f9af5730304407dedaff17a90fec9
4.5 kB Preview Download
md5:88c3caa459312360608cae0eb560452e
21.2 kB Download
md5:4b205769ac173e10d68465743852c065
4.3 kB Download
md5:3a6fe5dd195514b360df89221603e4e3
11.2 kB Download
md5:aaf3cd6d98b12513bfb54a12f0856bc0
4.0 kB Download
md5:d30624b9236f81dc07d8ab6be0ebb9b7
12.5 kB Download
md5:843c8394c191b882933abddd783f2c4e
6.8 kB Download
md5:78597a0ed20d7ad78bbdd8f258b69224
258 Bytes Download
md5:b82673dd9940d0360e863c48ea1cc08a
13.2 kB Preview Download
md5:8c592458a5baeb565a1d93962c717274
9.4 kB Preview Download
md5:d251505dbb3ad62d69caa7f406a074e5
5.6 kB Preview Download
md5:1da635da8bd6f6910e4c6615f51010fe
222.7 kB Preview Download
md5:7703ee45a63b7d4390c0dabf1e48d2ba
4.3 kB Download
md5:5114af4d13f717d0baed16e56d7bbccb
3.2 kB Download
md5:fc7b266fdfd9f60293aef102980522bf
5.6 kB Preview Download
md5:2394c278d1815d6a55f480a9a9559762
204.8 kB Preview Download
md5:7a851764c075a35a9aa9243ec5c89ba9
39.0 kB Preview Download
md5:2d8462ffaba2ad2dcc41a1c4cd0ca50d
37.4 kB Preview Download
md5:7258fd96e7cb77c298da50c3d3706b61
39.2 kB Preview Download
md5:1f12edf472030c6bfe681910423e3c76
2.3 MB Preview Download
md5:ba2b99af1a44db9720dc5ec0ce7aafbc
2.1 MB Preview Download
md5:49966435a80aceda2ecac2e70ac78f05
5.0 kB Preview Download
md5:fef83a64324857d5418306c4e545b54b
5.6 kB Preview Download
md5:4fc0143c11f807ad4fe7f7e747ea6a25
217.6 kB Preview Download
md5:a5e55092c963eda02c451df49fd2a8f2
663.4 kB Preview Download
md5:297a908aa025886c7dd14c5f531eb2d1
5.6 kB Preview Download
md5:9d3f8ddff1f7acc07447c4733c8d124d
648.1 kB Preview Download
md5:7f9d22179258967b1c5b4bfdd048a72d
5.6 kB Preview Download
md5:f41b4de4bf46e6e3d52c759feba56332
209.0 kB Preview Download
md5:a31b4ae458e57e505cfb474a1e4fec48
8.8 kB Download
md5:33c41cad77c4ba70565b6f0554ee42dd
5.6 kB Preview Download
md5:b74f782526264d5f48e1757cfac4f722
633.0 kB Preview Download
md5:07acdf01b64a50a1f22e37baaec2a776
118.0 kB Preview Download
md5:94678527d547248c9c95bac0b1587ee3
9.9 kB Download
md5:62230453c6681b76b2a63224593f0133
11.6 kB Preview Download
md5:3d8d75092f7a7eead3be4e7c6cf42a4b
4.7 kB Preview Download
md5:e70a9a6376d6931eb5f81592c4586e14
155 Bytes Preview Download
md5:2082157d37844df1357294586d5e0638
232.2 kB Preview Download
md5:80dd21e62f5544e27a69ca71bef111c5
5.6 kB Preview Download
md5:a804bcaa50976978b9f15d48a7880efc
231.8 kB Preview Download
md5:76903ede93b02f05cf5c06a61885143c
13.1 kB Preview Download
md5:7e26e4ef6e9e6743652a92db591d290d
5.2 kB Download
md5:567dfb1369d8957a8cd04454c13cab2e
626.8 kB Preview Download
md5:4bacb73c4230ef553cc6992bae86a6d1
224.8 kB Preview Download
md5:f4f1751a6cb2d9b721cca6cfde193a16
5.8 kB Preview Download