Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems
Creators
Description
Artificial intelligence (AI) assistants such as GitHub Copilot and ChatGPT, built on large language models like GPT-4, are revolutionizing how programming tasks are performed, raising questions about whether code is authored by generative AI models. Such questions are of particular interest to educators, who worry that these tools enable a new form of academic dishonesty, in which students submit AI generated code as their own work. Our research explores the viability of using code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code. Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4. Our classifier outperforms baselines, with an F1-score and AUC-ROC score of 0.91. A variant of our classifier that excludes gameable features (e.g., empty lines, whitespace) still performs well with an F1-score and AUC-ROC score of 0.89. We also evaluated our classifier with respect to the difficulty of the programming problem and found that there was almost no difference between easier and intermediate problems, and the classifier performed only slightly worse on harder problems. Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
# Whodunit: CodeChef AI & Human Solutions Dataset
This repository contains the data for the [CodeChef](https://www.codechef.com/) problems and the code used to collect, extract code-style and code complexity features from it, as well as the code to build and evaluate the classifiers for the paper `Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problems`. It also contains the modified baseline code.
## Data Collection
The data corresponds to `399` problems filtered from the initial set of `1100` problems. The code used for collection and filtering can be found in the scripts directory. The final data is stored as described below.
### Files
- `final_dataset.json`: Contains the data for all the problems on CodeChef.
- `final_successful_dataset.json`: Contains the same data as `final_dataset.json` but the `ai_solutions` field only contains the solutions that successfully passed the public test cases.
The `final_files` directory contains zipped sets of python files (binned into `easy`, `medium` and `hard` difficulties) that were used for the experiments in the paper.
- `unverified.zip`: Contains 2 Human and 2 AI solutions for each problem.
- `verified.zip`: Contains solutions that are verified to pass the public test cases.
- `sampled.zip`: Contains a random set of problems sources from the `unverified` set such that it matches the distribution of the `verified` set.
### Data Format
The JSON data is stored as a nested dictionary. The top level keys are the `11 difficulty levels` of CodeChef. Each difficulty level is a dictionary with the key being the `problem_code_id` on CodeChef and the value being the data for the problem. The data for each problem is structured in a dictionary with the following keys:
- `constraints`: A string describing any constraints related to the problem.
- `subtasks`: A string detailing the subtasks associated with the problem.
- `sample_test_cases`: An array of dictionaries, each representing a public test case. Each test case includes:
- `input`: The input given to the problem.
- `output`: The expected output for the given input.
- `explanation`: A detailed explanation of why the output is as expected.
- `problem_statement`: A string describing the problem, its background, and requirements.
- `input_format`: A string describing the format in which input is provided.
- `output_format`: A string describing the format in which output is expected.
- `problem_name`: The name of the problem.
- `user_tags`: An array of strings representing user-defined tags for the problem.
- `computed_tags`: An array of strings representing system-generated tags for the problem.
- `problem_code_id`: A string representing the unique code ID of the problem.
- `difficulty_level`: A string or number indicating the difficulty level of the problem.
- `ai_solutions`: An array of strings, each representing GPT-4 (v0613) generated solution to the problem.
- `human_solutions`: An array of dictionaries, each containing details about a solution submitted by a user, which includes:
- `id`: A unique identifier for the solution.
- `submission_date`: The date of submission.
- `language`: The programming language used.
- `username`: The username of the submitter.
- `user_rating_star`: The user's rating.
- `contest_code`: The code of the contest in which the solution was submitted.
- `tooltip`: Status of the solution (e.g., accepted, rejected).
- `score`: The score achieved by the solution.
- `points`: The points achieved by the solution.
- `icon`: A link to an icon representing the status of the solution.
- `time`: The execution time of the solution.
- `memory`: The memory used by the solution.
- `solution`: A unique identifier for the solution.
- `code`: The actual code of the solution.
**Note:** The `input_format`, `output_format` and `constraints` fields are not available for older problems on CodeChef. In such cases, the information is present in the `problem_statement` field.
## Feature Extraction
Before extracting the features, the comments and multi-line strings are to be removed by running the `scripts/remove_all_comments.py` script. This script accepts the `source_directory`, `destination_directory` and `output_file_path` which are the paths to the directory containing the files, the directory to store the files with comments removed and the path to a file that logs information about the file and removal process.
The code for extracting the code-style and code complexity features from the dataset can be found in the `script/feature_extraction` directory. This directory has 3 files: `extract_main_features.py`, `extract_non_gameable_features.ipynb`, and `extract_with_halstead_features.ipynb` corresponding to the 3 different feature sets used in the paper. These scripts read the files from the input path provided and outputs the extracted features to the output path provided. The extracted features are stored in the `data/features` directory from which 8 features files are generated. The features files are named as follows:
- `main_features.csv`: Contains the main classifier's features (`RQ1`).
- `with_halstead_features.csv`: Contains the halstead features (`RQ1`).
- `non_gameable_features.csv`: Contains the non-gameable features (`RQ2`).
- `correct_solutions_features.csv`: Contains the features for solutions that passed the public test cases (`RQ3`).
- `sampled_solutions_features.csv`: Contains the features for solutions sampled from the unverified set (`RQ3`).
- `easy_problems_features.csv`: Contains the features for solutions to the easy problems (`RQ4`).
- `medium_problems_features.csv`: Contains the features for solutions to the intermediate problems (`RQ4`).
- `hard_problems_features.csv`: Contains the features for solutions to the hard problems (`RQ4`).
**Note:** The extraction scripts require that the directory to the respective dataset is provided as input and the output path is specified. The output path in this package is the `data/features` directory.
## Classification
There are 8 scripts corresponding to the 8 different classifiers used in the paper and the 8 extracted features files. The scripts are found in the `scripts/classification` directory. The scripts are named as follows:
- `main_classification.py`: Contains the code for the main classifier (`RQ1`).
- `with_halstead_classification.py`: Contains the code for the halstead classifier (`RQ1`).
- `non_gameable_classification.py`: Contains the code for the non-gameable classifier (`RQ2`).
- `correct_solutions_classification.py`: Contains the code for the classifier trained on solutions that passed the public test cases (`RQ3`).
- `sampled_solutions_classification.py`: Contains the code for the classifier trained on solutions sampled from the unverified set (`RQ3`).
- `easy_problems_classification.py`: Contains the code for the classifier trained on solutions to the easy problems (`RQ4`).
- `medium_problems_classification.py`: Contains the code for the classifier trained on solutions to the intermediate problems (`RQ4`).
- `hard_problems_classification.py`: Contains the code for the classifier trained on solutions to the hard problems (`RQ4`).
Each of these scripts accept a `.csv` file containing the features as input and prints out the results. The scripts also display the SHAP plot for the top 10 features.
## For more information, please refer to the `README.md file`
Files
ai_solutions_testing.ipynb
Files
(8.9 MB)
Name | Size | Download all |
---|---|---|
md5:7b9f9af5730304407dedaff17a90fec9
|
4.5 kB | Preview Download |
md5:88c3caa459312360608cae0eb560452e
|
21.2 kB | Download |
md5:4b205769ac173e10d68465743852c065
|
4.3 kB | Download |
md5:3a6fe5dd195514b360df89221603e4e3
|
11.2 kB | Download |
md5:aaf3cd6d98b12513bfb54a12f0856bc0
|
4.0 kB | Download |
md5:d30624b9236f81dc07d8ab6be0ebb9b7
|
12.5 kB | Download |
md5:843c8394c191b882933abddd783f2c4e
|
6.8 kB | Download |
md5:78597a0ed20d7ad78bbdd8f258b69224
|
258 Bytes | Download |
md5:b82673dd9940d0360e863c48ea1cc08a
|
13.2 kB | Preview Download |
md5:8c592458a5baeb565a1d93962c717274
|
9.4 kB | Preview Download |
md5:d251505dbb3ad62d69caa7f406a074e5
|
5.6 kB | Preview Download |
md5:1da635da8bd6f6910e4c6615f51010fe
|
222.7 kB | Preview Download |
md5:7703ee45a63b7d4390c0dabf1e48d2ba
|
4.3 kB | Download |
md5:5114af4d13f717d0baed16e56d7bbccb
|
3.2 kB | Download |
md5:fc7b266fdfd9f60293aef102980522bf
|
5.6 kB | Preview Download |
md5:2394c278d1815d6a55f480a9a9559762
|
204.8 kB | Preview Download |
md5:7a851764c075a35a9aa9243ec5c89ba9
|
39.0 kB | Preview Download |
md5:2d8462ffaba2ad2dcc41a1c4cd0ca50d
|
37.4 kB | Preview Download |
md5:7258fd96e7cb77c298da50c3d3706b61
|
39.2 kB | Preview Download |
md5:1f12edf472030c6bfe681910423e3c76
|
2.3 MB | Preview Download |
md5:ba2b99af1a44db9720dc5ec0ce7aafbc
|
2.1 MB | Preview Download |
md5:49966435a80aceda2ecac2e70ac78f05
|
5.0 kB | Preview Download |
md5:fef83a64324857d5418306c4e545b54b
|
5.6 kB | Preview Download |
md5:4fc0143c11f807ad4fe7f7e747ea6a25
|
217.6 kB | Preview Download |
md5:a5e55092c963eda02c451df49fd2a8f2
|
663.4 kB | Preview Download |
md5:297a908aa025886c7dd14c5f531eb2d1
|
5.6 kB | Preview Download |
md5:9d3f8ddff1f7acc07447c4733c8d124d
|
648.1 kB | Preview Download |
md5:7f9d22179258967b1c5b4bfdd048a72d
|
5.6 kB | Preview Download |
md5:f41b4de4bf46e6e3d52c759feba56332
|
209.0 kB | Preview Download |
md5:a31b4ae458e57e505cfb474a1e4fec48
|
8.8 kB | Download |
md5:33c41cad77c4ba70565b6f0554ee42dd
|
5.6 kB | Preview Download |
md5:b74f782526264d5f48e1757cfac4f722
|
633.0 kB | Preview Download |
md5:07acdf01b64a50a1f22e37baaec2a776
|
118.0 kB | Preview Download |
md5:94678527d547248c9c95bac0b1587ee3
|
9.9 kB | Download |
md5:62230453c6681b76b2a63224593f0133
|
11.6 kB | Preview Download |
md5:3d8d75092f7a7eead3be4e7c6cf42a4b
|
4.7 kB | Preview Download |
md5:e70a9a6376d6931eb5f81592c4586e14
|
155 Bytes | Preview Download |
md5:2082157d37844df1357294586d5e0638
|
232.2 kB | Preview Download |
md5:80dd21e62f5544e27a69ca71bef111c5
|
5.6 kB | Preview Download |
md5:a804bcaa50976978b9f15d48a7880efc
|
231.8 kB | Preview Download |
md5:76903ede93b02f05cf5c06a61885143c
|
13.1 kB | Preview Download |
md5:7e26e4ef6e9e6743652a92db591d290d
|
5.2 kB | Download |
md5:567dfb1369d8957a8cd04454c13cab2e
|
626.8 kB | Preview Download |
md5:4bacb73c4230ef553cc6992bae86a6d1
|
224.8 kB | Preview Download |
md5:f4f1751a6cb2d9b721cca6cfde193a16
|
5.8 kB | Preview Download |