Published December 28, 2024 | Version v2
Dataset Open

Dataset of the Paper: Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study

Description

This dataset collected from GitHub was used to conduct an empirical study on security issues in GitHub Copilot-generated code. We provide below a brief description of each folder and file:

1. source-data folder
contains all the code files from the Code and Repository labels that we collected from github and used in our study. Code snippets are included in these code files.

2. scan-result folder
contains the commands used to perform security scans and all the results from security scans performed using static analysis tools.

3. filtered-result folder
contains the results we kept after filtering the scan results in Step 5.

4. fix-result folder
contains code snippets before and after fixes in RQ3 and the results of fixes for security issues.

5. project-url.xlsx
provides the projects from the Repository label and source files from the Code label that contain Copilot-generated code from GitHub.
--SOURCE gives the URL of the project from GitHub.
--FILE gives the path to the source code file in the project (only for the source files from the Code label).
--NOTE gives the statement describing the project as generated by Copilot.
--FUNCTION gives a functional description of the project.
--DOMAIN gives the application domain that the project containing the Copilot-generated code belongs to.

6. corresponding_cwe.xlsx
provides warning messages from static analysis tools corresponding to CWEs.

7. files_with_security_issues.xlsx
provides information about code snippets with security issues.

8. cwe-result.xlsx
provides the types and quantities of CWEs identified in the scan results.

Files

dataset.zip

Files (3.3 MB)

Name Size Download all
md5:3393e23ca42d6a861450abdfc5d71297
3.3 MB Preview Download