On Inter-dataset Code Duplication and Data Leakage in Large Language Models

Hernández López, José Antonio; Chen, Boqi; Saaz, Mootez; Sharma, Tushar; Varró, Dániel

doi:10.5281/zenodo.10446176

There is a newer version of the record available.

Published December 31, 2023 | Version v1

Dataset Open

On Inter-dataset Code Duplication and Data Leakage in Large Language Models

1. Linköping University
2. McGill University
3. Dalhouise University
4. Dalhousie University

This dataset encompasses the sparse graph referenced in the publication titled "On Inter-dataset Code Duplication and Data Leakage in Large Language Models."

This resource is a snapshot of the original repository, and the graph is preserved in the interduplication.db database. The schema of this database is easily understandable and is available in the original repository. Each code snippet is identified by a unique identifier (id_within_dataset) that corresponds to its identification within the dataset from which it was extracted. The complete datasets are stored in .jsonl files within their respective folders (e.g., python-150/data.jsonl, codetrans/data.jsonl, etc.).

Files

Files (5.7 GB)

Name	Size	Download all
code-inter-dataset-duplication.tar.gz md5:e93e5f5060b4fb0392e15408e5bde420	5.7 GB	Download

319

Views

Downloads

Show more details

	All versions	This version
Views	319	252
Downloads	75	40
Data volume	523.0 GB	309.6 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: January 3, 2024
Modified: July 22, 2024

On Inter-dataset Code Duplication and Data Leakage in Large Language Models

Creators

Description

Files

Files (5.7 GB)