Technical note Open Access

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris

GitHub projects can be easily replicated through the site's fork
process or through a Git clone-push sequence.  This is a problem for
empirical software engineering, because it can lead to skewed results
or mistrained machine learning models.  We provide a dataset of 10.6
million GitHub projects that are copies of others, and link each record
with the project's ultimate parent.  The ultimate parents were derived
from a ranking along six metrics.  The related projects were calculated
as the connected components of an 18.2 million node and 12 million
edge denoised graph created by directing edges to ultimate parents.
The graph was created by filtering out more than 30 hand-picked and 2.3
million pattern-matched clumping projects.  Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects.  Our dataset identified
30 thousand duplicate projects in an existing popular reference dataset
of 1.8 million projects.  An evaluation of our dataset against another
created independently with different methods found a significant overlap,
but also differences attributed to the operational definition of what
projects are considered as related.


Files (1.8 MB)
Name Size
1.8 MB Download
All versions This version
Views 653653
Downloads 244244
Data volume 448.1 MB448.1 MB
Unique views 618618
Unique downloads 225225


Cite as