Technical note Open Access

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris

GitHub projects can be easily replicated through the site's fork
process or through a Git clone-push sequence.  This is a problem for
empirical software engineering, because it can lead to skewed results
or mistrained machine learning models.  We provide a dataset of 10.6
million GitHub projects that are copies of others, and link each record
with the project's ultimate parent.  The ultimate parents were derived
from a ranking along six metrics.  The related projects were calculated
as the connected components of an 18.2 million node and 12 million
edge denoised graph created by directing edges to ultimate parents.
The graph was created by filtering out more than 30 hand-picked and 2.3
million pattern-matched clumping projects.  Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects.  Our dataset identified
30 thousand duplicate projects in an existing popular reference dataset
of 1.8 million projects.  An evaluation of our dataset against another
created independently with different methods found a significant overlap,
but also differences attributed to the operational definition of what
projects are considered as related.

 

Files (1.8 MB)
Name Size
forkproj-extended.pdf
md5:c3791f1707a97d5e3a25fe8bd662c165
1.8 MB Download
343
170
views
downloads
All versions This version
Views 343343
Downloads 170170
Data volume 312.2 MB312.2 MB
Unique views 310310
Unique downloads 156156

Share

Cite as