Published April 21, 2020 | Version v1
Technical note Open

A Dataset for GitHub Repository Deduplication: Extended Description

  • 1. Athens University of Economics and Business
  • 2. University of Tennessee


GitHub projects can be easily replicated through the site's fork
process or through a Git clone-push sequence.  This is a problem for
empirical software engineering, because it can lead to skewed results
or mistrained machine learning models.  We provide a dataset of 10.6
million GitHub projects that are copies of others, and link each record
with the project's ultimate parent.  The ultimate parents were derived
from a ranking along six metrics.  The related projects were calculated
as the connected components of an 18.2 million node and 12 million
edge denoised graph created by directing edges to ultimate parents.
The graph was created by filtering out more than 30 hand-picked and 2.3
million pattern-matched clumping projects.  Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects.  Our dataset identified
30 thousand duplicate projects in an existing popular reference dataset
of 1.8 million projects.  An evaluation of our dataset against another
created independently with different methods found a significant overlap,
but also differences attributed to the operational definition of what
projects are considered as related.




Files (1.8 MB)

Name Size Download all
1.8 MB Preview Download

Additional details

Related works

Software: 10.5281/zenodo.3653924 (DOI)
Dataset: 10.5281/zenodo.3653920 (DOI)
Is cited by
Conference paper: 10.1145/3379597.3387496 (DOI)
Is supplement to
Conference paper: 10.1145/3379597.3387496 (DOI)


FASTEN – Fine-Grained Analysis of Software Ecosystems as Networks 825328
European Commission