Technical note Open Access

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3740595", 
  "language": "eng", 
  "title": "A Dataset for GitHub Repository Deduplication: Extended Description", 
  "issued": {
    "date-parts": [
  "abstract": "<p>GitHub projects can be easily replicated through the site&#39;s fork<br>\nprocess or through a Git clone-push sequence.&nbsp; This is a problem for<br>\nempirical software engineering, because it can lead to skewed results<br>\nor mistrained machine learning models.&nbsp; We provide a dataset of 10.6<br>\nmillion GitHub projects that are copies of others, and link each record<br>\nwith the project&#39;s ultimate parent.&nbsp; The ultimate parents were derived<br>\nfrom a ranking along six metrics.&nbsp; The related projects were calculated<br>\nas the connected components of an 18.2 million node and 12 million<br>\nedge denoised graph created by directing edges to ultimate parents.<br>\nThe graph was created by filtering out more than 30 hand-picked and 2.3<br>\nmillion pattern-matched clumping projects.&nbsp; Projects that introduced<br>\nunwanted clumping were identified by repeatedly visualizing shortest path<br>\ndistances between unrelated important projects.&nbsp; Our dataset identified<br>\n30 thousand duplicate projects in an existing popular reference dataset<br>\nof 1.8 million projects.&nbsp; An evaluation of our dataset against another<br>\ncreated independently with different methods found a significant overlap,<br>\nbut also differences attributed to the operational definition of what<br>\nprojects are considered as related.</p>\n\n<p>&nbsp;</p>", 
  "author": [
      "family": "Spinellis, Diomidis"
      "family": "Kotti, Zoe"
      "family": "Mockus, Audris"
  "type": "article", 
  "id": "3740595"
All versions This version
Views 351351
Downloads 179179
Data volume 328.7 MB328.7 MB
Unique views 318318
Unique downloads 164164


Cite as