Technical note Open Access

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>GitHub projects can be easily replicated through the site&#39;s fork<br>\nprocess or through a Git clone-push sequence.&nbsp; This is a problem for<br>\nempirical software engineering, because it can lead to skewed results<br>\nor mistrained machine learning models.&nbsp; We provide a dataset of 10.6<br>\nmillion GitHub projects that are copies of others, and link each record<br>\nwith the project&#39;s ultimate parent.&nbsp; The ultimate parents were derived<br>\nfrom a ranking along six metrics.&nbsp; The related projects were calculated<br>\nas the connected components of an 18.2 million node and 12 million<br>\nedge denoised graph created by directing edges to ultimate parents.<br>\nThe graph was created by filtering out more than 30 hand-picked and 2.3<br>\nmillion pattern-matched clumping projects.&nbsp; Projects that introduced<br>\nunwanted clumping were identified by repeatedly visualizing shortest path<br>\ndistances between unrelated important projects.&nbsp; Our dataset identified<br>\n30 thousand duplicate projects in an existing popular reference dataset<br>\nof 1.8 million projects.&nbsp; An evaluation of our dataset against another<br>\ncreated independently with different methods found a significant overlap,<br>\nbut also differences attributed to the operational definition of what<br>\nprojects are considered as related.</p>\n\n<p>&nbsp;</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Athens University of Economics and Business", 
      "@id": "https://orcid.org/0000-0003-4231-1897", 
      "@type": "Person", 
      "name": "Spinellis, Diomidis"
    }, 
    {
      "affiliation": "Athens University of Economics and Business", 
      "@id": "https://orcid.org/0000-0003-3816-9162", 
      "@type": "Person", 
      "name": "Kotti, Zoe"
    }, 
    {
      "affiliation": "University of Tennessee", 
      "@type": "Person", 
      "name": "Mockus, Audris"
    }
  ], 
  "headline": "A Dataset for GitHub Repository Deduplication: Extended Description", 
  "image": "https://zenodo.org/static/img/logos/zenodo-gradient-round.svg", 
  "datePublished": "2020-04-21", 
  "url": "https://zenodo.org/record/3740595", 
  "keywords": [
    "deduplication", 
    "fork", 
    "project clone", 
    "GitHub", 
    "dataset"
  ], 
  "@context": "https://schema.org/", 
  "identifier": "https://doi.org/10.5281/zenodo.3740595", 
  "@id": "https://doi.org/10.5281/zenodo.3740595", 
  "@type": "ScholarlyArticle", 
  "name": "A Dataset for GitHub Repository Deduplication: Extended Description"
}
351
179
views
downloads
All versions This version
Views 351351
Downloads 179179
Data volume 328.7 MB328.7 MB
Unique views 318318
Unique downloads 164164

Share

Cite as