Technical note Open Access

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Spinellis, Diomidis</dc:creator>
  <dc:creator>Kotti, Zoe</dc:creator>
  <dc:creator>Mockus, Audris</dc:creator>
  <dc:date>2020-04-21</dc:date>
  <dc:description>GitHub projects can be easily replicated through the site's fork
process or through a Git clone-push sequence.  This is a problem for
empirical software engineering, because it can lead to skewed results
or mistrained machine learning models.  We provide a dataset of 10.6
million GitHub projects that are copies of others, and link each record
with the project's ultimate parent.  The ultimate parents were derived
from a ranking along six metrics.  The related projects were calculated
as the connected components of an 18.2 million node and 12 million
edge denoised graph created by directing edges to ultimate parents.
The graph was created by filtering out more than 30 hand-picked and 2.3
million pattern-matched clumping projects.  Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects.  Our dataset identified
30 thousand duplicate projects in an existing popular reference dataset
of 1.8 million projects.  An evaluation of our dataset against another
created independently with different methods found a significant overlap,
but also differences attributed to the operational definition of what
projects are considered as related.

 </dc:description>
  <dc:identifier>https://zenodo.org/record/3740595</dc:identifier>
  <dc:identifier>10.5281/zenodo.3740595</dc:identifier>
  <dc:identifier>oai:zenodo.org:3740595</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:relation>info:eu-repo/grantAgreement/EC/H2020/825328/</dc:relation>
  <dc:relation>doi:10.1145/3379597.3387496</dc:relation>
  <dc:relation>doi:10.1145/3379597.3387496</dc:relation>
  <dc:relation>doi:10.5281/zenodo.3653924</dc:relation>
  <dc:relation>doi:10.5281/zenodo.3653920</dc:relation>
  <dc:relation>doi:10.5281/zenodo.3740594</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>deduplication</dc:subject>
  <dc:subject>fork</dc:subject>
  <dc:subject>project clone</dc:subject>
  <dc:subject>GitHub</dc:subject>
  <dc:subject>dataset</dc:subject>
  <dc:title>A Dataset for GitHub Repository Deduplication: Extended Description</dc:title>
  <dc:type>info:eu-repo/semantics/technicalDocumentation</dc:type>
  <dc:type>publication-technicalnote</dc:type>
</oai_dc:dc>
351
179
views
downloads
All versions This version
Views 351351
Downloads 179179
Data volume 328.7 MB328.7 MB
Unique views 318318
Unique downloads 164164

Share

Cite as