Published June 23, 2021 | Version 3.0.0
Dataset Open

GitSED: GitHub Socially Enhanced Dataset

  • 1. Universidade Federal de Minas Gerais
  • 2. Instituto Federal de Minas Gerais

Description

Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

  • 8,556,778 repositories
  • 32,411,674 developers
  • 6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)
  • 13 collaboration metrics

There are two previous versions of GitSED, which were originally built for the following conference papers:

v2 (May 2017)Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

v1 (Sep 2015)Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.

Files

Files (1.6 GB)

Name Size Download all
md5:7bf1864bc46fa2c35592e1d5e12c902e
1.6 GB Download