Published May 23, 2017 | Version v1
Technical note Open

Modular Queries and Unit Testing

  • 1. Delft University of Technology
  • 2. Athens University of Economics and Business


Handouts of the following technical briefing.

Georgios Gousios and Diomidis Spinellis. Mining software engineering data from GitHub. In Proceedings of the 39th International Conference on Software Engineering Companion, ICSE-C '17, pages 501–502, Piscataway, NJ, USA, May 2017. IEEE Press. Technical Briefing. DOI:10.1109/ICSE-C.2017.164

GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.



Files (2.4 MB)

Name Size Download all
2.4 MB Preview Download

Additional details

Related works

Software: (URL)
Software: (URL)
Is supplement to
Conference paper: 10.1109/ICSE-C.2017.164 (DOI)


CROSSMINER – Developer-Centric Knowledge Mining from Large Open-Source Software Repositories 732223
European Commission