Dataset Open Access
This publication consists of a dataset of 7k references manually identified in 450 Pull request (PR) discussion threads sampled from GitHub in CSV format. In addition to the dataset, it also contains R code files which were written to analyze this dataset statistically. This dataset is released under the research, which is accepted for publication at CSCW 2021 conference, titled "@alex, this fixes #9": Analysis of Referencing Patterns in Pull Request Discussions".
Pull Requests (PRs) are a frequently used method for proposing changes to source code repositories. When discussing proposed changes in a PR discussion, stakeholders often reference a wide variety of information objects for establishing shared awareness and common ground. Previous work has not considered how referential behavior impacts collaborative software development via PRs. This knowledge gap is the major barrier in evaluating the current support for referencing in PRs and improving them. We conducted an explorative analysis of ~7K references, collected from 450 public PRs on GitHub, and constructed taxonomies of referent types and expressions. Using our annotated dataset, we identified several patterns in the use of references. Referencing source code elements was prevalent but the authoring interface lacks support for it. Three classes of contextual factors influence referencing behaviors: referent type, discussion thread, and project attributes. Referencing patterns may indicate PR outcomes (e.g., merged PRs frequently reference issues, users, and tests). We conclude with design implications to support more effective referencing in PR discussion interfaces.