GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses
Description
Three datasets.
PRBatch Dataset (file name: prfeatures_train.csv and prfeatures_test.csv)
PRFeatures uses an extensive dataset from Xunhui Zhang et al. (2021), originating from their 2020 work and the GHTorrent data dump dated June 1, 2019 (https://github.com/ghtorrent/ghtorrent.org/). This dataset was selected for its diversity in project activity, language, and size, offering a more generalizable and holistic view of Pull-Request (PR) dynamics across various software development scenarios.
We performed necessary pre-processing steps mainly handling missing values by replacing negative and missing values with \textit{Not a Number (NaN)} and omitting factors with over 30\% missing values. In terms of feature engineering, redundant factors were removed and related factors like \textit{files-added} and \textit{files-deleted} were consolidated into \textit{files-changed}. Rather than narrowing down key variables, our study aims to showcase the adaptability of RL algorithms in handling extensive feature sets; therefore, we retain a large number of features in the dataset. Correct data types were set for each factor, and categorical values in the \textit{language} factor were label-encoded.
We used an 80/20 datasplit to create the training dataset and the testing datasets as uploaded here. The dataset contains a little over 1.3 million PRs and 72 PR related features.
PRChat Dataset (file name: pr_comments_dataset_publish)
The second dataset, PRChats Dataset was curated specifically for a specialized Reinforcement Learning formalization for Pull-Request (PR) outcome predictions on GitHub using just the developer discussions. It contains over 5,88,097 in-line code comments of 66,281 PRs and a total of 15 features. The raw comments and the respective commit_ids were extracted from the work publised by Akshay Sinha (refer to the references). The data spans from January 2015 to December 2020. All the other features were augmented using the GitHub REST API.
The dataset contains a little under 0.6 million comments associated with around 66,000 PRs. To view the PRs (consequently the related comments), group by using: owner_name, repo_name, pull_no.
Feature Extraction resulted in addition of following features:
- has_code_element: whether the comment makes a code suggestion or not
- word_count: no. of words in the comments (British and American English only based on Hunspell Library)
- stopw_ratio: ratio of no. of stop words to total word count in the comment
Sentiment Analysis conducted using VADER resulting in addition of:
- neg_vr: negative polarity score
- neu_vr: neutral polarity score
- pos_vr: positive polarity score
- compound: overall polarity score of the comment
Other PR and project related features include:
- owner_name: the account owner of the repo (not case sensitive)
- repo_name: the name of the repo without the .git extension (not case sensitive)
- pull_no: the number to identify the PR
- merged_or_not: whether PR has been merged or not
- timestamp: for each comment
Survey Data (file name: survey_responses_raw.csv)
The third dataset is the collection of responses of an online exploratory survey targeting software developers and engineers. The underpinning objective was to delve deep into the developers' perspectives regarding the PR review processes and the quality of these reviews. We received a total of 22 responses.
We designed a survey protocol following Carleton University's guidelines for on-line research, adhering to the Tri-Council Policy Statement: Ethical Conduct for Research Involving humans (TCPS 2) in Canada (https://tcps2core.ca/welcome). After careful evaluation by Carleton University's Research Ethics Boards, in alignment with TCPS2, we received approval on May 2, 2023 (Ethics Clearance ID # 119296), effective until May 31, 2023.
The survey was carefully structured into three distinct sections. The initial section delved into the participant's demographic and professional background, featuring six primary questions, along with an optional seventh question. Prioritizing participant confidentiality, the survey was designed to safeguard anonymity. The subsequent section transitioned to a set of questions focused on PR factors and review practices. This section presented participants with two multiple-choice queries and a pair of questions grounded in the Likert-scale, enabling a structured feedback mechanism.
Concluding the survey, the third section was crafted to prompt more detailed insights from the participants. It comprised two open-ended questions, providing an avenue for respondents to further describe their PR review experiences and techniques.
Cite Original Paper:
R. Joshi and N. Kahani, "Comparative Study of Reinforcement Learning in GitHub Pull Request Outcome Predictions," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 489-500, doi: 10.1109/SANER60148.2024.00057.
Files
pr_comments_dataset_publish.csv
Additional details
Identifiers
References
- Sinha, Akshay. (2021). Pull Request Review Comments Dataset (2.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5015062
- Hutto, C. and Gilbert, E., 2014, May. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
- X. Zhang, Y. Yu, G. Georgios, and A. Rastogi, "Pull request decisions explained: An empirical overview," IEEE Transactions on Software Engineering, pp. 1–1, 2022.
- X. Zhang, A. Rastogi, and Y. Yu, "On the shoulders of giants: A new dataset for pull-based development research," in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 543–547.
- R. Joshi and N. Kahani, "Comparative Study of Reinforcement Learning in GitHub Pull Request Outcome Predictions," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 489-500, doi: 10.1109/SANER60148.2024.00057.