GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses

Joshi, Rinkesh; Kahani, Nafiseh

doi:10.5281/zenodo.10049493

Published August 2023 | Version v2

Dataset Open

GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses

1. Carleton University

Three datasets.

PRBatch Dataset (file name: prfeatures_train.csv and prfeatures_test.csv)

PRFeatures uses an extensive dataset from Xunhui Zhang et al. (2021), originating from their 2020 work and the GHTorrent data dump dated June 1, 2019 (https://github.com/ghtorrent/ghtorrent.org/). This dataset was selected for its diversity in project activity, language, and size, offering a more generalizable and holistic view of Pull-Request (PR) dynamics across various software development scenarios.

We performed necessary pre-processing steps mainly handling missing values by replacing negative and missing values with \textit{Not a Number (NaN)} and omitting factors with over 30\% missing values. In terms of feature engineering, redundant factors were removed and related factors like \textit{files-added} and \textit{files-deleted} were consolidated into \textit{files-changed}. Rather than narrowing down key variables, our study aims to showcase the adaptability of RL algorithms in handling extensive feature sets; therefore, we retain a large number of features in the dataset. Correct data types were set for each factor, and categorical values in the \textit{language} factor were label-encoded.

We used an 80/20 datasplit to create the training dataset and the testing datasets as uploaded here. The dataset contains a little over 1.3 million PRs and 72 PR related features.

PRChat Dataset (file name: pr_comments_dataset_publish)

The second dataset, PRChats Dataset was curated specifically for a specialized Reinforcement Learning formalization for Pull-Request (PR) outcome predictions on GitHub using just the developer discussions. It contains over 5,88,097 in-line code comments of 66,281 PRs and a total of 15 features. The raw comments and the respective commit_ids were extracted from the work publised by Akshay Sinha (refer to the references). The data spans from January 2015 to December 2020. All the other features were augmented using the GitHub REST API.

The dataset contains a little under 0.6 million comments associated with around 66,000 PRs. To view the PRs (consequently the related comments), group by using: owner_name, repo_name, pull_no.

Feature Extraction resulted in addition of following features:

has_code_element: whether the comment makes a code suggestion or not
word_count: no. of words in the comments (British and American English only based on Hunspell Library)
stopw_ratio: ratio of no. of stop words to total word count in the comment

Sentiment Analysis conducted using VADER resulting in addition of:

neg_vr: negative polarity score
neu_vr: neutral polarity score
pos_vr: positive polarity score
compound: overall polarity score of the comment

Other PR and project related features include:

owner_name: the account owner of the repo (not case sensitive)
repo_name: the name of the repo without the .git extension (not case sensitive)
pull_no: the number to identify the PR
merged_or_not: whether PR has been merged or not
timestamp: for each comment

Survey Data (file name: survey_responses_raw.csv)

The third dataset is the collection of responses of an online exploratory survey targeting software developers and engineers. The underpinning objective was to delve deep into the developers' perspectives regarding the PR review processes and the quality of these reviews. We received a total of 22 responses.

We designed a survey protocol following Carleton University's guidelines for on-line research, adhering to the Tri-Council Policy Statement: Ethical Conduct for Research Involving humans (TCPS 2) in Canada (https://tcps2core.ca/welcome). After careful evaluation by Carleton University's Research Ethics Boards, in alignment with TCPS2, we received approval on May 2, 2023 (Ethics Clearance ID # 119296), effective until May 31, 2023.

The survey was carefully structured into three distinct sections. The initial section delved into the participant's demographic and professional background, featuring six primary questions, along with an optional seventh question. Prioritizing participant confidentiality, the survey was designed to safeguard anonymity. The subsequent section transitioned to a set of questions focused on PR factors and review practices. This section presented participants with two multiple-choice queries and a pair of questions grounded in the Likert-scale, enabling a structured feedback mechanism.

Concluding the survey, the third section was crafted to prompt more detailed insights from the participants. It comprised two open-ended questions, providing an avenue for respondents to further describe their PR review experiences and techniques.

Cite Original Paper:

R. Joshi and N. Kahani, "Comparative Study of Reinforcement Learning in GitHub Pull Request Outcome Predictions," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 489-500, doi: 10.1109/SANER60148.2024.00057.

Files

pr_comments_dataset_publish.csv

Files (660.7 MB)

Name	Size
pr_comments_dataset_publish.csv md5:a5e1acff3bed96d5c1217e1be8ae3960	218.8 MB	Preview Download
prfeatures_test_data.csv md5:0cd6e8ae09d0af5e86b87f745d1ebb8a	87.9 MB	Preview Download
prfeatures_train_data.csv md5:ce145058c3c5436f8673f05a37e41be0	354.0 MB	Preview Download
survey_responses_raw.csv md5:54fc70ee3801714693cc14589eefd746	22.9 kB	Preview Download

Additional details

DOI: 10.1109/SANER60148.2024.00057

Sinha, Akshay. (2021). Pull Request Review Comments Dataset (2.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5015062
Hutto, C. and Gilbert, E., 2014, May. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
X. Zhang, Y. Yu, G. Georgios, and A. Rastogi, "Pull request decisions explained: An empirical overview," IEEE Transactions on Software Engineering, pp. 1–1, 2022.
X. Zhang, A. Rastogi, and Y. Yu, "On the shoulders of giants: A new dataset for pull-based development research," in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 543–547.
R. Joshi and N. Kahani, "Comparative Study of Reinforcement Learning in GitHub Pull Request Outcome Predictions," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 489-500, doi: 10.1109/SANER60148.2024.00057.

	All versions	This version
Views	1,474	655
Downloads	671	488
Data volume	153.9 GB	104.2 GB

pr_comments_dataset_publish.csv

Files (660.7 MB)

Identifiers

References

GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses

Authors/Creators

Description

Files

pr_comments_dataset_publish.csv

Files (660.7 MB)

Additional details

Identifiers

References