Published July 15, 2022 | Version 0.0.3
Dataset Open

The Good First Issue Recommendation Dataset from "GFI-Bot: Automated Good First Issue Recommendation on GitHub"

  • 1. Peking University, China

Description

This is a good first issue (GFI) recommendation dataset created from the GFI-Bot project (https://github.com/osslab-pku/gfi-bot). For more information about the GFI recommendation problem and GFI-Bot, please check our publications:

  • Wenxin Xiao, Hao He, Weiwei Xu, Xin Tan, Jinhao Dong, and Minghui Zhou. 2022. Recommending Good First Issues in GitHub OSS Projects. In Proceedings of the 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 21–29, 2022. ACM. https://hehao98.github.io/files/2022-recgfi.pdf
  • Hao He, Haonan Su, Wenxin Xiao, Runzhi He, and Minghui Zhou. 2022. GFI-Bot: Automated Good First Issue Recommendation on GitHub. In Proceedings of the 2022 ACM 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, November 14-16, 2022. ACM. https://hehao98.github.io/files/2022-gfibot.pdf

The dataset is a MongoDB dump and needs to be restored to a MongoDB instance before use. This can be done via the official mongorestore tool by running a command like this in the dataset/ folder:

mongorestore --uri={{ your mongodb url }} --gzip

In the gfibot.dataset collection,  each document describes the state of an issue at a certain time (either at the time of issue creation or at the time of issue resolution). The resolver_commit_num  is the ground truth label (i.e., # of commits the issue resolver has made in the repository before issue resolution, excluding commits for resolving the issue itself; resolver_commit_num = 0 means the resolver is someone completely new to the repository). The remaining fields can be used as features or further analyzed to derive new features.

The gfibot.resolved_issue collection additionally provides information about which GitHub user resolved this issue and in what commit or pull request. This information can be used to study problems like, e.g., personalized good first issue recommendation or newcomer retention mechanisms.

This dataset can be used to evaluate new GFI recommendation approaches. We hope it will be helpful in advancing GFI recommendation research and other future studies on open-source software onboarding.

Files

dump.zip

Files (2.5 GB)

Name Size Download all
md5:b39aada99ab926252549fb5d6cbd4e69
2.5 GB Preview Download