A Dataset of Bot and Human Activities in GitHub
Description
A Dataset of Bot and Human Activities in GitHub
This repository provides a dataset of GitHub contributor activities and accompanies a paper accepted at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). The dataset contains 834K high-level activities made by 385 bots and 616 human contributors on GitHub between 25 November 2022 and 9 March 2023. The activities were generated from 1M+ low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.
Files description
The following files are provided as part of the archive:
- bot_activities.json - A JSON file containing 649,755 activities made by 385 bot contributors;
- human_activities.json - A JSON file containing 184,056 activities made by 616 human contributors (anonymized);
- JsonSchema.json - A JSON schema that validates the above datasets.
Example
Below is an example of a Closing pull request activity:
{
"date": "2022-11-25T18:49:09+00:00",
"activity": "Closing pull request",
"contributor": "typescript-bot",
"repository": "DefinitelyTyped/DefinitelyTyped",
"comment": {
"length": 249,
"GH_node": "IC_kwDOAFz6BM5PJG7l"
},
"pull_request": {
"id": 62328,
"title": "[qunit] Add `test.each()`",
"created_at": "2022-09-19T17:34:28+00:00",
"status": "closed",
"closed_at": "2022-11-25T18:49:08+00:00",
"merged": false,
"GH_node": "PR_kwDOAFz6BM4_N5ib"
},
"conversation": {
"comments": 19
},
"payload": {
"pr_commits": 1,
"pr_changed_files": 5
}
}
List of activity types
In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.
List of fields
Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.
For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.
Properties
- date
- Date on which the activity is performed
- Type:
string
- e.g., "2022-11-25T09:55:19+00:00"
- String format must be a "date-time"
- activity
- The activity performed by the contributor
- Type:
string
- e.g., "Commenting pull request"
- contributor
- The login name of the contributor who performed this activity
- Type:
string
- e.g., "analysis-bot", "anonymised" in the case of a human contributor
- repository
- The repository in which the activity is performed
- Type:
string
- e.g., "apache/spark", "anonymised" in the case of a human contributor
- issue
- Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
- Type:
object
- Properties
- id
- Issue number
- Type:
integer
- e.g., 35471
- title
- Issue title
- Type:
string
- e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
- created_at
- The date on which this issue is created
- Type:
string
- e.g., "2022-11-10T13:07:23+00:00"
- String format must be a "date-time"
- status
- Current state of the issue
- Type:
string
- "open" or "closed"
- closed_at
- The date on which this issue is closed. "null" will be provided if the issue is open
- Types:
string
,null
- e.g., "2022-11-25T10:42:39+00:00"
- String format must be a "date-time"
- resolved
- The issue is resolved or not_planned/still open
- Type:
boolean
- true or false
- GH_node
- The GitHub node of this issue
- Type:
string
- e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
- id
- pull_request
- Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
- Type:
object
- Properties
- id
- Pull request number
- Type:
integer
- e.g., 35471
- title
- Pull request title
- Type:
string
- e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
- created_at
- The date on which this pull request is created
- Type:
string
- e.g., "2022-11-10T13:07:23+00:00"
- String format must be a "date-time"
- status
- Current state of the pull request
- Type:
string
- "open" or "closed"
- closed_at
- The date on which this pull request is closed. "null" will be provided if the pull request is open
- Types:
string
,null
- e.g., "2022-11-25T10:42:39+00:00"
- String format must be a "date-time"
- merged
- The PR is merged or rejected/still open
- Type:
boolean
- true or false
- GH_node
- The GitHub node of this pull request
- Type:
string
- e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
- id
- review
- Pull request review information - provided for Reviewing code
- Type:
object
- Properties
- status
- Status of the review
- Type:
string
- "changes_requested" or "approved" or "dismissed"
- GH_node
- The GitHub node of this review
- Type:
string
- e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
- status
- conversation
- Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
- Type:
object
- Properties
- comments
- Number of comments present in the corresponding issue or pull request
- Type:
integer
- e.g., 5
- comments
- comment
- Comment information - Provided for all the activities for which the field issue or pull_request is reported and additionally for commit comment
- Type:
object
- Properties
- length
- Length of the comment text or description text (if comment is not expected)
- Type:
integer
- e.g., 25
- GH_node
- The GitHub node of this comment or description. "null" will be provided if there is no comment expected
- Types:
string
,null
- e.g., "IC_kwDOEj6V8c5PHT78", "anonymised" in the case of a human contributor
- length
- gitref
- Tag information - provided for Creating branch, Creating tag, Deleting branch, Deleting tag, Editing wiki page and Publishing a release
- Type:
object
- Properties
- type
- Type of the gitref
- Type:
string
- "tag" or "branch" or "commit"
- name
- Name of the gitref
- Type:
string
- e.g., "cherry-pick-11-to-release-4.10"
- description_length
- Length of the description text provided while creating the gitref. "null" be provided if the type is "branch" or "commit" as they do not have any description
- Type:
integer
,null
- e.g., 23
- type
- release
- Release information - provided for Publishing a release
- Type:
object
- Properties
- name
- The name of the release that is created. "null" will be provided if the name is not provided
- Type:
string
,null
- e.g., "v0.65.9"
- description_length
- Length of the description of the release that is created
- Type:
integer
- e.g., 888
- created_at
- The date at which the release is created (activity date is the release published date)
- Type:
string
- e.g., "2022-11-25T11:34:48+00:00"
- String format must be a "date-time"
- prerelease
- If the release that is created is a prerelease or not
- Type:
boolean
- true or false
- new_tag
- If a new tag is created for this release or another tag is re-used
- Type:
boolean
- true or false
- GH_node
- The corresponding release node ID
- Type:
string
- e.g., "RE_kwDOCm6M2s4FBGxT", "anonymised" in the case of a human contributor
- name
- page
- Page information - provided for Editing wiki page
- Type:
object
- Properties
- name
- Name of the page
- Type:
string
- e.g., "Workflow-status"
- title
- Title of the page
- Type:
string
- e.g., "Workflow status"
- new
- If the page is created new or existing page is edited
- Type:
boolean
- true or false
- name
- payload
- Other additional details - Provided for Opening pull request, Closing pull request, Reopening pull request and pushing commits
- Type:
object
- Properties
- pr_commits
- The number of commits in this pull request
- Type:
integer
- e.g., 3
- pr_changed_files
- The number of files that are changed in this pull request
- Type:
integer
- e.g., 2
- pushed_commits
- The number of commits present in this push
- Type:
integer
- e.g., 4
- distinct_pushed_commits
- The distinct number of commits present in this push
- Type:
integer
- e.g., 1
- github_push_id
- The corresponding GitHub push ID
- Type:
integer
- e.g., 11790446870, "anonymised" in the case of a human contributor
- pr_commits
Mapping between activities and events
For many activity types, the corresponding activity can be observed by the occurrence of a single event type. For example, the activity types Forking repository and Starring repository would require the occurrence of a single event type for each as given below.
Activity type | Event type | Payload |
---|---|---|
Forking repository | ForkEvent |
- |
Starring repository | WatchEvent |
action = "started" |
However, in some cases, the same event type yields different activity types depending on the value present in the payload. For example, three different activity types can be generated from the same low-level event type CreateEvent
, depending on the value of its ref_type (either "repository", "branch", or "tag") present in the payload.
Activity type | Event type | Payload |
---|---|---|
Creating repository | CreateEvent |
ref_type = "repository" |
Creating branch | CreateEvent |
ref_type = "branch" |
Creating tag | CreateEvent |
ref_type = "tag" |
In some cases, there is no one-to-one mapping between events and activities. This is because some actions on GitHub may generate more than a single event and lead to a sequence of one mandatory event and a second optional event (marked with ?). For example, for the activity type Publishing a release, event type ReleaseEvent
is mandatory with payload's action value = "published", while event type CreateEvent
is optional as it is required only when a new tag is created along with the published release.
Activity type | Event type | Payload |
---|---|---|
Publishing a release | ReleaseEvent |
action = "published" |
? CreateEvent |
ref_type = "tag" |
All the identified activities along with their events type(s) and payload information is given in the following table.
Activity type | Event type | Payload |
---|---|---|
Creating repository | CreateEvent |
ref_type = "repository" |
Creating branch | CreateEvent |
ref_type = "branch" |
Creating tag | CreateEvent |
ref_type = "tag" |
Deleting tag | DeleteEvent |
ref_type = "tag" |
Deleting repository | DeleteEvent |
ref_type = "branch" |
Publishing a release | ReleaseEvent |
action = "published" |
? CreateEvent |
ref_type = "tag" | |
Making repository public | PublicEvent |
- |
Adding collaborator to repository | MemberEvent |
action = "added" |
Forking repository | ForkEvent |
- |
Starring repository | WatchEvent |
action = "started" |
Editing wiki page | GollumEvent |
pages-->action = "created" or "edited" |
Opening issue | IssuesEvent |
action = "opened" |
Closing issue | IssuesEvent |
action = "closed" |
? IssueCommentEvent |
action = "created" | |
Reopening issue | IssuesEvent |
action = "reopened" |
? IssueCommentEvent |
action = "created" | |
Transferring issue | IssuesEvent |
action = "opened" |
Commenting issue | IssueCommentEvent |
action = "created" |
Opening pull request | PullRequestEvent |
action = "opened" |
Closing pull request | PullRequestEvent |
action = "closed" |
? IssueCommentEvent |
action = "created" | |
Reopening pull request | PullRequestEvent |
action = "opened" |
? IssueCommentEvent |
action = "created" | |
Commenting pull request | IssueCommentEvent |
action = "created" |
Commenting pull request changes | PullrequestReviewCommentEvent |
action = "created" |
? PullRequestReviewEvent |
action = "created" | |
Reviewing code | PullRequestReviewEvent |
action = "created" |
Commenting commits | CommitCommentEvent |
action = "created" |
Pushing commits | PushEvent |
- |