A Dataset of Contributor Activities in the NumFocus Open-Source Community
Description
The NumFocus dataset provides a comprehensive representation of contributor activity across 58 open-source projects supported by the NumFocus organization. Spanning a three-year observation period (January 2022 to December 2024), this dataset captures the dynamics of open-source collaboration within a defined community of scientific and data-driven software projects.
To address the challenges of interpreting raw GitHub event logs, the dataset introduces two structured levels of abstraction: actions and activities. Actions offer a detailed view of individual operations, such as creating branches or pushing commits, while activities aggregate related actions into high-level tasks, such as merging pull requests or resolving issues. This hierarchy bridges the gap between granular operations and contributors’ broader intentions.
The primary dataset focuses on activities, providing a high-level overview of contributor behavior. For users requiring more granular analysis, a complementary dataset of actions is also included.
The dataset is accompanied by a Python-based command-line tool that automates the transformation of raw GitHub event logs into structured actions and activities. The tool, along with its configurable mapping files and scripts, is publicly available at ghmap.
The dataset is distributed across the following files:
- NumFocus_Jan22-Dec24_GH_Actions.zip: Contains 2,716,910 actions in JSON Lines format, capturing individual contributor operations.
- NumFocus_Jan22-Dec24_GH_Activities.zip: Contains 2,278,299 activities in JSON Lines format, representing high-level tasks derived from grouped actions.
- action_schema.json: A validation schema in JSON format to ensure consistency in interpreting the actions dataset.
- activity_schema.json: A validation schema for validating and integrating the activities dataset.
Actions: Low-Level Operations
Each action record captures a single contributor operation and includes the following attributes:
- action: Specifies the type of operation (e.g., PushCommits, OpenPullRequest, or CreateBranch).
- event_id: A unique identifier linking the action to its originating GitHub event.
- date: The timestamp of the action, recorded in ISO 8601 format.
- actor: Contains details about the contributor performing the action, including a persistent id and their GitHub login.
- repository: Provides information about the repository where the action occurred, including its id, name, and associated organisation.
- details: Stores additional attributes specific to the action type, extracted from the payload of the corresponding GitHub event (e.g., for a PushCommits action, the details include the branch reference and the number of commits; for an OpenPullRequest action, the details include the pull request’s title, labels, state, and creation and update dates).
The dataset encompasses 24 distinct action types, each derived from specific GitHub events and representing a well-defined contributor operation:
- AddMember: Tracks the addition of a new collaborator to a repository.
- CloseIssue: Indicates that an issue has been marked as closed by a contributor.
- ClosePullRequest: Represents the closure of a pull request without merging its changes.
- CommentCommit: Captures comments made directly on specific commits within a repository.
- CreateBranch: Logs the creation of a new branch within a repository.
- CreateIssueComment: Tracks comments added to existing issues.
- CreatePullRequestComment: Records comments made on pull requests, including discussions on the changes proposed.
- CreatePullRequestReview: Represents the submission of a review for a pull request.
- CreatePullRequestReviewComment: Captures inline comments added during a pull request review process.
- CreateRepository: Represents the creation of a new GitHub repository.
- CreateTag: Logs the creation of a tag, often associated with versioning or releases.
- DeleteBranch: Indicates that an existing branch has been deleted from a repository.
- DeleteTag: Tracks the deletion of a tag within a repository.
- ForkRepository: Captures the action of forking a repository to create a copy under a different account.
- MakeRepositoryPublic: Represents the change of a private repository’s visibility to public.
- ManageWikiPage: Logs edits or updates made to a repository’s wiki pages.
- MergePullRequest: Indicates that a pull request has been merged, integrating its changes into the base branch.
- OpenIssue: Captures the creation of a new issue within a repository.
- OpenPullRequest: Represents the initiation of a new pull request to propose changes.
- PublishRelease: Tracks the publication of a release, often tied to specific tags and associated metadata.
- PushCommits: Records push events, detailing branches and commits included in the operation.
- ReopenIssue: Indicates that a previously closed issue has been reopened for further action.
- ReopenPullRequest: Captures the reopening of a previously closed pull request.
- StarRepository: Tracks when a user stars a repository to bookmark it or show support.
Example of action record:
{
"action":"CloseIssue",
"event_id":"26170139709",
"date":"2023-01-01T20:19:58Z",
"actor":{
"id":1282691,
"login":"KristofferC"
},
"repository":{
"id":1644196,
"name":"JuliaLang/julia",
"organisation":"JuliaLang",
"organisation_id":743164
},
"details":{
"issue":{
"id":1515182791,
"number":48062,
"title":"Bad default number of BLAS threads on 1.9?",
"state":"closed",
"author":{
"id":1282691,
"login":"KristofferC"
},
"labels":[
{
"name":"linear algebra",
"description":"Linear algebra"
}
],
"created_date":"2022-12-31T18:49:47Z",
"updated_date":"2023-01-01T20:19:58Z",
"closed_date":"2023-01-01T20:19:57Z"
}
}
}
Activities: High-Level Intent Representation
To provide a more meaningful abstraction, actions are grouped into activities. Activities represent cohesive, high-level tasks performed by contributors, such as merging a pull request, publishing a release, or resolving an issue. This higher-level grouping removes noise from low-level event logs and aligns with the contributor's intent .
Activities are constructed based on logical and temporal criteria. For example, merging a pull request may involve several distinct actions: closing the pull request, pushing the merged changes, and deleting the source branch. By aggregating these actions, the activity more accurately reflects the contributor’s intent.
Each activity record represents a cohesive, high-level task and includes the following attributes:
- activity: Specifies the type of activity (e.g., MergePullRequest, ReviewPullRequest, or PushCommits).
- start_date: Indicates when the activity began, recorded in ISO 8601 format.
- end_date: Indicates when the activity concluded, recorded in ISO 8601 format.
- actor: Contains details about the contributor performing the activity, including a persistent id and their GitHub login.
- repository: Provides details about the repository where the activity occurred, including its id, name, and associated organisation.
- actions: A list of the actions that constitute the activity, retaining their original metadata for traceability.
The dataset includes 21 distinct activity types, which aggregate related actions based on logical and temporal criteria to represent contributors’ high-level intent:
- AddContributors: Tracks the addition of one or more contributors to a repository within a short timeframe.
- CloseIssue: Represents the resolution of an issue, optionally accompanied by a comment clarifying the closure.
- ClosePullRequest: Indicates the closure of a pull request without merging its changes, optionally documented with a comment.
- CommentCommits: Logs comments made directly on specific commits, often as part of discussions or reviews.
- CommentIssue: Captures multiple comments on a specific issue.
- CommentPullRequest: Records multiple inline or general comments on a pull request.
- CreateRepository: Represents the creation of a new repository, optionally including the initialization of its main branch.
- ForkRepository: Captures the action of creating a fork of an existing repository.
- MakeRepositoryPublic: Tracks the transition of a private repository to public visibility.
- ManageBranches: Logs the creation or deletion of branches within a repository.
- ManageTags: Tracks the creation or deletion of tags, often linked to versioning or releases.
- ManageWikiPages: Represents updates or edits to wiki pages associated with a repository.
- MergePullRequest: Indicates the successful merging of a pull request, potentially accompanied by actions such as pushing changes, deleting branches, or closing linked issues.
- OpenIssue: Logs the creation of a new issue within a repository to report bugs, request features, or raise concerns.
- OpenPullRequest: Tracks the initiation of a pull request proposing changes to the repository.
- PublishRelease: Represents the publication of a release, optionally involving the creation of a corresponding tag.
- PushCommits: Logs a sequence of commits pushed to a branch within a repository.
- ReopenIssue: Captures the reopening of a previously closed issue for further action, optionally accompanied by a clarifying comment.
- ReopenPullRequest: Represents the reopening of a previously closed pull request for additional review or discussion, optionally with a comment.
- ReviewPullRequest: Tracks the review process of a pull request, including general or inline comments and formal reviews.
- StarRepository: Logs when a user stars a repository, signaling interest or support.
Example of activity record:
{
"activity":"MergePullRequest",
"start_date":"2023-01-01T20:19:57Z",
"end_date":"2023-01-01T20:20:05Z",
"actor":{
"id":1282691,
"login":"KristofferC"
},
"repository":{
"id":1644196,
"name":"JuliaLang/julia",
"organisation":"JuliaLang",
"organisation_id":743164
},
"actions":[
{
"action":"MergePullRequest",
"event_id":"26170139644",
"date":"2023-01-01T20:19:57Z",
"details":{
"pull_request":{
"id":1181521272,
"number":48064,
"title":"use the correct env variable name to set default openblas num threads",
"state":"closed",
"author":{
"id":1282691,
"login":"KristofferC"
},
"labels":[
{
"name":"backport 1.8",
"description":"Change should be backported to release-1.8"
},
{
"name":"backport 1.9",
"description":"Change should be backported to release-1.9"
}
],
"created_date":"2022-12-31T19:59:00Z",
"updated_date":"2023-01-01T20:19:57Z",
"closed_date":"2023-01-01T20:19:56Z",
"merged":true
}
}
},
{
"action":"CloseIssue",
"event_id":"26170139709",
"date":"2023-01-01T20:19:58Z",
"details":{
"issue":{
"id":1515182791,
"number":48062,
"title":"Bad default number of BLAS threads on 1.9?",
"state":"closed",
"author":{
"id":1282691,
"login":"KristofferC"
},
"labels":[
{
"name":"linear algebra",
"description":"Linear algebra"
}
],
"created_date":"2022-12-31T18:49:47Z",
"updated_date":"2023-01-01T20:19:58Z",
"closed_date":"2023-01-01T20:19:57Z"
}
}
},
{
"action":"DeleteBranch",
"event_id":"26170140410",
"date":"2023-01-01T20:20:04Z",
"details":{
"branch_name":"kc/openblas_threads"
}
},
{
"action":"PushCommits",
"event_id":"26170140428",
"date":"2023-01-01T20:20:05Z",
"details":{
"push":{
"id":12151296179,
"ref":"refs/heads/master",
"commits":1
}
}
}
]
}