There is a newer version of this record available.

Dataset Open Access

(No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset

Baltes, Sebastian; Knack, Jascha

This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

  1. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
  2. were active for at least one year (365 days) before the first build with Travis CI (before_ci),
  3. used Travis CI at least for one year (during_ci),
  4. had commit or merge activity on the default branch in both of these phases, and
  5. used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

  1. have Java or Ruby as their project language
  2. used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
  3. have commit activity for at least two years (730 days)
  4. are engineered software projects (at least 10 watchers)
  5. were not in the TravisTorrent dataset

In total, 8,046 projects satisfied those constraints. We drew a random sample of 100 projects from this sampling frame and retrieve the commit and merge data ni the same way as for the CI sample.

This dataset contains the following files:

tr_projects_sample_filtered_2.csv
A CSV file with information about the 113 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv

One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv

One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).

comparison_project_sample_100.csv
A CSV file with information about the 100 projects in the comparison sample.

commits_default_branch_before_mid.csv
commits_default_branch_after_mid.csv

One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

merges_default_branch_before_mid.csv
merges_default_branch_after_mid.csv

One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

Files (11.4 MB)
Name Size
commits_default_branch_after_mid.csv
md5:43fe5f690b61249a345a266559bbeef8
909.0 kB Download
commits_default_branch_before_mid.csv
md5:74e0cf737e9ddddcf971ff9b6332c197
1.1 MB Download
comparison_project_sample_100.csv
md5:6547c547896591505eeced85e874cca4
14.3 kB Download
merges_default_branch_after_mid.csv
md5:9c306225676b3cbf5634eb99109dfd0f
190.4 kB Download
merges_default_branch_before_mid.csv
md5:4d2812019df1edf652c793e7ab10669a
148.6 kB Download
tr_projects_sample_filtered_2.csv
md5:c5d2a6d443dcbf364e8f344c1cc686ee
16.5 kB Download
tr_sample_commits_default_branch_before_ci.csv
md5:1fa5c6df50f1524c187368ab50c02ee6
3.7 MB Download
tr_sample_commits_default_branch_during_ci.csv
md5:db8da4fbf872a9323611e30c4eb2b5fa
3.7 MB Download
tr_sample_merges_default_branch_before_ci.csv
md5:3940f66a4ab3afbf2d451ca0940bf3bb
628.2 kB Download
tr_sample_merges_default_branch_during_ci.csv
md5:7e1c6a570dd36b0af54c573912a0c8a6
944.2 kB Download
238
2,828
views
downloads
All versions This version
Views 23826
Downloads 2,828595
Data volume 9.3 GB478.9 MB
Unique views 21525
Unique downloads 2,592577

Share

Cite as