Replication Package for the paper: "Continuous Integration Practices in Machine Learning Projects: The Practitioners' Perspective"
Authors/Creators
Description
Replication Package Description
Overview
This replication package provides all necessary scripts, datasets, and documentation to reproduce the analysis performed in the study on Continuous Integration Practices in Machine Learning Projects: The Practitioners’ Perspective. The package includes data processing, thematic analysis, network visualization, and survey-related scripts.
Folder Structure
The package is organized into the following directories:
1. r_scripts/ - R Scripts for Data Processing and Analysis
This folder contains all R scripts used for pre-processing, analysis, and visualization. The scripts are categorized based on their function:
- Pre-processing scripts: Used to filter and structure datasets before analysis.
- pre-processing-01-select-projects-to-survey.R - Selects ML repositories for the survey based on build duration.
- pre-processing-02-select-integrators-to-send-form.R - Identifies integrators to contact.
- pre-processing-03-fetch-integrators-email-name.R - Fetches integrators' names and emails using GitHub APIs.
- pre-processing-04-select-contributors-to-send-form.R - Identifies contributors to contact.
- pre-processing-05-fetch-contributors-email-name.R - Fetches contributors' names and emails.
- pre-processing-06-update-ml-repos-dataset-to-include-integrators-count.R - Updates repository dataset with integrators count.
- Thematic Analysis Scripts: Perform code counting and theme analysis.
- RQ1-2-thematic-analysis-theme-code-counting.R
- RQ2-2-thematic-analysis-theme-code-counting.R
- RQ3-1-thematic-analysis-theme-code-counting.R
- Network Visualization Scripts: Generate network plots from the thematic analysis.
- RQ1-3-neovis-network-plot.R
- RQ2-3-neovis-network-plot.R
- RQ3-2-neovis-network-plot.R
- Survey Response and CI Perception Analysis Scripts:
- RQ1-1-participants-perception-on-ci-practices-differences-in-ml.R
- RQ2-1-participants-perspectives-on-build-duration-in-ml.R
- RQ3-3-analysis-of-acceptable-test-coverage-in-ml-projects.R
- Additional Analysis Scripts:
- 00_demographic_analysis.R - Performs demographic analysis.
- 01_neovis_example.R - Example script for network visualization.
2. datasets/ - Data Files
Contains raw and processed datasets used in the study.
Subdirectories:
- bernardo_et_al_2024_data/ - Raw datasets from our prior study on the differences in CI adoption between ML and non-ML projects.[1]
- survey_responses/ - Contains responses from the survey.
- axial_analysis/ - Processed datasets used for axial coding analysis.
Key Dataset Files:
- 1_ml_repos_with_shorter_and_longer_build_durations.csv - Repository-level dataset categorizing projects based on build duration.
- 2_ml_repos_with_shorter_and_longer_build_durations_survey_form_link_integrator.xlsx - Survey form links for integrators.
- 2_ml_repos_with_shorter_and_longer_build_durations_survey_form_link_contributors.xlsx - Survey form links for contributors.
- 3_integrators_with_closed_prs_unduplicated.csv - List of integrators with unique PR closures.
- 4_integrators_with_closed_prs_unduplicated_name_email_fetched.csv - Same as above, with names and emails included.
- 5_integrators_with_closed_prs_unduplicated_name_email_fetched__email_available.csv - Integrators with valid emails retrieved.
- 6_contributors_with_prs_unduplicated_filtered.csv - List of contributors with unique PR submissions.
- 7_contributors_with_prs_unduplicated_filtered_name_email_fetched.csv - Same as above, with names and emails included.
- 8_contributors_with_prs_unduplicated_filtered_name_email_fetched__email_available.csv - Contributors with valid emails retrieved.
3. plots/ - Visualizations
This directory contains plots generated by the R scripts to compose the analysis performed on the paper. This directory also contains plots used in the forms we created to survey the participants of each investigated project.
4. google_apps_scripts/ - Google Sheets Automation
Scripts for handling survey form responses and linking them to datasets.
- repos_with_form_links.xlsx - Links repositories to survey forms (Google Forms).
- combine-form-responses.gs - Google Apps Script for merging survey responses.
- readme.txt - Explanation of the Google Apps Scripts.
5. SURVEY EXAMPLE - appendix_tesseract-ocr_tesseract-form.pdf
This file contains an example of the survey form used in the study. It provides full visibility into:
- The questions asked to ML practitioners.
- The format of the survey.
- How responses were collected and structured.
How to Reproduce the Analysis
1. Set Up Your Environment
- Install the required R packages
- Navigate to the working directory (e.g, . r_scripts/).
- Ensure the necessary API tokens (e.g., GitHub) are configured securely.
2. Run Pre-processing Scripts
Execute the pre-processing scripts sequentially to filter and prepare the data.
3. Run Thematic Analysis
Perform thematic analysis and generate network visualizations.
Important Note:
The thematic analysis (e.g., code generation, refinement, merging into themes) was manually performed by the authors. The scripts in this package do not automate this process but serve to summarize and visualize the results by:
- Counting codes and themes
- Summarizing thematic distributions
- Generating network visualizations
4. Run Survey Response Analysis
Analyze specific survey results. For instance:
source("r_scripts/RQ3-3-analysis-of-acceptable-test-coverage-in-ml-projects.R")
Contact and Citation
If you use this package, please cite the associated paper:
Bernardo, João Helis, et al. "Continuous Integration Practices in Machine Learning Projects: The Practitioners’ Perspective". Under Review in the Empirical Software Engineering, 2025.
For questions or issues, contact João Helis at joaohelis.bernardo@gmail.com.
This replication package ensures full transparency and reproducibility of the study, providing all necessary data and scripts for independent verification and further research.
[1] Bernardo, João Helis, et al. "How do machine learning projects use continuous integration practices? An empirical study on GitHub Actions." Proceedings of the 21st International Conference on Mining Software Repositories. 2024.
Files
replication-package-ci4ml-survey.zip
Files
(14.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:58db96509a9210229f0c81191a534538
|
14.5 MB | Preview Download |
Additional details
Dates
- Available
-
2025-02-20