Replication Package for "Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects"
Authors/Creators
Description
This replication package contains the replication package for the following paper:
Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2026. Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects. In 23rd International Conference on Mining Software Repositories (MSR ’26), April 13–14, 2026, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3793302.3793349
Organization
- The
scripts/folder contains all the scripts we used to collect data, run propensity score matching, and collect SonarQube metrics. These scripts are volatile in nature: Rerunning them will likely not get the same dataset we obtained for our paper (GitHub repositories growing and taken down, GitHub API changing, SonarQube changing, etc.), but they might be a valuable reference. - The
data/folder contains the exact dataset we used for our paper.repos.csv: Metadata for treatment group repositories (Cursor-adopting projects)cursor_commits.csv: Commit-level data showing commits modifying Cursor configuration filescursor_files.csv: File-level data for Cursor configuration filesrepo_events.csv/repo_events_control.csv: GitHub event data for treatment/control repositoriesmatching.csv: Propensity score matching results (treatment-control pairs)panel_event_monthly.csv: Main panel dataset used for DiD analysis (monthly aggregated)ts_repos_monthly.csv/ts_repos_control_monthly.csv: Monthly time series for treatment/control groupsrepo_metrics.csv: Additional repository-level metrics (stars, contributors, etc.)sonarqube_warnings.csv: SonarQube static analysis warnings per repository (as available)sonarqube_warning_definitions.csv: Definitions and categories of SonarQube warning typescontrol_repo_candidates_*.csv: Monthly snapshots of candidate control repositories
- The
notebooks/folder contains R notebooks that will read fromdata/and reproduce results in the paper.DataCollection.md: Dataset descriptionPropensityScoreMatching.md: Matching resultsDiffInDiffBorusyak.md: Main DiD resultsDynamicPanel.md: Main panel GMM resultsDiffInDiffAll.md: Comparing alternative DiD estimatorsAnalyzeSonarQubeWarnings.md: Appendix analysis on SonarQube warningsNonCausalMethods.md: Some interesting results replicating the same RQ with only descriptive and correlational methodsDiffInDiffCallaway.md,DiffInDiffPosterFigures.md,DiffInDiffTWFE.md: Older notebooks that has been deprecated
- The
plots/folder contains all the plots in this paper
Development Environment
All results were obtained using R 4.3.3 for statistical analysis and visualization (DiD estimation, panel models, plotting), and Python 3.11.4 for data collection scripts (GitHub API, BigQuery, SonarQube). We provide the detailed packages and their versions for future replication purposes. Due to the R ecosystem constantly evolving, different R versions and package versions may lead to slightly different results for this paper.
R Packages
| Package | Version | Description |
|---|---|---|
| tidyverse | 2.0.0 | Data manipulation and visualization |
| ggplot2 | 3.5.1 | Grammar of graphics plotting |
| data.table | 1.15.4 | Fast data manipulation |
| dplyr | 1.1.4 | Data manipulation verbs |
| tidyr | 1.3.1 | Data tidying |
| tibble | 3.2.1 | Modern data frames |
| lubridate | 1.9.3 | Date/time manipulation |
| scales | 1.3.0 | Scale functions for visualization |
| fixest | 0.12.1 | Fixed effects estimation |
| did | 2.1.2 | Callaway & Sant'Anna DiD estimator |
| didimputation | 0.3.0 | Borusyak et al. imputation estimator |
| plm | 2.6-4 | Panel data models (GMM) |
| bacondecomp | 0.1.1 | Bacon decomposition for TWFE |
| modelsummary | 2.1.1 | Model summary tables |
| knitr | 1.48 | Dynamic report generation |
| kableExtra | 1.4.0 | Enhanced table formatting |
| gridExtra | 2.3 | Arrange multiple plots |
| cowplot | 1.1.3 | Plot composition |
| corrplot | 0.94 | Correlation visualization |
| RColorBrewer | 1.1-3 | Color palettes |
| Cairo | 1.6-2 | High-quality graphics device |
| showtext | 0.9-7 | Custom fonts in plots |
| ggfx | 1.0.1 | Graphics effects for ggplot2 |
Python Packages
| Package | Version | Description |
|---|---|---|
| pandas | 2.2.0 | Data manipulation |
| numpy | 1.26.4 | Numerical computing |
| requests | 2.31.0 | HTTP requests |
| python-dotenv | 1.0.1 | Environment variable management |
| GitPython | 3.1.43 | Git repository interaction |
| PyGithub | 2.3.0 | GitHub API client |
| google-cloud-bigquery | 3.25.0 | BigQuery client for GH Archive |
| scikit-learn | 1.5.0 | Machine learning (propensity scores) |
| semver | 3.0.2 | Semantic versioning |
| node-semver | 0.9.0 | Node.js semver parsing |
| gql | 3.5.0 | GraphQL client |
| aiohttp | 3.9.5 | Async HTTP for GraphQL |
Replication Instructions
1. Obtain the Data
Clone this repository and download the full dataset from Zenodo:
git clone https://github.com/hehao98/CursorStudy.git
cd CursorStudy
# (Optional) Download data files from Zenodo and place them in the data/ folder
2. Set Up R Environment
Install R 4.3.3 and the required packages:
install.packages(c(
"tidyverse", "ggplot2", "data.table", "dplyr", "tidyr", "tibble",
"lubridate", "scales", "fixest", "did", "didimputation", "plm",
"bacondecomp", "modelsummary", "knitr", "kableExtra", "gridExtra",
"cowplot", "corrplot", "RColorBrewer", "Cairo"
))
3. Reproduce Results
Knit the notebooks in the following order (Using RStudio or the R VS Code extension):
notebooks/DataCollection.Rmd— Overview of the dataset (Table 1, Figure 2)notebooks/PropensityScoreMatching.Rmd— Matching diagnostics (Appendix)notebooks/DiffInDiffBorusyak.Rmd— Main DiD results (Tables 2, Figure 3, Figure 4)notebooks/DynamicPanel.Rmd— Panel GMM results (Table 3)notebooks/DiffInDiffAll.Rmd— Comparing DiD estimators (Appendix)notebooks/AnalyzeSonarQubeWarnings.Rmd— SonarQube analysis (Appendix)
Each notebook reads from data/ and outputs tables/figures to plots/.
4. (Optional) Rerun Data Collection
To rerun the data collection scripts (not recommended for exact replication):
# Set up Python environment
python -m venv venv
source venv/bin/activate
pip install pandas numpy requests python-dotenv GitPython PyGithub google-cloud-bigquery scikit-learn semver node-semver gql aiohttp
# Configure API keys in .env file
cp .env.example .env
# Edit .env with your GitHub token and BigQuery credentials
# Run scripts (see scripts/ for individual script usage)
Note: Data collection scripts interact with live APIs and will produce different results due to repository changes, API updates, or rate limiting.
Files
CursorStudy.zip
Files
(663.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2fbe42436c8e692c4ad356c7cc1df22a
|
663.1 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/hehao98/CursorStudy
- Programming language
- Python , R