Stack Overflow Meets Replication: Security Research Amid Evolving Code Snippets
Description
This artifact contains code and data needed to reproduce and replicate the major claims discussed in the paper:
"Stack Overflow Meets Replication: Security Research Amid Evolving Code Snippets".
This artifact went through evaluation at the USENIX Security 2025 Artifact Evaluation and was awarded all three badges: Available, Functional, Reproduced.
Below we outline the steps required to reproduce the major claims in the paper.
Prerequisite
Virtual Environment
We exported the dependencies for this artifact using pip
:
pip list --format=freeze > requirements.txt
To reproduce the major findings in our study, a Python virtual environment based on the provided requirements.txt file should be created.
Create a new environment, activate the environment, and install the required packages using the requirements.txt
file.
Example to create a new conda environment named usenix-ae
and install the required packages:
conda create -n usenix-ae python=3.12
conda activate usenix-ae
pip install -r requirements.txt
This prerequisite step is required to run the Jupyter notebooks and Python scripts provided in this artifact and is thus a core requirement for the replication of the claims in the paper.
MySQL Server Docker Container
This sets up a Docker container with a MySQL server that holds the database (and all tables) needed to replicate the claims presented in the paper.
This prerequisite is not required to run all the scripts in this artifact, but it is necessary to reproduce the claims that require querying the database.
Those claims are C1, C7, C13, C14, C15, and C16.
-
Build the Docker image
docker build -t paper403 .
-
Run the container (in detached mode)
docker run --name mysql-server -d -p 3306:3306 paper403
-
Access the container and log in with the credentials (username:
sotorrent
, password:sotorrent
):docker exec -it mysql-server mysql -u sotorrent -p
-
List the databases to confirm that the
sotorrent22
database is createdshow databases
If you see the following output, then the DB server setup is successful:
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| performance_schema |
| sotorrent22 |
+--------------------+
3 rows in set (0.01 sec)
- Import the DB files into the
sotorrent22
database. The files are available in themysql-docker/dump
directory
Execute the import_tables.sh
script to import the tables required to verify claims C1, C7, C13, C14, C15, and C16.
sh import_tables.sh
The script above should import eight tables into the sotorrent22
database: Posts
, PostBlockVersion
, Users
, CppCheckWeakness
, CrawledPapers
, DicosResults
, Violations
, and PostReferenceGH
.
The import_tables.sh
script will not import the PostHistory
table since this table is not required to verify the claims made in the paper. However, if a user wishes to replicate the findings in our paper (i.e., running the scripts in the sources/6_replication_case_studies/case_study_{1,2,3,4}/replication
packages),
then the PostHistory.sql
dump file should be imported into the sotorrent22
database. You can simply do this by uncommenting this line:
# Optional if reproducing the claims in the paper, required if replicating our study.
#docker exec -i mysql-server mysql -usotorrent -psotorrent sotorrent22 < dump/PostHistory.sql
After the import is complete, verify that all tables have been successfully imported by running the following SQL command:
use sotorrent22;
show tables;
The output of the show tables
MySQL command should appear as follows:
+--------------------+
| Tables_in_sotorrent22 |
+--------------------+
| Posts |
| PostBlockVersion |
| PostHistory |
| Users |
| CppCheckWeakness |
| CrawledPapers |
| DicosResults |
| Violations |
| PostReferenceGH |
+--------------------+
This completes the containerization setup needed to reproduce the claims in the paper.
NOTE: The Posts.sql
, PostHistory.sql
, and PostBlockVersion.sql
are extremely large files, about 71GB, 148GB, and 123GB respectively, and may take a while to import.
The Users.sql
dump file is 3.3GB, while the sizes of the other files are small and should import quickly.
The SQL tables were exported using the following command (example using the Posts table):
export packet_size="512M"
export sotorrent_db=sotorrent22
export sotorrent_password=sotorrent
export table_name=Posts
export output_file=Posts.sql
mysqldump --max_allowed_packet=$packet_size -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db $table_name -r $output_file
Suggestion By Reviewer C on Slow Import
A reviewer at USENIX'25 AEC suggested the following configuration changes for faster import of the large files.
"I have added/modified the following configurations in the MySQL server and import became much faster (still waiting on bigger files though):"
innodb_buffer_pool_size = 12G
innodb_log_file_size = 1G
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 2 # this seems to be removing the bottleneck
innodb_doublewrite = 0
innodb_autoinc_lock_mode = 2
"With innodb_flush_log_at_trx_commit = 2
that seems to be the key here (should be reverted in practice after import due to its implications for data loss, etc.)"
NOTE: Use the suggestion above at your own discretion. The default configurations are used in this artifact.
DB Schema
Entity-relationship diagram illustrating the relationships between key tables from the Stack Exchange dataset (shown in blue) and the SOTorrent dataset (shown in red). Note that there are additional tables below which are not imported above since they are not required for reproducing the claims in the paper.
Replication
Below we describe the major claims and how to replicate them using the provided scripts.
This artifact went through the artifact evaluation at the 34th USENIX Security Symposium (USENIX Sec'25) and was awarded all three badges
(i.e., Artifacts Available, Artifacts Functional, and Results Reproduced) by all four reviewers.
However, the reviewers noted slight mismatches in the values observed for the following claims:
- Claim C9: The notebook
DICOS.ipynb
returned a slightly different number of insecure posts: (expected: 30,343 vs. actual: 30,359) - Claim C18: The notebook
main.ipynb
has a difference betweenno-dupe-keys
under Parse Error violations (observed 874 vs. expected 2,070)
These mismatches in the results were corrected in the extended version of the paper on Arxiv.
Major Claims
- C1: Our literature search yielded 42 relevant studies. This is proven by the experiment (E1) described in Section 3.2 whose results are reported in Table 1.
- C2: Certain programming languages have exhibited distinct trends regarding the overall number of added snippets and the ratio of security-relevant edits, with languages like C/C++ being relatively stable, while others like JavaScript and Python have fluctuated over time. This claim is demonstrated by the monthly trends in the addition of new code snippets shown in Figure 3 and the analysis of post edits in Figure 4 of Section 5.
- C3: The ratio of security-relevant comments on Stack Overflow has steadily increased, suggesting an ongoing community effort to improve the quality of shared content. This claim is backed by the data shown in Figure 5, highlighting the increasing percentage of security-relevant commits and comments over time.
- C4 (page 8): Code snippets with weaknesses (i.e.,
Code_w
) increased (11,748 → 30,254), indicating a growth rate of 157.5%. - C5 (page 8): The authors identified 12,998 CWE instances in the latest versions of 7,481 answers, whereas we found 7,679 instances in 5,721 answers.
- C6 (page 8): We noticed a shift in the ranking of CWE types: CWE-758 climbed from 6th to 2nd; CWE-401 dropped from 2nd to 3rd; CWE-775 fell from 3rd to 7th.
- C7 (page 8): As revisions increased from 1 to 3 or more, the proportion of improved
Code_w
only rose from 3.1% to 7.4%. This claim is shown on the right-hand side of Table 7 in Appendix A. - C8 (page 8 and page 9): We found that the fraction of users with vulnerable C/C++ snippets more than doubled compared to the original findings. Moreover, the number of users that contributed just one vulnerable snippet also increased. Further, the authors reported that users who contributed multiple vulnerable snippet versions repeatedly contributed the same CWE type. We found that these users contribute different types of CWE with the same likelihood.
- C9 (page 10): Using the same approach on our dataset of 1,046,052 posts, we discovered 30,359 (2.9%) insecure posts, i.e., an increase of 52%. Among these, 4,887 (16.1%) insecure posts contained all three features, i.e., an increase of 155%, while the remaining 25,472 posts contained two features.
- C10 (page 10): We found that
Dicos
had an 11% precision (compared to the authors' 91%), 32% accuracy (versus 89%), and an 87% recall (versus 93%). These results are shown in Table 4. This low precision and accuracy indicate that the ability ofDicos
to detect insecure code snippets has decreased significantly. - C11 (page 11): Using the newer version of the SOTorrent dataset, we observed a higher ratio of insecure posts between accepted (7.72%) and non-accepted (6.61%) answers. Figure 10 in Appendix B compares the original and replication results.
- C12 (page 11): The types of security weaknesses detected in C/C++ code snippets have changed over time, with memory leaks becoming the most prevalent, replacing undefined behavior as the dominant issue. This claim is supported by the comparison of the original and replication results for RQ3, as seen in Section 6.2.3. Figure 11 in Appendix B compares the original and replication results.
- C13 (page 12): After applying the authors' filtering criteria, we obtained 12,095 questions containing 72,202 answers, of which 10,140 were accepted answers. This means the number of code snippets matching the authors' filtering criteria has dropped since 2018: 529,054 → 239,575.
- C14 (page 12): Our findings regarding the number of questions with at least one insecure answer differ significantly: 18.1% (out of 10,861) dropped to 4.9% (out of 12,095). Similarly, the percentage of accepted answers containing at least one insecure snippet also decreased: 9.8% (out of 7,444) → 2.2% (out of 10,139).
- C15 (page 12): Although the vulnerability rankings from the original study remain unchanged, we observed a shift in the number of affected snippets: code injection increased (2,319 → 5,734), while insecure cipher (564 → 356), insecure connection (624 → 276), and data serialization (153 → 140) all dropped. Like the original study, no snippets were impacted by XSS vulnerabilities.
- C16 (page 12): Like the original study, we also found no significant difference between the two user groups. However, we observed different p-value (0.9 → 6.2) and Cliff's delta (0.01 → 0.03) values.
- C17 (page 13): The number of violations in JavaScript code snippets increased from 5,587,357 to 7,385,044, with the average violations per snippet rising from 11.94 to 28.8. This claim is supported by the results in Section 6.4.3 (Revisiting RQ1 Findings).
- C18 (page 13): Due to Stack Overflow’s evolution, the number of violations in each category has risen. Further, the ranking of violations within certain categories has changed. This claim is supported by the results in Section 6.4.3 (Revisiting RQ2 Findings).
Experiments
The following experiments are provided to reproduce the claims described above:
-
E1: (Table 1, Section 3.2):
The goal of this experiment is to reproduce C1, which pertains to the number of research studies identified through our systematic literature review process outlined in Figure 2. The relevant studies discovered are presented in Table 1, and the aim of this experiment is to replicate the studies listed in Table 1.Preparation:
Ensure that the MySQL Server Docker container is running and that you can connect to the server as outlined in Section 3.2. You will also need access to an OPENAI API Key. The script requires the API Key and theDB_HOST
to be set in the environment where it is being executed. For example, when running the script from the terminal, theOPENAI_API_KEY
should be configured in the environment. If the default database user and password created during the creation of the Docker container are not changed, then the script will use those credentials to connect to the DB server on localhost. Otherwise, configure theDB_HOST
environment variable with the server's IP address.Execution:
Runpython relevant_studies.py
Results:
The script will display all the relevant studies described in Table 1.
-
E2: (Section 5)
The goal of this experiment is to reproduce claims C2 and C3, which pertain to the statistics about the evolution of code and their surrounding context (i.e., comments and commit messages) on Stack Overflow. The claims and statistics supporting them are described in Section 5.Preparation:
Ensure that the virtual environment is activated and contains all the packages listed in the requirements.txt file. Also, ensure that the data directory is located in its proper location and contains all necessary files.Execution:
Run all the code cells in thePCS.ipynb
andEvolution.ipynb
Jupyter notebooks, which are found in the5_evolution_of_stackoverflow
folder. The notebook includes direct quotes from the paper in markdown cells, with the corresponding code to reproduce each quote provided right after it.Results:
Each code cell in the notebook, once executed, will reproduce the claims C2 and C3 described in Section 5 of the paper. -
E3: (Section 6.1)
The goal of this experiment is to demonstrate how claims C4, C5, C6, C7, and C8 described in Section 6.1 can be reproduced.Preparation:
Ensure that the virtual environment is activated and contains all the packages listed in the requirements.txt file. Thetable_7.py
script connects and queries the sotorrent22 database running in the MySQL Server Docker container.Execution:
Execute the following scripts to reproduce the findings for each claim.- C4: Execute the
main()
function inside thetable_2.ipynb
Jupyter notebook. - C5, C6: Execute all code cells inside the
RQ1_RQ2.ipynb
Jupyter notebook. - C7: Run
python table_7.py
- C8: Study and execute all code cells of the
Zhang.ipynb
andZhang_InstanceAdded.ipynb
Jupyter notebooks. The notebooks include direct quotes from the paper, with the corresponding code to reproduce each quote provided right after it.
Results:
Executing the scripts described above will reproduce claims C4, C5, C6, C7, and C8. - C4: Execute the
-
E4: (Section 6.2)
The goal of this experiment is to demonstrate how claims C9, C10, C11, and C12 described in Section 6.2 can be reproduced.Preparation:
Ensure that the virtual environment is activated and contains all the packages listed in the requirements.txt file. The results for Claim C10 were obtained through manual labeling conducted by two researchers. Thedata/accuracy_measurements.csv
CSV file includes multiple tabs, each containing the coding label results for each task. The Discovery Accuracy tab includes the results for the precision, accuracy, and recall calculations.Execution:
- C9, C11, C12: Run all the code cells in the
main.ipynb
Jupyter notebook, which is found in thecase_study_2/analysis
directory. The notebook includes direct quotes from the paper in markdown cells, with the corresponding code to reproduce each quote provided right after it. - C10: There is no Python script to run in order to verify this claim. Instead, the results of this experiment can be found in the Discovery Accuracy tab of the
data/accuracy_measurements.csv
CSV file.
Results:
Each execution of the Python script will reproduce the findings for each claim. - C9, C11, C12: Run all the code cells in the
-
E5: (Section 6.3)
The goal of this experiment is to demonstrate how claims C13, C14, C15, and C16 described on page 12 in Section 6.4 can be reproduced.Preparation:
This experiment utilizes scripts that require DB server credentials to connect to and query the sotorrent22 database. The scripts expect the environment variablesDB_USER
andDB_PASSWORD
to be set, containing the database username and password. These variables will be used by the script to connect to the MySQL Server Docker container, with the default host set to localhost (127.0.0.1). All related scripts for this experiment are located in the directorysources/6_replication_case_studies/case_study_3/analysis
.Execution:
Execute the following scripts to reproduce the findings for each claim.- C13:
python general_stats.py
- C14:
python revisiting_RQ1_findings.py insecure_post_stats
- C15:
python revisiting_RQ1_findings.py rankings
- C16:
python revisiting_RQ2_findings.py
Results:
Each execution of the Python script will reproduce the findings for each claim. - C13:
-
E6: (Section 6.4)
The goal of this experiment is to reproduce claims C17 and C18, described on page 12 in Section 6.4.Preparation:
Ensure that the virtual environment is activated and contains all the packages listed in the requirements.txt file. This experiment uses theviolations.feather
andrule_categories.feather
files located in thedata/feather_files
directory.Execution:
Run all the code cells in themain.ipynb
Jupyter notebook, which is found in thecase_study_4/analysis
directory. The notebook includes direct quotes from the paper in markdown cells, with the corresponding code to reproduce each quote provided right after it.Results:
Each code cell in the notebook, once executed, will reproduce claims C17 and C18.
Code Structure
The sources contained in this artifact are structured as follows:
|- 3_1_literature_search
|--- analysis
|--- replication
|- 5_evolution_of_stackoverflow
|- 6_replication_case_studies
|--- case_study_1
|----- analysis
|----- replication
|--- case_study_2
|----- analysis
|----- data
|----- replication
|--- case_study_3
|----- analysis
|--- case_study_4
|----- analysis
|----- replication
-
3_1_literature_search
:
This directory contains everything related to Experiment E1, which involves reproducing Claim C1. It is focused on the literature search, specifically on the research studies identified through a systematic literature review.analysis
: Contains the code or scripts used to analyze the relevant studies identified in the literature review.replication
: Includes scripts or files necessary to populate theCrawledPapers
table needed to reproduce Claim C1.
-
5_evolution_of_stackoverflow
:
This directory is related to Experiment E2, which involves reproducing Claims C2 and C3. These claims describe the evolution of code and its surrounding context (comments, commit messages) on Stack Overflow over time.- This directory contains Jupyter notebooks that reproduce the trends of code snippets and security-relevant content over time, as discussed in Section 5.
-
6_replication_case_studies
:
This directory houses the various case studies (from Experiments E3 to E6) that focus on reproducing several claims (C4 to C18). The structure within this directory is divided into subdirectories for each case study.-
case_study_1
:analysis
: This folder contains the code/scripts used for analyzing data related to Claim C4, C5, C6, C7, and C8.replication
: Contains scripts used to produce the data needed to reproduce claims C4, C5, C6, C7, and C8.
-
case_study_2
:analysis
: Contains scripts used for analyzing data related to Claim C9, C11, and C12.data
: This directory contains the necessary data files used in the analysis of C9, C11, and C12.
-
case_study_3
:analysis
: This folder holds analysis code related to Claims C13, C14, C15, and C16.
-
case_study_4
:analysis
: Contains code related to the analysis of Claims C17 and C18.replication
: Holds the code needed to generate the data to reproduce Claims C17 and C18.
-
Summary of Structure:
3_1_literature_search
: Handles Experiment E1 and Claim C1 related to the literature review.5_evolution_of_stackoverflow
: Deals with Experiment E2 and the reproduction of Claims C2 and C3 related to Stack Overflow trends.6_replication_case_studies
: Is the main directory for all subsequent case studies, each corresponding to a set of claims. The case studies are broken into subdirectories for analysis and replication of claims ranging from C4 to C18.
Source Code Repository
The source code for this artifact is publicly available on the CISPA GitLab server at the following link: https://projects.cispa.saarland/alfusainey.jallow/usenix2025-artifact.
If you encounter any issues while using the artifact, please feel free to create an issue ticket directly in the repository. We will review and address any concerns as quickly as we can. Your feedback is invaluable for improving the artifact, and we appreciate your assistance in making sure everything works smoothly.
Citation
If you use this artifact in your research, please cite the following paper:
@inproceedings{jallow:usenix:25,
author = {Alfusainey Jallow and Sven Bugiel},
booktitle = {34th USENIX Security Symposium (USENIX Sec'25)},
publisher = {USENIX Association},
title = {Stack Overflow Meets Replication: Security Research Amid Evolving Code Snippets},
year = {2025}
}
Files
Files
(101.8 GB)
Name | Size | Download all |
---|---|---|
md5:745762e29f321f89bf047ff4797fa6b2
|
101.8 GB | Download |
Additional details
Software
- Repository URL
- https://projects.cispa.saarland/alfusainey.jallow/usenix2025-artifact.git
- Programming language
- Python
- Development Status
- Active