Published June 7, 2025
| Version v1
Model
Open
Security issue reporting in NPM packages (USENIX Sec'25)
Authors/Creators
Description
"I wasn’t sure if this is indeed a security risk": Data-driven Understanding of Security Issue Reporting in GitHub Repositories of Open Source npm Packages
Code, Model and data for Usenix Security 2025(cycle-2) paper (#1361)
1. This repositotry contains "SecurityIssueClassification" and "UserInteractionThemesClassification" folders. The "SecurityIssueClassification" folder contains the code for inference and models used for classification of Github issues into "security-related" or "non security-related" classes. The "UserInteractionThemesClassification" folder contains the code for inference and models used for identification of themes in user-maintainer interaction (for resolution of issues).
2. This repository contains (in "Dataset" folder) instructions for procuring the full dataset for academic research purpose. The full dataset have two parts (i) issues collected from GitHub repositories of ~45K+ NPM packages and (ii) dataset of comments, events and other metadata created in the course of discussion and resolution of security-related issues.
Our dataset was collected from publicly available GitHub repositories. While this data is technically public, aggregating and sharing it more widely—especially by putting it on a permanent platform might have privacy implications. This is particularly true if the dataset were to be used in ways that extend beyond typical academic research or if it were accessed and used for reasons other than academic research. Thus we decided to provide the dataset only after the recipient verify that the recipient will use the data only for academic research purposes, will not reshare the data, and has received approval from their institution to study the data.
For more details on how to access the data, please see "Dataset" folder.
The dataset (i) of issues collected from GitHub repositories of ~45K+ NPM packages would contain the following fields:
- url: The URL of the issue reported in GitHub
- issue title & body: The title and body of the issue reported in GitHub
- tags: The set of tags associated with the issue (e.g., "security", "bug")
- predicted label by our ML model: The label predicted by our ML model, indicating whether the issue is security-related or not
The dataset (ii) of comments, events and other metadata created in the course of discussion and resolution of security-related issues contain the following fields:
- url: The URL of the issue reported in GitHub
- issue title & body: The title and body of the issue reported in GitHub
- comments body : Set of all comments made by users and maintainers in the course of discussion and resolution of the issue in the entire timeline of the issue
- events captured in the timeline : Set of events captured in the timeline of the issue (e.g., issue merged, issue closed etc.)
- predicted label by our ML model: The label predicted by our ML model, indicating the theme of user-maintainer interaction (e.g., DISCUSSION; Acknowledged - Spoke against the issue)
------------------------------------------
Folder Description
------------------------------------------
./Dataset -> instructions for procuring the full dataset for academic research purpose. It would include issues collected from GitHub across ~45K+ npm repositories and the comments, events etc involved in user-maintainer interaction of security related issue.
./SecurityIssueClassification -> Contains model weights and inference code for for "Roberta-base" model used for classifying issues into "security-related" issue or "non security-related" issue
./UserInteractionThemesClassification -> Contains model weights and inference code for "Roberta-base" model used for identifying themes in user-maintainer interactions (for issue resolution).
Release date: 7th June, 2025
Files
Usenix'25_Cycle2_Paper#1361_Artifact.zip
Files
(864.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f636c055bd4b80e92259b602ec705f23
|
864.5 MB | Preview Download |