Workflow Analysis of Data Science Code in PublicGitHub Repositories
Authors/Creators
- 1. University of Zurich
Description
This contains supplementary files for a scientific article. It includes a dataset (f_DASWOW.pkl) consisting of data science code snippets extracted from Jupyter notebook cells, each annotated with the data science step it performs. The repository also includes a set of analyses aimed at understanding the data science implementation life cycle.
The DASWOW dataset file (f_DASWOW.pkl) is located in the `data-science-code-analysis/features/` directory. The exact filenames of the notebooks used are included in the dataset under the `filename` column.
The 470 Jupyter notebooks used in the DASWOW dataset (curated from the original ~1M collection by rule et. al) can be accessed at https://doi.org/10.5281/zenodo.17638924 (DASWOW Jupyter Notebooks Subset)
*The version v.0.2.0 contains the additional analyses like anti pattern analysis.
Files
data-science-code-analysis.zip
Files
(8.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dcfb2f3ad5b2c6a8a2e0649f0a9a286f
|
8.7 MB | Preview Download |
Additional details
Funding
- Swiss National Science Foundation
- Data-driven Contemporary Code Review PP00P2_170529
- Swiss National Science Foundation
- CrowdAlytics: Large-Scale Human-Machine Systems for Data Science 200020_184994