Published November 1, 2021 | Version v0.2.0
Journal article Open

Workflow Analysis of Data Science Code in PublicGitHub Repositories

  • 1. University of Zurich

Description

This contains supplementary files for a scientific article. It includes a dataset (f_DASWOW.pkl) consisting of data science code snippets extracted from Jupyter notebook cells, each annotated with the data science step it performs. The repository also includes a set of analyses aimed at understanding the data science implementation life cycle. 
The DASWOW dataset file (f_DASWOW.pkl) is located in the `data-science-code-analysis/features/` directory. The exact filenames of the notebooks used are included in the dataset under the `filename` column. 
The 470 Jupyter notebooks used in the DASWOW dataset (curated from the original ~1M collection by rule et. al) can be accessed at https://doi.org/10.5281/zenodo.17638924 (DASWOW Jupyter Notebooks Subset)

*The version v.0.2.0 contains the additional analyses like anti pattern analysis.

Files

data-science-code-analysis.zip

Files (8.7 MB)

Name Size Download all
md5:dcfb2f3ad5b2c6a8a2e0649f0a9a286f
8.7 MB Preview Download

Additional details

Funding

Swiss National Science Foundation
Data-driven Contemporary Code Review PP00P2_170529
Swiss National Science Foundation
CrowdAlytics: Large-Scale Human-Machine Systems for Data Science 200020_184994