Replication Package of "Eliciting Best Practices for Collaboration with Computational Notebooks"
Description
This is the replication package of the paper: "Eliciting Best Practices for Collaboration with Computational Notebooks"
In the following, we describe the contents of each file archived in this repository.
-
dataset.tar.bz2contains:-
the dataset of Jupyter notebooks we retrieved from Kaggle (
/cscw2021_dataset.tar.bz2); -
the notebook we used to filter the available Kaggle kernels based on our research criteria (
Meta_Kaggle_filtering/notebooks/Meta_Kaggle_filtering.ipynb); -
the specific version of the Meta Kaggle dataset (October 27, 2020) that we used to perform the filtering (
Meta_Kaggle_filtering/data/MetaKaggle 27-10-2020 (KT version)).
-
-
notebook_analysis.tar.bz2contains:-
cscw2021_db_dump.sql.tar.bz2, a PostgreSQL dump of the database with all the data we extracted from the notebooks; -
notebook_analysis_scripts/, the scripts by Pimentel et al. that we extended to analyze our dataset of notebooks (see the dedicated section in this README).
-
-
notebook_linting.tar.bz2contains the Python modules we developed to check code quality in Jupyter notebooks viapylint. -
Best Practices in The Most Upvoted Notebooks.pdfcontains a table that summarizes and compares the results of our quantitative analysis for each of the studied notebook samples.
Notebooks Analysis
The scripts in the notebook_analysis/notebook_analysis_scripts folder were developed by Pimentel et al. and shared on Zenodo [1] as the replication package of their article: "A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks" [2]. To perform our analysis, we had to make little extensions to the original code, mainly because it was meant to automatically retrieve notebooks from GitHub. For our purposes, we were interested in analyzing a dataset of notebooks stored in a local folder on our machine.
To avoid a major refactoring of the original scripts, we resorted to an expedient solution. We put each Jupyter Notebook from our dataset in a distinct directory and initializated such directory as a git repository. The result was a folder comprising 1386 git repositories, one per notebook.
Afterward, we made the following additions to the original scripts:
-
main_with_crawler_custom.py, to be called instead ofmain_with_crawler.py. It sequentially invokes the scripts for notebook analysis and saves the results to a PostgreSQL database. -
s0_local_crawler.py. This script is invoked bymain_with_crawler_custom.pyin place of the originals0_repository_crawler.pyto crawl repositories from the local folder that we built rather than from GitHub.
N.B.: the path to the local folder containing the git repositories must be specified as an environment variable called JUP_LOCALHOST_REPO_DIR. To do this, you can use the following command:
export JUP_LOCALHOST_REPO_DIR="path/to/local/directory/containing/repositories"
N.B.: since studying the reproducibility of Jupyter notebooks was out of the scope of our work, our main script main_with_crawler_custom.py skips the execution of s7_execute_repositories.py, the original script imputed to the re-execution of notebooks.
To replicate our experiment, you can still setup your execution environment by following the original guide provided by Pimentel et al. on their Zenodo repository: https://zenodo.org/record/2592524
-
if you follow the section "Reproducing the Analysis" of such guide, make sure to create a PostgreSQL database extracting and using our dump:
cscw2021_db_dump.sql.tar.bz2. -
instead, if you follow the section "Reproducing or Expanding the Collection", make sure to replace the last instruction of the guide:
python main_with_crawler.py
with
python main_with_crawler_custom.py
to invoke our main script instead of the original.
N.B.: In notebook_analysis/notebook_analysis_scripts/.env you find a complete list of the environment variables that must be declared in order to execute the scripts. Customize the variable values and source the .env file to have all variables properly set up in your bash session.
Notebooks Linting
The folder notebooks_linting contains the Python modules that we developed to check code quality in Jupyter notebooks via pylint.
To reproduce the analysis, a preliminary step is required: notebooks have to be grouped in separate folders by the Python version in which they are written. Indeed, each Python version has a dedicated version of pylint. To perform this operation, you can use the script discern_notebooks.py: it takes notebooks from the dataset folder, inspects their Python version and assigns them to the correct output folder.
N.B.: Before you run discern_notebooks.py, please make sure to customize it by editing the value of the global variables DATASET_PATH -- the path to the folder containing the dataset of Jupyter notebooks -- and OUTPUT_FOLDERS_PATH -- the path to the folder which will contain the directories of notebooks grouped by language version (i.e., py27/ for notebooks written in Python 2.7, py36/ for notebooks written in Python 3.6, etc.)
Once notebooks are grouped by Python version, we can set up the environment for the execution of the main script:
-
in
config.py, set the global variableDATASETS_BASE_PATHto the path of the folder containing the grouped notebooks (giveDATASETS_BASE_PATHthe same value that you assigned toOUTPUT_FOLDERS_PATHindiscern_notebooks.py); -
specify the path to your
anacondaorcondainstallation as an environment variable calledJUP_ANACONDA_PATH; if you have already replicated the first part of our analysis (see the previous section, "notebooks_analysis") following the instruction provided by Pimentel et al., this environmental variable should be already set. Otherwise, you can use the following command:export JUP_ANACONDA_PATH="path/to/your/local/anaconda/installation"
-
lastly, you should install the same
condaenvironments that are required to execute the first part of our analysis (see the previous section, "notebooks_analysis"); follow the instruction provided by Pimentel et al. on the Zenodo repository dedicated to their project: https://zenodo.org/record/2592524. In particular, refer to the section "Reproducing or Expanding the Collection".
N.B.: actually, we did not find any Kaggle notebook with Python version 2.7 nor any with version 3.8. Thus, if you are replicating our analysis on the same dataset, you can skip the installation of these conda environments.
Now you can run the main script by issuing the following commands in your shell:
conda activate py36 python main.py
The script main.py returns linting results in the .csv format (one .csv file per each group of notebooks). The results can then be summarized by using the Jupyter notebook notebook_linting/Results analysis.ipynb.
References
[1] João Felipe, Leonardo, Vanessa, & Juliana. (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Data set]. Zenodo. http://doi.org/10.5281/zenodo.2592524
[2] J. F. Pimentel, L. Murta, V. Braganholo and J. Freire, "A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks," 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 2019, pp. 507-517, doi: 10.1109/MSR.2019.00077.