Published January 14, 2021 | Version v3
Software Open

Replication Package of "Eliciting Best Practices for Collaboration with Computational Notebooks"

Authors/Creators

  • 1. Anonymous

Description

This is the replication package of the paper: "Eliciting Best Practices for Collaboration with Computational Notebooks"

In the following, we describe the contents of each file archived in this repository.

  • dataset.tar.bz2 contains:

    • the dataset of Jupyter notebooks we retrieved from Kaggle (/cscw2021_dataset.tar.bz2);

    • the notebook we used to filter the available Kaggle kernels based on our research criteria (Meta_Kaggle_filtering/notebooks/Meta_Kaggle_filtering.ipynb);

    • the specific version of the Meta Kaggle dataset (October 27, 2020) that we used to perform the filtering (Meta_Kaggle_filtering/data/MetaKaggle 27-10-2020 (KT version)).

  • notebook_analysis.tar.bz2 contains:

    • cscw2021_db_dump.sql.tar.bz2, a PostgreSQL dump of the database with all the data we extracted from the notebooks;

    • notebook_analysis_scripts/, the scripts by Pimentel et al. that we extended to analyze our dataset of notebooks (see the dedicated section in this README).

  • notebook_linting.tar.bz2 contains the Python modules we developed to check code quality in Jupyter notebooks via pylint.

  • Best Practices in The Most Upvoted Notebooks.pdf contains a table that summarizes and compares the results of our quantitative analysis for each of the studied notebook samples.

 

Notebooks Analysis

The scripts in the notebook_analysis/notebook_analysis_scripts folder were developed by Pimentel et al. and shared on Zenodo [1] as the replication package of their article: "A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks" [2]. To perform our analysis, we had to make little extensions to the original code, mainly because it was meant to automatically retrieve notebooks from GitHub. For our purposes, we were interested in analyzing a dataset of notebooks stored in a local folder on our machine.

To avoid a major refactoring of the original scripts, we resorted to an expedient solution. We put each Jupyter Notebook from our dataset in a distinct directory and initializated such directory as a git repository. The result was a folder comprising 1386 git repositories, one per notebook.

Afterward, we made the following additions to the original scripts:

  • main_with_crawler_custom.py, to be called instead of main_with_crawler.py. It sequentially invokes the scripts for notebook analysis and saves the results to a PostgreSQL database.

  • s0_local_crawler.py. This script is invoked by main_with_crawler_custom.py in place of the original s0_repository_crawler.py to crawl repositories from the local folder that we built rather than from GitHub.

N.B.: the path to the local folder containing the git repositories must be specified as an environment variable called JUP_LOCALHOST_REPO_DIR. To do this, you can use the following command:

export JUP_LOCALHOST_REPO_DIR="path/to/local/directory/containing/repositories"

N.B.: since studying the reproducibility of Jupyter notebooks was out of the scope of our work, our main script main_with_crawler_custom.py skips the execution of s7_execute_repositories.py, the original script imputed to the re-execution of notebooks.

To replicate our experiment, you can still setup your execution environment by following the original guide provided by Pimentel et al. on their Zenodo repository: https://zenodo.org/record/2592524

  • if you follow the section "Reproducing the Analysis" of such guide, make sure to create a PostgreSQL database extracting and using our dump: cscw2021_db_dump.sql.tar.bz2.

  • instead, if you follow the section "Reproducing or Expanding the Collection", make sure to replace the last instruction of the guide:

    python main_with_crawler.py

    with

    python main_with_crawler_custom.py

    to invoke our main script instead of the original.

N.B.: In notebook_analysis/notebook_analysis_scripts/.env you find a complete list of the environment variables that must be declared in order to execute the scripts. Customize the variable values and source the .env file to have all variables properly set up in your bash session.

 

Notebooks Linting

The folder notebooks_linting contains the Python modules that we developed to check code quality in Jupyter notebooks via pylint.

To reproduce the analysis, a preliminary step is required: notebooks have to be grouped in separate folders by the Python version in which they are written. Indeed, each Python version has a dedicated version of pylint. To perform this operation, you can use the script discern_notebooks.py: it takes notebooks from the dataset folder, inspects their Python version and assigns them to the correct output folder.

N.B.: Before you run discern_notebooks.py, please make sure to customize it by editing the value of the global variables DATASET_PATH -- the path to the folder containing the dataset of Jupyter notebooks -- and OUTPUT_FOLDERS_PATH -- the path to the folder which will contain the directories of notebooks grouped by language version (i.e., py27/ for notebooks written in Python 2.7, py36/ for notebooks written in Python 3.6, etc.)

Once notebooks are grouped by Python version, we can set up the environment for the execution of the main script:

  • in config.py, set the global variable DATASETS_BASE_PATH to the path of the folder containing the grouped notebooks (give DATASETS_BASE_PATH the same value that you assigned to OUTPUT_FOLDERS_PATH in discern_notebooks.py);

  • specify the path to your anaconda or conda installation as an environment variable called JUP_ANACONDA_PATH; if you have already replicated the first part of our analysis (see the previous section, "notebooks_analysis") following the instruction provided by Pimentel et al., this environmental variable should be already set. Otherwise, you can use the following command:

    export JUP_ANACONDA_PATH="path/to/your/local/anaconda/installation"
  • lastly, you should install the same conda environments that are required to execute the first part of our analysis (see the previous section, "notebooks_analysis"); follow the instruction provided by Pimentel et al. on the Zenodo repository dedicated to their project: https://zenodo.org/record/2592524. In particular, refer to the section "Reproducing or Expanding the Collection".

N.B.: actually, we did not find any Kaggle notebook with Python version 2.7 nor any with version 3.8. Thus, if you are replicating our analysis on the same dataset, you can skip the installation of these conda environments.

Now you can run the main script by issuing the following commands in your shell:

conda activate py36
python main.py

The script main.py returns linting results in the .csv format (one .csv file per each group of notebooks). The results can then be summarized by using the Jupyter notebook notebook_linting/Results analysis.ipynb.

 

References

[1] João Felipe, Leonardo, Vanessa, & Juliana. (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Data set]. Zenodo. http://doi.org/10.5281/zenodo.2592524

[2] J. F. Pimentel, L. Murta, V. Braganholo and J. Freire, "A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks," 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 2019, pp. 507-517, doi: 10.1109/MSR.2019.00077.

Files

Best Practices in The Most Upvoted Notebooks.pdf

Files (2.1 GB)

Name Size Download all
md5:6103aee5774827fb5eca29ef4478dcf9
110.1 kB Preview Download
md5:7e21e1e6e6d89840238f9a82314950b9
2.1 GB Download
md5:ea015b923f95816c494bf7d7b8359fc1
23.7 MB Download
md5:6468b809a841d1aab0386194e077ebbd
566.2 kB Download