Code Duplication and Reuse in Jupyter Notebooks

Koenzen, Andreas P.; Ernst, Neil A.; Storey, Margaret-Anne D.

doi:10.5281/zenodo.3836691

Published May 29, 2020 | Version 3.0

Conference paper Open

Code Duplication and Reuse in Jupyter Notebooks

1. University of Victoria

This is a replication package for the paper: "Code Duplication and Reuse in Jupyter Notebooks", which was accepted as a full paper at the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2020.

The contents of this package are as follows:

code folder: Contains all necessary code to reproduce the first study presented in the paper.
data folder: Contains all data pertaining to the first study presented in the paper.
- clones_1582405629.json.gz file: JSON database with all detected clones and its metadata for the used dataset.
- commit_data_1589997765.pkl.gz file: Pandas pickle file containing the table "commit_data" (See database.sql file).
- commits_1589997765.pkl.gz file: Pandas pickle file containing the table "commit" (See database.sql file).
- counter_1582422799.json.gz file: JSON database with statistics about all repositories in the used dataset.
- notebooks_1589997765.pkl.gz file: Pandas pickle file containing the table "notebooks" (See database.sql file).
- parameter_tunning folder: Folder with the results of the parameter tuning phase. Each TXT file corresponds to a different threshold.

In order to fully reproduce the code, a fully functional Python 3.7 environment is needed. The requirements can be found in the requirements.txt file. If the starting scripts are to be used, a Python 3.7.7 version must be installed via pyenv, but is NOT necessary to run the notebooks, the JupyterLab environment can be launched manually issuing the command: "jupyter lab notebooks"

Commands:

To install Python dependencies via Pip: "pip install -r requirements.txt"
To launch Jupyter: "source start-jupyter.sh"

Optional:

To access environment variables from Jupyter, the file env_variables.py can be edited to add new variables or modify current ones.

SHA1SUM of ZIP file: c9b5d7e2dbe0574b73f2d2b67adb9e18fdcfb513

Files

VLHCC_2020_Paper_Reproducibility_Pkg.zip

Files (2.5 GB)

Name	Size	Download all
VLHCC_2020_Paper_Reproducibility_Pkg.zip md5:18cccb23601930f522ae345b83fe91bc	2.5 GB	Preview Download

	All versions	This version
Views	449	448
Downloads	85	83
Data volume	220.8 GB	215.9 GB

Code Duplication and Reuse in Jupyter Notebooks

Authors/Creators

Description

Files

VLHCC_2020_Paper_Reproducibility_Pkg.zip

Files (2.5 GB)