Code Duplication and Reuse in Jupyter Notebooks
This is a replication package for the paper: "Code Duplication and Reuse in Jupyter Notebooks", which was accepted as a full paper at the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2020.
The contents of this package are as follows:
- code folder: Contains all necessary code to reproduce the first study presented in the paper.
- data folder: Contains all data pertaining to the first study presented in the paper.
- clones_1582405629.json.gz file: JSON database with all detected clones and its metadata for the used dataset.
- commit_data_1589997765.pkl.gz file: Pandas pickle file containing the table "commit_data" (See database.sql file).
- commits_1589997765.pkl.gz file: Pandas pickle file containing the table "commit" (See database.sql file).
- counter_1582422799.json.gz file: JSON database with statistics about all repositories in the used dataset.
- notebooks_1589997765.pkl.gz file: Pandas pickle file containing the table "notebooks" (See database.sql file).
- parameter_tunning folder: Folder with the results of the parameter tuning phase. Each TXT file corresponds to a different threshold.
In order to fully reproduce the code, a fully functional Python 3.7 environment is needed. The requirements can be found in the requirements.txt file. If the starting scripts are to be used, a Python 3.7.7 version must be installed via pyenv, but is NOT necessary to run the notebooks, the JupyterLab environment can be launched manually issuing the command: "jupyter lab notebooks"
- To install Python dependencies via Pip: "pip install -r requirements.txt"
- To launch Jupyter: "source start-jupyter.sh"
- To access environment variables from Jupyter, the file env_variables.py can be edited to add new variables or modify current ones.
SHA1SUM of ZIP file: c9b5d7e2dbe0574b73f2d2b67adb9e18fdcfb513