# Instructions on how to run the code

This document describes on how to run the tools. At the end, there is a little troubleshooting section. 

## Requirements
* Python 3, and openpyxl
* Rust (v1.59.0)
* Java 11 - on the path or a JAVA_HOME environment variable
* at least 32GB of RAM - if less, make sure you do not run the code in parallel; see the Rust tool documentation below 
* at least 100GB of free HDD space

## Folder descriptions
* analyses - contains the Rust code folder, and the `r-scripts` folder for running the R analysis. The dataset is in the `r-scripts` folder.
* projects - this is the folder where Git projects and email archives will be downloaded by the scripts/tool. 
* scripts - this is the folder that contains the Python script for downloading the Git repositories from GitHub, and another script for post-processing the classification of commit messages.
* root folder - this contains the previous folders + the XML metadata file that contains the list of Apache Software Foundation Incubator projects. 

## How to run - Main steps
This section describes how to run the tool and the R scripts. Following step-by-step should be sufficient to collect data and replicate the study. The main Rust tool has two binaries `miner.exe` for Windows and `miner-linux` for the Linux OS (tried with Ubuntu 18.04). Both are for x64 architectures. 

### Download Git repositories

The data is collected from local Git repositories. To download the Git repositories, do the following commands:

* `cd scripts`
* `pip install openpyxl` - or skip if you already have this library
* `python download_git_repos.py projects-info-from-podlings-xml-extra-metadata.xlsx`

These Git repositories will be cloned in the `projects/git` folder. The script only downloads `graduated` or `retired` projects. 

### Download Emails

The next step is to download the archive emails for the projects. The main Rust tool does this part; it will only download the mailing archives for the incubation period of a project, that is, it only gets the corresponding dev mailing archives for each month that the project was in the incubator. The emails are saved as an `.mbox` file in the `projects/emails` folder. An example email archive output would be: `projects/git/abdera-dev-200606.mbox`, where `200606` denotes that this is the mailing archive for year `2006`, month `06`, for the `Abdera` project, `dev` mailing archive list. 

!!! Due to the way I initially wrote the code, the emails will not be downloaded if there is no Git repository in the `projects/git` folder. I should refactor this, but for the moment I will leave it as it is. 

To download the emails, run the Rust code by using the binaries provided. 

* `cd analyses/rust-code`
* `./bin/miner-linux --download-emails` - if you're on Linux (x64)
* `./bin/miner.exe --download-emails` - if you are on Windows (x64)

For other operating systems or architectures, you need to install Rust v1.59.0 (at least) and build the source code. Install Rust by following the official instructions, and then run: 

* `cargo build`
* `./target/debug/miner --download-emails`

### Rust tool - collecting data

#### Tool description
To collect the code, process, and code quality metrics, you need to call the main Rust tool. Be warned, depending on the number of repositories, this might take 5-10 days to complete and requires a good amount of RAM available (at least 32GB). By default, it will try to use 4 cores/threads, but this can be overwritten using the option `--threads=X`, where X is the number of cores you want to use. 
The rust tool also provides a help menu with regards to its command arguments. 

```
USAGE:
    miner.exe [FLAGS] [OPTIONS]

FLAGS:
        --commit-messages    skip sokrates analysis
        --download-emails    download all projects' emails
    -h, --help               Prints help information
        --list-projects      only show projects
        --missing-emails     check for any missing email archives
        --skip-sokrates      skip sokrates analysis
    -V, --version            Prints version information

OPTIONS:
        --project <project>    only parse given project
        --threads <threads>    number of threads
```

#### Collecting the data

Use the provided binaries to start collecting the data. Possible commands are above, but if you only want to start collecting all the data, please run after downloading the Git repositories and the emails:

* `./bin/miner` 

The tool will collect logs in the `miner.log` file and output data in the `data` folder. Each project will be have its data in an individual CSV file.


#### Aggregating the project's individual CSV files into one
The tool outputs a CSV file per project. To have a single CSV file with all the data, you can use Linux tools to aggregate all the data in one:

```
cd analyses/rust-code
rm data.csv
head -1 data/Abdera.csv >> data.csv
tail -n +2 -q data/*.csv >> data.csv
```

This will collect the header row from the `Abdera.csv` file and put it into the `data.csv` file, and then collect all the rows from all the other CSV files (excluding the first row (header row) from each CSV file) 

#### Extract commit messages
The tool can also extract every commit message in the incubation period, so that later a commit classifier (fasttext) can be used. Run the following comman to extract all the commit messages for each incubating period of each project: 

* `./bin/miner --commit-messages`

This will create a file named `commits-messages.csv` in the `analyses/rust-code/data/` folder.

#### Classify commits using fasttext
Requirements: follow the instructions to install `fasttext` from here: https://github.com/gesteves91/fasttext-commit-classification

* `pip install fasttext`

Then open the `scripts/fasttext-commit-classification/notebooks/classification.ipynb` Jupyter Notebook and run the code. Make sure you replace the `commits-messages.csv` file in the `scripts/fasttext-commit-classification/notebooks` folder with the one that was generated at the previous step.

As this file and the main data analyses files are separate, you would have to combine th two using R and the left join function, on project, status, incubation_month, which will combine the data analysis and the commit classification data. 

## Running the R analysis

## Troubleshooting

* If you have errors with the Rust tool, mainly, complaining that sokrates failed to do X, check the following:
do you have at least Java 11 installed? Is it accessible via the `java` command? Do you have a `JAVA_HOME` path? If yes, make sure it is correct and `java` can be run from there. For linux, it expects to have a `JAVA_HOME` env variable. In my Linux tests I had the following in my `.bashrc`; so the tool will work only if the path ends with `/bin/`. 
```
export JAVA_HOME=/home/user/jdk-11/bin/
export PATH=$PATH:/home/user/jdk-11/bin/
export PATH=$JAVA_HOME/bin:$PATH
``` 

* The Windows `JAVA_HOME` variable looks like this: `C:\Program Files\Microsoft\jdk-17.0.2.8-hotspot` so, the tool will try to add the `/bin/` folder to find the executable. Make sure you have this environment variable set up. 