Snakemake is… TODO

The Workflow

The Snakefile contains rules which define the output files we want and how to make them. Snakemake automatically figures out the dependencies of each of the rules and what order to run them in.

This Snakemake workflow preprocesses the dataset (data/otu_large.csv) with mikropml::preprocess_data(), calls mikropml::run_ml() for each seed and ML method set in config/config.yml, combines the results files, plots performance results, and renders a simple R Markdown report as a GitHub-flavored markdown file.

Setup

  1. Clone or download the mikropml-snakemake-workflow repo.

  2. Install snakemake.

    We recommend using conda (see miniconda installation here) to install Snakemake and its dependencies. Create a conda environment and activate it with:

    Alternatively, you can install snakemake and the other dependencies listed in config/environment.yml however you like.

  3. Install the mikropml R package: see the mikropml install instructions.

    e.g.

  4. Edit the configuration file config/config.yml.

    • ml_methods: list of machine learning methods to use. Must be supported by mikropml.
    • ncores: the number of cores to use for preprocessing and for each mikropml::run_ml() call. Do not exceed the number of cores you have available.
    • nseeds: the number of different random seeds to use for training models with mikropml::run_ml().
  5. Do a dry run to make sure the snakemake workflow is valid.

  6. Run the workflow.

    Run it locally with:

    To run the workflow on an HPC with SLURM:

    1. Edit your email (YOUR_EMAIL_HERE) and SLURM account (YOUR_ACCOUNT_HERE) in:
    2. Submit the snakemake workflow with: µsh sbatch code/submit_slurm.sh The main job will then submit other snakemake jobs.
  7. View the results in report.md.