# Fisheries and Climate Toolkit Data Preparation and Modeling Pipelines

The [Fisheries and Climate Toolkit](https://fisheriesclimatetoolkit.sdsu.edu/) (FaCeT) is an ongoing NASA Biodiversity and Ecological Forecasting project that seeks to support climate-ready and sustainable fisheries practices targeted in Sustainable Development Goal 14. FaCeT uses the fundamental principles of [dynamic ocean management](https://fisheriesclimatetoolkit.sdsu.edu/dynamic-ocean-management/) to develop innovative, transformative, and actionable science. As such, the FaCeT team is developing online products and applications to support climate-resilient fisheries. These tools and applications will:
- Visualize dynamic species distributions of fisheries and conservation interest
- Identify species and vessel responses to climate anomalies and directional climatic changes 
- Project fishery interactions and hotspots in response to changing ocean conditions
- Effectively quantify and communicate uncertainty of model results to users with a focus on incorporating uncertainty into decision processes

These goals are naturally interconnected as a set of "pipelines" or workflows. For example, effective visualization of dynamic species distributions is a core piece of the FaCeT toolkit that relies on adequately capturing, quality-controlling, enhancing and modeling data on species observations. This is basically the same workflow as is common in production-ready machine learning workflows and requires reproducibility, extensive version control, and (some) workflow automation. There are many tools out there that can help us accomplish these tasks. Here, we've chosen to use a hybrid approach in which users can work locally with (Data Version Control)(https://dvc.org/)(DVC) and access cloud storage and computing resources via (Dagster)(https://dagster.io/). Leveraging these tools allows us to conduct more reproducible, transparent science and more "deployment-ready" tools.

## Tasks

This project has three main families of tasks that result in our desired products.

- Data Preparation
- Modeling
- Utility

Those three larger families can be broken down further into a few more defined and decoupled tasks.

![Pipeline overview](./pipeline_overview.drawio.svg)

If for each one of these types of tasks can have a clear interface, then we can easily swap between different types of models and data.

- Biological (point) data
- Environmental (gridded) data
- Biological data enhancement with environmental data (point)
- Model fitting (including data reshaping)
- Model prediction
- Model sharing

## A Primer on DVC

DVC is an open source version control system for machine learning projects. DVC is basically like using "Git for data" except that it picks up where Git leaves off regarding, for example, large data sets. For our purposes, DVC keeps track of all aspects of a machine learning pipeline from its data to its code and many of the artifacts associated with running a given workflow. These can be linked together into pipelines of data and models that can depend on each other and be fully descriptive about how an entire workflow is constructed, allowing robust version control across iterations of the different pieces of the pipeline.

> Here's a [helpful blog post](https://dvc.org/blog/r-code-and-reproducible-model-development-with-dvc) showing the process of running an R model fitting and training in DVC.

There are also many great tutorial style videos explaining how DVC works and showing how to interact with it.

## A Primer on Dagster

Dagster is a data orchestration tool focused on production-ready machine learning workflows.

@abkfenris will fill in the rest of this

### But, what do these pipeline tools do for us?

If we can break up our modeling processes into smaller discrete scripts that can work with subsets of data, these tools then can take care of scheduling and distributing the work, and tracking the changes and provenance of the data.

If upstream data changes, then all affected downstream scripts are rerun. Their previous outputs will be accessible as prior commits to the repo.

## The Ins and Outs
The FaCeT toolkit relies on a number of environmental and biological datasets. These all require various levels of processing before, ultimately, models can be fit. Fitted models are only useful once validated and will then be used for making predictions for future (climate) scenarios. 

### "Biological" observations
In general, dynamic ocean management tools rely on a suite of biological datasets that all represent, in some way, a species observation. In our case, we've expanded "species" distribution modeling efforts to include modeling fishing effort as vessel distribution models (VDMs). Our observation data thus includes:

- observations of highly migratory species occurrence. These can include conventional marker tags, electronic tags, fishery capture, etc.
- observations of fishing effort or vessel distribution. These can include fishery information (such as from fishery logbooks) or "independent" observation such as from AIS-based Global Fishing Watch.

### Environmental observations
Here we rely on a number of different static and dynamic environmental metrics as our best effort to represent the real ocean. To do so, we integrate across a number of remotely-sensed and modeled metrics describing ocean dynamics. We specifically focus on:

- synoptic remote sensing capabilities such as sea surface temperature and chlorophyll
- data-assimilating oceanographic models (e.g. HYCOM) that provide a best estimate of the 3d ocean

Various metrics from these different sources are used to build quantitative relationships between observations and the environment the observations occur within.

### Models

To create these relationships, we use a number of different modeling methods. Most of these approaches require observations of both where an animal (or vessel) was and was not. Thus, in cases where we only have observed "presence" information, we need a way to re-create potential absences, called pseudoabsences. Ultimately, both presence and absence information is then enhanced with environmental information and this aggregate data is used for model fitting. Once robust models have been fitted and validated, we ultimately seek to use them to make inference about the natural world. In the case of FaCeT, we use model predictions at a number of historical and future timescales to understand how fishery dynamics (including target and incidental, or bycatch, species) have historically changed and will continue to change under expected climate scenarios.

## Data Preparation Pipelines

Before we can create models, we need to collect and prepare data for the models to work with.

Most models use a combination of biological data and environmental data for fitting,
and environmental data for prediction.

- Biological Data Preparation
  - Download (or manually upload to a repo)
  - Convert and Quality Control
  - Combine into a single "ready to use" CSV (if needed)
  - Generate pseudo-absences (optional, depending on data type) to accompany presence data
- Environmental Data Preparation
  - Download environmental data (in this case, HYCOM)
  - 2D Environmental Data Calculation (typically SD)
  - 3D Environmental Data Calculation (for example, isothermal layer depth)
- Biological Data Enhancement
  - Split Biological Data By Day
  - Extract Environmental Data Per Location By Day (including enhancing bio data with bathymetry and other static variables)
  - Combine Extracted Data ("model ready")

![Data preparation pipelines](./data_prep_pipelines.drawio.svg)

The following is an example showing how the general pipeline structure above has been implemented for ICCAT marker tag data.

### Biological Data Preparation

Since models may use several types of biological data, we may have to do some different stages of processing, but in the end we should be able to output a CSV with latitude, longitude, and date so we can combine them with environmental data.

  - Datetimes should be in ISO format.
  - Longitudes are -180 to 180.

#### Download from ICCAT

Our first pipeline deals with acquiring the latest data from ICCAT.
ICCAT separates out the tagging data by species, and accordingly we can treat their download tasks separately.

We are defining which species we want to download with a set of `.txt` files that specify the ICCAT SpeciesCode which is used to download data for the species of interest (see the `iccat_download` stage of our `dvc.yaml`).

![iccat_download](./iccat/images/iccat_download.drawio.svg)

#### Convert and QC

As each ICCAT species is downloaded as `.xls` (thanks ICCAT) we need to convert them to `.csv` so that we can use them.
At the same time (though it could be its own pipeline) we are going to run quality control. 

![iccat_qc](./iccat/images/iccat_qc.drawio.svg)

#### Combine Biological Data

Now that we have our biological data downloaded, converted, and quality controlled,
we should consolidate it into a single CSV before it may be used in other ways.

![iccat_combine](./iccat/images/iccat_combine.drawio.svg)

#### Generate Pseudo-absence Data

Despite the number of fish carrying around cool tech, our models need to also know where they are not. Since we can't strap the same tracking or marker tag on non-existent fish that only go where the fish we care about don't go (too many negatives don't make a positive) we have to generate that data ourselves.

This functionality was written in `R` for generalized use. Thus, the base code is included in the `../tools/` directory. That code is then accessed and applied to a specific use case in this pipeline, generating pseudoabsence data specific to ICCAT marker tags. 

![iccat_pseudoabs](./iccat/images/iccat_pseudoabs.drawio.svg)

### Environmental Data Preparation

In most cases environmental data should be stored in repos by variable and each repo should contain NetCDFs nested in directories by date (`YYYY-MM-DD`).

For environmental data that will be used directly for model forecasting or
prediction we will keep it around, but for data that is only used to
generate derived environmental data (say salinity and temperature for
Isothermal Layer Depth), that will be acquired as needed.

> #### Daily Cron
>
> A little bit of a detour, as this is more of a utility pipeline, but we have a generic `daily-cron` pipeline in the `../tools` directory.
>
> This uses a [cron](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/pipeline/cron/) pipeline to create a YYYY-MM-DD file once a day.
>
> The reasons for having a pipeline dedicated to this job rather than having individual pipelines use a `cron` input,
> so that we can have a standard format and a trackable provenance for each days data.
> Also so that there can be a common spot to mock out times (for testing a selection of days),
> and for being able to backdate the whole pipeline to acquire past data.


#### Download Daily HYCOM Variables

The pipeline will use the YYYY-MM-DD tick files from the `daily-cron` pipeline in order to know which data to download.

The HYCOM variables that will be downloaded and stored are:

- `water_u` @ depth 0 (surface) 
- `water_v` @ depth 0 (surface) 
- `surf_el`
- `temp` @ depth 0 (surface) as `sst`

With HYCOM, a single [`script`](./hycom/Python/scripts/get_hycom/py) can be used for all 4 pipelines.

Since environmental data is downloaded daily and stored split up by days,
processing can be distributed with an independent task per day. The example below if for sea surface temperature:

![hycom_sst](./hycom/hycom_sst.drawio.svg)

#### Environmental Data Derivation

Derived environmental data will either be triggered by a daily cron
pipeline, or if it's using environmental data that we will otherwise keep,
then it will be triggered from that.

##### 2D Environmental Data Calculation

For both the Sea Surface Temperature and the Sea Surface Height, the standard deviation will be calculated.

This is based on a single `R` script, [`calc_env_sd`](./hycom/R/calc_env_sd.r) and is implemented like:

![calc_sst_sd](./hycom/calc_sst_sd.drawio.svg)


##### 3D Environmental Data Calculation

Isothermal Layer Depth and Buoyancy Frequency are both calculated per day.

For each variable each day will download the full day's lat x lon x depth
water temperature and salinity to create the output but that data will be discarded rather than stored.

This is housed in a single [`pipeline`](./hycom/calc_hycom_3d.pipeline.json) and script in order to take advantage of the shared, temporary HYCOM data that both require.

![calc_hycom_3d](./hycom/calc_hycom_3d.drawio.svg)


### Biological Data Enhancement

Once the Pachyderm pipelines have prepared both the biological and environmental data, the biological data needs to be enhanced with the corresponding environmental data.

#### Split Biological Data By Day

First biological data needs to be split up by day to match up with the structure (`YYYY-MM-DD` directories) of environmental data.

This is done with a Python-based "tool" script and implemented on a per-pipeline basis.

![iccat_by_date](./iccat-hycom/iccat_by_date.drawio.svg)

#### Extract Environmental Data Per Location By Day

Daily biological data is joined with daily environmental data.

Since both sources of data have now been split up and standardized, this can
use a common script, and be distributed with independent task per day.

![extract_iccat_hycom](./iccat-hycom/extract_iccat_hycom.drawio.svg)

#### Enhance with Bathymetry

Finally, biological data is enhanced with static variables, in this case, bathymetry.

![enhance_iccat_bathy](./iccat-hycom/enhance_iccat_bathy.drawio.svg)


> **A note on confidential biological data:**
>
> For confidential data sets (observer/logbook data) at the end of this
> pipeline would be the first point where the data could be made anonymous.
>
> Until this point, the lat/lon/date needs to stay attached to the data so
> that the matching environmental data can be extracted.
> During this step confidential columns can be removed, so that clean
> non-confidential data can be passed to the model.
>
> There are three main ways that I see making this happen:
>
> - Pachyderm Enterprise includes access controls, so we can restrict certain
>   data to specific users.
> - Limit direct access to Pachyderm.
> - Having an additional Pachyderm cluster for confidential data that submits
>   the non-confidential biological + extracted environmental data to the
>   regular Pachyderm cluster.
>
> Even if we start with Pachyderm Enterprise, we can always fall back to one
> of the other methods later.


#### Combine Extracted Data

Now that there are daily CSVs that have been enhanced with environmental data, they can be combined back to a single CSV.

![combine_iccat_hycom](./iccat-hycom/combine_iccat_hycom.drawio.svg)


## Modeling Pipelines

Once we have environmental data, it's time to build and run models with it, and share the results.

![Modeling pipelines](./modeling_pipelines.drawio.svg)

### Fitting BRT models

For now our model fitting pipeline is quite simple: it takes in the biological data that has been "enhanced" with all the environmental data (e.g. HYCOM, bathymetry, etc), splits this data up by species (eventually we should change this to an index variable argument instead), and fits BRTs.

#### Split by species

First biological data needs to be split up by species for species-specific modeling.

This is done with a Python-based "tool" script and implemented on a per-pipeline basis.

![model_split_by_species](./model-brt/model_split_by_species.drawio.svg)

#### Model fitting

With CSVs for each species ready for model fitting, all we need is to configure the model fitting process. This is done with a set of species-specific text files in `model_config_brt` (see [`./pipelines/model-brt/`](./pipelines/model-brt/) for details).

![model_fit_brt](./model-brt/model_fit_brt.drawio.svg)

#### Model Prediction

We have not worked on this pipeline yet.

#### Model Sharing

We have not worked on this pipeline yet.

## Utility Tasks

- [Data cleanup](./9-Data-cleanup)
- [Interactive data exploration](./10-Interactive_exploration)
- Daily cron pipeline

## How do we make this happen?

We can test the effectiveness of Pachyderm in several phases.

### Pachyderm Hub

Initially as we get scripts developed, we can use the [Pachyderm Hub](https://www.pachyderm.com/platform/#hub) hosted platform to test our scripts and pipelines connecting it.

Pachyderm Hub should give enterprise level controls (such as the dashboard and access control) in a
time limited (4 hours) test environment.
This should give us enough time to test with a subset of the time span that we want to work over, and
to make sure that our pipelines are reasonable before we deploy our own cluster.
We can also do some initial testing of how the access controls will work with confidential data.

### Test AWS Deployment

Once we are feeling comfortable with how our pipelines work on Pachyderm Hub, we can deploy our own
cluster in AWS.

For this we will want to [deploy](https://docs.pachyderm.com/latest/deploy-manage/deploy/amazon_web_services/) an EKS cluster.
On this cluster we can start running our longer term pipelines, like daily forecasting (though we can
still test these on Pachyderm Hub).

> Probably should figure out how to use Terraform rather than `eksctl` directly for repeatability

We can get a trial of Pachyderm Enterprise for a few weeks to test features like access controls, and if we need the dashboard.
At this stage we can also explore how we want to set up exploratory compute (whether thats a single EC2 instance or JupyterHub on the same EKS cluster).

### Production AWS Deployment

From the results of our test deployment, we can determine if we need Pachyderm Enterprise,
how we are going to do exploratory compute (local, EC2, or EKS), before we spin up a
compute cluster for good. This will also depend on funding availability.
