# ICCAT Marker Tag Data

This workflow deals with marker tag data from the International Commission for the Conservation of Atlantic Tunas (ICCAT). This data consists of locations where tagged animals were released and, in some cases, where that same individual was recaptured. Here, we treat these data as "presence" data that can be subsequently used to build species distribution models for quantifying habitat suitability for a particular species. 

The data generally contains latitude, longitude, date, species and other metrics about fish size, vessel characteristics, and many others. Sample sizes vary from hundreds for less common species (e.g. skipjack tuna) to many thousand for common species that are often released (e.g. blue sharks). This data is entirely dependent on fishing effort and that a fish was not desirable for harvest (for whatever reason) and therefore tagged and released. These data come from several different fishery-related sources such as commercial pelagic longline vessels as well as recreational fishers. Thus, as you might imagine, there are many potential caveats associated with its use for determining habitat suitability.

Regardless, the data is an excellent source of information on where these species are and can thus be used, with caution, for quantifying habitat suitability and ultimately predicting habitats.

## Pipeline overview

![iccat](./images/iccat.drawio.svg)

The pipeline is defined in [`dvc.yaml`](../../dvc.yaml) for running locally with DVC, and in [`./py/iccat.py`](./py/iccat.py) for running in Dagster.

In both cases they use a [Docker container](./Dockerfile) built by the the `iccat` service as defined in [`docker-compose.yaml`](../../docker-compose.yaml) with `data/` mapped as `/data/`.

## Elements of Pipeline

To get and prepare this data for use in our modeling framework, this ICCAT-specific workflow contains 4 main processing steps as outlined above:

- Per species
    - `iccat_download` - Data is sourced from the ICCAT website as rar-compressed nasty .xls files.
    - `iccat_qc` - These are then read and quality controlled before use. The QC process primarily formats the data to something more friendly/usable and does some limited filtering of spurious locations (e.g. on land).
- `iccat_combine` - Next, the per-species ICCAT data (post-quality control) for each species is combined into a master ICCAT dataset for subsequent use.
- `iccat_pseudoabs` - Finally, pseudoabsences are generated for the cleaned and combined ICCAT data as this is a necessary step prior to building species distribution models (SDMs) for this data.

> ### ICCAT Species
>
> For each species that ICCAT has tracking data on, they have a three-letter code to identify it.
>
> - Albacore - `ALB`
> - Bigeye Tuna - `BET`
> - Atlantic Bluefin Tuna - `BFT`
> - Blue shark - `BSH`
> - Atlantic Blue Marlin - `BUM`
> - Porbeagle - `POR`
> - Atlantic Sailfish - `SAI`
> - Skipjack Tuna - `SKJ`
> - Shortfin Mako - `SMA`
> - Swordfish - `SWO`
> - Atlantic White Marlin - `WHM`
> - Yellowfin Tuna - `YFT`

### ICCAT Download <img src="./images/iccat_download.drawio.svg" align="right">

This pipeline downloads marker tag data from the ICCAT website and unzips the .xlsx file it contains.

#### Script

See `./iccat/R/` directory for instructions on `iccat_download.r` which is the function this pipeline calls to do the work.

#### Step

This step runs for each individual species.
`iccat_download.r` get a file with the species code in it to identify which species it should download data for.
The script also receives an output `.xlsx` path that it should save data to.

##### DVC

For DVC the species codes are in `data/iccat/species/`, and the outputs to `data/iccat/download/`, and the script is called within via Docker-Compose (`docker-compose run iccat ./R/iccat_download.r /data/iccat/species/ALB.txt /data/iccat/download/_tagALB.xlsx`) where `/data/` is mapped to the `data/` at the top level of the repo. The DVC `foreach` input to the `iccat_download` stage is used to iterate through desired species codes.

##### Dagster

Dagster calls `iccat_download.r` with temporary source and output paths, and then it will persist the output `.xlsx` to S3 (though it is currently going to local storage at `data/dagster/iccat/download/` instead).
For each species `.xlsx` an asset will be created as `iccat > download > SPECIES_CODE`.

### ICCAT QC <img src="./images/iccat_qc.drawio.svg" align="right">

This step performs quality control on the marker tag data previously downloaded from the ICCAT website.

#### Script

See `./iccat/R/` directory for instructions on `iccat_qc.r` which is the function this pipeline calls to do the work.

#### Step

This step runs for each individual species.
`iccat_qc.r` gets a file with downloaded data for a given species and performs very simple cleaning, organizing and quality control of the data. The script receives an output `.csv` path that it should save data to and a path to bathymetric data that is used to remove erroneous positions on land. In this case, bathymetry data can be downloaded from our tracked version on digital ocean using `dvc import-url https://nasa-facets-testing.nyc3.digitaloceanspaces.com/global_bathy_0.01.nc  data/bathy/global_bathy_0.01.nc`. This path can also be found in the `.dvc`-specific data file for this asset which is stored in the same directory as this bathymetry data, `data/bathy/global_bathy_0.01.nc.dvc`.

### ICCAT Combine <img src="./images/iccat_combine.drawio.svg" align="right">

This step combines the species-specific marker tag data into one aggregated `.csv` using the marker tag data that was previously quality-controlled.

#### Script

See `./tools/combine.py` for instructions on the generic python-based combine script `combine.py` which is the function this pipeline calls to do the work.

#### Step

This step runs for each individual species and produces an output that aggregates across species. 
`combine.py` takes in inputs paths for each species-specific data file and combines them into a master `.csv`. The script also receives an output `.csv` path that it should save data to. 

### ICCAT Pseudoabs <img src="./images/iccat_pseudoabs.drawio.svg" align="right">

This step generates pseudoabsences for ICCAT marker tag data using a generic pseudoabsence generation utility for `R`. See `./tools/pseudoabs/` for details.

#### Script

See `./tools/generate_pseudoabs/` directory for instructions on the tool implemented for this use case which calls `generate_pseudoabs.r` to do the work.

#### Step

This step takes in the combined data output from the previous `iccat_combine` step. It splits the combined data by an index (provided as an optional argument, `--index_var`), in this case by `SpeciesCode` then generates a number of random "draws" from within the spatiotemporal limits of the input data resulting in a desired ratio of pseudoabsence to presence data (optional `--abs_ratio` argument). For further details see `./tools/generate_pseudoabs/` and the core function that does the work `./R/sp_random.r`. This function also wants the same input bathymetry file as used previously in `iccat_qc`, as this is used to keep pseudoabsences off land. Finally, an output path must be specified for the resulting `.csv` file generated that contains both presence and pseudoabsence locations for all ICCAT species. 

This is the end of the ICCAT-specific workflow. The output from `iccat_pseudoabs` is used as input to the `../iccat-hycom/iccat_by_date` repo.