# Enhance

![](https://media1.giphy.com/media/3ohc14lCEdXHSpnnSU/giphy.gif?cid=790b761176a2f235b5c2c921c7878f38e6f000c216a46859&rid=giphy.gif&ct=g)

Tools for combining point data with gridded data.

## Enhancing steps

For most sets of source data, we will want to initially split the source data by day, enhance with various environmental data sources, then recombine the days. (I believe the enhancing methods can now work even if there is a mix of days, but it is probably more reliable to break things up.)

This pipeline contains the tooling for each of these steps.

**Currently this process has only been run with ICCAT, bathymetry, and HYCOM data, so it still may be specialized despite generalization that has been done.**

### Splitting by date

`split_by_date.py` unsuprisingly does the lifting of splitting apart our source CSV into individually dated CSVs.

During testing with DVC it may be called as `./split_by_date.py --limit=5 /data/iccat/with_pseudoabs.csv /data/enhance/iccat/split/`

This reads in the CSV at `/data/iccat/with_pseudoabs.csv` and outputs dated CSVs to `/data/enhance/iccat/split/`, so `/data/enhance/iccat/split/1992-10-03.csv`, `data/enhance/iccat/split/1992-10-04.csv`, and so forth.
It also will only output the first 5 days worth of data as we only need to test with a subset locally.

A `--start_date` flag can be added to further grab only a specified subset of data, though it is already set to limit to the range of HYCOM. This date is specified in `YYYY-MM-DD`, so `--start_date=2013-01-27`.

### Enhancing

The enhancing process currently happens in `enhance_iccat.py`.
(This will probably get renamed once it is fully generalized and tested with multiple sources of point data.)

To help standardize the enhancing process, there are two classes that are configured to manage the extraction of gridded data and match it up with the points.
These classes are found in `methods.py`.
This way they can help present a similar interface to users.

The quick way to decide which method should be used, is to figure out how 'expensive' is it to access data.
For 'expensive' sources, `methods.EnhanceGroupByApply` class is used, and for 'cheap' ones, `methods.EnhancePointwise`.
`methods.EnhanceGroupByApply` is also used if say a window calculation needs to be computed for each point.

The classes present the same interface for enhancement, a `.enhance(df: pd.DataFrame) -> pd.DataFrame)` method.
They account for the differences between their source gridded data via the configuration of their init methods.

Even with these helper methods, it's easiest to extract the enhancement steps for each data source into a separate function, as it helps keep the `main()` function clear and easy to understand.

The enhance process is called with something along the lines of `./enhance_iccat.py /data/enhance/iccat/split/${item}.csv /data/enhance/iccat/enhanced/${item}.csv /data/enhance/bathy_rugosity.nc`, with a NetCDF specified for the source of the bathymetry and rugosity data.

### Combining

`combine.py` does the lifting of taking a source directory of CSVs and combining them to a single file.

`./combine.py /data/enhance/iccat/enhanced/ /data/enhance/iccat/iccat-enhanced.csv`
