Published August 18, 2025 | Version v3
Software Open

From Reanalysis to Climatology: Deep Learning Reconstruction of Tropical Cyclogenesis in the Western North Pacific

  • 1. ROR icon Hanoi University of Science and Technology
  • 1. ROR icon Vietnam National University, Hanoi
  • 2. ROR icon Hanoi University of Science and Technology
  • 3. ROR icon Indiana University Bloomington

Description

1. Introduction

This code repository provides (i) a workflow to generate meteorological observation samples from different climatological datasets; and (ii) a deep learning framework to exploit the generated data, based on RestNet-18 architecture.

For the former, three climatological datasets are used:

  1. MERRA-2 by the National Aeronautics and Space Administration (NASA) (Gelaro et al., 2017)
  2. FNL by the National Centers for Environmental Prediction (NCEP) (NCEP, 2000), and
  3. Tropical cyclone information from the International Best Track Archive for Climate Stewardship (IBTrACS) (Knapp et al., 2010).

The standardized datasets will be used for subsequent ML tasks in the latter framework.

2. Input data

The NCEP-FNL dataset includes 27 meteorological data fields. These data are structured on a three-dimensional grid: latitude ranges from -90 to +90 degrees with 1-degree spacing; longitude from 0 to +360 degrees, also spaced at 1-degree intervals; and altitude is represented by atmospheric pressure levels, with 26 levels ranging from 10 millibars to 1000 millibars. The dataset covers the period from July 1, 1999, to December 31, 2022, with data sampled every 6 hours. Before 2010, NCEP-FNL was stored in GRIB1 format and later switched to GRIB2. Each data file contains a single forecast instance and is between 20–40 MB in size, with the full dataset totaling approximately 400 GB.

The NASA-MERRA2 dataset contains 13 meteorological variables. The data are mapped onto a two-dimensional grid: longitude spans from -180 to +180 degrees with a resolution of 0.625 degrees; latitude ranges from -90 to +90 degrees with a resolution of 0.5 degrees. Altitude is defined by 42 pressure levels, ranging from 1000 hPa to 0.1 hPa. The data were collected from January 1, 1980, to December 31, 2022, with a sampling frequency of every 3 hours. Each file contains data for 4 forecast times per day, stored in NetCDF format, and is approximately 2.2–2.3 GB in size. The complete dataset occupies around 18 TB.

The IBTrACS dataset provides information on tropical cyclone positions. It is structured on a two-dimensional grid with a longitude range of 0 to +360 degrees (0.001-degree resolution) and a latitude range from -90 to +90 degrees (also 0.001-degree resolution). The data span from the 1980s to December 31, 2022, and are sampled every 3 hours. The IBTrACS dataset is stored in a single CSV file with a total size of approximately 300 MB.

3. Preprosessing and Sampling

3.1. Prerequisites

The following command is used to install the required libraries (only needs to be executed once):

conda create -n hurricane-ml python=3.10 ipykernel autopep8 numpy xesmf pandas xarray pynio netcdf4 matplotlib cfgrib pytables h5netcdf tqdm alive-progress -c conda-forge -y

3.2. Data Preprocessing

Run on a local machine or use interactive shells: 

# Pre-processing
python3 Preprocess_Ibtracs_Fnl.py
python3 Preprocess_Ibtracs_Merra2.py

Using Slurm Jobs

In addition to running locally, the above source code can also be executed on a server by submitting Slurm jobs. Sample Slurm job scripts are provided in the ./scripts/slurm_job directory.

3.3. Data Sampling

Positive Sampling

Each positive sample is a data file containing information within a 33×33 grid centered around the eye of the targeted tropical cyclone at the forecast timestamp when it forms.

Negative Sampling Using the Past Domain Strategy (PastDomain):

The PastDomain strategy generates negative samples by using the location of a positive sample and going back in time by n ≥ 1 timestamps. This method results in a more balanced number of negative samples compared to positive ones. Negative sample filenames under this strategy follow the convention: {SID}_{n}.nc4, where n ∈ [1, N] represents the number of time steps prior to the positive sample timestamp.

Negative Sampling Using the Dynamic Domain Strategy (DynamicDomain):

The DynamicDomain strategy selects negative samples from regions adjacent to the positive sample location. For each positive sample, eight negative samples are generated, each corresponding to one of the neighboring grid blocks in the 2D space: Northwest (nw), North (n), Northeast (ne), West (w), East (e), Southwest (sw), South (s), and Southeast (se) (as illustrated in Figure 1). Additionally, samples are taken from n ≥ 0 time steps in the past for each of the eight neighboring regions.

The following commands are used to generate samples using the above-mentionned strategies:

# Data extraction
python3 Extract_FixedDomain.py
python3 Extract_DynamicDomain.py
python3 Extract_PastDomain.py

3.4. Data Postprocessing

The following commands are used to analyze the NaN and noise statistics of the preprocessed data:

# Post-processing (NaN and noise statistics)
python3 Analyze_NaNStat.py
python3 Analyze_NoiseFinder.py

5. Evaluation using ResNet-18

In our work "From Reanalysis to Climatology: Deep Learning Reconstruction of Tropical Cyclogenesis in the Western North Pacific," we adapted the original ResNet-18 architecture for our TCG applications. The modified network consists of eight residual blocks, preceded by an initial convolutional layer for input embedding, and followed by a fully connected layer with a softmax activation to predict the probability of storm occurrence, forming a total of 18 layers.

5.1. Evaluation using Past Domain Sampling Strategy

For detailed instructions, see past_domain/README.md

Libraries installation

To install all required libraries, run the following commands:

bash
pip install torch torchvision
pip install numpy pandas matplotlib seaborn scikit-learn tqdm
pip install pytorch-metric-learning
pip install xarray

Running the Script

To run the scripts above, you need the following inputs:

  • A directory path containing MERRA2 files with the format: `merra2_19800101_00_00.nc`
  • An `IBTRACS.csv` file containing tropical cyclone information (placed in the `csv` directory)

Execute the `prepare_csv.py` script to generate the necessary input CSV files for training the model:

python prepare_csv.py

The output will include the following files (located in the `csv` directory):

  • `data_statistics.xlsx` – statistical summaries of MERRA2 features, used for data normalization
  • `FIRST_MERRA2_IBTRACS.csv` – information on the first occurrence time of each storm
  • `merra_full_new.csv` – list of paths to MERRA2 files, with a label:
  • `-1` if the file corresponds to a time during a storm (but not the first occurrence)
  • blank if no storm is occurring at that time

After this step, proceed to train and evaluate the model.

To train and evaluate the model, use the following command:

bash
python main.py --time t2_rus4_cw3_fe --norm_type new --lr 1e-7 --pos_ind 2 --under_sample --rus 4 --class_weight 3 --small_set

Command-Line Arguments:

See past_domain/README.md for details.

Example Usage

To train with undersampling and specific class weights:
This is using normalised method, learning rate 1e-7, positive sample is time step t-2, using undersampling method - with ratio 1:10 and class weight balanced

bash
python main.py --time experiment_1 --norm_type new --lr 1e-7 --pos_ind 2 --under_sample --rus 10 --class_weight 1

For a quick test using a small dataset:
This is using normalised method, learning rate 1e-7, positive sample is time step t-2, not using undersampling method, only use small subset of data and class weight balanced

bash
python main.py --time test_run --norm_type new --lr 1e-7 --pos_ind 2 --small_set --class_weight 1

Running the Script Evaluate Trained Model on Full months (eval_fullyear.py)

Besides the standard training and testing routine, this project includes a dedicated script for full year predictions. The file eval_fullyear.py is used as:

bash
python eval_fullyear.py --timestep t2_rus4_cw1 --strict --fullmonth --model_path result/model/model.pth

Command-Line Arguments

See past_domain/README.md for details.

Output 

Results for --fullmonth enabled: ./result_fullmap/all_months

Results for --fullmonth disabled: ./result_fullmap/storm_months

5.2. Dynamic Domain Sampling Strategy

For detailed instructions, see dynamic_domain/README.md

Libraries installation

To install all required libraries, run the following commands, either in native python environment or a virtual environment:

bash
pip install pandas tqdm numpy torch sklearn open-cv2 xarray pandarallel calendar matplotlib basemap

Running script

Prepare data

  • Data should be in NetCDF format and organized by a csv file called "data_path", which contain the path of each sample, its metadata (position, step, ...).
  • Run this command to generate train, val and test data in csv format for a single predicting-step.
python Prepare_data.py --path $path --step $step --ratio $ratio --dst $dst

where:

  • $path: the csv file containing the path(s) of every sample
  • $step: the step forecast, a step stand for 3 hours predict
  • $ratio: the ratio for under resampling the dataset
  • $dst: output directory

Modeling

Run this command to train model from scratch

python Train.py --inp_dir $inp_dir --out_dir $out_dir --weight $weight --map_path $map_path

where:

  • $inp_dir: the input dataset, including a train set, validate set and a test set.
  • $out_dir: the output directory, saving model checkpoint and evaluation results.
  • $weight: the class weight assigned for positive sample, if set to 0, the class weight is computed balancedly
  • $map_path: the csv file containing the path(s) of every sample for map evaluating. 

Evaluating

- Run this command to evaluate model performance on selected area.

python Map_eval.py --temp $temp --out $out

where:

  • $temp: the csv template of map prediction for every forecasting step in range from 2 to 18. This should be a csv file path with the step forecast left in {}.
  • $out: the output directory where the scoreboard will be exported

- Run this command to visual the map distribution of mean score in the selected area

python Spatial_map.py --temp $temp --out $out

where:

  • $temp: the csv template of map prediction for every forecasting step in range from 2 to 18. This should be a csv file path with the step forecast left in {}.
  • $out: the output directory where the figure will be exported

Cite this Work

  1. Duc-Trong Le, Tran-Binh Dang, Anh-Duc Hoang Gia, Duc-Hai Nguyen, Minh-Hoa Tien, Quang-Trung Luu, Quang-Lap Luu, Tai-Hung Nguyen, Thanh T. N. Nguyen, and Chanh Kieu, "From Reanalysis to Climatology: Deep Learning Reconstruction of Tropical Cyclogenesis in the Western North Pacific," submitted to Geoscientific Model Development, 2025.

Files

tcgnet_full_v3.zip

Files (1.0 MB)

Name Size Download all
md5:4b21b59ef3a6774b91b4c6634a39f892
1.0 MB Preview Download

Additional details

Dates

Updated
2025-08-18

Software

Programming language
Python
Development Status
Active

References

  • Gelaro, R., McCarty, W., Suárez, M. J., Todling, R., Molod, A., Takacs, L., Randles, C. A., Darmenov, A., Bosilovich, M. G., Reichle, R., Wargan, K., Coy, L., Cullather, R., Draper, C., Akella, S., Buchard, V., Conaty, A., da Silva, A. M., Gu, W., … Zhao, B. (2017). The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Journal of Climate, 30(14), 5419–5454. https://doi.org/10.1175/JCLI-D-16-0758.1
  • Knapp, K. R., Kruk, M. C., Levinson, D. H., Diamond, H. J., & Neumann, C. J. (2010). The International Best Track Archive for Climate Stewardship (IBTrACS): Unifying Tropical Cyclone Data. Bulletin of the American Meteorological Society, 91(3), 363–376. https://doi.org/10.1175/2009BAMS2755.1
  • NCEP. (2000). NCEP FNL Operational Model Global Tropospheric Analyses, continuing from July 1999. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory. https://doi.org/10.5065/D6M043C6