CONUS near real-time crop type mapper model and training data
Description
This collection contains the trained model (.h5) and the training and testing data (.csv or .npy) for the near real-time crop type mapper for Conterminous United States (CONUS) using harmonized Landsat and Sentinel-2 (HLS) dataset with codes on https://github.com/hankui/Real-time-crop-type-mapper
and for the paper
Zhang, H. K., Shen, Y., Zhang, X., Che, X., Yang, Z., et al. (2025), A near real-time crop type mapper for the conterminous United States, In review.
1, The trained model file
v1_70.layer4.METHOD2.BATCH64.LR0.0002.EPOCH30.L20.1.i0.model.h5
The structure of the model please refer to Zhang et al. (2025).
2, Training_and_evaluation.zip file contains training data for generating the above model and evaluation to produce the paper results. The training and testing (evaluation) samples were split as in Zhang et al. (2025) and from different pixel locations.
The training or testing input x (i.e., the HLS reflectance) is stored as a 3D matrix with dimensions n× (176+176+176+176) ×13.
- The first dimension, n, represents the number of training or testing samples.
- The second dimension comprises four segments of 176 values each:
- The first 176 represents the first-year Landsat data, with a maximum of 176 dates.
- The second 176 represents the first-year Sentinel-2 data, with a maximum of 176 dates.
- The third 176 represents the second-year Landsat data, with a maximum of 176 dates.
- The last 176 represents the second-year Sentinel-2 data, with a maximum of 176 dates.
- Time series with fewer than 176 observations were padded with -9999.
- The third dimension corresponds to spectral information, including year, normalized day of year (DOY), and normalized reflectance. Although the year is not used in training or testing, it is included to identify the sample's time. The reflectance bands are ordered as follows: four visible bands, one near-infrared (NIR) band, two shortwave infrared (SWIR) bands, three red-edge bands (Sentinel-2 only), and one broad NIR band (Sentinel-2 only). For Landsat data, the last four bands are filled with -9999. The mean and std normlization file is included in https://github.com/hankui/Real-time-crop-type-mapper
The training or testing output y includes 50 classes (Table 2 in Zhang et al., 2025), with values ranging from 0 to 49. The array in the file ‘inverse_mapping.npy’ can map these values back to the original label values as defined in the CDL keys at https://support.regrid.com/parcel-data/cdl-keys
The training and evaluation files were generated by processing AlignedCONUS_scale60_all_tiles_v1_6.csv and AlignedCONUS_scale60_all_tiles_v1_4.subcol_add_cdls.csv through several steps: combining data into two years, filtering out non-homogeneous pixels, excluding 2013 and 2014 data, applying normalization, discarding records with fewer than four HLS observations over two years, and redefining the labels. The resulting dataset includes 50 classes, comprising 37 crop classes and 13 non-crop classes (Tables 1 and 2 in Zhang et al., 2025).
3, AlignedCONUS_scale60_all_tiles_v1_6.csv contains the original HLS data with day of year, surface reflectance and quality assessment layer. The data was obtained by sampling every 60th 30-meter pixel across 96 systematically distributed tiles covering the CONUS (Fig. 1 in Zhang et al., 2025).
The data was derived for the period 2013 to 2023; however, only the data from 2015 to 2023 was utilized in Zhang et al. (2025) to cover Sentinel-2 data. Only cloud-free observations were stored. Cloud-free observations were defined for those not labelled as snow/ice, cloud, cloud shadow, or adjacent to cloud/shadow in the HLS quality assessment layer.
Each record contained data corresponding to a single pixel location for a specific year.
There are 4233 columns, with 9 columns storing the pixel specific and year information ('tile', 'col', 'row', 'lat', 'lon', 'year', 'total_n', 'tile_id', 'lid').
It contains 176×11 rows for the Landsat time series in a given year, where 176 represents the maximum number of cloud-free observations in a year, and 9 corresponds to the nine Landsat 8/9 bands (day of year, QA, the seven solar reflective bands: four visible, one near-infrared (NIR), and two shortwave infrared (SWIR) bands, and the two thermal bands). Note the two thermal bands are not used in Zhang et al., (2025). If a record contained fewer than 176 observations, missing values were filled with -9999.
It contains 176×13 rows for Sentinel-2 time series in a given year, where 176 represents the maximum number of cloud-free observations in a year, and 13 corresponds to the 13 Sentinel-2 bands (day of year, QA, and the 11 solar reflective bands: four visible, two NIR, three red-edge and two SWIR bands). If a record contained fewer than 176 observations, missing values were filled with -9999.
4, AlignedCONUS_scale60_all_tiles_v1_4.subcol_add_cdls.csv contains the original cdl labels.
It has 10 columns ['tile', 'col', 'row', 'lat', 'lon', 'year', 'tile_id', 'lid', 'cdl', 'cdl_homo'], with the first eight columns specific the pixel-specific and year information that can be linked to the AlignedCONUS_scale60_all_tiles_v1_6.csv file.
The variable ‘cdl’ represents the CDL label; refer to this link https://support.regrid.com/parcel-data/cdl-keys for label definitions. A value of 0 or NaN may indicate that the pixel is located in the ocean or outside the United States.
The ‘cdl_homo’ column indicates whether the label is consistent with all eight neighboring pixels (1 for consistent, 0 for not consistent).
Files
CONUS_HLS_CDL_time_series.zip
Additional details
Dates
- Available
-
2013-01-01HLS start date
- Available
-
2023-12-31HLS end date