The MCM is an irregularly shaped region that is 20,686 km2 in area,
# units::set_units(st_area(st_union(pred.area())), "km^2")
and we used the MCM as the prediction area for which we generated and validated our Ta predictions. The longitude ranges from 99.9 to 97.8° W and the latitude ranges from 18.6 to 20.2° N
# round(d = 1, st_bbox(st_transform(pred.area(), crs = crs.lonlat)))
(see Figure 1).
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#fig--g/area-map
To produce the study area for our Ta model, we expanded the spatial extent by 50 km outwards from each side of the bounding box of the MCM and rounded the limits to the nearest tenth of a degree of longitude or latitude. This allowed us to include additional stations that we used only as training data. The result was a rectangular region in the plate carrée projection that ranged from 100.4 to 97.4° W and 18.1 to 20.6° N.
# study.area()
This was then reprojected to match the MODIS sinusoidal projection and divided into 101,892 total 1 km × 1 km cells,
# nrow(master.grid)
and this output served as our master grid for modeling.


Figure 1. Study area showing all available ground stations (n=126) used for our daily Ta predictions in the Megalopolis of Central Mexico (MCM, shown as indigo-colored regions) from 2003–2019.

Data sources

Meteorological ground stations

We utilized weather station records from the year 2003 through 2019 inclusive, based on data availability from the different sources of information. By request, we obtained historical records of Ta and wind speed from the Servicio Meteorológico Nacional de México (SMN) that integrated three monitoring networks: the Estaciones Meteorológicas Automáticas network (EMAs; 30 stations), the Estaciones Sinópticas Meteorológicas network (ESIMEs; 10 stations), and a network of observatories (10 stations). We also incorporated two networks for which data is publicly available on the Internet: the Sistema de Monitoreo Atmosférico de la Ciudad de México (SIMAT; 35 stations; http://www.aire.cdmx.gob.mx) and the Programa de Estaciones Meteorológicas del Bachillerato Universitario at the Universidad Nacional Autónoma de México (PEMBU-UNAM; 13 stations; https://www.ruoa.unam.mx/pembu). Finally, we included observations from personal weather stations available from Weather Underground (76 stations; http://wunderground.com). 
# get.ground()$stations[, by = network, .N]
Records for all networks are available in local time, except those from the EMAs and ESIMEs networks which are available in UTC and were converted to local time.

Given the heterogeneous sources of information in our study which included crowdsourced data, we adapted methods developed for data quality assessment applied elsewhere to make them work in our study region (Meier et al., 2017; Napoly et al., 2018; Dirksen et al., 2020). We discarded data that did not pass several checks:
# Function: filter-raw
1. If a station had fewer than 20 observations in a given year, we dropped that station for that year.
2. We dropped station-days with values that were impossible (e.g., a maximum Ta that is less than the minimum Ta or a negative mean wind speed) or implausible (i.e., colder than −30 °C, hotter than 53 °C, or having a mean wind speed greater than 114 m/s). 3. We dropped runs of station-days (ignoring unobserved days) in which one of the three daily Ta values (minimum, mean and maximum), rounded to the nearest .01 K, was repeatedly equal. We allowed runs of up to 2 equal values for the mean Ta and 3 equal values for the maximum or minimum Ta. These limits were chosen based on the observation that longer runs are rare except in Weather Underground. 4. We compared observed Ta to inverse-distance-weighted interpolations of Ta from other stations on the same day that were no more than 30 km away and 500 m different in elevation as a “buddy check” for spatial consistency. We term the squared differences between observations and these interpolations "deviations", and for each Ta's daily observation, we computed the 99th percentile of the deviation, excluding Weather Underground. We dropped station-days for which a deviation exceeded this percentile. This step removed 3.3% of Weather Underground observations, less than 2% of observations in the three SMN networks, and less than 1% of UNAM and SIMAT observations.

After these removals, 419,120 observations from 174 stations remained; the above per-network counts of stations are after data cleaning. Of these, there were 301,645 observations
# sum(ground$stn %in% stations[master.grid[stations$mrow, in.pred.area], stn])
from 126 stations
# sum(master.grid[stations$mrow, in.pred.area])
in the prediction area (i.e., the area of the MCM).

5. We checked for possible duplicate stations—that is, stations with different identifiers but similar locations and a substantial number of identical observations—but did not find any.

A summary of the geographic location of stations from each network and number of records used in our analyses is shown in Table S1. The geographic location of stations from the UNAM and REDMET networks can be found online in their websites (UNAM; https://www.ruoa.unam.mx/pembu/index.php?page=map# and REDMET; http://www.aire.cdmx.gob.mx/opendata/catalogos/cat_estacion.csv). A complete list with geographic coordinates for the EMAs and ESIMEs networks of the SMN was downloaded from their website (http://smn1.conagua.gob.mx/emas/catalogoa.html). The same information for all the networks from the SMN was received in our data request. In Supplement S1, we included general characteristics about mounting, location and exposure of instruments, and contact information for each network, based on guidance on metadata by (Aguilar et al., 2003) when such information was available.

Land surface temperature

LST records were extracted from the daily daytime and nighttime MODIS LST products MOD11A1 (Terra satellite) and MYD11A1 (Aqua satellite) using the most recent reprocessing version (Collection 6). LST retrievals from both satellites were available by the first half of 2002, but we selected 2003 as the start year due to data completeness from both sensors (Tatem, Goetz and Hay, 2004). In other words, 2003 is the earliest year with complete LST data. The spatial resolution was 1 × 1 km, and data were available for the entire study period. Local overpass times for Terra and Aqua were around 22:30 and 1:30 for night time, and around 10:30 and 13:30 for daytime, respectively. MODIS LST products are derived from channels 31 (10.78–11.28 µm) and 32 (11.77–12.27 µm) in the thermal infrared band, and they are already corrected for emissivity and atmospheric effects using the split-window algorithm. These products have been used before in similar studies due to their high spatiotemporal resolution and free availability (Benali et al., 2012; Kloog et al., 2012, 2014; Shi, Liu, Kloog, et al., 2016; Rosenfeld et al., 2017).

 Land use terms

Vegetation - To calculate a measure of vegetation density, the monthly Normalized Difference Vegetation Index (NDVI) from both Terra and Aqua MODIS instruments were averaged (Collection 6 MOD13_A3 and MYD13_A3) at the spatial resolution of 1 × 1 km. 

Elevation - Elevation from the Shuttle Radar Topography Mission at a 30 m spatial resolution was aggregated by applying a Gaussian filter (150 m SD) and extracting data to the centroids of the MODIS 1 × 1 km products. 

Statistical methods

Aqua and Terra readings were combined to form a daily LST variable for daytime and nighttime. When one satellite's reading was missing for a given time and place, data from the other was used. When both readings were available, the average was calculated. To impute cases where both satellite readings were missing, we used a linear-interpolation algorithm as follows. For a given day t and grid cell g, we found the closest days before and after t in the same year as t, t0 and t1, with a non-missing value of the appropriate variable (daytime LST or nighttime LST) at g, and we computed the imputed value as:

yig = (y0 (t1 - t) + y1 (t - t0)) / (t1 - t0)

where yig is the LST at day ti and place g. When no such t0 exists for the given year, y1, unaltered, is used as the imputed value instead, and likewise y0 substitutes when no t1 exists. 
We also imputed missing wind speed. We did this by using the wind speed from the closest station with a value for the same day (or the previous day, if no other station was available that day, or the day before that if necessary, and so on).
We calibrated daily Ta (minimum, mean, and maximum) on LST as follows: each Ta station was assigned the closest LST observation on a specific day (within each 1 x 1 km grid cell) using grid cells for which both Ta measurements and LST values (observed or imputed) were available. On each day we estimated a separate slope in the relationship between Ta and LST to capture the temporal variability in their relationship. The calibrations, fit separately per year and daily Ta outcome (minimum, mean, or maximum), were mixed-effects regression models implemented with the R package lme4 (Bates et al., 2015). After calibration, we used the coefficients of the mixed effects model to predict Ta in those grid cells without Ta information but with LST values. A generalization of the equations for the three models (for minimum, mean, and maximum Ta) is:
Taij =
(α + υj) + β1 day LSTij + β2 night LSTij + β3 imputed day LSTij + β4 imputed night LSTij + β5 NDVIij + β6 sin2π timei + β7 cos2π timei + β8 elevationi + β9 mean wind speedij + β10 seasonj + β11 seasonj*day surface temperatureij + β12 seasonj*night surface temperatureij + εij

where Taij is near-ground (2 m) air temperature (minimum, mean, or maximum) on location i on day j, (α + υj) are the fixed and random intercepts, day and night LSTij are satellite day and satellite night LST, imputed day and night LSTij are indicators of whether satellite day and satellite night surface temperatures are imputed, NDVIij is the mean NDVI of Aqua & Terra for grid cell i in month j, timej is the time of year calculated as (day of the year - 1)/(total days of the year - 1), elevationi is the mean elevation at site i, and mean wind speedij is daily mean wind speed from the nearest station. The season was defined as a categorical variable with three levels: cold-dry for November through February, warm-dry for March and April, and rainy for May to October. All continuous variables were centered and scaled before fitting. All analyses were conducted in R 4.0.2 (R Core Team, 2020). 

Assessment of model performance

Within each year, all available stations in the prediction area were randomly split into 10 cross-validation folds. Inside the cross-validation loop, models were trained with all observations (occurring in the year of interest) for stations in the training folds plus all stations outside the prediction area but within the study area (i.e. calibration folds with 90% of the data), and asked to predict the observations for the stations in the test fold (i.e. validation folds with 10% of the data). Mean wind speed for test stations was imputed as if all test stations were missing mean wind speed at all times. The primary measure of model performance was the root mean square error (RMSE) of cross-validated predictions. Tabular summaries of model results also report the standard deviation (SD) and improvement (SD - RMSE) of each outcome (minimum, mean, or maximum Ta) in a particular year, season, or subregion. To evaluate prediction performance more evenly throughout the study region, our summary tables include spatially weighted versions of the SD and RMSE, for which each day and each sixteenth of a longitude-latitude grid cell (splitting each degree of longitude or latitude into four equal intervals) with at least one observation is given a total weight of 1. These metrics help to evaluate performance when we consider areas with few stations (e.g., the northern parts of the MCM) to be equally important to areas with lots of stations (e.g., Mexico City proper). In addition, we estimate the degree to which our predictions capture spatial and temporal patterns by computing for each Ta outcome y and prediction p, given per-station annual means of the outcome My and the predictions Mp:

- R2, the proportion of variance accounted for, as 1 - Mean((y - p)2)/Var(y)
- R2spatial, the squared correlation between My and Mp. 
- R2temporal, the squared correlation between (y - My) and (p - Mp).
For summaries on the per-station annual means (R2spatial), the sample size is the number of stations instead of station-days (Kloog et al., 2014).

To verify that including Weather Underground data did not impair prediction, we conducted variations of the cross-validation procedure in which we excluded all Weather Underground stations from testing. We computed the spatial RMSE from a cross-validation that tests and trains in non-Weather Underground stations, RMSENWU. Then Weather Underground stations were allowed in training and RMSEWU was computed in non-Weather Underground stations and finally subtracted from RMSENWU. Thus, a positive difference in RMSENWU - RMSEWU means an improvement in RMSE when Weather Underground was included in training.

We constructed a learning curve in order to illustrate how our model's predictive accuracy is influenced by the size of its training data. This analysis was conducted for mean daily temperature in 2018. We selected two folds to hold out for testing, while using various subsets of the remaining eight folds for training. The test folds were chosen to have the closest unweighted RMSE under the cross validation as the overall RMSE for this year and dependent variable. These test folds ended up comprising 16 stations and 4,813 observations.
# learning.curve()

There were 7 rounds of analysis and 100 simulation replicates for each round. In each replicate of round 1, 10 stations were randomly selected, 2,500 observations were randomly selected from these 10 stations, and the model was trained on these 2,500 observations and tested on the test folds. Round 2 used 20 stations and 5,000 observations, round 3 used 30 stations and 7,500 observations, and so on up to round 7 with 70 stations and 17,500 observations.

Estimation of at risk population

We obtained population density information at the AGEB level (equivalent to US census tracts) from the 2010 Mexican Population Census that was carried out by the Instituto Nacional de Estadística y Geografía de Mexico (Instituto Nacional de Estadística y Geografía (INEGI), 2016). AGEB data is available as a polygon layer with population data for all metropolitan areas in the MCM. From these records we calculated the number of at risk days (above 30 °C and below 5 °C) experienced by people ≥65 years old in the MCM during the year 2010.


Results
Approximately one-quarter to one-third of LST data were missing from both Aqua and Terra satellites, and thus had to be imputed for our temperature model. The proportions of missing LST data for both satellites were similar over time (23% to 34% missing daytime LST and 27% to 37% missing nighttime LST across all years), except for a higher number of missing data in 2004 (52% daytime, 50% nighttime).
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#tab--sat-temp-missingness

For all years and dependent variables (i.e. minimum, mean, and maximum Ta), we observed substantially lower RMSE compared with the SD, indicating that our model was effective in predicting temperature. Table 1
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#tab--cv-overall
presents our model’s performance in which we conducted 10-fold cross validation for mean Ta. Tables S.2 and S.3 in the Supplement show cross-validation results for minimum and maximum Ta, respectively. Across all results, the range for RMSEs was 0.92 to 1.98 K,
# round(range(sr$overall$rmse), 1)
with a mean of 1.47 K,
# round(mean(sr$overall$rmse), 1)
whereas the range for SDs was 3.68 to 5.35 K.
# round(range(sr$overall$sd), 1)
Values of out of sample R2 ranged from 0.77 to 0.95 indicating good predictive abilities in our models, particularly for maximum Ta (R2 0.79 to 0.92) and mean Ta (R2 0.88 to 0.95), compared to minimum Ta (R2 0.77 to 0.87). 
# round(range(sr$overall$R2), 2)
The spatial and temporal R2 also showed good performance with average spatial R2 values of 0.86, 0.94, and 0.89 for minimum, mean, and maximum Ta respectively; and average temporal R2 values of 0.80, 0.88, and 0.85 for minimum, mean, and maximum Ta respectively.
The averaged RMSEs for minimum, mean, and maximum Ta were 1.67 K, 1.16 K, and 1.59 K respectively; and the spatially weighted RMSEs, in which all monitored areas are equally important, were generally worse than the unweighted RMSEs (by 0.34 K on average), but still reasonably small.
# mean(sr$overall$rmse.s - sr$overall$rmse)

Table 1. Prediction accuracy for the Megalopolis of Central Mexico (MCM): Ten-fold cross validated (CV) results for daily mean Ta predictions for 2003–2019.


Figure 2
#notebook.html#fig--g/cv-time-series
provides an example of observations and predictions at two different monitoring stations, in June of 2010 and 2018, showing improvement in 2018 Ta predictions compared to 2010 for the southern region of Morelos as the number of stations in training increased between these years. 


Figure 2. Observed and predicted Ta from CV, for station 8 (in Mexico City proper) and station 24 (in the southern region of the study area, in the state of Morelos) in two different years.


Figure 3
#notebook.html#fig--g/error-density

shows that the distribution of prediction errors in 2018 was similar by season in the MCM.


Figure 3. Density plots of the CV-predicted Ta minus observed Ta in K for 2018,
 aggregated across stations but stratified by season.

We examined model performance by metropolitan area in order to compare how our model performed in areas with fewer stations relative to those with more stations. Table 2
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#tab--cv-by-region
summarizes cross-validated model performance within each metropolitan area of the MCM with ≥5 stations for 2018. By comparing the RMSEs to the SDs of the Ta responses we can see how much more Ta variation is explained with our model. For results for all metropolitan areas, see Table S.4. There were large differences in both SD and RMSE across municipalities. Still, the improvement was at least 1 K in most cases. On average, our models yield better predictions than the naïve estimate of the mean for all Ta outcomes (i.e. our predictions’ measure of variation RMSE reduces the randomness in all observed Ta outcomes better than the SD).

Table 2. Prediction accuracy by metropolitan area in the MCM: Ten-fold cross-validation (CV) results for Ta predictions in 2018.


We evaluated model performance for each ground monitoring network from which we obtained temperature records. Table 3

#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#tab--cv-by-network

shows the accuracy in cross-validated Ta predictions by type of network. The lowest RMSE values were obtained for UNAM records for minimum, mean, and maximum Ta predictions. The highest RMSE values came from Weather Underground for minimum Ta, ESIMEs for mean Ta, and EMAs for maximum Ta. One reason why UNAM had the lowest RMSEs for all Ta outcomes despite having fewer stations in 2018 compared to most networks is because UNAM stations were evenly distributed across a smaller and more climatically homogeneous area, and potentially received higher maintenance compared to the other networks.

Table 3. Prediction accuracy by ground monitoring network in the MCM in 2018.


Model performance was compared by season type: cold-dry, warm-dry, and rainy. Table 4
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#tab--cv-by-season
presents the average cross-validated prediction accuracy for each Ta outcome by season for the entire MCM from 2003-2019. The lowest average RMSE values were for mean Ta for all seasons. The highest RMSE values were observed for minimum and maximum Ta during the cold-dry and rainy seasons, respectively, and for mean Ta in the cold-dry season. The highest precision improvements were observed during the cold-dry season for minimum Ta, and during the rainy and cold-dry seasons for mean and maximum Ta.

Table 4. Prediction accuracy by season in the MCM: Average S.D., RMSE and S.D. - RMSE for minimum, mean, and maximum Ta predictions from 2003–2019.


Results from our test for impaired prediction from allowing data from Weather Underground stations in training, RMSEWU, compared to excluding them from training, RMSENWU, showed that while the mean difference of -0.04 K in RMSENWU - RMSEWU favors training without Weather Underground stations, there was considerable variation in RMSENWU - RMSEWU by dependent variable and year. For instance, the inclusion of Weather Underground stations improved the RMSE for minimum Ta for seven years. Table S.5 shows RMSENWU - RMSEWU for all years and dependent variables in the MCM. 
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#sec--with-vs-without-training

Figure 4
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#fig--g/learning-curve
shows the RMSE in two held-out folds that can be achieved when training our model on various subsets of the remaining data. The mean RMSE across simulation replicates decreases rapidly as the training set grows from 2,500 to 7,500 observations, then levels off around 1.33 K. This example suggests that our sample is more than large enough to achieve the best accuracy possible with this model in this region.

Figure 4. Model performance learning curve for minimum Ta in 2018 as a function
 of the size of its training data

Model predictions were made for each 1 × 1 km grid cell in the entire prediction area for summarization and Figure 5
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#fig--g/map-temp-quantiles
shows the 95th percentile of the minimum and maximum temperatures across days for each grid cell in 2018. In general, the inhabited areas were hotter than the mountainous borders between municipalities. The southernmost region of the MCM, comprising the metropolitan areas of Cuernavaca and Cuautla, was substantially hotter than the rest of the MCM.


Figure 5. Spatial pattern of the 95th percentiles of minimum (a) and maximum (b) temperature across days for each 1 km2 grid cell in the Megalopolis of Central Mexico for 2018


Finally, we used population data to quantify human exposure to extreme ambient temperatures for people aged ≥65 years-old in the urban AGEBs of the MCM in 2010, Figure 6 (a).

#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#fig--g/map-population
There were over 50 million and 17 million person-days of exposure to extreme low and high temperatures, respectively, in 2010. The highest number of person-days of exposure to daily minimum Ta ≤5 °C, were concentrated in the metropolitan areas of Toluca, MCMA, Puebla-Tlaxcala, and Pachuca, Figure 6 (b).
#http://coco.5e102.mountsinai.org:8787/files/Jdrive/PM/Just_Lab/projects/Mexico_temperature/writing/kodi_notebook/notebook.html#fig--g/map-extreme-person-days
As for exposure to daily maximum Ta ≥30 °C, Figure 6 (c) shows that the metropolitan areas with a higher number of person-days above this point were Cuernavaca, Cuautla, MCMA, and Puebla-Tlaxcala. 

Figure 6. Spatial distribution of population density (a), person-days exposure to ≤5 °C (b) and ≥30 °C (c) for each 1 km2 grid cell in the Megalopolis of Central Mexico for people aged 65 years old and over in 2010


Raw data, processed data, predictions, and code are archived in the Zenodo open-access digital repository (doi:10.5281/zenodo.3362523).