Predicting Aedes Aegypti Eggs Count Using Remote Sensing Data and a Generalized Linear Model

Here, we present a method for temporal modeling of the oviposition activity of Ae. aegypti mosquitoes based on a weighted generalised linear model (GLM) with explanatory environmental effects extracted from freely available remotely sensing (satellite) images. Our results show potential for operational applications. Experimental results are provided using field collected Ae. aegypti eggs count data in Córdoba, Argentina.


INTRODUCTION
Remote sensing (RS) data provide information about the Earth's surface at an unprecedented scale. As a result, it has been applied in many domains. One of such domains is landscape epidemiology which focuses on using remotely sensed data information to understand and model the dynamics of environment-dependent disease risk [1,2,3]. Traditionally, in order to model health-related risk proxies in urban areas, post-hoc outbreak evaluations were performed. This approach, however, is only effective a posteriori. For prevention purposes, monitoring of vector dynamics and disease spread should be performed ahead of outbreak events [1]. Accordingly, there have been multiple attempts in landscape epidemiology towards implementing remote sensing-based systems to predict or forecast, for instance, the oviposition activity and adult population of mosquito species which are known vectors of widespread diseases [3,4,5].
Dengue, Zika and Chikungunya which are widespread in over 100 countries in the world are transmitted by the Aedes aegypti mosquito species [6]. This mosquito is fully adapted to urban areas and breeds in artificial water containers. Moreover, its oviposition activity, adult development, and disease transmission capabilities are influenced by environmental variables including temperature, humidity, precipitation and vegetation condition [6,7]. Since these environmental variables have RS-based proxy estimates, it possible is model the population dynamics of Aedes aegypti at scale with RS data; given the right data and modeling technique.
In this study, we apply the weighted GLM technique initially proposed in [8] for modeling and extrapolation of Ae. aegypti eggs count (not adult population) based on covariates estimated from RS data. In addition, we apply the method on data from Córdoba, Argentina. As in [8], we compare the results with those obtained by machine learning (ML) methods. In addition, we use the RS variables coefficients from the resulting best model to show how it can be interpreted to support the Argentinean operational risk system.

MODELING
In this work, the rounded-up mean Ae. aegypti eggs count (Y ) 1 is modeled as a function of environmental variables (X) derived from RS data. For this reason, a Poisson GLM with a logarithm link function is used. It has been shown [9] that this model is suitable to model non-negative discrete non-Gaussian distributed variables. To improve our model quality, we further fit a weighted GLM, labeled GLM-W. The weights w used to fit GLM-W are obtained as w i = |y i − y i | −1 , where |y i − y i | is the i-th week residual of our initially unweighted GLM. We then apply a step-wise regression [10] based on the Akaike Information Criterion (AIC) [11] to drop uninformative environmental variables from GLM-W. The resulting model is labeled GLM-W * .
To model the egg counts Y ∈ N as a function of X ∈ χ ⊂ R p , we estimate a function, m, such that m(x) = E(Y | X = x), i.e.
where the superscript T denotes transposition; y i is the predicted egg count in the i-th epidemiological week; k is the number of RS-based environmental variables used to fit the model; x ij is the value of the j-th environmental variable in the i-th week; β 0 is the value of y when all the environmental variables are equal to zero; and β j is the gradient of y with respect to the j-th environmental variable. Our observed egg count y i is an instance of the random variable Y i with mean estimated as y by our model. The random component of the model is the Poisson distribution for y i ⊂ Y . Under this model, the observations are outcomes of a probability process of the form: where µ i = E(y i | x i ) ≥ 0 is the mean and variance of the distribution. The relationship between µ i and the linear predictor is given by a link function, g, such that g(µ i ) = x T i β; cf. [12,9]. As mentioned, a logarithm link function is applied.
For comparison purposes, two ML techniques have been used: Random Forest (RF) and Support Vector Machines (SVM). We assess the quality of all models using the AIC for GLMs only, and the Root Mean Squared Error (RMSE) [13] for all models.

MATERIALS
The work refers to the city of Córdoba, the second largest city in Argentina. Entomological data consists of 300 ovitraps distributed in 150 houses over the city (Fig. 1). They were placed in the front yard of each house, usually in shaded places and below or close to bushes or pots with plants. All ovitraps were replaced weekly for two years for a total of 105 weeks from week 40 in 2017 to week 40 in 2019. The average egg count (rounded up) across all 150 traps is obtained for each week as the target variable of this study.
For the purposes of this study, selected environmental variables (temperature, humidity, precipitation and vegetation condition), are represented by proxy features estimated from freely accessible RS products. Specifically, the Enhanced Vegetation Index (EVI), Normalised Difference Water Index (NDWI), and day and nighttime land surface temperature layers of the Moderate Resolution Imaging Spectroradiometer (MODIS) data were used, as in other recent studies [3,6], as proxies to humidity, temperature and vegetation condition. Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) data was instead used for precipitation information [14].
To obtain the modeling explanatory variables, the average value for each RS environmental variable across all ovitrap locations (latitude and longitude) is obtained per week. In this way, we work around the differences in spatial resolution among the considered RS products. Variables representing different time intervals were interpolated to obtain weekly values using a third order spline interpolation. As in [1], timedelayed observations with weekly steps up to two weeks are included to account for non-synchronous environmental effects.
To obtain the modeling target variable, the (rounded-up) mean eggs count is obtained for each observed week. 80% of the data (the first 84 weeks) are used to train the models, and the remaining 21 weeks of data are used for model validation.

EXPERIMENTAL RESULTS AND DISCUSSION
By including for each of these variable up to two weeks of lag, we reach a total of 15 environmental covariates. Table 1 shows quality measures for all fitted models. First, we consider GLM, GLM-W, and GLM-W * . Considering AIC, GLM-W produces an improvement in prediction quality with respect to GLM. This is not the case considering RMSE. It is sufficient, however, to rely on AIC in comparing these two models because it is a measure of the absolute fit corrected for the number of predictors in the model. Also, it can be seen from this table that even with less variables in GLM-W * (see Table 2 for selected variables), we are able to obtain a slight improvement in AIC and RMSE values. Fig. 2 presents line plots comparing observed and fitted values by GLM, GLM-W, and GLM-W * . Looking at Fig. 2(a), we see that its predicted values and 95% confidence intervals (CI) do not reach the highest observed values which are around weeks 75-85 (corresponding to first week of March to second week of May in 2019). However, as shown in Fig. 2(b), GLM-W slightly improves the predic- tions in this period of high egg counts. Moreover, the 95% CI of GLM-W covers all the points during observed weeks 75-85, but grossly overestimates the observed values around the 79−th observed week. GLM-W * (c.f Fig. 2(c)) reduces this overestimation effects while still capturing the observed values within its CI.
Results for the selected benchmark ML models (RF and SVM) are also presented in Table 1. Considering RMSE, RF and SVM show better prediction on the training data, while the GLM models generalise better on validation data (less overfitting). In all, GLM-W * produces the lowest RMSE on validation data. Furthermore, the line plots presented in Fig. 3(a) reveal that while RF produces flat predictions from around week 90 till week 105, predictions by GLM-W * follow the observed data during that period.  A further advantage of GLM-W * over RF and SVM is that it provides a model equation through which the variance effects of the contributing covariates can be intuitively explained. Operational requirements in eco-epidemiology can benefit from model explanation: identifying key local biotic and abiotic environmental effects on female mosquitoes activity. Table 2 is a summary of GLM-W * in this regard. It shows the selected variables and their contribution coefficients. We see that all the NDWI variables in GLM-W * show high positive influence on the eggs count. This result is consistent with the study in [15] which shows that higher humidity rates are associated with higher dengue virus propagation. Ideally, more eggs might correlate positively with diseases spread. Alternatively, EVI shows the highest negative influence on the eggs count. Based on our model, the EVI and NDWI variables show relatively higher influences among all considered RS environmental variables.  Table 3: Summary of all the observed and fitted models on validation data In Table 3, we present a summary of the observed and fitted data: mean, median, minimum (Min), maximum (Max), first (Q1) and third (Q3) quartiles. Here, it can be seen that GLM-W * produces better Max, median and mean values compared to the chosen ML baselines.

CONCLUSIONS
In this work, a Poisson GLM has been used to model and extrapolate Ae. aegypti eggs count based on relevant environmental variables (precipitation, temperature, humidity and vegetation condition) extracted from RS data.
The experiments show that the proposed weighted model performs better than the standard GLM model. It also proved to be robust when qualitatively compared with more complex ML algorithms, such as RF and SVM. The proposed Poisson model ignores potential temporal autocorrelation of the egg counts data. Future studies can consider ways to include autoregressive components in the model. Week Mean eggs count