Towards an accurate Ground-Level Ozone Prediction

Received Jun 7, 2017 Revised Dec 27, 2017 Accepted Jan 5, 2018 This paper motivation is to find the most accurate technique to predict the ground level ozone at Al Jahra station, Kuwait. The data on the meteorological variables (air temperature, relative humidity, solar radiation, direction and speed of wind) and concentration of seven pollutants of environment (SO2, NO2, NO, CO2, CO, NMHC, and CH4) were applied to forecast the ozone concentration in atmosphere. In this report, three methods (PLS regression, support vector machine (SVM), and multiple least-square regression) were used to predict ground-level ozone. We used Fifteen parameters to evaluate the performance of methods. Multiple least-square regression, partial least square regression (PLS regression), and SVM using linear and radial kernels were the best performers with MAE (mean absolute error) of 9.17x 10-03, 9.72 x 10-03, 9.64 x 10-03, and 9.12 x 10-03, respectively. SVM with polynomial kernel had MAE of 5.46 x 10-02. These results show that these methods could be used to predict ground-level ozone concentrations at Al Jahra station in Kuwait. Keyword:


INTRODUCTION
This paper intends to find the most accurate technique in forecasting the ground level ozone. Kuwait has a number of civic areas where the pollution of air has dramatically grown up due to the industrialization and technical development, and over population [1]. All such factors have been contaminating the air and causing enhanced air pollution in the areas with high number of population. These populated areas of Kuwait might experience severe health related issues in near future due to the pollutants in their nearer proximity. At the initial stage, the ambient of ozone layer has been severely damaged by the pollution and this has become a matter of concern due to the increasing pollutants that are being originated from developed and industrialized nations. Air pollution can have more drastic affects for populations living in areas closer to the industrial estates that are causing the pollution in the air. This might be dangerous for the health of all living things especially chronic respiratory infection, aggravate asthma, lung inflammation, damaged lung defense mechanisms and decreased immunity in human beings [2].
Disturbance to the balanced atmosphere has become the major cause of damage to the ozonosphere. This damage might lead to severe foliar injuries, reduction in biomass production and agricultural yield. A considerable shift in the competitive benefits from different species of plants in varied populations has also been seen [3]. The ozone layer"s damage is consistently increasing which is dangerous for the health of humans and the environment [4]. According to scientists, the ozone layer is formed by O 3 that results the depletion from the reaction of ultraviolet rays and chemical interaction of oxides of nitrogen NOx along with organic species [5].
All such basic pollutants have generally originated from industrial and other factors such as urban development. Lack of protection from ozone might create health issues because of its high reaction, to chemicals, rubber materials and fabrics. It also affects severely to some crops. Therefore, the ozone"s concentration in the lower atmosphere has been greatly consid erable due to its reaction against oxidizing photochemical fog. As a vital index substance of photochemical smog, ozone has been considered as the major pollutant affecting the quality of air and environment [6], [7].
Two main chemical precursors of ozone along with other photochemical oxidants have been recognized as Hydrocarbons (HC) and Nitrogen Oxides (NOx) [6]. Petrochemical processes and fuel or oil burning along with transportation have been the major causes of atmospheric HCs and NOx. Most of such pollutants are measured by emission rates that are counted with the help of activities going on in urban and industrial areas. The association of such basic pollutants with meteorological elements might help reach to a determination of ozone levels. A model that determines the chemical processes and atmospheric movements might be an appropriate strategy. The chemistry of organic species might also be difficult to be accurately collected. The complicated ozone"s formation has been uncontrollable. The stratospheric ozone"s layer that protects the earth from ultraviolet rays causing harms for both humans as well as for crops. But the lower level ozone concentration has been the main concern in terms of causing health related issues and other material and vegetation effects [8]. Ozone inhalation can initiate numerous health problems like respiratory tract irritation, eyesight problem, coughing, chest tightness, wheezing. Children who usually go outside during the daytime in summer are likely to be at risk when the concentration of ozone remains higher in the air. Further, this ozone is also become a reason of loss in the agricultural production.
In the recent past, the air pollution in the urban areas of the country is heavily increased [9]. This is the due to the overpopulation, technical growth, and rapid industrialization in the country. The levels of ozone pollution as observed in the residential district of Salmiya were increasing the ambient quality standards of air during certain times of year. Therefore, there is an immense need of accurate forecasting of the surface ozone; as with the forecasting it would be assisting in the successful implementation of the warning strategies for public especially during the episodic days in country.
There are three wide areas in terms of meteorology that need to be focused for ozone"s concentration through statistical methods. Every area of approach is considerably unique from others: first one is regression based method, the second is extreme value method, and third one is Space-Time approach. Ozone"s variability is decreasing that is yet to be understoo d and that has also been under consideration very commonly with the help of meteorological adjustment. The change in the climate, or a change in the policy, ultimately creates a change in the process. These changes might be considerably smaller and hard to be identified, which needs efforts of separation of it from weather and climate [10]. Regression based method and extreme value method, both concentrate on forecasting, estimation and revealing the fundamental mechanisms.
Similarly, the studies carried out for the analysis of ozone level were focused on the comparison of ozone levels with the standard limits internationally. This comparison includes the study of seasonal trends of ozone levels, understating of behavior diurnally in ozone, assessment of effects on health by ozone pollution [11], [12].
Few studies on developing a robust system for a public warning system of forecasting that can be utilized, most of the forecasting systems were developed for the prediction of concentration in the ambient ozone in Kuwait with the use of precursor concentrations and meteorological data [11], [12].
In [1], predicting the levels of ozone from meteorological conditions and precursor concentrations at (SIA) Shuaiba Industrial Area of Kuwait during the daylight hours was achieved by using step wise multiple regression modeling.
The application of artificial neural networks and the principal component regression was done for the prediction of ozone of concentration in the lower atmosphere of Kuwait. The prediction was done using five variables of meteorology (air temperature, relative humidity, solar radiation, wind direction, and wind speed) and the data from seven concentrations of environmental pollution (SO 2 , NO 2 , NO, CO 2 , CO, NMHC, and CH 4 ).
As linear regression is a well known method ,a number of studies such as [1], [13][14][15][16] have discussed multiple linear regression method to associate the ozone"s measurement with contemporary meteorological measurements. These models have presented with lack of autocorrelation and crosscorrelation and this is why these models do not fulfill the basic requirement of merit, proven scientifically. Time series regression has been another complex factor that associates the still relationship along with a correlation structure such as a simple AR (1), for the residuals [17]. This is considerable only when the fits are diagnosed by appropriate methods. As this technique is robust however, these methods might be inappropriate for obtaining the interactions and nonlinear response of ozone"s concentration.
In the paper [18], the comparison of performances of various forecasting systems is done across different locations in Kuwait. Fuzzy modeling and time series are the tools which are used for the analysis. The two forecasting models of analysis depicted a significant improvement in comparison to the currently used model of forecasting air pollution i.e. pure persistence forecast. Large proportion of ozone variation is described by the daily maximum temperature. According to [19] the statistical linear models have been experienced as complex for gathering the multifaceted association in ozone and meteorological variables. For gathering the data and developing a parametric nonlinear model, around forty five monitoring stations were established in the Chicago region and they provided considerable data during 1981 to 1991. The data was gathered at the AIRS database. The authors presented a model of the daily median across different sites, with a maximum of one hour of ozone"s average values, with the help of nonlinear least squares. During the exploratory graphical evidences and through nonparametric modeling, several relationships were observed such as, contemporary still ozone, relative humidity, upper earth surface temperature, and seven hundred HPA surface wind speeds, that present a trend of parametric forms. The Fourier series is applied in the modeling of seasonal waves. This helps in calculating the standard errors in the coefficients and authors confirms the occurrence of serial autocorrelation in the model residuals. The authors accordingly apply suitable adjustments by applying the Galant"s methods [20] applied the linear method of stepwise multiple regression so that the best fit equation could be formed that relates to the ozone"s maximum concentration during the daylight period in the air and meteorological conditions with a twenty four hour air"s upwind parcel trajectory. There were four variables involved in the equation, such as maximum upwind ozone on earlier day, maximum temperature upwind of last day, and the average upwind speed in both the upper and the lower layer. The rate of emissions for upwind along with hydrocarbons and nitrogen oxides that form the lower layer of ozone, were also examined and found to be lacking in improvement of multiple correlation coefficients.
A number of evaluations of variables used in meteorological adjustments are aimed at being stepwise but lacking in gathering all appropriate subsets. These subsets were computationally very difficult due to huge number of variables. Stepwise method was found to be missing a global phenomenon of model selection, therefore, it has been causing the problem when a variable is eliminated earlier, might have vital interactions with the others. They are later dropped from the model after being masked. [21] Applied a different strategy for linear models for the determination of health related issues with regards to specific matter and air pollution. It was an approach that set down earlier probabilities of containing the numerous variables and then calculates the uncertainty of associated model with regards to posterior probabilities for a vast number of models.
It is difficult to predict the levels of ozone by using theoretical method (for example detailed atmospheric diffusion model). For the development of forecasting system, the empirical analysis is needed. The well evaluated forecasting model of ozone would be the factor which can raise the chances of a successful control strategy. In addition to this, the daily forecasting of maximum ozone concentrations would be helpful in reducing and avoiding the damages and injuries related with ozone. The conducted research is important in this manner because it gives comparison between different statistical methods for the prediction of ground level ozone at Al Jahra station Kuwait.
Because of its major significance to atmospheric chemistry, ozone has been widely studies for several years both theoretically as well as experimentally [21]. Despite being just a triatomic, learning ozone kinetically, spectroscopically as well as dynamically has been highly difficult for the theory. There is considerable divergence between anticipated as well as detected low temperature rates of O+O 2 isotope exchange [23]. At low temperature, experimental rates are three to five times greater than predictions and reveal a negative temperature on the basis of the fact that has been evidenced tough reproducing theoretically. In the troposphere, the level of ozone is of immense significance due to its negative impact on vegetation, materials and human health.
Through complicated photochemical reactions in the sunlight, ground-level ozone is mainly produced from its precursor of NOx as well as volatile organic compounds (VOC), according to [24]. By physical and chemical processes as well as by the meteorological conditions, accumulation of ozone at ground level is affected. Over numerous temporal and spatial scales, atmospheric pollution extents reveal complicated inconsistency with harmful impacts on the environment [25].
The existing condition of improvement in the measurement studies as well as modeling of ozone precursors, transport processes and photochemical behavior has been currently evaluated [26]. Although ozone chemistry has been widely examined in several chamber experiments as well as in the photochemical modeling analysis [27], there are still difficulties in perfectly forecasting ambient ozone levels and its spatial distribution, behavior and related patterns. One has to develop comprehension of not just ozone itself for tracking and predicting ozone, however also the situations that integrate to its formation. It is essential to implement the models describing and helping to know the complicated associations between several variables and ozone levels causing or hindering ozone production.
To predict ozone variations time to time, photochemical models are sometimes used and assist to develop lucrative sources of minimizing ambient ozone to control, particularly the emissions of NOx and  [28]. Ozone formation differs on the basis of hours, days and seasons, due to the complicated series of reactions are handled by sunlight and temperature. For predicting ozone levels, a survey of the significance of meteorology in the surface ozone levels is portrayed as well as linear regression ways were employed [29]. In the United States as well as other industrialized countries, ground-level ozone pollution is said to be severe health issues in several cities.
For several ozone-sensitive individuals, specifically people who suffer from respiratory diseases like asthma, high summertime levels may cause distress [30]. In most of the cities, it is clear that the public forecasts or announcements of potential unhealthy ozone air quality for future may be of great advantage to those at the risk of respiratory discomfort [31]. Moreover, "ozone action" processes was intended in several cities to control episodic emission due to its health effect. These actions rely on forecasting of ground level ozone.

THE PROPOSED METHOD 2.1. Datasets
The datasets used in this report were discussed in the public warning systems for forecasting ambient ozone pollution in Kuwait. Data from 2006 to 2008 were used for training the methods, and data from 2009 and 2010 were used for testing the methods. Datasets were pre-processed in Microsoft Excel using information from [32] before analyses.

Multiple Least-squares Regression
Multiple least-squares regression (MLR) models the relationships between dependent variables(Y) and independent variables (X) as shown in Equation (1).
where B is the matrix of regression coefficients and E contains the residuals.

PLS Regression
PLS regression predicts a relationship between a set of predictor variables, X, and a set of dependent variables, Y. In the first stage of PLS regression, w and q are weight vectors derived from X and Y respectively, where the corresponding scores t = Xw and u = Yq. Next, calculating least squares regression between u and t, the inner relationship r is determind. At last, rank-one reductions of X and Y are performed such that Xj-1 = Xj-tjpjT and Yj-1 = Yj-rjqjT, where pj = XjTtj/(tjTtj), and j is the latent variable being calculated. Thes stages are repetitive until the desired number of latent variables (K) has been extracted The general model is given by Equation (2) [33].

SVM
SVMs are used widly due to their use of kernel functions to represent data, The differentkernel functions of SVM are Polynomial, Linear, Sigmoid, and Radial Basis Function (RBF) [34]. In this paper linear, radial, and polynomial kernels were used for training and predictions.

Statistical Analysis
R3.1.2 was used to perform the statistical analyses, with the following packages (pls, and e1071). The performances of the methods were assessed using mean absolute error.

Experimental Setting 2.4.1. Multiple Least-square Regression
MLR was applied to the normalized data using Equation (1). The regression coefficients and the p-values of the training of the MLR method are represented in Table 1 (Table 1).

PLS Regression
With PLS regression, the question is the number of PLS regression components to consider so that we don"t select noise in training the model. In this report, root mean square error of prediction was used to select the number of PLS regression components (Figure 1). Five PLS regression components were selected and used for training the method. The model created was then used for predicting the ground-level ozone concentrations using test dataset (2009 and 2010). The model mean absolute error was 9.72 x 10 -03 (Table 2)

SVM
SVM was trained using three kernels (linear, radial, and polynomial). A grid search was used for selecting the best parameters (gamma and cost) for linear, radial, and polynomial kernels. The training dataset was split into two sets; the first set for training and second set for validation. Table 3 shows the results of grid search for linear, radial, and polynomial kernels, respectively. The best gamma parameters for linear, gamma, and polynomial were 1 x 10 -02 , 1.0 x 10 -04 , and 1.2 x 10 -03 , respectively; and the best cost parameters for linear, gamma, and polynomial were 1, 1 x 10 05 , and 100, respectively.
The best parameters obtained from tuning the kernels were used for training the methods. The mean absolute errors of SVM trained using the three kernels were 9.64 x 10 -03 , 9.12 x 10 -03 , and 5.46 x 10 -02 for linear, radial, and polynomial, respectively (Table 2).  Table 2 and Figure 2 show the performances of all the three methods (MLR, PLS regression, SVMs) on our test dataset.  (Table 2).

RESULTS AND DISCUSSIONS
MLR, PLS, and SVM with linear and radial kernels performed better than SVM with polynomial kernel in predicting the test dataset as indicated by low MAE (mean absolute error). Additionally, our results show a strong linear relationship between the ground-level ozone concentrations and the predictor variables.
The monthly value of actual data was found to be 0.02 parts per million at the month of January 2009. The SVM using polynomial kernel was below the actual data value and lie between multiple least square regression and actual data. This is the same position where SVM e radial was situated. The position of Multiple least square regression was at 0.01 parts per million.
In the month of February the actual data shows a decreasing trend. While Multiple Least squares Regression is showing a rapid increasing trend , the SVM polynomial is decreased to 0.01 parts per million and at the same point the SVM radial was also present. The PLS regression in February also depicts the position at 0.01 parts per million. J a n _ 0 9 F e b _ 0 9 M a r _ 0 9 A p r _ 0 9 M a y _ 0 9 J u n _ 0 9 J u l_ 0 9 A u g _ 0 9 S e p _ 0 9 O c t _ 0 9 N o v _ 0 9 D e c _ 0 9 J a n _ 1 0 F e b _ 1 0 M a r _ 1 0 A p r _ 1 0 M a y _ 1 0 J u n _ 1 0 J u l_ 1 0 A u g _ 1 0 S e p _ 1 0 In the month of March the most rapid increasing trend was depicted by Multiple Least squares Regression and PLS. The Actual data is still in decreasing trend but at a slow rate. The SVM polynomial is also increased than the month of February. Here SVM radial is depicting almost the same level of ground ozone that is shown by SVM polynomial. The March shows a decreased trend in SVM linear.
With the start of April the actual data retains its position which it had in the month of January, i.e. 0.0200 parts per million of ground level ozone. The Multiple Least squares Regression also shows a decreased trend. PLS regression in this month has reached the 0.01 ppm level which is far less than the previous month i.e. more than 0.02 ppm. SVM linear has slightly increased whereas the SVM radial has a very slight decrease than last month. SVM polynomial has reached the 0.01 ppm level of ground ozone in this month.
From now onwards there is an increased trend in all the three methods and the predictor variables till the month of August. Actual data also attains the highest position in 18 months after the month of august. From August the trend in every predictor tends to be decreasing especially actual data. In the meanwhile only SVM polynomial is sustained and depicting only a very less fluctuation. After the month of September the trend falls very rapidly specially SVM polynomial. The concentration of ground level ozone decreases in every predictor till the month of November. Multiple Least squares Regression is positioned at the highest point among other methods in this month with 0.02 ppm ground level ozone.
In the month of November the lowest position is of SVM linear. The trend of decrease does not stop here but continues with some fluctuations in different variables and methods till the month of January. January shows the lowest trend of concentration of ground level ozone which ousts it from the periodic season. The SVM radial reaches the lowest of all and is 0 ppm in this month. The SVM polynomial does not depict a different condition than it.
There is only a bit difference where it lies with the SVM radial i.e. near 0 ppm. From January the PLS falls rapidly and reaches 0 ppm in the month of February. Only SVM polynomial is increased at a higher pitch than others and reaches the 0.018 ppm in this month. With the start of March the ground level ozone started increasing. Only SVM radial is at a sustaining position than others and there is very less difference in its readings from previous 3 to 4 months. This depicts the start of season of higher concentration of ozone ground level. From now onwards every method and variable depicts and increasing trend with slight fluctuations. This trend is sustained till the mid of May. After May SVM polynomial has a fluctuating trend among others while SVM polynomial and SVM radial are rather stable. This situation continues till the month of September.
Studies compared the SVM prediction performance in all the branches of atmospheric sciences, such as meterology, atmospheric physics and chemistry in addition to weather forecasting. These Studies support our findings, where SVM demonstrated a robust tool for prediction, examples of these studies could be found in [35][36][37][38][39].

CONCLUSIONS AND SUGGESTIONS
The adverse effects of ozone are vulnerable and can be spread to very long distances and up to a wide range, depending on the direction and wind speed and can be the reason of various bad effects on the health of inhabitants.
In this report three methods (MLR, PLS regression, and SVMs) were effective in predicting groundlevel ozone concentrations ozone concentrations at Al Jahra station in Kuwait. MLR, PLS regression, and SVM using linear and radial kernels performed better than SVM using polynomial kernel. SVM tuning is computationally expensive, especially tuning radial and polynomial kernels.
From the analysis it is found that the concentration of ground level ozone (O3) varies with season. This depicts the impact of predictor variables on the concentration. Usually, in the starting 4 months the concentration of O3 is low in air. This trend increases from the month of April and continues till the months of July and August. From then the concentration becomes decreasing till the January and February. In the study the concentration of ozone is determined by the multiple least squares regression, PLS regression, and SVM.
The data for research was collected for eighteen consecutive months starting from January 2009 to September 2010. With the help of the statistical methods used for the forecast of ground level ozone concentration it is found that the concentration is variable on monthly basis. Ozone concentration has a direct impact due to the minor changes in the rainfall, humidity or temperature. It is observed from the study that different methods although give different prediction readings for ozone concentration, some of them are varied a lot in their results. The present research on the specified site of prediction that is Al Jahra station in Kuwait is identified and the work will also be helpful for the future forecast of ozone concentration.