Keywords

1 Introduction

Urban atmospheric pollutants are increasing day by day. They are considered as one of the main causes of increased incidence of respiratory illness in citizens. It is now irrefutable that air pollution is being caused by large amount of Total Suspended Particulates (TSP) and respiratory particulate of Particulate Matter less than 10 µm in aerodynamic diameter that has numerous undesired consequences on human health [1]. Air pollutants in an area with good airflow quickly mixes with air and disperses however when trapped in an area, pollutant concentration can increase rapidly which ultimately leads to degradation of air quality. In order to measure how polluted the air is Air Quality Index is examined while for properties of air we see Qualities of Air [2]. All these factors affect the ozone layer which is Earth’s protective layer. It is a belt of naturally occurring ozone gas that sits 9.3–18.6 miles above Earth and serves as shield from harmful ultraviolet B radiation emitted by sun. Several steps like Montreal Protocol which declines emission of ODS (ozone depleting substances) have been taken. It is expected to result in a near complete recovery of ozone layer near the middle of 21st century. By 2012, the total combined abundance of anthropogenic ODS in the troposphere has decreased by nearly 10% from its peak value in 1994 [3, 5].

In the present era, there is a wide spread concern for ozone layer depletion due to the release of pollution. As particulate matter causes several kind of respiratory and cardiovascular disease, it also leads to ozone depletion which attracts more and more attention for air quality information. This shows the need for integration of different information system in a similar way as done by Birgersson et al. [6] for data integration using Machine Learning [6]. Prediction of air quality has thus become a necessary need to save the future. Machine learning has been applied in various fields [14, 15, 17]. Medical and other fields have also been covered by various classification techniques [18, 19, 22,23,24,25]. Just as application of rough set technique was done for data investigation by Roy et al. (2013), in this paper application of Random Forest, Multivariate Adaptive Regression Splines and Classification and Regression Tree techniques has been applied for predicting the concentration of ozone [13, 20, 21]. Chuanting Zhang and Dongfeng Yuan (2015) worked on Grained Air Quality Index Level Prediction Using Random Forest Algorithm on Cluster Computing of Spark [4]. Previously existing methods could not meet the demand of real time analysis so a distributed random forest algorithm is implemented using Spark on the basis of resilient distributed dataset and shared variable. Parallelized random forest is also used as prediction model. Estimation of benzene by on field calibration of an electronic nose has been carried by Vito et al. (2008) in which gas multi-sensor played an important role which helps to raise the density of the monitoring network. But their concentration estimation capabilities are seriously limited by the known stability and selectivity issues of solid-state sensors they often rely on. Sensor fusion algorithm used in regression need to be properly tuned via supervised learning. But this training was revealed to be unsuccessful [7, 12]. Forecasting and prediction of things has become an essential part for future life. Roy et al. (2015) worked on prediction of Stock Market Forecasting using Lasso Linear regression model [16]. Vito et al. (2009) worked on CO, NO2 and NOx urban pollution monitoring. Some authors have used gas multisensor devices as a tool for densening the urban pollution monitoring mesh due to the significantly low cost per unit [8]. But the drawback is that these sensors are not reliable for long term and selectivity issues. In this paper we concentrate on regression technique Multiple Adaptive Regression Spline (MARS) for Air Quality dataset. Hui et al. (2013) used this regression model for prediction of emission of CO2 in ASEAN countries. A comparative study of multiple regression (MR) and multiple adaptive regression splines (MARS) was carried for statistical modelling of CO2 over period of 1980–2007 [9]. MARS model was concluded as more feasible and with better predictive ability. This paper shows the comparison of regression techniques like Random Forest, Multivariate Adaptive Regression Splines and Classification and Regression Tree on Air Quality data showing the prediction using Salford Predictive Modeller.

This paper is organised as follows. Section 2 overviews proposed techniques of Random Forest, Multivariate Adaptive Regression Splines and Classification and Regression Tree. Section 3 gives the experimental setup and the steps involved in performing the regression techniques on the given dataset. Section 4 displays the results and discussion. Section 5 concludes the paper.

2 Proposed Techniques

To work with Salford Modeller it is necessary to know the working of the regression techniques that are going to be used. All are type of machine learning like a computer program is said to learn from experience ‘E’ with respect to some class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’ improves with experience ‘E’. All these have been used for prediction of ozone concentration by extracting knowledge from dataset.

  1. A.

    Random Forest algorithm

It is a tree-based ensemble learning method involving the combination of several models to solve a single prediction problem. The first algorithm for random decision forests was created by Tin Kam Ho using the random subspace method. It may also be said as a collection of many CART trees that are not influenced by each other when it is constructed [4]. It works as a large collection of decorrelated decision trees. It comes under bagging technique (average noisy and unbiased models to create a model with low variance). Tress are based on random selection of data as well as variable. This develops lots of decision trees based on random selection of data and random selection of variables. After all the tree are built the data get fed in the tree and proximities are calculated for each pair of cases. If any two cases occupy the same terminal node, their proximity is changed and incremented by one. At last, the proximities get normalized by dividing it by the number of trees. Proximities can be used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data. It serves as one of the most useful tools in random forests. The proximities originally form an N × N matrix. After a tree is built, both training and test data are pulled down the tree. At the end, the proximities get normalized by dividing by the number of trees. Since the large data set could not fit an N × N matrix into fast memory, a modification reduced the required memory size to N × T where T stands for the number of trees in the forest. In order to speed up the computation-intensive scaling and iterative missing value replacement, the user is provided with the option of retaining only the nrnn largest proximities for each case. When the dataset is presented, the proximities of each case in the test set with each case in the training set can also be computed and compared. The amount of additional computing is moderate. The dataset contains thousands of data from which concentration of ozone is predicted. Thus Random Forest is useful in handling thousands of input variables without variable deletion. Hence Random Forest gives variable importance to each and every variable involved.

  1. B.

    Classification and Regression Tree algorithm

Classification Regression Tree was introduced by Breiman et al. (1984) for classification or regression predictive modelling problems. It is often referred as ‘Decision Tree’ but now named as CART in modern software. It provides a foundation for important algorithms like bagged decision trees, random forest and boosted decision trees. It is a binary tree that splits a node into two child nodes repeatedly beginning with the root that contains whole learning sample. Say for x being a nominal categorical variable of I categories, there are 2I−1-1 possible splits for the predictor. If X is an ordinal categorical or continuous variable with K different values there are K − 1 different splits on X. At any node say t, the best split s is chosen to maximize a splitting criterion \( \Delta {\text{i}}\left( {\text{s,t}} \right) \) [11]. There are 3 splitting criteria available.

2.1 Gini Criterion

The impurity measure at a node t is defined as

$$ {\text{i}}\left( {\text{t}} \right) = \sum\nolimits_{i,j} {c\left( {i |j} \right)p\left( {i |t} \right)p(j|t)} $$
(1)

It is the decrease of impurity given by

$$ \Delta {\text{i}}\left( {{\text{s}},{\text{t}}} \right) = {\text{i}}\left( {\text{t}} \right) - {\text{p}}_{\text{L}} {\text{i}}\left( {{\text{t}}_{\text{L}} } \right) - {\text{p}}_{\text{R}} {\text{i}}\left( {{\text{t}}_{\text{R}} } \right) $$
(2)

where pL and pR are probabilities of sending case to left child node and right

$$ {\text{p}}_{\text{L}} = {\text{p}}\left( {{\text{t}}_{\text{L}} } \right)/{\text{p}}\left( {\text{t}} \right) $$
(3)

child node where

And

$$ {\text{p}}_{\text{R}} {\text{ = p}}\left( {{\text{t}}_{\text{R}} } \right)/{\text{p}}\left( {\text{t}} \right) $$
(4)
$$ \Delta i\left( {s,t} \right) = {\text{p}}_{\text{L}} {\text{p}}_{\text{R}} \sum\nolimits_{j} {\left[ {\left| {p(j} \right|t} \right) - p\left( {j|t} \right)|]^{ 2} } $$
(5)

2.2 Twoing Criterion

CART does not require any special data preparation other than a good representation of the problem.

  1. C.

    Multivariate Adaptive Regression Splines algorithm

It is a form of regression analysis developed by Friedman in 1991 [10] with the aim to predict dependent variable from set of independent variable. It is simpler than other models like random forest and neural network. It is seen as an extension of linear models that automatically models non linearity interaction between in variables. Mars is not affected by outliners. It produces model that can be written as an equation. It models both classification and regression tasks. It accepts a large number of predictor and chooses important predictor variable. The extensive use of MARS model can be done for prediction as it has been done for concentration of ozone in this paper. It is used for prediction and classification problems (Islam et al. 2015; [4]). The details of the MARS model can be observed through a website by Salford Systems. Also, this regression is influenced by the recursive partitioning method for which any criteria can be chosen for the selection of basis function of multivariate spline. One of the advantage of the mars model is that MARS can reduce the outliers. The proposed MARS forms the model with the use of two sided truncated functions of the predictor x which has the form below.

$$ ({\mathbf{x}} - {\mathbf{t}})_{ + } = \left\{ {\begin{array}{*{20}c} {{\mathbf{x}} - {\mathbf{t}}\text{,}} & {{\mathbf{x}} > {\mathbf{t}}} \\ {{\mathbf{0}}\text{,}} & {{\mathbf{otherwise}}} \\ \end{array} } \right\} $$
(6)

Equation (6) works as a basis function for linear and non-linear functions. Also, this Eq. (6) works to approximate any function f(x). In MARS, let us assume that the dependent variable (output) is y and number of terms is M. The output variable can be represented as following,

$$ \varvec{y} = \varvec{f}\left( \varvec{x} \right) =\varvec{\beta}_{0 } + \mathop \sum \limits_{{\varvec{m} - 1}}^{\varvec{M}}\varvec{\beta}_{\varvec{m}} \varvec{H}_{{\varvec{km}}} \left( {\varvec{x}_{\varvec{v}} \left( {\varvec{k},\varvec{m}} \right)} \right) $$
(7)

In the Eq. (7) MARS works over M term. The terms \( \varvec{\beta}_{0} \), \( \varvec{\beta}_{\varvec{m}} \) are the parameter. Hinge function i.e. the H can be written as the following equation,

$$ H_{km} \left( {x_{{v\left( {k,m} \right)}} } \right) = \mathop \prod \limits_{k} - 1^{k} h_{km} $$
(8)

In the above Eq. (8) the product of kth of the mth term is given as, x v(k, m)

The value of K = 1 and K = 2 gives additive and pairwise interaction respectively. For this work, the opted value of K is 2.

3 Experiment

The experiment to predict concentration of ozone is carried out by using software named ‘Salford Predictor Modeller 7.0’ founded in 1983 which supports the three techniques Random Forest, Multivariate Adaptive Regression Splines and Classification and Regression Tree.

  1. A.

    DATASET

This dataset is given by Saverio De Vito in 2006 from UCI dataset containing the response of multisensor device deployed on the field in an Italian City. It has 9358 instances with 15 attributes which was recorded from March (2004) to February (2005). The attributes are Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyser. The attributes included Date, Time, True hourly averaged concentration CO in mg/m3, PT08.S1 (tin oxide) hourly averaged sensor response, True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m3, True hourly averaged Benzene concentration in microg/m3, PT08.S2 (titania) hourly averaged sensor response, True hourly averaged NOx concentration in ppb, PT08.S3 (tungsten oxide) hourly averaged sensor response, True hourly averaged NO2 concentration in microg/m3, PT08.S4 (tungsten oxide) hourly averaged sensor response, PT08.S5 (indium oxide) hourly averaged sensor response, Temperature in °C, Relative Humidity (%), AH Absolute Humidity [7].

  1. B.

    Air Quality Prediction Steps

This describes the various steps taken for prediction by Salford Modeller.

Step 1: The database is opened in the software as it supports all type of file.

Step 2: The model is designed by selecting the predictor. A total of 12 predictors are selected for MARS/Random Forest/CART for this dataset. Date and time are not chosen as they don’t have any effect. PT08_S5_O3_ is set as the target in all the cases.

Step 3: The analysis method is selected as MARS/Random Forest/CART with analysis type being ‘regression’ in all three cases.

Step 4: It’s time to separate the dataset as learning set and test set. This is done by selecting Fraction of cases selected at random for testing by assigning any value. Remember the values are in terms of percent. Here we put the test set as 0.30.

Step 5: Now the model is started and resulting graph pops up showing the information required for future prediction of target variable. It also provides summary for all other details which is discussed in the next section.

4 Result and Discussion

In this section we compare the results given for the target variable PT08_S5_O3_ through Random Forest, Multivariate Adaptive Regression Splines and Classification and Regression Tree by Salford Predictive Modeller. Out of 15 attributes, 12 are being used for used as predictors while 1 is selected as the targeted variable and the targeted variable being PT08.S5 (O3). 30% of the 9358 instances are selected for test case while rest go in for the learn case. This paper contains the graphical representation of the learn and test value, summary of important terms, list of variable importance by the three models used. On applying Multivariate Adaptive Regression Splines, we get Fig. 1 which shows the graph shows result of learn and test case where the Y-axis represents MSE with an interval of 50,000 and X-axis representing Basis functions which was taken as 15 initially. From the graph conclusion can be made that there are least error as both the learn and test cases are same. Initially the MSE value starts from 200,000 drops till 5000 and gradually becomes constant.

Fig. 1.
figure 1

MSE vs. basis functions

There are several important parameters that give the model error measure. These important parameters have been listed in Table 1 showing their value for both learn and test. The variables include RMSE, MSE, GCV, MAD, MRAD, SSY, SSE, R^2, R^2 Norm, GCV R-Sr. Out of 15 attributes, 12 were set as predictors but after the regression model was prepared it was deducted that only 8 variables were important for prediction of PT08.S5 (O3). The most important variable was found to be PT08_S2_NMHC. The scores of all the variables are given in decreasing order of their importance in predicting the predictor in Table 2. The number of basis function was set as 15 initially. The model assigns special variables to make a new equation to cover all points of nonlinearity. These variables are termed as basis variables. The model is a weighted sum of basis function. Each basis function takes one of the following forms

Table 1. MARS result for learn and test
Table 2. Variable importance in MARS
  1. (1)

    Constant. Only one term i.e. the intercept

  2. (2)

    Hinge function

  3. (3)

    Product of 2 or more Hinge function. Table 3 consists all the basis function and their combination to give the final equation of Y.

    Table 3. MARS basis functions

Random Forest was started by setting the number of trees to be built as 500 and number of predictors for each node as 3. The frequency of the report was set 10 along with parent minimum case as 2. The elapsed time was nearly 45 s for creating the trees. Separate graphs were obtained showing the comparison of train as in Fig. 2 and test cases as in Fig. 3 with 500 trees having maximum terminal node of 3293. Observing the curve in both the cases, a clear difference can be seen in the curve. Both start with MSE 40,000 but the train set is less steep when compared with test set. Train set shows a turn at 16th tree (18,502.625) while test set shows a turn at 8th tree (16,911.168). Both the graph for train and test are given in Figs. 2 and 3 respectively.

Fig. 2.
figure 2

Train set plot in random forest

Fig. 3.
figure 3

Test set plot in random forest

There are several important parameters that give the model error measure. These important parameters have been listed in Table 4 showing their value for both learn and test. The variables include RMSE, MSE, MAD, MRAD, SSY, SSE, R^2, R^2 Norm. Unlike MARS in random forest all 12 variables have their own importance and show their contribution for building trees. Importance of each variable has been shown in Table 5.

Table 4. Random forest result for learn and test
Table 5. Variable importance in random forest

Classification and Regression Tree model also leads in building up of trees. It gives a graph where the Y-axis shows Relative Error while X-axis shows the number of nodes. The graph value shows the relative error value as 0.083 at 150th node. So by examining the graph directly we can get the relative error of test from the train set as shown in Fig. 4.

Fig. 4.
figure 4

Relative error through CART

There are several important parameters that give the model error measure. These important parameters have been listed in Table 6 showing their value for both learn and test. The variables include RMSE, MSE, MAD, MRAD, SSY, SSE, R^2, R^2 Norm, AIC, AICc, BIC, Relative Error. CART also requires the use all 12 predictor variables just as Random Forest does. Table 7 lists the variable according to their importance.

Table 6. Random forest result for learn and test
Table 7. Variable importance in CART

5 Conclusion

In this paper we have proposed to show the prediction of ozone concentration by using three regression model. By keeping the train and test in the ratio 7:3 we compare the result from all three cases. Evaluation of the prediction models indicates that the Multivariate Adaptive Regression Splines model describes the dataset better and has achieved significantly better prediction accuracy as compared to the Random Forest and Classification and Regression Tree. Multivariate Adaptive Regression Splines gives the result by considering less variables as compared to other two. It evaluates on basis of 8 variables while other two require all variables. Moreover, Random Forest takes a little more time for building the tree as the elapsed time was calculated to 45 s in this case. PT08_S2_NMHC_ is the most important variable as given by Multivariate Adaptive Regression Splines while PT08_S1_CO_ is most important variable as given by Random Forest and Classification and Regression Tree. Observing all the graphs Multivariate Adaptive Regression Splines gives the closest curve of both train and test set when compared. It can be concluded that multivariate adaptive regression splines can be a valuable tool in predicting ozone for future.