1 Introduction

Air pollution is responsible for the death of millions of people every year. Long-term exposure to particulate matter (PM) and other air pollutants raises the risk of mortality from respiratory infections, heart disease, lung cancer, asthma and chronic obstructive pulmonary disease [1, 34]. The situation is extremely grim in Asian cities, where rapid urbanization, diesel fueled transport, thermal power plants, burning trash, rapid industrialization and household fuels such as coal and biomass release harmful pollutants in atmosphere. In this work, air quality of the Delhi, the capital of India and the national capital region (NCR) has been taken as case study. According to the United Nations World Urbanization Prospects 2018 [30], Delhi is the fifth most populous city in the world and first if the entire NCR is included. Air quality in Delhi and encircling NCR has deteriorated due to rapid rise in particulate matter (PM\(_{2.5}\) and PM\(_{10}\)) from vehicles, crustal dust, dust from paved roads and construction sites, emissions from diesel generators, garbage and bio-fuel burning, industrial activities and thermal power plants [4, 6]. There are 10,000–30,000 deaths annually due to air pollution in Delhi, and it was designated the most polluted city in the world in 2014 [29]. Particulate matter has been associated with respiratory diseases, cardiovascular diseases, cancer and premature deaths. Delhi is ranked 3rd in the list of cities with most deaths due to air pollution, Shanghai being the first [10, 13].

The life expectancy of the inhabitants of Delhi has dropped by 9 years due to air pollution (reported by the Energy Policy Institute, University of Chicago) [17]. People living in urban areas get information about air quality from fixed air quality monitoring stations, which are costly to build and maintain, sparsely located, and limited in number. The concentration of air pollutants is highly context aware, and depending upon the nature of traffic in a certain location, or an unplanned industrial activity, concentration of particulate matter can change quickly in real time. An air quality sensing system mounted in an automobile that commutes through the city is able to provide large scale real time, fine-grained details on pollutants levels and the quality of air in different areas of the city, throughout its route. Using this information with suitable statistical or stochastic techniques, it will now be possible to predict the condition of the air quality in that area. This data, when made available, can be used by commuters to plan their routes. It can also be used in various ways by the smart city planners to improve urban air quality.

1.1 Contribution of this Work

The air quality in Delhi and the adjoining national capital region has been taken as a case study in this study. As air quality monitoring stations are few and far apart, this work focuses on the development of a mobile, location aware internet of things (IoT) system for monitoring, modeling and analysis of air quality. Using the IoT system, air quality data is sensed for several months, and this data is then used to model and forecast air quality. The work uses various statistical and stochastic models on the air quality time series data, and does a performance evaluation to find the efficacy of these models.

2 Related Work

The sensing, analysis and modelling of outdoor air quality is described in some of the recent works. Zhang et al.  [33] have used 6-day meteorological and air quality data to develop a multi-model fusion model that forecasts PM\(_{2.5}\) in Beijing. An early warning system for air quality is developed in [11]. It uses hybrid Elman neural networks to generate forecasts. Wang and Chen [27] have developed a data collection and estimation system based on vehicular sensor networks (VSNs) to monitor urban air quality. The goal is to balance between communication overhead and monitoring accuracy. The data gathering algorithm also uses Delaunay triangulation to estimate air quality indices of the locations without any sensed data. In [35], the authors have developed a system with the help of which they intend to determine city wide air quality, using sensed data from few air quality monitoring stations, that are spaced far apart. Granger causality is used to analyze all causality relations responsible for generation of air pollution. The authors in [8] have proposed an on-road, real time air quality monitoring and prediction approach in different road segments. The approach uses data from air quality monitoring stations as well as e-participatory pollution sensors, road network data as well as contextual data. Dijkstra algorithm and an artificial neural network models are used for finding the least polluted route in the road-network and the prediction of air quality. Data processing is done using HBase and MapReduce.

The authors in [7] have used multivariate long short term memory technique to learn the spatial correlations and temporal dependencies of traffic patterns at different base stations, to accurately forecast the traffic in future. A weighed graph of the base stations is built afterwards, according to their traffic patterns, and an algorithm to find the optimal base station clustering scheme is proposed. The performance of the framework has been evaluated using real time data for 2 months. In [24], the authors have discussed the losses in energy generation by solar panels due to increase in air pollution. Kiruthika et al. [15] have created an IoT system for monitoring temperature, humidity, soil moisture and air pollution. The recorded data is reported to users and alerts are generated if safety thresholds are breached.

A hardware and software system for monitoring air quality is developed by authors in [19]. This system has been developed for cities with high vehicular traffic. The system collects nitrogen dioxide (NO\(_2\)) emission data at a single location in Ternopil city, and builds a mathematical model to predict NO\(_2\) emission at that point, based on the vehicle emissions. The drawback with this work is that it considers vehicular pollution at a single location to predict the spread of a single pollutant. In [23], the authors have built a low cost, IoT based pollution monitoring system, and the gathered data is displayed on a web-server. Mrigank et al. [16] have developed a long short term memory based prediction model to predict hourly CO, PM\(_{2.5}\), O\(_{3}\) and NO\(_x\) concentrations in Delhi.

The authors in [26] have proposed a dynamic route determination algorithm, so as to avoid routes with intense traffic and poor air quality, while choosing routes with less traffic and better air quality. However, the authors did not elaborate how cost for different route metrics will be determined. In [3] the authors have deployed sensor nodes in the city of Ottawa, Ontario to collect and transmit IoT data. The information about air quality and noise levels in different parts of the city helps citizens decide which areas of the city are more suitable for residential purposes.

To the best of our knowledge, there is no single recent work that has involved end to end process of building IoT systems to sense, analyze and model air quality data in real time. This is our motivation to take up this comprehensive work of developing an IoT system to sense, visualize, model and analyze outdoor air quality data.

Fig. 1
figure 1

Our IoT sensing: sensor setup present in the vehicle transmits sensed data to IoT cloud for modelling and analysis

3 IoT Sensing System

The air-pollutant concentration in a city is highly location dependent, and is heavily influenced by the vehicular traffic, industrial emissions, construction activities, vegetation and population in that area. Fixed air quality monitoring stations are spaced far apart and do not provide actual level of air quality in areas not adjoining them. Hence, a sensing and data collection setup is developed for this study that will provide real-time, fine grained details on the concentration of pollutants in different areas of the city. As shown in Fig. 1, the sensing system captures the variations in air pollutants concentrations in the vehicle with windows open. It is comprised of an array of sensors that are connected to an Arduino Uno micro controller. The Arduino board wirelessly transmits the sensed data to an Android application with the help of a Bluetooth module. The sensed data is received by the smartphone application and uploaded to the IoT cloud for visualization and analysis. The detail of all sensors is given in Table 1. Concentration of CO in measured in ppm while that of PM is in \(\upmu \)g/m\(^3\). The hardware and the software system setup of our sensing system is discussed in detail in following paragraphs.

Table 1 Hardware sensing setup

3.1 Hardware Setup

The sensors used in the sensing system are (1) gas sensors such as MQ135 (carbon dioxide), MQ2 (carbon monoxide) and MQ4 (methane); (2) Sharp optical sensor GP2Y1010AU0F (particulate matter); (3) sound sensor (noise); (4) DHT22 Sensor (temperature and humidity); and (5) Adafruit Ultimate GPS (latitude, longitude). Arduino Uno, which is based on ATmega328P 8-bit micro-controller, is used for collection and transmission of sensed data. Bluetooth module (HC-05) is used to transmit the sensed data to smartphone application. The sensing system with Arduino and the connected sensors are shown in Fig. 2.

Fig. 2
figure 2

Sensing system with Arduino and connected sensors

In our setup, MQ2, MQ4 and MQ135 are gas sensors. A gas sensor contains a steel skeleton which houses the sensing element. Tin-oxide (SnO\(_2\)) is often used as the sensing element.

Fig. 3
figure 3

Working principle of an electro-chemical gas sensor [5]

The working of an electro-chemical gas sensor is explained in Fig. 3. Current flows along the grain boundary (conjunction parts) within the sensor. At the grain boundary, the adsorbed oxygen creates a potential barrier that prohibits the carriers from moving freely. This in turn reduces the barrier height, decreasing the sensor resistance. The sensor resistance is proportional to the output voltage (Ohm’s Law). When high gas concentration is detected by the sensor, the resistance value in the grain boundary is reduced, thus decreasing the output voltage. The equations to convert the measured voltage into gas concentrations are [12]:

$$\begin{aligned} V_{eq}= & {} V_{ao} \times \left( \frac{5}{1024}\right) \end{aligned}$$
(1)
$$\begin{aligned} R_s= & {} 20{,}000 \times \left( \frac{5.0 \, {-} \, V_{eq}}{V_{eq}}\right) \end{aligned}$$
(2)
$$\begin{aligned} Gas \, concentration \,(in \, \mu g/m^3)= & {} 37{,}143 \times \left( \frac{R_s}{R_o}\right) ^{-3.178} \end{aligned}$$
(3)

where \(V_{ao}\) is the analog output of the gas sensor to Arduino Uno, \(V_{eq}\) represents the equivalent voltage from the analog sensor output (in V) and, \(R_s\) denotes the sensor resistance at different concentrations of the gas, at different humidities and temperatures.

Fig. 4
figure 4

Schematic diagram of GP2Y1010AU0F optical sensor for measuring particulate matter

Particulate matter (PM\(_{2.5}\) and PM\(_{10}\)) is measured using small size and low power GP2Y1010AU0F optical sensor. GP2Y1010AU0F optical sensor is chosen for measuring PM\(_{2.5}\) and PM\(_{10}\) concentrations over other sensors because of its accuracy and sensitivity, as it has the ability to measure PM density upto 0.5 mg/m\(^3\) [14]. Figure 4 explains the working of the GP2Y1010AU0F optical sensor. An infrared emitting diode (IRED) and a photo-diode are arranged diagonally to track the light reflected by the pollutant particles within the orifice. The presence of PM in the orifice changes the output voltage of the photo-diode. The relationship between output voltage and PM concentration in air for an appropriate range of PM concentrations (0.5 \(\upmu \)g/m\(^3\)–1.2 mg/m\(^3\)) is linear, with a sensitivity of 0.5 V/100 \(\upmu \)g/m\(^3\). PM (mg/m\(^3\)) concentration is obtained with the help of the equation [14]:

$$\begin{aligned} \rho = (0.172 \times V_{op} \, \, {-} \, \, 0.0999) \times 1000 \end{aligned}$$
(4)

where,

  • \(\rho \) = PM density (in mg/m\(^3\)), and

  • \(V_{op}\) = output voltage of the optical sensor.

3.2 Software System Setup

The data collected by the sensors in the sensing system is transmitted to Arduino Uno, which uses the Bluetooth module (HC-05) to further transfer it to the smartphone application. The software setup of our IoT system consists of: (a) an embedded C program which is uploaded into the memory of the Arduino micro-controller. It is used for collection, calibration and transmission of the sensor data to the Android application installed in a smartphone; (b) an Android application that receives the sensed data and stores this data in the phone memory as a comma separate value (.csv) file; and (c) a cloud service that receives the data from the Android application. In our IoT system, the cloud platform is developed using IBM Cloud. The sensed data is transmitted to the IBM cloud from the Android application where it is stored. The stored data can be availed by users with the help of a smartphone application or a web browser.

The sensed parameters are collected by the sensing setup, which transmits the sensed data to the Android application on smartphone with the help of Bluetooth module.

4 Data Collection Setup

The sensing setup is mounted in a vehicle and is used for the collection of air quality parameters. The smartphone application collects the sensed data and transmits them to the IoT cloud. Figure 5 shows the sensing setup in the vehicle, where it collects the air quality parameters and transmits to IoT cloud. The windows of the vehicle are kept open, allowing the outside air to freely flow inside, so that the pollutant concentration is uniform in our surroundings.

Fig. 5
figure 5

Data collection setup: Smartphone application receiving air quality data from the sensor setup

These observations have been made during the winter months of November, December 2018 and January, February 2019. Air quality parameters were collected using the sensing system along the highway from Dadri to Anand Vihar (Delhi-NCR).

Fig. 6
figure 6

Colour coding scheme according to the AQI values (National Air Quality Index, India, 2014)

The sensed air quality parameters are used in the calculation of the air quality index (AQI). The index used by the government to indicate the state of air pollution at a location is known as the AQI. The National AQI of India takes into consideration 8 pollutants (PM\(_{2.5}\), NO\(_2\), PM\(_{10}\), O\(_3\), SO\(_2\), NH\(_{3}\), CO and Pb). Sub-index for each pollutant is calculated on the basis of the atmospheric concentration of the pollutant, which is a linear function of the actual concentration of the pollutant. The sub-index with the largest value is the overall AQI. The sub-index for a pollutant ‘x’ can be calculated with the help of the following AQI equation [21]:

$$\begin{aligned} I_{x} = \left( C_{x} - {Bx}_{L}\right) \, \times \, \left( \frac{I_{H} - I_{L}}{{Bx}_{H} - {Bx}_{L}}\right) \end{aligned}$$
(5)

where \(I_{x}\) is the sub-index of the pollutant x,

  • \(C_{x}\) is the actual concentration of pollutant x,

  • \({Bx}_{H}\) denotes the break point which is larger than or equal to \(C_{x}\),

  • \({Bx}_{L}\) represents the break point which is lesser than or equal to \(C_{x}\),

  • \(I_{H}\) is the AQI value which corresponds to \({Bx}_{H}\), and

  • \(I_{L}\) is the AQI value which corresponds to \({Bx}_{L}\).

The sub-indices for the pollutants under observation are compared with each-other and the largest sub-index value is the final AQI.

Fig. 7
figure 7

Variation of AQI during the winter months of November, 2018 to February, 2019 at Anand Vihar and Dadri respectively

For a particular location, the AQI is calculated only if the pollution concentration data for at-least 3 pollutants is available, out of which at-least one must be PM\(_{2.5}\) or PM\(_{10}\). Else, data is considered insufficient for AQI calculation. In our work, the pollutants that have been taken into consideration are CO, PM\(_{2.5}\) and PM\(_{10}\). The colour coding scheme for AQI as per the National Air Quality Index of India, and the level of health concern indicated by each colour are shown in Fig. 6. Figure 7 shows the variation of daily average AQI over time for a period of 4 months (November, 2018 to February, 2019) at Anand Vihar and Dadri respectively.

From Fig. 7, it is clear that for most of the days, AQI levels are significantly higher than the permissible limit of 100. Hence, we calculate the likelihood of being exposed to certain levels of AQI in these two areas, and compute the exceedance probability of AQI. The exceedance probability for a certain threshold value is computed as:

$$\begin{aligned} P\{ (AQI) > (AQI_{TH}) \} = \frac{days\, on\, which\, AQI\, exceeds}{total\, no.\, of \, days} \end{aligned}$$
(6)
Table 2 Exceedance probabilities at Anand Vihar and Dadri
Fig. 8
figure 8

The exceedance probability in percent is plotted against AQI

For this data, we calculate the exceedance probability for AQI levels of 50, 100, 200, 300 and 400. Table 2 and Fig. 8 detail the exceedance probabilities at Anand Vihar and Dadri. From Fig. 8, it is clear that air quality will be poor (> 200) in both the areas of Anand Vihar and Dadri with a very high exceedance probability of 89% and 82%, respectively. The probability that air quality levels will be hazardous (> 400) in both these areas are also close to 39% and 35% respectively. To estimate how often such levels of AQI occur in these two places, we then calculate the return period. Return period of AQI in our times-series data will represent average number of days between AQI levels of that magnitude or higher. This can be found by computing the inverse of exceedance probability values computed above. Hence, we calculate the return period as:

$$\begin{aligned} Return \, Period = \frac{1}{P\{ (AQI) > (AQI_{TH}) \}} \end{aligned}$$
(7)

It implies that the AQI value of 400 will exceed every 2.85 \(\approx \) 3 days at Dadri. Figure 9 is the plot of return period against AQI at Anand Vihar and our university at Dadri respectively. The return period for all the AQI levels are shown in Table 3. We can see that the hazardous level of AQI is likely to return in average, about 2.53 days at Anand Vihar, while it is slightly more (2.85 days) at Dadri.

Fig. 9
figure 9

The return period at Dadri and Anand Vihar in days

Table 3 Return period (days) at Anand Vihar and Dadri

5 Forecasting Air Quality

Forecasting air quality is critical as it can be used to provide warnings and alerts for various applications. This include generating health alerts for people who have health conditions like asthma, allergies, and are prone to skin, eye and lung irritation on short term exposure, and blood and liver issues for long term exposure. Forecast models can be used to issue warnings for re-routing vehicular traffic in high density metropolitan areas, so that people can drive through alternate less polluting routes. These models can also be used to strengthen government air pollution control programs. Stochastic, statistical, machine learning based and hybrid techniques can be used for developing forecast models [35].

Linear, multiple linear, ridge and quantile regression techniques may be employed in the development of statistical models. Auto regressive moving average (ARMA), autoregressive integrated moving average (ARIMA) and exponential smoothing are some techniques that are used in building stochastic forecast models. Machine learning based forecast is performed using algorithms such as support vector regression, fuzzy logic and artificial neural networks (ANN). Hybrid methods involve combining two or more of these techniques for the improvement in forecast accuracy [2]. In this section, we have presented our statistical and stochastic models for forecasting air quality. The forecasted values generated using these models have been plotted. In Sect. 6, we have done extensive performance evaluation of these models by (a) comparing forecasting performance, (b) measuring quality of these forecasting models, (c) comparing accuracy of these forecasting models, and (d) finding the size of forecasting error in all these models.

5.1 Our Quantile Regression based Forecasting Model

Regression analysis is a statistical method that is used to estimate the variation in our dependent variable (AQI) when one or more independent variables are varied. However, widely used regression methods such as ordinary least squares (OLS) or linear regression are highly sensitive to outliers. As shown in Fig. 7, sharp peaks or outliers are observed in AQI on days 10, 13, 20, 30, 36, 50, 53, 60, 70, 80 and 100, indicating that quantile regression would serve better to counter or neutralize the effect of such outliers, as compared with linear regression. Quantile regression is an analysis method frequently used in econometrics and statistics, which produces estimates that are more robust against outliers [28, 32]. To define our quantile regression model, a sample of ‘n’ AQI observations, \(y_i\) are considered, where the predicted AQI value, \(z_i\) is bounded within a specified interval \(z_{min}\) and \(z_{max}\). Our quantile regression model for AQI can be expressed as

$$\begin{aligned} Z_i = Y_i\cdot \beta _{p} + \epsilon _{i} \end{aligned}$$
(8)

where \(Z_i\) is the AQI value to be predicted, \(Y_i\) is the previous AQI and \(\beta _p\) = {\(\beta _{p1}\), \(\beta _{p2}\), ..., \(\beta _{pn}\)} and it denotes the unknown regression parameters. Assuming P(\(\epsilon _i\) \(\le \) 0 \(\vert \) \(y_i\)) = p, where p is a number between 0 and 1, or equivalently

$$\begin{aligned} P(z_i \le y_i \cdot \beta _{p} \vert y_i) = p \end{aligned}$$
(9)

The quantile ‘p’ of the probability distribution of the predicted AQI values \(z_i\), given previous values of AQI \(y_i\) is described as

$$\begin{aligned} Q_z(p) = y_i \cdot \beta _p \end{aligned}$$
(10)

The value p \(=0.5\) is the conditional median which divides the conditional distribution into two equal parts. There exists a specific set of parameters \(\beta _{p}\) for each quantile ‘p’ and a non-decreasing function ‘h’ is defined in the interval (\(z_{min}\), \({z_{max}}\)) such that:

$$\begin{aligned} h\{Q_z(p)\} = y_i \cdot \beta _p \end{aligned}$$
(11)

The logistic transformation of the above equation is defined as

$$\begin{aligned} h(z_i) = log \left( \frac{z_i - z_{min}}{z_{max} - z_i}\right) = log (it(z_i)) \end{aligned}$$
(12)

By integration Eqs. (11) and (12), the inverse function is defined as

$$\begin{aligned} Q_z(p) = \frac{exp(y_i \cdot \beta _p)\cdot z_{max} + z_{min}}{1 + exp(y_i\cdot \beta _p) } \end{aligned}$$
(13)

The inference on \(Q_z\)(p) is obtained by the inverse transform of Eq. (12) when the estimates for quantile regression coefficients ‘\(\beta _{p}\)’ are obtained:

$$\begin{aligned} Q_{h(z_i)}(p) = Q_{log(it(z_i))}(p) = y_i \cdot \beta _p \end{aligned}$$
(14)

Using quantile regression, we obtain our regression equation as

$$\begin{aligned} Z \,= \,-97.11\, + \,36.44\,T \end{aligned}$$
(15)

where ‘Z’ is the predicted AQI and ‘T’ is the day on which the AQI is to be predicted. The predicted AQI values at Dadri and Anand Vihar or the quantile regression line along with the actual AQI values is shown in Figs. 10 and 11, respectively.

Fig. 10
figure 10

Projected values of AQI at Dadri using Quantile Regression in comparison to the actual AQI

Fig. 11
figure 11

Projected values of AQI at Anand Vihar using Quantile Regression in comparison to the actual AQI

5.2 Our ARMA/ARIMA based Forecasting Model

Our AQI time series (Figs. 1011) doesn’t maintain a linear relationship over time. The statistical properties of the time-series such as correlations among the daily AQI values are not constant over time. Such time-series are best analyzed using Box Jenkins or the autoregressive moving average models (ARMA) [25]. The ARMA (r,s) model is obtained by combining the AR(r) and MA(s) models. In our AR model, present air quality value (\(Z_t\)) depends on the values of AQI for the previous days. Our AR model is represented as:

$$\begin{aligned} Z_t = f(Z_{t-1}, Z_{t-2}, Z_{t-3}, Z_{t-4}, \ldots , Z_{t-k}, \theta _t) \end{aligned}$$
(16)

Here, \(Z_t\) is the predicted AQI value, \(Z_{t-1}, Z_{t-2}, Z_{t-3}, Z_{t-4}, \ldots , Z_{t-k}\) are the ‘k’ values of AQI for previous days, and the error term is \(\theta _{t}\) In case of the AR model, the predicted AQI value \(Z_t\) depends on ‘r’ AQI values of previous days. Our AR model is described by the equation:

$$\begin{aligned} Z_t = \alpha _0 + \alpha _1 \cdot Z_{t-1} + \alpha _2 \cdot Z_{t-2} + \alpha _3 \cdot Z_{t-3} +\cdots + \alpha _r \cdot Z_{t-r} + \epsilon _t \end{aligned}$$
(17)

(where \(\alpha _0, \alpha _1, \alpha _2, \alpha _3, \ldots , \alpha _r\) and \(\theta _t\) are the coefficients to be calculated).

The MA model is composed of the random variations of our AQI time series. It is a linear combination of the random error terms in the time-series and is represented by the equation:

$$\begin{aligned} Z_t = f(\theta _{t-1}, \theta _{t-2}, \theta _{t-3}, \theta _{t-4}, \ldots , \theta _{t-s}) \end{aligned}$$
(18)

Here, \(Z_t\) is the predicted air quality value and \(\theta _{t-1}\), \(\theta _{t-2}\), \(\theta _{t-3}\), \(\theta _{t-4}, \ldots , \theta _{t-s}\) are the ‘s’ previous error terms of air quality.

Our MA model, which depends on ‘s’ values of AQI of previous days (MA(s) model) is described using the equation:

$$\begin{aligned} Z_t = \sigma _0 + \sigma _1 \cdot \theta _{t-1} + \sigma _2 \cdot \theta _{t-2} + \sigma _3 \cdot \theta _{t-3} +\cdots + \sigma _s\cdot \theta _{t-s} \end{aligned}$$
(19)

(here \(\sigma _0\), \(\sigma _1\), \(\sigma _2\), \(\sigma _3, \cdots , \sigma _s\) are the regression coefficients to be computed).

We represent our time-series of AQI as a combination of both AR and MA models, referred to as the ARMA(r,s) model. The ARMA(r, s) model depends on ‘r’ of its past AQI values and ‘s’ past values of random error terms, using the equation:

$$\begin{aligned} \begin{aligned}&Z_t = \alpha _0 + \alpha _1 \cdot Z_{t-1} + \alpha _2 \cdot Z_{t-2} + \alpha _3 \cdot Z_{t-3} +\cdots + \alpha _r\cdot Z_{t-r} \\&\quad +\sigma _0 + \sigma _1 \cdot \theta _{t-1}, \sigma _2 \cdot \theta _{t-2} + \sigma _3 \cdot \theta _{t-3} +\cdots + \sigma _s . \theta _{t-s} \end{aligned} \end{aligned}$$
(20)

ARMA model can be applied only if the time-series under observation is stationary. A time-series is said to be stationary if it has constant mean and variance with time. Hence, our time-series of AQI values is stationary if:

$$\begin{aligned} \mu [Z_1] = \mu [Z_2] = \mu [Z_3] = \cdots = \mu [Z_k] = \mathrm{constant} \end{aligned}$$
(21)

and,

$$\begin{aligned} Var[Z_1] = Var[Z_2] = \cdots = Var[Z_k] = \mathrm{constant} \end{aligned}$$
(22)

Here, \(\mu \) is the mean, Var is the variance and \(Z_{1}\), \(Z_{2},\ldots , Z_{k}\) are AQI values at different lags.

If the time-series under study is not stationary, it must be made stationary by differencing it with respect to time. We performed 2 stationarity tests on our AQI time series at Dadri: (1) Augmented Dickey-Fuller (ADF) test, and (2) KPSS test.

5.2.1 Augmented Dickey–Fuller (ADF) test

This test is used to determine if a unit root is present in our AQI time series or not [9]. The null and alternate hypothesis for the ADF test are described as:

  • Null Hypothesis Our AQI time-series possesses a unit root,

  • Alternate Hypothesis Our AQI time-series doesn’t have a unit root.

If the null hypothesis is accepted, it indicates that our time-series is either stationary or difference-stationary. But if null hypothesis is rejected, it implies that our AQI time-series is non-stationary.

5.2.2 Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test

We use the KPSS test to find out if our AQI time-series is trend-stationary or it comprises a unit root [18]. The null and alternate hypothesis for the KPSS test are described as:

  • Null Hypothesis Our AQI time-series is trend stationary,

  • Alternate Hypothesis Our AQI time-series has a unit root.

If null hypothesis is accepted, it indicates that our time-series is trend-stationary. The trend, which is a function of time, must be removed to make the time-series stationary. But, if the null hypothesis is rejected, it implies that our AQI time-series has a unit root i.e., it can be made stationary by differencing.

The significance level, \(\alpha \) for each of the above tests is considered to be 0.05. If the calculated p-value for a particular test is equal to or more than 0.05, the null hypothesis is accepted. But, if the p-value is lesser than the significance level \(\alpha \), the null hypothesis is rejected and alternate hypothesis is accepted. Table 4 shows the results of stationary test for our daily average AQI data for the period from November, 2018 to February, 2019.

Table 4 Results of stationarity test (significance level, \(\alpha =0.05\)) for AQI observations at Anand Vihar and Dadri respectively

The null hypothesis is accepted in case of ADF test, and rejected in case of KPSS test, indicating that the time-series has a unit root. Hence, it is concluded that our air-pollutant time-series contains a unit root and can be made stationary by differencing.

A time-series which becomes stationary when it is differenced ‘n’ times is said to have an order of integration ‘n’ and is represented as I(n). We differenced our AQI time-series once to make it stationary, so it has an integration order of 1 and is denoted by I(1). The stationary time-series is forecasted using Box–Jenkins method using the following steps: (a) Identification, (b) Estimation, and (c) Diagnostic checking.

(i) Identification Identification of our time series is done by using the Auto-correlation function (ACF). ACF describes the way the daily AQI observations in our time-series at Dadri are associated with each other. It is measured by determining the correlation between current AQI observation (\(Z_t\)) and the AQI observation ‘n’ days prior to the current one (\(Z_{t-n}\)). Hence,

$$\begin{aligned} Cor(Z_t, Z_{t-n}) = \frac{Cov(Z_t, Z_{t-n})}{\root \of { (var(Z_{t})} \cdot \root \of { (var(Z_{t-n})}} \end{aligned}$$
(23)

Here, \(Cor(Z_t, Z_{t-n})\) is the correlation between \(Z_t\) and \(Z_{t-n}, Cov(Z_t, Z_{t-n})\) is the covariance between \(Z_t\) and \(Z_{t-n}, var(Z_{t}\) and \(var(Z_{t-n})\) are the variances of \(Z_t\) and \(Z_{t-n}\), respectively.

ACF is helpful in finding out the level of association between the current AQI value and the previous AQI observations, and the number of previous AQI values or lags to be considered for building the forecast model. The correlogram plot (sample ACF and partial ACF against lag values) of AQI time-series, displayed in Fig. 12 is used for the selection of the most appropriate ARIMA (r, n, s) model for forecasting. As seen in Fig. 12, the peaks are observed at lag values of 0 and 1 for ACF and 1 and 2 for PACF.

Fig. 12
figure 12

ACF and PACF values for the AQI time-series at Dadri with lags

(ii) Estimation Fitting the most accurate ARIMA model to our time-series requires finding the optimum values for the orders (r and s) of our AR and MA process. This is done by observing the significant auto-correlation coefficients and finding the pattern. We select the model with the lowest Akaike’s Information Criterion (AIC) and Bayesian information criterion (BIC) values, i.e. ARIMA(1,1,1) model. This model is used to predict future AQI values at Dadri.

Fig. 13
figure 13

Auto-correlation coefficients of the residuals at Dadri

(iii) Diagnostic checking Once we have fitted the appropriate ARIMA model (ARIMA(1,1,1) in our case) to our AQI time-series, ACFs of the residuals are plotted to estimate the goodness-of-fit of the fitted model. Figure 13 shows the ACF and partial-ACF coefficients of the residuals of our ARIMA(1,1,1) model plotted against the lag values. It is observed that all the auto-correlation coefficients of residuals of the fitted model are within the range: \(\big (\frac{-1.96}{\root \of {K}}, \frac{+1.96}{\root \of {K}}\big )\), K being the number of observations (indicated by dotted blue line in Fig. 13). This indicates that there is no correlation between model residuals and the model fit is suitable. Using the ARIMA(1,1,1) model, the future AQI values are predicted at Dadri, as shown in Fig. 14.

Fig. 14
figure 14

Projected values of AQI at Dadri using ARIMA in comparison to the actual AQI

Similarly, we use the AQI time-series at Anand Vihar to get ARMA(1,1,0) as the best-fit forecast model for Anand Vihar. Predicted AQI values at Anand Vihar are plotted in Fig. 15. Due to space constraints, we are not showing the ACF, PACF and residual plots of Anand Vihar.

Fig. 15
figure 15

Projected values of AQI at Anand Vihar using ARMA in comparison to the actual AQI

6 Comparison of Forecasting Models

The forecast of AQI at Dadri and Anand Vihar using various prediction models is shown in the form of plot in Figs. 16 and 17, respectively. The difference between the actual AQI value and the forecasted AQI value on a particular day of our time series is known as the forecast error. The forecast error is calculated by comparing an AQI outcome at a single instant of time with the corresponding AQI forecast at that instant of time, and a summary of errors is created over a range of such instants of time [22]. The forecast error is defined as:

$$\begin{aligned} Err(t) = Z(t) - {\hat{Z}}(t {\vert } t-1 ) \end{aligned}$$
(24)

where \(Z(t) = AQI\) observation at time ‘t’, and \({\hat{Z}}(t {\vert } t-1 )\) = denotes the AQI forecast of Z(t) based upon all previous AQI observations.

Fig. 16
figure 16

Prediction performance of forecasting models at Dadri

In this section, we are comparing the performance of our forecasting models at Dadri and Anand Vihar (AV) with the help of different metrics. A variety of methods have been used to measure or compare the forecast performance. These are mean absolute deviation (MAD), mean percentage error (MPE), mean squared error (MSE), root mean squared error (RMSE) and mean absolute percentage error (MAPE) [20, 31]. These methods and their significance is discussed in the following paragraphs.

Fig. 17
figure 17

Performance of forecasting models at Anand Vihar

6.1 Mean Absolute Deviation (MAD)

The prediction performance of the two prediction models at Dadri and Anand Vihar are first compared using mean absolute deviation (MAD). MAD computes the size of errors in units, and not in percentage. MAD is the average of unsigned errors, and is defined as-

$$\begin{aligned} MAD = \frac{1}{K} \sum _{i=1}^{K} {\vert }{z_i - f(y_i)}{\vert } \end{aligned}$$
(25)

where K is the number of AQI observations, \(z_i\) is the AQI value at time instant i, \(y_i\) is the input vector (time), and f is the forecast model.

The mean absolute deviation (MAD) for our forecasting models is displayed in Table 5.

Table 5 Prediction performance using MAD

MAD gives us the average difference between the actual AQI and the forecasted AQI values in units. The results from Table 5 indicate that the mean error between the actual AQI and predicted AQI is the lesser for ARIMA as compared with Quantile regression model.

6.2 Mean Squared Error (MSE)

Mean squared error is a measure of the quality of a forecasting model. It is always non-negative and MSE values of a predictive model that are close to zero, are considered better. MSE is the second moment about the origin of error, and includes both variance as well as bias of the estimator.

$$\begin{aligned} MSE = \frac{1}{K}\cdot \sum _{i=1}^{K}(z_i - f(y_i))^2 \end{aligned}$$
(26)

MSE for our forecasting techniques is displayed in Table 6.

Table 6 Prediction performance using MSE

The results from the Table 6 indicate that the quality of the quantile regression model is better than ARIMA model, because the spread or deviation of the predicted value is lesser than the ARIMA model.

6.3 Root Mean Squared Error (RMSE)

RMSE is a measure of accuracy, which is used to compare the forecasting accuracy of different forecasting models. It is always non negative and RMSE values of a predictive model that are close to zero, are considered better. The effect of each error on RMSE is proportional to the size of the squared error. RMSE is sensitive to outliers as the effect of large errors is disproportionately large. RMSE for our forecasting techniques is displayed in Table 7.

$$\begin{aligned} RMSE = \root \of {\frac{1}{K}\cdot \sum _{i=1}^{K}(z_i - f(y_i))^2} \end{aligned}$$
(27)
Table 7 Prediction performance using RMSE

As errors are squared in case of RMSE before they are averaged, it is more useful when large errors are especially undesirable. The results of the Table 7 indicate that in comparison to the ARIMA model, the Quantile regression model performs fewer or lesser large errors.

6.4 Mean Percentage Error (MPE)

Mean percentage error (MPE) defines the size of forecast error in percentage terms. MPE is average of signed percentage errors, and is defined as

$$\begin{aligned} MPE = \frac{1}{K} \cdot \sum _{i=1}^{K} {\frac{z_i - f(y_i)}{z_i}} \times 100 \end{aligned}$$
(28)
Table 8 Prediction performance using MPE

The mean percentage error (MPE) for various forecasting models is displayed in Table 8. As positive and negative forecasts may offset each other in case of MPE, the results of Table 8 indicate the bias of the forecasting models. The results show that while both the models have a positive bias, the bias of the Quantile regression model is lesser.

6.5 Mean Absolute Percentage Error (MAPE)

The disadvantage with MPE is that actual values of forecast error used in the formula may be both positive and negative, and may offset each other. So, we use another method for determining the forecast error, which employs absolute values of forecast error (MAPE)

$$\begin{aligned} MAPE = \frac{1}{K}\cdot \sum _{i=1}^{K} \left\vert {\frac{z_i - f(y_i)}{z_i}}\right\vert \times 100 \end{aligned}$$
(29)
Table 9 Prediction performance using MAPE

The mean absolute percentage error for various forecasting techniques is displayed in Table 9. The MAPE for ARIMA model is lesser than that of quantile regression model. This analysis shows that forecasts using ARIMA are more accurate than the quantile regression model.

7 Conclusions

In this study, an Internet of Things system has been developed to collect, visualize, analyze and model urban air quality. Hourly air pollutant data has been collected at two locations in the Delhi-NCR region, i.e. Anand Vihar and Dadri, for the winter months of November, 2018 to February, 2019. The daily average AQI for the four winter months has been used to perform predictive modeling using quantile regression and ARIMA models. The prediction performance of these forecast models has been discussed and compared with the help of mean absolute deviation, mean percentage error, mean square error, root mean square error and mean absolute percentage error. The ARIMA model is found to be more accurate in comparison with the quantile regression model with lower values of MAD, MPE and MAPE. The MSE and RMSE values are lower for the quantile regression model as compared with the ARIMA model at both the locations, indicating that the quantile regression model performs lesser large errors than the ARIMA model. The performance of the two forecast models is found to be consistent at both the locations, indicating that there is little ambiguity in evaluation process.