1 Introduction

The industrial activities, population density, thermal power units, agricultural methods, automotive industries and transportation activities effect the air quality index in a region significantly (Ravindra, 2019) (Ravindra et al., 2020). In addition to the adverse effect on the atmospheric conditions, it negatively impacts the human health in terms of lung infection, breathing problems, premature death, heart failure, lung cancer and skin rashes (Manisalidis et al., 2020). The air quality index is greatly affected by the emission of greenhouse gases, meteorological factors and the gaseous pollutants such as Particulate Matter, Carbon Mono oxide, Sulphur dioxide, Ammonia, Nitrogen oxide and Ozone (Bao & Zhang, 2020)-(Balakrishnan et al., 2019). These pollutants cause other deteriorating effect in atmosphere such as global warming, climate change, decreased visibility, acid rain and the development of smog and aerosols (Malhi et al., 2021).

Out of the 15 most polluted cities in Central and South Asia in 2022, 11 are in the India. Air pollution has massive impact on the human health in India. It is second biggest risk factor and subsequently it increases the cost of air pollution by 150 billion dollars annually (https://www.greenpeace.org/static/planet4-india-stateless/2023/03/2fe33d7a-2022-world-air-quality-report.pdf. n.d.). According to the WHO air quality report published in the year 2014, Varanasi has been ranked as the third most polluted city in the world. The major contributing factor in degrading the air quality in Varanasi city is due to the vehicular exhaust followed by construction and road dust, industries and thermal power plant. This shows the severity of the air pollution in the Indian cities and its impact on human health in the coming years (https://www.ndtv.com/india-news/varanasis-air-quality-deteriorating-delhi-victim-of-negligence-report-2020928. n.d.) (https://climatetrends.in/wp-content/uploads/2019/11/Political-Leaders-Position-Action-on-Air-Quality-Indian-MPs-report-card-2014-19-April-2019-.pdf. n.d.).

The machine learning model can aid in effectively predicting and forecasting the air quality index as compared to statistical and rigid models. The availability of the historical data provides a strong aid in precisely and accuracy predicting the AQI index. The category of Air Quality and its impact on health condition for the different Air Quality Index is as shown in the Table 1.

Table 1 Category of AQI and its health impact

The major contribution of the proposed work is as follows:

  1. a.

    The data regarding the air quality is collected from government websites such as Central Pollution Control Board (CPCB) (Board, 2019).

  2. b.

    The characteristics of the dataset is analyzed, processed and represented in a manner, so it can be used for the development of effective machine learning models for the prediction.

  3. c.

    The next step is to employ the different methods such as outlier detection and removal, filling the missing value with mean value of the parameter and scaling the values of the dataset, to pre process the data.

  4. d.

    The overall strategy used in the proposed approach is shown in the Fig. 1.

  5. e.

    The different machine learning based models such as Linear Regression, Decision Tree, Random Forest along with hyper parameter tuning, K Nearest Neighbor and SVM with different kernels such as Linear, Polynomial and Radial Basis Forward are used for the prediction and classification of the air quality index of the Varanasi city.

  6. f.

    The performance of the different approaches is evaluated and compared on the basis of different performance metrics such as Mean Absolute Error, Root Mean Square Error, Precision, Recall, Accuracy and Confusion Matrix. The comparison result show that the Random Forest achieves the highest accuracy for the considered dataset.

Fig. 1
figure 1

Flow chart of the proposed method

The organization of the paper is as follows: related work discussion in Section 2, the overview of the methodology in Section 3, results and discussion in Section 4 and Section 5 concludes the paper with conclusion and future scope.

2 Related Work

The air pollution monitoring is a crucial task in meeting the objective of mapping of air pollution levels to different cities. To predict the air quality index as a time series model two methods have proposed in Tuan-Vinh (2021): one as a hybrid model combining the ARIMA model with PCR and other as a hybrid model combining the ARIMA model with gene expression programming. In this the correlation between the urban nature and urban traffic were used to determine the PM2.5 levels. The sensor and IOT based system were developed to determine the AQI values. The data collected was used to predict the air quality level for the next day using the Linear Regression Model. The performance of the proposed model is evaluated in terms of Mean Absolute error and Root Mean Square Error (Kumar et al., 2020). The five regression models as: partial least square, principal component, partial component with one out, cross validation and multiple regression were used for the prediction of air quality. The performance of the models is evaluated for the data collected from different cities (Londhe May, 2021). The daily temperature was also considered as a contributing in determination of air quality index. The deep learning model such as LSTM and other were used for the prediction of the air quality in the Dhaka city (Chowdhury et al., 2020). The improved neural network using nonlinear auto regressive neural network is developed for the precise prediction of air quality index using the weather stations and atmospheric monitoring (Zhou et al., 2018). The different pollutant levels were considered for the prediction of the air quality. The different regression models such as random forest regression, stochastic gradient descent regression and linear regression were implemented and evaluated against the various performance metrics such as MAE, MSE and R score (Srivastava et al., 2018). SVM and ANN based models were used for the prediction of the air quality for the Delhi region in Raturi and Prasad (2018). The survey was conducted to understand the effect of different pollutants on the air quality. The ANN proved to be best model for the prediction task (Mahalingam et al., 2019). The supervised learning models, SVM and ANN were used for the prediction of the air quality in the Delhi region (Sethi & Mittal, 2019). The supervised machine learning models such as Decision Tree, SVR and stacking ensemble methods proved to perform best for the air quality prediction. The emission of different hazardous gasses and their impact on the environment and human health has been investigated in the works presented in the (Wen et al., (2024), Zhang et al., (2021), Luo et al., (2024), IoT-Based Air Quality Monitoring in Hair Salons, (2023), Samuel et al., (2023), Yin et al., (2023), Shang et al., (2023), Liu et al., (2023)-(Blessy et al., 2023).

Based on the literature survey, we conclude that the open access datasets are not efficient in predicting the air quality due to the large number of incorrect and missing values. So there is need of the incorporation of machine learning models which can enhance the efficiency of air quality prediction.

3 Proposed Methodology

The motivation to select the Varanasi city is due to the fact that the air quality index has reached to 490 according to the report published in the reference (https://climatetrends.in/wp-content/uploads/2019/11/Political-Leaders-Position-Action-on-Air-Quality-Indian-MPs-report-card-2014-19-April-2019-.pdf. n.d.). The report concludes the major construction work going on in the city is the root cause of degraded air quality over the years. To address the alarming degradation in the air quality index of Varanasi city several AQI stations have been implemented at prominent places with the objective of the continuous monitoring and control of AQI. The list of AQI station in Varanasi city are as shown in the Table 2.

Table 2 List of AQI stations in Varanasi city

3.1 System Design

The proposed system consists of six stages as: i. Dataset collection, ii. Dataset selection iii. Data pre processing, iv. Splitting the dataset into training and testing dataset, v. implementation of different machine learning models and vi. Performance evaluation and comparison of different models. The flow chart of the proposed approach is as shown in the Fig. 1.

The detailed description of the task performed in each step is as follows:

  1. 1.

    The first step of the proposed system is the collection of data from the CPCB website. To collect the data, we have considered the station IESD Banaras Hindu University, Varanasi UPPCB and the parameter option is considered as Select All. The format of the data collection is selected as Tabular, the criteria is considered as 1 Hr. and the duration of the data collection is considered as 1st January 2019 to 22nd October, 2023. The result of data collection is stored in the.csv file which consists of 22 parameters.

  2. 2.

    The second step of the proposed approach is the data selection. It has been established that the AQI depends majorly on 7 parameters such as PM2.5, PM10, NO2, CO, SO2, NH3, Ozone. So out of the 22 parameters we have considered 7 parameters and removed the remaining columns from the dataset. After pruning the dataset, the size of the dataset is considered as (2543,9).

The next step of the proposed model consists of the data understanding. For better analysis of the dataset, the correlation matrix of the different parameters is determined which establishes the relationship of the different parameters on the AQI. The result of the data visualization of correlation matrix is as shown in the Fig. 2.

Fig. 2
figure 2

Correlation matrix for the considered parameters

After the correlation analysis, the pair plot is plotted to visualize the distribution of air quality index in different classes. The pair plot for each of the parameters is as shown in the Fig. 3.

  1. 3.

    After the data visualization step, the various data preprocessing methods are implemented. The box plot for all the considered parameters is plotted to determine the outliers in the dataset. The outlier are the values which lie beyond the normal range of the values for the particular parameter. The visualization of the box plot is as shown in the Fig. 4.

Fig. 3
figure 3

Pair plot for the different parameters

Fig. 4
figure 4

Box plot

From the box plot it is clear that the parameter PM2.5 has highest number of outlier values as 30. The Inter quartile range is used to determine the maximum and minimum value for the parameter. The inter quartile range is determined as the difference of the 75 quartile and 25 quartile. The values which are lower than minimum value and which are greater than the maximum value is considered as the outlier. To remove the outliers the outlier values are replaced with NA values. Then these values are filled with the scaled value using standard scalar method. The scaled value for a parameter is calculated using the equation 1.

$$y=\frac{(x-\mu )}{s}$$
(1)

where y is scaled value of the parameter x, µ is the mean of the parameter and s is the standard deviation of the parameter.

Similarly, the missing/NULL values are identified and filled with the mean value for that parameter.

In this step, the irrelevant columns as From Date and To Date are also removed as it has no contribution in the air quality index determination. Hence after the pre processing the dataset size is (2543,7).

After pre processing the heat map is plotted to determine if there is any missing value in the dataset as shown in the Fig. 5.

Fig. 5
figure 5

Heat map of the different parameters

It can be inferred from the heat map that there are no missing values in the dataset.

After pre processing, the next step is calculating the air quality index and air quality index category using the standard range allowed for each parameter. The two functions are defined for this functionality. The index value of each parameter is determined using the equation 2.

$$Ip\frac{{I}_{hi}- {I}_{lo}}{Bphi-Bplo}{\left(Cp-Bplo\right)}^{*}{I}_{lo}$$
(2)

where Ip is the index for Pollutant P, Cp is the rounded concentration of Pollutant P, Bphi is the break point which is greater than or equal to Cp, Bplo is the break point which is lesser than or equal to Cp, Ihi is the AQI corresponding to Bphi and Ilo is the AQI corresponding to Bplo.

The air quality index and air quality index category are calculated using the parameter index value as shown in the Table 3.

Table 3 AQI calculation using the different parameter concentration levels

After appending AQI category the size of the dataset is (2543,8).

  1. 4.

    In the next step, the considered dataset is divided into training and testing dataset. The test size is considered as 30%, hence the size of testing dataset is (751,8) and the size of training dataset is (1752,8).

  2. 5.

    The different machine learning models is implemented for the prediction of the air quality.

  3. 6.

    The performance evaluation of machine learning model is done on the basis of different performance metrics such as accuracy, confusion matrix, R2 score, precision, recall, Mean Absolute Error and Mean Square Error. The comparative evaluation of the models is also done.

3.2 Implementation Details

To predict the air quality for the considered, machine learning models as Linear Regression, Decision Tree, Random Forest, Neural Network, K Nearest Neighbor and SVM with different kernel functions is utilized. All the models have been implemented in python programming language on a Windows based operating system. In this section, a brief description of the various machine learning models used to forecast the air quality is provided.

3.3 Machine Learning Models

Extra tree regressor stands for extremely randomized trees. It is an ensemble based supervised machine learning method that aggregates the results from the de-correlated decision trees to improve the efficiency and performance (Geurts et al., 2006). Extra tree regression models are useful, if the accuracy is more important as compared to the construction of the generalized model. It is different from the random forest-based method in that it does not utilize the concept of bootstrap, instead it works using the randomized split. The extra tree regressor method can also be used for the determination of importance of features in the dataset.

3.3.1 Linear Regression Model

The linear regression model is a supervised machine learning algorithm, which tries to find the best fitting line for the exploratory variables using the relationship with the dependent variables (Rogers & Girolami, 2016). The algorithm tries to find a line which is best fitted for the data under consideration using the concept of residuals. The residual value represents the distance between the exploratory variable and the actual value. The process of finding the best fitting line is done in several iterations. The Eq. 3 is the equation of the regression line:

$$a= {\gamma }_{0}+ {\gamma }_{1}{\upbeta }_{1}+ {\gamma }_{2}{\upbeta }_{2}+$$
(3)

where α is the dependent variable and β is the independent variable. γ represents the coefficient of the regression model.

3.3.2 Decision Tree Model

Decision tree is a supervised prediction and classification method based on Boolean conditions. The decision tree represents a collection of nodes and links (Quinlan, 2014). The nodes in the tree are different parameters and link is the connection between the parameters. The best attribute is determined as a root node and then depending on the conditions the course of actions is determined. Since a single attribute cannot determine the class labels accurately so different parameters such as Information Gain, Gini Index are considered. These parameters are termed as impurity. The Eq. 4 is used to calculate the Gini impurity:

$$Gini\left(Dataset\right)=1- {\sum }_{i=1}^{c}{P}_{i}$$
(4)

where Gini (Dataset) represents the Gini impurity for the dataset, c represents the number of classes and Pi represents the probability of the instance belonging to the class i.

3.3.3 Random Forest

The random forest is a supervised machine learning algorithm which predicts the class label by considering the majority vote for a particular class (Liaw & Wiener, 2002). The class label is determined using the collection of forest of decision trees. This algorithm was designed for removing the drawback of the decision tree algorithm of overfitting the dataset. The random forest uses bagging for the aggregation of the different trees.

3.3.4 Neural Network Model

The neural network-based machine learning model consists of number of inputs, hidden and output layers. The input values are provided from input layer, the hidden layers are responsible for the processing of the input data using the activation function. The output of each layer is determined using the activation function. The neural network has been found to perform well for a specific task, as it provides the facility of the parallel processing.

3.3.5 KNN Model

KNN stands for K Nearest Neighbor. It is used for prediction tasks and it works by comparing the similarity value of a new data point by the values in the dataset. The different similarity metrics are used for the determination of the class label such as Manhattan distance, Minkowski distance and Cosine Similarity. The k nearest neighbors are evaluated and on the basis of the majority votes, the class of a new data point is determined. By varying the value of k, the model can be fine-tuned to achieve the optimal accuracy.

3.3.6 Support Vector Machine Model

The support vector machine is a supervised machine learning algorithm which tries to determine a hyperplane in a N dimensional space that distinctly classifies a data point. To determine the unique hyperplane the SVM algorithm tries to maximize the margin i. e. the distance between the data points of the two class should be maximum. The SVM algorithm can be applied for the linearly separable as well as non-linearly separable classes. For the non-linearly classes, the maximal margin hyperplane is determined using the kernel function. The objective of kernel function is to transform the input dataset into higher dimensional data points. There are three kinds of kernel functions possible in context to SVM i.e., linear, polynomial and radial basis kernel function (Dun et al., 2020).

3.3.7 Xgboost

Xgboost stands for Extreme Gradient Boosting. It is a supervised machine learning algorithm used for classification and regression tasks. This method is suitable for the large and complex datasets. The model starts with the predicting the initial value of the independent variable. The residual is computed as the difference of the predicted and actual value. The xgboost method works on decreasing the residual value to the minimum and it continues until a terminating condition is arrived. The terminating condition can be either the maximum number of iterations or the threshold value for the residual value (Tianqi & Carlos, 2016).

3.4 Performance Metrics

The performance of the considered machine learning model is evaluated in terms of following performance metrics.

3.4.1 Mean Absolute Error

Mean absolute error represents the error in the pairwise observations (Sammut, 2010). The larger MAE value indicates the larger error in the model. The formula to calculate the MAE is as shown in the Eq. 5:

$$MAE= \frac{{\sum }_{I=1}^{n}\left({y}{\prime}-y\right)}{n}$$
(5)

where, n represents the number of observations, y’ is the actual value and y is the predicted value.

3.4.2 Mean Square Error

The performance metric used to measure how well the regression line fits to the data values is known as MSE (Nevitt & Hancock, 2000). It can be considered as the mean deviation of the residuals. The Eq. 6 is used to calculate the MSE is as follows:

$$MSE=\frac{\sum_{t=1}^{n}{(y{\prime}-y)}^{2}}{T}$$
(6)

3.4.3 Confusion Matrix

It is a graphical representation used to determine the performance of any machine learning model as shown in the Table 4.

Table 4 Example of a confusion matrix

It consists of four terms which are defined as follow:

  1. a.

    True Positive (TP)- This metric represents the number of positive data points classified correctly.

  2. b.

    True Negative (TN)- This metric represents the number of negative data points classified correctly.

  3. c.

    False Positive (FP)- This metric represents the number of positive data points classified incorrectly.

  4. d.

    False Negative (FN)- This metric represents the number of negative data points classified incorrectly.

Accuracy

Accuracy represents the ratio of number of instances correctly classified to the total number of observations. The formula to calculate the accuracy of a machine learning model is as follows:

$$Accuracy=TP/TP +NP$$
(7)

where TP represents the number of positive instances correctly classified. TP + NP represents the total number of observations.

Precision

Precision represents the number of instances which are positively labeled and are correctly classified.

$$Precision=TP/TP +FP$$
(8)

Recall

Recall is the performance metrics which represent the efficiency of the model in predicting the positive outcomes.

$$Recall=TP/TP +FN$$
(9)

F1 Score

The harmonic mean of Precision and Recall is defined as the F1 score of the model. It is defined as:

$$F1 Score= \left({2}^{*} {Precision}^{*} Recall\right)/\left(Precesion+Recall\right)$$
(10)

4 Results and Discussion

The implementation results are discussed in two subsections as: i. performance of the difference machine learning models and ii. Comparison of considered models.

4.1 Performance of the Different Machine Learning Models

4.1.1 Extratree Feature Importance

Extra tree regressor is used to determine the importance of different features on the air quality index. The results of the feature importance are as shown in the Fig. 6. It can be inferred from the figure that the Ozone has the highest impact on air quality index.

Fig. 6
figure 6

Feature importance plot using Extra tree Regressor

4.1.2 Linear Regression

The result of linear regression model for the prediction of air quality is as shown in the Table 5.

Table 5 Performance evaluation of Linear Regression Model

The distribution of training dataset and testing dataset prediction across different AQI ranges is as shown in the Fig. 7 and Fig. 8 respectively.

Fig. 7
figure 7

Distribution plot for testing dataset

Fig. 8
figure 8

Distribution plot for the training dataset

The scatter plot in Fig. 9 shows the distribution of the test data set and prediction values across the different AQI ranges.

Fig. 9
figure 9

Scatter plot for the testing dataset and prediction

4.1.3 Random Forest

The random forest is applied over the considered dataset. The accuracy of the model is best among all the models considered as 99%. To further improve the performance of the proposed model, randomized search cross validation is applied for the hyper parameter tuning. The depth values are considered randomly in the range of 1 to 20. The number of estimators is used in the range of 50 to 500. The 5 folds of cross validation is considered and the number of iterations in each fold is considered as 5. The model is then trained using the best parameters. After the cross validation, the best hyper parameters are found as max depth is 14 and number of estimators is 286. The confusion matrix for the best model is as shown in the Fig. 10.

Fig. 10
figure 10

Confusion matrix for the Best RBF model obtained after hyper parameter tuning

The feature importance is further determined using the best random forest model is as shown in the Fig. 11.The results show that the ozone is the most prominent feature for the air quality prediction.

Fig. 11
figure 11

Feature importance plot using best Random Forest Model

4.1.4 KNN Model

The KNN based model is used for the prediction of air quality. The accuracy of the model is found as 98% for the value of k as 7. The effect of the accuracy on varying the value of k is as shown in the Fig. 12.

Fig. 12
figure 12

Effect of varying the value of k on KNN accuracy

The other performance metrics related to the KNN model is as shown in the Fig. 13.

Fig. 13
figure 13

Performance of KNN Model

4.1.5 Confusion Matrix MLP

In the section, we have implemented the neural network model for the prediction of the air quality. The accuracy of the model is found as 98.5%. The confusion matrix for the model is as shown in the Figs. 14 and 15.

Fig. 14
figure 14

Confusion matrix for Neural Network Model

Fig. 15
figure 15

Feature importance plot using Xgboost Model

4.1.6 Support Vector Machine Model

Support vector machine-based model is implemented for the classification of the considered dataset in the different air quality ranges. The different kernel functions are used for the evaluation of the performance of the model. The performance results in terms of precision, recall and F1 score for the different kernel functions are as shown in the Table 6.

Table 6 Performance of different kernel functions in SVM

The results show that the polynomial and linear SVM models have the highest accuracy as compared to RBF SVM.

4.1.7 Xgboost Feature Importance

In this section, we have described the feature importance for the different parameters using the Xgboost based method. The result show that SO2 and NH3 has highest impact on the prediction of air quality.

The results of mean score for the training and testing dataset using the Xgboost based model is as shown in the Fig. 16.

Fig. 16
figure 16

Performance of the Xgboost Model for Training and Testing dataset

It can be inferred from the table that the error value is significantly low for both the training and testing dataset.

4.2 Comparison of considered models

In this section we have compared the performance of the different models in terms of their accuracy for the prediction of air quality as shown in the Fig. 17. The results show that the random forest and decision tree have the highest accuracy as 100% and linear regression has the lowest accuracy as 79%.

Fig. 17
figure 17

Performance comparison of Different Models

5 Conclusion and Future Scope

The paper presented a machine learning based model for the effective prediction and classification of air quality index for the Varanasi city collected from CPCB website. The various data pre processing techniques are employed to improve the data representation such as outlier detection, missing value imputation and scaling the data. In the proposed ExtraTree Regressor method is employed to determine the importance of the different features. The results show that the concentration level of Ozone has significant impact on air quality index. Six different machine learning models such as Linear Regression, Decision Tree, Random Forest, K Nearest Neighbor, Neural Network and SVM with different kernels along with the hyper parameter tuning has been implemented for the determination of the most efficient machine learning model. The results show that the random forest and decision tree models have highest accuracy in the prediction of the air quality whereas SVM with RBF kernel is most efficient for the classification task on the basis of several performance metrics such as accuracy, precision and recall.

The proposed models can be implemented for the city or state level air quality prediction. The model can be also be investigated for the real time air quality monitoring and prediction.