Introduction

Air pollution represents the condition of air pollutants in the atmosphere at high enough concentrations within serious or above normal ambient levels, and it often has a dominant effect on the quality of life [1]. Normal and above normal levels are mainly measured with air pollutant index (API) which is defined in terms of the effects of pollutants on human health [2, 3]. Generally, modeling the receptor to identify and characterize emission sources is typified by difficulties like incomplete information of the sources and the complexities of modeling boundary layer processes. Therefore, identifying effective modeling tools for receptor data record is essential in detecting local, distant, and diverse emission sources in ambient air pollution management studies [4, 5].

The methods used to identify, quantify, characterize and predict pollutants’ emission sources in literature can be grouped into two main classes-receptor models (RMs) [6, 7] and dispersion models (DMs) [8]. RMs are statistical approaches used for identification and quantification of pollutants’ sources at receptor site. DMs are mathematical simulation approaches used to predict pollutants’ dispersion in the atmosphere. In a comprehensive review of RM models, Pokorná et al. [9] identified the following prevalent methods: chemical mass balance, multivariate methods (including target transformation factor analysis, unmix, positive matrix factorization, and constrained models), other complex models (e.g., multiple sample type data and time synchronization model), multiway data (e.g., spatially distributed data and size–composition–time data), ensemble methods, methods using local wind data (e.g., conditional probability function, nonparametric regression, nonparametric wind regression (NWR) and sustained wind incidence method), methods incorporating back trajectories (e.g., potential source contribution function), simplified quantitative trajectory bias analysis and Future directions.

Holmes and Morawska [10] presented an overview of DMs applied to the dispersion of particulate matters (PM). These models were separated into four main classes: box models, Gaussian models, Lagrangian models and computational fluid dynamic models. Generally, the large number of pollutants around the monitoring stations makes the identification process more difficult [1]. Moreover, the distant locations of major resources reduce the detection sensitivity while the minor elements that are locally available close to monitoring stations also impede clear detection. To overcome these limitations, some studies have developed combined models in order to detect the pollutants’ emission sources with high precision. Qin and Oduyemi [11] combined the receptor model (positive matrix factorization) and dispersion model (Gaussian plume dispersion model) to identify aerosol sources and estimate source contributions to air pollution in Dundee, UK. The dispersion model identified sources that were undetected by the receptor model. Kim and Hopke [12] combined positive matrix factorization (PMF) and conditional probability function (CPF) to identify sources of PM in Spokene, an arid city in the Northwestern part of U.S. This approach enhanced the identification of emission sources.

The models which incorporate meteorological data (e.g., clear wind patterns) produce the most reliable results [10], particularly when integrated with CPF for apportionment of pollutants’ emission sources. CPF is an effective statistical technique for providing directional information for pollutants’ sources as well as isolate specific source types using wind direction, wind speed and concentration of specific pollutants [13, 14]. CPF analysis with wind speed, direction and time of day can help isolate specific source types for further analysis [13]. Using CPF, Henry et al. [15] concluded that large sources of pollution might have a minor contribution to concentrations of SO2. Kim and Hopke [12] investigated the advantages of using CPF approach over the non-parametric regression approach, while Bae et al. [16] applied the CPF method to identify the direction of major sources of pollutants in New York State. Furthermore, optimizing data collected by few monitoring stations in complex study areas is crucial. In this context, [17] developed a new approach which combines bivariate polar plots with a CPF to detect and characterize source contributions. This approach, called the conditional bivariate probability function (CBPF), is an extension of the commonly used CPF. The CBPF offers more information on the type of sources being identified by highlighting important dispersion characteristics [17].

Other studies have identified pollutants’ sources from the time series of PM concentrations using RMs or artificial neural network (ANN) modeling methods. Marmur et al. [18] used RMs to determine pollutants’ sources emission from the time series of PM concentrations in the southeastern United States. The study concluded that pollutants’ sources have significant temporal variation with the likelihood of shortcomings with respect to spatial representation. Elangasinghe et al. [19] developed a new approach based on ANN, bivariate concentration-wind speed-wind direction relationships and k-means clustering for identifying source performance signals from the time series of ambient PM concentrations in a coastal area in New Zealand. They emphasized that the identification of periodic and aperiodic source performance signals is a useful technique to enhance the prediction accuracy of hourly average concentrations.

Most recently, Jeričević et al. [4] used a combination of different source apportionment methods to determine the major pollutant sources of PM, Hydrogen sulfide (H2S), NO2 and SO2 in a complex urban area in Croatia (Slavonski Brod). The methods included chemical PM speciation, time series of pollutant concentrations, PMF and CBPF. The study confirmed the importance of the CBPF method for the identification of pollutants’ sources and noted a larger uncertainty of CBPF method, which is related to the representation of peak concentrations transported with wind speeds higher than 8 m/s. Similarly, determination of the extent of influence of the pollutants’ emission in ambient atmosphere and its impact on the urban areas are important for environmental management planning.

In addition to these approaches, Geographic Information Systems (GIS) modeling techniques have the capability to support pollutants’ source detection [20, 21]. Wang et al. [22] proved that GIS is useful for identifying and mapping the potential sources of pollutants. Although a number of researches have integrated these techniques for optimal performance, many of the studies focus on developed cities, mostly in America and Europe. Attempts to implement such advanced statistical and programming algorithms in developing cities, particularly in Asia and Africa, are limited. Mukherjee and Agrawal [23] combined CBPF, land use regression (LUR) and trajectory statistical models (TSM) to detect the sources of PM and their relationships with gaseous pollutants (NO2, SO2, O3, and CO) and meteorological parameters (wind direction, wind speed, relative humidity, and temperature) in middle Indo-Gangetic plain, India. The study concluded that meteorological parameters aided increase in PM concentration than associated gaseous pollutants. Further, the traffic in the northwestern part of India was identified as the most probable source of PM. A combination of CBPF and statistical models was also used to investigate the effect of meteorological parameters on PM concentration in four cities in the Yangtze River Delta, China [24]. Rana and Khan [25] used CBPF and Concentration Weighted Trajectory (CWT) methods to assess the impact of congenial meteorology on PM concentrations in six urban areas in Bangladesh while Kang et al. [26] confirmed a negative relationship between PM concentrations and meteorological parameters in Nanjing city, China. Also, Jain et al. [27] indicated that the traffic emission and biomass burning contribute greatly to PM concentration in Delhi, India, especially during winter. Despite these studies, implementation of the integrated spatial- CBPF technique in Asia remains relatively limited. Therefore, it is imperative to extend these techniques to cities in the high-risk developing region (UN News, 2012), in order to ascertain their suitability for adoption therein.

Based on the foregoing, this current research aims to determine the source direction of the major pollutants in the study area and investigate possible relationships between these pollutants. We also aim to forecast potential emission of pollutants in the near future. Therefore, we present an integrated approach based on CBPF, sensitivity analysis, time series signature and GIS to identify, characterize, and model pollutants’ emission sources in a complex region in Malaysia (Kuala Lumpur city and surroundings areas). The selected pollutants are Sulfur dioxide (SO2), Nitrogen oxides (NO2), Carbon monoxide (CO), and Particulate Matter (PM10). Also, the correlation between PM10 and other pollutants was explored using generalized linear model (GLM).

Study area

Kuala Lumpur metropolis is the federal and economic capital of Malaysia and is surrounded by the state of Selangor. It has six strategic zones with a total area of 242.8 km2 [28]. Air pollution due to airborne particulate matter (PM10) is an environmental issue in the Southeast Asia (SEA) region, particularly Indonesia, Singapore, Brunei, and Malaysia. Because of its rapid urbanization, industrialization and increased vehicular traffic, Kuala Lumpur has witnessed unprecedented infrastructural development, causing alteration of its landscape and contamination of the environment [29]. Accelerated urbanization, continuous industrial emission, vehicle emission and re-suspension of soil dust have increased the volume of PM10 mix in Malaysia’s atmosphere and exacerbated toxicity [30]. Smoke-haze episodes are regular occurrences in Kuala Lumpur, which cause frequent emissions of hazardous particles and gases into the surrounding atmosphere [31]. Particle pollution from urban activities and transboundary inputs are major sources of PM10 pollution in the city [32]. In its study of risks posed by major air pollutants in Kuala Lumpur, Tajudin [33] noted hospitalizations due to cardiovascular diseases in relation to exposure to NO2 and Ozone (O3). This sustained emission of pollutants from multiple sources make Kuala Lumpur an ideal study area as shown in Fig. 1. The figure also shows the land use and relative locations of the air quality stations in the study area.

Fig. 1: Study area.
figure 1

The study area showing industrial land use locations, density, and locations of the air quality monitoring stations.

Methodology

Data used

Four stations were used to collect the data from 2007 to 2016 (Fig. 1). These include station Ca0025 located at Taman Tun Dr Ismail Jaya, Shah Alam. The area is surrounded by a complicated industrial land use, the Sepang Airport and heavy industries in the southern zone. Station Ca0054, which is the source of 2009 to 2017 hourly data, is located at Seri Permaisuri, KL and is surrounded by a number of industries. It is near the city center which experiences heavy traffic. Station Ca0016 is located at Petaling Jaya, western part of KL with industries situated on the southern and western parts. Station Ca0058 is in the northern part of KL, specifically at Batu Muda, and surrounded by some industries on the north and north eastern part. The stations are surrounded by complex industrial areas and were selected for this study because of ease of detection by distant resources due to surrounding flat terrain. Thus, wind flows are not affected by issues related to terrain.

We collected data from approximately 40 sites using land use map provided by the Federal Department of Town and Country Planning (PLANMalaysia), and digitized locations using Google earth service. The selected data comprise of industrial zones only as it is the most likely to emit the biggest share of pollutants. Roads and other landuse were ignored due to data limitation. Figure 1 shows distribution of the air gauges. CA0058 and CA0025 are surrounded by multiple sources of pollutants, while CA0054 and CA0016 are situated in neighborhoods with limited emission of pollutants. Sources of pollution in the study area include food and beverage companies, automobile maintenance and services, chemical industry, and a complex road network. ESRI’s ArcGIS software was used to map the study area and relative locations of the roads, main emission sources and industrial areas to the monitoring site.

The maximum values of 1-hour concentration of SO2, NO2, CO, and PM10 are found in stations CA0016 and CA0025 (Table 1). This can be attributed to the location of these stations within the city center and in close proximity to highly dense industrial zones. All pollutants were measured in volumetric units’ parts per million by volume (ppm), except PM10 which was measured in Gravimetric unit (μg/m3). Wind speed was measured in meter/second (m/sec), with the Max. value (26.5) registered at CA0054 and Min. value (0.7) registered at CA0016 and CA0025. The duration of all the data measurements collected from the site and analyzed in this study was from 1st January 2006 to 31st December 2016, except data at station CA0054, which covered the period from 2007 to 2016.

Table 1 Minimum and maximum concentration of pollutants at the four stations.

Missing data imputation

The acquired data records were examined to detect missing data that need to be imputed before proceeding as recommended by Shah et al. [34]. Using 4 stations encircling the study area, we collected daily record covering a period of 10 years. The largest amount of missing data, represented by wind direction, reached a maximum of 28% in CA0025, and the maximum missing wind speed data was 20% in CA0054 as shown in Fig. 2. The other variables (pollutants) vary but do not exceed 10%. Generally, the missing data is not very high, and the data imputation technique is applicable to cover this source of uncertainty. Figure 3 shows the data summary after data imputation using predictive mean matching (PMM) in MICE package of R programming environment.

Fig. 2: Missing data.
figure 2

Percentage of missing data for all variables in the 4 stations.

Fig. 3: Summary plot.
figure 3

Summary plot and statistical summary of pollutants’ concentration variables at the 4 stations.

It presents the mean concentration values of the pollutants combined with wind direction, plotted on radial axis, and indicates the most critical directions of the dominant pollutant sources around the stations. The meteorological data and air condition are also summarized in the summary plot. Since the land use data does not contain detailed information on the industrial land use, it is difficult to compare the research outcomes with acceptable or dominant range of pollution concentration for each industry (Table 2).

Table 2 Major pollutants’ concentration in each station (Bolded values refer to maximum value for each station).

Conditional bivariate probability function

For a given wind sector, CPF estimates the probability that the measured concentration goes beyond the fixed threshold [14]. CPF was originally used to show the wind directions that dominate a specified high pollutant concentration, indicating the probability of such concentrations occurring due to wind direction [14].

$$CPF_{{\Delta}\theta }{\mathrm{ = }}\frac{{m_{{\Delta}\theta \left| {C \ge x} \right.}}}{{n_{{\Delta}\theta }}}$$
(1)

Where mΔθ = number of samples in the wind sector θ, C = pollutants concentration, x = threshold value of high percentile of concentration e.g., 95th, nΔθ = total number of samples from wind sector Δθ

Conditional bivariate probability function (CBPF) combines CPF with the wind speed, which is considered the third variable in the equation. It assigns the pollutants’ concentration to the cells that are defined by the wind direction and wind speed, having a higher reliability than the conventional methods that consider only wind direction [14]. CBPF is appropriate in areas characterized by high source complexity, with the potential of identifying more pollutants’ sources in comparison with currently used techniques such as the CPF [17].

$${\mathrm{Mathematically}},\,CBPF_{{\Delta}\theta ,{\Delta}u} = \frac{{m_{{\Delta}\theta ,{\Delta}u\left| {C \ge x} \right.}}}{{n_{{\Delta}\theta ,{\Delta}u}}}$$
(2)

Where mΔθu = number of samples in the wind sector Δθ, Δu = wind speed interval; C = pollutants concentration, x = threshold value of high percentile of concentration e.g., 95th, nΔθu = total number of samples in that wind direction-speed interval [17]

The R programing package “OpenAir”, which shows durability and efficiency in terms of data size and processing time, was used to implement the CBPF technique.

Generalized linear model (GLM)

The generalized linear model (GLM) was generated for determining the correlation between PM10 with other pollutant sources. Linear regression serves as a workhorse of statistics but cannot handle some types of complex data. A GLM expands upon linear regression to include non-normal distributions including binomial and count data.

Logistic regression is used to predict a class i.e., a probability, and it can predict a binary outcome accurately. Imagine you want to predict whether a loan is denied/accepted based on many attributes. The logistic regression is of the form 0/1. y = 0 if a loan is rejected, y = 1 if accepted.

A logistic regression model differs from linear regression model in two ways.

  • First, the logistic regression accepts only dichotomous (binary) input as a dependent variable (i.e., a vector of 0 and 1).

  • Second, the outcome is measured by the following probabilistic link function called sigmoid due to its S-shape.:

$$\sigma \left( t \right) = \frac{1}{{1 + \exp \left( { - t} \right)}}$$

The output of the function is always between 0 and 1

The sigmoid function returns values from 0 to 1. For the classification task, we need a discrete output of 0 or 1.

To convert a continuous flow into discrete value, we can set a decision bound at 0.5. All values above this threshold are classified as 1

$$\widehat y = \left\{ {\begin{array}{*{20}{c}} 0 & {if} & {\widehat p {\,}{\,}< {\,}{\,}.5} \\ 1 & {if} & {\widehat p \ge .5} \end{array}} \right.$$

timetk: a tool kit for working with time series in R

For the future prediction of PM10, testing data is used in time series to analyze the residual and forecast PM10 occurrence for 3 months. Residual plot for the stations is prepared taking data for a year (1-1-2016 to 31-12-2016).

The time series signature is a collection of useful features that describe the time series index of a time-based data set. It entails a wealth of features that can be used to forecast time series containing patterns. In this vignette, the user implements advanced statistical analysis to predict future outcomes in a time-based data set. The timetk package, comprising tools to get the time series index, signature, and summary from time series objects and time-based tables, enables a user to work with time series objects more easily in R. The robust platform also supports inspecting and manipulating the time-based index, expanding the time features for data mining and machine learning, and converting time-based objects to and from the many time series classes.

Results and discussion

Pollutants’ concentration and sources

Figure 4 shows box plots of median concentration of PM10 (monthly average) while Fig. 5 presents Whisker plots of PM10 API concentration for the period of study. Box and whisker plots are effective tools for summarizing data over different time scales, including day of the week, day of the year, and month of the year. These show variations in pollutant concentrations over different time scales, providing valuable clues on pollutants’ sources and their respective levels of significance. From Fig. 4, it is observed that the concentration of PM10 at each of the station’s peaks in March, June, May, September, and October. This trend occurs at all the four stations. However, the highest monthly concentrations are observed in August in station CA0025, presumably due to its proximity to Subang airport, which has significant chemical pollution episodes. In its investigation of the environmental and health impacts of proximity to airport infrastructure in Malaysia, Sahrir et al. [35] reported that aviation services, particularly airplane emissions, produce numerous hazardous materials such as CO, O3, SO2, NO2, and PM, confirming the likelihood of high emissions in areas close to the Subang airport.

Fig. 4: Box plots.
figure 4

Box plots of median concentration of PM10 (monthly average).

Fig. 5: Whisker plots.
figure 5

Whisker plots of PM10 API concentration for the period Jan.2007-July.2016, except ID CA0058 that started two years later (Jan.2009).

Analysis of Fig. 5 with respect to the Department of Environment’s API limits (Low pollution (0–50), Moderate (51–100), Unhealthy (101–200), Very unhealthy (201–300), and Hazardous and risky (301–500) [36]), shows that the value of PM10 did not exceeded 150 API (unhealthy), except in 2013 and 2015, and with maximum individual hour API values of 426 in Oct 2015, 416 in Oct 2015, 390 in Oct.2015, and 322 in June 2013 at stations ca0025, ca0058, ca0016 and ca0054 respectively.

Station CA0016 falls in the middle of a highly polluted area. This renders the conventional mean plot ineffective to observe the pollutants’ sources, thus, underscoring the limitations of existing techniques in such instances. A similar scenario exists in station CA0058 due to the presence of some industrial areas in the north and north east directions. Also, station CA0025 falls in the middle of highly polluted area and an airport, which complicates the detection of pollutants’ sources. Therefore, to overcome this limitation, CBPF is required to divide the mean values into 10 quantiles for further analysis. Figure 6 depicts the CBPF plot of mean values of main pollutants measured at the four stations. However, the absence of dominant pollutants around CA0054 enables a clear rendition of the effects of KL on the north and north west areas.

Fig. 6: CBPF plot.
figure 6figure 6

CBPF plot of mean values of main pollutants measured in four stations in volumetric units (ppm), except PM10 measured in Gravimetric unit (µg/m3).

Although every pollutant has a specific emission range, this is usually ‘washed out’ in the normal conventional polar plot [37]. Therefore, considering the quantiles intervals using the CBPF’s interval classification can reveal more information on pollutants’ sources.

The CBPF was calculated for all the stations (Fig. 7) by taking the 10 quantiles of each pollutant for better visualization of different emission sources. This is particularly pertinent for stations CA0016 and CA0025 that did not reveal clear sources and directions for the pollutants as shown in Fig. 6. The CBPF plots reveal significant hidden information about different pollution sources which was not discernable in the ordinary plot, especially for the stations that are located in the middle of high-density industrial areas and pollutants. Subsequent analysis focused on the main pollutant in the two stations, PM10, since it always has the highest value in comparison to other pollutants, thus, determining the Air Pollution Index (API).

Fig. 7: Polar plots.
figure 7figure 7

Polar plots of concentrations at 4 stations based on the CBPF function for a range of percentile intervals from 0–10, 10–20, …, 90–100.

Nearer land use at station CA0016 have the highest percentile values while the distant ones recorded low percentiles. Analyzing Fig. 7 in tandem with Fig. 1, land uses in the NW direction were detected by two percentiles,7th (47–52) and 8th (52–58), while the third zone of land use at the southern direction of the station is represented by SW 9th (58–70) and SE 10th (70–390) percentiles. Despite their low API values, CPBF successfully detected distant pollutants at the NE direction, in the 1st (12–29) and 2nd (29–33) percentiles.

The presence of many land use surrounding station CA0025 presents unique challenges in identifying sources of diverse pollutants, but the CBPF percentile plot presents some interesting findings. In the N and the NW direction, we observed the Subang airport and a huge industrial area, respectively. These were measured by stations on 2nd (30–35) and 3rd (35–39) percentiles, respectively. In S and SW directions, the effect of these two regions was higher than the airport’s location, and detected by 9th (64–76) and 10th (76-426) percentiles respectively, in addition to the 5th percentile (43–47) that detected the pollution originating from the center of the study area (eastern direction of the station). Other stations detected many sources that were not observed in close region to the station, but with low effect. The major effect originated from the close land use as shown in Fig. 7 for the two stations (ca0058 and ca0054).

Further insights can be gained by considering how percentile values vary in relation to other factors i.e., conditioning. For example, the plot shows the seasonal variation and whether it is nighttime or daytime. Also, PM10 concentration results in Fig. 7 indicate that identification of sources varies according to percentiles (high to low), while Fig. 8 highlights the impact of seasonal variation on identification of pollutants’ sources. At station CA0016, PM10 is higher in September, October, and November, which refers to the SE direction that was indicated in Fig. 7. It is important to note that there are no significant differences between day and night concentrations. This suggests that PM10 is not affected by day or night temperature, but rather by the pollutants’ active hours and seasons.

Fig. 8: Rose plot.
figure 8

A percentile Rose plot of PM10 concentration plotted for 3 months average (season) and relation to daylight and nighttime at two stations at the study area.

Correlation between PM10 and other pollutants

Generalized linear model

The three major pollutants, SO2, NO2, and CO, show positive correlation with PM10 in all stations based on outputs of generalized linear model using percentile value. CO has the highest correlation value in station CA0058, followed by CA0025.

A similar trend is observed in other cities, particularly in the Asian region. Table 3 documents selected studies on major pollutants and their concentration in different Asian cities. The data in the table shows the average concentration of the major pollutants, SO2, NO2, CO, and PM10. A correlation between PM10 and the other pollutants is established in all the cities, which aligns with the findings of our present study. Vehicle emission and proximity to industrial areas is the major source of pollutants in these cities and our study area too.

Table 3 Characterization of pollutants in Asian cities.

Sensitivity analysis

Sensitivity analysis was conducted at all stations to observe the relation between PM10 and three major pollutants based on the correlation observed from outputs of generalized linear model using percentile value. CO, NO2, and SO2 are taken as explanatory variables whose values are kept constant at their minimum 20th percentile, 40th percentile, 60th percentile, 80th percentile and maximum percentile. At station CA0058, NO2 and CO clearly displayed non-linear relation with PM10, showing a logarithmic growth and tend to level off with increasing value of the pollutants. S02 had almost negligible response. At station CA0054, the sensitivity model for SO2 and NO2 showed exponential growth trend in the first four quantiles and descended at maximum quantile. At station CA0025, the pattern of response by PM10 varied with respect to quantiles. In the first four quantiles in CO and NO2, the graph shows slow logarithmic growth at the beginning, followed by a sharp increase. In contrast, for the maximum concentration quantile values, PM10 exhibited an exponential growth with respect to both CO and NO2. However, SO2 shows a linear relationship with the occurrence of PM10. At station CA0016, CO and PM10 displayed a unimodal relation. The response pattern of PM10 is non-linear with different patterns in different quantiles. PM10 tends to show linear response with respect to SO2.

The sensitivity analysis at station CA0025 (Fig. 9) indicates that the current pollutants are not driven by daylight and night hours. This is evident in the concentration pattern from September to November, which increases especially towards the NE sources (Airport and city center). Between June and august, some pollutants started to emit PM10 with lower values. However, these sources fall outside the data coverage, necessitating further investigation to understand the variance in seasonal concentration especially in PM10.

Fig. 9: Sensitivity analysis.
figure 9

Sensitivity analysis at the study area showing the effect of variable values’ splits with response.

Time series signature and forecast

By using tk functions in R, we obtain residual and forecasting plots. Figures S1a, S2a, S3a, and S4a (see Supplementary Material) show the mean residual of testing data for the period 1 Jan 2016-31 Dec 2016 for the stations CA0058, CA0054, CA0025, and CA0016, respectively. In these figures, the blue line shows the residual values and gray points represent tested data. Generally, the residual plots show an irregular pattern at all stations. Figures S1b, S2b, S3b, and S4b (see Supplementary Material) illustrate the 3-month forecast of PM10 for the stations CA0058, CA0054, CA0025, and CA0016, respectively. The red points represent the modeled values of PM10 while the white line represents the mean values of predicted concentration of PM10. The forecast plots reveal a cyclical pattern at all stations, indicating a seasonal behavior. It is also observed that the forecast concentration of PM10 is very near and represent a continuous distribution to tested data.

Conclusion

In the current article, we proposed an enhanced methodology to reliably identify the sources of PM10, SO2, CO, and NO2 in high density urban areas of Kuala Lumpur, Malaysia, and other surrounding areas. We also investigated the correlation between PM10 and the other pollutants and forecast the future occurrence and sources of PM10 to aid air pollution mitigation and management strategies. To achieve the study’s objectives, we selected 10 years’ data from four randomly distributed air pollution stations- CA0016, CA0025, CA0058 and CA0054. The study leveraged recent advances in statistical programming algorithms such as CPBF; polar plot; percentile seasonal and daily plot; and whisker plot embedded in OpenAir package within the R programming environment. These advanced algorithms have the capability to optimize the visualization mechanism of long time series air pollution data of large urban regions that are naturally correlated in time and spatially complicated to analyze, thereby elaborating different sources of pollution that were hitherto undetectable using ordinary plot models. Further, Generalized Linear model (GLM) and sensitivity analysis was applied to assess the relationship between Air Pollution Index (API) of PM10 and other significant pollutants of GML outputs based on quantile values. The whisker time series plot showed that the value of PM10 was within the city’s healthy API range, with exceptions in 2013 and 2015. Results from the CBPF plot indicated that CA0058 and CA0054 enable easier detection of pollutants’ sources in comparison to CA0016 and CA0025 due to their location near industrial areas. The highest monthly concentrations were observed in August at station CA0025, presumably due to its proximity to Subang airport, which has significant chemical pollution episodes. The CBPF plots revealed significant hidden information about different pollution sources in the study area, which was not discernable in ordinary plot, particularly for the stations that are located around high-density industrial areas. The sensitivity analysis at station CA0025 revealed that the pollutants are not affected by daylight and night hours and temperature. Rather, PM10 concentration is influenced by seasonal variations. Analysis of the Time series signature and forecast revealed that the residual plots generally indicate an irregular pattern at all stations while the forecasting plots reveal a cyclical pattern at all stations, which is indicative of a seasonal behavior. Based on these findings, the major conclusion from this study are summarized below:

  • Despite missing data attributes and scarcity of air quality stations, the use of long-term data with advanced statistical and spatial analysis enhances the understanding of the mobility of pollutants along the study area, spatially and temporally.

  • Proper distribution of air quality monitoring stations provides deep insights on temporal scenarios of pollution distribution, aided by detailed land use maps and Google earth services.

  • CBPF plot can identify the sources of pollution in cases where the stations are located at the edge of pollutants’ sources or there is a great variance in pollution concentration. Otherwise, CBPF’s results will not be valid.

  • Polar plot of concentrations based on the CBPF function for a range of percentile intervals from 0 to 100 successfully elaborated the different sources of pollution around the stations, which was not visible in the ordinary plot.

  • Closer pollutants with high intensity have major impact on the stations’ concentration reading, although small concentrations emanating from distant pollutants can still be detected.

  • There are no significant differences between day and night concentrations in the study area. This offers interesting insights on the nature of PM10, which is not affected by temperature of daylights and night. However, further investigation of the impact of vehicular traffic is necessary.

The visual interpretation of results and analysis has impacts on comprehending the urban air pollution dynamics using advanced statistical models to convey vital information to the local community, decision makers and all stakeholders.

For future work, it is imperative to get a detailed land use data within the industrial land use for the purpose of comparing the outcome with acceptable or dominant range of pollution concentration for each industry.