Introduction

Seven million people die every year from the effects of air pollution. More than 90% of such deaths are in developing countries (WHO 2019). Across southern Asia, levels of fine particulate matter (PM2.5) and surface ozone (O3) exceed the World Health Organization (WHO) limits for much of the year (Kumar et al. 2018). Macao is located in Southern China, in the Pearl River Delta (PRD) region. The levels of nitrogen dioxide (NO2), particulate matter (PM), particulate matter with an average aerodynamic diameter below 10 μm and 2.5 μm (PM10 and PM2.5, respectively), and ozone (O3) in Macao are high and often exceed the established limit values recommended by WHO’s air quality guidelines (AQG). Since 2010, the worst air quality index classes in Macao have been due to PM10 and PM2.5 (SMG 2019). Macao was listed as the number one most densely populated region in the world (Sheng and Tang 2013), with a population density of about 20,000 inhabitants/km2. A significant proportion of Macao urban population is being exposed to air pollutant concentrations above the limit or target values.

The exposure to air pollutants such as NO2, PM, and O3 increase the chance of hospital admissions for cardiovascular and respiratory disease and mortality in the world (Liu and Peng 2018; WHO 2018). O3 at the ground level is associated with numerous harmful effects on respiratory health, at levels commonly found in urban areas throughout the world, contributing to morbidity and hospital admissions related to respiratory disease, even at low ambient levels (Entwistle et al. 2019). Regarding particulate matter, for human health, small particles (PM2.5) are particularly dangerous as they can penetrate deeply into the lungs and be transported directly into the bloodstream (Wiśniewska et al. 2019). Furthermore, mixtures of NO2-PM2.5-O3 exist in ambient environments, being the combinations of these pollutants more harmful to human health (a mixture with relatively low levels of some pollutants combined with relatively high levels of other pollutants was found to be equally or more harmful than a mixture with high levels of all pollutants) (Liu and Peng 2018). In Macao, traffic-related pollution is high, primarily due to high vehicle emissions and urban canyon topology (He et al. 2000).

In this context, it is relevant to develop a reliable methodology to forecast the concentration of air pollutants, which can provide an alert for health hazards in advance, in a way that the population can take precautionary actions to avoid exposure.

Recent studies have been conducted to access meteorological influence on air quality (Tong et al. 2018a, b; Xie et al. 2019), and related to air quality forecast (Lee et al. 2017; Deng et al. 2018), both in PRD region. The current paper focuses the development of air quality forecast models by statistical methods for the most critical air pollutants in Macao.

The methods for the prediction of the air pollutant concentration can be roughly divided into two types: deterministic and stochastic. Statistical approach learns from historical data and predicts the future behavior of the air pollutants. Meteorological conditions significantly affect the levels of air pollution in the urban atmosphere, due to their important role in the transport and dilution of pollutants. It has also been concluded that there is a close relationship between the concentration of air pollutants and meteorological variables (Zhang and Ding 2017). Thus, multiple linear regression models (MR) are trained based on existing measurements and are used to predict concentrations of air pollutants in the future, according to the corresponding meteorological variables.

The Greater Bay Area (GBA) of China consists of nine cities of Guangdong province, and the Special Administrative Region of Hong Kong and Macao. The synoptic situation of Macao and other cities of the GBA is closely related due to its geographic proximity. The GBA experiences a complex temporal and spatial climatic condition due to topographic variations, urban morphology, and land-water contrasts. Located along the southeast coast of Mainland China, Macao is surrounded by the sea on three sides, with a subtropical oceanic monsoon climate that is characterized by high temperatures, high rates of evaporation, high levels of atmospheric moisture, and abundant rainfall (SMG 2014). In winter, Macao is influenced by the north monsoon, the climate is cold and dry with the predominant wind from the north quadrant. In summer, the northeast monsoon is replaced by the strong southwest monsoon with heavy rains. Spring and autumn are transition periods.

Recent studies (Tong et al. 2018a, b) showed a rise of surface temperature and a drop of surface absolute humidity and wind speed at GBA due to the decline of vegetation and irrigated cropland. The landscape of GBA is characterized by a large flatland surrounded by the Nanling Mountains which can prevent air pollution from the central part of China reaching the GBA. Nevertheless, the northeast monsoon present during the winter may transport pollutants from northern and eastern China, along the coastline to the region of GBA (Tong et al. 2018a, b). PM levels are usually measured higher during the winter season, from December to February, due to the northern wind, bringing the air pollutants to the region, lowering mixing height, and fewer amount and lower frequency of rainfall. During summer season, from June to August, PM levels are usually measured lower due to the southern winds from the China sea, higher mixing height, higher frequency, and amount of rainfall, which allow for a better air pollution dispersion and deposition conditions (Lopes et al. 2016).

The air pollution of the GBA is normally associated with emission sources at alternating spatial scales from local to regional and transboundary (Tong et al. 2018a), under certain synoptic conditions. Estimates show that, in this region, for nitrogen oxides (NOx), mobile sources account for the majority of emissions (50%). For PM, the industrial sector is the main emitter, followed by mobile sources (Zheng et al. 2009). O3 is not emitted directly to the atmosphere, but is formed in reactions between NOx and volatile organic compounds (VOC), being these reactions driven by absorbed solar radiation (Reid et al. 2008).

Materials and methods

The statistical methods selected for this paper were both multiple linear regression analysis (MR) and classification and regression tree (CART). Those can be a useful and straightforward tool in air quality studies (Choi et al. 2013; Martinez et al. 2018; Cassmassi 1987; Clapp and Jenkin 2001). As one of the advantages of the CART analysis is its effectiveness in explaining the variations in pollutant levels solely by a combination of meteorological conditions, regression trees can identify specific meteorological conditions that lead to low or elevated pollutant concentrations (Choi et al. 2013). The basic concept of the CART approach is to make a hierarchy of binary decisions, each of which splits distribution/variation of a target variables into two mutually exclusive branches (groups) based on the explanatory variable/value showing the largest reduction in variations in target variable after the split (Choi et al. 2013).

Following precedent experiences (Cassmassi 1987; US EPA 2003; Durão et al. 2016; Oduro et al. 2016), the statistical models were initially created using MR analysis. As an approach to obtain improved results, mainly regarding a better prediction of high pollutant levels, the CART analysis was chosen to better predict the maximum concentrations.

Statistical models, based on MR and CART, were applied to forecast the daily average concentration of NO2, PM10, PM2.5, and the maximum average hourly concentration of O3 levels for the next day, for each station of the air quality monitoring network in Macao. This comprehends six air quality monitoring stations, operated by Macao Meteorological and Geophysical Bureau (SMG), being two of them classified as roadside (Macao Roadside, Ká-Hó Roadside), two as high density residential (Macao Residential, Taipa Residential), and two as ambient background types (Taipa Ambient, Coloane Ambient). Figure 1 represents the air quality monitoring stations spatial location, within the 30 km2 of Macao region.

Fig. 1
figure 1

Air quality monitoring network spatial location in Macao

Data from 4-year daily series observations, from 2013 to 2016, were used to develop the forecast models, and each of the models was evaluated using 2017 data.

The first step of the study was to gather a set of meteorological and air quality data, namely (i) meteorological surface observations: hourly observations from automatic weather stations, such as temperature, relative humidity, and dew point temperature collected from the Taipa Grande Meteorological Station; (ii) upper-air observations, such as, geopotential heights, temperature, relative humidity, and dew point temperature at various altitudes, collected from Hong Kong King’s Park location; (iii) surface air quality measurements, from SMG’s network, of NO2, PM10, PM2.5, and O3. Other variables were added to the analysis, as the flag for week/weekend day and the daily sunlight period duration. These variables are presented in Table 1.

Table 1 Variables used as predictors in the MR and CART models

The next step was to assess data efficiency levels, for each parameter, through the years, in order to reject lower annual efficiencies. The statistical models for Ká-Hó Roadside station were not feasible, due to the lack of sufficient air quality data. Outliers were identified and excluded from the data series. A complimentary analysis was conducted to observe air pollution trends, monthly, weekly, and hourly patterns, and pollution roses.

A preliminary exploratory data analysis, looking at basic statistics, like average, mode, histogram, distribution type, correlation between different variables, and principal component analysis, was performed to identify variables with similar behaviors. This strategy enabled to decide the proper steps to get the best model outcome.

The significance level of 0.05 was used in the linear MR analysis. Some variables initially selected were rejected from the forecast models due to collinearity. The final objective was to obtain prediction models with the lowest possible number of variables but with the maximum explained variance as translated by the R2. The higher the number of variables used by the model, the higher the risk of compromising the operational forecast, due to lack of information/missing data in case one or more variables are not accessible. SPSS version 25 was used to perform linear MR (stepwise method) and CART analysis.

Model performance was determined recurring to the following parameters: coefficient of determination (R2) (1), root mean square error (RMSE) (2), mean absolute error (MAE) (3), and Bias (4).

$$ {R}^2=\frac{{\left[{\int}_{i=1}^n\left({f}_i-\overline{f}\ \right)-\left({o}_i-\overline{\mathrm{o}}\ \right)\right]}^2}{\left[{\int}_{i=1}^n{\left({f}_{i-\kern0.5em }\overline{f}\right)}^2\right]\ \left[{\int}_{i=1}^n{\left({o}_i-\overline{o}\ \right)}^2\right]} $$
(1)
$$ \mathrm{RMSE}=\sqrt{\ \frac{1}{n}\ {\sum}_{i=1}^n{\left({f}_i-{o}_i\right)}^2} $$
(2)
$$ \mathrm{MAE}=\frac{1}{n}\ {\sum}_{i=1}^n\mid {f}_i-{o}_i\mid $$
(3)
$$ \mathrm{Bias}=\frac{1}{n}\ {\sum}_{i=1}^n\left({f}_i-{o}_i\right) $$
(4)

where f is forecast, \( \overline{f} \) is forecast average, o is observation, and \( \overline{o} \)is observation average, for each i case to the n number of cases.

Results and discussion

The statistical models based on MR and CART analysis were developed to forecast NO2, PM10, PM2.5, and O3 concentrations. The final objective is to be able to perform a daily forecast, for the next day, in an operational mode, by running the prediction models after 16H (due to the daily schedules of which the air quality data is made available).

CART analysis was tested mainly in order to better predict the high concentration levels. For NO2 and PM, CART analysis did not improve the quality of the overall predictions. Therefore, prediction models were based only on one MR model. In the case of O3 forecast, for three stations (Taipa Ambient, Taipa Residential, and Coloane Ambient), CART analysis allowed to identify split nodes, for which O3 prediction equations were determined afterwards by using MR for each node. Figure 2 represents an example of the CART trees obtained, in this case for O3 MAX prediction at Taipa Ambient station.

Fig. 2
figure 2

CART tree obtained for O3 MAX prediction at Taipa Ambient station

The output meteorological and air quality variables and equations obtained with MR (or CART and MR, in the O3 MAX case) are listed in Table 2.

Table 2 Variables and model equations for each pollutant per air quality monitoring station

The models were validated with collected data from 2017. The results show a good agreement between modelled and observed concentrations, being statistically significant at the 95% confidence level. The selected models provide a good relationship between meteorological and air quality variables, when performing an air quality forecast under different situations. Table 3 contains the obtained model performance indicators, such as, R2, RMSE, MAE, and Bias.

Table 3 Model performance indicators

The obtained results performed a better R2 for PM (between 0.86 and 0.93 and, in all cases, greater for PM10 than for PM2.5), followed by NO2 (between 0.84 and 0.90), being the lowest explained variance achieved for O3 (between 0.78 and 0.87). Models did not show a defined trend on the forecasts by type of station, presenting undistinctive R2 for roadside, residential, and ambient stations. The monitored and forecasted concentrations, in 2017, for the models with the highest and lowest R2 are depicted in Figs. 3 and 4, being respectively, the one for PM10 Coloane Ambient and O3 MAX Coloane Ambient, in 2017. The poorest results obtained in Coloane Ambient is related with the fewest cases available to build the model (N = 546).

Fig. 3
figure 3

Observed and predicted PM10 concentrations for Coloane Ambient in 2017

Fig. 4
figure 4

Observed and predicted O3 MAX concentrations for Coloane Ambient in 2017

Regarding the RMSE, all models presented the same trend observed for R2, being the RMSE lower for PM (between 4.9 and 9.2 μg/m3), followed by NO2 (between 6.1 and 7.9 μg/m3), and the highest for O3 (between 21.1 and 27.4 μg/m3). In the case of O3, the high RMSE obtained values were due to abrupt variations, on consecutive days, influencing the predicted values, since statistical models are sensitive to this kind of fluctuations.

Regarding CART analysis for O3 prediction, three equation nodes were used. The number of cases considered in each node (N), the coefficient of determination (R2), the correlation coefficient (r), and the standard error of the estimate are presented in Table 4. The obtained standard error of the estimate, which is a measure of the prediction’s accuracy, was higher for higher concentrations prediction categories. The highest obtained standard error of the estimate for node 1 was of 17.2 μg/m3 in Coloane Ambient station, for node 2 was of 28.8 μg/m3, and for node 3 was of 43.6 μg/m3, both in Taipa Residential station. This reflects the difficulty of the model on predicting the highest O3 concentration ranges. Traffic-related pollutants, such as PM and NO2, are dependent on meteorological conditions as well as emission rates. Because O3 is produced in the atmosphere through photochemical processes, the major meteorological factors affecting ozone concentrations are different from those for traffic-related primary pollutants (Choi et al. 2013).

Table 4 CART model performance indicators

In all the cases, the variable that represents the last 24-h pollutant concentrations (16D1) is the most prevalent, being selected at all the forecast equations (Table 3). The geopotential height at 850 hPa (H_850), indicator of synoptic-scale weather pattern, is also frequently present in the forecast of NO2 and PM. Specifically, in the case of PM10, relevant variables are H_850 and the medium relative humidity (HRMD), while for PM2.5, for both residential stations, average dew point temperature (TD_MD) and air temperature at 925 hPa (TAR_925, a measure of the strength and height of the subsidence inversion) figure in the final equations. Atmospheric stability at 925 hPa and at 850 hPa (STB_925 and STB_850, respectively) figure in final equations in the case of NO2 and O3 MAX at Taipa Ambient. This temperature differences between layers provide information about atmospheric stability.

The used statistical methods depend on the past series of data. If the historical data is insufficient, forecasted data will be less reliable. In particular, if emission sources change considerably or if meteorological variables also change due to factors related to new weather patterns eventually motivated by climate change, the data series of the past will not represent the updated situation, and models need to be recalculated with more recent data.

Conclusion

The development of statistical models to forecast the daily average concentration of NO2, PM10, PM2.5, and the maximum hourly average concentration of O3 for the next day, in Macao region, was successfully accomplished for five locations, recurring to MR analysis. In the case of O3 predictions, CART analysis showed better results, specially improving high concentration levels predictions, assuring a more accurate prediction of critical pollution episodes.

The pollutants for which best results were obtained were PM10, followed by PM2.5 and NO2. The most challenging pollutant forecast was the maximum hourly concentration of O3, scoring the lowest R2 (0.78), due to its secondary nature as a pollutant, involved in several atmospheric reactions that depend on the concentrations of other compounds, and also key meteorological conditions, such as sunlight and temperature.

The variables that explained most of the variability, for all pollutants, were the concentration levels measured in the previous 24-h to the operational forecast. For PM and NO2, the indicator of synoptic-scale weather pattern (geopotential height at 850 hPa parameter), was also a relevant variable.

This work shows that in areas such as Macao, where data may not be easily obtained with a high level of confidence (such as spatially resolved emissions and traffic-related data), this kind of statistical approach becomes an opportunity to obtain a reliable forecast with a clearer understanding of the main factors that affect air quality.