Abstract
Aim of this study is to develop a calibration procedure through Machine Learning to upgrade the low-cost air quality sensor performance and investigate the generalization of this function over a specific area towards air quality data fusion.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Bad air quality (AQ) has a negative impact on peoples’ quality of life. The small number of monitoring stations used for the official AQ monitoring and the operationally available air pollution modelling tools still leave open space for improving local AQ knowledge. The KASTOM project (www.air4me.eu) is developing a versatile and flexible air quality monitoring and forecasting system by deploying an IoT-oriented network of low-cost AQ sensor nodes (LCAQSN), while in parallel developing a state-of-the-art emission modeling module combined with state-of-the-art three-dimensional AQ models. LCAQSN can cover larger areas due to their low cost but are lacking the necessary accuracy.
2 Materials
The Greater Thessaloniki Area (GTA) is the second largest urban agglomeration in Greece hosting more than 1 million inhabitants. The KASTOM project has installed 33 LCAQSN in the GTA including: (a) Particle sensors (PM10–PM2.5: PMS5003, Beijing Plantower Co., Ltd.), (b) sensors for gaseous pollutants (NO2, O3 and CO: Alphasense Ltd., U.K.) and meteorological sensors (Air Temperature, Relative Humidity and Air Pressure, BME280 Bosch Sensortec, Germany).
In this study, we have collocated six nodes with two reference stations (Fig. 1) in Agias Sofias (AGSOF) and Kordelio (KORD) areas, classified by the European Environment Agency as an urban traffic and urban industrial station respectively.
The initial dataset (NodeSet) consists of six nodes measurements (Node1–3 located in AGSOF and Node4–6 located in KORD) for the period of 21/12/2019–10/03/2020 and the reference stations measurements for PM10, O3 and NO2 (NO2 measurements in KORD omitted due to missing value problems). The additional dataset (FSet) included meteorological modeling (WRF) and free traffic flow data (Salanova et al., 2018). All variables are presented in Table 1.
3 Methods
The first step of the computational procedure aimed at generating a set of features, capturing the maximum amount of information. We therefore applied time lags (from 1 to 12 h) and rolling—aggregation statistics (6 and 12 h) to all the variables, leading to 161 features for the Nset and 401 features for the Fset. To reduce noise introduced by features, a feature reduction procedure was followed employing the Random Forest Feature Importance (RFFI) method. We then employed a Machine Learning (ML)-oriented modeling approach, making use of the reference station measurements as target parameters (PM10, O3 and NO2) to calibrate and upgrade the KASTOM nodes performance. Models were trained in the two subsets, for each sensor and location. A Gradient Boosting algorithm was used (Friedman, 2001), combining the outputs sequentially from individual regression trees, where each new tree helps to correct errors made by a previously trained tree.
To evaluate the initial performance of the LCAQSN, the Pearson Correlation Coefficient (r) and Coefficient of Divergence (CoD) were calculated. The ML models were evaluated using a fivefold time forward cross validation on a rolling basis, using the Coefficient of Determination (R2) and the Relative Expanded Uncertainty (REU), following the methodology described in the Guide to the Demonstration of Equivalence of Ambient Air Monitoring Methods (EUD, 2008). According to the European Air Quality Directive, uncertainties for “class 1 sensor” or indicative measurements are 50, 25, 30% and for “class 2 sensor” or objective measurements are 100, 75, 75% for PM10, NO2 and O3 respectively.
4 Results
Field calibration of an LCAQSN network requires the individual nodes to perform identical to each other, this being the first condition to apply the same calibration function. This was checked with the aid of the CoD versus Pearson (Fig. 2). All PM10 sensors scored very high Pearson and very low CoD thus behaving identical, but the gas sensors, and especially O3 sensors in the AGSOF, displaying a more diverse behavior therefore suggesting that in this case, the generalization of the calibration functions could be more challenging.
The RFFI selected the most relevant features, mostly the ones deriving from the KASTOM nodes’ measurements, but also meteorological factors deriving from modeling (Fig. 3). On the other hand, traffic related features are only chosen in the AGSOF location (an urban traffic station). Also, traffic features seem to influence more NO2 and PM10 than O3.
.
While raw measurements display extremely poor scores against reference measurements, the computational procedure and the XGBoost shows promising results (Table 2). In most of the cases the use of the Fset leads to better output than the use of the Nset, though by a small margin.
In terms of REU, the calibrated PM10 can be considered as “class 1 sensor” in both locations, while the calibrated O3 are above the desired threshold but have still improved their performance and be considered as “class 2 sensor” (Fig. 4).
5 Conclusions
The intercomparison of LCAQSN for a small time period, proves that PM10 sensors are behaving similar in the same locations and the proposed computational calibration procedure can upgrade their performance as indicative measurements for regulatory purposes, while it may be possible to apply the same approach to the rest of the network. For NO2 and O3, while the calibration functions can improve the sensors’ response, the desired REU levels couldn’t be reached. In every case data fusion is improving results and therefore more data sources and additional effort towards better fusion should be considered.
References
EUD. (2008). Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on ambient air quality and cleaner air for Europe. Official Journal of the European Union L152
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232.https://doi.org/10.1214/aos/1013203451
Salanova et al., 2018Salanova Grau J. M., Mitsakis E., Tzenos P., Stamos I., & Aifadopoulou, G. (2018). Multisource data framework for road traffic state estimation. Journal of Advanced Transportation, 1–9. https://doi.org/10.1155/2018/9078547
Acknowledgements
This research has been co‐financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH—CREATE—INNOVATE. Project code Τ1ΕDΚ-01697; project name Innovative system for air quality monitoring and forecasting (KASTOM, www.air4me.eu).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Questions and Answers
Questions and Answers
- QUESTIONER:
-
Zhaoyue Chen
- QUESTION:
-
Thanks, how could you determine lagged hour length when enriching feature space?
- ANSWER:
-
The lagged length was determined after trial-and-error experiments, while it has been observed from previous computational exercises by our group that no more than 24 hours lagged is important for low-cost sensor nodes calibration.
- QUESTIONER:
-
Bas Mijling
- QUESTION:
-
Low-cost sensors are calibrated at two different sites. What would happen if the sensor location snapped? Does the calibration obtained at site 1 is applicable at site 2?
- ANSWER:
-
This is a very interesting question and can be answered thoroughly only if further research is applied. From our knowledge of the field and ongoing experiments, applying a calibrated function from Agias Sofias to Kordelio and vice versa is yielding good results in terms of uncertainty and R2 for PM2.5 and PM10, and acceptable metrics for O3. Although the question about the spatial generalizability of the calibration function cannot be answered with only two reference stations collocated with the low-cost sensors. Data from a third collocated reference station, not included in this study, show more ambiguous behavior and thus applying functions by proximity or by type of station (urban, suburban, traffic, background, etc.) or applying one generalized calibration function trained in all available locations, would be considered for calibrating the whole network of 33 devices.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kassandros, T., Bagkis, E., Karatzas, K. (2022). Data Fusion for the Improvement of Low-Cost Air Quality Sensors. In: Mensink, C., Jorba, O. (eds) Air Pollution Modeling and its Application XXVIII. ITM 2021. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-031-12786-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-12786-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12785-4
Online ISBN: 978-3-031-12786-1
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)