Introduction

Air pollution causes 7 million deaths worldwide every year (WHO (World Health Organization) n.d.). The 2018 Lancet Commission on pollution pointed out that air pollution accounts for 16% of all deaths worldwide, with 4.2 million deaths due to ambient air PM2.5 (Landrigan et al. 2018). Clear evidence of the potential negative health effects of air pollution, such as chronic obstructive pulmonary disease (COPD), lung cancer, and cardiovascular disease, has been widely reported in the literature (Costa et al. 2017; Ierodiakonou et al. 2016; Kim et al. 2017; Nyhan et al. 2013). According to Romero et al. 2019, the transportation sector corresponds to one of the main pollutant sources in the Lima Metropolitan Area (LMA); however, spatial and temporal emission concentration disaggregation must be performed to update transport and environmental regulations. Air quality station data are needed to analyze the regulations implemented on air quality and health. In Peru, only a few air quality stations are operated by the National Meteorological and Hydrological Service (SENAMHI). Therefore, low-cost air quality sensors could represent an important resource to assess air quality at high spatial and temporal resolutions, particularly for the evaluation of vehicle emissions (Velásquez et al. 2020).

Low-cost sensors are characterized by their small size and weight, relatively low power requirements, short response time, and real (or close to real)-time network adaptability (Snyder et al. 2013; Sheats et al. 2010). These sensors have the following benefits: It delivers highly precise real-time data, and it is easy to use, with fast sampling (Idrees and Zheng 2020); besides the advantage of low-cost sensors is the significant amount of sensors that can be deployed for pollution mapping (Li et al. 2020), this information with a correct correlation methodology can be calibrated with a reference meteorological station simultaneously, especially for particle matter concentration (Romero et al. 2020). Although they have several benefits, challenges still remain regarding their usage, which is due to the sensor data quality (Snyder et al. 2013). Low-cost air quality sensors need calibration to validate the reliability of their data, and such validations are performed through simple linear regressions and multiple linear regressions. Simple linear regression models can be used to correct slopes and offsets, whereas multiple linear regression models, which can include other pollutants measurements and meteorological parameters, show a better performance (Munir et al. 2019). In the multi linear regression model (MLRM), the meteorological parameters usually include temperature and relative humidity (Badura et al. 2018; Munir et al. 2019; Snyder et al. 2013) because they affect the functioning of the electronics (Badura et al. 2018). Therefore, by including these parameters in the MLRM, a higher correlation of determination and a lower Root mean square error (RMSE) can be obtained (Panem et al. 2020). For each sensor, a calibration function is specified assuming that the sensor responses are linear with the reference measurement for each contaminant. Ordinary linear regression is performed by minimizing the square residuals of the sensor responses versus the reference measurements. The calibration functions are of the type Rs = a · X + b, where Rs represents the sensor responses and X is the corresponding reference measurement of the air pollutant.

Multiple regression is applied in several calibration process with success, for example, in reduced-spaced Gaussian (Arias Vel et al. 2020), with interpolating nitrogen dioxide concentration for temporal-spatial analysis (Weissert et al. 2020), also compared with multilinear analysis or artificial neural networks approaches for low-cost calibration. Therefore, the regression in calibration models is capable of modeling extremely complex functions that are very suitable for the calibration of a group of sensors. In the study conducted by Spinelle et al. (2015), two types of Artificial Neural Network (ANN) architecture (recommended in diagnosis analysis with ANN for linear regression analysis, Arias Velásquez and Mejia Lara 2018) were considered: radial functions and multilayer perceptron (MLP). MLP shows a better performance and represents the network architecture used today, and it originates from the work of Rumelhart and McClelland (Rumelhart 1986). This architecture consists of artificial units that receive a number of inputs and typically a hidden layer with hidden units. To send an activation signal, the MLP performs an activation function to produce the output of the unit and obtain training algorithms that use interactive techniques called “back propagation.”

In this research, we developed a multiple linear regression to analyze the performance of a particular technology called PurpleAir sensors (PurpleAir 2019). We have considered low-cost sensor and its calibration, due to government requirements to several universities; researches are focusing on developing procedures that allow remote sensor calibrations (Weissert et al. 2020). Therefore, we have progressively introduced it to the case scenario of Lima, Peru, associated to the main avenues and meteorological stations, for traffic assessment (Arias Vel et al. 2020). In the following research, we proposed the development of more advance techniques by taking into account the existing reference monitors, meteorological parameters, and other pollutants to ensure the accuracy of the sensor data relative to reference measurements with PurpleAir sensors and the performance for measurement PM2.5 and PM10 (Table 1).

Table 1 Statistical metrics summary for air quality reference and available measurements of PM2.5 and PM10 traffic-related pollutants. PM10a and PM2.5a come from the “A” channel, and PM10b and PM2.5b from the “B” channel of the PurpleAir sensors. Concentration data are in μg/m3

Material and methods

Characterization of the study site

To assess the performance of low-cost sensor technology via comparisons with a reference monitor, a PurpleAir sensor was installed next to the Campo de Marte Air Quality Station (AQSMarte), which is currently under the supervision of SENAMHI. About the study site, the AQSMarte is located in the middle of Lima city, in “Jesus María” district, so, it is one of the 43 districts, with a population of 75,359 (INEI 2018), with industrial facilities, mainly plastic, electric furnaces, and 3 several avenues across Lima city. In Fig. 1, this sensor allows to evaluate the main effect on traffic associated to “Paseo de la República” avenue, “Salaverry” avenue, and “Brasil” avenue, the roads with the greatest traffic problem in the city of Lima with a calibrated meteorological station online (Arias Vel et al. 2020).

Fig. 1
figure 1

a Reference air quality station, b low-cost air quality sensor installation (at approximately 2 to 3 m under the ground), and c location of the reference and PurpleAir sensor

The evaluation was carried out over a period of 3 weeks from October 3rd to October 24th in 2019. This study was developed by only one institution; however, a following step in collaboration with the Ministry of Environment of Peru will include various private companies and academic institutions to assess different low-cost air quality sensor technologies currently in the Peruvian market.

Technical specifications of the reference monitors

Ambient PM2.5 and PM10 mass concentrations for AQSMarte were measured with the leading Automated Measurement System (AMS), GRIMM EDM 180 technology. This system offers outstanding features, such as simultaneous PM measurements in 31 particle size channels, 0.1 μg/m3 resolution, dust mass detection of 0–100 μg/m3, and an isothermal inlet with an integrated Nafion dryer. This sensor can measure values in a range of 0.25 to 32 μm with regard to particle size and has selectable storage intervals between 6 s and 60 min. This sensor works in a temperature range of − 20 °C to 50 °C and consumes approximately 160 W of power with a maximum current of 1.4 A. The measurement principle of this technology is based on light scattering of single particles, detection volume with an aerodynamic focus, and no border zone error.

In addition, a meteorological station is in operation next to the reference monitor station, and it provided data for the measurement of temperature, relative humidity, wind speed, and wind direction.

Technical specifications of the low-cost sensor—PurpleAir Technology

In this study, a PurpleAir sensor (Fig. 2) was chosen. These are low-cost, small, and portable sensors. The portability allows for the development of a dense air quality sensor network, and the sensor provides an internal data storage to record data with or without a WiFi connection and a free platform to visualize real-time data for all parameters (particulate matter, temperature, and humidity), with high consistency as a function of temperature. PurpleAir is a local community organization that operates an air quality monitoring network based on PMS sensors from PlanTower. Currently, they provide one new sensor, i.e., PurpleAir I-PA-Indoor (PMS 1003), and two new sensors, i.e., PurpleAir II-PA and II-PA/SD (PMS 5003), for outdoor and/or indoor evaluations. The last two sensor models are composed of the sensors PMS 5003 (particle counter) and BME 280 (for temperature and humidity) for measuring PM1, PM2.5, and PM10, temperature and humidity in near real time. However, the PurpleAir II-PA-SD sensor model is capable of storing data on an SD card with or without a WiFi connection (PurpleAir PA-II-SD n.d.). PMS 5003 is a digital particle concentration sensor based on a laser scattering physical principle, and it measures 90° light scattering with a photodiode detector that converts the scattered light into a voltage pulse. This sensor counts the number of particles of a certain diameter (0.3 to 10 μm) in the air (by counting the pulses from the scattering signal) and outputs them into a digital interface (PurpleAir PA-II-SD n.d.).

Fig. 2
figure 2

PurpleAir II-AP/SD sensor and an overview of the PMS5003 working principle

Dynamic light scattering is needed for the laser light source, and the PMS 5003 sensor use laser radiation as the light source at a wavelength of 680 to 30 nm for scattering intensity measurements, and radiation interacts with the different particles suspended in the air. The scattered light is collected at a certain angle by a light scattering measuring cavity to finally obtain the curve of the scattering light change as a function of time. The equivalent particle diameter and the number of particles with different diameters per unit volume are calculated by a microprocessor based on MIE Theory, and the results have a maximum error per module of 10% at a temperature range of 10 to 40 °C, as shown in Fig. 8, according the Manual of the PMS50003, a laser counter manufacturer, and calibrated by PlanTower (2016). According to the manufacturer, the PMS50003 response time is less than 10 s, from which we can infer that the sensor has some limitations in rapidly changing environments. The concentrations uncertainties for PM2.5 are in the range 100 to 500 μg/m3, and PM10 are in the range from 0 to 100 μg/m3 (PlanTower 2016).

BME 28 is a digital pressure, temperature, and humidity sensor with high performance, and the size and low power consumption are key design advantages for mobile applications. These sensors excel in their fast response time for temperature and humidity, which is advantageous for fast context awareness applications, and high overall accuracy over a wide temperature and humidity range.

The PurpleAir sensor has two different “channels” or identical laser counters that measure data: channel A and channel B. As shown on the PurpleAir web page (PurpleAir PA-II-SD n.d.), both channels contain primary and secondary type data. Certain differences between the two channels may indicate that an error is occurring in one of the channels. For the different “fields” of data, “ATM” or “Atmospheric” type data are used for outdoor applications and CF = 1 is meant to be used for indoor or controlled environment applications.

In Peru, the government entrusts the monitoring of air quality to SENAMHI, which continuously monitors air quality and was the main reference source for testing the accuracy of the mobile sensor units deployed for this study.

Model establishment and validation

A multiple regression model was developed to analyze the relationship between the sensor data and the reference measurements (PM2.5 and PM10) and meteorological parameters, such as temperature and relative humidity. Two regression models, model 1 and model 2, were developed (see Eq. 1), and they are summarized in Table 2.

$$ {Y}_i={\beta}_{1i}\times {X}_i+{\beta}_{2i}\times tem+{\beta}_{3i}\times hum+{\beta}_{0i} $$
(1)
Table 2 Parameters from the multiple regression Model 1 (PM2.5) and Model 2 (PM10) developed for the study

Yi corresponds to the reference PM measurement for a diameter of i μm, μg/m3; Xi refers to the sensor measurement by the PurpleAir sensors for Channel “A,” μg/m3; tem represents temperature; and hum represents the relative humidity recorded at AQSMarte.

According to the results shown in Table 2, Model 1 and Model 2 are equal to < 2.2e−16, which is highly significant because it means that at least one of the predictor variables (Xi, tem, and hum) is significantly related to the outcome variable (Yi). For both models, the Xi and tem predictor variables are more significant for the Yi variable.

Results and discussion

Results obtained by reference monitors

Figure 3 shows the hourly and daily average PM2.5 and PM10 concentrations during the study period (October 3rd to October 24th). The daily average concentration of PM2.5 during this period was 24.5 μg/m3 and ranged from 14.4 to 36.4 μg/m3. According to the environmental legislation, during this period, the air quality regulations for PM2.5 were not violated because, on any day, the daily average exceeded 50 μg/m3 (MINAM 2017); it has been verified in the last research in Peru with meteorological stations (Arias Vel et al. 2020) and temporal and spatial analysis of traffic in the last 5 years since 2020 (Romero et al. 2020). Moreover, the maximum PM10 daily concentration did not exceed the regulation of no more than 100 μg/m3 during our 3-week study period, which showed a daily average concentration of 34.6 μg/m3 and a range of 19.7–48.7 μg/m3.

Fig. 3
figure 3

Temporal distribution of the hourly and daily average PM2.5 and PM10 concentrations for the reference monitor during the study period. In a and b, the left column refers to hourly averages and the right column refers to daily averages

During our study period, we measured a relatively low temperature between 13.7 and 21.1 °C (average 16 °C), high relative humidities between 64 and 97% (average 86%), wind direction between 3 and 357.0° (average 216.7°), and wind speed ranging from 0 to 5.5 m/s, with an average of 2.4 m/s. The meteorological parameter variations (temperature, relative humidity, wind speed, and wind direction) from the reference monitors are shown in Fig. 4.

Fig. 4.
figure 4

Temporal distribution of the hourly average of temperature, relative humidity, wind speed, and wind direction variations during the study period for the meteorological monitoring sensor

Results obtained by PurpleAir sensors

The complete data provided by the PurpleAir sensor must be considered. As briefly mentioned in the “Technical specifications of the low-cost sensor—PurpleAir Technology” section, the two different channels “A” and “B” correspond to equal laser counter sensors, which must be comparable. In this particular study, the ATM data from Primary Channels “A” and “B” were chosen for analysis due to the outdoor condition (ATM). PM2.5a and PM10a correspond to Channel “A”, and PM2.5b PM10b correspond to Channel “B.”

This performance evaluation was carried out using data observed by the sensor platforms and data obtained from the reference monitor, which were statistically analyzed and compared.

PMs—particle matter with diameters less than 2.5 μm and 10 μm

Figure 5 shows the trends of air quality data from the PurpleAir sensor versus the reference measurements. The results show that positive correlations occur between PM10 and PM2.5 (for both channels of PurpleAir sensor) and the reference measurement; however, the “A” channel presents the highest correlations for PM2.5a and PM10a of 0.8 and 0.4, respectively; therefore, we recommend to use the A channel for outdoor PM2.5 and PM10 studies.

Fig. 5
figure 5

Sensor air quality data compared with reference measurements. PM2.5a and PM10a from the “A” channel and PM25b and PM10b from the “B” channel

Based on the multiple regression model developed in the “Model establishment and validation” section and the summary of the results in Table 2, the following figures can be analyzed. Figures 6 and 7 show the multiple regression model diagnostics. For both cases, the developed models present a very good performance, according to the Normal quantile–quantile (Q-Q) analysis; it demonstrates the distribution of the data against the expected normal distribution. A linear regression approach is appropriated for the standardized residuals analysis on PM2.5 and PM10 data.

Fig. 6
figure 6

Diagnostic of the regression model developed for PM2.5a sensor measurements (Model 1)

Fig. 7
figure 7

Diagnostic of the regression model developed for PM10a sensor measurements (Model 2)

The residuals versus fitted graphs demonstrate that the difference between the dependent (Yi) variables and predictor variables (Xi, tem, and hum), which can be interpreted for Model 1 and Model 2, for both predictions was too high (residuals near 0), and it suggests that there are no outliers for linear predictor and residual analysis. Normal Q-Q graphs show that the residuals are close to 0 because they follow a straight line for Model 1 and Model 2. In both cases, the result is that all the predictor variables have a considerable influence on the developed regression models (Figure 8).

Fig. 8
figure 8

PMS50003 laser counters manufactured and calibrated by PlanTower. (PlanTower 2016)

Conclusions

Developing low-cost sensors represents a potential alternative that can complement reference air quality monitor stations worldwide because of the low cost and minimal maintenance requirements during operation. This study shows that different low-cost technologies have already been introduced for informative and research purposes in Lima, Peru; however, previous studies have performed data comparisons of their performance versus the performance of reference monitors, such as those from SENAMHI, which manages the reference air quality monitoring stations. It is important to point out that air quality data are required to be collected at high spatial and temporal resolution to develop better assessments and provide information to private and public sectors because of the effects of air quality on health.

The results show that a high positive correlation occurs between sensor data and reference measurements for PM2.5 and PM10, especially for PM2.5 (r2 = 0.8) for the “A” and “B” channels. Two multiple regression models (Model 1 and Model 2) were developed by taking into account the sensor data (PM2.5 and PM10 concentration) and meteorological parameters, such as temperature and relative humidity. In both cases, a good performance was identified for the models, and temperature was shown to be a more influential meteorological parameter compared with relative humidity. The PurpleAir sensor shows a good performance for the measurement of PM2.5 and PM10 but not for temperature and relative humidity.

Additional studies must be performed to ensure the accuracy of sensor data relative to reference measurements. A following study that uses machine learning techniques will be performed to compare the performance of the multiple regression model with machine learning techniques.