Introduction

Air pollution is the introduction into the atmosphere of chemicals, particulates, or biological materials that cause discomfort, disease, or death to humans, damage other living organisms such as food crops, or damage the natural environment or built environment (García Nieto 2001; García Nieto 2006; Lutgens and Tarbuck 2012). The atmosphere is a complex dynamic natural gaseous system that is essential to support life on planet Earth. For instance, the stratospheric ozone depletion due to air pollution has long been recognized as a threat to human health as well as to the Earth’s ecosystems. Additionally, the urban air quality is listed as one of the world’s worst toxic pollution problems in the 2008 Blacksmith Institute World’s Worst Polluted Places report (Phalen 2011; Domike and Zacaroli 2013). Indeed, air pollution is one of the important environmental problems in metropolitan and industrial cities (García Nieto 2001; García Nieto 2006; Seinfeld and Pandis 2006; Lutgens and Tarbuck 2012) such as Oviedo (Principality of Asturias, Spain). The World Health Organization states that 2.4 million people die each year from causes directly attributable to air pollution, with 1.5 million of these deaths attributable to indoor air pollution. The health effects caused by air pollution may include difficulty in breathing, wheezing, coughing, and aggravation of existing respiratory and cardiac conditions. These effects can result in increased medication use, increased doctor or emergency room visits, more hospital admissions, and premature death (Wark et al. 1997; Wang et al. 2004; Lutgens and Tarbuck 2012). The human health effects of poor air quality are far reaching but principally affect the body’s respiratory system and the cardiovascular system (Anderson 2009; García Nieto 2001). Individual reactions to air pollutants depend on the type of pollutant a person is exposed to, the degree of exposure, the individual’s health status, and genetics (Anderson 2009). A new economic study of the health impacts and associated costs of air pollution in the Los Angeles Basin and San Joaquin Valley of Southern California shows that more than 3800 people die prematurely (approximately 14 years earlier than normal) each year because air pollution levels violate federal standards (Lutgens and Tarbuck 2012). The number of annual premature deaths is considerably higher than the fatalities related to auto collisions in the same area, which average fewer than 2000 per year (Anderson 2009; Brimblecombe 2011). Diesel exhaust (DE) is a major contributor to combustion-derived particulate matter air pollution (Lucking et al. 2008). In several human experimental studies (Törnqvist et al. 2007), using a well-validated exposure chamber setup, DE has been linked to acute vascular dysfunction and increased thrombus formation. This serves as a plausible mechanistic link between the previously described association between particulate matter air pollution and increased cardiovascular morbidity and mortality (García Nieto 2001; Karaca et al. 2005; García Nieto 2006; Lutgens and Tarbuck 2012).

Oviedo is the capital city of the Principality of Asturias in northern Spain. It is also the name of the municipality that contains the city. Oviedo, which is the administrative and commercial center of the region, also hosts the annual Prince of Asturias Awards. This prestigious event, held in the city’s Campoamor Theatre, recognizes international achievement in eight categories. Oviedo University’s international campus attracts many foreign scholars from all over the globe. The city of Oviedo has a population of 225,973 inhabitants. It covers a land area of 186.65 km2, and it has an altitude of 232 m above sea level and a density of 1210.68 inhabitants/km2. The climate of Oviedo, as with the rest of northwest Spain, is more varied than that of southern parts of Spain. Summers are generally humid and warm, with considerable sunshine, but also some rain. Winters are cold with some very cold snaps and very rainy. The cold is especially felt in the mountains surrounding the city of Oviedo, where snow is present from October till May. Both rain and snow are regular weather features of Oviedo’s winters. On the other hand, there is a coal-fired power plant located 7 km south from the city of Oviedo: the Soto de Ribera’s coal-fired power plant (see Fig. 1a, b). Such plant provides most of the electrical energy used in the city of Oviedo. Its economy is based on coal (e.g., Hunosa Ltd.), limestone and dolomite quarries located in Naranco mount and Olloniego area, livestock rearing, a strong tertiary sector, etc. (Karaca et al. 2005; Lutgens and Tarbuck 2012; García Nieto et al. 2013). Figure 1a shows the geographical location of the three meteorological stations and the Soto de Ribera’s coal-fired power plant. The Soto de Ribera’s coal-fired power plant is located 7 km south from the city of Oviedo in the district of Ribera de Arriba and at an altitude of 126.50 m above sea level.

Fig. 1
figure 1

a The geographical location of the air quality stations (yellow squares) in the city of Oviedo (northern Spain) and Soto de Ribera’s coal-fired power plant (red square) and b road map

To fix ideas, there are many air pollution indicators affecting human health (Comrie and Diem 1999; Elbir et al. 2000; Akkoyunku and Ertürk 2003; Godish 2004; Suárez Sánchez et al. 2011). The automatic measurements of meteorological pollution, such as CO, NO, NO2, SO2, O3, and particulate matter (PM10), are more and more important due to their harmful effects on human health (Wark et al. 1997; García Nieto 2001; Wang et al. 2004; García Nieto 2006; Lutgens and Tarbuck 2012). EU and many national environmental agencies have set standards and air quality guidelines for allowable levels of these pollutants in the air (Cooper and Alley 2002; Wang et al. 2004; Suárez Sánchez et al. 2011). The aim of this research work is to construct a model for the averaged next-month pollution that would be applicable for use by the authority responsible for air pollution regulation in the appropriate region of the country. The use of the artificial neural networks of multilayer perceptron (MLP) type as the model of pollution was exploited frequently in the last years (Boznar et al. 1993; Haykin 1999; Hooyberghs et al. 2005; Suárez Sánchez et al. 2011). In this way, an artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation (Kukkonen et al. 2003). In most cases, an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are nonlinear statistical data modeling tools (Gardner and Dorling 1999; Chaloulakou et al. 2003; Karaca et al. 2006). They are usually used to model complex relationships between inputs and outputs or to find patterns in data. A MLP is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate output. A MLP consists of multiple layers of nodes in a directed graph, which is fully connected from one layer to the next. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron, which can distinguish data that is not linearly separable (Haykin 1999; Bishop 2006; Suárez Sánchez et al. 2011).

In this innovative research work, a model based on the multivariate adaptive regression splines (MARS) is proposed (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; Vapnik 1999; Hastie et al. 2003; Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009) for the study of considered pollutants: CO, NO, NO2, SO2, O3, and particulate matter (PM10). The data taking part in learning and testing have been collected within 3 years: from 2006 to 2008. The results of numerical experiments based on the application of the MARS technique have confirmed good accuracy of daily modeling for all considered pollutants. These detailed results will be presented and discussed in this paper.

Indeed, the MARS technique is a form of regression analysis introduced by Jerome Friedman in 1991 (Friedman 1991). The MARS technique is a nonparametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions. This technique (García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012) has been applied greatly in recent years to many fields of science and engineering with success. For instance, the MARS technique is used in a variety of fields, including biomedicine and bioinformatics and other engineering fields (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; Vapnik 1999; Hastie et al. 2003; Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012). MARS models are more flexible than linear regression models and they are simple to understand and interpret. The MARS technique can handle both numeric and categorical data, and it tends to be better than recursive partitioning for numeric data because hinges are more appropriate for numeric variables than the piecewise constant segmentation used by recursive partitioning (Vapnik 1999; Bishop 2006). To fix ideas, building MARS models often requires little or no data preparation. The hinge functions automatically partition the input data, so the effect of outliers is contained. In this respect, the MARS technique is similar to recursive partitioning which also partitions the data into disjoint regions, although using a different method. MARS models tend to have a good bias variance trade-off, and they are flexible enough to model nonlinearity and variable interactions (de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012).

In this research work, the MARS technique (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995) was used as an automated learning tool when building three MARS models for nitrogen dioxide (NO2), sulfur dioxide (SO2), and aerosol particles less than 10 μm (PM10) as a function of other measured relevant pollutants in air quality: nitric oxide (NO), carbon monoxide (CO), and ozone (O3). The aim was to make accurate concentration estimates of the three abovementioned pollutants (NO2, SO2, and PM10) (Schnelle and Brown 2001; Colbeck 2008; Hewitt and Jackson 2009; Suárez Sánchez et al. 2011). MARS models were used as an alternative to the traditional regression approaches. The three MARS models were found to be better to tackle nonlinear regression problems such as those associated to air quality and studied in this research work.

This study is structured as follows: firstly, the materials and methods used to carry out this study are described; next, the obtained results are presented and discussed; and finally, the main conclusions drawn from the results are described.

Materials and methods

Sources and types of air pollution

A substance in the air that can be harmful to humans and the environment is known as an air pollutant. Pollutants can be in the form of solid particles, liquid droplets, or gases. In addition, they may be natural or man made. Pollutants can be classified as primary or secondary. Usually, primary pollutants are directly emitted from a process, such as ash from a volcanic eruption, the carbon monoxide gas from a motor vehicle exhaust, or sulfur dioxide released from factories. Secondary pollutants are not emitted directly. Rather, they form in the air when primary pollutants react or interact. An important example of a secondary pollutant is ground-level ozone, one of the many secondary pollutants that make up photochemical smog (Wark et al. 1997; Schnelle and Brown 2001; Monteiro et al. 2005; García Nieto et al. 2013). Some pollutants may be both primary and secondary: that is, they are both emitted directly and formed from other primary pollutants.

Major primary pollutants produced by human activity include the following (Wark et al. 1997; Friedlander 2000; García Nieto 2001; Cooper and Alley 2002; Wang et al. 2004; Karaca et al. 2005; García Nieto 2006; Seinfeld and Pandis 2006; Vincent 2007; Colbeck 2008; Hewitt and Jackson 2009; Suárez Sánchez et al. 2011; Lutgens and Tarbuck 2012; García Nieto et al. 2013):

  • Particulate matter (PM10), alternatively referred to as atmospheric particulate matter or fine particles, are tiny particles of solid or liquid suspended in a gas. PM10 are particles with a diameter of 10 μm or less. Sources of particulates can be man made or natural. Increased levels of fine particles in the air are linked to health hazards such as heart disease, altered lung function, and lung cancer.

  • Sulfur oxides (SO x ), especially sulfur dioxide, a chemical compound with the formula SO2. Since coal and petroleum often contain sulfur compounds, their combustion generates sulfur dioxide. Further oxidation of SO2, usually in the presence of a catalyst such as NO2, forms H2SO4 and thus acid rain (Wark et al. 1997; Wang et al. 2004; Lutgens and Tarbuck 2012).

  • Nitrogen oxides (NO x ), especially nitrogen dioxide, are emitted from high-temperature combustion and are also produced naturally during thunderstorms by electric discharge. Nitrogen dioxide is a chemical compound with the formula NO2. This reddish-brown toxic gas has a characteristic sharp, biting odor. The initial product formed is nitric oxide (NO). When NO oxidizes further in the atmosphere, nitrogen dioxide (NO2) forms. Commonly, the general term NO x is used to describe these gases.

  • Carbon monoxide (CO) is a colorless, odorless, nonirritating but very poisonous gas. It is a product by incomplete combustion of fuel such as natural gas, coal, or wood. Vehicular exhaust is a major source of carbon monoxide.

  • Volatile organic compounds (VOCs) are an important outdoor air pollutant. In this field, they are often divided into the separate categories of methane (CH4) and nonmethane (NMVOCs). Specifically, this pollutant is not considered in this study.

Secondary pollutants considered here are as follows (Wark et al. 1997; Anderson et al. 2001; Cooper and Alley 2002; Godish 2004; Wang et al. 2004; Seinfeld and Pandis 2006; Vincent 2007; Weinhold 2008; Jerrett et al. 2009; Lutgens and Tarbuck 2012; García Nieto et al. 2013):

  • Particulate matter formed from gaseous primary pollutants and compounds in photochemical smog. Smog is a kind of air pollution. The word smog is a portmanteau of smoke and fog.

  • Ground-level ozone (O3), which is formed from NO x and VOCs. Ozone (O3) is a key constituent of the troposphere (it is also an important constituent of certain regions of the stratosphere commonly known as the ozone layer). Photochemical and chemical reactions involving it drive many of the chemical processes that occur in the atmosphere by day and by night. The negative effects of ozone are well documented. Short-term exposure to elevated levels of ozone causes eye and lung irritations (Jerrett et al. 2009).

With respect to the trends of in air quality, the Clean Air Act of 1970 mandated the setting of standards for four of the primary pollutants (aerosols, sulfur dioxide, carbon monoxide, and nitrogen oxides) as well as the secondary pollutant ozone. At the time, these five pollutants were recognized as being the most widespread and objectionable. Today, with the addition of lead, they are known as the criteria pollutants and are covered by the US National Ambient Air Quality Standards (see Table 1) (Godish 2004; Lutgens and Tarbuck 2012; García Nieto et al. 2013). The primary standard for each pollutant shown in Table 1 is based on the highest level that can be tolerated by humans without noticeable ill effects, minus a 10–50 % margin for safety reasons.

Table 1 National Ambient Air Quality Standards by US Environmental Protection Agency (USEPA) (Godish 2004; Lutgens and Tarbuck 2012; García Nieto et al. 2013)

Experimental data set

The Section of Industry and Energy from the government of Asturias has three automatic air quality monitoring stations distributed throughout the city of Oviedo (see Fig. 1). A measuring station is a sampling point, regardless of the number of parameters monitored and the analysis techniques applied. It consists of a group of systems and proceedings to evaluate and assess the appearance of pollution agents in the atmosphere. These three stations measure every 15 min the following primary and secondary pollutants: sulfur dioxide (SO2), nitrogen oxides (NO and NO2), carbon monoxide (CO), particulate matter less than 10 μm (PM10), and ozone (O3). This data set is collected, processed, and delivered on average for the entire city every day. Therefore, we have data for the pollutants listed above each day from January 2006 to December 2008. Additionally, this data set is averaged monthly and shown in Table 2.

Table 2 Monthly average air pollution concentrations in the Oviedo urban area from January 2006 to December 2008

The automatic analyzers of monitoring stations use the physical and chemical properties of the gaseous pollutants and particulate matter to determine its concentration. The methods currently used by automatic analyzers of the above gaseous pollutants and particulate matter are as follows (Seinfeld and Pandis 2006; Singal 2012):

  • The sulfur dioxide analyzers use the principle of the pulsing fluorescence based on the fact that the molecules of SO2 absorb ultraviolet radiation (UV) at a wavelength in the range of 210–410 nm, entering in an instantaneous state of excitation. Subsequently, this gas decays to a lower energy state, emitting a pulse of fluorescent light of a greater wavelength in the range from 240 to 410 nm. The intensity of the emitted fluorescent light is proportional to the concentration of SO2.

  • Chemiluminescence is an analytical technique based on the measurement of the amount of light generated by a chemical reaction. The analyzers of nitrogen oxides (NO x ) use this principle to determine the concentrations of nitric oxide (NO) and nitrogen dioxide (NO2), taking into account their chemical reaction with ozone (O3).

  • The analyzers of carbon monoxide (CO) are based on the ability of this gas to absorb energy at specific wavelengths. Specifically, the absorption of infrared light is measured in the region of maximum absorption of this pollutant.

  • The operating principle for measuring the particulate matter is the attenuation of the beta radiation. Beta radiation is passed through the deposited particles. The layer of particles, which is increasing, reduces the intensity of beta radiation’s beam, which is measured by an ionization chamber. The electrical output signal is proportional to the actual mass sampled.

  • The operating principle of ozone analyzers is known as the ultraviolet photometry method and consists of measuring the amount of ultraviolet light at a wavelength of 254 nm absorbed by the ozone in a sample. The operating principle is based on the Beer-Lambert law.

Multivariate adaptive regression splines method

Multivariate adaptive regression splines (MARS) is a multivariate nonparametric classification/regression technique introduced by Friedman (Friedman 1991; Friedman and Roosen 1995). The computational implementation of this regression modeling has been performed using the free software programming language and software environment for statistical computing known as the R project (James et al. 2013; Lantz 2013). The theoretical model that is explained below has already been presented by the authors in previous researches (García Nieto et al. 2011; García Nieto et al. 2012). In spite of this fact and due to its interest for the reader, in order to achieve a full understanding of the research, this technique is presented in this paper. Its main purpose is to predict the values of a continuous dependent variable, y(n×1), from a set of independent explanatory variables, X(n×p). The MARS model can be represented as

$$ \mathbf{y}=f\left(\mathbf{X}\right)+\mathbf{e} $$
(1)

where f is a weighted sum of basis functions that depend on X and e is an error vector of dimension (n×1)

The MARS model does not require any a priori assumptions about the underlying functional relationship between dependent and independent variables. Instead, this relation is uncovered from a set of coefficients and piecewise polynomials of degree q (basis functions) that are entirely “driven” from the regression data (X,y). The MARS regression model is constructed by fitting basis functions to distinct intervals of the independent variables. Generally, piecewise polynomials, also called splines, have pieces smoothly connected together. In MARS terminology, the joining points of the polynomials are called knots, nodes, or breakdown points. These will be denoted by the small letter t. For a spline of degree q, each segment is a polynomial function. MARS uses two-sided truncated power functions as spline basis functions, described by the following equations (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; Vapnik 1999; Hastie et al. 2003; Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012):

$$ {\left[-\left(x-t\right)\right]}_{+}^q=\left\{\begin{array}{cc}\hfill {\left(t-x\right)}^q\hfill & \hfill \mathrm{if}\kern0.24em x<t\hfill \\ {}\hfill 0\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\right. $$
(2)
$$ {\left[+\left(x-t\right)\right]}_{+}^q=\left\{\begin{array}{cc}\hfill {\left(t-x\right)}^q\hfill & \hfill \mathrm{if}\kern0.24em x\ge t\hfill \\ {}\hfill 0\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\right. $$
(3)

where q(≥0) is the power to which the splines are raised and which determines the degree of smoothness of the resultant function estimate. When q=1, which is the case in this study, only simple linear splines are considered.

The MARS model of a dependent variable \( \overrightarrow{y} \) with M basis functions (terms) can be written as (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009)

$$ \widehat{\mathbf{y}}={\widehat{f}}_M\left(\mathbf{x}\right)={c}_0+{\displaystyle \sum_{m=1}^M{c}_m{B}_m\left(\mathbf{x}\right)} $$
(4)

where ŷ is the dependent variable predicted by the MARS model; c 0 is a constant; B m (x) is the mth basis function, which may be a single spline basis functions; and c m is the coefficient of the mth basis functions.

Both the variables to be introduced into the model and the knot positions for each individual variable have to be optimized. For a data set X containing n objects and p explanatory variables, there are N=n×p pairs of spline basis functions, given by Eqs. (2) and (3), with knot locations x ij (i=1,2,…,nj=1,2,…,p).

A two-step procedure is followed to construct the final model. First, in order to select the consecutive pairs of basis functions of the model, a two-at-a-time forward stepwise procedure is implemented (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; García Nieto et al. 2012). This forward stepwise selection of basis function leads to a very complex and overfitted model. Such a model, although it fits the data well, has poor predictive abilities for new objects. To improve the prediction, the redundant basis functions are removed one at a time using a backward stepwise procedure. To determine which basis functions should be included in the model, MARS utilizes the generalized cross-validation (GCV) (Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009). In this way, the GCV is the mean squared residual error divided by a penalty dependent on the model complexity. The GCV criterion is defined in the following way (Friedman 1991; Sekulic and Kowalski 1992; Friedman and Roosen 1995; Vapnik 1999; Hastie et al. 2003; Chou et al. 2004; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012):

$$ \mathrm{G}\mathrm{C}\mathrm{V}(M)=\frac{\frac{1}{n}{\displaystyle \sum_{i=1}^n{\left({y}_i-{\widehat{f}}_M\left({\mathbf{x}}_i\right)\right)}^2}}{{\left(1-C(M)/n\right)}^2} $$
(5)

where C(M) is a complexity penalty that increases with the number of basis functions in the model and which is defined as (Friedman and Roosen 1995; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012)

$$ C(M)=\left(M+1\right)+d\;M $$
(6)

where M is the number of basis functions in Eq. (4) and the parameter d is a penalty for each basis function included into the model. It can be also regarded as a smoothing parameter. Large values of d lead to fewer basis functions and therefore smoother function estimates. For more details about the selection of the d parameter, see the references Friedman and Roosen (1995), de Cos Juez et al. (2009), García Nieto et al. (2011), Vidoli (2011), and García Nieto et al. (2012). In our studies, the parameter d equals 2, and the maximum interaction level of the spline basis functions is restricted to 3.

The importance of the variables in the MARS model

Once the MARS model is constructed, it is possible to evaluate the importance of the explanatory variables used to construct the basis functions. Establishing predictor importance is in general a complex problem which in general requires the use of more than one criterion. In order to obtain reliable results, it is convenient the use of the GCV parameter explained before together with the parameters Nsubsets (criterion counts the number of model subsets in which each variable is included) and the residual sum of squares (RSS) (Friedman and Roosen 1995; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012).

Analysis of results and discussion

It is possible to study the trend in concentrations of the preceding pollutants in the years 2006, 2007, and 2008 (Wark et al. 1997; Schnelle and Brown 2001; Godish 2004; Wang et al. 2004; García Nieto et al. 2013). In the first place, Fig. 2 shows the sulfur dioxide concentrations each month during the years 2006, 2007, and 2008. It is possible to observe that the emission peaks occur during late autumn and early winter, namely from November to February each year, reaching the maximum emission during the Christmas of 2006: 51 μg/m3. Similarly, minimum concentrations (13 μg/m3 in August 2007) occur during the summer months because there are lower power consumption and less traffic in the city. This trend is general throughout the years studied, and it is within the logic. Following the US Environmental Protection Agency (USEPA) Air Quality Standards (see Table 1), the maximum allowable concentration of SO2 expressed as annual arithmetic mean is 80 μg/m3. The annual arithmetic means for this gas during the years 2006, 2007, and 2008 were 24, 23, and 24 μg/m3, respectively. Therefore, the concentrations of this gas were below the maximum permitted and met air quality standards for a healthy person during these 3 years, including emission peaks.

Fig. 2
figure 2

Monthly trend of sulfur dioxide (SO2), nitrogen dioxide (NO2), and carbon monoxide (CO) concentrations during the years 2006, 2007, and 2008 in the Oviedo urban area

Secondly, Fig. 3 shows the nitric oxide concentrations each month during the years 2006, 2007, and 2008. It is also possible to observe again that the emission peaks occur during late autumn and early winter, reaching the maximum emission in December 2006: 74 μg/m3. Similarly, minimum concentrations (13 μg/m3 in July 2007) occur during the summer months. Although the initial product of the combustion is nitric oxide (NO), this gas is rapidly oxidized and converted to NO2. Its residence time in the atmosphere is very short and the USEPA Air Quality Standards does not consider it.

Fig. 3
figure 3

Monthly trend of nitric oxide (NO), particulate matter (PM10), and ozone (O3) concentrations during the years 2006, 2007, and 2008 in the Oviedo urban area

Thirdly, the nitrogen dioxide (NO2) concentrations each month during the years 2006, 2007, and 2008 are also shown in Fig. 2. Note that the emission peak occurs from November to February each year, reaching the maximum emission in December 2006: 53 μg/m3. Similarly, minimum concentrations occur in August 2006 and October 2007 with values of 26 and 23 μg/m3, respectively. Following the USEPA Air Quality Standards (see Table 1), the maximum allowable concentration of NO2 expressed as annual arithmetic mean is 100 μg/m3. The annual arithmetic means for this gas during the years 2006, 2007, and 2008 were 34, 37, and 36 μg/m3, respectively. Thus, the nitrogen dioxide (NO2) concentrations are also below the maximum permitted and meet air quality standards for a healthy person during these 3 years, including emission peaks. It is also important to highlight a certain flattening of the concentrations of this gas over time.

Fourthly, Fig. 2 also shows the carbon monoxide (CO) concentrations each month during the years 2006, 2007, and 2008. The main emission peak occurs in December 2006 and January 2007 with a value of 0.85 mg/m3. Similarly, minimum concentrations occur during the summer months: 0.32 mg/m3 in July 2006, 0.31 mg/m3 in August 2007, and 0.35 mg/m3 in July 2008, respectively. This trend is a quasi-sinusoidal signal with peaks and valleys, although the height of the peaks and valleys appears to be decreasing in time (García Nieto 2001; Godish 2004; García Nieto 2006). Similarly, following the USEPA Air Quality Standards (see Table 1), the maximum allowable concentration of CO expressed as annual arithmetic mean is 3.33 mg/m3. The annual arithmetic means for this gas during the years 2006, 2007, and 2008 were 0.52, 0.50, and 0.44 mg/m3, respectively. Hence, the concentrations of CO were below the highest level that can be tolerated by humans according to USEPA Air Quality Standards during these 3 years, including emission peaks.

Fifthly, Fig. 3 shows the particulate matter PM10 concentrations each month during the years 2006, 2007, and 2008. It is possible to observe that the emission peaks occur in December 2006, January 2007, March 2008, and August 2008 with values 71, 72, 75, and 69 μg/m3, respectively. However, the quasi-sinusoidal trend is lost here in general in the case of the particulate matter. All aerosol concentrations are kept in a band or range of concentrations between approximately 50 and 70 μg/m3, regardless of the month and the season. From the point of standard air quality view, following the USEPA Air Quality Standards (see Table 1), the maximum allowable concentration of PM10 expressed as annual arithmetic mean is 50 μg/m3. The annual arithmetic means for this pollutant during the years 2006, 2007, and 2008 were 60, 59, and 56 μg/m3, respectively. Therefore, the aerosol concentrations are greater than the allowable maximum for a healthy person during these 3 years, including emission peaks. This behavior can give place to serious health problems for the population such as chronic diseases and deaths.

Figure 3 also indicates the ozone concentrations each month during the years 2006, 2007, and 2008. Note that there is a variation of the ozone concentration in the form of an oscillating sawtooth over time, reaching maximum values during summer months: 46 μg/m3 in June 2006, 54 μg/m3 in May 2007, and 59 μg/m3 in April 2008. This trend is general throughout the years studied, since ozone is associated with photochemical reactions, and these ones require the presence of strong sunlight. The Clean Air Act directs the USEPA to establish National Ambient Air Quality Standards for several pollutants, including ground-level ozone, and cities out of compliance with these standards are required to take steps to reduce their levels. In May 2008, the USEPA lowered its ozone standard from 80 to 75 μg/m3. This proved controversial, since the agency’s own scientists and advisory board had recommended lowering the standard to 60 μg/m3, and the World Health Organization recommends 51 μg/m3. Many public health and environmental groups also supported the 60-μg/m3 standard. The annual arithmetic means for this gas in the city of Oviedo during the years 2006, 2007, and 2008 were 26, 39, and 41 μg/m3, respectively. Thus, the concentrations of this gas were below the maximum permitted and met air quality standards during these 3 years, including emission peaks. In April 2008, a peak was reached very close to the allowable limit. This fact was very dangerous for the health of the population of Oviedo. Furthermore, there is a great deal of evidence showing that high concentrations of ozone, created by high concentrations of pollution and daylight UV rays at the Earth’s surface, can harm lung function and irritate the respiratory system. Exposure to ozone and the pollutants that produce it has been linked to premature death, asthma, bronchitis, heart attack, and other cardiopulmonary problems. According to scientists of the USEPA, susceptible people can be adversely affected by ozone levels as low as 40 μg/m3.

Finally, in order to explain the peaks of SO2, NO2, NO, and CO concentrations during the period November 2006–January 2007, it is necessary to take into account the role of meteorology. In this way, Fig. 4 shows the average monthly temperature as well as the average monthly precipitation in the city of Oviedo from January 2006 to December 2008. Note that the temperature during this period (November 2006–January 2007) is lower than in the remaining years. As a consequence, this means that the electricity and heating consumptions were greater and therefore the concentrations of these pollutants were also increased in this period.

Fig. 4
figure 4

Meteorological data: average monthly temperature and average monthly precipitation in the Oviedo urban area from January 2006 to December 2008

To fix ideas, the list of input variables taken into account in this research work is shown in Table 3. The total number of dependent variables used to build the MARS models was 3: nitrogen dioxide (NO2), sulfur dioxide (SO2), and particulate matter less than 10 μm (PM10). Indeed, we have built three different MARS models taking as dependent variables NO2, SO2, and PM10, respectively, and as independent input variables (predictor variables) the other variables listed in Table 3.

Table 3 Set of input variables used in this study

In this work, three second-order MARS models have been used, so that the basis functions of the model consist of linear and second-order splines and the maximum number of terms was not limited (no pruning). The results of the MARS models computed using all the available data observations are shown in Tables 4, 5, and 6. Tables 4, 5, and 6 show a list of 25, 43, and 33 main basis functions for each of the three MARS models and their coefficients, respectively. Please note that the function h(x) is the bisector of the first quadrant if x is greater than 0 and is 0 if x is less than or equal to 0. Therefore, the MARS model is a form of nonparametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions as a weighted sum of basis functions called hinge functions (Friedman and Roosen 1995; Xu et al. 2004; de Cos Juez et al. 2009; García Nieto et al. 2011; Vidoli 2011; García Nieto et al. 2012). The predicted response for NO2, SO2, and PM10 is now a better fit to the original values since the MARS model has automatically produced a kink in the predicted dependent variable to take into account nonlinearities.

Table 4 List of basis functions of the MARS model for the NO2 pollutant and their coefficients c i
Table 5 List of basis functions of the MARS model for the SO2 pollutant and their coefficients c i
Table 6 List of basis functions of the MARS model for the PM10 pollutant and their coefficients c i

According to the results shown in Table 7, the most important variables for the prediction of NO2 (output variable) are in the priority order of importance: (1) sulfur dioxide (SO2), (2) nitric oxide (NO) and carbon monoxide (CO) alike, (3) particulate matter less than or equal to 10 μm (PM10), and finally (4) ozone (O3).

Table 7 Evaluation of the importance of the variables that form the model for the NO2 pollutant according to criteria Nsubsets, GCV, and RSS

Similarly, from the results shown in Table 8, it is possible to observe that the most important variables for the prediction of SO2 (output variable) are in the hierarchical order: (1) NO2, (2) NO and O3 alike, (3) PM10, and finally (4) CO.

Table 8 Evaluation of the importance of the variables that form the model for the SO2 pollutant according to criteria Nsubsets, GCV, and RSS

Additionally, the results shown in Table 9 indicate that the most important variables for the prediction of PM10 (output variable) are mainly in the hierarchical order: (1) NO2, (2) NO, (3) CO, (4) SO2, and finally (5) O3.

Table 9 Evaluation of the importance of the variables that form the model for the PM10 pollutant according to criteria Nsubsets, GCV, and RSS

Furthermore, a graphical representation of the terms that constitute the three MARS models can be seen in Figs. 5, 6, and 7, respectively.

Fig. 5
figure 5

Graphical representation of the terms that constitute the MARS model for the NO2 pollutant. a First-order term of the predictor variable SO2. b First-order term of the predictor variable NO. c First-order term of the predictor variable CO. d First-order term of the predictor variable PM10. e First-order term of the predictor variable O3. f Second-order term of the predictor variables SO2 and NO values. g Second-order term of the predictor variables SO2 and CO values. h Second-order term of the predictor variables SO2 and PM10 values. i Second-order term of the predictor variables SO2 and O3 values. j Second-order term of the predictor variables NO and CO values. k Second-order term of the predictor variables NO and PM10 values. l Second-order term of the predictor variables NO and O3 values. m Second-order term of the predictor variables CO and PM10 values. n Second-order term of the predictor variables CO and O3 values. o Second-order term of the predictor variables PM10 and O3 values

Fig. 6
figure 6

Graphical representation of the terms that constitute the MARS model for the SO2 pollutant. a First-order term of the predictor variable NO. b First-order term of the predictor variable NO2. c First-order term of the predictor variable CO. d First-order term of the predictor variable PM10. e First-order term of the predictor variable O3. f Second-order term of the predictor variables NO and NO2 values. g Second-order term of the predictor variables NO and CO values. h Second-order term of the predictor variables NO and PM10 values. i Second-order term of the predictor variables NO and O3 values. j Second-order term of the predictor variables NO2 and CO values. k Second-order term of the predictor variables NO2 and PM10 values. l Second-order term of the predictor variables NO2 and O3 values. m Second-order term of the predictor variables CO and PM10 values. n Second-order term of the predictor variables CO and O3 values. o Second-order term of the predictor variables PM10 and O3 values

Fig. 7
figure 7

Graphical representation of the terms that constitute the MARS model for the PM10 pollutant. a First-order term of the predictor variable SO2. b First-order term of the predictor variable NO. c First-order term of the predictor variable NO2. d First-order term of the predictor variable CO. e First-order term of the predictor variable O3. f Second-order term of the predictor variables SO2 and NO values. g Second-order term of the predictor variables SO2 and NO2 values. h Second-order term of the predictor variables SO2 and CO values. i Second-order term of the predictor variables SO2 and O3 values. j Second-order term of the predictor variables NO and NO2 values. k Second-order term of the predictor variables NO and CO values. l Second-order term of the predictor variables NO and O3 values. m Second-order term of the predictor variables NO2 and CO values. n Second-order term of the predictor variables NO2 and O3 values. o Second-order term of the predictor variables CO and O3 values

It is very important to select the model that best fits the experimental data. The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. In regression analysis, the criterion considered in this research to relate to goodness of fit was the coefficient of determination R 2 (Freedman et al. 2007; de Cos Juez et al. 2009; García Nieto et al. 2011; García Nieto et al. 2012). This ratio indicates the proportion of total variation in the dependent variables (NO2, SO2, and PM10 in our case) explained by the MARS models. A data set takes values t i , each of which has an associated modeled value y i . The former are called the observed values and the latter are often referred to as the predicted values. Variability in the data set is measured through different sums of squares:

  • \( {\mathrm{SS}}_{\mathrm{tot}}={\displaystyle \sum_{i=1}^n{\left({t}_i-\overline{t}\right)}^2} \): the total sum of squares, proportional to the sample variance

  • \( {\mathrm{SS}}_{\mathrm{reg}}={\displaystyle \sum_{i=1}^n{\left({y}_i-\overline{t}\right)}^2} \): the regression sum of squares, also called the explained sum of squares

  • \( {\mathrm{SS}}_{\mathrm{err}}={\displaystyle \sum_{i=1}^n{\left({t}_i-{y}_i\right)}^2} \): the residual sum of squares

In the previous sums, \( \overline{t} \) is the mean of the n observed data:

$$ \overline{t}=\frac{1}{n}{\displaystyle \sum_{i=1}^n{t}_i} $$
(7)

Bearing in mind the above sums, the general definition of the coefficient of determination is

$$ {R}^2\equiv 1-\frac{{\mathrm{SS}}_{\mathrm{err}}}{{\mathrm{SS}}_{\mathrm{tot}}} $$
(8)

A coefficient of determination value of 1.0 indicates that the regression curve fits the data perfectly. In this current research work, the three fitted MARS models for NO2, SO2, and PM10 have coefficients of determination equal to 0.85, 0.82, and 0.75, respectively. Additionally, their correlation coefficients were 0.92, 0.91, and 0.87, respectively. These results indicate a very high goodness of fit for three MARS models analyzed.In order to guarantee the prediction ability of the three MARS models, the cross-validation (Picard and Cook 1984; Efron and Tibshirani 1997) was the standard technique used here for the three MARS models built in this research work. In this sense, the data set is randomly divided into a number of disjoint subsets of equal size, and each subset is used once as a validation set, whereas the remaining subsets are put together to form a training set. In the simplest case, the average accuracy of all the validation sets is used as an estimator for the accuracy of the method. In this research work, a tenfold cross-validation was used, that is to say, to calculate the error criterion, the models were built using 90 % of the sample and tested with the remaining 10 %, thus simulating as closely as possibly the real conditions under which the model would be built in order to later fit it to new observation data unrelated to the construction of the models.Finally, this research work was able to estimate the concentrations of NO2 from 2006 to 2008 in agreement with the real experimental concentrations of NO2 observed with success (see Fig. 8). Similarly, Figs. 9 and 10 show a good agreement between the experimental concentrations of SO2 and PM10 and their predicted concentrations using the MARS models from 2006 to 2008, respectively.

Fig. 8
figure 8

Comparison between the concentrations of NO2 observed and predicted by the MARS model from 2006 to 2008

Fig. 9
figure 9

Comparison between the concentrations of SO2 observed and predicted by the MARS model from 2006 to 2008

Fig. 10
figure 10

Comparison between the concentrations of PM10 observed and predicted by the MARS model from 2006 to 2008

Conclusions

In the first place, this research described steps for the construction of three MARS models to estimate quickly and with a high degree of accuracy the concentrations of NO2, SO2, and PM10 from 2006 to 2008. We have provided examples of real applications and simple explanations of two commonly used statistics for the selection of the best-fitting models: the coefficient of determination and correlation coefficient.Secondly, the MARS models are potentially useful for predicting pollutant concentrations in the atmosphere. In other words, this new and innovative methodology developed here could be applied to other industrial cities with similar or different sources of pollutants, but it would be necessary to take into account the specific nature of each location.Finally, the results of this research about the development of models of local pollutant concentrations are a valuable tool for mitigation projects of acid rain and for the research of the effect of particulate matter on human health. Furthermore, there is an increasing interest to use mathematical models with good physical properties to understand the behavior of the pollutants in the atmosphere in order to improve the air quality and to reduce the number of deaths. In this way, this model can be assembled inside other more general models of the atmosphere. Additionally, one of the main findings of this study was to set the order of priority (hierarchy) of the predictor variables involved in the estimation of the dependent variables: NO2, SO2, and PM10. Furthermore, this paper presents examples of real applications and simple explanations of statistical calculation for the selection of the best-fit models.