Introduction

The composition of the atmosphere has been gradually changing over the past millions of years. Rapid urbanization and industrialization has added other elements/compounds to the pure air and thus caused the increase in pollution. In addition to the criteria pollutants, there are many potent short-lived climate pollutants like black carbon, tropospheric ozone, methane, and hydrofluorocarbons that need to be regulated to reduce emissions. These pollutants remain in the atmosphere for shorter periods of time and have much larger global warming potentials compared to CO2. As one of the most significant sources of greenhouse gases (GHG) and criteria pollutant emissions, the transportation system represents one of the greatest needs for emission reductions and one of the greatest opportunities to build an economy that aligns stable economic growth with the need for ever-improving public health and environmental protection. Reducing transportation emissions, including those from heavy-duty diesel engines, will have dramatic air quality and public health benefits. Black carbon (as a component of PM2.5) and ozone are the air pollutants with harmful health effects and reducing their emissions can offer significant improvements in air quality and public health. In addition to the short-lived, local ozone precursors like NOX and SOX, methane is a global source of tropospheric ozone. Black carbon is the most strongly light-absorbing component of particulate matter (PM) emitted from burning fuels such as coal, diesel, and biomass. Diesel PM is a toxic air contaminant that can be inhaled (PM10 and PM2.5). Black carbon contributes to climate change both directly by absorbing sunlight and indirectly by depositing on snow and by interacting with clouds and affecting cloud formation. Reducing black carbon emissions globally can have immediate economic, climate, and public health benefits (Brown et al. 2006).

While we must continue taking steps to rapidly reduce CO2 emissions, additional efforts to cut emissions of short-lived climate pollutants can yield immediate climate benefits. In addition, fast and sustainable actions to reduce these emissions can help to achieve other benefits though avoided impacts on agriculture, water availability, ecosystems, and human health. The reduction of methane would reduce background tropospheric ozone concentrations, which would help with progress towards healthy air quality and avoid crop yield losses and forest damage due to the direct action of ozone on plant growth. Black carbon impacts cloud formation and precipitation, and black carbon deposits on glaciers and snowpack accelerate melting. Reducing black carbon and methane emissions will help reduce the risk for premature deaths, air pollution-related hospitalizations, and associated medical expenses each year. With this backdrop, the need for enactment of pollution control laws was recognized by the US-EPA (Brown et al. 2006).

In order to prevent, control, and abate air pollution, the Clean Air Act was enacted in 1970 by US-EPA and Air Quality Index (AQI) for defining air quality was suggested. AQI formulation uses probability theory and is devoid of the concept of partial belief associated with human thinking. The uncertainty associated with criteria pollutants is considered as probabilistic.

The modeling formalism has embraced two valued logic-based probability theory wherein random variable is used as the basis of probability computations. The standard probability theory is not designed to deal with imprecise probabilities or Z-probabilities which pervade real-world uncertainties. Fuzzy set theory is the way of modeling uncertainty due to imprecision, fuzziness, and ambiguity wherein human perception plays a pivotal role.

The limitations of conventional AQI calls for devising fuzzy logic-based formalism, known as ZadehDeshpande (ZD) approach (Yadav et al. 2011, 2013, 2014), wherein air quality/water quality is described straightway in linguistic terms with linguistic degree of certainty attached to each description.

The overall focus of the paper is centered on the linguistic description of air quality using genetic algorithm–fuzzy modeling in combination with ZD approach via case studies. The self-explanatory system flowchart in Fig. 1 shows the approaches used in the research study.

Fig. 1
figure 1

System flowchart

The remaining part of the paper is organized as follows. Section “Classification of polluted cities of Maharashtra State in India using Fuzzy C mean” describes classification of 15 cities in Maharashtra, India, based on their pollution potential using Fuzzy C mean (FCM). Section “Pollution forecast and application of GA-ZD method” predicts pollution forecast using four mathematical techniques and genetic algorithm–fuzzy modeling in combination with ZD approach to assess and compare the air quality status of New York and Mumbai. Section “Software support” gives details regarding the software developed for linguistic description of air quality. Concluding remarks and future scope for research are presented in “Conclusion and future scope for research” section.

Fig. 2
figure 2

FCM classification: pollution status of 15 cities in Maharashtra (India)

Classification of polluted cities of Maharashtra State in India using Fuzzy C mean

Clustering of numerical data forms the basis of many classification and system modeling algorithms. The purpose of clustering is to identify natural groupings of data from a large data set to produce a concise representation of a system’s behavior. Clustering refers to identifying the number of subclasses of c clusters in a data universe X comprised of n data samples and partitioning X into c clusters (2 ≤ c < n). Note that c = 1 denotes rejection of the hypothesis that there are clusters in the data, whereas c = n constitutes the trivial case where each sample is in a “cluster” by itself. There are two kinds of c partitions of data: hard (or crisp) and soft (or fuzzy). For numerical data, one assumes that the members of each cluster bear more mathematical similarity to each other than to members of other clusters. Two important issues to consider in this regard are how to measure the similarity between pairs of observations and how to evaluate the partitions once they are formed (Ross 2009).

One of the simplest similarity measures is distance between pairs of feature vectors in the feature space. If one can determine a suitable distance measure and compute the distance between all pairs of observations, then one may expect that the distance between points in the same cluster will be considerably less than the distance between points in different clusters. Several circumstances, however, mitigate the general utility of this approach, such as the combination of values of incompatible features, as would be the case, for example, when different features have significantly different scales. The clustering methods described defines “optimum” partitions through a global criterion function that measures the extent to which candidate partitions optimize a weighted sum of squared errors between data points and cluster centers in feature space. Hard C means are employed to classify data in a crisp sense. By this, we mean that each data point will be assigned to one, and only one, data cluster. The problem of hard C mean lies in assigning the point on the line of symmetry to a class. To which class should this point belong? Whichever class the algorithm assigns this point to, there will be a good argument that it should be a member of the other class. Alternatively, the argument may revolve around the fact that the choice of two classes is a poor one for this problem. Three classes might be the best choice, but the physics underlying the data might be binary and two classes may be the only option (Ross 2009).

The solution to the above problem is FCM. This technique was originally introduced by Jim Bezdek in 1981 as an improvement on earlier clustering methods. FCM is a data clustering technique in which a dataset is grouped into n clusters with every data point in the dataset belonging to every cluster to a certain degree. For example, a certain data point that lies close to the center of a cluster will have a high degree of belonging or membership to that cluster and another data point that lies far away from the center of a cluster will have a low degree of belonging or membership to that cluster. Thus each data point belongs to a cluster to some degree that is specified by a membership grade (Bezdek et al. 1984; Saksena et al. 2003).

The case study relates to clustering of 15 cities of Maharashtra (India), based on their pollution potential, into five different clusters. The classification is based on the pollution levels of Ambient Air Quality Monitoring (AAQM) stations.

A sample set of n data samples that we wish to classify: X = {x 1 , x 2 , x 3 , . . .,x n }. Each data sample, x i , is defined by m features, i.e., x i  = {x i1 , x i2 , x i3 , . . .x im } where each x i in the universe X is an m-dimensional vector of m elements or m features. There are 15 data points (cities), and each data point is described with two features viz. NOX and PM10. The 15 data points are to be clustered into five clusters viz. not polluted, moderately polluted, poorly polluted (alert level), very poorly polluted (warning level), and severely polluted (emergency level). The matrix U has been arranged as follows:

$$ {U}^0=\left[\begin{array}{cccccccccccccc}\hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 00\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 00\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 11\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 00\hfill \\ {}\hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 00\hfill \end{array}\right] $$

V i is the i th cluster center, which is described by m features and can be arranged in vector form as V i  = {v i1,v i2,v i3,….v im}. Each of the cluster coordinates for each class can be calculated using Eq. (1)

$$ {v}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^n}{\mu}_{ik}^{m^{\prime }}.{x}_{kj}}{{\displaystyle {\sum}_{k=1}^n}{\mu}_{ik}^{m^{\prime }}} $$
(1)

where j is a variable on the feature space, i.e., j = 1,2,3,…m, and μ ik is the membership of the k th data point in the i th cluster. The distance measure d ik is a Euclidean distance between the i th cluster center and the k th data set computed using Eq. (2). mϵ[1, ∞)  is the weighting parameter that controls the amount of fuzziness in the clustering process. In the present case study, m′ = 2.2.

$$ {d}_{ik}=\sqrt{{{\displaystyle {\sum}_{j=1}^m\left({x}_{kj}-{v}_{ij}\right)}}^2} $$
(2)

with the distance measures update U using Eq. (3)

$$ {\mu}_{ik}^{\left(r+1\right)} = {\left[{{\displaystyle {\sum}_{j=1}^c\left({d}_{ik}^{(r)}/{d}_{jk}^{(r)}\right)}}^2\right]}^{-1} $$
(3)

Thus the updated fuzzy partition obtained is:

$$ {U}^1=\left[\begin{array}{cccccccccccccc}\hfill 0.39\hfill & \hfill 0.38\hfill & \hfill 0.42\hfill & \hfill 0.23\hfill & \hfill 0.37\hfill & \hfill 0.22\hfill & \hfill 0.41\hfill & \hfill 0.40\hfill & \hfill 0.06\hfill & \hfill 0.04\hfill & \hfill 0.38\hfill & \hfill \mathrm{0.160.66}\hfill & \hfill 0.25\hfill & \hfill 0.40\hfill \\ {}\hfill 0.50\hfill & \hfill 0.61\hfill & \hfill 0.56\hfill & \hfill 0.31\hfill & \hfill 0.47\hfill & \hfill 0.29\hfill & \hfill 0.55\hfill & \hfill 0.52\hfill & \hfill 0.12\hfill & \hfill 0.06\hfill & \hfill 0.49\hfill & \hfill \mathrm{0.240.33}\hfill & \hfill 0.34\hfill & \hfill 0.51\hfill \\ {}\hfill 0.04\hfill & \hfill 0.01\hfill & \hfill 0.01\hfill & \hfill 0.15\hfill & \hfill 0.05\hfill & \hfill 0.17\hfill & \hfill 0.01\hfill & \hfill 0.03\hfill & \hfill 0.15\hfill & \hfill 0.24\hfill & \hfill 0.05\hfill & \hfill \mathrm{0.150.00}\hfill & \hfill 0.13\hfill & \hfill 0.03\hfill \\ {}\hfill 0.05\hfill & \hfill 0.01\hfill & \hfill 0.01\hfill & \hfill 0.19\hfill & \hfill 0.07\hfill & \hfill 0.21\hfill & \hfill 0.02\hfill & \hfill 0.03\hfill & \hfill 0.25\hfill & \hfill 0.40\hfill & \hfill 0.06\hfill & \hfill \mathrm{0.210.00}\hfill & \hfill 0.17\hfill & \hfill 0.04\hfill \\ {}\hfill 0.04\hfill & \hfill 0.01\hfill & \hfill 0.01\hfill & \hfill 0.18\hfill & \hfill 0.07\hfill & \hfill 0.18\hfill & \hfill 0.02\hfill & \hfill 0.03\hfill & \hfill 0.48\hfill & \hfill 0.27\hfill & \hfill 0.06\hfill & \hfill \mathrm{0.320.00}\hfill & \hfill 0.17\hfill & \hfill 0.04\hfill \end{array}\right] $$

To determine whether we have achieved convergence, we choose a matrix norm such that the maximum absolute value of pairwise comparisons of each of the values in U 0 and U 1 using Eq. (4).

$$ ma{x}_{i,k}\left|{\mu}_{ik}^{(1)}-{\mu}_{ik}^{(0)}\right| $$
(4)

Finally, the FCM converges after four iterations, and the following clusters are obtained (Fig. 2)

Fig. 3
figure 3

Vehicle population in Ratnagiri, India

  1. 1.

    Cluster 1 = {Akola, Kolhapur, Ratnagiri, Solapur} → not polluted

  2. 2.

    Cluster 2 = {Amravati, Aurangabad, Jalgaon, Nashik, Sangli} → slightly polluted

  3. 3.

    Cluster 3 = {Pune} → moderately polluted

  4. 4.

    Cluster 4 = {Chandrapur, Jalna, Latur, Nagpur} → highly polluted

  5. 5.

    Cluster 5 = {Mumbai} → very highly polluted

Table 1 presents air quality description of the clusters by inference. Based on the available criteria pollutant data for the year 2009, Ratnagiri is classified as not polluted. It lies at the heart of Konkan region, a charming stretch of land on the west coast of India, endowed with beautiful seashore and picturesque mountains. The total number of vehicles registered in Ratnagiri is 153491. With the increase in the number of vehicles to 189,619 (Fig. 3) in the year 2013, Ratnagiri has been classified as slightly polluted which raises an alarm for quick pollution abatement measures. No city in Maharashtra remains unpolluted as per 2013 pollutant data. The clusters obtained using 2013 air quality data are as follows:

  1. 1.

    Cluster 1 = {} → not polluted

  2. 2.

    Cluster 2 = {Akola, Amravati, Ratnagiri} → slightly polluted

  3. 3.

    Cluster 3 = {Aurangabad, Jalgaon, Kolhapur, Sangli} → moderately polluted

  4. 4.

    Cluster 4 = {Jalna, Latur, Nagpur, Nashik, Solapur} → highly polluted

  5. 5.

    Cluster 5 = {Chandrapur, Mumbai, Pune} → very highly polluted

Table 1 Air quality description of the clusters by inference
Fig. 4
figure 4

Ratnagiri pollution load forecast

Mumbai and Pune are clustered into a highly polluted cluster due to very high concentrations of NOX and PM10. The reason for increase in PM10 is the alarming increase in number of vehicles in the city which is the main cause of deteriorating air quality. Figure 4 forecasts the pollution load of Ratnagiri city as this city is in the nonpolluting cluster but has now been moved to a slightly polluted cluster. The forecast graphs shows that the condition will further worsen if pollution abatement measures are not implemented.

Fig. 5
figure 5

a Mumbai city pollution load forecast. b New York City pollution load forecast

Pollution forecast and application of GA-ZD method

Since majority of cities in India are polluted, it is necessary to visualize future air pollution scenario in a few typical cities. There has been rapid increase in vehicular population resulting into increase in pollution, especially in cities. Pollution forecasted for the selected cities is presented using four methods viz. fuzzy time series, arithmetic increase, incremental increase, and geometric increase.

Pollution load forecast

Figure 5a, b depicts pollution load forecast in Mumbai and New York using four methods of forecasting (http://scetcivil.weebly.com/uploads/5/3/9/5/5395830/m5_l5-population_forecasting.pdf). The Chen’s method of forecasting makes use of fuzzy time series (Chen 2002; Chen and Hsu 2004). The vehicular pollution forecast for Mumbai city shows that the pollution load is increasing linearly and has already reached an alarming situation and warrants strict pollution abatement measures. New York State has already initiated strict pollution norms in the state for all types of vehicles, and thus the graph shows that the pollution is steady and is well under control. The arithmetic mean pollution forecast shows linear increase in pollution, whereas geometric mean shows sudden increase in the pollution load in Mumbai city.

Fig. 6
figure 6

Flowchart for genetic algorithm

GA-ZD approach

Zadeh–Deshpande approach

Genetic algorithm and Zadeh–Deshpande (GA-ZD) method is the application of fuzzy logic and genetic algorithm in air quality classification. ZD approach directly describes air quality in linguistic terms with linguistic degree of certainty attached to each description. The identification of air quality parameters, number of sampling stations, time and frequency of observations etc., are crucial and are invariably based on the experts’ knowledge base. The approach is summarized below:

Collection of pollution parametric data from sampling locations is the first step. On the basis of appropriate probability distribution to the data, mean and variance values could be used in further analysis. When the data is inadequate for statistical distribution fitting, a well-documented bootstrap method could be employed. Based on the data obtained using bootstrap method, experience has shown that most of the air quality parametric data follows normal distribution. Bootstrap mean and standard deviation are considered while plotting the distribution. As we intend to match probability distribution with the fuzzy sets drawn for the selected parameters (usually possibility distribution), it is necessary to transform probability distribution into possibility distribution for all the criteria pollutants, using the concept of Convex Normalized Fuzzy Number (CNFN).

If x i is some point on the parametric domain (say in normal distribution) for which p (x j ) is maximum, then define function μ A (x) as:

$$ {\mu}_A(x)=p\left({x}_i\right)/p\left(\ {x}_j\ \right) $$
(5)

This transforms a random variable into CNFN with membership grade function μ A (x), thereby characterizing dynamic behavior of air quality parameters which is a possibility distribution.

Fuzzy sets are generated using GA for selected parameters as per the computational framework explained in “Zadeh–Deshpande (ZD) approach” section computational scheme of degree of match (DM) can be used with a view to estimate matching between fuzzy sets and the antecedent part of the rule, in order to describe air quality fuzzily with certain degree of certainty. A set of fuzzy rules is constructed for classifying air quality as: very good, good, fair, poor, and very poor in order to aggregate the set of attributes. The degree of match of each classification rule indicates the certainty value of classification. The greater the degree of match, the greater is the possibility that air quality is classified in that class. The rules are processed using conjunction and disjunction operators. The optimal acceptance strategy is usually that for which the degree of assertion is the maximum (Yadav et al. 2011, 2013, 2014).

Genetic algorithm

Genetic algorithms were invented by John Holland in 1975 to mimic some of the processes observed in natural selection. The basic concepts of GAs come from Darwin’s theory of evolution, which describes the law of competition and natural selection. By performing the basic operations reproduction, crossover, and mutation, a GA can select better chromosomes from the current population for a specific problem. After the evolution process, a better next generation can be obtained. Repeating the above process, finally, we can obtain a near-optimal chromosome to deal with a specific problem. In a GA, we encode the parameters of a problem into a numerical string, where the numerical string is called a chromosome. Each element in a chromosome is called a gene. The fitness function is used to evaluate the degree of fitness of a chromosome. The larger the fitness value of a chromosome, the higher the probability of the chromosome to contribute one or more offspring in the next generation. The data type of the genes in a chromosome may be real, integer, or binary values depending on the specific problem (Goldberg 1989; Man et al. 1996; Mitchell 1998).

The initial population is formed by a set of randomly generated members. Each generation consists of members whose constituents are the individual design variables that characterize a design and these are embedded in a binary string. Each member is evaluated using the objective function and is assigned a fitness value, which is an indication of the performance of the member relative to the other members in the population. A biased selection depending on the fitness value decides which members are to be used for producing the next generation. The selected strings are the parents for the next generation, which evolves from the use of two genetic operators namely crossover and mutation.

These operators give a random displacement to the parent population and generate a new population of designs. The crossover operator takes two parent strings, splits them at a random location, and swaps the sub-strings formed. A probability of crossover determines whether a crossover should be performed. The mutation operator inverts a bit in the string depending on the probability of mutation. The new strings formed are evaluated, and the iteration continues until a maximum number of generations have been reached or until a user-defined termination criterion has been met. Figure 6 shows the sequence of steps in a basic genetic algorithm. The control parameters that have to be initially specified are the population size, the crossover and mutation probabilities, the maximum number of generations, and the termination criterion. Chen and Chen 2010 proposed a new method to generate fuzzy rules from training data by using genetic algorithms. First, the training data is divided into several clusters by using the weighted distance clustering method and generate a fuzzy rule for each cluster. Then, genetic algorithms were used to tune the membership functions of the generated fuzzy rules. The proposed method attains a higher average classification accuracy rate than the existing methods. Kim et al. 2001 use genetic algorithms to automatically determine the near-optimal rules and their membership functions of fuzzy traffic controllers.

Fig. 7
figure 7

a Mumbai (India). b New York (USA)

Reproduction (selection) operator

The chromosomes with larger fitness values may have more chance to evolve a better next generation. In the reproduction process of a GA, it tries to select the chromosomes that have larger fitness values into a mating pool. The fitness function quantifies the optimality of a solution so that particular solution may be ranked against all other solutions. The function depicts the closeness of the given solution to the desired result. The degree of fitness of each chromosome in a population is computed by using a fitness function. Two commonly used methods of the reproduction operations are roulette wheel and tournament selection operations.

Crossover (recombination) operator

Crossover is a genetic operator that combines (mates) two chromosomes (parents) to produce new chromosome (offspring). The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. Crossover selects genes from the parent chromosomes and creates offspring. There are three common methods of crossover operations: one-point crossover, two-point crossover, uniform crossover operations. The genes of a chromosome are independent and have the same chance to perform the crossover operations.

Mutation operator

After crossover is performed, mutation takes place. It is an operator used to maintain genetic diversity from one generation of a population of chromosomes to the next. Mutation occurs during evolution according to a user-definable mutation probability (P m), usually set to fairly low value, say 0.01 a good first choice. Mutation alters one or more gene values in a chromosome from its initial state. This can result in entirely new gene values been added to the gene pool. With the new gene values, the genetic algorithm may be able to arrive at better solutions than was previously possible. Mutation is an important part of genetic search that helps prevent the population from stagnating at any local optima. Mutation is intended to prevent the search falling into a local optimum of the state space. The flip-bit mutation operator simply inverts the value of the chosen gene (Goldberg 1989; Man et al. 1996; Mitchell 1998).

Computing membership functions using genetic algorithm

Genetic algorithms can be used to compute membership functions (Karr and Gentry 1993). Given some functional mapping for a system, some membership functions and their shapes are assumed for the various fuzzy variables defined for a problem. These membership functions are then coded as bit strings that are then concatenated. An evaluation (fitness) function is used to evaluate the fitness of each set of membership functions (parameters that define the functional mapping).

In a genetic algorithm, the parameter set of the problem is coded as a finite string of bits. The bit strings are combinations of zeros and ones, which represent the value of a number in binary form. An n-bit string can accommodate all integers up to the value 2n − 1. For example, the number 7 requires a 3-bit string, i.e., 23 − 1 = 7, and the bit string would look like “111”, where the first unit digit is in the 22 place (=4), the second unit digit is in the 21 place (=2), and the last unit digit is in the 20 place (=1); hence, 4 + 2 + 1 = 7. The number 10 would look like “1010”, i.e., 23+ 21 = 10, from a 4-bit string. This bit string may be mapped to the value of a parameter, say C i , i = 1, 2, by the mapping:

$$ {C}_i = {C}_{min} + \frac{b}{2^L-1}\left({C}_{ma{x}_i}-{C}_{mi{n}_i}\right) $$
(6)

where b is the number in decimal form that is being represented in binary form (e.g., 152 may be represented in binary form as 10011000), L is the length of the bit string (i.e., the number of bits in each string), and C max and C min are user-defined constants between which C 1 and C 2 vary linearly. The parameters C 1 and C 2 depend on the problem.

First, an initial population of n strings (for n parameters) of length L is created. The strings are created in a random fashion, i.e., the values of the parameters that are coded in the strings are random values (created by randomly placing the zeros and ones in the strings). Each of the strings is decoded into a set of parameters that it represents. This set of parameters is passed through a numerical model of the problem space. The numerical model gives out a solution based on the input set of parameters. On the basis of the quality of this solution, the string is assigned a fitness value. The fitness values are determined for each string in the entire population of strings. With these fitness values, the three genetic operators are used to create a new generation of strings, which is expected to perform better than the previous generations (better fitness values). The new set of strings is again decoded and evaluated, and a new generation is created using the three basic genetic operators. This process is continued until convergence is achieved within a population.

Among the three genetic operators, reproduction is the process by which strings with better fitness values receive correspondingly better copies in the new generation, i.e., we try to ensure that better solutions persist and contribute to better offspring (new strings) during successive generations. This is a way of ensuring the “survival of the fittest” strings. Because the total number of strings in each generation is kept a constant (for computational economy and efficiency), strings with lower fitness values are eliminated.

The second operator, crossover, is the process in which the strings are able to mix and match their desirable qualities in a random fashion. After reproduction, crossover proceeds in three simple steps. First, two new strings are selected at random. Second, a random location in both strings is selected. Third, the portions of the strings to the right of the randomly selected location in the two strings are exchanged. In this way, information is exchanged between strings, and portions of high-quality solutions are exchanged and combined. Reproduction and crossover together give genetic algorithms most of their searching power. The third genetic operator, mutation, helps to increase the searching power (Ross 2009).

Describing air quality

The case study relates to describing air quality in linguistic terms using parametric data for Mumbai city in India and New York City in the USA (Fig. 7a, b). Maharashtra Pollution Control Board (MPCB) monitors three pollutants viz. particulate matter (PM10), oxides of nitrogen (NOX), and oxides of sulfur (SOX) (http://www.mpcb.gov.in/ Retrieved on 1/10/2013) which were considered for linguistic description of air quality at Mumbai in two locations viz. Sion and Bandra. New York city monitors carbon monoxide (CO), particulate matter (PM2.5), and ozone (O3) and were thus considered to describe air quality at New York (http://www.dec.ny.gov/airmon/Retrieved on 10/10/2013) in four locations viz. City College New York (CCNY), Division Street, Public School (PS)-19, and International School (IS)-143. The parametric data was considered for the winter month of November 2013 for both the cities. A computational scheme of degree of match (DM) is used with a view to estimate matching between the assertion and the antecedent part of fuzzy rules, in order to describe air quality fuzzily with linguistic degree of certainty.

Reliability of monthly mother sample data of the pollutants and the identification of the worst winter month has been ensured using bootstrapping. One thousand bootstrap samples were compared for mean and standard error of the original and the bootstrap samples. In the present study, bootstrap mean for the month of November 2013 for all the pollutants monitored in Mumbai and New York is considered for further analysis. CNFN was constructed for the data sets of the parameters and linguistic description of air quality was obtained from the experts. GA was applied to construct fuzzy sets describing the interval of confidence for the terms very good, good, fair, poor, and very poor.

Tables 2 and 3 portray the DM between the parametric data of the pollutants computed at monitoring stations in New York and Mumbai cities and GA generated fuzzy sets in linguistic hedges for these parameters. For example, at Sion station, the highest DM for SOX, NOX, and PM10 are 1, 1, and 0.96 for the linguistic description very good, very poor, and very poor, respectively,, and for Bandra station, the highest DM is 1 for all SOX, NOX, and PM10 for the linguistic description very good, poor, and very poor, respectively. In case of New York, the highest DM at station CCNY is 1 for all pollutants O3, PM2.5, and CO for linguistic description very good.

Table 2 Degree of match—New York
Table 3 Degree of match—Mumbai

Based on the parametric data and fuzzy sets generated by GA (Table 4), it can be concluded that the air quality in Mumbai is generally described as very poor. Table 4 describes air quality linguistically with linguistic degree of certainty using GA generated fuzzy sets. It can be concluded that the AQ at all the monitoring stations in New York is very good with degree of certainty as very high. The air quality at Sion city in Mumbai can be directly described linguistically as very poor with a very high (0.95) degree of certainty value. The next higher value of 0.78 for poor indicates that the air quality at Sion is definitely deteriorating as the DC is 0.78 which is higher than 0.46 for the linguistic hedge fair. In the present scenario, the air quality description of both cities using expert fuzzy and GA generated fuzzy sets is same. There is no change in the final description of air quality. Thus, genetic algorithm could be a better choice only in case where the domain experts are unavailable as far as air quality is concerned.

Table 4 Linguistic description of air quality and degree of certainty

The lower part of Table 4 describes air quality linguistically with linguistic degree of certainty using fuzzy sets based on the perceptions of the domain experts. Using expert fuzzy sets, the air quality at New York could be linguistically described as very good with degree of certainty as very high (1). The air quality at Sion city in Mumbai can be directly described linguistically as very poor with a very high (0.96) degree of certainty value. The next higher value of 0.46 for poor indicates that the air quality at Sion is definitely deteriorating as the DC is 0.46 which is higher than 0.23 for the linguistic hedge fair. It is pertinent to mention that fuzzy logic approach attaches certainty values to the linguistic terms while describing air quality, whereas the conventional method describes air quality index first with a numeric value and then describes it in linguistic terms which is a departure from human thinking.

High concentration of PM10 in air can be attributed to heavy vehicular traffic, construction activities, cement roads, untarred roads, etc. in Mumbai city. Low temperature and less vertical dispersion during winter months increase the concentration of PM10. The fact that there is not much horizontal air movement in summer also leads to localization of PM10. During summer, there is lack of convection winds and particulate matter gets concentrated in the area where they are generated and are not dispersed. The increase in the number of vehicles has eventually led to alarming levels of emissions as they are caught at the intersections and in the usual snarls during peak rush hours. The flyovers constructed for fast movement vehicles only added auto exhaust emissions. Industrial units located in cities add to oxides of nitrogen, SOX, and particulate matter. The authors believe that the situation will further deteriorate if the present trend of plying more and more polluting vehicle and construction activities in Mumbai continues. From Table 5, it can be inferred that the pollution in New York City is within limits. The average values of NO2 and PM10 for Sion are 177.78 and 151.26 μg/m3, respectively, which are higher than the permissible values of 80 and 100 μg/m3 as per Table 6 for 24 h average. Tables 6 and 7 present the NAAQS for Mumbai and New York, respectively. In New York, the air quality can be described as very good with very good degree of certainty.

Table 5 New York and Mumbai: air quality criteria pollutants: basic statistics
Table 6 Air quality standards established by CPCB
Table 7 Air quality standards established by US-EPA

Recognizing that unintentional carbon monoxide poisoning is a serious but preventable environmental health threat, New York City enacted a law in 2004 requiring CO alarms in residential and many public buildings and updated the Health Code to make CO poisoning immediately reportable by telephone to the Department of Health and Mental Hygiene/NYC Poison Control Center.

New York City has initiated strict pollution control norms and enacted a law in 2004 requiring CO alarms in residential and many public buildings. New York City’s air quality has improved over time as the regulations have made federal, state, and local air quality standards more stringent over the last two decades. Federal and state regulatory efforts to reduce emissions from the transportation, off-road, and stationary source sectors have driven continued national improvements in air quality. As required by the federal Clean Air Act, the US Environmental Protection Agency (US-EPA) sets standards for particulate matter (PM2.5), nitrogen dioxide (NO2), and sulfur dioxide (SO2) emissions from large fossil-fuel combustion sources and also designates areas that fail to meet health-based standards for air quality established by the agency. Most recently, between 2006 and 2010, the US-EPA phased in stringent emissions standards for heavy duty diesel vehicles such as trucks and for gasoline passenger vehicles while also reducing the sulfur content of diesel fuel and gasoline. These new emissions standards, once fully implemented, are expected to significantly reduce air pollution from key transportation sectors and reduce their associated public health impacts, thus contributing to improved air quality in New York City.

From the air quality status estimated by GA-ZD approach combined, it can be concluded that the air quality in Mumbai is very poor with very high degree of certainty and that of New York, it is very good with very high degree of certainty. The air pollution authorities in Mumbai, India, have to follow stringent pollution control norms as the vehicle population is growing at an alarming rate in Mumbai. Petrol/diesel-driven vehicles should be discouraged. The policy on plying nonpolluting vehicles should be implemented.

Software support

A system for linguistic description of air quality, including web-based air quality description software has been developed. The website takes as input the concentrations of three pollutants viz. NOX, SOX, and PM10 in μg/m3 and displays as output a facial expression to describe the status of air. The software is developed using the fuzzy-logic-based formalism in ZD approach that models aleotary and epistemic uncertainty which is not taken care of in the conventional air quality index method. The air quality computations are performed using a web-based software developed on Windows Operating System platform. The web server used is Apache or Nginx. MySQL (structured query language) is used as a relational database management system and for the client–server architecture. The database server (MySQL) and the client (application program) communicate with each other for querying data, saving changes, etc. The technology and tools used for development are Hypertext Preprocessor (PHP5) which is a server-side scripting language designed for web development or for a general purpose programming language, HyperText Markup Language (HTML) is the main markup language used for creating WebPages and information to be displayed on the web browser, Cascading Style Sheets (CSS) is a style sheet language used for describing presentation semantics of a document designed in HTML, Java Query (JQuery) is a multi-browser JavaScript library designed to simplify the client-side scripting of HTML, and JavaScript (JS) is an interpreted computer programming language. It is implemented as a part of web browser so that client-side scripts can interact with the user, control the browser, communicate asynchronously, and alter the document content. The code contains 22 PHP files with approximately 1400 lines of code, 75 JS files with 3500 lines of code, and 822 CSS files with 1000 lines of code. The computation time of approximately 2 min is required only for data entry which includes inputting three pollutant values. The final output which describes air quality linguistically with linguistic degree of certainty is obtained on the click of the buttons. The site address for the software is http://www.jyotiyadav.in/. However, no tall claims are made and the website can be modified to include more number of air pollutants and expert fuzzy sets for those pollutants.

Conclusion and future scope for research

The case studies presented infers the strength of fuzzy set theoretic approach in combination with genetic algorithm in straightway describing air quality in linguistic terms with linguistic degree of certainty attached to each term. A close look at the limited study will show that the fuzzy sets drawn on the basis of expert knowledge base are not much different than those obtained using genetic algorithm.

The classification of air quality monitoring stations using Fuzzy C mean clustering algorithm in association with Zadeh–Deshpande could be a better method in decision making for initiating pollution abatement action plans. The authors have demonstrated the concept of reference group and computed weight factor value which is needed in Fuzzy C mean clustering. Having said this, there is a future need to use other soft computing techniques in air quality modeling pollution assessment, pollution risk, and for suggesting pollution mitigation measures. There is a long way to go.