Fuzzy Similarity-Based Hierarchical Clustering for Atmospheric Pollutants Prediction

Camastra, F.; Ciaramella, A.; Son, L. H.; Riccio, A.; Staiano, A.

doi:10.1007/978-3-030-12544-8_10

F. Camastra¹⁵,
A. Ciaramella¹⁵,
L. H. Son¹⁶,
A. Riccio¹⁵ &
…
A. Staiano¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11291))

Included in the following conference series:

International Workshop on Fuzzy Logic and Applications

820 Accesses
3 Citations

Abstract

This work focuses on models selection in a multi-model air quality ensemble system. The models are operational long-range transport and dispersion models used for the real-time simulation of pollutant dispersion or the accidental release of radioactive nuclides in the atmosphere. In this context, a methodology based on temporal hierarchical agglomeration is introduced. It uses fuzzy similarity relations combined by a transitive consensus matrix. The methodology is adopted for individuating a subset of models that best characterize the predicted atmospheric pollutants from the ETEX-1 experiment and discard redundant information.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Hierarchical Clustering for Optimizing Air Quality Monitoring Networks

Information-Theoretic Approaches for Models Selection in Multi-model Ensemble Atmospheric Dispersion Predictions

Fuzzy Inference ANN Ensembles for Air Pollutants Modeling in a Major Urban Area: The Case of Athens

Keywords

1 Introduction

The real-time simulation of pollutant dispersion or the accidental release of radioactive substances in the atmosphere is a challenging aspect of many national services and agencies. In particular, releases of harmful radionuclides (e.g. Fukushima, Chernobyl) could be simulated and monitored [1, 10, 13, 20]. In this work we consider atmospheric compounds from the ENSEMBLE system [6,7,8]. ENSEMBLE is a web-based system aiming at assisting the analysis of multi-model data provided by many national meteorological services and environmental protection agencies worldwide. It is worth noting that in the case of multimodel ensemble for atmospheric dispersions, models are certainly more or less dependent from several intrinsic mechanisms (e.g., they often share features, initial/boundary data, numerical methods, parameterizations and emissions). For this reason, results obtained by ensemble analysis may lead to erroneous interpretations and in a multimodel approach the effective number of models may be lower than the total number, since models could be linearly (or nonlinearly) dependent on each other.

To solve this problem, a number of techniques has been proposed in literature. In [15, 17, 18] the authors present a statistical analysis (i.e., Bayesian Model Averaging) for combining predictive distributions from different sources of a multi-model ensemble, and in [16] some basic properties of multi-model ensemble systems are investigated. Moreover, cluster-based approaches have also been proposed [2,3,4]. In this paper, we introduce a methodology that improves the forecasting by considering observations that may become available during the course of the event. The methodology is based on fuzzy similarity relations that allow to combine multiple hierarchical agglomerations, each for a different forecasting leading time. From the overall temporal agglomeration obtained by a consensus matrix it is possible to select a subset of models and discard redundant information.

The remainder of the paper is organized as follows. In Sect. 2 the proposed methodology is detailed. In particular, some fundamental concepts on t-norms and fuzzy similarity relations (Sect. 2.2) are given and the agglomerative based approach is described in Sect. 2.3. Finally, in Sect. 3 some experimental results, obtained by applying this methodology on an ensemble of prediction models, are described. Conclusions and future remarks are given in Sect. 4.

2 Fuzzy Similarity and Agglomerative Clustering

In general, when one deals with clustering tasks, fuzzy logic permits to obtain soft clustering, instead of hard (crisp or non-fuzzy) clustering of data. Hierarchical clustering is a methodology for cluster analysis which seeks to build a hierarchy of clusters and it can be agglomerative or divisive. In this work we consider an agglomerative clustering approach. One of the main aspects of this methodology is the use of a measure of dissimilarity between sets of observations, by using an appropriate metric. On the other hand, a dendrogram is a tree diagram used to illustrate the results produced by hierarchical clustering. In the following, we show that a dendrogram can be associated with a fuzzy equivalence relation based on Łukasiewicz valued fuzzy similarities. Successively, a consensus matrix, that is the representative information of all dendrograms, is obtained by combining multiple temporal hierarchical agglomerations of dispersion models. The main steps of the proposed approach are

1.
Membership functions characterization;
2.
Fuzzy Similarity Matrix calculation (or dendrogram) for all the models at a fixed time;
3.
Consensus matrix construction for temporal hierarchical agglomerations.

2.1 Membership Functions

The effective of fuzzy logic is the transformation of linguistic variables in fuzzy sets. Fuzzification is the process of changing a real scalar value into a fuzzy value and it is achieved by using different types of membership functions. The membership function represents the degree of truth to which a given input belongs to a fuzzy set. In the proposed approach, fuzzy sets are described by the following membership functions [21]

$$\begin{aligned} \mu (\mathbf{x}_i) = \frac{\mathbf{x}_i - \min (\mathbf{x}_i)}{\max (\mathbf{x}_i) - \min (\mathbf{x}_i)}, \end{aligned}$$

(1)

where $\mathbf{x}_i = [x_1^i, x_2^i, \ldots , x_L^i]$ is the i-th observation vector of the L considered models.

2.2 Fuzzy Similarity

We observe that fuzzy sets can be combined via the conjunction and disjunction operations and continuous triangle norms or co-norms are adopted, respectively. A triangular norm (t-norm for short), is a binary operation t on the unit interval [0, 1]. In particular, it is a function $t:[0,1]^2 \rightarrow [0,1]$, such that it satisfies the following four axioms for all $x,y,z \in [0,1]$ [11]

$$\begin{aligned} \begin{array}{lccrr} t(x,y) &{}=&{} t(y,x) &{} &{}{ ({ commutativity})}\\ \\ t(x,t(y,z)) &{}=&{} t(t(x,y),z) &{} &{} {({ associativity})} \\ \\ t(x,y) &{} \le &{} t(x,z) &{} \text {whenever } y \le z &{}{({ monotonicity})}\\ \\ t(x,1) &{}=&{} x &{} &{}{({\textit{boundary condition}})} \end{array} \end{aligned}$$

(2)

In practical situations the following four basic t-norms are considered

(3)

However, in these years, several parametric and non-parametric t-norms have been introduced [11] and generalized versions have also been studied [5]. In the following, we focus on the properties of the Łukasiewicz t-norm ($t_\mathbf{L}$). One main operator adopted in fuzzy-based systems (e.g., fuzzy inference systems) is the residuum $\rightarrow _t$

$$\begin{aligned} x \rightarrow _t y = \bigvee \left\{ z | t(z,x) \le y \right\} \end{aligned}$$

(4)

where $\bigvee $ is the union operator and, for the left-continuous basic t-norm $t_\mathbf{L}$, is given by

(5)

Moreover, we also note that letting p be a fixed natural number in a generalized Łukasiewicz structure, we obtain

$$\begin{aligned} \begin{array}{lcc} t_\mathbf{L}(x,y) &{}=&{} \root p \of {\max (x^p + y^p - 1,0)} \\ x \rightarrow _\mathbf{L} y &{}=&{} \min (\root p \of {1 - x^p + y^p},1) \\ \end{array} \end{aligned}$$

(6)

Another fundamental operation on a residuated lattice is the bi-residuum that will be used for our construction of the fuzzy similarities. It is defined as

$$\begin{aligned} x \leftrightarrow _t y = (x \rightarrow _t y) \wedge (y \rightarrow _t x), \end{aligned}$$

(7)

where $\wedge $ is the meet. In the case of the left-continuous basic t-norm $t_\mathbf{L}$, we obtain the following bi-residuum

$$\begin{aligned} x \leftrightarrow _\mathbf{L} y = 1 - \max (x,y) + \min (x,y) \end{aligned}$$

(8)

On the other hand, a binary fuzzy relation R is defined on $U \times V$ as a fuzzy set on $U \times V$ ($R \subseteq U \times V$). A similarity matrix is a fuzzy relation $S \subseteq U \times U$ such that, for each $u,v,w \in U$, the following properties are satisfied

$$\begin{aligned} \begin{array}{lccr} S \langle u,u \rangle &{}=&{} 1 &{} {({\textit{everthing is similar to itself}})}\\ \\ S \langle u,v \rangle &{}=&{} S \langle v,u \rangle &{} {({ symmetric})} \\ \\ t(S \langle u,v \rangle , S \langle v,w \rangle )&{} \le &{} S \langle u,w \rangle &{} {({\textit{weakly transitive}})}\\ \end{array} \end{aligned}$$

(9)

It is essential to observe that from fuzzy sets with membership functions $\mu : X \rightarrow [0,1]$, a fuzzy similarity matrix S can be generated as

$$\begin{aligned} S \langle a,b \rangle = \mu (a) \leftrightarrow _t \mu (b) \end{aligned}$$

(10)

for all $a,b \in X$.

Moreover, to build the fuzzy similarity matrix a main result is considered [19, 21]

Proposition 1

Consider n Łukasiewicz valued fuzzy similarities $S_i$, $i=1,\ldots ,n$ on a set X. Then

$$\begin{aligned} S \langle x,y \rangle = \frac{1}{n} \sum _{i=1}^{n} S_i\langle x,y \rangle \end{aligned}$$

(11)

is a Łukasiewicz valued fuzzy similarity on X.

In this work, we consider for Eq. 11

$$\begin{aligned} S_i\langle x,y \rangle = x \leftrightarrow _\mathbf{L} y. \end{aligned}$$

(12)

Now, let $t_\mathbf{L}$ be the Łukasiewicz product, it is worth noting that S is a fuzzy equivalence relation on X with respect to $t_\mathbf{L}$ iif $1 - S$ is a pseudo-metric on X.

2.3 Dendrogram and Consensus Matrix

We also have to observe that if a similarity relation is min-transitive ($t=\min $ in (9)) then it is a fuzzy-equivalence relation that can be graphically described by a dendrogram [12]. In other words, transitivity implies the existence of the dendrogram.

The min-transitive closure $R^T$ of R can be obtained as follows [14]

$$\begin{aligned} R^T = \bigcup _{i=1}^{n-1} R^i \end{aligned}$$

(13)

where $R^{i+1}$ is defined as

$$\begin{aligned} R^{i+1} = R^{i} \circ R, \end{aligned}$$

(14)

and n is the dimension of a relation matrix.

Considering two fuzzy relations R and S, we observe that the composition $R \circ S$ is a fuzzy relation defined by

$$\begin{aligned} R \circ S \langle x,y \rangle = \text {Sup}_{z \in X} \{R \langle x,z \rangle \odot S \langle z,y \rangle \} \end{aligned}$$

(15)

$\forall x,y \in X$, where $\odot $ stands for a t-norm (e.g., min operator) [14]. Then we can conclude that the min-transitive closure $R^T$ of a matrix R can be easily computed and the overall process is described in Algorithm 1.

We also observe that to accomplish an agglomerative clustering a dissimilarity relation is needed. Here we considered the following result [14].

Lemma 1

Letting R be a similarity relation with the elements $R \langle x,y \rangle \in [0,1]$ and letting D be a dissimilarity relation, which is obtained from R by

$$\begin{aligned} D(x,y) = 1 - R \langle x,y \rangle \end{aligned}$$

(16)

then D is ultrametric iif R is min-transitive.

In other words, we have a one-to-one correspondence between min-transitive similarity matrices and dendrogram and between ultrametric dissimilarity matrices and dendrograms.

Finally, after the dendrograms have been obtained at each time, a consensus matrix, that is the representative information of all temporal dendrograms, is obtained by combining the transitive closures by using Eq. 15 (i.e., max-min) [14]. The overall approach is described in Algorithm 2.

3 Experimental Results

This Section aims to illustrate some results obtained by the proposed approach. In particular, we consider the multi-model ensemble simulated distributions of the ETEX-1 experiment [9]. The ETEX-1 experiment concerned the release of pseudo-radioactive material on 23 October 1994 at 16:00 UTC from Monterfil, southeast of Rennes (France). Briefly, a steady westerly flow of unstable air masses was present over central Europe. Such conditions persisted for the 90 h that followed the release with frequent precipitation events over the advection area and a slow movement toward the North Sea region. Just for an example, in Fig. 1 we show the integrated concentration after 78 h from release. In the experiment, the main objective of the several independent groups worldwide (25 members) was to forecast the observations with different atmospheric dispersion models. Moreover, each simulation was based on weather fields generated by (most of the time) different Global Circulation Models (GCM) and all the simulations relate to the same release conditions. For further information on the involved groups and the adopted models the reader can refer to [8] and [9].

Now we apply the proposed approach to analyze data of the ETEX-1 experiment. The preliminary step is the fuzzification. In particular, Eq. 1 is applied on the concentrations estimated by models at each time level. Successively, for each concentration at different times a dendrogram (similarity matrix) is produced (Eq. 11 with Łukasiwicz norm and $p=1$). Finally, the consensus matrix that described the representative dendrogram is estimated by using the approach described in Algorithm 2. In Fig. 2 a particular of the representative dendrogram obtained after 78 h is visualized. We observe that different clusters of similar models are obtained.

To highlight the clustering outcomes, in Fig. 3, we show some representative distributions of the clustered models. For example, as confirmed by dendrogram, the distributions of the models 22 and 24 are very close. See Figs. 3a and b for a comparison. Instead, the model 21 has a very diffusive distribution, as highlighted by the dendrogram. This distribution is visualized in Fig. 3c. At this point, we can identify models that have similar behavior by analyzing the different clusters. In order to identify the group of models that more appropriately describe observations, we compare the distributions of the models by using a Kullback Leibler divergence.

The Kullback Leibler (KL) divergence between two discrete n-dimensional probability density functions ${\mathbf p} = [p_i \ldots p_n]$ and ${\mathbf q} = [q, \ldots q_n]$ is defined as

$$\begin{aligned} KL({\mathbf p}||{\mathbf q}) = \sum _{i=1}^n p_i \log \left( \frac{p_i}{q_i}\right) . \end{aligned}$$

(17)

This is known as the relative entropy. It satisfies the Gibbs’ inequality

$$\begin{aligned} KL({\mathbf p}||{\mathbf q}) \ge 0 \end{aligned}$$

(18)

where equality holds only if ${\mathbf p} \equiv {\mathbf q}$. In general $KL({\mathbf p}||{\mathbf q}) \ne KL({\mathbf q}||{\mathbf p})$. In our experiments we use the symmetric version [2] that can be defined as

$$\begin{aligned} KL = \frac{KL({\mathbf p}||{\mathbf q}) + KL({\mathbf q}||{\mathbf p})}{2}. \end{aligned}$$

(19)

First of all, we compute the KL divergence between each model and the median value of the overall cluster. Successively, for each cluster, the model with the minimum KL is selected. The median model of these considered models is compared with the real observations by KL. In Fig. 4 we show the KL obtained by varying the number of clusters.

We observe that varying the number of clusters this procedure permits to select the models that have the best approximation of the real observation (see [17] and [4] for more details). After our analysis, we conclude that the best approximation is obtained by using 6 clusters. Moreover, we stress that a lower KL does not necessarily correspond to the use of a large number of models. This suggest an approach for systematic reduction of ensemble data complexity and the use of the consensus matrix permits to obtain a more robust and realistic temporal analysis.

4 Conclusions

In this work we focused on models comparison in a multi-model air quality ensemble system. A methodology based on temporal hierarchical agglomeration is introduced for real-time simulation of pollutant dispersion or the accidental release of radioactive nuclides in the atmosphere. The proposed methodology is able to combine multiple temporal hierarchical agglomerations of dispersion models and it is based on fuzzy similarity relations combined by a transitive consensus matrix. The methodology is adopted for individuating models that characterize the predicted atmospheric pollutants from the ETEX-1 experiment. The results show that this methodology is able to discard redundant temporal information, reducing the data complexity. In the next future, further experimentations will be devoted to real pollutant dispersions (e.g., Fukushima) and different similarity relations also using ordinal sums.

References

Ascione, I., Giunta, G., Mariani, P., Montella, R., Riccio, A.: A grid computing based virtual laboratory for environmental simulations. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1085–1094. Springer, Heidelberg (2006). https://doi.org/10.1007/11823285_114
Chapter Google Scholar
Ciaramella, A., et al.: Interactive data analysis and clustering of genomic data. Neural Netw. 21(2–3), 368–378 (2008)
Article Google Scholar
Napolitano, F., Raiconi, G., Tagliaferri, R., Ciaramella, A., Staiano, A., Miele, G.: Clustering and visualization approaches for human cell cycle gene expression data analysis. Int. J. Approximate Reasoning 47(1), 70–84 (2008)
Article Google Scholar
Ciaramella, A., Giunta, G., Riccio, A., Galmarini, S.: Independent data model selection for ensemble dispersion forecasting. In: Okun, O., Valentini, G. (eds.) Applications of Supervised and Unsupervised Ensemble Methods. SCI, vol. 245, pp. 213–231. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03999-7_12
Chapter Google Scholar
Ciaramella, A., Pedrycz, W., Tagliaferri, R.: The genetic development of ordinal sums. Fuzzy Sets Syst. 151, 303–325 (2005)
Article MathSciNet Google Scholar
Galmarini, S., Bianconi, R., Bellasio, R., Graziani, G.: Forecasting consequences of accidental releases from ensemble dispersion modelling. J. Environ. Radioactiv. 57, 203–219 (2001)
Article Google Scholar
Galmarini, S., et al.: Ensemble dispersion forecasting, part I: concept, approach and indicators. Atmos. Environ. 38, 4607–4617 (2004)
Article Google Scholar
Galmarini, S., et al.: Ensemble dispersion forecasting? Part II: application and evaluation. Atmos. Environ. 38, 4619–4632 (2004)
Article Google Scholar
Girardi, F., et al.: The ETEX project. EUR Report 181–43 EN, 108 pp. Office for official publications of the European Communities, Luxembourg (1998)
Google Scholar
Giunta, G., Montella, R., Mariani, P., Riccio, A.: Modeling and computational issues for air/water quality problems: a grid computing approach. Nuovo Cimento C Geophys. Space Phys. 28, 215–224 (2005)
Google Scholar
Klement, E.P., Mesiar, R., Pap, E.: Triangular Norms. Kluwer Academic Publishers, Dordrecht (2001)
MATH Google Scholar
Meyer, H.D., Naessens, H., Baets, B.D.: Algorithms for computing the min-transitive closure and associated partition tree of a symmetric fuzzy relation. Eur. J. Oper. Res. 155(1), 226–238 (2004)
Article MathSciNet Google Scholar
Montella, R., Giunta, G., Riccio, A.: Using grid computing based components in on demand environmental data delivery. In: Proceedings of the Second Workshop on Use of P2P, GRID and Agents for the Development of Content Networks, UPGRADE-CN 2007, pp. 81–86 (2007)
Google Scholar
Mirzaei, A., Rahmati, M.: A novel hierarchical-clustering-combination scheme based on fuzzy-similarity relations. IEEE Trans. Fuzzy Syst. 18(1), 27–39 (2010)
Article Google Scholar
Potempski, S., Galmarini, S., Riccio, A., Giunta, G.: Bayesian model averaging for emergency response atmospheric dispersion multimodel ensembles: is it really better? How many data are needed? Are the weights portable? J. Geophys. Res. 115 (2010). https://doi.org/10.1029/2010JD014210
Potempski, S., Galmarini, S.: Est modus in rebus: analytical properties of multi-model ensembles. Atmos. Chem. Phys. 9(24), 9471–9489 (2009)
Article Google Scholar
Riccio, A., Giunta, G., Galmarini, S.: Seeking for the rational basis of the median model: the optimal combination of multi-model ensemble results. Atmos. Chem. Phys. 7, 6085–6098 (2007)
Article Google Scholar
Riccio, A., Ciaramella, A., Giunta, G., Galmarini, S., Solazzo, E., Potempski, S.: On the systematic reduction of data complexity in multimodel atmospheric dispersion ensemble modeling. J. Geophys. Res. 117(D5), D05314 (2012)
Article Google Scholar
Sessa, S., Tagliaferri, R., Longo, G., Ciaramella, A., Staiano, A.: Fuzzy similarities in stars/galaxies classification. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 494–4962 (2003)
Google Scholar
Solazzo, E., Riccio, A., Van Dingenen, R., Valentini, L., Galmarini, S.: Evaluation and uncertainty estimation of the impact of air quality modelling on crop yields and premature deaths using a multi-model ensemble. Sci. Total Environ. 633, 1437–1452 (2018)
Article Google Scholar
Turunen, E.: Mathematics Behind Fuzzy Logic. Advances in Soft Computing. Springer, Heidelberg (1999)
MATH Google Scholar

Download references

Acknowledgments

This work was partially funded by the University of Naples Parthenope (Sostegno alla ricerca individuale per il triennio 2016–2018 project).

Author information

Authors and Affiliations

Department of Science and Technology, University of Naples “Parthenope”, Isola C4, Centro Direzionale, 80143, Naples (NA), Italy
F. Camastra, A. Ciaramella, A. Riccio & A. Staiano
Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Hanoi, Vietnam
L. H. Son

Authors

F. Camastra
View author publications
You can also search for this author in PubMed Google Scholar
A. Ciaramella
View author publications
You can also search for this author in PubMed Google Scholar
L. H. Son
View author publications
You can also search for this author in PubMed Google Scholar
A. Riccio
View author publications
You can also search for this author in PubMed Google Scholar
A. Staiano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Ciaramella .

Editor information

Editors and Affiliations

Széchenyi István University, Győr, Hungary
Robert Fullér
Ca’ Foscari University of Venice, Venice, Italy
Silvio Giove
University of Genoa, Genoa, Italy
Francesco Masulli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Camastra, F., Ciaramella, A., Son, L.H., Riccio, A., Staiano, A. (2019). Fuzzy Similarity-Based Hierarchical Clustering for Atmospheric Pollutants Prediction. In: Fullér, R., Giove, S., Masulli, F. (eds) Fuzzy Logic and Applications. WILF 2018. Lecture Notes in Computer Science(), vol 11291. Springer, Cham. https://doi.org/10.1007/978-3-030-12544-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-12544-8_10
Published: 23 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12543-1
Online ISBN: 978-3-030-12544-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics