Keywords

1 Introduction

In Data Science the aim is to extract new knowledge from Standard, Big and complex data. Another characteristic of Data Science is that its methods and tools are not developed in order to be applied for only a specific domain but for any domain providing data. Often industrial data are unstructured with variables defined on different units. They can also be multi sources (as mixture of numerical and textual data, with images and networks). In order to reduce the size, the complexity and the efficiency of the models associated to such data, a key solution is to use classes, of row statistical units, which are considered as new statistical units. “Classes” are as usual, subsets of any statistical set of units as for example: teams of football players, Region of inhabitant, Level of consumption in health insurance. There are at least three advantages of considering classes instead of standard units. First, classes can be the units that interest the most the users. For example, “regions” instead of their “inhabitants” or “species” instead of “specimen”, “teams” instead their players or “documents” instead of their “words”. Second, classes induce local models often (but not always!) more efficient than global ones. Third, classes give a concise and structured view on the data as they can be organized by a partition, a hierarchical clustering, a pyramid (for overlapping clusters) or a Galois lattice. In clustering classes (called “clusters”) are not known and can be obtained by a clustering process called “Dynamical Clustering Method” (i.e. DCM) (see [11]) or for fuzzy classes by EM (see [18]), improving iteratively the fit between each obtained cluster and its local associated model. In the case of unsupervised data, clusters can be modeled for example, by means (as in the “k-means” method), distributions (as in mixture decomposition) or factorial axis (which leads to local factorial analysis). In case of supervised data, clusters can be modeled by regressions (or more generally by canonical analysis), neural networks, SVM, etc. In both cases the obtained classes can be described in order to express their within-class variability by vectors of intervals, probability distributions, weighted sequences, functions, and the like, called “symbolic data”. Hence, we obtain a symbolic data table that we can study in order to obtain explanatory information on the given classes or obtained clusters. We can also use these symbolic data in order to measure the explanatory power of each class or cluster. More generally a “symbolic data table” is a table where classes of individuals are described by at least one symbolic variable. Standard variables can also describe classes by considering the set of classes as a new set of units of higher level.

The Fig. 1 is an example of symbolic data table. The statistical units of the ground population are players of French cup teams and classes of players are teams called Paris, Lyon, Marseille and Bordeaux. The variability of the players inside each team is expressed by the following symbolic variables: “Weight” which value is the interval of [min, max] weight of the players of the associated team, “National Country” which value is the list of their nationality, “Age bar chart” is the frequency of the age players being in the intervals: [less than 20], [20, 25], [25, 30], [more than 30], respectively, denoted: (0), (1), (2), (3) in Fig. 1. The symbolic variable “age” is called “bar chart variable” as the interval of age on which it is defined are the same for all the classes and can therefore be considered as categories. The last variable is numerical as its values for a team is the frequency of the French players in this team among all the French players of all the teams. Hence, this variable produces a vertical bar chart in comparison with the symbolic variable “age” of horizontal bar charts value in Fig.  1. By adding to the French the same kinds of columns associated with the other nationalities, we can obtain a new symbolic variable whose values are a list of numbers, where each number is the frequency of having players in a team of a nationality among all the players having this nationality among all the teams. A team can also be described by standard numerical or categorical variables as for example, its expenses or the number of goals in a season.

Fig. 1.
figure 1

An example of symbolic data table where teams of the French Cup are described by three symbolic variables of interval, sequence of categories, “horizontal” bar charts and a numerical variable inducing a “vertical” bar chart.

More generally, the first characteristic of the so-called “symbolic variable” is that they are defined on classes. Their second characteristic, is that their values take the variability between the individuals inside these classes into account by “symbols” representing more than only one category or number. Hence, the standard operators of numbers cannot be applied to the values of these kinds of variables, so these values are not numerical: that is why they are called “symbolic” and represented by “symbols” as intervals, bar chart and the like.

The first aim of the so called “Symbolic Data Analysis” (SDA) is to describe classes by vectors of symbolic data in an explanatory way. Its second aim is to extend Data Mining and Statistics to new kinds of complex data coming from the industrial domain. We cannot say that SDA give better results than standard data analysis we can just say that SDA can give good complementary results when we need to work on units which have a higher level of generality. For example, if we wish to know what makes a good player, for sure the data concerns individuals units, but if we wish to know what makes a good team, in this case the units are the teams and so, there are classes of individuals.

Complex data constitute an important source of symbolic data. We consider “complex data” as data set which cannot be considered as a “standard statistical units x standard variables” data table. This is the case when data are defined by several data tables with different statistical units and different variables coming from multi sources sometimes at multi levels. In this case one of the advantage of “symbolic data” is that unstructured data with unpaired samples at the level of row units, become structured and paired at the classes’ level. By definition, a “class of complex data” is a vector of standard classes defined on different statistical space of units. For example, in Official Statistics a Region can be considered as a class of complex data denoted CR = (Ch, Cs, Ci) where Ch is the class hospitals, Cs the class of schools, Ci the class of inhabitants, of this region.

Example of complex data, classes and symbolic variables in Official Statistics:

National Statistical Institutes (NSI) organize census in their regions on different kinds of populations: hospitals, schools, inhabitants etc. Each of these populations are associated to their own characteristic variables. For hospitals: number of beds, doctors, patients, etc.; for schools: number of pupils, teachers, etc.; for inhabitants: gender, age, socio professional category, etc. The regions are the classes of units described by the variable available for all these populations of different sizes. If we have n regions and N populations (of hospitals, schools, etc.), then we get after aggregation, a symbolic data table with n rows and p1  +  …  +  pN columns associated to the N sets of symbolic value variables characteristic of each of the N populations. For sure other variables (standard or symbolic) can be added in order to describe other aspects of the regions.

Symbolic Data Analysis (SDA) is an extension of standard data analysis and data mining to symbolic data. SDA has several advantages. As the number of classes is lower than the number of individuals, SDA facilitates interpretation of results in decision trees, factorial analysis etc. SDA reduces simple or Complex and/or Big Data. It also reduces missing data and solves confidentiality (when individuals are confidential but classes are not confidential). It allows adding new variables at the right level of generality.

The theory and practice of SDA have been developed in several books [2, 3, 15], many papers (see overviews in [1, 9]), and several international workshops Special issue related to SDA has been published, for example in the RNTI journal, edited by, Guan et al. [20] on ‘Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis’; in the ADAC journal on SDA, edited by Brito et al. [4]; in IEEE Trans Cybern [25]. We indicate among many others, four examples of application in nuclear power point [6], epidemiology [21], in cancerology [23], and in face recognition [24].

The paper present three sections after the introduction. The first is devoted to the building of symbolic data from given classes or obtained clusters. The next section shows that the explanatory power of the symbolic data describing a class can be measured by different criteria which provide a measure of the explanatory power of this class. In the next section we show that any tool of machine learning can be transformed by a clustering process in local tools, often more adapted than global ones. Then, based on the explanatory criteria characterizing individuals, classes and variables we show how to improve the explanatory power of any machine learning tool by filtering explanatory sub populations.

2 Building Symbolic Data from Given Classes or Clusters

The aim is to study the symbolic data table provided by the description of given classes (as in supervised learning) or given clusters (obtained from a clustering process), in order to get complementary knowledge enhancing the usual standard interpretation (by means, variance, etc.). For example, in mixture decomposition clustering the description of each class is just given by the analytical expression of the joint probability density fi associated to each class. Hence in case of Gaussian model, the joint is described by a big correlation matrix heavy to interpret when there are numerous variables. Building a symbolic data table where the units are the given classes or the obtained clusters can be done in three ways: directly if the obtained clusters define a partition, from the marginal induced by the joint distribution associated to each cluster provided by EM or DCM, or from the membership weight of the individuals if we have fuzzy clusters as in EM mixture decomposition.

If Lk, is the representative of the class Pk, then the weight tk(u) of an individual u in class Pk, which takes the value x ji for the variable j and the individual I, is given by: \( {\text{t}}_{k} \left( {\text{u}} \right) = d\left( {u, L_{k} } \right) \) where d is the dissimilarity used by the clustering method which has produced the classes. In case of fuzzy clusters (like in EM), \( {\text{t}}_{k} \left( {\text{u}} \right) \) is the fuzzy weight of u in the kth class. Then, the histogram for the kth class and the jth variable is given by:

$$ H_{kj} = \left( {\frac{{\mathop \sum \nolimits_{i = 1}^{N} t_{k} \left( {x_{i} } \right)_{{{\text{I}}_{1} }} \left( {x_{i}^{j} } \right)}}{{\mathop \sum \nolimits_{v = 1}^{V} \mathop \sum \nolimits_{i = 1}^{N} t_{k} \left( {x_{i} } \right)_{{{\text{I}}_{v} }} \left( {x_{i}^{j} } \right)}}, \ldots ,\frac{{\mathop \sum \nolimits_{i = 1}^{N} t_{k} \left( {x_{i} } \right)_{{{\text{I}}_{V} }} \left( {x_{i}^{j} } \right)}}{{\mathop \sum \nolimits_{r = 1}^{p} \mathop \sum \nolimits_{i = 1}^{N} t_{k} \left( {x_{i} } \right)\left( {x_{i}^{j} } \right)}}} \right) $$
(1)

where \( \delta_{{{\text{I}}_{v} }} \left( {x_{i}^{j} } \right) \) is a vector of Dirac mass defined on V intervals \( \left( {I_{1} , \ldots ,I_{V} } \right) \) partitioning the domain Dj of the numerical variable Xj such that: \( \delta \left( {x_{i}^{j} } \right) = (\updelta_{{{\text{I}}_{1} }} \left( {x_{i}^{j} } \right) \), …, \( \updelta_{{{\text{I}}_{V} }} \left( {x_{i}^{j} } \right) \)) where \( \updelta_{{{\text{I}}_{v} }} \left( {x_{i}^{j} } \right) \) takes the value 1 if \( x_{i}^{j} \in \gamma I_{v} \) and 0 elsewhere. When the \( I_{v} \) are categorical values instead of intervals, we obtain a bar chart and \( \delta \left( {x_{i}^{j} } \right) \) takes the value 1 if \( x_{i}^{j} \) is the category \( I_{v} \) and the value 0 elsewhere.

When instead of a fuzzy partition \( \left( {P_{1} , \ldots ,P_{K} } \right) \) as the one given by EM we have an exact partition denoted \( \left( {P_{1}^{'} , \ldots ,P_{K}^{'} } \right) \) as the one induced by \( P_{K}^{'} = \left\{ {x_{i} /f\left( {x_{i} ,a_{k} } \right) \ge f\left( {x_{n} ,a_{k} } \right)} \right\} \) or directly by DC, we can build in the same way an histogram or a bar chart by setting:\( t_{k} \left( {x_{i} } \right) = 1 \), for any \( k = \left( {1, \ldots ,K} \right) \) and \( i = \left( {1, \ldots ,N} \right) \).

In SDA, in order to increase the explanatory power of the obtained symbolic data table, first the chosen number of intervals \( I_{v} \) is preferably chosen not numerous (about 5, but it can be increased if needed), second the size and position of these intervals can be obtained in an optimal way in order to maximize the distance between the symbolic description of the classes (see [16]). After an EM mixture decomposition, the joints fi associated to each class Ci are described by their marginal fij. These marginal are then described by several kinds of symbolic data as histograms or interquartile intervals or any kind of property, mean, mean square, percentiles, correlation between some characteristic variables and the like. More generally, from any clustering method we obtain a symbolic data table on which SDA can be applied.

3 Explanatory Power of Classes or Clusters from Their Associated Symbolic Data Table

Our aim in SDA is to get a meaningful symbolic data table maximizing the discrimination power of the symbolic data associated to each variable for each class. A discrimination degree can be calculated by a normalized sum (to be maximized) of the dissimilarity two by two between the symbolic descriptions. Such kind of dissimilarities can be found in [3, 7, 8, 15]. In case of histogram value variables an example of discriminating tool is given in [DID 2013] by optimizing the lenght of the histograms intervals. There are at least three ways: distances between rows in each column, to be maximized, entropy in each cell to be minimized, correlations between columns to be maximized. More details are given in [9].

Other kinds of explanatory power of a symbolic data table can be defined. First, we can define a theoretical framework for SDA in the case of categorical variables (see [17, 19]. Let be three random variables C, X, S defined on the ground population Ω in the following way: C a class variable: Ω → P such that C(w) = c where c is a class of a given partition P. X a categorical variable: Ω → M such that X(w) = x is a category among the set of categories M of this variable. From C and X, we can build a third random variable defined as follows: S: Ω →[0, 1] such that S(w) = s(X(w), C(w)) = s(x, c) the proportion of the category x inside the class c. In other words s(x, c) can be considered as the probability of the category x knowing the class c: s(x, c) = Pr(X = x/C = c). If we denote f c(x) the value of the bar chart induced by the class c and the category x, we have f c(x) = s(x, c).

Characterization of a class by an event: We say basically that a category is “characteristic” of a class if it is frequent in the class and rare in the other classes. In order to insight what we develop in this section, we start with a simple example.

Example: Suppose fc(z) is higher than 0.9 (i.e. fc(x) belongs to the event E(z, c)) = [0.9, 1]) and fc’(x) for most of the classes class c’ Є P different of c, belongs to the event E(z, c’)) = [0, 0.9[). In this case, we can say that the category x is characteristic of the class c versus the other classes of the partition P for the event [0.9, 1] as its frequency takes a value in this event for the class c which is rare for the other classes of the partition P.

A characterization criterion W, varying between 0 and 1, of a category z and a class c can be measured by:

W(z, c, E) =  fc(z)/(1+gz, E(z, c)(c)), where E(z, c) is an event defined by an interval included in [0, 1]containing f c(z) and Pr(Sx Є E(z, c)) = |{w Є Ω/z = X(w), Sz(w) = s(z, C(w)) Є E(z, c))|/|Ω|, defines g:

g z, E(z, c) (c) = Pr(Sz Є E(z, c)). Hence, g z, E(z, c) associates to a class c and a category z, the frequency of individuals of Ω satisfying the event E(z, c) with z = X(w) and c = C(w). If the ground population is infinite, we suppose that Ω is a sample. Hence, given an event E, the criterion W express, how much a category z is characteristic of a class c versus the other classes c’ of the given partition P. This criterion means that a category z is even more characteristic of a given class c and for an event E, its frequency in the class c is large and the proportion of individuals w taking the z category in any class c’ (including c) and satisfying to the event E(z, c) is low in the ground population Ω. Giving z and c, several choices of E can be interesting.

Four examples of events E:

For a characterization of x and c in the neighborhood of s(z, c):

E1(z, c) = [s(z, c) – ε, s(z, c) + ε] for ε > 0 and s(z, c) Є [ε, 1- ε] where ε can be a percentile.

For a characterization of the higher values than s(z, c): E2(z, c) = [s(z, c), 1].

For a characterization of the lower values than s(z, c): E3(z, c) = [0, s(z, c)].

In order to characterize the existence of the category z in any: E4(z, c) = ]0, 1].

Hence, a category z is characteristic of a class c when it is frequent in the class c and rare in a neighborhood of s(z, c) if E = E1, rare above (resp. under s(z, c) if E = E2 or E = E3, rare to appear outside of c in the other classes c’ if E = E4.

In fact there are four cases to consider depending on the fact that in a class c a category z is frequent or not and among the set of classes it is frequent or not in E(z, c). Hence, we have four cases called FF, FR, RF, RR, the cases FF and RR cannot give any specific value to W(z, c, E), but the case FR (resp. RF) where the category is frequent (resp. rare) in c and rare (resp. frequent) in the other classes’ leads to a value of W(z, c, E) which is high (resp. low). Therefore, we can say that z is a specific category of c iff W(z, c, E)LogW(z, c, E) is close to 0. Other kinds of characterization criterion can be used. The popular ‘test value,’ developed in [22], may also be used to measure a characterization of a category in a bar chart contained in a cell. The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. A simple way can be the ratio between the frequency of a category in a class and the mean of the frequencies of the same category in all the classes of the given partition.

Characterization of classes and symbolic bar chart variables

Asymbolic data table of bar chart variables can be transformed in a data table where each column is associated to a category, by summing on the characterization of all the cells of each row (resp. column) we obtain a characterization of each class (resp. variable). In the same way, in summing on the characterization of all the cells, we can obtain a characterization of the symbolic data table. In the same way we can find the most typical or atypical class or bar chart variables or symbolic data table. In the following, we focus on characterization but for sure in the same way we could consider the singular or specific case.

It can be shown that the standard Tf-Idf (very popular in text mining) is a case of the W criterion and a parametric version of this criterion can be defined (see [17]).

4 Improving Explanatory Power of Machine Learning by Using a Filter

We show in three steps that any learning machine process can be improved in the efficiency and the explanatory power of its provided rules: in the first step by a dynamical clustering process optimizing at each step a first objective function we obtain local learning models, defined by couples of clusters and local associated predictive models (regression, neural network, SVM, Bayesian, decision tree, etc.) in case of supervised data or couple of cluster and mean, distribution or factorial axis in case unsupervised learning; in the second step the obtained clusters are described by symbolic data (induced by only the explanatory variables in case of supervised data or by all the variables in case of unsupervised data), which leads to the explanatory power of each cluster, measured by a second objective function of characterization; in the third step we provide an allocation rule to any new unit (only known by its explanatory values, i.e. without knowing its predictive values, in case of supervised data) if it improves simultaneously the first and the second objective function (i.e. at least improving one without degrading the other). Several kinds of allocation rules are proposed including Latent Dirichlet models (see Diday [19]).

Hence, in the first step, we use The “Dynamical Clustering Method” (DCM): Starting from a given partition P = (P1, …, Pk) of a population, this method is based on an alternative use of a representation function g which associates a representation L to a class C and an allocation function f which associate a class C to a point x of the population: f(x) = C in order to improve a given criterion at each step until convergence.

Proof.

starting from a partition P = (P1, …, PK) of the initial population, the representation function applied to the classes Pi produces a vector of representation L = (L1, …, LK) among a given set of possible representations, where g(Ci) = Li. A quality criterion can be defined in the following way: \( {\text{W}}\left( {{\text{P}},{\text{L}}} \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{k}} {\text{f}}\left( {{\text{P}}_{\text{i}} ,{\text{L}}_{\text{i}} } \right) \) where w measures the fit between each class Pi and its representation Li it decreases when this fit increases.

Starting from a partition P(n), the value of the sequence un = W(P(n), L(n)) decreases at each step n of the algorithm. Indeed, during a the allocation step an individual x belonging to a class \( {\text{P}}_{\text{i}}^{{\left( {\text{n}} \right)}} \) is affected to a new class \( {\text{P}}_{\text{j}}^{{\left( {{\text{n}} + 1} \right)}} \) iff W(P(n+1), L(n)) ≤ W(P(n), L(n)) = un. Then, starting from the new partition \( {\text{P}}_{\text{j}}^{{\left( {{\text{n}} + 1} \right)}} \), we can always define a new representation vector \( {\text{L}}^{{\left( {{\text{n}} + 1} \right)}} = \left( {{\text{L}}_{1}^{{\left( {{\text{n}} + 1} \right)}} , \ldots ,{\text{L}}_{\text{K}}^{{\left( {{\text{n}} + 1} \right)}} } \right) \) where for any i = 1 to K, \( {\text{L}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} = {\text{g}}\left( {{\text{P}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} } \right) \) fit best to \( {\text{P}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} \) than \( {\text{L}}_{\text{i}}^{{\left( {\text{n}} \right)}} \) or remains unchanged (i.e. \( {\text{L}}^{{\left( {{\text{n}} + 1} \right)}} = {\text{L}}^{{\left( {\text{n}} \right)}} \). This means: \( {\text{f}}\left( {{\text{P}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} , {\text{L}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} } \right) \le {\text{f}}\left( {{\text{P}}_{\text{i}}^{{\left( {{\text{n}} + 1} \right)}} , {\text{L}}_{\text{i}}^{{\left( {\text{n}} \right)}} } \right) \) for i = 1 to K.

Hence, at this step, we have un+1 = W(P(n+1), L(n+1)) ≤ W(P(n+1), L(n)) ≤ W(P(n), L(n)) = un. As this inequality is true for any n, this positive sequence decreases and converges.

Moreover, notice that in the case where \( {\text{W}}\left( {{\text{P}}_{\text{i}} ,{\text{L}}_{\text{i}} } \right) = \mathop \sum \limits_{{{\text{w}} \in {\text{P}}_{\text{i}} }} {\text{f}}\left( {{\text{w}}, {\text{L}}_{\text{i}} } \right) \), the allocation step consists to change w from one class to another when f(w, Lj) < f(w, Li). Notice also that a simple condition of convergence is that for any Ci taken among all the possible subsets of the given population and Li taken among the given set of possible representations: f(Ci, g(Ci)) ≤ f(Ci, Li).

In case of unsupervised data, the classical k-means method is the case where Li is the mean of the class Ci. When Lk are probability densities, we have a mixture decomposition method which improves the fit (in term of likelihood) between each class (of the partition) and its associated density function. More precisely, in this case each individual is associated by the allocation function to the density function of highest value for this individual. There are many other possibilities such as when representation of any class can be a distance, a functional curve, points of the population, a factorial axis etc. For an overview see [12, 13].

In case of supervised data we can settle for example: \( {\text{W}}\left( {{\text{P}}_{\text{i}} ,{\text{L}}_{\text{i}} } \right) = \mathop \sum \limits_{{{\text{w}} \in {\text{P}}_{\text{i}} }} {\text{f}}\left( {{\text{w}}, {\text{L}}_{\text{i}} } \right) \) and f(w, Li) = ||Y(w) – Mi(w)|| where Y(w) is the predictive value given by the supervised data sample and Mi(w) is the value given by the model Mi applied to the class Pi. The convergence of the method is then obtained if for any Ci and Li taken among a given family of models, f(Ci, g(Ci)) ≤ f(Ci, Li) where g(Ci) = Mi is the best fitting model to the class Ci among a given family of models.

For example in the representation by a regression, each individual is allocated to the class C’i if this individual fit the best the regression Li among all the possible regressions, (see [5]), more generally in case of representation by canonical axis see [14]. Notice that this method contains in case of unsupervised data: local PCA (Principal Component Analysis) (See Fig. 2) and local correspondence analysis. In case of supervised data it contains local regression (see Fig. 2) and local discriminant analysis (See Fig. 2).

Fig. 2.
figure 2

Local PCA: find simultaneously classes and first axes of local PCA which fit the best Local Discriminant Analysis: find simultaneously classes and first axes of local factorial discriminant analysis which fit the best. Local Regression: find simultaneously classes and local Regressions which fit the best.

Notice that we can extend this fuzzy partitioning method in order to get fuzzy local models by considering the \( f_{i} \left( {x,a_{i} } \right) \) to be the fit between x and a model Mi with parameters ai.

In the second step we enhance the explanatory power of the clustering by a characterization measure. The characterization measure of an individual w for the jth variable, the kth class and the event E with: x = (x1,…, xp), xj = (xj1,…, xjp), Xj (w) = xjm, C(w) = ck is defined by: W(xjm, ck, E) = f c(xj)/g x, E(xjm, ck) (ck) Therefore, we can define a characterization measure of an individual CI(w) = ∑j = 1, p W(xjm, c, E).

We can define a characterization measure of a symbolic variable Xj by:

CV(Xj) = ∑k = 1, K Max m = 1, mj W(xjm, ck, E). We can also define a characterization measure of a class c by:

CC(c) = ∑j = 1, p Max m = 1, mj W(xjm, c, E).

We can then place in order from the less to the more characteristic the individuals w, the symbolic variables Xj and the class ci by using respectively the CI, CV or the CL characteristic measure. All these criteria can then enhance the explanatory power of the local machine learning tool used. These orders are respectively denoted OCI, OCV, OCL.

In the third step we suppose that we have already obtained a clustering from a basic sample where the predictive values are given in case of supervised data. Then, the aim is to allocate new individuals to their best cluster. We have to consider two cases depending on the fact that the data are supervised or not.

In case of unsupervised data we have to allocate new individuals to the best fitting representative associated to each cluster. For example, in the case of the k-means, we associate any new individual to the cluster of closest mean. If the representative is a distribution like in Mixture decomposition any new individual is allocated to the cluster which associate density function maximizes the likelihood of this individual. For any individual and in any case we can obtain an order of preference of the clusters from the best fitting representative to this individual to the less representative. Hence, by this way, an individual can place the clusters in an order denoted O1.

In case of supervised data the aim is first to allocate a new individual (which predictive value is not given) to the best cluster and then to obtain its predicted value from the local model associated to this cluster. For example, if we allocate a new individual to a cluster modeled by a local regression, we can then obtain its predictive value by using this regression. The same can be done if instead of having a local regression, we have a local decision tree, a local SVM, a local neural network etc. In order to find the best new individual allocation we can only use the given data without the predicted value variable as for the new individuals for sure this value is not given. Coming back to the basic sample where now the predicted value variable associate to each individual is its cluster. We can then use, on these data, a supervised machine learning tool for which any individual can have an order of preference to the clusters from the best allocation to the worse. Hence, by this way, an individual can place the clusters in an order denoted O2.

We can also associate to any new individual its fit to the symbolic description associated to any obtained cluster. For example, in the numeral case, if the symbolic descriptions are density functions fj, we can use the likelihood product of the fj(xj) for j = 1, p where xj is the value taken by this individual for the jth initial variable. We can then place in order the clusters from their best to the lower fit to this individual. We can also replace fj(xj) by W(xj, c, E) in the categorical case. Hence, by this way, an individual can place the clusters in an order denoted OE. Finally given a new unit, we can place in order the obtained clusters by two ways: OE or Oi (i = 1 or 2) where OE is an explanatory order. Several strategy are then possible. Having chosen one of them we can continue the machine learning process: we allocate the new individual to a cluster and then adding it to this cluster, then finding a best fit representative and so on until the convergence of DCM until a new partition and its local models.

Machine learning filtering strategies:

The idea is to add (i.e.to filter) a new individual to the cluster and to a symbolic description if it improves simultaneously at best the fit between the cluster and its representative (i.e. its associated model in case of supervised learning) and the explanatory power of its associated symbolic description.

The first kind of filtering strategy is to continue the learning process only with only the individuals which have at best position the same cluster in the order OE and Oi. Another kind of strategy is to continue the learning machine process with only the individuals whose clusters at best position are not more fare then a given rank k. Then the individual is allocated to the cluster of best rank following OE or Oi alternatively or depending if you wish more explanatory power or better decision. Other strategies are also possible by adding OCI, and (or) OCL to OE and Oi. It is also possible to reduce the number of variables by choosing the first ones in the OCV order. In any filtering strategy, the learning process progress with individuals which improve the explanatory power of the machine learning as much as possible without degrading at all or not much the efficiency of the obtained rules. When a sub-population is obtained, the process can continue with the remaining population and lead to other subpopulations.

5 Conclusion

We have first introduced Symbolic Data Analysis which can give useful complementary knowledge to any standard data analysis. We have recalled local data analysis obtained by Dynamic Clustering which can give more accurate results to any kind of data analysis. We have defined several kinds of characterization criteria which allow to place in order individuals, clusters and variables following their explanatory power. We finally gave several strategies for filtering individuals which give the best explanatory of the machine learning process by alternatively improve rules and explanatory power. Much remains to be done in order to compare and improve the different criteria and strategies and to test the results with different black box machine learning methods (Neural network, SVM, Deep machine learning, etc.) on different kinds of data.