1 Introduction

With the increasing demand to storage and analyze huge data sets and in order to be able to manage them, it is essential to be able to summarize them while still maintaining as much knowledge inherent to the entire data set as possible. One direct consequence of this problem is that the data may no longer be formatted as single values such as is the case for classical data, but may be represented by lists, intervals, distributions, and the like instead. These summarized data are examples of symbolic data types. Table 1 shows part of a symbolic data set described by intervals.

Table 1 Breast temperature interval data set

The breast temperature interval data set was previously considered in [1]. In order to evaluate the feasibility of breast temperature abnormalities (malignant, benign and cyst) and detect breast cancer, they proposed a three-stage feature extraction approach in which breast interval data are extracted from a breast thermography data set, transformed into continuous features and then are used as input data for a classification task. It is composed by 50 breast thermograms of patients aged >35 years with a suspected mass, whose diagnoses were confirmed by clinical examination and followed by ultrasound, mammographic and biopsy exams. Here, the data set is scattered into two classes of different sizes: 14 elements of malignant masses class and 31 elements of non-malignant masses class (composed by elements belonging to benign masses and elements belonging to of cyst masses). Each patient is described by four interval variables that represent the temperature intervals obtained from the left breast (\(X_1\)) and the right breast, (\(X_2\)) the join between left and right breasts (\(X_3\)) and an interval obtained from a morphological processing with both breasts (\(X_4\)) as described in [1].

Symbolic data types were defined in symbolic data analysis (SDA) [2]. SDA aims to provide a set of suitable methods (clustering, factorial techniques, decision trees, etc.) for managing aggregated data described through many types of variables whose values can be sets of categories, intervals or probability distributions in the cells of a data table [2]. A symbolic variable is defined according to its type of domain, i.e., an interval variable takes, for its object, an interval of \(\mathfrak {R}\) (the set of real numbers). A symbolic modal one takes, for its object, a nonnegative measure (a frequency or a probability distribution or a system of weights). If this measure is specified in terms of a histogram, the modal variable is called histogram variable.

Several supervised classification tools have been extended to handle interval data: Ichino et al. [3] introduced a symbolic classifier as a region-oriented approach for multi-valued data. In this approach, the classes of examples are described by a region (or set of regions) obtained through the use of an approximation of a mutual neighborhood graph (MNG) and a symbolic join operator. Souza et al. [4] proposed a MNG approximation to reduce the complexity of the learning step without losing the classifier performance in terms of prediction accuracy. D’Oliveira et al. [5] presented a region-oriented approach in which each region is defined by the convex hull of the objects belonging to a class.

Ciampi et al. [6] introduced a generalization of binary decision trees to predict the class membership of symbolic data. Rossi and Conan-guez [7] have generalized multi-perceptrons to work with interval data. Mali and Mitra [8] extended the fuzzy radial basis function network to work in the domain of symbolic data. Appice et al. [9] introduced a lazy-learning approach (labeled Symbolic Objects Nearest Neighbor) that extends a traditional distance weighted k-nearest neighbor classification algorithm to interval and modal data. Silva and Brito [10] proposed three approaches to the multivariate analysis of interval data, focusing on linear discriminant analysis and Souza et al. [11] introduced four pattern classifiers based on logistic regression methodology in which these classifiers differ on the way they represent each interval variable.

However, these classification methods for symbolic data were not developed to solve nonlinearly separable problems, that is, problems where elements belonging to one class cannot be separated from elements belonging to another class by a hyperplane, and thus, another approach is needed to solve this family of problems when data are interval-valued. Generalized discriminant analysis [12] (GDA) is a generalization of the classical linear discriminant analysis (LDA) that obtains nonlinear discriminants through kernel functions. This is achieved by formulation as an eigenvalue resolution problem and applies kernel functions to find a feature vector space where the input data becomes linearly separable, similar to the underlying theory on Support Vector Machines.

This work addresses a way in which GDA is generalized for interval data. It changes the inner product used on the core matrix of GDA to the inner product for interval data and then introduces kernelized inner product, allowing the interval-valued data to be kept as intervals while still performing the nonlinear mapping into a feature space. In addition, the proposed approach is applied to a breast temperature abnormality classification problem regarding malignant versus non-malignant classes. Section 2 describes the proposed kernelized discriminant approach for interval data. Section 3 shows the synthetic data sets considered in this work. Section 4 presents the experimental evaluation regarding the synthetic data sets and the Brazilian’s thermography breast database displayed in Table 1. Section 5 gives the conclusions.

2 Proposed model

In this section, we present an extension of the GDA [12] to treat interval data, called here Interval Kernel Discriminant Analysis (IKDA). The main idea is to obtain a classifier for interval data that should be able to solve classification problems for classes not linearly separable.

According to the GDA classifier, the IKDA one mainly consists of obtaining a kernel matrix whose elements are composed by the inner product between elements of each class against each other and then incorporates this matrix to the classical linear discriminant analysis, formulating it as an eigenvector problem, then data are projected into a space in which each test data point can be allocated.

Let \(X = \{\mathbf{x }_{i},y_{i}\},\) \(i = 1, \ldots , N\) be a set of training symbolic objects. Each object i of \(\Omega\) is described by a set of p symbolic interval variables and a categorical discrete variable. A symbolic interval variable [2] is a correspondence \({\mathfrak {I}}\rightarrow {\mathfrak {R}}\) such that each pattern i is represented by an interval \([a, b] \subseteq {\mathfrak {I}}\) where \({\mathfrak {I}}= \{ [a, b] : a, b \in {\mathfrak {R}}, a \le b \}\) is an interval. Here, the N training symbolic patterns \(({\mathbf {x}} _{i},y_{i})\) have \({\mathbf {x}}_{i} = (x_{i1}=[a_{i1},b_{i1}], \ldots , x_{ip}=[a_{ip},b_{ip}])\) as a vector of interval covariates and \(y_{i}\) as response variable which contains C class labels.

Let \(\mathbf K\) be a \(C \times C\) symmetric block matrix defined over the classes of the training set, whose elements are defined as being matrices themselves:

$$\begin{aligned}&{\mathbf{k}}_{gh} = (k_{gh})_{lm} \nonumber \\&g,h \in \{1, \ldots , C\}, \nonumber \\&l \in \{1,\ldots ,n_{g}\}, m \in \{1,\ldots ,n_{h}\} \end{aligned}$$
(1)

in which \(n_{g}\) is the number of elements in class g and \(n_{h}\) is the number of elements in class h. In order to propose a kernel matrix regarding p-dimensional interval data space, each pattern is split in p parts and p kernel functions are defined for these parts. Suppose that any point w over the interval \([a_j,b_j]\) for dimension j can be mapped from input data space to a high-dimensional feature space F through a nonlinear function \(\phi (w)\):

$$\begin{aligned} \phi : X&\rightarrow F \nonumber \\ [a,b]&\rightarrow \phi ([a,b]) \end{aligned}$$
(2)

Consider \(\phi\) as a monotonic nonlinear function defined on real numbers that compose the interval \([a_j,b_j]\). For all \(w, r \subseteq [a_j,b_j]\) such that \(w \le r\), \(\phi\) preserves or reverses the order (\(\phi (w) \le \phi (r)\) or \(\phi (w) \ge \phi (r)\), respectively), and thus, we do not need to apply \(\phi\) to all real numbers inside the interval, only to its boundaries. Here, \(a_j \le b_j\) so \(\phi (a_j) \le \phi (b_{j})\) or \(\phi (a_j) \ge \phi (b_j)\).

The main ways in which symbolic interval data arise are aggregation of large data sets. For example, in a breast temperatures matrix, the main interest is to evaluate the feasibility of temperature abnormalities for each breast. All temperature values for each breast are aggregated and their characteristics combined into a single object. In this way, all points inside of the interval \([a_j,b_j]\) can be mapped using the \(\phi\) function. As \(\phi\) is monotonic, interval structure can be preserved. Then, applying this function to the lower and upper bounds of the interval domain still remains in a nonlinear space. An interval in feature space can be defined as:

$$\begin{aligned} [a_{j},b_{j}]&=[\phi (a_j),\phi (b_j)] \quad {\text { if }}\;\phi\;{\text {is\;monotonically\;nondecreasing}}\\ [a_{j},b_{j}]_{\phi }&=[\phi (b_j),\phi (a_j)] \quad {\text { if }}\;\phi\;{\text {is\;monotonically\;nonincreasing}}\\ \end{aligned}$$

2.1 Kernelized inner product for interval data

For data points, this nonlinear mapping is often replaced by an inner-product kernel to obtain the corresponding points in the transformed space. Here, for interval data, we consider to kernelize the interval inner product and to achieve a similar result to the original GDA.

According to [13], given any interval-valued variables \(x_{r} = ([a_{r1},b_{r1}],\ldots ,\) \([a_{rp},b_{rp}])\) and \(x_{s} = ([a_{s1},\) \(b_{s1}],\ldots ,[a_{sp},b_{sp}])\), the inner product for interval data is given by:

$$\begin{aligned} \langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle = \left\{ \begin{array}{lr} \frac{1}{4}\sum \nolimits _{j=1}^{p}(a_{rj}+b_{rj})(a_{sj}+b_{sj}){,} &\quad {\text {if }} \quad {\mathbf {x}} _{r} \ne {\mathbf {x}} _{s}\\ \\ \frac{1}{3}\sum \nolimits _{j=1}^{p}(a_{rj}^{2}+a_{rj}b_{rj}+b_{rj}^{2}){,} &\quad {\text {if}}\quad {\mathbf {x}} _{r} = {\mathbf {x}} _{s} \end{array} \right. \end{aligned}$$
(3)

Using the Eq. (3) the kernelized inner product can be defined as

$$\begin{aligned} \langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle _{\phi } = \left\{ \begin{array}{lr} \frac{1}{4}\sum \nolimits _{j=1}^{p}{} {\mathbf {\{}} \phi (a_{rj}) \cdot \phi (a_{sj}) + \phi (a_{rj}) \cdot \phi (b_{sj}) \\ \quad + \phi (b_{rj}) \cdot \phi (a_{sj}) + \phi (b_{rj}) \cdot \phi (b_{sj}){\mathbf {\}}} \\ \\ \frac{1}{3}\sum \nolimits _{i=1}^{p}{} {\mathbf {\{}} \phi (a_{rj}) \cdot \phi (a_{rj})+\phi (a_{rj}) \cdot \phi (b_{rj})\\ \quad +\phi (b_{rj}) \cdot \phi (b_{rj}){\mathbf {\}}} \\ \end{array} \right. \end{aligned}$$
(4)

under the same restrictions as Eq. (3), that is: if \({\mathbf {x}} _{r} \ne {\mathbf {x}} _{s}\) and \({\mathbf {x}} _{r} = {\mathbf {x}} _{s}\), respectively.

Regarding the properties that the sum of kernel functions under the same points input space is a kernel function [14], we can say that \(\langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle _{\phi }\) is a valid kernel.

If \(a_{rj}=b_{rj}\) and \(a_{sj}=b_{sj}\), we have a particular case

$$\begin{aligned} \langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle _{\phi } = \left\{ \begin{array}{ll} \sum \nolimits _{i=1}^{p}\phi (a_{rj}) \cdot \phi (a_{sj}), \quad {\text {if} }\;{\mathbf {x}} _{r} \ne {\mathbf {x}} _{s}\\ \\ \sum \nolimits _{i=1}^{p}\phi (a_{rj}) \cdot \phi (a_{rj}), \quad {\text {if }}\; {\mathbf {x}} _{r} = {\mathbf {x}} _{s} \end{array} \right. \end{aligned}$$
(5)

The kernel model \(\langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle _{\phi }\) in Eq. (5) is the sum of univariate kernels as a combined kernel for point data. Thus, the kernel model \(\langle {\mathbf {x}} _{r} , {\mathbf {x}} _{s}\rangle _{\phi }\) in Eq. (4) allows to generalize the traditional kernel to treat interval data

$$\begin{aligned} {\mathbf {k}} _{gh} = (k_{gh})_{lm} = \langle {\mathbf {x}} _{l}(g) , {\mathbf {x}} _{m}(h)\rangle _{\phi } \nonumber \\ \quad g,h \in \{1, \ldots , C\}, \nonumber \\ l \in \{1,\ldots ,n_g\}, m \in \{1,\ldots ,n_h\}. \end{aligned}$$
(6)

2.2 Optimization problem

The kernel operator \(\mathbf K\) allows the construction of nonlinear separating function in the input space that is equivalent to linear separating function in the feature space F. The construction of this function is formulated by maximizing the inter-classes inertia and minimizing the intra-classes inertia.

According to [12], the formulation of this optimizing problem is to need to find eigenvalues \(\lambda\) and eigenvectors \(\mathbf v\), which are the solutions of the equation:

$$\begin{aligned} \lambda V{\mathbf {v}} = B {\mathbf {v}} \end{aligned}$$
(7)

The largest eigenvalue of the previous equation gives the maximum of the following quotient of inertia:

$$\begin{aligned} \lambda = \frac{{\mathbf{{v}} }^{t}{\mathbf{{B}} }{\mathbf{{v}} }}{{\mathbf{{v}} }^{t}{\mathbf{{V}} }{\mathbf{{v}} }} \end{aligned}$$
(8)

where \(\mathbf V\) and \(\mathbf B\) represent the total and inter-classes inertia matrices, respectively, in the feature space F.

Because the eigenvectors are linear combinations of elements in F, there exist coefficients \(\alpha _{gq} (g = 1, \ldots , C)\) and \((q = 1, \ldots , n_{g})\) such that:

$$\begin{aligned} {\mathbf {v}} = \sum _{g=1}^{C}\sum _{q=1}^{n_{g}}\alpha _{gq}\,\phi (x_{g(q)}) \end{aligned}$$
(9)

The general coefficient vector \(\varvec{\alpha } = (\alpha _{gq})\) can be written as \(\varvec{\alpha }\) = (\(\varvec{\alpha }_{g})_{g \in \{1,\ldots ,C\}}\) where \(\varvec{\alpha }_{g} = (\alpha _{gq})_{q=1,\ldots ,n_{g}}\); \(\varvec{\alpha }_{g}\) is the coefficient vector of the class g into \(\mathbf v\).

From Appendix B of [12], the Eq. (8) is equivalent to

$$\begin{aligned} \lambda = \frac{\varvec{\alpha }^{t} {\mathbf {K}} {\mathbf {W}} {\mathbf {K}} {\varvec{{\alpha }}}}{{\varvec{{\alpha}} }^{t}{} {\mathbf {K}} {\mathbf{K}} {\varvec{{\alpha}} }} \end{aligned}$$
(10)

in which \(\mathbf W\) is a block diagonal matrix where each one of its elements \({\mathbf {W}} _{g}\) is square \(n_{g} \times n_{g}\) matrices with terms all equal to \(1/n_{g}\) \((g \in \{1,\ldots ,C\})\).

The elements of the matrix \(\mathbf K\) are centered in the feature space according to [12], and the solution of the system in Eq. (10) is given using the eigenvectors decomposition of the matrix \(\mathbf K\)

$${\mathbf {K}} = {\mathbf {U}} {\varvec{\varGamma }} {\mathbf {U}}^{t}$$
(11)

where \(\varvec{\varGamma }\) is the diagonal matrix of nonzero eigenvalues and \(\mathbf U\) the matrix of normalized eigenvectors associated to \(\varvec{\varGamma }\).

Substituting \(\mathbf K\) in Eq. (10)

$$\begin{aligned} \lambda = \frac{\varvec{\alpha }^{t} {\mathbf {U}} {\varvec{\varGamma }} {\mathbf {U}} ^{t} {\mathbf {W}} {\mathbf {U}}{\varvec{{\varGamma } }}{\mathbf {U}} ^{t} {\varvec{{\alpha}}}}{\varvec{{\alpha}}^{t}{\mathbf {U}} {\varvec{{\varGamma}} } {\mathbf {U}} ^{t}{} {\mathbf {U}} {\varvec{\varGamma }} {\mathbf {U}}^{t}{\varvec{{\alpha }}}} = \frac{ ({\varvec{{\varGamma } }}{\mathbf{U}} ^{t}{\varvec{\alpha}})^t {\mathbf {U}} ^{t} {\mathbf {W}}{\mathbf{ U}} ({\varvec{{\varGamma }}} {\mathbf {U}} ^{t}{\varvec{{\alpha }}})}{({\varvec{\varGamma }} {\mathbf {U}}^{t}{\varvec{{\alpha }}})^t {\mathbf {U}} ^{t} {\mathbf {U}} ({\varvec{{\varGamma }}} {\mathbf{ U}} ^{t} {\varvec{{\alpha }}})}\end{aligned}$$
(12)

Consider \(\varvec{\beta }=\varvec{\varGamma }{} {\mathbf {U}} ^{t}{\varvec{{\alpha}}}\). So, the Eq. (12) can be rewritten as

$$\begin{aligned} \lambda \varvec{\beta } ={\mathbf{U}} ^{t} {\mathbf{W}} {\mathbf{U}} {\varvec{{\beta}}} \end{aligned}$$
(13)

For a given \(\varvec{\beta }\), there is at least one \(\varvec{\alpha }\) satisfying \(\varvec{\beta }=\varvec{\varGamma }{} {\mathbf{U}} ^{t}{\varvec{{\alpha}}}\) in the form:

$$\begin{aligned} \varvec{\alpha } ={\mathbf {U}} ({\varvec{{\varGamma }}})^{-1}{\varvec{{\beta }}}. \end{aligned}$$

The coefficients \(\varvec{\alpha }\) are normalized by requiring that the corresponding vectors \(\mathbf v\) be normalized \({\mathbf{v}} ^{t}{} {\mathbf {v}} = 1\) in F. So,

$$\begin{aligned} \varvec{\alpha }= \frac{\varvec{\alpha }}{\sqrt{\varvec{\alpha }^{{\varvec{t}}}{\mathbf{K}} {\varvec{\alpha} }}} \end{aligned}$$
(14)

Given the normalized eigenvectors \(\mathbf v\), we can obtain the projection vector of an element represented by \({\mathbf {x}}\) on \(\mathbf v\) as

$$\begin{aligned} {\mathbf {z}} ({\mathbf {x}} ) = {\mathbf {v}} ^{t} \phi ({\mathbf {x}} ) = \sum _{g=1}^{C}\sum _{l=1}^{n_{g}}\alpha _{gl} \, \langle {\mathbf {x}} _{l}(g) , {\mathbf {x}} \rangle _{\phi }. \end{aligned}$$
(15)

2.2.1 The algorithm

The IKDA algorithm is summarized as follows:

figure a

3 Three synthetic interval data sets

In this section, three different data sets are presented: two synthetic interval data sets with synthetic seeds and one synthetic data set with real data seeds.

3.1 Two synthetic interval data sets with synthetic seeds

The procedure to generate synthetic interval data sets based on synthetic seeds consists of two steps:

  • To obtain a seed data set with classical variables.

  • To consider variability for seed data in order to generate a synthetic interval data set.

To obtain these synthetic interval data set, two standard synthetic quantitative data sets are generated and used as seeds to obtain the synthetic interval data sets. With regard to the two standard synthetic quantitative data sets, both are generated in \(\mathfrak {R}^2\), and therefore, they have two standard continuous quantitative variables.

The first data set has 100 points scattered among two classes. Each class is a defined as an upper and lower halves of a circumference generated from data in the same uniform distribution plus Gaussian noise, then the upper class was shifted to increase the proximity between classes. The second data set has 150 points distributed in two classes of unequal sizes, the first class has 100 points, and the second has 50. Both classes were designed as circumferences with the same origin, but each class has a different radius and is generated from data in an independent uniform distribution with Gaussian noise.

The quantitative data set 1 is generated by the following parameters:

Class 1 :

\(X_{1} \sim U(5,25)\)

\(X_{2} = \sqrt{100 - (X_{1} - 15)^{2}} + 20\)

noise \(\sim N(0,1)\)

\(S_{X_{1}} = 10\)

\(S_{X_{2}} = -3\)

Class 2 :

\(X_{1} \sim U(5,25)\)

\(X_{2} = \sqrt{100 - (X_{1} - 15)^{2}} + 20\)

noise \(\sim N(0,1)\)

The quantitative data set 2 is generated by the following parameters:

Class 1 :

\(X_{1} \sim U(0,40)\)

\(X_{2} = \sqrt{400 - (X_{1} - 20)^{2}} + 20\)

noise \(\sim N(0,1)\)

Class 2 :

\(X_{1} \sim U(15,25)\)

\(X_{2} = \sqrt{25 - (X_{1} - 20)^{2}} + 20\)

noise \(\sim N(0,1)\)

\(X_1\) is the first coordinate, \(X_{2}\) is given by the circle equation, \(S_{X_{1}}\) and \(S_{X_{2}}\) are the values added to each coordinate to force class 1 closer to class 2 and noise is a value added to the \(X_{2}\) coordinate. Now to generate symbolic data sets from these two standard quantitative data sets, a procedure where each variable is expanded to form an interval is used.

Each data point (\(x_{1}\), \(x_{2}\)) of each one of these synthetic quantitative data sets is a seed for a vector of intervals (rectangle) through the following procedure:

$$\begin{aligned} ([x_1 - \gamma _{1}/2, x_1 + \gamma _{1}/2], [x_2 - \gamma _{2}/2, x_2 + \gamma _{2}/2]) \end{aligned}$$

where these parameters \(\gamma _{1}\) and \(\gamma _{2}\) are randomly selected from a predefined interval \(\left[ 1,5 \right] ,\left[ 1,10 \right]\) or \(\left[ 1,15 \right]\).

Fig. 1
figure 1

Quantitative data sets 1 and 2 and their correspondent Symbolic data sets

Therefore, from each element in the standard data set, we generate an interval element on the synthetic interval data set. Figure 1 presents an example of the generation of the symbolic data sets described in this section, on the left side, synthetic data set 1 and its corresponding symbolic counterpart, on the right side, synthetic data set 2 and its corresponding symbolic counterpart. These examples were generated by choosing both \(\gamma _{1}\) and \(\gamma _{2}\) from the \(\left[ 1,5\right]\) interval.

3.2 A synthetic interval data set with real seeds: interval Iris data

As a different study case for our method, we analyze Fisher’s Iris flower data set which is a typical test case used by the machine learning community. This classical data set consists of 3 classes described by 4 continuous variables that correspond to the sepal and petal length and width of each element.

Given the different nature of data our classifier is supposed to address, we subject the database to the same procedure used to generate synthetic interval-valued data in previous subsection.

That is, the original Iris data set is subjected to the same procedure as the synthetic data sets 1 and 2 to generate a symbolic Iris data set, whose variables are interval variables. The generation parameters \(\gamma _{1}\) and \(\gamma _{2}\) were chosen from the same intervals used on the synthetic data sets. Table 2 shows partially the resulting data set.

Table 2 Interval iris data set from original variables

4 Experimental evaluation

In this section, the experimental evaluation is presented. The proposed classifier (IKDA) is evaluated and compared against three other classifiers:

  • Logistic Regression classifier (LOGIT) where two regressions are adjusted for each class, one regarding the lower bounds of the interval variables and another regarding the upper bounds of the interval variables, allocation is given by the average of the response obtained by each regression.

  • Linear Discriminant Analysis for Interval Data (ILDA), using the distributional approach with either definitions A or B found in [10] and Hausdorff distance for interval data (ILDA-A refers to ILDA using definition A and ILDA-B refers to ILDA using definition B).

In our experiments with IKDA proposed method, the following elements were considered:

  • Polynomial kernel with degree \(d = 1, 2, 3, 4, 5\) and Gaussian kernel with width \(\sigma = 0.5, 1, 3, 5, 7\).

  • Euclidean distance in the allocation step.

Prediction accuracy is measured by the error rate of classification which is estimated by a Monte Carlo simulation for the simulated data set with 500 replications, through a tenfold cross-validation for the synthetic data set with real seed and through the leave-one-out method for the real data sets. On the framework of a Monte Carlo simulation, test and learning sets are randomly selected from each synthetic interval data set. The learning set corresponds to 75% of the original data, and the test data set corresponds to 25%.

4.1 Synthetic data sets with synthetic seeds

Tables 3 and 4 present the average and standard deviation (in parenthesis) of the error rate for IKDA method and interval data set 1 and Tables 5 and 6 for IKDA method and interval data set 2. Table 7 shows error rate averages for LOGIT and ILDA methods. From the results in these tables, some remarks are listed.

  • For interval data set 1, it can also be seen that the small increase in the value of the parameters did not cause an increase in performance when the polynomial kernel was used; however, it was the opposite when the Gaussian kernel was used.

  • For interval data set 1, the best result regarding polynomial kernel is with \(d=1\) and the best result regarding Gaussian kernel is with \(\sigma =7\). Under these conditions the Gaussian kernel is slightly superior polynomial kernel. This is expected since the interval data set 1 has weak nonlinear separation when compared to interval data set 2.

  • For interval data set 2, which shows a greater degree of nonlinear separation than that of the interval data set 1, the Gaussian kernel is superior to the polynomial kernel for any value of \(\sigma \in \{0.5, 1, 3, 5, 7\}\).

  • The linear classifiers obtained bad performance, overall only comparable with the worse results from the nonlinear classifiers.

Table 3 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic data set 1 and polynomial kernel
Table 4 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic data set 1 and Gaussian kernel
Table 5 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic data set 2 and polynomial kernel
Table 6 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic data set 2 and Gaussian kernel
Table 7 Average (in %) and standard deviation of the error rate for LOGIT and ILDA classifiers and synthetic data sets 1 and 2

4.2 Synthetic data sets with real seeds

Tables 8 and 9 present the average and standard deviation of the error rate for the IKDA classifier regarding the synthetic interval data set with real seeds using for \(\gamma\) the intervals [1, 5], [1, 10] and [1, 15]. Table 10 shows the average and standard deviation of the error rate for LOGIT, ILDA-A and ILDA-B classifiers.

The results in these tables show that the IKDA method with polynomial kernel had better performance than Gaussian kernel for parameter \(d = 1\) for each interval of uncertainty introduced in the data set; however, despite the polynomial kernel obtaining the best results, overall the Gaussian kernel was more successful. This effect can be due to the fact that the original iris data set has linearly separable classes and the polynomial kernel is similar to a linear model, being well adjusted for the parameters chosen. The linear classifiers had overall lower accuracy than both methods using kernels.

Table 8 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic interval data set with real seeds and polynomial kernel
Table 9 Average (in %) and standard deviation of the error rate for IKDA approach, synthetic interval data set with real seeds and Gaussian kernel
Table 10 Average (in %) and standard deviation of the error rate for LOGIT and ILDA classifiers and synthetic interval data set with real seeds

4.3 Real breast temperature interval data set

As stated in [15], “Most work on the analysis of breast thermal images provide classification results using the accuracy, specificity and sensitivity measures or/and also present the corresponding ROC curves of their methods,” this is mostly due to a type I error approach, that is, most works are interested in classifying correctly malignant abnormalities class more than other classes (also reflected in our representation of this problem as a binary problem). Global misclassification/accuracy alone analyzes the overall correctness of classification, but cannot identify if the class of interest has a good detection rate, which justifies other measures being calculated and presented together with accuracy/misclassification values. Researchers in the medical field value sensitivity [1618] because classifying wrongly patients that should be allocated to the malignant abnormalities class may lead directly to their death.

Therefore, in our analysis we prioritize sensitivity followed by global misclassification rate in this specific order. Table 11 presents confusion matrices for the IKDA proposed method using polynomial kernel with parameter \(d=1\), \(d=2\), \(d=3\), \(d=4\) and \(d=5\) and Table 12 presents confusion matrices for the IKDA proposed method using Gaussian kernel with \(\sigma =0.5\), \(\sigma =1\), \(\sigma =3\), \(\sigma =5\) and \(\sigma =7\), respectively. The best performances of the IKDA method are achieved with \(d=1\) and \(\sigma =5\) and 7 for polynomial and Gaussian kernels, respectively.

Table 11 Confusion matrix for the IKDA classifier with polynomial kernel
Table 12 Confusion matrix for the IKDA classifier with Gaussian kernel

Table 13 displays the confusion matrices for the LOGIT, ILDA-A and ILDA-B classifiers. The LOGIT method is inferior to the ILDA-A and ILDA-B ones in terms of correct predicted classifications of malign abnormalities class, but superior in terms of overall correct predicted classifications.

Table 13 Confusion matrix for the LOGIT, ILDA-A and ILDA-B classifiers

The global misclassification rate and sensitivity index are computed from the previous tables. Sensitivity index represents the proportion of actual positives samples which are correctly identified as such and plays an important role in medical field as it related to the ratio between the true positive and true negative observations. The sensitivity can be calculated as

$$\begin{aligned} {\text{Sen}}(i) = \frac{{\text{TP}}_i}{{\text{TP}}_i+{\text{FN}}_i} \end{aligned}$$
(16)

where \({\text{TP}}_i\) = True positive for class i and \({\text{FN}}_i\) = False Negative for class i.

The overall misclassification rate and sensitivity index for the malignant class and IKDA, LOGIT, ILDA-A and ILDA-B methods are presented in Table 14. The results show that the best value of sensitivity index, which is extremely important for medical studies, is achieved with the IKDA method using Gaussian kernel (\(\sigma =5\) and \(\sigma =7\)) and ILDA-A and ILDA-B models. Among these three methods, the IKDA one had the best overall misclassification rate.

Table 14 Misclassification rate and sensitivity index for malignant class and IKDA, ILDA-A, ILDA-B and LOGIT methods

5 Conclusions

This work introduced a kernelized classifier for interval-valued data. It was based on the generalized discriminant analysis (GDA) for its ability to solve nonlinearly separable problems. Here, the inner product for interval is kernelized as a resulting summation of multiple identical kernel functions applied to different bounds of each interval-valued variable. The proposed method is a generalization of the GDA to treat symbolic interval data regarding nonlinearly separable classes.

Two types of kernel functions were used to evaluate the behavior of the proposed classifier. Its performance was assessed by the global error rate based on different configurations of synthetic interval data sets. An application with a Brazilian’s thermography breast database was considered, and the performance was assessed by the sensitivity index, which is extremely important for medical studies and global misclassification rate. The study of performance analysis allowed to confirm the usefulness of the proposed method in regard to interval data in nonlinearly separable class problems when compared with other classifiers of the symbolic data analysis literature.