Keywords

1 Introduction

Plastic recycling is a promising alternative to landfills for dealing with the fastest growing waste stream in the world [8]. However, for physiochemical reasons related to non-miscibility between plastics, most plastics must be recycled separately. Plastic category identification is therefore a major challenge in the recycling process. With the emergence of hyperspectral imaging, some industrial firms have designed sorting devices able to discriminate between several categories of plastics based on their absorption or transmittance spectra. The sorting process is generally performed using supervised classification, which has been well developed with the emergence of computer sciences and data science [18, 22, 38]. The classification performance might be affected by several issues such as noise or overlapping regions in the feature space [21, 34]. The latter problem occurs when samples from different classes share very similar characteristics. We are particularly faced with these problems when attempting to classify industrially acquired spectra. Indeed, in an industrial context, the acquisition process is subject to technical and financial constraints to ensure throughput and financial competitiveness. For this reason one cannot expect the same quality of data as for equivalent laboratory measures. Several issues imply the presence of imprecision and uncertainty in the acquired spectra: (i) the available spectral range might be insufficient; (ii) the plastic categories to be recycled are chemically close; (iii) atmospheric perturbations may cause noise; (iv) plastic ageing and plastic additives are known to change spectral information; (v) impurities like dust deposits or remains of tags will also produce spectral noise. As in solving many other decision-making problems, classification errors may have serious consequences, e.g., medical diagnosis applications. Regarding plastic sorting, identification errors will cause serious recycling difficulties and significant degradation of the secondary raw material performances and thus quality degradation of the recycled products. Usually, the problem of plastic identification is treated using standard classification algorithms that are designed to produce point predictions, i.e., a single plastic category. In cases of imperfect data, standard classifiers become confused and inevitably commit errors. This brings us to consider alternative representations of the information that take into account imprecision and uncertainty to achieve more accurate classification. Modern theories of uncertainty such as fuzzy subsets [35], possibility theory [14], imprecise probabilities [33] or belief functions [26, 30] offer better representations of the data-imperfection of information. Several classification algorithms have been proposed in these frameworks. Most of them are extensions of standard algorithms. We can cite the fuzzy version of the well known k-means algorithm [15], fuzzy and evidential versions of k-Nearest Neighbour (k-NN) [10, 19] or some fuzzy and evidential revisions of neural network algorithms [4, 11].

In this paper we consider the case where the original imperfections come from data features only. Available training example labels are precise and considered trustworthy, e.g., based on laboratory measures and expertise. In order to better represent all available information, we think that labels should conform with the feature imprecision. If an object of class \(\theta _1\) has its vector of features x in the overlapping region \(\theta _1\) and \(\theta _2\), then the example should be relabelled by the set \(\lbrace \theta _1,\theta _2 \rbrace \). In order to achieve such representation we propose to relabel each training example in accordance with their discriminatory nature. New labels are therefore subsets of the original set of classes. This imprecise relabelling would better represent the learning data by mapping overlaps in the feature space. The resulting imprecise label can be naturally treated in the belief functions theory context. Indeed, belief functions theory [26] is an interesting framework for representing imprecise and uncertain data by allowing the allocation of a probability mass for imprecise data. Thus, imprecision and ignorance is better captured in this framework compared to the probability framework where equiprobability and imprecision are confused. The recent growing interest in this theory has allowed techniques to be developed for resolving a diverse range of problems such as estimation [12, 17], standard classification [10, 32], or even hierarchical classification [1, 23].

Our proposed approach, called Evidential CLAssification of incomplete data via Imprecise Relabelling (ECLAIR), is based on a relabelling procedure of the training examples that enables better representation of the missing information about some data features. Then a classifier is trained on the relabelled data producing a posterior mass function. With imprecise relabelling we try to quantify, using a mass function, the extend to which a subsets of classes is reliable and relevant as output for a new data. In other words, we look for the set of classes which any more precise subset output would lead inevitably to an error. The resulting classification algorithm can enhance the classification accuracy as well as cope with difficult examples by allowing less precise but more reliable classification output which will optimize the recycling process.

The remainder of this paper is organized as follows: Sect. 2 sets down the main notations and provides a reminder on supervised classification and elements of belief functions theory; in Sect. 3 we present the proposed approach; Sect. 4 briefly describes the related works; Sect. 5 presents results of experimentation on the sorting problem of four plastics.

2 Theoretical Background

Classification is a technique allowing to assign objects to categories from the observations of several of their characteristics. A classifier is a function that maps an object represented by its values of characteristics on a finite set of variables, to a category represented by a value of a categorical variable. More precisely, let us consider a set of n categories represented by a set \(\varTheta = \lbrace \theta _1,\theta _2, \ldots , \theta _n \rbrace \), also refereed as a set of labels or classes. In the framework of belief functions \(\varTheta \) is called a frame of discernment. Each \( \theta _j , \; j \in \lbrace 1,...,n \rbrace \) denotes a singleton which represents the lowest level of discernible information in \(\varTheta \). Let us denote by \(X_1, X_2, \ldots , X_p\), p variables where the taken values represent the characteristics, also called attributes or features, of the objects, to be classified. In the rest of the paper we refer to \(\varTheta \) as a set of classes and to \((X_1, X_2, \ldots , X_p)\) as a vector of features where \(\forall i \in \{1,\ldots ,p\}\), \(X_i\) refers both to the name of the feature and to the space of the values taken by the feature, i.e., \(X_i \subseteq \mathbb {R}\). For an object x belonging to \(\mathcal {X} =\prod \limits _{i=1}^{p} X_i \subseteq \mathbb {R}^p\), let \( \theta (x) \in \varTheta \) denote the unknown label that should be associated to x.

In this article, we focus on a supervised classification problem. The specificity of the considered data, referred to as incomplete data, is that some features of some examples are missing due to technological aspects. Therefore, only part of the data of these examples is obtained. The proposed classification approach, qualified as imprecise, integrates the incompleteness of the data in its process to predict subsets of classes comprising the true class when standard counterpart classifier would have predicted the wrong class. To this aim we diverted standard probabilistic classifiers from their natural use for computing probability on sets of classes. Such uncertain resulting information is then captured by belief functions. The following subsections, briefly recalls the notions discussed.

2.1 Supervised Classification

To determine \( \theta (x)\) in a supervised classification manner, a standard classifier \(\delta _{\varTheta } : \mathcal {X} \rightarrow \varTheta \) is trained on a set of examples \((x_i, \theta _i)_{1\le i \le N}\) such that for all \( 1\le i\le N \), \( x_i \) belongs to \( \mathcal {X}\) and \( \theta _i\) to \( \varTheta \). By standard classifier we mean a classifier that assigns to x a single class \( \theta (x) =\theta _j\), \(j \in \{1,\ldots ,n\}\). In some cases when the input data is too voluminous or redundant, it may be appropriate to perform some extraction features before the training of \(\delta _{\varTheta }\). By reducing the dimension of \(\mathcal {X} \), and thus, working with a reduced feature space \(\mathcal {X}' \subseteq \mathbb {R}^{p'}\) with \( p' < p\), the extraction such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) or Independent Component Analysis (ICA) facilitates the learning and may enhance the classification performance. When feature extraction is designed taking into account the labels of the training examples it is termed as supervised feature extraction. For instance LDA also known as Fisher discriminant analysis reduces the number of features to \(n-1\) by looking for a linear combination of the variables maximizing the within-groups and minimizing between-groups variance.

2.2 Probabilistic Classifier and Decision Rule

When \(\delta _{\varTheta }\) can also provide for x a posterior probability distribution \(p(.|x) : \varTheta \rightarrow [0,1]\), it is called a probabilistic classifier. Many classifier algorithms base their decision only on p(.|x) as follows: \(\theta (x)= arg \max \limits _{j = 1 ,\ldots ,n} p(\theta _j|x)\). For more sophisticated decisions, one can use the decision rule technique classically used in decision theory. Let \(\mathcal {A}=\{a_1, a_2,\ldots ,a_m\}\) be a finite set of actions that can be taken. In the case of a standard classifier, an action \(a \in \mathcal {A}\) corresponds to assign a class \(\theta \in \varTheta \) to an object x. In such case, we simplify by setting \( \mathcal {A} = \varTheta \). In order to compare decisions in \(\mathcal {A}\) or to compare the classifier \(\delta _{\varTheta }\) to another decision rule, two functions are introduced: loss function and risk function. A loss function \( L : \mathcal {A} \times \varTheta \rightarrow \mathbb {R}\) is considered to quantify the loss \(L(a,\theta )\) incurred when choosing the action \(a \in \mathcal {A}\) while the true state of nature is \( \theta \in \varTheta \). A risk function \(r_{\delta _{\varTheta }}: \mathcal {A} \rightarrow \mathbb {R}\) is defined as the following expectation: \(r_{\delta _{\varTheta }}(a)=E_{p(.|x)}(L(a,\theta ))\). In the case of discrete and finite probability distribution, we have \(r_{\delta _{\varTheta }}(\theta _j)=\sum \limits _{k=1}^{n} L(\theta _j, \theta _k) \,\, p(\theta _k|x)\), \(j \in \{1,\ldots ,n\}\). Thus, considering the decision rule \(\delta _{\varTheta }\), the class \(\theta _j\) minimizing the risk \(r_{\delta _{\varTheta }}(\theta _j)\) over \(\varTheta \) should be chosen.

2.3 Elements of Belief Functions Theory

Due to the additivity constraint inherent to the definition of a probability distribution, one cannot built a probability distribution when measures, observations, etc. are imprecise. Belief functions theory, as an extension of probability theory, allows masses to be assigned to imprecise data. Two levels are considered when introducing belief functions: credal and pignistic levels. At the credal level, beliefs are captured and quantified by belief functions, while at the pignistic level or decision level, beliefs are quantified using probability distributions.

Credal Level. A mass function, also called basic belief assignment (bba), is a set function \(m : 2^{\varTheta } \rightarrow \left[ 0,1 \right] \) satisfying \(\sum \limits _{A \subseteq \varTheta } m(A) = 1\). For a set \( A \subseteq \varTheta \), the quantity m(A) is interpreted as a measure of evidence committed exactly to the set A and not to any more specific subsets of A. The elements \(A \in 2^{\varTheta }\) such that \(m(A)>0\) are called focal elements and they form a set denoted \(\mathbb {F}\). \((m, \mathbb {F}) \) is called body of evidence. The total belief committed to A is measured by the sum of all masses of A’s subsets. This is expressed by the belief function \(Bel : 2^{\varTheta } \rightarrow \left[ 0,1 \right] \), \(Bel(A) = \sum \limits _{B \subseteq \varTheta , B \subseteq A} m(B)\). Furthermore the plausibility of A, \(Pl : 2^{\varTheta } \rightarrow \left[ 0,1 \right] \), quantifies the maximum amount of support that could be allocated to A, \(Pl(A) = \sum \limits _{B \subseteq \varTheta , B \cap A \ne \emptyset } m(B)\).

Pignistic Level. In the transferable belief model [29], the decision is made in the pignistic level. The evidential information is transferred into a probabilistic framework by means of the pignistic probability distribution \( betP_m\), for \(\theta \in \varTheta \), \(betP_m(\theta ) = \sum \limits _{A \subseteq \varTheta , A \ni \theta }m(A)/| A|\), where |A| denotes the number of elements in A.

Decision Rule. The risk associated with a decision rule is adaptable for the evidential framework [9, 13, 27]. In the case of imprecise data, the set of actions \(\mathcal {A}\) is \( 2^{\varTheta } \setminus \lbrace \emptyset \rbrace \). In order to decide between the elements of \( \mathcal {A}\) according to the chosen loss function L, it is possible to adopt different strategies. Two strategies are proposed in the literature: the optimistic strategy by minimizing \(\underline{r}_{\delta _{\varTheta }} \) or the pessimistic strategy by minimizing \(\overline{r}_{\delta _{\varTheta }}\) which are defined as follows:

$$\begin{aligned} \underline{r} (A) = \sum _{B \subseteq \varTheta } m(B) \min _{\theta \in B} L(A,\theta ), \quad \overline{r}(A)= \sum _{B \subseteq \varTheta } m(B) \max _{\theta \in B} L(A,\theta ). \end{aligned}$$
(1)

3 Problem Statement and Proposed Approach

3.1 Imprecise Supervised Classification

For a new example x, the output of an imprecise classifier is a set of classes, all its elements are candidates for the true class \(\theta \) and the missing information prevent more precise output. In this case a possible output of the classifier is the information: “\(\theta \in A\)”, \(A \subseteq \varTheta \). To perform an imprecise classification, two cases need to be distinguished related to the training examples: (case 1) learning examples are precisely labelled, i.e., only a single class is assigned to each example; (case 2) one or more classes are assigned to each training example. In the first case described in the Subsect. 2.1, standard classifiers give a single class as prediction to a new object x but some recent classifiers [6, 7, 36] give a set of classes as prediction of x. Some of these recent classifiers base their algorithm on the posterior probability provided by standard classifiers. More precisely, if we denote by \(\mathbb {P}(.|x)\) the probability measure associated to the posterior probability distribution p(.|x), \(\mathbb {P}(A|x)=\sum \limits _{\theta \in A} p(\theta |x)\), \(A \subseteq \varTheta \) is used to determine the relevant subset of classes to be assigned to x. In the second case when the imprecision or doubt is explicitly expressed by the labels, [2, 5, 37], a classifier \(\delta _{2^\varTheta } : \mathcal {X} \rightarrow 2^\varTheta \setminus \lbrace \emptyset \rbrace \) is trained on a set of examples \((x_i, A_i)_{1\le i \le N}\) such that for all \( 1\le i\le N \), \( x_i \) belongs to \( \mathcal {X}\) and \( \emptyset \ne A_i \subseteq \varTheta \). This case is refereed in our paper as imprecise supervised classification.

3.2 Problem Statement

Let us consider the supervised classification problem where the available training examples that are precisely labelled (case 1) \((x_i, \theta _i)_{1\le i \le N }\), \( x_i \in \mathcal {X}\) and \( \theta _i \in \varTheta \) are such that (i) the labels \(\theta _{i = 1,...,N}\) are trusted. They may derive from expertise on other features \( x^*_{i = 1,...,N} \) which contain more complete information than \( x_{i = 1,...,N} \), (ii) this loss of information induces overlapping on some examples. In other words, \(\exists i,j \in \lbrace 1,..., N \rbrace \) such that the characteristics of \( x_i\) are very close to those of \(x_j\) but \( \theta _i \ne \theta _j \). When a standard classifier is trained on such data, it will commit inevitable errors. The problem that we handle in this paper is how to improve the learning step to better consider this type of data and get better performances and reliable predictions.

3.3 The Imprecise Classification Approach

The proposed approach of imprecise classification is constituted by three steps: (i) the relabelling step which consists in analysing the training example in order to add to the class that is initially associated to an example the classes associated to the other examples having characteristics very close. Thus a new set of examples is built: \((x_i, A_i)_{1\le i \le N}\) such that for all \( 1\le i\le N \), \( x_i \) belongs to \( \mathcal {X}\) and \( \emptyset \ne A_i \subseteq \varTheta \); (ii) the training step which consists on the training of probabilistic classifier \( \delta _{2^{\varTheta }} : \mathcal {X} \rightarrow 2^{\varTheta } \setminus \lbrace \emptyset \rbrace \). The classifier \(\delta _{2^{\varTheta }}\) provides for a new object \(x \in \mathcal {X}\) a posterior probability distribution on \(2^{\varTheta }\) which is also a mass function denoted m(.|x). The trained classifier ignores the existence of inclusion or intersection between subsets of classes. This unawareness of relations between the labels may seem counter intuitive, but is compatible with the purpose of finding a potentially imprecise label associated to a new incoming example; (iii) the decision step which consists of proposing a loss function adapted for the case of imprecise classification that calculates the prediction that minimize the risk function associated to the classifier \(\delta _{2^{\varTheta }}\). Figure 1 illustrates the global process and the steps of relabelling, classification and decision are presented in detail below.

Fig. 1.
figure 1

Steps of evidential classification of incomplete data

Relabelling Procedure. First we perform LDA extraction on the training examples (cf Fig. 1) in order to reduce complexity. The resulting features are \(x'_i \in \mathbb {R}^{n-1}, i = 1,...,N\) where \(n=|\varTheta |\). Then we consider a set of C standard classifiers \(\delta _{\varTheta }^{1}, ..., \delta _{\varTheta }^{C}\) where on each classifier \(\delta _{\varTheta }^{c} : \mathbb {R}^{n-1} \rightarrow \varTheta , \; c \in \lbrace 1,...,C \rbrace \) we compute leave-one-out (LOO) cross validation predictions for the training data \( (x'_i, \theta _i)_{i = 1,\ldots ,N}\).

The relabelling of the example \((x'_i, \theta _i)\) is based on a vote procedure of the LOO predictions of the C classifiers. The vote procedure is the following: when more than \( 50 \%\) majority of the classifiers predict a class \(\theta _{maj_i}\), the example is relabelled as the union \(A_i =\lbrace \theta _i , \theta _{maj_i} \rbrace \). Note that when \(\theta _{maj_i} = \theta _i\) the original label remains, i.e., \(A_i =\theta _i\). If none of the predicted classes from the C classifiers gets the majority, then the ignorance is expressed for this example by relabelling it as \(A_i =\varTheta \). Note that the new labels are consistent with the original classes that were trusted. The fact that several (C) classifiers are used to express the imprecision permits a better objectivity on the real imprecision of the features, i,e, the example is difficult not only for a single classifier. We denote by \(\mathbb {A} \subseteq 2^{\varTheta }\) the set of the new training labels \( A_i, i=1,...,N\).

Note that we limited the new labels \(A_i\) to have at most two elements except when expressing ignorance \(A_i=\varTheta \). This is done for avoiding too unbalanced training sets, but more general relabelling could be considered. Once all the training examples are relabelled, a classifier \(\delta _{2^{\varTheta }} \) can be trained.

Learning \({\varvec{\delta }}_{\mathbf {2}^{\varvec{\varTheta }}}.\) As indicated throughout this paper, \( \delta _{2^{\varTheta }} \) is learnt using the new labels ignoring the relations that might exist between the elements of \(\mathbb {A}\). Reinforcing the idea of independence of treatment between the classes, LDA is applied to the relabelled training set \((x_i,A_i)_{i=1,\ldots ,N}\). This results to the reduction of the space dimension from p to \(|\mathbb {A}|-1\) which better expresses the repartition of relabelled training examples. For the training example \(i\in \lbrace 1, ..., N \rbrace \), let \(x''_i \in \mathbb {R}^{|\mathbb {A}|-1}\) be the new projection of \(x_i \) on this \(|\mathbb {A}|-1\) dimension space. The classifier \( \delta _{2^{\varTheta }} \) is finally taught on \((x''_i, A_i)_{i = 1,...,N} \).

Decision Problem. As recalled in Subsects. 2.2 and 2.3, the decision to assign a new object x to a single class or a set of classes usually relies on the minimisation of the risk function which is associated to a loss function \( L : 2^{\varTheta }\setminus \lbrace \emptyset \rbrace \times \varTheta \rightarrow \mathbb {R}\). As mentioned in the introduction to this paper, the application of our work concerns situations where errors may have serious consequences. It would then be legitimate to consider the pessimistic strategy by minimizing \(\overline{r}_{\delta _{\varTheta }}\). Furthermore, in the definition of \(\overline{r}_{\delta _{\varTheta }}\), Eq. (1), the quantity \(\max \limits _{\theta \in B} L(A,\theta )\) concerns the loss incurred by choosing \( A \subseteq \varTheta \), when the true nature is comprised in \( B \subseteq \varTheta \). On the basis of this fact, we proposed a new definition of the loss function, L(AB), \( A, B \subseteq \varTheta \), which directly takes into account the relations between A and B. This is actually a generalisation of the definition proposed in [7] that is based on F-measure, recall and precision for imprecise classification. Let us consider \(A, B \in 2^{\varTheta } \setminus \lbrace \emptyset \rbrace \), where A \( = \theta (x)\) is the prediction for the object x and B is its state of nature. Recall is defined as the proportion of relevant classes included in the prediction \(\theta (x)\). We define the recall of A and B as:

$$\begin{aligned} R(A,B)=\frac{|A \cap B|}{|B|}. \end{aligned}$$
(2)

Precision is defined as the proportion of classes in the prediction that are relevant. We define the precision of A and B as:

$$\begin{aligned} P(A,B)=\frac{|A \cap B|}{|A|}. \end{aligned}$$
(3)

Considering these two definition, the F-measure can be defined as follows:

$$\begin{aligned} F_{\beta }(A,B)=\frac{(1+\beta ^2) PR}{\beta ^2P+R}=\frac{(1+\beta ^2) |A \cap B|}{\beta ^2|B|+|A|}. \end{aligned}$$
(4)

Note that \(\beta = 0\), induce \(F_{\beta }(A,B) = P(A,B)\), whereas when \( \beta \rightarrow \infty \),

\(F_{\beta }(A,B) \underset{\beta \rightarrow \infty }{\rightarrow }~P(A,B)\). Let us comment on some situations according to the “true set” B and the predicted set A. The worse scenario of prediction is when there is no intersection between A and B. This would always be sanctioned by \(F_{\beta }(A,B) = 0\). On the contrary, when \(A = B\), \(F_{\beta }(A,B) = 1\) for every \( \beta \). Between those extreme cases, the errors of generalisation i.e., \(B \subset A \), are controlled by the precision while the errors of specialisation i.e., \(A \subset B \), are controlled by the recall. Finally, the loss function \( L_{\beta } : 2^{\varTheta }\setminus \lbrace \emptyset \rbrace \times 2^{\varTheta }\setminus \lbrace \emptyset \rbrace \rightarrow \mathbb {R}\) is extended:

$$\begin{aligned} L_{\beta }(A,B)=1-F_{\beta }(A,B). \end{aligned}$$
(5)

For an example x to be classified, whose mass function m(.|x) has been calculated by \(\delta _{2^{\varTheta }} \), we predict the set A minimizing the following risk function:

$$\begin{aligned} \text {Risk}_{\beta }(A) = \sum _{B \subseteq \varTheta } m(B) L_{\beta }(A, B). \end{aligned}$$
(6)

4 Related Works

Regarding relabelling procedures, much research has been carried out to identify suspect examples with the intention to suppress or relabel them into a concurrent more appropriate class [16, 20]. This is generally done to enhance the performance. Other approaches consist in relabelling into imprecise classes. This has been done to test the evidential classification approach on imprecise labelled data in [37]. But, as already stated, our relabelling serves a different purpose, better mapping overlaps in the feature space. Concerning the imprecise classification, several works have been dedicated to tackle this problem. Instead of the term “imprecise classification” that is adopted in our article, authors use terms like “nondeterministic classification” [7], “reliable classification” [24], “indeterminate classification” [6, 36], “set-valued classification” [28, 31] or “conformal prediction” [3] (see [24] for a short state of the art). In [36], the Naive Credal Classifier (NCC) is proposed as the extension of Naive Bayes Classifier (NBC) to sets of probability distributions. In [24] the authors propose an approach that starts from the outputs of a binary classification [25] using classifier that are trained to distinguish aleatoric and epistemic uncertainty. The outputs are epistemic uncertainty, aleatoric uncertainty and two preference degrees in favor of the two concurrent classes. [24] generalizes this approach to the multi-class and providing set of classes as output. Closer to our approach are approaches of [5] and [7]. The approach in [7] is based on a posterior probability distribution provided by a probabilistic classifier. The advantage of such approach and ours is that any standard probabilistic classifier may be used to perform an imprecise classification. Our approach distinguishes itself by the relabelling step and by the way probabilities are allowed on sets of classes. To the best of our knowledge existing works algorithms do not train a probabilistic classifier on partially labelled data to quantify the body of evidence. Although we insisted for the use of standard probabilistic classifier \(\delta _{2^{\varTheta }} \) unaware of relations between the sets, it is possible to run our procedure with an evidential classifier as the evidential k-NN [5].

5 Illustration

5.1 Settings

We performed experiments on the classification problem of four plastic categories designated plastics A, B, C and D on the basis of industrially acquired spectra. The total of 11540 available data examples is summarized in Table 1. Each plastic example was identified by experts on the basis of laboratory measure of attenuated total reflectance spectra (ATR) which is considered as a reliable source of information for plastic category’s determination. As a consequence, original training classes are trusted and were not questioned. However data provided by the industrial devices may be challenged. These data consist in spectra composed of the reflectance intensity of 256 different wavelengths. Therefore and for the enumerated reasons in Sect. 1, the features are subject to ambiguity. Prior to experiments, all the feature vectors, i.e., spectra, were corrected by the standard normal variate technique to avoid light scattering and spectral noise effects. We implemented our approach and compared it to the approaches in [5] and [7]. The implementation is made using R packages, using existing functions for the application of the following 8 classifiers naive Bayes classifier: (nbayes), k-Nearest Neighbour (k-NN), decision tree (tree), random forest (rf), linear discriminant analysis (lda), partial least squares discriminant analysis (pls-da), support vector machine (svm) and neural networks (nnet).Footnote 1

Table 1. Number of spectra of each original class in learning and testing bases.

5.2 Results

In order to apply our procedure, we must primary choose a set of classifiers to perform the relabelling. These classifiers are not necessarily probabilistic but producing point prediction. Thus, for every experimentation, our algorithm ECLAIR was performed with the ensemble relabelling using 7 classifiers: nbayes, k-NN, tree, rf, lda, svm, nnetFootnote 2. Then, we are able to perform the ECLAIR imprecise version of a selected probabilistic classifier. Figure 2, shows the recall and precision scores of the probabilistic classifier nbayes to show the role of \(\beta \). We see the same influence of \(\beta \) as mentioned in [7]. Indeed, (cf Subsect. 3.3), with small values of \( \beta \) we have good precision, traducing the relevance of prediction, i.e., the size of the predicted set is reasonable; while high values of \( \beta \) give good recall, meaning reliability, i.e., better chance to have true class included in the predictions. The choice of \( \beta \) should then result form a compromise between relevance and reliability requirement.

Fig. 2.
figure 2

Recall and precision of ECLAIR using nbayes, i.e. \(\delta _{2^{\varTheta }} \) is nbayes, against \( \beta \).

Table 2. Precision P of ECLAIR compared with nondeterministic with \(\beta \)s chosen such that recalls equal to 0.90.

In order to evaluate the performances of ECLAIR, we compared our results to the classifier proposed in [7] that is called here nondeterministic classifier. As nondeterministic classifier and ECLAIR are set up for a parameter \(\beta \), we decided to set \( \beta \)s such that global recalls equal to 0.90, and compare global precisions on a fair basis. For even more neutrality regarding the features used in both approach, we furnish to the nondeterministic classifier, the same reduced features \( x''_i, i = 1, ...,N\), that those used by ECLAIR in the training phase (see Fig. 1). The 7 first columns of Table 2 shows the so obtained precisions for 7 classifiers. These results show the competitiveness of our approach for most of the classifiers, especially nbayes, k-NN, rf and pls-da. However, these results are only partial since they do not show the general trend for different \( \beta \)s that are generally in favour of our approach. Therefore we present more complete results for nbayes and svm in Fig. 3, showing evaluation of precision score against recall score for several values of \( \beta \) varying in [0, 6]. On the same figure, we also present the results of nondeterministic classifier with different input feature (in black): raw features, i.e., \( x_i \in \mathbb {R}^p\), LDA reduced features, i.e., \(x'_i \in \mathbb {R}^{n-1}\) and the same features as those used for ECLAIR, i.e., \(x''_i \in \mathbb {R}^{|\mathbb {A}|-1}\) (see Fig. 1 for more details). Doing so, we show that the good performances of ECLAIR are not only attributable to extraction phase. To facilitate the understanding of the results plotted in Fig. 3, one should understand that the best performances are those illustrated by points on the top right of the plots, i.e., higher precision and recall scores. We observe that ECLAIR generally makes a better compromise between the recall and precision scores for the used classifiers. Regarding the special case when ECLAIR is performed with an evidential classifier performing example imprecise labelled training (see Sect. 4), the comparison is less straightforward. We considered the evidential k-NN [10] for imprecise labels by minimizing the error suggested in [39]. Using this evidential k-NN as a classifier \( \delta _{2^{\varTheta }}\) in ECLAIR procedure is straightforward. Concerning the application of nondeterministic classifier, we decided to keep the same parameter and turn the classifier into probabilistic by applying the pignistic transformation to the mass output of the k-NN classifier (see column of Table 2). ECLAIR obtains a slightly better results.

Fig. 3.
figure 3

Precision vs recall of Nondeterministic (ND) and ECLAIR

6 Conclusion

In this article, a method of evidential classification of incomplete data via imprecise relabelling was proposed. For any probabilistic classifier, our approach proposes an adaptation to get more cautious output. The benefit of our approach was illustrated on the problem of sorting plastics and showed competitive performances. Our algorithm is generic it can be applied in any other context where incomplete data on the features are presents. In future works we plan to exploit our procedure to provide cautious decision-making for the problem of plastic sorting. This application requires high reliability of the decision for preserving the physiochemical properties of the recycle product. At the same time, the decision shall ensure reasonable relevance to guarantee financial interest, indeed the more one plastic category is finely sorted the more benefice the industrial gets. We also plan to strengthen our approach evaluation by confronting it with other state of the art imprecise classifiers and by preforming experiments on several datasets from machine learning repositories.