Robust instance-dependent cost-sensitive classification

De Vos, Simon; Vanderschueren, Toon; Verdonck, Tim; Verbeke, Wouter

doi:10.1007/s11634-022-00533-3

Robust instance-dependent cost-sensitive classification

Regular Article
Published: 07 January 2023

Volume 17, pages 1057–1079, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Robust instance-dependent cost-sensitive classification

Download PDF

Simon De Vos ORCID: orcid.org/0000-0002-3032-8678¹,
Toon Vanderschueren¹,
Tim Verdonck^2,3 &
…
Wouter Verbeke¹

807 Accesses
Explore all metrics

Abstract

Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS.

Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise

Article 06 July 2018

Progressive random k-labelsets for cost-sensitive multi-label classification

Article 15 December 2016

On Sensitivity of Metalearning: An Illustrative Study for Robust Regression

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Classification is a well-studied machine learning task that involves the assignment of instances to a predefined set of outcome classes. Cost-sensitive classification methods take into account asymmetric costs related to incorrectly classifying instances across various classes (Elkan 2001; Verbeke et al. 2020). Such misclassification costs may either be class-dependent, i.e., equal for all instances of a class, or instance-dependent, i.e., vary across instances.

Classification methods are adopted to support or automate business decision-making, e.g., for credit scoring (Petrides et al. 2022) or customer churn prediction (Lessmann et al. 2021). Note that in both applications, misclassified instances involve variable costs. For instance, the cost of a misclassified churner equals the future customer lifetime value, whereas a misclassified non-churner typically involves a much smaller cost, i.e., the cost of targeting the customer with the retention campaign. Either or both may be instance-dependent or class-dependent depending on the characteristics of the particular application setting.

A broad variety of cost-sensitive (CS) and instance-dependent cost-sensitive (IDCS) classification methods have been proposed in the literature as reviewed and experimentally evaluated by Petrides and Verbeke (2022) and Vanderschueren et al. (2022). A prominent approach that is adopted by both CS and IDCS methods for taking misclassification costs into account is to weigh instances proportionally with the misclassification cost involved when learning a classification model.

In this article, we raise the question of whether IDCS classification methods are sensitive to outliers and noise in the data. No prior work seems to have addressed this question, which nonetheless is of significant practical importance given the broad adoption and potential monetary impact of using biased classification models for decision-making.

To address these shortcomings, we present the results of a series of experiments to evaluate the robustness of IDCS classification methods with respect to outlying costs in the data, which highlight the potential bias and vulnerability of IDCS classification methods. We propose a robust approach to IDCS classification by extending the existing cslogit approach (Höppner et al. 2022). An important benefit is the automatic and reliable detection of outliers in the data. These outliers may not only spoil the resulting analysis (as illustrated in this article) but can also contain valuable information. A robust analysis can thus provide better insight into the structure of the data.

The following section outlines related work on IDCS learning and discusses both cslogit and robustness. Next, in Sect. 3, a series of simulations on synthetic data is presented that motivate the need for robust IDCS learning which we develop in Sect. 4. Section 5 presents the results of a series of experiments that illustrate the excellent performance of the proposed robust IDCS learning method, denoted r-cslogit, in comparison with both logit and cslogit. We conclude and present directions for future research in Sect. 6.

2 Related work

Elkan (2001) introduces a learning paradigm where different misclassification errors incur different penalties depending on the predicted and actual class, with applications to, for example, detecting transaction fraud and credit scoring. The benefits and costs of different predictions can be summarized in a two-dimensional instance-dependent cost matrix with one dimension for the predicted value and another dimension for the ground truth. Given these benefits and costs, each new instance should be assigned to the class that leads to the lowest expected cost, which is calculated by means of conditional probabilities.

2.1 IDCS learning, cslogit, and robustness

For certain applications, benefits and costs depend not only on the class but also on the instance itself. Therefore, instance-dependent cost-sensitive learning considers a more detailed, lower level of granularity than class-dependent costs. For these applications, using instance-dependent costs instead of class-dependent costs leads to a decreased total misclassification cost (Brefeld et al. 2003; Vanderschueren et al. 2022).

Several instance-dependent cost-sensitive methodologies have been proposed in the literature, with recent overviews given by Petrides and Verbeke (2022) and Vanderschueren et al. (2022). Especially relevant to our work are methodologies that adjust the learning algorithm to incorporate instance-dependent costs. Instance-dependent cost-sensitive variants have been proposed for several common machine learning classifiers, such as boosting (Fan et al. 1999; Zelenkov 2019; Höppner et al. 2022), support vector machines (Brefeld et al. 2003), decision trees (Sahin et al. 2013; Bahnsen et al. 2015), and logistic regression (Bahnsen et al. 2014; Höppner et al. 2022).

In this work, we will build upon an instance-dependent cost-sensitive version of logistic regression. Following Höppner et al. (2022), we will refer to this method as cslogit. Logistic regression is a widely used method for binary classification tasks. To extend logistic regression to its IDCS counterpart, Bahnsen et al. (2016) and Höppner et al. (2022) propose an objective function that combines both cost-sensitivity and instance-dependent learning, resulting in instance-dependent costs for optimization. The application of this objective function yields significant improvements in terms of higher savings compared to cost-insensitive or class-dependent cost-sensitive models in the context of, for example, credit scoring and transaction fraud detection.

Classical nonrobust methods for regression, such as least squares or maximum likelihood techniques, try to fit the model optimally to all the data. As a result, these methods are heavily influenced by data outliers. This implies that outliers may bias the parameter estimates and confidence intervals and thus hypothesis tests may become unreliable and/or uninformative. In contrast, robust methods can resist the effect of outliers to avoid distorted results and false conclusions. As an important benefit, they allow the automatic detection of outliers as observations that deviate substantially from the robust fit. It is important to note that the detected outliers are not necessarily errors in the data. The presence of outliers may reveal that the data are more heterogeneous than has been assumed and that it can be handled by the original statistical model. Outliers can be isolated or may come in clusters, indicating that there are subgroups in the population that behave differently. Many different approaches to robust regression have been proposed and a good overview can be found in reference works such as Huber and Ronchetti (2009), Maronna et al. (2019) and Rousseeuw and Leroy (1987). In the context of generalized linear models (GLMs), various robust alternatives have been presented, such as Cantoni and Ronchetti (2001), Bergesio and Yohai (2011), Valdora and Yohai (2014), Ghosh and Basu (2016) and Štefelová et al. (2021). Robust logistic regression has been studied by Künsch et al. (1989), Morgenthaler (1992), Carroll and Pederson (1993), Bianco and Yohai (1996), Croux and Haesbroeck (2003), Bondell (2005), Bondell (2008), Monti and Filzmoser (2021) and Hosseinian and Morgenthaler (2011).

2.2 Preliminaries

The dataset ${\mathcal {D}}$ consists of N observed predictor-response pairs $\left\{ \left( {\varvec{x}}_{i}, y_{i}\right) \right\} _{i=1}^{N}$ and is used to train a binary classification model s(.). The costs $C_i$ correspond to the cost matrix defined in Table 1.

This binary classification model predicts a probability score $s_i \in [0, 1]$ for each instance i based on the features ${\varvec{x}}_{i}$. Depending on the classification threshold $t_{i}^{*}$, $s_i$ is converted to a predicted class ${\widehat{y}}_i \in \left\{ 0,1 \right\} $.

For models trained with AEC (Eq. 3), savings remain relatively stable across different thresholding strategies (Vanderschueren et al. 2022). Therefore, we use a default threshold of 0.5.

A binary logistic regression predicts a probability score that an observation belongs to the positive class. This probability score is calculated by Eq. (1), where $\beta _{0}$ is the bias term, $\beta _{1}\ldots \beta _{d}$ the learned weights and ${\varvec{x}}_i$ are the features of a particular observation i:

$$\begin{aligned} s_i = s_{(\beta _0,\varvec{\beta })}({\varvec{x}}_i)=\frac{1}{1+e^{-z}} where \, z=\beta _{0}+\beta _{1} x_{i1}+\beta _{2} x_{i2}+\ldots +\beta _{d} x_{id} . \end{aligned}$$

(1)

This probability score is then compared to a threshold to categorize each of these observations into classes. The objective function of a logistic regression is the likelihood that is maximized or the cross-entropy loss that is minimized. For a single sample with true label $y_i \in \left\{ 0,1 \right\} $ and a probability score $s_i = P(Y=1)$, the cross-entropy loss is presented by Eq. (2):

$$\begin{aligned} L_{\log }(y_i, s_i)=-\Big (y_i \log (s_i)+(1-y_i) \log (1-s_i)\Big ). \end{aligned}$$

(2)

Note that this equation does not take into account any costs. Because this objective function assigns equal weights to each misclassification, it does not necessarily correspond to the underlying business problem where costs are to be minimized. The reason for this is twofold: misclassification costs are different per class and per instance. The real business objective is to minimize the average expected total cost of the binary classifier.

We build upon the instance-dependent cost-sensitive logistic (cslogit) model as proposed by Bahnsen et al. (2016) and Höppner et al. (2022). Cslogit minimizes an instance-dependent cost-sensitive objective function corresponding to the real business objective of minimizing costs in domains such as customer churn prediction, credit scoring, and direct marketing (Thai-Nghe et al. 2010; Claude Sammut 2017). Dependent on this business objective, also other cost matrices can be considered. For example, Höppner et al. (2022) propose the cost matrix $C_i(0 \mid 0)=0$, $C_i(0 \mid 1)=A_i$, $C_i(1 \mid 0)=c_f$, and $C_i(1 \mid 1)=c_f$ for the detection of transfer fraud where $c_f$ is a fixed administrative fee. Alternatively, Bahnsen et al. (2014) propose the cost matrix $C_i(0 \mid 0)=0$, $C_i(0 \mid 1)=L_{g d}$, $C_i(1 \mid 0)=r_i+C_{F P}^a$, and $C_i(1 \mid 1)=0$ for credit scoring where $L_{g d}$ is the loss given default, $r_i$ is the loss in profit by rejecting what could have been a good customer, and $C_{F P}^a$ is the cost related to the assumption that the financial institution will not keep the amount of the declined applicant unused. However, the reason for this work is to address the need for robustness and to propose a solution to solve this potential issue in a generic way, regardless of its application. Therefore, to present an application-agnostic methodology and preferring the most simple cost matrix, this work utilizes a symmetric cost matrix.

Equation (3) shows the average expected cost (AEC), the cost-sensitive objective function that is used by cslogit, given a symmetric cost matrix, as shown in Table 1:

$$\begin{aligned} AEC(s({\mathcal {D}}))= & {} \frac{1}{N} E[{\text {Cost}}(s({\mathcal {D}})) \mid {\varvec{X}}] \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^{N}\Big (y_{i}[s_{i} C_{i}(1 \mid 1)+(1-s_{i}) C_{i}(0 \mid 1)]\nonumber \\{} & {} +(1-y_{i})[s_{i} C_{i}(1 \mid 0)+(1-s_{i}) C_{i}(0 \mid 0)]\Big ) \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^{N} \Big (A_{i}(y_{i}(1-s_{i})+(1-y_{i})s_{i})\Big ). \end{aligned}$$

(3)

In Eq. (3), each observation i is a pair of d features ${\varvec{x}}_{i} = (x_{i1}, . . . , x_{id})$ and a binary response label $y_i \in \left\{ 0,1 \right\} $.

Table 1 Symmetric cost matrix for cslogit

Full size table

Across multiple models, the total cost as a metric is not unambiguously interpretable, as datasets with high instance-dependent costs might have a higher total misclassification cost but still have a better relative score. Proceeding with the idea of normalizing the total classification costs of a model presented in Whitrow et al. (2009), Bahnsen et al. (2014) introduce a more interpretable metric: Savings. This metric represents the relative improvement of the cost of a newly proposed model, $Cost(s({\mathcal {D}}))$, compared to the cost of using an empty model that assigns all instances to a single class, $Cost_{empty}({\mathcal {D}})$. $Cost_{empty}({\mathcal {D}})$ is calculated by taking the minimum of the costs incurred when classifying all instances as either belonging to the negative or positive class:

$$\begin{aligned} Cost_{empty}({\mathcal {D}})=\min \left\{ Cost\left( s_{0}({\mathcal {D}})\right) , Cost\left( s_{1}({\mathcal {D}})\right) \right\} . \end{aligned}$$

(4)

Using the $Cost_{empty}({\mathcal {D}})$ of an empty model as a factor to normalize total costs, Savings of the model $s({\mathcal {D}})$ are calculated by Eq. (5):

$$\begin{aligned} \textrm{Savings}\, (s({\mathcal {D}}))= 1- \frac{Cost(s({\mathcal {D}}))}{Cost_{empty}({\mathcal {D}})} . \end{aligned}$$

(5)

3 Sensitivity analysis

Data can contain outliers in terms of misclassification costs due to various reasons, such as missing data, invalid observations, or typos. By incorporating instance-dependent costs in the learning algorithm, outliers in these misclassification costs could potentially have a large impact on instance-dependent cost-sensitive learning methodologies such as cslogit. Therefore, we test the sensitivity of cslogit to these outliers and examine to what extent this is a shortcoming of this method.

3.1 Simulation setup

We analyze the sensitivity to outlying costs through a series of simulations on synthetic data. The different synthetic datasets all share the following properties. Each observation is visualized by a dot, with the size of the dot corresponding to its misclassification cost. The positive class is presented in red and the negative class in blue. Each observation has, other than its misclassification cost and label, two features: $X_1$ and $X_2$. $X_1$ is the feature for the misclassification cost A. For the positive class, this cost is positively related to $X_1$. Cases of the negative class have a negative relation between $X_1$ and their cost. The underlying function is given by Eq. (6):

$$\begin{aligned} A_i=\left\{ \begin{array}{ll} 20+2x_{1i} &{} \text { for the positive class }, \\ 20-2x_{1i} &{} \text { for the negative class. } \end{array}\right. \end{aligned}$$

(6)

Panel (b) and (c) in Fig. 1 visualize this equation.

$X_2$ is the feature that determines the two distributions of classes 0 and 1. The two class distributions are a 2-dimensional Gaussian, sharing the same standard deviations. Observations from the negative class are sampled from $N(\mu _{0} ,\sigma _{0} ^{2},\nu _{0} ,\tau _{0} ^{2},\rho )$ and observations from the positive class from $N(\mu _1 ,\sigma _1 ^{2},\nu _1 ,\tau _1 ^{2},\rho )$.

$\mu _0$ and $\mu _{1}$ are both equal to 0, while $\nu _0 = -5$ and $\nu _{1}=5$. The variances $\sigma _0 ^{2},\tau _0 ^{2},\sigma _{1} ^{2}$ and $\tau _{1} ^{2}$ are equal to 4. As there is no correlation between the two dimensions $X_1$ and $X_2$, $\rho $ is equal to 0. The cases of the positive class have a higher $X_2$ value than the cases of the negative class. Panel (a) in Fig. 1 displays these class distributions by which data are generated.

To generate outliers, both in the synthetic setup (Sects. 3 and 5.1) and in the sensitivity analysis on real data (Sect. 5.2), the multivariate distribution of ${\textbf{X}}$ remains unchanged since we only focus on outliers in the observed costs of instances. We use the Tukey-Huber contamination model to generate the cost distribution with outliers where the Dirac function is applied to generate costs of any size (Maronna et al. 2019).

Given these settings for instance-dependent costs and class distribution, observations of the negative class with a high associated cost are expected to be located in the third quadrant and observations of the positive class with a high associated cost in the first quadrant. The symmetric cost matrix used for the examples on synthetic data is presented in Table 1 as introduced in Sect. 2.2.

3.2 Results

Within each setting, two classifiers are compared: logit and cslogit. They are both linear classifiers and propose a distinctly different decision boundary based on the training data. Since the data are only two-dimensional, these decision boundaries can be visually represented by lines. The logit and cslogit model’s proposed boundaries are respectively coloured in red and blue. The normal behavior of both models in the default settings of examples on synthetic data is visualized in Panel a of Fig. 2.

4 Robust IDCS

To overcome the sensitivity of instance-dependent cost-sensitive classifiers to outlying costs, we introduce a three-step framework to make IDCS methods robust by detecting outliers and adapting their cost matrix. Hence, the final model will be trained using a less volatile and more rigid set of costs. The resulting robust classification model will also yield automatic outlier detection. The concrete implementation of this framework is represented by Algorithm 1.

To estimate the misclassification costs of observations in a robust manner in step 1, a regression with Huber loss is applied. Concretely, we estimate the cost $A_i$ of observation i as a function of its features ${\textbf{x}}_i$ and label $y_i$ with a linear regression with a Huber loss function: $\hat{A_i}=f({\varvec{x}}_i,y_i)$. A formalization of robustness in statistics started with the work of Huber (1964). Interestingly, his ground-breaking results and well-known loss function are still widely used today in the field of statistics and machine learning. The Huber loss function is defined by Eq. (7). This results in a regression that is less sensitive to outliers than traditional regression methods, which often use a squared error loss.

$$\begin{aligned} L_{\delta }(a)=\left\{ \begin{array}{ll} \frac{1}{2} a^{2} &{} \text { for }|a|\le \delta , \\ \delta \left( |a|-\frac{1}{2} \delta \right) &{} \text { otherwise. } \end{array}\right. \end{aligned}$$

(7)

Next, to detect outliers, we compare the absolute value of the standardized residuals with a cutoff value of a normal distribution (Rousseeuw and Hubert 2011). If this value exceeds 2.5, we consider it an outlier and add it to the initially empty set ${\mathcal {S}}_{outlier}$.

The observed cost $A_i$ of observation i is operationally defined as an outlier if, for an estimator ${\hat{A}}=f(X,Y)$, the absolute value of the standardized residual $\epsilon _i$ is larger than 2.5.

Figure 3 further clarifies the concept of a conditional outlier on the setting of synthetic data where noise is added to the observed costs, as is displayed in Fig. 2c. In this figure, the black line displays the estimated costs, as predicted by the linear regression model with Huber loss (Algorithm 1, line 2). The red dots represent the costs of observations in function of $X_1$. Consider the two observations A and B. To check whether their observed costs are outliers, we look at the standardized residuals, represented by the grey vertical lines for those two observations. Both observations A and B have an observed cost of 50. The standardized residual of A exceeds 2.5, whereas for B it is smaller than 2.5. Consequently, although both observations have the same cost, the cost of A is considered an outlier, whereas the cost of B is not.

By doing so, the costs A that are outliers, conditional on their features ${\textbf{X}}$ and label Y, are detected. In step 2, the observed outlying costs A of all observations in ${\mathcal {S}}_{outlier}$ are imputed with their estimated counterpart ${\hat{A}}$. This results in a robust cost matrix (Table 2).

Table 2 Symmetric cost matrix for r-cslogit

Full size table

Equation (8) retakes the cost-sensitive objective function AEC, given by Eq. (3), but adapts it to the new robust cost-matrix given by Table 2. Indicator function $\mathbbm {1}_{o}$ takes value 1 if the cost $A_i$ of observation i is classified as an outlier.

$$\begin{aligned} {\text {AEC}}(s({\mathcal {D}}))= & {} \frac{1}{N} E[{\text {Cost}}(s({\mathcal {D}})) \mid {\varvec{X}}] \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^N \Biggl [ \mathbbm {1}_{o} \Bigl ({\hat{A}}_i\left( y_i\left( 1-s_i\right) +\left( 1-y_i\right) s_i\right) \Bigl ) \nonumber \\{} & {} +(1-\mathbbm {1}_{o})\Bigl (A_i\left( y_i\left( 1-s_i\right) +\left( 1-y_i\right) s_i\right) \Bigl ) \Biggr ] \end{aligned}$$

(8)

5 Results

This section discusses the performance of logit, cslogit and the novel r-cslogit on synthetic data and tests their sensitivity on real data with additional outliers. In the reported experiments, symmetric cost matrices are taken into account. However, the use of alternative cost matrices as presented in Sect. 2.2 yields similar results concerning robustness.

The performance of binary classification algorithms is typically measured by labeling one class as positive and the other class as negative and constructing a confusion matrix. Positive classes are typically used to describe the minority class and negative classes are used to describe the majority class. From the confusion matrix, we count the following numbers. True Negatives (TN) is the number of correctly classified negative cases. False Positives (FP) is the number of negative cases incorrectly classified as positive. False Negatives (FN) is the number of positive cases incorrectly classified as negative. True Positives (TP) is the number of correctly classified positive cases. With these numbers, we define Sensitivity or Recall, Specificity, and Precision by Eqs. 9, 10, and 11:

$$\begin{aligned} Sensitivity=Recall= & {} \frac{TP}{TP+FN} \end{aligned}$$

(9)

$$\begin{aligned} Specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$

(10)

$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$

(11)

Equation 12 defines F1-measure, which is the Harmonic mean of precision and recall.

$$\begin{aligned} F_1=\frac{2 \cdot precision \cdot recall}{precision+recall} \end{aligned}$$

(12)

The area under the receiver operating curve (AUC) of a classifier can be interpreted as a measure of the probability that a randomly chosen minority case is predicted to have a higher score than a randomly chosen majority case. Therefore, a higher AUC indicates better classification performance (Höppner et al. 2022). Class distributions and misclassification costs are not taken into account in calculating the AUC.

Equation 13 defines the Brier score, where $s_i$ is the predicted probability and $y_i$ is the observed outcome. This metric measures the mean squared difference between the predicted probability and the actual outcome and is used to assess whether the model’s predictions are calibrated probabilities. A lower score is better.

$$\begin{aligned} Brier=\frac{1}{N} \sum _{i=1}^N\left( s_i-y_i\right) ^2 \end{aligned}$$

(13)

5.1 Synthetic data

In this section, we reuse the examples on synthetic data introduced in Sect. 3.1 to demonstrate how the possible shortcomings of cslogit can be countered by deploying the more robust r-cslogit. Figure 4 displays the decision boundaries of logit, cslogit and r-cslogit in red, blue and green, respectively.

5.1.1 Synthetic data: three settings

The basic setting of the examples on synthetic data in Panel a is the same as explained in Sect. 3.1. In Panel b, an additional outlier of the positive class is added in the third quadrant with a cost equal to 400. The robust method first estimates its cost with a linear Huber regression to be 13.75 and flags it as an outlier. Next, the cost for this instance of 400 is changed to its estimated cost of 13.75. In Panel c, we add noise to the costs. Hence, the misclassification costs are generated by Eq. (14), where the noise $\epsilon _i$ is sampled from a lognormal distribution with parameters $\mu $ = 2 and $\sigma $ = 1.5.

$$\begin{aligned} A_i=\left\{ \begin{array}{ll} 20+2x_{1i}+\epsilon _i &{} \text { for the positive class }, \\ 20-2x_{1i}+\epsilon _i &{} \text { for the negative class. } \end{array}\right. \end{aligned}$$

(14)

5.1.2 Description of results

Figure 4 visualizes the decision boundaries of the three models. In Panel a, the decision boundaries of cslogit and r-cslogit overlap as regression with Huber Loss can perfectly predict the underlying function of associated misclassification costs as a function of $X_1$ in the absence of noise or outliers. Panel b displays the case where one outlier is added. The decision boundary of the logit model is not affected by the size of misclassification costs. Hence, it is not influenced by the outlier and remains unchanged, demonstrating normal behavior as defined before. The blue decision boundary of the cslogit model is strongly influenced by outliers. The objective function takes into account the full misclassification costs of the observations in the training set, including the excessive outliers. As a consequence, the behavior of the cslogit model has been completely disrupted. This is strongly in conflict with its normal behavior, as the decision boundary is almost tilted by a quarter turn. This tilted decision boundary results in poor predictive classification power, making the cslogit model to be of inferior quality. The green decision boundary of r-cslogit remains largely unchanged, as it is robust against the single added outlier.

Performance metrics are summarized in Table 3. We consider the cost-sensitive metric Savings introduced in Sect. 2.2 and cost-independent metrics Sensitivity, Specificity, F1, AUC, and Brier score.

r-cslogit outperforms logit and cslogit in terms of Savings when we add an outlier and noise. Moreover, the performance in terms of Savings remains unchanged after adding an outlier. In the default case of setting one, r-cslogit and cslogit are equivalent, as they make the exact same predictions. When considering cost-insensitive metrics, logit performs best. A full analysis on synthetic data where we experiment with different settings of class imbalance and outlier size can be found in Appendix A.

Table 3 Results on synthetic data (i) in the default setting, (ii) with an outlier, and (iii) with additional noise added to the amounts

Full size table

5.2 Sensitivity analysis on real data

In this section, we analyze the sensitivity of the three methods in an experiment with real data where we add an additional outlier, gradually increasing in size. To add outliers, we randomly select an observation and change its class label and instance-dependent misclassification cost.

This setup is similar to the second setup with synthetic data as presented in the previous section. The performance is measured by the cost-sensitive metric Savings as described before as well as the cost-independent metrics Sensitivity, Specificity, F1, AUC, and Brier score. The measurement of performance makes use of five-fold cross-validation with a stratified split on class distribution that is repeated twice with a different random initialization.

5.2.1 Description of the dataset

The dataset on which the three methods are tested is the Kaggle Credit Card Fraud Detection dataset (ULB 2018). The dataset dates from September 2013 and contains transactions made by European credit cardholders. A total of 492 out of 284,807 transactions are fraudulent, resulting in a high class imbalance. The numerical input features $V1, V2, \dots , V28$ are the results of a PCA transformation to anonymize the dataset. Time and Amount have not been transformed. The feature Time is not taken into consideration in this experiment and is therefore dropped in the preprocessing phase. The feature Amount is the transaction amount, which is of high importance in cost-sensitive instance-dependent learning and translates into our setting as the instance-dependent misclassification cost. The feature $Class \in \{0,1\}$ indicates whether a transaction is fraudulent or not.

Table 4 Sensitivity analysis on real data resulting from a two times five-fold cross validation procedure on the Kaggle Credit Card Fraud Detection dataset

Full size table

5.2.2 Results

Table 4 contains the results of a $2\times 5$-fold cross validation procedure for the Kaggle Credit Card Fraud Detection dataset. We measure each classifier’s performance averaged over the ten ($2\times 5$) test sets with the metrics Savings, F1, AUC, Sensitivity, Specificity, and Brier score where instance-independent thresholds are applied.

In terms of Savings, logit is always outperformed by cslogit and r-cslogit. When adding an outlier, r-cslogit outperforms cslogit. Note that the performance of r-cslogit remains stable for all considered metrics when increasing the size of the outlier. In terms of cost-insensitive metrics AUC, Specificity, and Brier score, logit performs best. In terms of F1 and Sensitivity, logit is outperformed by either cslogit or r-cslogit. This could be due to the effect of class imbalance and is in line with previous findings of Höppner et al. (2022) The results in terms of Savings are visualized in Fig. 5. Since the logit model is not cost-sensitive, its performance remains constant after adding an outlier. The performance of cslogit is strongly disrupted after the cost of the outlier is set to 1 M or larger. This corresponds with the shift of the two-dimensional linear decision boundary, as shown by the findings of the examples on synthetic data. Even though the dataset contains over 280,000 instances, a single outlier, albeit a large outlier, can unhinge the cslogit method. When increasing the misclassification cost of a single outlier, the performance of r-cslogit remains stable. It is certainly more robust to this additional noise than its nonrobust counterpart, as the individual outlier is detected and its cost is imputed with an estimated, expected cost. The shaded areas in Fig. 5 represent the variability of performance over different folds in cross-validation. In contrast to the variability of cslogit, which increases drastically, the variability in the performance of r-cslogit remains stable.

6 Conclusion

Instance-dependent cost-sensitive (IDCS) learning methods take into account variable misclassification costs across instances in the training data in learning a classification model. This allows for optimizing the performance of the resulting classification model in terms of the misclassification costs rather than the classification accuracy.

In this article, we present the results of a series of experiments on synthetic data to demonstrate the sensitivity of IDCS methods to outliers and noise in the data. We show that the resulting classification model may be highly sensitive to outlying instance-dependent costs, in learning an instance-dependent cost-sensitive classification model. Consequently, using existing cost-sensitive models in the presence of noise or outliers can result in large misclassification costs.

To address this potential vulnerability, we propose a generic, IDCS-method-independent, three-step framework to develop robust IDCS methods with respect to the effects of random variability and noise. In the first step, instances with outlying misclassification costs are detected. In the second step, outlying costs are corrected in a data-driven way. In the third step, an IDCS learning method is applied using the adjusted instance-dependent cost information.

This generic framework is subsequently applied in combination with cslogit, which is a logistic regression-based IDCS method, to obtain its robust version named r-cslogit. The robustness of this approach is introduced in the first two steps of the generic framework by making use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data. The results show that the proposed method is superior in terms of cost savings when compared to its non-robust counterpart for variable levels of noise and outliers.

References

Bahnsen AC, Aouada D, Ottersten B (2014) Example-dependent costsensitive logistic regression for credit scoring. 2014 13th international conference on machine learning and applications, pp 263-269. https://doi.org/10.1109/ICMLA.2014.48
Bahnsen AC, Aouada D, Ottersten B (2015) Example-dependent costsensitive decision trees. Exp Sys Appl 42(19):6609–6619
Article Google Scholar
Bahnsen AC, Aouada D, Stojanovic A, Ottersten B (2016) Feature engineering strategies for credit card fraud detection. Exp Sys Appl 51:134–142
Article Google Scholar
Bergesio A, Yohai VJ (2011) Projection estimators for generalized linear models. J Am Stat Assoc 106(494):661–671
Article MathSciNet MATH Google Scholar
Bianco AM, Yohai VJ (1996) Robust estimation in the logistic regression model. Robust statistics, data analysis, and computer intensive methods, Springer, Berlin, p 17–34
Bondell HD (2005) Minimum distance estimation for the logistic regression model. Biometrika 92(3):724–731
Article MathSciNet MATH Google Scholar
Bondell HD (2008) A characteristic function approach to the biased sampling model, with application to robust logistic regression. J Stat Plann Infer 138(3):742–755
Article MathSciNet MATH Google Scholar
Brefeld U, Geibel P, Wysotzki F (2003) Support vector machines with example dependent costs. European conference on machine learning, p 23–34
Cantoni E, Ronchetti E (2001) Robust inference for generalized linear models. J Am Stat Assoc 96(455):1022–1030
Article MathSciNet MATH Google Scholar
Carroll RJ, Pederson S (1993) On robustness in the logistic regression model. J Royal Stat Soci: Ser B (Methodol) 55(3):693–706
MathSciNet MATH Google Scholar
Claude Sammut GIW (2017) Encyclopedia of machine learning and data mining. Springer, US
Book MATH Google Scholar
Croux C, Haesbroeck G (2003) Implementing the bianco and yohai estimator for logistic regression. Comput Stat & Data Anal 44(1–2):273–295
Article MathSciNet MATH Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. Int Joint Conf Artif Intell 17:973–978
Google Scholar
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. Icml, Vol. 99, p 97–105
Ghosh A, Basu A (2016) Robust estimation in generalized linear models: the density power divergence approach. TEST 25(2):269–290
Article MathSciNet MATH Google Scholar
Höppner S, Baesens B, Verbeke W, Verdonck T (2022) Instance-dependent cost-sensitive learning for detecting transfer fraud. Eur J Operat Res 297(1):291–300
Article MathSciNet MATH Google Scholar
Hosseinian S, Morgenthaler S (2011) Robust binary regression. J Stat Plann Infer 141(4):1497–1509
Article MathSciNet MATH Google Scholar
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat, 35 (1), 73–101. Retrieved from http://www.jstor.org/stable/2238020
Huber PJ, Ronchetti E (2009) Robust statistics. Wiley, Hoboken, p 2
Book MATH Google Scholar
Künsch HR, Stefanski LA, Carroll RJ (1989) Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. J Am Stat Assoc 84(406):460–466
MathSciNet MATH Google Scholar
Lessmann S, Haupt J, Coussement K, De Bock KW (2021) Targeting customers for profit: an ensemble learning framework to support marketing decision-making. Inf Sci 557:286–301
Article MathSciNet Google Scholar
Maronna RA, Martin RD, Yohai VJ, Salibián-Barrera M (2019) Robust statistics: theory and methods (with r). Wiley, Hobroken
MATH Google Scholar
Monti GS, Filzmoser P (2021) Robust logistic zero-sum regression for microbiome compositional data. Adv Data Anal Classif 16(2):301–324
Article MathSciNet MATH Google Scholar
Morgenthaler S (1992) Least-absolute-deviations fits for generalized linear models. Biometrika 79(4):747–754
Article MathSciNet MATH Google Scholar
Petrides G, Moldovan D, Coenen L, Guns T, Verbeke W (2022) Costsensitive learning for profit-driven credit scoring. J Oper Res Soc 73(2):338–350
Article Google Scholar
Petrides G, Verbeke W (2022) Cost-sensitive ensemble learning: a unifying framework. Data Min Knowl Discov 36(1):1–28
Article MathSciNet MATH Google Scholar
Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection. Wiley Interdiscip: Rev Data Min Knowl Discov 1(1):73–79
Google Scholar
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, Hobroken
Book MATH Google Scholar
Sahin Y, Bulkan S, Duman E (2013) A cost-sensitive decision tree approach for fraud detection. Exp Sys Appl 40(15):5916–5923
Article Google Scholar
Štefelová N, Alfons A, Palarea-Albaladejo J, Filzmoser P, Hron K (2021) Robust regression with compositional covariates including cellwise outliers. Adv Data Anal Classif 15(4):869–909
Article MathSciNet MATH Google Scholar
Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. The 2010 international joint conference on neural networks (IJCNN) p 1–8.https://doi.org/10.1109/IJCNN.2010.5596486
ULB MLG (2018) Anonymized credit card transactions labeled as fraudulent or genuine. https://www.kaggle.com/mlg-ulb/creditcardfraud
Valdora M, Yohai VJ (2014) Robust estimators for generalized linear models. J Stat Plann Infer 146:31–48
Article MathSciNet MATH Google Scholar
Vanderschueren T, Verdonck T, Baesens B, Verbeke W (2022) Predictthen- optimize or predict-and-optimize? an empirical evaluation of costsensitive learning strategies. Inf Sci 594:400–415
Article Google Scholar
Verbeke W, Olaya D, Berrevoets J, Verboven S, Maldonado S (2020) The foundations of cost-sensitive causal classification. arXiv:2007.12582
Whitrow C, Hand DJ, Juszczak P, Weston D, Adams NM (2009) Transaction aggregation as a strategy for credit card fraud detection. Data Min Knowl Discov 18(1):30–55
Article MathSciNet Google Scholar
Zelenkov Y (2019) Example-dependent cost-sensitive adaptive boosting. Exp Sys Appl 135:71–82
Article Google Scholar

Download references

Funding

No funds, grants, or other support was received.

Author information

Authors and Affiliations

Faculty of Economics and Business, Leuven AI, KU Leuven, Naamsestraat 69, Leuven, 3000, Belgium
Simon De Vos, Toon Vanderschueren & Wouter Verbeke
Department of Mathematics, University of Antwerp, Middelheimlaan 1, Antwerp, 2020, Belgium
Tim Verdonck
Department of Mathematics, KU Leuven, Celestijnenlaan 200B, Leuven, 3001, Belgium
Tim Verdonck

Authors

Simon De Vos
View author publications
You can also search for this author in PubMed Google Scholar
Toon Vanderschueren
View author publications
You can also search for this author in PubMed Google Scholar
Tim Verdonck
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Verbeke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon De Vos.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Results on synthetic data

See Tables 5 and 6.

Table 5 This table displays the results of tests on synthetic data with no outlier and an outlier of size 100. We apply a $2\times 5$-fold cross validation procedure with a train/test split ratio of 0.8/0.2. We report the average together with the standard deviation over these 10 runs. Per row, the two classes become more and more imbalanced. The best performing methods are indicated in bold. In this table, r-cslogit always perform at least equally good in comparison to cslogit. In terms of the cost-sensitive metric Savings, logit is always outperformed by cslogit and r-cslogit. Logit performs best in terms of cost-insensitive metrics and its performance remains stable after increasing the size of the outlier given its cost-insensitive nature. An exception is Specificity with a 90/10 class imbalance and an outlier of 100. However, given the relatively high standard deviation of 0.11, these results are rather volatile because of the high class imbalance, which results in a small amount of observations in one class

Full size table

Table 6 This table displays the results of tests on synthetic data with an outlier of size 1000 and an outlier of size 10000. We apply a $2\times 5$-fold cross validation procedure with a train/test split ratio of 0.8/0.2. We report the average together with the standard deviation over these 10 runs. Per row, the two classes become more and more imbalanced. The best performing methods are indicated in bold. In terms of Savings, r-cslogit always outperforms the other two methods and remains stable after increasing the size of the outlier. Also in terms of cost-insensitive metrics, the performance of r-cslogit remains stable. After increasing the outlier size, cslogit performs worse. This is analogous to the results as displayed in Fig. 5. Logit performs best in terms of cost-insensitive metrics and, given its cost-insensitive nature, its performance remains stable after increasing the size of the outlier. The few times that logit is outperformed by either cslogit or r-cslogit in terms of cost-insensitive metrics, the performance scores have a rather high volatility. This is predominantly the case for tests with high class imbalance

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

De Vos, S., Vanderschueren, T., Verdonck, T. et al. Robust instance-dependent cost-sensitive classification. Adv Data Anal Classif 17, 1057–1079 (2023). https://doi.org/10.1007/s11634-022-00533-3

Download citation

Received: 02 June 2022
Revised: 22 December 2022
Accepted: 24 December 2022
Published: 07 January 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11634-022-00533-3

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust instance-dependent cost-sensitive classification

Abstract

Similar content being viewed by others

Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise

Progressive random k-labelsets for cost-sensitive multi-label classification

On Sensitivity of Metalearning: An Illustrative Study for Robust Regression

1 Introduction

2 Related work