1 Introduction

Classification is a well-studied machine learning task that involves the assignment of instances to a predefined set of outcome classes. Cost-sensitive classification methods take into account asymmetric costs related to incorrectly classifying instances across various classes (Elkan 2001; Verbeke et al. 2020). Such misclassification costs may either be class-dependent, i.e., equal for all instances of a class, or instance-dependent, i.e., vary across instances.

Classification methods are adopted to support or automate business decision-making, e.g., for credit scoring (Petrides et al. 2022) or customer churn prediction (Lessmann et al. 2021). Note that in both applications, misclassified instances involve variable costs. For instance, the cost of a misclassified churner equals the future customer lifetime value, whereas a misclassified non-churner typically involves a much smaller cost, i.e., the cost of targeting the customer with the retention campaign. Either or both may be instance-dependent or class-dependent depending on the characteristics of the particular application setting.

A broad variety of cost-sensitive (CS) and instance-dependent cost-sensitive (IDCS) classification methods have been proposed in the literature as reviewed and experimentally evaluated by Petrides and Verbeke (2022) and Vanderschueren et al. (2022). A prominent approach that is adopted by both CS and IDCS methods for taking misclassification costs into account is to weigh instances proportionally with the misclassification cost involved when learning a classification model.

In this article, we raise the question of whether IDCS classification methods are sensitive to outliers and noise in the data. No prior work seems to have addressed this question, which nonetheless is of significant practical importance given the broad adoption and potential monetary impact of using biased classification models for decision-making.

To address these shortcomings, we present the results of a series of experiments to evaluate the robustness of IDCS classification methods with respect to outlying costs in the data, which highlight the potential bias and vulnerability of IDCS classification methods. We propose a robust approach to IDCS classification by extending the existing cslogit approach (Höppner et al. 2022). An important benefit is the automatic and reliable detection of outliers in the data. These outliers may not only spoil the resulting analysis (as illustrated in this article) but can also contain valuable information. A robust analysis can thus provide better insight into the structure of the data.

The following section outlines related work on IDCS learning and discusses both cslogit and robustness. Next, in Sect. 3, a series of simulations on synthetic data is presented that motivate the need for robust IDCS learning which we develop in Sect. 4. Section 5 presents the results of a series of experiments that illustrate the excellent performance of the proposed robust IDCS learning method, denoted r-cslogit, in comparison with both logit and cslogit. We conclude and present directions for future research in Sect. 6.

2 Related work

Elkan (2001) introduces a learning paradigm where different misclassification errors incur different penalties depending on the predicted and actual class, with applications to, for example, detecting transaction fraud and credit scoring. The benefits and costs of different predictions can be summarized in a two-dimensional instance-dependent cost matrix with one dimension for the predicted value and another dimension for the ground truth. Given these benefits and costs, each new instance should be assigned to the class that leads to the lowest expected cost, which is calculated by means of conditional probabilities.

2.1 IDCS learning, cslogit, and robustness

For certain applications, benefits and costs depend not only on the class but also on the instance itself. Therefore, instance-dependent cost-sensitive learning considers a more detailed, lower level of granularity than class-dependent costs. For these applications, using instance-dependent costs instead of class-dependent costs leads to a decreased total misclassification cost (Brefeld et al. 2003; Vanderschueren et al. 2022).

Several instance-dependent cost-sensitive methodologies have been proposed in the literature, with recent overviews given by Petrides and Verbeke (2022) and Vanderschueren et al. (2022). Especially relevant to our work are methodologies that adjust the learning algorithm to incorporate instance-dependent costs. Instance-dependent cost-sensitive variants have been proposed for several common machine learning classifiers, such as boosting (Fan et al. 1999; Zelenkov 2019; Höppner et al. 2022), support vector machines (Brefeld et al. 2003), decision trees (Sahin et al. 2013; Bahnsen et al. 2015), and logistic regression (Bahnsen et al. 2014; Höppner et al. 2022).

In this work, we will build upon an instance-dependent cost-sensitive version of logistic regression. Following Höppner et al. (2022), we will refer to this method as cslogit. Logistic regression is a widely used method for binary classification tasks. To extend logistic regression to its IDCS counterpart, Bahnsen et al. (2016) and Höppner et al. (2022) propose an objective function that combines both cost-sensitivity and instance-dependent learning, resulting in instance-dependent costs for optimization. The application of this objective function yields significant improvements in terms of higher savings compared to cost-insensitive or class-dependent cost-sensitive models in the context of, for example, credit scoring and transaction fraud detection.

Classical nonrobust methods for regression, such as least squares or maximum likelihood techniques, try to fit the model optimally to all the data. As a result, these methods are heavily influenced by data outliers. This implies that outliers may bias the parameter estimates and confidence intervals and thus hypothesis tests may become unreliable and/or uninformative. In contrast, robust methods can resist the effect of outliers to avoid distorted results and false conclusions. As an important benefit, they allow the automatic detection of outliers as observations that deviate substantially from the robust fit. It is important to note that the detected outliers are not necessarily errors in the data. The presence of outliers may reveal that the data are more heterogeneous than has been assumed and that it can be handled by the original statistical model. Outliers can be isolated or may come in clusters, indicating that there are subgroups in the population that behave differently. Many different approaches to robust regression have been proposed and a good overview can be found in reference works such as Huber and Ronchetti (2009), Maronna et al. (2019) and Rousseeuw and Leroy (1987). In the context of generalized linear models (GLMs), various robust alternatives have been presented, such as Cantoni and Ronchetti (2001), Bergesio and Yohai (2011), Valdora and Yohai (2014), Ghosh and Basu (2016) and Štefelová et al. (2021). Robust logistic regression has been studied by Künsch et al. (1989), Morgenthaler (1992), Carroll and Pederson (1993), Bianco and Yohai (1996), Croux and Haesbroeck (2003), Bondell (2005), Bondell (2008), Monti and Filzmoser (2021) and Hosseinian and Morgenthaler (2011).

2.2 Preliminaries

The dataset \({\mathcal {D}}\) consists of N observed predictor-response pairs \(\left\{ \left( {\varvec{x}}_{i}, y_{i}\right) \right\} _{i=1}^{N}\) and is used to train a binary classification model s(.). The costs \(C_i\) correspond to the cost matrix defined in Table 1.

This binary classification model predicts a probability score \(s_i \in [0, 1]\) for each instance i based on the features \({\varvec{x}}_{i}\). Depending on the classification threshold \(t_{i}^{*}\), \(s_i\) is converted to a predicted class \({\widehat{y}}_i \in \left\{ 0,1 \right\} \).

For models trained with AEC (Eq. 3), savings remain relatively stable across different thresholding strategies (Vanderschueren et al. 2022). Therefore, we use a default threshold of 0.5.

A binary logistic regression predicts a probability score that an observation belongs to the positive class. This probability score is calculated by Eq. (1), where \(\beta _{0}\) is the bias term, \(\beta _{1}\ldots \beta _{d}\) the learned weights and \({\varvec{x}}_i\) are the features of a particular observation i:

$$\begin{aligned} s_i = s_{(\beta _0,\varvec{\beta })}({\varvec{x}}_i)=\frac{1}{1+e^{-z}} where \, z=\beta _{0}+\beta _{1} x_{i1}+\beta _{2} x_{i2}+\ldots +\beta _{d} x_{id} . \end{aligned}$$
(1)

This probability score is then compared to a threshold to categorize each of these observations into classes. The objective function of a logistic regression is the likelihood that is maximized or the cross-entropy loss that is minimized. For a single sample with true label \(y_i \in \left\{ 0,1 \right\} \) and a probability score \(s_i = P(Y=1)\), the cross-entropy loss is presented by Eq. (2):

$$\begin{aligned} L_{\log }(y_i, s_i)=-\Big (y_i \log (s_i)+(1-y_i) \log (1-s_i)\Big ). \end{aligned}$$
(2)

Note that this equation does not take into account any costs. Because this objective function assigns equal weights to each misclassification, it does not necessarily correspond to the underlying business problem where costs are to be minimized. The reason for this is twofold: misclassification costs are different per class and per instance. The real business objective is to minimize the average expected total cost of the binary classifier.

We build upon the instance-dependent cost-sensitive logistic (cslogit) model as proposed by Bahnsen et al. (2016) and Höppner et al. (2022). Cslogit minimizes an instance-dependent cost-sensitive objective function corresponding to the real business objective of minimizing costs in domains such as customer churn prediction, credit scoring, and direct marketing (Thai-Nghe et al. 2010; Claude Sammut 2017). Dependent on this business objective, also other cost matrices can be considered. For example, Höppner et al. (2022) propose the cost matrix \(C_i(0 \mid 0)=0\), \(C_i(0 \mid 1)=A_i\), \(C_i(1 \mid 0)=c_f\), and \(C_i(1 \mid 1)=c_f\) for the detection of transfer fraud where \(c_f\) is a fixed administrative fee. Alternatively, Bahnsen et al. (2014) propose the cost matrix \(C_i(0 \mid 0)=0\), \(C_i(0 \mid 1)=L_{g d}\), \(C_i(1 \mid 0)=r_i+C_{F P}^a\), and \(C_i(1 \mid 1)=0\) for credit scoring where \(L_{g d}\) is the loss given default, \(r_i\) is the loss in profit by rejecting what could have been a good customer, and \(C_{F P}^a\) is the cost related to the assumption that the financial institution will not keep the amount of the declined applicant unused. However, the reason for this work is to address the need for robustness and to propose a solution to solve this potential issue in a generic way, regardless of its application. Therefore, to present an application-agnostic methodology and preferring the most simple cost matrix, this work utilizes a symmetric cost matrix.

Equation (3) shows the average expected cost (AEC), the cost-sensitive objective function that is used by cslogit, given a symmetric cost matrix, as shown in Table 1:

$$\begin{aligned} AEC(s({\mathcal {D}}))= & {} \frac{1}{N} E[{\text {Cost}}(s({\mathcal {D}})) \mid {\varvec{X}}] \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^{N}\Big (y_{i}[s_{i} C_{i}(1 \mid 1)+(1-s_{i}) C_{i}(0 \mid 1)]\nonumber \\{} & {} +(1-y_{i})[s_{i} C_{i}(1 \mid 0)+(1-s_{i}) C_{i}(0 \mid 0)]\Big ) \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^{N} \Big (A_{i}(y_{i}(1-s_{i})+(1-y_{i})s_{i})\Big ). \end{aligned}$$
(3)

In Eq. (3), each observation i is a pair of d features \({\varvec{x}}_{i} = (x_{i1}, . . . , x_{id})\) and a binary response label \(y_i \in \left\{ 0,1 \right\} \).

Table 1 Symmetric cost matrix for cslogit

Across multiple models, the total cost as a metric is not unambiguously interpretable, as datasets with high instance-dependent costs might have a higher total misclassification cost but still have a better relative score. Proceeding with the idea of normalizing the total classification costs of a model presented in Whitrow et al. (2009), Bahnsen et al. (2014) introduce a more interpretable metric: Savings. This metric represents the relative improvement of the cost of a newly proposed model, \(Cost(s({\mathcal {D}}))\), compared to the cost of using an empty model that assigns all instances to a single class, \(Cost_{empty}({\mathcal {D}})\). \(Cost_{empty}({\mathcal {D}})\) is calculated by taking the minimum of the costs incurred when classifying all instances as either belonging to the negative or positive class:

$$\begin{aligned} Cost_{empty}({\mathcal {D}})=\min \left\{ Cost\left( s_{0}({\mathcal {D}})\right) , Cost\left( s_{1}({\mathcal {D}})\right) \right\} . \end{aligned}$$
(4)

Using the \(Cost_{empty}({\mathcal {D}})\) of an empty model as a factor to normalize total costs, Savings of the model \(s({\mathcal {D}})\) are calculated by Eq. (5):

$$\begin{aligned} \textrm{Savings}\, (s({\mathcal {D}}))= 1- \frac{Cost(s({\mathcal {D}}))}{Cost_{empty}({\mathcal {D}})} . \end{aligned}$$
(5)

3 Sensitivity analysis

Data can contain outliers in terms of misclassification costs due to various reasons, such as missing data, invalid observations, or typos. By incorporating instance-dependent costs in the learning algorithm, outliers in these misclassification costs could potentially have a large impact on instance-dependent cost-sensitive learning methodologies such as cslogit. Therefore, we test the sensitivity of cslogit to these outliers and examine to what extent this is a shortcoming of this method.

3.1 Simulation setup

We analyze the sensitivity to outlying costs through a series of simulations on synthetic data. The different synthetic datasets all share the following properties. Each observation is visualized by a dot, with the size of the dot corresponding to its misclassification cost. The positive class is presented in red and the negative class in blue. Each observation has, other than its misclassification cost and label, two features: \(X_1\) and \(X_2\). \(X_1\) is the feature for the misclassification cost A. For the positive class, this cost is positively related to \(X_1\). Cases of the negative class have a negative relation between \(X_1\) and their cost. The underlying function is given by Eq. (6):

$$\begin{aligned} A_i=\left\{ \begin{array}{ll} 20+2x_{1i} &{} \text { for the positive class }, \\ 20-2x_{1i} &{} \text { for the negative class. } \end{array}\right. \end{aligned}$$
(6)

Panel (b) and (c) in Fig. 1 visualize this equation.

Fig. 1
figure 1

The setup for synthetic data. Panel a displays the distribution of the negative and positive class, dependent on \(X_2\). Panels b and c represent the misclassification costs of observations from the negative and positive classes as a linear function of \(X_1\), generated by Eq. (6). The three panels all show a sample of size 50 per class

\(X_2\) is the feature that determines the two distributions of classes 0 and 1. The two class distributions are a 2-dimensional Gaussian, sharing the same standard deviations. Observations from the negative class are sampled from \(N(\mu _{0} ,\sigma _{0} ^{2},\nu _{0} ,\tau _{0} ^{2},\rho )\) and observations from the positive class from \(N(\mu _1 ,\sigma _1 ^{2},\nu _1 ,\tau _1 ^{2},\rho )\).

\(\mu _0\) and \(\mu _{1}\) are both equal to 0, while \(\nu _0 = -5\) and \(\nu _{1}=5\). The variances \(\sigma _0 ^{2},\tau _0 ^{2},\sigma _{1} ^{2}\) and \(\tau _{1} ^{2}\) are equal to 4. As there is no correlation between the two dimensions \(X_1\) and \(X_2\), \(\rho \) is equal to 0. The cases of the positive class have a higher \(X_2\) value than the cases of the negative class. Panel (a) in Fig. 1 displays these class distributions by which data are generated.

To generate outliers, both in the synthetic setup (Sects. 3 and 5.1) and in the sensitivity analysis on real data (Sect. 5.2), the multivariate distribution of \({\textbf{X}}\) remains unchanged since we only focus on outliers in the observed costs of instances. We use the Tukey-Huber contamination model to generate the cost distribution with outliers where the Dirac function is applied to generate costs of any size (Maronna et al. 2019).

Given these settings for instance-dependent costs and class distribution, observations of the negative class with a high associated cost are expected to be located in the third quadrant and observations of the positive class with a high associated cost in the first quadrant. The symmetric cost matrix used for the examples on synthetic data is presented in Table 1 as introduced in Sect. 2.2.

3.2 Results

Within each setting, two classifiers are compared: logit and cslogit. They are both linear classifiers and propose a distinctly different decision boundary based on the training data. Since the data are only two-dimensional, these decision boundaries can be visually represented by lines. The logit and cslogit model’s proposed boundaries are respectively coloured in red and blue. The normal behavior of both models in the default settings of examples on synthetic data is visualized in Panel a of Fig. 2.

Fig. 2
figure 2

The instability of cslogit’s decision boundary.This figure motivates the need for a robust version of cslogit. Three examples of synthetic data where logit and cslogit are tested are shown. Panel a shows the normal behavior of cslogit and logit in the default case. Panel b displays the case where a large outlier is added. The blue decision boundary of cslogit shifts, while the red decision boundary of logit remains stable. Note that the effect of the outlier can be subtle, i.e. it pulls on the decision boundary resulting in a slight rotation, without actually being classified correctly. Panel c displays the case where random noise is added to the misclassification costs. The decision boundary of cslogit shifts even further, resulting in an almost perpendicular boundary in comparison with Panel a. We further elaborate on the exact setting of the examples on synthetic data in Sect. 5.1 (colour figure online)

4 Robust IDCS

To overcome the sensitivity of instance-dependent cost-sensitive classifiers to outlying costs, we introduce a three-step framework to make IDCS methods robust by detecting outliers and adapting their cost matrix. Hence, the final model will be trained using a less volatile and more rigid set of costs. The resulting robust classification model will also yield automatic outlier detection. The concrete implementation of this framework is represented by Algorithm 1.

figure a

To estimate the misclassification costs of observations in a robust manner in step 1, a regression with Huber loss is applied. Concretely, we estimate the cost \(A_i\) of observation i as a function of its features \({\textbf{x}}_i\) and label \(y_i\) with a linear regression with a Huber loss function: \(\hat{A_i}=f({\varvec{x}}_i,y_i)\). A formalization of robustness in statistics started with the work of Huber (1964). Interestingly, his ground-breaking results and well-known loss function are still widely used today in the field of statistics and machine learning. The Huber loss function is defined by Eq. (7). This results in a regression that is less sensitive to outliers than traditional regression methods, which often use a squared error loss.

$$\begin{aligned} L_{\delta }(a)=\left\{ \begin{array}{ll} \frac{1}{2} a^{2} &{} \text { for }|a|\le \delta , \\ \delta \left( |a|-\frac{1}{2} \delta \right) &{} \text { otherwise. } \end{array}\right. \end{aligned}$$
(7)

Next, to detect outliers, we compare the absolute value of the standardized residuals with a cutoff value of a normal distribution (Rousseeuw and Hubert 2011). If this value exceeds 2.5, we consider it an outlier and add it to the initially empty set \({\mathcal {S}}_{outlier}\).

Fig. 3
figure 3

Cost estimation in function of \(X_1\)

The observed cost \(A_i\) of observation i is operationally defined as an outlier if, for an estimator \({\hat{A}}=f(X,Y)\), the absolute value of the standardized residual \(\epsilon _i\) is larger than 2.5.

Figure 3 further clarifies the concept of a conditional outlier on the setting of synthetic data where noise is added to the observed costs, as is displayed in Fig. 2c. In this figure, the black line displays the estimated costs, as predicted by the linear regression model with Huber loss (Algorithm 1, line 2). The red dots represent the costs of observations in function of \(X_1\). Consider the two observations A and B. To check whether their observed costs are outliers, we look at the standardized residuals, represented by the grey vertical lines for those two observations. Both observations A and B have an observed cost of 50. The standardized residual of A exceeds 2.5, whereas for B it is smaller than 2.5. Consequently, although both observations have the same cost, the cost of A is considered an outlier, whereas the cost of B is not.

By doing so, the costs A that are outliers, conditional on their features \({\textbf{X}}\) and label Y, are detected. In step 2, the observed outlying costs A of all observations in \({\mathcal {S}}_{outlier}\) are imputed with their estimated counterpart \({\hat{A}}\). This results in a robust cost matrix (Table 2).

Table 2 Symmetric cost matrix for r-cslogit

Equation (8) retakes the cost-sensitive objective function AEC, given by Eq. (3), but adapts it to the new robust cost-matrix given by Table 2. Indicator function \(\mathbbm {1}_{o}\) takes value 1 if the cost \(A_i\) of observation i is classified as an outlier.

$$\begin{aligned} {\text {AEC}}(s({\mathcal {D}}))= & {} \frac{1}{N} E[{\text {Cost}}(s({\mathcal {D}})) \mid {\varvec{X}}] \nonumber \\= & {} \frac{1}{N} \sum _{i=1}^N \Biggl [ \mathbbm {1}_{o} \Bigl ({\hat{A}}_i\left( y_i\left( 1-s_i\right) +\left( 1-y_i\right) s_i\right) \Bigl ) \nonumber \\{} & {} +(1-\mathbbm {1}_{o})\Bigl (A_i\left( y_i\left( 1-s_i\right) +\left( 1-y_i\right) s_i\right) \Bigl ) \Biggr ] \end{aligned}$$
(8)

5 Results

This section discusses the performance of logit, cslogit and the novel r-cslogit on synthetic data and tests their sensitivity on real data with additional outliers. In the reported experiments, symmetric cost matrices are taken into account. However, the use of alternative cost matrices as presented in Sect. 2.2 yields similar results concerning robustness.

The performance of binary classification algorithms is typically measured by labeling one class as positive and the other class as negative and constructing a confusion matrix. Positive classes are typically used to describe the minority class and negative classes are used to describe the majority class. From the confusion matrix, we count the following numbers. True Negatives (TN) is the number of correctly classified negative cases. False Positives (FP) is the number of negative cases incorrectly classified as positive. False Negatives (FN) is the number of positive cases incorrectly classified as negative. True Positives (TP) is the number of correctly classified positive cases. With these numbers, we define Sensitivity or Recall, Specificity, and Precision by Eqs. 9, 10, and 11:

$$\begin{aligned} Sensitivity=Recall= & {} \frac{TP}{TP+FN} \end{aligned}$$
(9)
$$\begin{aligned} Specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$
(10)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$
(11)

Equation 12 defines F1-measure, which is the Harmonic mean of precision and recall.

$$\begin{aligned} F_1=\frac{2 \cdot precision \cdot recall}{precision+recall} \end{aligned}$$
(12)

The area under the receiver operating curve (AUC) of a classifier can be interpreted as a measure of the probability that a randomly chosen minority case is predicted to have a higher score than a randomly chosen majority case. Therefore, a higher AUC indicates better classification performance (Höppner et al. 2022). Class distributions and misclassification costs are not taken into account in calculating the AUC.

Equation 13 defines the Brier score, where \(s_i\) is the predicted probability and \(y_i\) is the observed outcome. This metric measures the mean squared difference between the predicted probability and the actual outcome and is used to assess whether the model’s predictions are calibrated probabilities. A lower score is better.

$$\begin{aligned} Brier=\frac{1}{N} \sum _{i=1}^N\left( s_i-y_i\right) ^2 \end{aligned}$$
(13)

5.1 Synthetic data

In this section, we reuse the examples on synthetic data introduced in Sect. 3.1 to demonstrate how the possible shortcomings of cslogit can be countered by deploying the more robust r-cslogit. Figure 4 displays the decision boundaries of logit, cslogit and r-cslogit in red, blue and green, respectively.

5.1.1 Synthetic data: three settings

The basic setting of the examples on synthetic data in Panel a is the same as explained in Sect. 3.1. In Panel b, an additional outlier of the positive class is added in the third quadrant with a cost equal to 400. The robust method first estimates its cost with a linear Huber regression to be 13.75 and flags it as an outlier. Next, the cost for this instance of 400 is changed to its estimated cost of 13.75. In Panel c, we add noise to the costs. Hence, the misclassification costs are generated by Eq. (14), where the noise \(\epsilon _i\) is sampled from a lognormal distribution with parameters \(\mu \) = 2 and \(\sigma \) = 1.5.

$$\begin{aligned} A_i=\left\{ \begin{array}{ll} 20+2x_{1i}+\epsilon _i &{} \text { for the positive class }, \\ 20-2x_{1i}+\epsilon _i &{} \text { for the negative class. } \end{array}\right. \end{aligned}$$
(14)

5.1.2 Description of results

Figure 4 visualizes the decision boundaries of the three models. In Panel a, the decision boundaries of cslogit and r-cslogit overlap as regression with Huber Loss can perfectly predict the underlying function of associated misclassification costs as a function of \(X_1\) in the absence of noise or outliers. Panel b displays the case where one outlier is added. The decision boundary of the logit model is not affected by the size of misclassification costs. Hence, it is not influenced by the outlier and remains unchanged, demonstrating normal behavior as defined before. The blue decision boundary of the cslogit model is strongly influenced by outliers. The objective function takes into account the full misclassification costs of the observations in the training set, including the excessive outliers. As a consequence, the behavior of the cslogit model has been completely disrupted. This is strongly in conflict with its normal behavior, as the decision boundary is almost tilted by a quarter turn. This tilted decision boundary results in poor predictive classification power, making the cslogit model to be of inferior quality. The green decision boundary of r-cslogit remains largely unchanged, as it is robust against the single added outlier.

Performance metrics are summarized in Table 3. We consider the cost-sensitive metric Savings introduced in Sect. 2.2 and cost-independent metrics Sensitivity, Specificity, F1, AUC, and Brier score.

r-cslogit outperforms logit and cslogit in terms of Savings when we add an outlier and noise. Moreover, the performance in terms of Savings remains unchanged after adding an outlier. In the default case of setting one, r-cslogit and cslogit are equivalent, as they make the exact same predictions. When considering cost-insensitive metrics, logit performs best. A full analysis on synthetic data where we experiment with different settings of class imbalance and outlier size can be found in Appendix A.

Fig. 4
figure 4

Superiority of r-cslogit. The red decision boundary of logit remains unchanged, as it is not cost-sensitive. The blue decision boundary of cslogit differs strongly per example, as the model is prone to outliers and noise. The green decision boundary of r-cslogit is stable against outliers and handles noise in misclassification costs quite well. In Panel a, the blue and green decision boundaries coincide

Table 3 Results on synthetic data (i) in the default setting, (ii) with an outlier, and (iii) with additional noise added to the amounts

5.2 Sensitivity analysis on real data

In this section, we analyze the sensitivity of the three methods in an experiment with real data where we add an additional outlier, gradually increasing in size. To add outliers, we randomly select an observation and change its class label and instance-dependent misclassification cost.

This setup is similar to the second setup with synthetic data as presented in the previous section. The performance is measured by the cost-sensitive metric Savings as described before as well as the cost-independent metrics Sensitivity, Specificity, F1, AUC, and Brier score. The measurement of performance makes use of five-fold cross-validation with a stratified split on class distribution that is repeated twice with a different random initialization.

5.2.1 Description of the dataset

The dataset on which the three methods are tested is the Kaggle Credit Card Fraud Detection dataset (ULB 2018). The dataset dates from September 2013 and contains transactions made by European credit cardholders. A total of 492 out of 284,807 transactions are fraudulent, resulting in a high class imbalance. The numerical input features \(V1, V2, \dots , V28\) are the results of a PCA transformation to anonymize the dataset. Time and Amount have not been transformed. The feature Time is not taken into consideration in this experiment and is therefore dropped in the preprocessing phase. The feature Amount is the transaction amount, which is of high importance in cost-sensitive instance-dependent learning and translates into our setting as the instance-dependent misclassification cost. The feature \(Class \in \{0,1\}\) indicates whether a transaction is fraudulent or not.

Table 4 Sensitivity analysis on real data resulting from a two times five-fold cross validation procedure on the Kaggle Credit Card Fraud Detection dataset
Fig. 5
figure 5

Sensitivity analysis on real data

5.2.2 Results

Table 4 contains the results of a \(2\times 5\)-fold cross validation procedure for the Kaggle Credit Card Fraud Detection dataset. We measure each classifier’s performance averaged over the ten (\(2\times 5\)) test sets with the metrics Savings, F1, AUC, Sensitivity, Specificity, and Brier score where instance-independent thresholds are applied.

In terms of Savings, logit is always outperformed by cslogit and r-cslogit. When adding an outlier, r-cslogit outperforms cslogit. Note that the performance of r-cslogit remains stable for all considered metrics when increasing the size of the outlier. In terms of cost-insensitive metrics AUC, Specificity, and Brier score, logit performs best. In terms of F1 and Sensitivity, logit is outperformed by either cslogit or r-cslogit. This could be due to the effect of class imbalance and is in line with previous findings of Höppner et al. (2022) The results in terms of Savings are visualized in Fig. 5. Since the logit model is not cost-sensitive, its performance remains constant after adding an outlier. The performance of cslogit is strongly disrupted after the cost of the outlier is set to 1 M or larger. This corresponds with the shift of the two-dimensional linear decision boundary, as shown by the findings of the examples on synthetic data. Even though the dataset contains over 280,000 instances, a single outlier, albeit a large outlier, can unhinge the cslogit method. When increasing the misclassification cost of a single outlier, the performance of r-cslogit remains stable. It is certainly more robust to this additional noise than its nonrobust counterpart, as the individual outlier is detected and its cost is imputed with an estimated, expected cost. The shaded areas in Fig. 5 represent the variability of performance over different folds in cross-validation. In contrast to the variability of cslogit, which increases drastically, the variability in the performance of r-cslogit remains stable.

6 Conclusion

Instance-dependent cost-sensitive (IDCS) learning methods take into account variable misclassification costs across instances in the training data in learning a classification model. This allows for optimizing the performance of the resulting classification model in terms of the misclassification costs rather than the classification accuracy.

In this article, we present the results of a series of experiments on synthetic data to demonstrate the sensitivity of IDCS methods to outliers and noise in the data. We show that the resulting classification model may be highly sensitive to outlying instance-dependent costs, in learning an instance-dependent cost-sensitive classification model. Consequently, using existing cost-sensitive models in the presence of noise or outliers can result in large misclassification costs.

To address this potential vulnerability, we propose a generic, IDCS-method-independent, three-step framework to develop robust IDCS methods with respect to the effects of random variability and noise. In the first step, instances with outlying misclassification costs are detected. In the second step, outlying costs are corrected in a data-driven way. In the third step, an IDCS learning method is applied using the adjusted instance-dependent cost information.

This generic framework is subsequently applied in combination with cslogit, which is a logistic regression-based IDCS method, to obtain its robust version named r-cslogit. The robustness of this approach is introduced in the first two steps of the generic framework by making use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data. The results show that the proposed method is superior in terms of cost savings when compared to its non-robust counterpart for variable levels of noise and outliers.