Keywords

1 Introduction

The global economic developments of recent decades have put corporate failure and their consequences for economic well-being under the spotlight, to the extent that bankruptcy or business failure has become a crucial task in finance. This, in turn, has emphasized that financial institutions need effective prediction mechanisms in order to make an appropriate lending decision.

In general, the objective of corporate failure prediction is to forecast the likelihood that a firm will survive or fail with the minimum possible classification error. That is why corporate failure research aims at binary classification (Séverin & Veganzones, 2021; Ouenniche & Tone, 2017). From the binary classification point of view, the model’s output is a dichotomous variable that takes the value of 1 when the firm follows a bankruptcy procedure and is set to 0 when the firm survives. The explanatory variables to design corporate failure prediction models are often financial ratios, which measure the relationship between any two items on financial statements.

Since the pioneer studies of Beaver (1966) and Altman (1968) who documented the predictive power of ratio analysis, many prediction techniques have been employed to develop corporate failure prediction models, including statistical and artificial intelligence methods (Veganzones & Severin, 2020; Kumar & Ravi, 2007; Moula et al., 2017). On the one hand, researchers still employ well-known statistical methods, notably linear discriminant analysis and logistic regression, due to their simplicity and capacity to interpret the data, even though they are clearly outperformed by machine learning techniques. On the other hand, artificial intelligence techniques (i.e., support vector machine, decision trees, neural networks, fuzzy set theory, self-organizing map) have become indispensable tools in the field of corporate failure prediction, especially in this era of advanced informatics and computing technology (Abedin et al., 2021). Their superiority relies on the fact that they learn directly from the data, which makes it possible to test complex data using nonlinear approaches, and therefore, their predictions are more reliable. Nonetheless, these mentioned methods are not free of drawbacks: low learning rate, slow computational time, converge in local minima, etc. (Yu et al., 2014; Abedin et al., 2018), which could make corporate failure prediction time consuming and arduous.

To overcome these, we consider a novel prediction method, Extreme Learning Machine (ELM) (Huang et al., 2006a) to predict corporate failure. There are several reasons behind choosing ELM as the classifier for the prediction of corporate failures. Firstly, despite many existing methodologies for predicting corporate failure, new methods of research should be continually explored by researchers and practitioners. Secondly, the main concept behind ELM is the random initialization of the Single Layer Feed-Forward Neural Network (SLFN), which replaces the computationally cost procedure of training the hidden layer performed by other artificial intelligence techniques. Unlike the AI techniques, it does not need to calibrate parameters, such as the learning rate. For this reason, ELM has good performance with an extremely fast learning speed (Akusok et al., 2015) and it is proven to be a universal approximator given enough hidden neurons (Huang et al., 2006b).

However, as other techniques, ELM possesses a main drawback: the random initialization that allows ELM to be an extremely fast algorithm, it becomes ELM a highly unstable classifier as well. In ELM, even if we train the same training sample several times, it performs differently due to the random initialization of bias and weights between the input and hidden nodes. Although the reliance on a single ELM may be misguided, the ensemble of predictions might improve the generalization performance of the ELM. Indeed, ensemble methods are usually used as an instrument for improving the accuracy of the learning algorithm by constructing and combining a set of weak classifiers (Kim & Kang, 2010; Abedin et al., 2022). This rationale motivates our specific study of the performance of the ensemble extreme learning machine to predict corporate failure.

Consequently, the aim of this current work is to fully examine which is the best ensemble procedure to improve the performance of ELM for corporate failure prediction. This is of significant importance because the diversity generation method is key in the process of creating an ensemble of classifiers. According to Rokach (2010), diversity creation can be obtained in several ways: by manipulating the training sample, by manipulating the inducer, by varying the representation of the target attribute and by changing the search space. Of all possible ensemble techniques, we selected 4 based on their popularity in the literature (Verikas et al., 2010): Multiple classifiers, Bagging, Boosting, and Random Subspace. The fact that the chosen techniques rely on different ensemble procedures might provide further insight into the general characteristics of ensemble techniques that are influenced by the base classifier. In turn, a rigorous study of such methods would provide assistance in designing a model of corporate failure based on ensemble ELM. Furthermore, optimal performance of prediction models developed based on ensemble ELM models can be employed as a baseline prediction model for future research.

The rest of the paper is organized as follows. Section 2 presents the research methodology. Sections 3 and 4 describe the experimental design and results, respectively. Finally, in Sect. 5, the conclusions are summarized.

2 Research Methodology

In this section, we present the method employed in this study. In particular, we describe the extreme learning machine classifier as well as the ensemble modeling techniques.

2.1 Extreme Learning Machine

The Extreme Learning Machine (ELM) classifier was proposed by Huang et al. (2006a). The ELM represents a fast way of creating a Single Layer Hidden Feed-Forward Neural Network (SLFN) by the random initialization of the internal bias and weights. The hidden layer does not need to be iteratively tuned; it bypasses the time-consuming calibration setup performed by artificial intelligence algorithms. As a result, ELM is an extremely fast learning speed while being a simple method. The ELM algorithm can be described as follows:

Consider a set of N observations with features xi  N and the corresponding output labels Y ∈ {−1, 1}Nxc. A SLFN with m neurons in the hidden layer is written by the following sum:

$$ {\Sigma}_{j=1}^m\ {\beta}_j\ \phi \left({w}_j\ {\boldsymbol{x}}_i+{b}_j\right)={\boldsymbol{Y}}_{ik},i=1,\dots, N\ k=1,\dots, c, $$
(1)

where βj are the output weights, ϕ is the activation function, wj are the input weights and bj represents the biases. The Eq. (1) can be expressed in the form of a matrix as Hβ = Y, where

$$ \boldsymbol{H}=\left(\begin{array}{ccc}\phi \left({w}_1\ {x}_1+{b}_1\right)& \cdots & \phi \left({w}_m\ {x}_1+{b}_m\right)\\ {}\vdots & \ddots & \vdots \\ {}\phi \left({w}_1\ {x}_N+{b}_1\right)& \cdots & \phi \left({w}_m\ {x}_N+{b}_m\right)\end{array}\right). $$
(2)
$$ \boldsymbol{\beta} ={\left({\beta}_1\kern0.5em \cdots \kern0.5em {\beta}_m\right)}^c\ \boldsymbol{Y}={\left({Y}_1\kern0.5em \cdots \kern0.5em {Y}_N\right)}^c. $$

Then, the output weights β can be calculated by the Ordinary Least Squares method using the Moore-Penrose pseudo inverse of H (Rao & Mitra, 1971):

$$ \boldsymbol{\beta} ={\mathbf{H}}^{\dagger}\mathbf{Y}. $$
(3)

2.2 Ensemble Techniques

2.2.1 Multiple Classifiers Technique

The multiple classifier technique relies on the simple idea that the combination of multiple classifiers leads to higher classification prediction and efficiency than the single classifier. This approach is equivalent to the wisdom of crowds: the combined opinion of diverse and independent experts usually outperforms the opinion of single individuals. According to Kitter et al. (1998), the multiple classifier technique achieves higher efficiency when learners generalize in different ways, i.e., the diversity of the ensemble is generated. As ELM is based on the random initialization of internal bias and weights, each learner will be different; there is diversity in the ensemble. Therefore, the forecast of several ELMs will be combined using majority voting to produce the final decision rule. Figure 1 shows the general architecture of the multiple classifier.

The classifiers C1(X),,CM(X) are built based on the data set {(x1, y1), (x2, y2), …, (xn, yn)}. Each classifier provides an output \( {\hat{\boldsymbol{y}}}_M \) that will be combined into the final output \( \hat{\boldsymbol{y}} \).

Fig. 1
An illustration of the multiple classifier. It has an input X that splits into E L M superscript 1 to M. Each of them gives y circumflex subscript 1 to y circumflex subscript M. All are combined to give the output y circumflex.

Architecture of the multiple classifier

2.2.2 Bagging

Bagging (short for bootstrap aggregating) is one of the primal ensemble techniques (Breiman, 1996). Its popularity lies in the fact that it is intuitive and simple to implement, with notably good performance. Bagging generates the essential diversity to create the ensemble process that manipulates the training set. In this regard, the training set samples are randomly resampled in order to generate several different bags of samples. Thus, each bag represents a set of training samples. Finally, the base classifier is applied to each bag, and the output classification is made by a majority vote of all the base classifier results.

Bagging technique generates an improvement in generalization performance due to the reduction in variance while maintaining steady or only slightly increasing the bias, in particular, when it is applied to weak classifiers (Grandvalet, 2004). The bagging algorithm can be expressed as follows:

Given a data set {(x1, y1), (x2, y2), …, (xn, yn)} .

  1. 1.

    Repeat for i = 1, 2, …, I.

    1. (a)

      Build a bootstrap sample \( \left\{\left({\boldsymbol{x}}_{\mathbf{1}}^{\ast },{\boldsymbol{y}}_{\mathbf{1}}^{\ast}\right),\left({\boldsymbol{x}}_{\mathbf{2}}^{\ast },{\boldsymbol{y}}_{\mathbf{2}}^{\ast}\right),\dots, \left({\boldsymbol{x}}_{\boldsymbol{n}}^{\ast },{\boldsymbol{y}}_{\boldsymbol{n}}^{\ast}\right)\right\} \) by randomly selecting n times with replacement from the data {(x1, y1), (x2, y2), …, (xn, yn)}.

    2. (b)

      Fitting the bootstrapped classifier Ci on corresponding bootstrap sample.

  2. 2.

    Calculate the output of the final classifier:

$$ \boldsymbol{C}\left(\boldsymbol{x}\right)={I}^{-1}{\sum}_i^I{C}_i(x). $$
(4)

2.2.3 Boosting

Unlike the bagging technique, the boosting technique combines inaccurate and relatively weak rules to produce highly accurate predictions. That is, it progressively gives more weight to observations that have been misclassified by previously generated classifiers in order to generate new classifiers and then combines the classifiers of different iterations with weighted voting to make final predictions. Since numerous algorithms for boosting have been proposed, we use the Adaboost algorithm (Freund & Schapire, 1996) which is one of the most popular boosting techniques applied to pattern recognition (Verikas et al., 2010). The Adaboost algorithm can be described as follows:

Given a data set {(x1, y1), (x2, y2), …, (xn, yn)} .

  1. 1.

    Initialize the weight vector of the training set:

$$ {W}_1(i)=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$N$}\right.\ \mathrm{for}\ i=1,\dots, N. $$
(5)
  1. 2.

    For t = 1, …, T,

    1. (a)

      Train the weak classifier Ct on the weighted training samples.

    2. (b)

      Calculate the sum of weighted errors of Ct:

$$ {\varepsilon}_t={\sum}_{i=1}^N{W}_i^t,{Y}_i\ne {C}_t\ \left({X}_i\right). $$
(6)
  1. (c)

    Choose

$$ {\alpha}_t=\frac{1}{2}\ln \left(\frac{1-{\varepsilon}_t}{\varepsilon_t}\right). $$
(7)
  1. (d)

    Update the weights:

$$ {W}_i^{t+1}=\frac{W_i^t\exp \left(-{\alpha}_t{Y}_i{C}_t\ \left({X}_i\right)\right)}{Z_t}, $$
(8)

where Zt is a normalization factor.

  1. 3.

    Output:

$$ f(x)=\mathit{\operatorname{sign}}\ \left({\sum}_{t=1}^T{\alpha}_t{C}_t\ (x)\right). $$
(9)

2.2.4 Random Subspace

The random subspace (Ho, 1998) bases its ensemble process on the modification of the feature space. That is, it creates different bags of training samples by randomly selecting features drawn for the initial feature set that characterizes each sample. The training sample Xi(i = 1, …, n) in the training set X = (X1, X2, …, Xn) is a p-dimensional vector Xi = (xi1, xi2, …, xip), where p represents the feature components. Within the random subspace, the k-dimensional subspace is randomly selected from the original p-dimensional feature space, k < p. The new learning samples \( {\boldsymbol{X}}^b=\left({\boldsymbol{X}}_{\mathbf{1}}^{\boldsymbol{b}},{\boldsymbol{X}}_{\mathbf{2}}^{\boldsymbol{b}},\dots, {\boldsymbol{X}}_{\boldsymbol{n}}^{\boldsymbol{b}}\right) \)in a k-dimensional subspace \( {\boldsymbol{X}}_{\boldsymbol{i}}^{\boldsymbol{b}}=\left({\boldsymbol{x}}_{\boldsymbol{i}\mathbf{1}}^{\boldsymbol{b}},{\boldsymbol{x}}_{\boldsymbol{i}\mathbf{2}}^{\boldsymbol{b}},\dots, {\boldsymbol{x}}_{\boldsymbol{i}\boldsymbol{n}}^{\boldsymbol{b}}\right) \), where \( {\boldsymbol{x}}_{\boldsymbol{ij}}^{\boldsymbol{b}}\left(j=1,\dots, r\right) \), are built and then, the classifiers in the random subspace Xb are combined using majority voting to create the final decision rule. Thus, the random subspace can be organized as follows:

  1. 1.

    Repeat b times, with b = 1, 2, …, B

    1. (a)

      Randomly select a k-dimensional subspace Xb among the initial p-dimensional feature space X.

    2. (b)

      Design a classifier Cb(x) using the sample Xb.

  2. 2.

    Combine the forecast of Cb(x) classifiers using majority voting to a final decision rule.

$$ \mathrm{Prev}(x)={\displaystyle \begin{array}{c}\mathrm{argmax}\\ {}y\in \left\{-1;1\right\}\end{array}}{\sum}_{b=1}^B{\delta}_{\mathit{\operatorname{sgn}}\left({\boldsymbol{C}}^{\boldsymbol{b}}\left(\boldsymbol{x}\right)\right),\boldsymbol{y}}. $$
(10)

3 Experimental Design

3.1 Data

Our empirical study uses non-listed French firms taken from the Diane database created by Bureau Van Dijk. The French companies must submit annual reports to the French Commercial Court under French law provide accounting and income statements to the Bureau Van Dijk authority. We drew firms from all sectors of activity (excluding financial companies) for the years 2016–2018, allowing us to examine the model’s capacity to create good prediction rules in a real-world scenario.

The Diane database provides the information on whether firms have failed or remain healthy; in the case of failure, it also provides the date. A firm is considered to be failed if it proceeded to be liquidated or reorganized, and non-failed firms were those that continued their activity for at least a year after the period studied. We decided to be conservative in the selection of non-failed firm in order to avoid the inclusion of healthy companies that may suddenly fail and ensure a reliable sample that does not fail. Moreover, firms that presented missing values in their financial statement, as well as outliers, were excluded to ensure the prediction model stability. Consequently, the collected dataset is composed of 3000 failed and 3000 non-failed firms.Footnote 1

To minimize the bias effect and sample variability that might influence the model prediction performance, we carried out a tenfold cross-validation method in which the dataset is split into ten distinct training and test set in order to learn and evaluate the model prediction. This procedure was repeated ten times to ensure the reliability of our results. Therefore, the final prediction performance is calculated as the average of 100 testing results.

3.2 Variables

Financial dimensions characterize the main explanatory factors for corporate failure. Therefore, the balance sheets and income statements of the collected firms were used to calculate 30 financial ratios to use as explanatory variables. This representation layer is important because it guarantees that the variables, we have used actually represent all aspects of the phenomenon.

The initial set of financial ratios that we compute includes at least four indicators representing six categories: liquidity, solvency, profitability, financial structure, turnover, and activity. These variables are presented in Table 1.

Table 1 Initial set of variables

However, using all financial ratios may result in very high-dimensional feature space, which may reduce model predictive capability. Therefore, a variable selection process has been performed in order to choose a subset of the most relevant financial ratios. Following the study by Kainulainen et al. (2011), a feed-forward variable selection process was performed to retain the necessary information for prediction.

3.3 Evaluation Metrics

The evaluation criteria of our experiments are adopted from standard measures established in the field of prediction (Shahriare et al., 2021). These measures include average accuracy, type error I, and type error II. The formula of these measures provided below can be explained with respect to the confusion matrix shown in Table 2.

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}}, $$
(11)
$$ \mathrm{Type}-\mathrm{I}\ \mathrm{error}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, $$
(12)
$$ \mathrm{Type}-\mathrm{II}\ \mathrm{error}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}. $$
(13)
Table 2 Confusion matrix for the prediction of corporate failure

In addition to these evaluation metrics, we also used the area under the receiver operating characteristic curve (AUC) to estimate the model performance. This is a graphical plot used to represent the model performance while changing the cutoff value. In this case, the proportion of true positive and false positive are plotted on the x-axis and y-axis of the curve. AUC has become a widely used evaluation metric in corporate failure prediction because it is insensitive to the matrix of misclassification costFootnote 2 to assess the discrimination ability of a model. In summary, two classifiers can be easily compared according to differences in the ROC curve performance. A classifier should get as close to the top left corner as possible, where its value will be close to 1.

With the data set mentioned above, a cross-validation loop (tenfold cross-validation repeated ten times) was performed to estimate the average evaluation measures. To compare the classifier performance, Demšar (2006) recommends a Wilcoxon signed ranks non-parametric test because it only assumes limited commensurability and can be applied to prediction accuracy, misclassification errors or any other evaluation metric. It is expressed as follows:

Given R+ be the sum of ranks when the second classifier outperforms the first one, R be the sum of ranks for the opposite and the ranks of di = 0 are split evenly among the sums:

$$ {R}^{+}=\sum \limits_{d_i>0}\operatorname{rank}\left({d}_i\right)+\frac{1}{2}\sum \limits_{d_i=0}\operatorname{rank}\left({d}_i\right), $$
(14)
$$ {R}^{-}=\sum \limits_{d_i<0}\operatorname{rank}\left({d}_i\right)+\frac{1}{2}\sum \limits_{d_i=0}\operatorname{rank}\left({d}_i\right). $$
(15)

Let T be the smaller of the sums, T =  min (R+, R), the normal approximation can be used and the following statistic is used to calculate the z-statistics with a corresponding p-value:

$$ z=\frac{T-\frac{n\left(n+1\right)}{4}}{\sqrt{\frac{n\left(n+1\right)\left(2n+1\right)}{24}}}. $$
(16)

However, Garcia and Herrera (2008) caution that several repeated pairwise comparison tests between algorithms conducted by us may lead to loss of control over family-wise errors.

4 Results

Experimental analysis is designed to compare the prediction ability of different ensemble methods based on extreme learning machine classifier. Table 3 indicates the evaluation metrics achieved to assess the performance of the methods. Furthermore, this table is complemented by Table 4, which highlights whether the differences between the methods are statistically significant.Footnote 3

We first analyze the overall performance of the methods. Boosting ELM and Bagging ELM achieve the best mean accuracy values, 82.2% and 82.6%, respectively, while Random subspace ELM attains mean accuracy value of 81.7% and that of 81.4% is achieved with Multiple ELM. All ensemble methods are more accurate than the single ELM (80.4% of the mean accuracy). Thus, it confirms that ensemble ELM methods produce greater predictive power compared to a single ELM classification. The fact that Bagging and Boosting ensembles lead to the best reduction in the generalization error is not entirely surprising, as it is well documented their robustness to overfitting (Xiao et al., 2013; González et al., 2020). In contrast, variation of the parameters of the classifiers, such as Multiple ensemble and Random Subspace, can generate greater diversity (Bi, 2012). Nonetheless, the information perceived by the varying diversity does not generate consistent guidance so that the ensemble classifier can obtain a good generalization. On the whole, the key of Boosting and Bagging is that they build a set of diverse classifiers, while they benefit from the balance between diversity and accuracy, which is an important determinant of the performance of ensemble classifiers.

Secondly, we find no uniform improvement among the ensemble methods. If the misclassification errors are analyzed, Boosting ELM and Bagging ELM, here as well, lead to lower misclassification error for failed firms, 18.8% and 18.2%, respectively, significant at 1% threshold in comparison with ELM. In contrast, we do not observe any significant differences in misclassification error for non-failed firms across ensemble methods; rather, the mean type-II error ranges from 16.5% with Bagging ELM and Random Subspace ELM to 18.8% with Bagging ELM.

Finally, the Bagging and Boosting ELM-based methods lead to higher AUC values than the other ensemble methods, which is in line with the previous results. In particular, Bagging ELM seems to be the most optimal ensemble method for corporate failure prediction as results are significantly better than those achieved with the other ensemble methods, but with respect to Boosting ELM.

Table 3 Performance of different ELM-based ensemble methods

In sum, the better overall prediction of Bagging and Boosting methods over the other ensemble methods, as previously observed, is due to their capacity to better identify failed firms. The superiority of Bagging ELM is based on the creation of a unique training set for each ensemble member because the perturbation generated in the learning set causes a significant change in the prediction constructed. As a model’s prediction is order-correct for most of the replicated observation, the bagging-based ELM can be transformed into a nearly optimal predictor, in particular, for failed firms. Furthermore, one of major reasons why boosted ELM better identifies failed firms may be due to the fact that the new classifier generation gives more relevance to misclassified observation, mostly failed firms. That is, the likelihood of instances that have been misclassified by the previously generated classifier increases, and the set of classifiers grows progressively diverse. This trend explains why this method provides higher accuracy for the minority class without jeopardizing the accuracy of the majority class.

Table 4 Significance levels of a test of differences by method and evaluation metric

4.1 Further Validation

In order to further evaluate the effectiveness of the ensemble extreme learning machine for the corporate failure prediction task, a new data set has been collected. In general, there is no universal accepted definition of corporate failure; bankruptcy, the more severe form of failure, is commonly used. The popularity of bankruptcy as the definition of failure is based on two concepts: on the one hand, it provides an objective criterion to distinguish failed and non-failed firms, and, on the other hand, the moment of failure can be dated when a firm fills in the bankruptcy procedure. Therefore, the bankruptcy notion offers a discrimination criterion for obtaining a well-defined dichotomy, or at least, a representation of corporate failure, that can be applied methodologically. Nonetheless, numerous studies (Sun et al., 2014; Brédart et al., 2021) consider that corporate failure begins when a firm experiences financial distress. That is, when a firm encounters financial difficulties or struggles to fulfill its obligations. Accordingly, we collected a data set considering financial distress as the definition of corporate failure. We consider the criterion provided by Balcaen et al. (2011), who define financial distress as a firm with negative recurring profit after taxes over two consecutive years. Consequently, the collected dataset is composed of 2500 failed and 2500 non-failed firms.Footnote 4

The results presented in Tables 5 and 6 are consistent with those of the previous ones. Boosting ELM and Bagging ELM achieve the highest accuracy values, in particular, due to their effectiveness in the reducing the type-I error in comparison to the single ELM.Footnote 5 Moreover, it is important to mention that the prediction performance of the methods in this data set is inferior to the previous one. Thus, it is more arduous to differentiate failed firms from healthy ones in the initial steps of failure, when firms just experience financial distress. The literature documented that firms have shown a certain resilience for a long time, even though their financial situation resembles to a bankrupt one (Iftikhar et al., 2021). In contrast, firms that seem completely sound may suddenly fail. Therefore, the inability to know whether the echoes of financial distress may result in corporate failure makes it difficult to capture distinguishable factors that might reinforce model accuracy. That is why the performance of models is lower when corporate failure is represented as financial distress than when it is defined as bankruptcy.

Table 5 Performance of different prediction methods

5 Conclusion

In this study, we propose to evaluate several ensemble methods applied to corporate failure prediction in order to improve the classification performance of ELM. An ensemble strategy that combines the predictions of individual models is more performance-based than relying on the prediction capacity of a single model. Our results confirm that the Extreme Learning Machine-based ensemble is more accurate and robust than the “individual best” ELM model using two real financial datasets. In particular, the ensemble methods used in this study increase, on average, the classification accuracy estimated for the single ELM by 1.6 and 2.1 percentage points for the bankruptcy data and financial distress data, respectively. An increase in prediction performance of these magnitudes may seem modest, but the readers need to understand that financial institutions and banks can save a huge amount of the limited financial resources with decision technology that can increase the prediction power by 2%.

Table 6 Significance levels of a test of differences by method and evaluation metric

As Bagging ELM and Boosting ELM give similar results – there is some evidence that the bagging strategy is more effective for the prediction of corporate failure using ELM – it is arduous to make a design recommendation for which method is more optimal. However, we do notice that both methods, which operate by taking a base learner and invoking it multiple times using different training sets, are most effective in the ensemble ELM prediction method. We also notice that bagged ELM is more computationally efficient, as it requires 40–50 ensemble members, while 60–70 members as necessary for the boosting ensemble.