Keywords

1 Introduction

From 2005 onwards, credit risk forecasting and bankruptcy prediction have become among the most important and interesting topics in the modern economic and financial field. However, quantitative methods have long been applied for predicting the bankruptcy event. First, Beaver in 1966 [5] applied discriminant analysis, then Altman [1] in 1968 developed the well-known Z score. Later on, Ohlson [28] in 1980 used logistic regression which has became the most applied model in the credit scoring field. Subsequently, in 1992 Narain [27] approached the problem via survival analysis, examining the timing of failure instead of simply considering whether or not an event occurred within a fixed interval of time; since then, Cox’s semi-parametric proportional hazard model and its extensions have been extensively proposed and adopted in economic, banking and financial fields [4, 6, 9, 20, 30, 38, 39].

However, whichever model is applied, one major challenge in constructing predictive failure models, as has been widely stated in the literature [2, 3, 7, 8, 15,16,17,18,19], is the effective selection of the most relevant variables from among those that have been collected because of their perceived importance or widespread use.

Besides the problem of correlations between variables that may affect the discriminant ability of a risk model [24], a crucial point remains the procedure chosen for making the selection [13, 45]. Beyond the traditional methods such as backward, forward and stepwise selection, and the use of criteria such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), new approaches known as penalty driven methods (Least Absolute Shrinkage and Selection Operator (LASSO), Smoothly Clipped Absolute Deviation (SCAD) or bridge estimator) [21, 41,42,43,44] and machine learning techniques (decision trees and neural networks) [11, 23, 25, 40] have become prominent. Moreover, the increased availability of high-dimensional data, which may impose limitations due to processing time, has led to the development of new high-performance procedures employing tools that can take advantage of parallel processing [37].

In the present paper, based on an application to economic data, we try to provide an answer to the following research questions: (1) do different variable selection methods among standard, modern and those taking advantage of parallel processing, lead to the same choice of variables; (2) which method is better for predicting the future state of a firm.

The paper is structured in the following way: Sect. 2 presents the methodology that will be applied; Sect. 3 gives a brief description of the data; results of the analysis are shown in Sect. 4; and Sect. 5 presents the conclusions of the investigation.

2 Methodology and Study Design

The primary purpose of this paper is to apply different techniques in order to select significant variables for predictive purposes, applying as quantitative method the binary logistic regression model. While acknowledging that different causes may lead to the end of a firm’s life, that alternative variables may influence these various events, and that the same variables may even have opposite effects (see [10] and [31]), a single adverse event— bankruptcy —was studied. The problem of overestimating the intercept coefficient in the logistic model [22] due to the relative lack of data on rare events, was overcome by applying one of the available solutions that we have previously applied in statistical analysis [32]. Thus a balanced data set was built by randomly selecting for each bankrupt firm four controls (firms that did not fail). Training and holdout samples were built to develop and test the models, respectively. The variables selected as relevant by each method were used as explanatory variables in a logistic model. The Wald test was applied to test whether a candidate variable should be included in the model, with the p-value cutoff set at 0.05. Each model’s adequacy and predictive capability were tested, through the holdout sample, measuring the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC).

Three parametric (forward-stepwise, LASSO, Maximum Data Variance (MDV) explained [36]) and two non-parametric methods (single and forest decision tree) were applied and compared, taking into account the number of selected variables and the AUC value in the holdout sample.

Focusing attention on SAS® software, which provides both standard and high performance (HP) procedures running in either single-machine mode or distributed mode, the following procedures were called upon: LOGISTIC [33] to apply the forward stepwise selection and to run and test all the logistic models; GLMSELECT [34], specifying the logit link, to perform the LASSO selection following the Efron et al. implementation [14]; HPREDUCE [37] to identify variables that jointly explain the maximum amount of data variance; and HPSPLIT [35] and HPFOREST [37] to build a single tree and a forest of trees, respectively.

3 Data Description

The data used in this study were extracted from Orbis [29], a global company database compiled by Bureau Van Dijk, one of the major publishers of business information. Orbis combines private company data with software for searching and analysing over 400 million companies.

The sample employed in the present analyses consists of 37,875 Italian firms operating in the manufacturing sector from 2000 to 2018. For each firm, the financial data for the last available year, its legal form, current legal status and geographical location were extracted. Following the classification of company status available in the Orbis database, three main categories of firms’ inactivity were identified: closure, liquidation and bankruptcy (Table 1). As indicated earlier in the Introduction, only one of the adverse events, bankruptcy, was taken into account and, due to its rarity (8.74%), a balanced data set was built by randomly choosing four controls (active firms) for each event (bankrupt firm). The data obtained in this way (16,560 observations) were then split at random into training (80% of the total sample, 13,095 observations) and holdout samples (20% of the total sample, 3465 observations) in order to develop and test the models on independent samples.

Table 1 Firms’ distribution by status

The distribution of firms in the training data set by geographical area (Table 2) shows an increasing percentage of defaulting firms going from the North (18%) to the South (28%). Moreover, private limited companies (21%) seem to be more prone to the adverse event (Table 3).

Table 2 Distribution of firms in the training set, by geographical area
Table 3 Distribution of firms in the training set, by legal form (LC = limited company)

For each firm indexes or ratios representative of its economic and financial situation were built, taking into account both their perceived importance and widespread use in the literature [1, 5, 12, 26] and the information availability required for the calculation. Correlation problems were solved by including only one of the ratios among those with correlation higher than 0.70. Finally, besides the firm’s age, geographical area and legal form, 37 indexes were used (Table 4), including liquidity and solvency ratios, profitability and operating efficiency ratios.

Table 4 Indexes evaluated as potential predictors of the bankruptcy event

4 Results

4.1 Stepwise, LASSO and Maximum Data Variance Selection Methods

The variable selection comparison between the stepwise, LASSO and maximum data variance (MDV) explained techniques, shows good performance of all three methods. Although the best performance in the holdout sample was given by the LASSO (AUC = 0.8921), AUC values under the other methods were extremely close (Table 5). The MDV method selected the smallest number of indexes (13), which in turn are also identified by the other two techniques. As shown in Table 5 the three approaches agree on the selection of more than 60% of the variables.

Table 5 Variable selection comparison among stepwise, LASSO and maximum data variance explained methods

The LASSO output results from the GLMSELECT procedure include detailed graphs as an aid to interpretation. Figure 1 shows the coefficient progression for the response variable: the names of the most important indexes affecting bankruptcy appear on the right-hand side, with those above the zero line increasing the probability of the event under study when their value increases and those below the zero line decreasing it. Coefficients corresponding to effects that are not in the selected model at a step are zero and hence not observable. Figure 2, complementary to the previous graph, shows how the average square error used to choose among the examined models progresses. The initial model includes only one index (ind0042), then a second one (ind0085) is added and so on (Fig. 2). The procedure stops at the 20th step.

Fig. 1
A 9-line graph of standardized coefficient versus output from G L M Select procedure. The coefficients ind 084, ind 021, and ind 031 have an increasing trend, and ind 001, ind 085, ind 058, ind 083, and ind 042 have a decreasing trend.

Coefficient progression for response variable: output from GLMSELECT procedure

Fig. 2
A line graph of S B C versus effect sequence. The S B C begins at negative 10000 for intercept, drops to negative 13000 for 2 + ind 042, drops to negative 16000 for 9 + ind 079, and ends at negative 18000 for 20 + ind 011, respectively.

Effect Sequence: output from GLMSELECT procedure

4.2 Single and Forest of Trees Methods

The two non-parametric approaches showed very similar results. The single tree and the forest of trees had in common 12 indexes, that is, respectively, 75% and 80% of the variables selected. Their performances in the holdout sample were virtually identical (Table 6). HPSPLIT plots provide a tool for selecting the parameters that result in the smallest estimated Average Square Error (Fig. 3) and a classification tree (Fig. 4) that uses colours to aid understanding of where the higher percentage of active firms is found: blue for bankruptcy, and pink for active.

Fig. 3
A line plot with error bars for average misclassification rate versus cost-complexity parameter and number of leaves, plots a gradual decrease from 0.20 at 1 to 0.13 at 44, rises to 0.14 at 291, and remains constant with a minimum average misclass rate of 0.128, leaves at 44, and parameter 0.0003.

Cost complexity analysis using cross-validation: PROC HPSPLIT output

Fig. 4
A classification tree for D 3. Each node branches out into 2 nodes with active firms and bankruptcy.

Classification tree: PROC HPSPLIT output

Table 6 Variable selection comparison between the two non-parametric approaches

In Fig. 5 the subtree starting at node 0 shows important details regarding the indexes’ values, that is, the cut-off at which they cause the separation into new leaves.

Fig. 5
An illustration of a subtree starting at node 0, each branches out into 2 nodes. Node 0 to nodes 1 and 2, node 1 to nodes 3 and 4, node 2 to nodes 5 and 6, node 3 to nodes 7 and 8, node 4 to nodes 9 and A, and node 5 to nodes B and C, respectively.

Subtree starting at node 0: PROC HPSPLIT output

4.3 Comparison Between the Best Method of Each Group

Even though all the methods applied in this context lead to very similar results, the best of each group was selected (LASSO and single tree methods) with the aim of making a more detailed comparison among a parametric and non parametric technique (Table 7). The variable selection comparison, on the basis of the AUC value, showed a slight predominance of the first one, however, the difference was extremely small (0.891 against 0.8892). LASSO selected a slightly greater number of variables as predictors, most of which (14) were in common with the single tree method (73.68%). Table 8 shows the ratios that they had in common.

Table 7 Variable selection comparison between the best method in each group
Table 8 Predictive variables in common between LASSO and single tree methods, in addition to Age and Legal Form. Increased values of variables above and below the horizontal line raise and reduce, respectively, the probability of bankruptcy

5 Discussion

Variable selection techniques were evaluated within two main groups of methods and then the best of each group were compared further. The first group considered the standard and widely used forward stepwise selection method, the LASSO technique, and a procedure that conducts a variance analysis and reduces dimensionality by selecting the variables that contribute the most to the overall variance of the data. Among these, the models refitted and tested through logistic regression showed very stable results. The AUC values in the holdout sample were very close, with differences only in the third decimal point. The selection was most parsimonious using the third method which discarded variables that are included by both the stepwise and LASSO methods (Table 5), but the AUC value was slightly higher.

The non-parametric approach showed very slight differences between the single tree and the forest methods. Again the differences lay in the third decimal places of the AUC (in the holdout sample) and the number of selected variables was almost the same, with most of these in common.

The final comparison between LASSO and single tree selection methods highlighted that these different techniques led to models with high and stable predictive performance in the holdout sample, with a preference towards the first method for its slightly higher AUC value (0.8921 against 0.8892) and for its computational performance in terms of processing time (0.91 vs. 25.16 seconds). Moreover, the LASSO and single tree approaches selected almost the same predictive variables with a smaller number in the second. In particular both gave particular relevance to variable ind042 reflecting the ratio of Shareholders’ Funds to Total Assets: both LASSO and single tree selected it first, on the basis of the average square error and variable importance. This confirms the protection from bankruptcy provided by strong corporate capital structure, while the credit situation (ind021) and debt exposure (ind060) may play an opposite rule [31].

The SAS software procedures used (GLMSELECT and HPSPLIT) both provide very intuitive graphs although perhaps the LASSO ones seem easier to interpret for a wider and non-technical audience. On the other hand HPSPLIT is a high performance procedure that runs in either single-machine mode or distributed mode and can therefore take advantage of parallel processing.

Uniformity in the predictive capability of these selection methods may have been affected by data dimensionality, therefore in the future the same procedures will be applied to a smaller data set. Future developments will also include the extension to multinomial logistic analysis.