Introduction

With the development of global economy, good financial condition becomes more and more important for an enterprise that wants to survive and develop under fierce market competition. A company with better financial condition usually has more business opportunities. However, when a company runs into financial distress, it may have the problems such as bad profitability, high liability and deficiency of cash flow, which may affect its business operation and even bring bankruptcy. Therefore, it is always a hot topic how to construct an effective financial distress prediction (FDP) model, because predicting financial distress in advance and taking corresponding measures in time can help a company and its investors avoid suffering great losses.

On the one hand, financial distress concept drift makes stationary FDP models to be unable to adapt to the new sample data stream. In other words, stationary FDP models are not suitable for the dynamic operational environment of enterprises. Since the number of financially distressed companies is often smaller than that of financially non-distressed companies, the data stream for dynamic FDP is class imbalanced rather than class balanced. On the other hand, enterprise financial distresses of different industries may show different characteristics. It is necessary to define the concept of financial distress from the view of a certain industry. Therefore, this study explores dynamic prediction of relative financial distress based on imbalanced data stream from the view of one industry. The process of financial condition evaluation and relative FDP is dynamically integrated based on the financial data stream of a certain industry. Hence, this study is not only able to provide an important tool for enterprises to make dynamic prediction of relative financial distress from the view of the industry, but also supplements the theoretical system of FDP.

Literature review

Concept of financial distress

Traditionally, financial distress usually refers to certain kind of financial difficulty faced by an enterprise. Foster (1986) defines financial distress as a serious liquidity problem which is impossible to be resolved without the large-scale restructuring of the operation or structure of economic entities. Doumpos and Zopounidis (1999) consider that financial distress also includes the situation of negative net asset value. In the twenty-first century, some researches define the financial distress of listed companies according to the regulations of Stock Exchange. In the study of Rafiei et al. (2011), an Iranian company, whose retained losses are more than 50% of its capital, is labeled to be in financial distress according to the commercial law of 141 Act of Tehran Stock Exchange. In Ding et al. (2008), Sun and Li (2012), and Geng et al. (2015), financial distress is defined as the criteria of the special treatment mechanism of China Stock Exchange. The above concept of financial distress is a kind of absolute definition of financial distress with certain criteria. A company satisfying the criteria is labeled to be in financial distress, and otherwise it is labeled to be in financial health.

Sun et al. (2011) propose the definition of an enterprise’s relative financial distress, which is the relatively bad financial situation of certain enterprise in the recent time span. With the process of an enterprise’s life cycle, the most recent time span moves on, and the relative financial situation of the same time point in different time spans may change, which is called the longitudinal concept drift of enterprise financial distress. Such definition of relative financial distress is also adopted in the research of Sun et al. (2016).

Prediction of financial distress

Most researches on FDP are based on absolute financial distress. Beaver (1966) applies the univariate analysis based on financial ratios for bankruptcy prediction. Altman (1968) uses the statistical approach of the multiple discriminant analysis to propose the famous Z-score model for bankruptcy prediction. Then, the statistical approach of the Logistic regression model is used for bankruptcy prediction (Ohlson 1980; Huang et al. 2012). More recently, Serrano-Cinca and Gutiérrez-Nieto (2013) apply partial least-squared discriminant analysis for the prediction of American bank bankruptcy, and it is not restricted by multicollinearity of independent variables.

The emergence of artificial intelligence and data mining techniques promotes the development of FDP, and various single classier algorithms are applied to FDP. Frydman et al. (1985) use decision tree (DT) for bankruptcy prediction. In the 1990s, the neural networks (NNs) are among the most widely used artificial intelligence methods for FDP, and many literature studies reached the conclusion that FDP based on NNs is more accurate than traditional statistical methods (Odom and Sharda 1990; Fletcher and Goss 1993; Carlos 1996; Zhang et al. 1999; Yang et al. 1999; Pendharkar 2005; Tseng and Hu 2010; Khashman 2011). Then after the support vector machine (SVM) was proposed (Vapnik 1998), it also came to be widely applied for FDP and proved its ability to demonstrate good generalization performance (Shin et al. 2005; Min and Lee 2005; Ding et al. 2008; Xie et al. 2011; Sun and Li 2012). In addition, some other artificial intelligence approaches such as the genetic algorithm (Kim and Han 2003), the case-based reasoning (Li and Sun 2009), and the rough set (McKee 2000; Bose 2006) are also applied in the research of FDP.

In the recent decade, more and more researches focus on the classifier ensemble approaches for FDP, which combine the predictions from multiple base classifiers instead of relying on a single classifier (Zhang et al. 2011). Sun and Li (2008) put forward a classifier ensemble for FDP based on weighted majority voting combination of different classifiers, and it outperforms the base classifiers. However, Tsai and Wu (2008) find that the NNs ensemble does not show better performance than the single best NNs classifier in many cases, and the possible reason is associated to too small training dataset. Alfaro et al. (2008) construct the NNs ensemble for FDP using the AdaBoost ensemble algorithm, and it has lower generalization error than the single NNs classifiers. Bagging and Boosting are the two most popular classifier ensemble algorithms, and Kim and Kang (2010) indicate that both Bagging and Boosting can improve the performance of FDP based on NNs. Li and Sun (2011) propose the principal component case-based reasoning ensemble method for FDP, and validate that it outperforms the best base model. Sun and Li (2012) train the base SVM classifiers using different kernel functions and different feature selection methods to construct a SVM ensemble model for FDP. Kim and Upneja (2014) compare the AdaBoosted DT ensemble and the single DT classifier for predicting restaurant financial distress, and find that the former outperforms the latter. Wang and Wu (2017) propose a business failure prediction model based on two-stage selective ensemble with manifold learning algorithm and kernel-based fuzzy self-organizing map. Wang et al. (2018) incorporate sentiment and textual information into the ensemble random subspace method for FDP.

The above studies on FDP do not consider the concept drift of financial distress, which is the process of variation of financial distress concept in the changing environment as time goes on (Sun and Li 2011). When there is financial distress concept drift, the FDP models trained on old sample data may become unsuitable for the current FDP. To dispose financial distress concept drift, some researches have been carried out. Sun and Li (2011) put forward a dynamic FDP model based on instance selection and time window. Sun et al. (2013) propose the adaptive and dynamic ensemble of SVM based on data batch combination. Sun et al. (2017) integrate sample time weighting with AdaBoost SVM ensemble, which can dynamically update the AdaBoost SVM ensemble FDP model. Li et al. (2017) propose a time-varying Malmquist DEA method. Liu and Wu (2017) put forward the incremental bagging based on selective ensemble and employ genetic algorithm to optimize the base classifier combination. These studies are based on the time series of panel data batches. In contrast, some researches on dynamic FDP are based on the longitudinal financial data stream of a certain company. For example, Sun et al. (2011) combine the ex-post evaluation of financial condition based on principal component analysis (PCA) and ex-ante prediction of financial distress based on NNs optimized by genetic algorithm for dynamic FDP, and Sun et al. (2016) propose another approach for dynamic evaluation and prediction of financial distress based on the entropy-based weighting, SVM, and an enterprise’s vertical sliding time window.

Theoretical concepts

Relative financial distress from the view of one industry

The relative financial distress in Sun et al. (2011, 2016) is the relative financial condition deterioration from the viewpoint of one enterprise. Such relative financial distress is not labeled by some concrete criteria. Instead, it is the result of comparing the financial conditions of different time points for an enterprise. The financial distress concept adopted in this study also belongs to relative financial distress. However, it is the result of comparing the financial conditions of different enterprises that belong to the same class of industry. Therefore, the relative financial distress concept in this study is defined from the view of one industry. Its definition is as follows: with the development of an enterprise and its industry, the comprehensive evaluation score of solvency, profitability, operating capacity, and growth ability for the enterprise becomes relatively worse in the industry.

Financial distress concept drift

Concept drift is generally known as changes in the target concept, and these changes are induced by changes in the hidden context (Schlimmer and Granger 1986). For a classification problem based on data stream, concept drift leads to changes of mapping relationship. In other words, the concept drift of data stream causes changes of the mapping relationship between the features and the class labels, which is hidden in the data. Usually, the data of a certain time point only reflect the concept of this time, and the concept hidden in the data drifts with time passing on. Sun and Li (2011) first validate the existence of financial distress concept drift. That is, the target concept of financial distress changes with the changing environment, or the underlying data distribution gradually changes with the inflow of new sample data although the target concept of financial distress does not change. Thus, the stationary FDP model, which is constructed on the sample data of past stationary time span, cannot adapt to the requirement of future FDP in the changing environment, or may become inaccurate for future FDP.

Due to the existence of financial distress concept drift, the FDP model should be constructed based on financial data stream with a dynamic model updating mechanism, instead of a stationary dataset. Namely, when time moves forward for a period, the new informative sample data should be added into the training dataset and the too-old sample data should be eliminated from the training dataset, so as to preserve the model’s predictive ability for future financial distress.

Imbalanced FDP

In a class-imbalanced dataset, the samples of one class are greatly more than the samples of the other class. The class with more samples is called as the negative class or majority class, and the class with fewer samples is called as the positive class or minority class. The phenomenon of class imbalance widely exists in the domains of fraud detection, bank credit scoring, text classification, as well as FDP. In most cases, decision makers care more for the minority class than the majority class, because the emergence of the minority class usually brings great losses to them. For example, falling into financial distress may disrupt the business activity of a company and even finally brings bankruptcy.

Class imbalance is obvious for FDP, since most enterprises are financially healthy and only a few enterprises are considered as financially distressed. For instance, there are 2831 listed companies in China Stock Market in 2015, and only 47 companies are marked as ST or ST* because of financial distress or other irregularities. For the classification modeling problem like FDP, the model trained on a class-imbalanced training dataset usually shows unsatisfactory performance for recognizing the minority class, for the reason that most classification algorithms are based on the assumption of class balance. When such classification algorithms are applied on a class-imbalanced dataset, the information of the minority class is overwhelmed by the information of the majority class, and the classification model trained on it tends to show bias toward the majority class. This finally leads to the reduction of recognition accuracy for the financial distress companies.

Dynamic model for industry’s relative FDP

Overall design of the dynamic prediction model

Taking the fiscal year as the time moving unit of the financial data stream, the width of the sample time window for training FDP model is set as n years. That is, the financial ratio data of the recent n years before the current modeling year should be used as the training dataset. Suppose T represents the FDP year. To train a (T − 1) FDP model with the forecasting ability of 1 year in advance, the financial data of the year (T − 1) should be corresponding to the financial condition labels of the year T to construct the training dataset. Let us denote the recent n years before the year T as shown in the formula (1).

$$ t = T - \text{ }(n - k) - 1\quad (k = 1,\;2, \ldots ,n). $$
(1)

In the above formula, k represents the sequence number of year. k = 1 corresponds to t = T − n, which means the starting year in the time window. k = n corresponds to t = T − 1, which means the last year in the time window. For example, suppose the width of time window for training the model is 10 years and it needs to make FDP for the year of 2015, then t = 2005, 2006,…,2014. To train a (T − 1) FDP model, the financial data of 2005–2013 should be, respectively, corresponding to the financial condition labels of 2006–2014 to construct the training dataset. Namely, the financial data of 2005 is corresponding to the financial condition labels of 2006, and so on. After training the FDP model, we should input the financial data of 2014 into the model, to output the prediction result of 2015’s financial condition labels.

The framework of the dynamic model for an industry’s relative FDP is designed as depicted in Fig. 1. In detail, the model is composed of three modules: financial feature selection module, financial condition evaluation module, and FDP module.

Fig. 1
figure 1

The framework of the dynamic model for an industry’s relative FDP

Suppose the number of enterprises in a certain industry in the kth year of the time window is denoted as Nk (k = 1, 2,…,n), and the number of the initial financial features is denoted as m. The initial financial dataset of the kth year is represented as Fk (k = 1, 2,…,n), and they together constitute the initial total dataset F.

$$ F_{1} = \left\{ {x_{ij}^{1} } \right\}\quad (i = 1,\;2, \ldots ,N_{1} ;\;j = 1,\;2, \ldots ,m) = \left[ {\begin{array}{cccc} {x_{11}^{1} } &\quad {x_{12}^{1} } &\quad \cdots &\quad {x_{1m}^{1} } \\ {x_{21}^{1} } &\quad {x_{22}^{1} } &\quad \cdots &\quad {x_{2m}^{1} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{1} 1}}^{1} } &\quad {x_{{N_{1} 2}}^{1} } &\quad \cdots &\quad {x_{{N_{1} m}}^{1} } \\ \end{array} } \right], $$
(2)
$$ F_{n} = \left\{ {x_{{_{ij} }}^{n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j = 1,\;2, \ldots ,m) = \left[ {\begin{array}{cccc} {x_{11}^{n} } &\quad {x_{12}^{n} } &\quad \cdots &\quad {x_{1m}^{n} } \\ {x_{21}^{n} } &\quad {x_{22}^{n} } &\quad \cdots &\quad {x_{2m}^{n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{n} } &\quad {x_{{N_{n} 2}}^{n} } &\quad \cdots &\quad {x_{{N_{n} m}}^{n} } \\ \end{array} } \right], $$
(3)
$$ F = F_{1} \cup F_{2} \cup \cdots \cup F_{n} . $$
(4)

In the financial feature selection module, a certain kind of feature selection method is applied to the initial total dataset F, and the irrelevant financial features are deleted. In this study, the plus-L-minus-R feature selection approach is adopted. To construct dynamic FDP model, the process of feature selection should be carried out each time when time moves forward to the next year. Suppose the number of financial features selected is denoted as m′. The financial data of the m′ financial features selected for the kth year of the time window can be denoted as \( F^{\prime}_{k} \) , where, (k = 1, 2,…,n), and the total dataset of the selected features can be denoted as F′.

$$ F^{\prime}_{1} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime 1} } \right\}\quad (i = 1,\;2, \ldots ,N_{1} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime 1} } &\quad {x_{12}^{\prime 1} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime 1} } \\ {x_{21}^{\prime 1} } &\quad {x_{22}^{{\prime 1}} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime 1} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{1} 1}}^{\prime 1} } &\quad {x_{{N_{1} 2}}^{\prime 1} } &\quad \cdots &\quad {x_{{N_{1} m^{\prime}}}^{\prime 1} } \\ \end{array} } \right], $$
(5)
$$ F^{\prime}_{n} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime n} } &\quad {x_{12}^{\prime n} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime n} } \\ {x_{{\text{ }21}}^{\prime n} } &\quad {x_{22}^{\prime n} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{\prime n} } &\quad {x_{{N_{n} 2}}^{\prime n} } &\quad \cdots &\quad {x_{{N_{n} m^{\prime}}}^{\prime n} } \\ \end{array} } \right], $$
(6)
$$ F^{\prime} = F^{\prime}_{1} \, \cup F^{\prime}_{2} \, \cup \cdots \cup F^{\prime}_{n} . $$
(7)

In the financial condition evaluation module, it utilizes the PCA method to make relative financial condition evaluation for all the enterprises of the industry year by year. For the kth year of the time window, each enterprise in the industry shall be labeled distressed or non-distressed, and the financial condition labels of all the enterprise in the industry compose a financial condition label vector of that year, which is denoted as Yk (k = 1, 2,…,n).

$$ Y_{1} = \left[ {y_{{_{1} }}^{1} ,\;y_{{_{2} }}^{1} , \ldots ,y_{{_{N1} }}^{1} } \right], $$
(8)
$$ Y_{n} = \left[ {y_{{_{1} }}^{n} ,\;y_{{_{2} }}^{n} , \ldots ,y_{{_{Nn} }}^{n} } \right]. $$
(9)

Just as mentioned above, each enterprise’s financial data of the year t − 1 should be corresponding to its financial condition label of the year t, so as to construct a training sample for FDP modeling. However, the number of enterprises in the industry may vary in different years. When some new enterprises are set up by some new investors in the year t, they have financial condition labels of the year t, but they have no financial data of the year t − 1. These enterprises should be omitted from the training samples of the year t. Similarly, when some old enterprises drop out of the industry due to bankruptcy or merger and acquisition in the year t, they have financial data of the year t − 1, but they have no financial condition labels of the year t. These enterprises should also be omitted from the training samples of the year t. For the year t, namely, the kth year of the time window, suppose that Kk represents the number of training samples that both have financial data of the year t − 1 and financial condition labels of the year t. These enterprises’ financial data of the year t − 1 are, respectively, combined with their financial condition labels of the year t, to construct the training dataset of the year t. Then the training datasets of all the years in the training time window are combined, and they compose the total training dataset for the current modeling year, which is denoted as TD. If the number of training samples in TD is denoted as K, it is obvious K = K1 + K2 +···+Kn.

$$ \begin{aligned} {\text{TD}} & = \left\{ {x_{{i^{\prime}j^{\prime}}}^{\prime} ,\;y_{{i^{\prime}}}^{\prime} } \right\}\quad (i^{\prime} = 1,\;2, \ldots ,K;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) \\ & = \left[ {\begin{array}{ccccc} {x^{\prime}_{11} } &\quad {x^{\prime}_{12} } &\quad \cdots &\quad {x^{\prime}_{{1m^{\prime}}} } &\quad {y^{\prime}_{1} } \\ {x^{\prime}_{21} } &\quad {x^{\prime}_{22} } &\quad \cdots &\quad {x^{\prime}_{{2m^{\prime}}} } &\quad {y^{\prime}_{2} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots &\quad \vdots \\ {x^{\prime}_{K1} } &\quad {x^{\prime}_{K2} } &\quad \cdots &\quad {x^{\prime}_{{Km^{\prime}}} } &\quad {y^{\prime}_{K} } \\ \end{array} } \right]. \\ \end{aligned} $$
(10)

In the FDP module, the training dataset TD is used to train the FDP model, which can be used to make relative FDP for the enterprises of the industry for the year T. The target enterprises’ financial data of the year T − 1 should be input into the FDP model, and the predicted relative financial condition labels denoted as the vector Z′ will be output. When the year T really comes, the financial condition evaluation module will be applied again for these enterprises based on their financial data of the year T, to produce the real relative financial condition labels, denoted as the vector Z. Finally, we can obtain the accuracy of industry’s relative FDP by comparing the vector Z′ and the vector Z.

Such process of model construction and application for industry’s relative FDP is dynamic, because the modeling time window rolls forward with time moving on. Since the width of the modeling time window is fixed, the newest year’s financial data will flow in and the oldest year’s financial data will be eliminated continually year by year. For each new current year, the feature selection module is carried out again to select the most relevant financial features, and the financial condition evaluation module is also applied again based on the new financial features to produce the new vector of relative financial condition labels. Then the new training dataset is constructed, and the new industry’s relative FDP model is trained to make prediction for the next year. Year by year, the ex post industry’s relative financial condition evaluation and the ex-ante industry’s relative FDP is dynamically integrated. Therefore, the dynamic model designed for an industry’s imbalanced relative FDP has good adaptability to the developing environment of the industry, and can greatly improve the efficiency of financial risk management for the enterprises of the industry.

Feature selection based on plus-L-minus-R approach

Feature selection, also known as variable selection or attribute selection, is the process of selecting a subset of relevant features (variables, predictors) for model construction, which reduces the dimensionality of the dataset at the same time of remaining almost all the useful information. Too many features usually make the model too complex and reduce the generalization ability of the model. Feature selection can reduce the number of features by eliminating the irrelevant features, which helps save the time of model training, improves the accuracy of the model, and simplifies the model to be easily understood. There are mainly three types of searching algorithms for feature selection, namely complete searching, heuristic searching, and random searching. Among them, complete searching and random searching is usually time consuming when the initial dimensionality is high, because the former considers all feature subsets and the latter is complex in algorithm. Heuristic searching is usually popular for good efficiency and performance. The heuristic searching algorithms mainly include the sequence forward selection algorithm, the sequence backward selection algorithm, the plus-L-minus-R selection algorithm, and so on. The sequence forward selection algorithm and the sequence backward selection algorithm belong to greedy searching approach and are easy to run into partial optimality. The plus-L-minus-R algorithm usually has better feature selection performance, because it integrates the thoughts of those two algorithms and also avoids the shortcomings. Therefore, this study makes feature selection based on the plus-L-minus-R approach.

For the plus-L-minus-R approach, there are two forms. The first one starts from the empty set, and in each iteration it firstly pluses L features and then minuses R features (L > R), finally to make the evaluation function obtain the optimal value. The second one starts from the complete set, and in each iteration it firstly minus R features and then plus L features (L < R). Therefore, the parameters of L and R should be set for the plus-L-minus-R approach. For the dynamic modeling of industry’s relative FDP, feature selection is needed for each time of model updating and model reconstruction.

Firstly, the normalization processing should be carried out for the initial dataset of financial features, to avoid the effect of different units of different dimensions. This study applies the following formula (11) for the purpose of data normalization.

$$ x_{\text{norm}} = - 1 + \frac{x - \hbox{min} (x)}{\hbox{max} (x) - \hbox{min} (x)} \times (1 - ( - 1)). $$
(11)

In the above formula, min(x) and max(x), respectively, denote the minimum value and maximum value of certain feature.

Secondly, feature selection processing is carried out based on the normalized dataset of financial features. Suppose the initial financial feature set contains m financial features, and the financial feature set after feature selection contains m′ financial features and is denoted as A. Before starting feature selection, the initial parameters should be set. For example, A0 = Ø, L  = 4, R  = 3, d_end = 10. It means: (1) feature selection starts from the empty set. (2) In each iteration, it firstly pluses 4 features and then it minuses 3 features. (3) Feature selection ends when the number of features selected reaches 10. Suppose the evaluation function for feature selection is J(·), the plus-L-minus-R algorithm follows formula (12) to add features and follows formula (13) to delete features.

$$ \left\{ {\begin{array}{l} {a^{ + } = \arg_{{a \notin A_{d} }} \hbox{max} \;J(A_{d} + a)}, \\ {A_{d + 1} = A_{d} + a^{ + } }, \\ {d = d + 1}, \\ \end{array} } \right. $$
(12)
$$ \left\{ {\begin{array}{l} {a^{ - } = \arg_{{a \in A_{d} }} \hbox{max} \;J(A_{d} - a)}, \\ {A_{d + 1} = A_{d} - a^{ - } }, \\ {d = d - 1}. \\ \end{array} } \right. $$
(13)

The iteration of adding features and deleting features ends when the number of features selected reaches d_end and the final financial feature set for FDP is selected.

Financial condition evaluation module based on PCA

It is worth mentioning that the industry’s relative FDP model in this study is used for prediction of an enterprise’s relative financial distress from the view of one industry, which means that the financial condition labels of training samples are enterprises’ relative financial performances in a certain industry rather than the absolute criteria such as bankruptcy or special treatment of Chinese listed companies. The approach of PCA method, which has been used for synthetic evaluation of financial performance (Sun et al. 2011), city economic performance (Zhu 1998), bank risk (Fang et al. 2018), etc., is used to make relative financial condition evaluation for all the enterprises of the industry. Given the normalized financial dataset after feature selection at the current time point, namely \( F^{\prime}_{n} \) in formula (14), the relative financial condition evaluation process based on PCA is as follows:

$$ F^{\prime}_{n} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime n} } &\quad {x_{12}^{\prime n} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime n} } \\ {x_{21}^{\prime n} } &\quad {x_{22}^{\prime n} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{\prime n} } &\quad {x_{{N_{n} 2}}^{\prime n} } &\quad \cdots &\quad {x_{{N_{n} m^{\prime}}}^{\prime n} } \\ \end{array} } \right]. $$
(14)
  1. (1)

    Calculate the contribution rate and the cumulative contribution rate

    Suppose the correlation matrix for the m′ financial features is denoted as R in formula (15).

    $$ R = \left[ {\begin{array}{cccc} {r_{11} } &\quad {r_{12} } &\quad \cdots &\quad {r_{{1m^{\prime}}} } \\ {r_{21} } &\quad {r_{22} } &\quad \cdots &\quad {r_{{2m^{\prime}}} } \\ \vdots &\quad \vdots &\quad \vdots &\quad \vdots \\ {r_{{m^{\prime}1}} } &\quad {r_{{m^{\prime}2}} } &\quad \cdots &\quad {r_{{m^{\prime}m^{\prime}}} } \\ \end{array} } \right]. $$
    (15)

    Then the eigenvalues \( \lambda_{{j^{\prime}}} \) (j′ = 1, 2,…,m′) and eigenvectors \( e_{{j^{\prime}}} \) (j′ = 1, 2,…,m′) are calculated by the Jacobi method, and the eigenvalues \( \lambda_{{j^{\prime}}} \) (j′ = 1, 2,…,m′) should be ordered in descending, namely \( \lambda {}_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{{m^{\prime}}} \ge 0. \) The contribution rate and the cumulative contribution rate of each principal component are, respectively, denoted as \( \beta_{{j^{\prime}}} \) and P(j′).

    $$ \beta_{{j^{\prime}}} = \frac{{\lambda_{{j^{\prime}}} }}{{\sum\nolimits_{{j_{1} = 1}}^{{m^{\prime}}} {\lambda_{{j_{1} }} } }}\quad (j^{\prime} = 1,\;2, \ldots ,m^{\prime}), $$
    (16)
    $$ P(j^{\prime}) = \frac{{\sum\nolimits_{{j_{1} = 1}}^{{j^{\prime}}} {\lambda_{{j_{1} }} } }}{{\sum\nolimits_{{j_{1} = 1}}^{{m^{\prime}}} {\lambda_{{j_{1} }} } }}\quad (j^{\prime} = 1,\;2, \ldots ,m^{\prime}). $$
    (17)
  2. (2)

    Calculate the financial condition evaluation score

    Suppose λ1, λ2,…,λl are the eigenvalues with the cumulative contribution rates over 90%, and they correspond to l (l ≤ m′) principal components. Calculate the loading of principal components as follows:

    $$ g_{{bj^{\prime}}} = \sqrt {\lambda_{b} e}_{{bj^{\prime}}} \quad (b = 1,\;2, \ldots ,l;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}). $$
    (18)

    Therefore, the l principal components of the dataset \( F^{\prime}_{n} \) can be expressed as follows:

    $$ \left\{ \begin{array}{l} s_{i1}^{n} = g_{11} x_{{\text{ }i1}}^{\prime n} + g_{12} x_{i2}^{\prime n} + \cdots + g_{{1m^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ s_{i2}^{n} = g_{21} x_{i1}^{\prime n} + g_{22} x_{i2}^{\prime n} + \cdots + g_{{2m^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ \cdots \hfill \\ s_{il}^{n} = g_{l1} x_{i1}^{\prime n} + g_{l2} x_{i2}^{\prime n} + \cdots + g_{{lm^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ \end{array} \right.\quad (i = 1,\;2, \ldots ,N_{n} ). $$
    (19)

    Thus, the dataset \( F^{\prime}_{n} \) with m′ features can be transferred as Sn:

    $$ S_{n} = \left[ {\begin{array}{cccc} {s_{11}^{n} } &\quad {s_{12}^{n} } &\quad \cdots &\quad {s_{1l}^{n} } \\ {s_{21}^{n} } &\quad {s_{22}^{n} } &\quad \cdots &\quad {s_{2l}^{n} } \\ \vdots &\quad \vdots &\quad \vdots &\quad \vdots \\ {s_{{N_{n} 1}}^{n} } &\quad {s_{{N_{n} 2}}^{n} } &\quad \cdots &\quad {s_{{N_{n} l}}^{n} } \\ \end{array} } \right]. $$
    (20)

    Finally, the comprehensive financial condition scores of the Nn companies can be calculated as follows:

    $$ Z_{i}^{n} = \sum\limits_{b = 1}^{l} {\beta_{b} s_{ib}^{n} } \quad (i = 1,\;2, \ldots ,N_{n} ). $$
    (21)
  3. (3)

    Evaluate the financial condition

    A company with higher comprehensive score shows relatively better financial situation in the industry at the current time point, and it has less possibility to fall into financial distress. Otherwise, a company with lower comprehensive score shows relatively worse financial situation in the industry at the current time point, and it has higher possibility to fall into financial distress. In terms of the comprehensive scores sorted in descending, the companies sorted at and before p percent positions are regarded as financial safety and those after p percent positions are considered as financial distress. For example, p percent equals 70% in this study. That is, the companies sorted at and before 70% positions are labeled as negative and those sorted in the last 30% are labeled as positive.

Ensemble FDP module based on SMOTE–AdaBoost

This study focuses on one industry’s relative financial distress. The number of sample companies for certain industry may be very limited, and the number of financial distress samples may be even fewer than that of financial safety samples. Therefore, SMOTE and AdaBoost are combined for construction of FDP model.

Zhou (2013) finds that SMOTE is appropriate for oversampling the positive financial distress samples when they are only much fewer than the negative financial safety samples. SMOTE is proposed by Chawla et al. (2002), and it increases new positive financial distress samples that are similar to but different from the original ones. For each minority positive sample x+, it finds k nearest neighbors (KNNs) of positive samples for it and randomly select n_smote samples from the KNNs, denoted as x+_near1, x+_near2,…,x+_nearn_smote. Then n_smote new positive samples can be generated between x+ and x+_neari (i = 1, 2,…,n_smote) as formula (22), in which rand(0,1) produces a random number between 0 and 1.

$$ x^{ + } {\text{\_smote}}_{i} = x^{ + } + {\text{rand(}}0,1) \times \left( {x^{ + } \_{\text{near}}_{i} - x^{ + } } \right). $$
(22)

AdaBoost is an ensemble learning algorithm proposed by Freund and Schapire (1997). It can train multiple base classifiers on one training set by iteration and always make the new trained base classifier more concerned about the training samples misclassified by the last base classifier. This is realized by adjusting the weights of training samples in iteration. Namely, in the initial iteration, all training samples have the same weights and they sum to 1. After a base classifier is trained, the weights of the samples correctly classified by the last base classifier are reduced, while the weights of the misclassified samples are increased. After R rounds of iterations, R base classifiers are trained and combined to construct a stronger ensemble classifier.

The SMOTE–AdaBoost (SMOTEBoost) ensemble model for prediction of one industry’s relative financial distress is shown in Fig. 2. Firstly, SMOTE is used to balance the number of positive financial distress samples and the number of negative financial safety samples by artificially generating new positive samples. Then the AdaBoost ensemble classifier is constructed based on certain base classification algorithm such as SVM, DT, KNN classifier and Logistic regression. The algorithm is described in Table 1.

Fig. 2
figure 2

The SMOTEBoost ensemble model

Table 1 The algorithm of AdaBoost ensemble classifier

Empirical experiment

Data collection

Since this study focuses on dynamic relative FDP of one industry with time going on, it needs a dataset of certain industry that exists for relatively long time. Because the iron and steel industry is a traditional industry in China, and the information disclosure of iron and steel listed companies is comprehensive and available as early as the year of 2000. Hence the time moving process can be well simulated in the empirical experiment based on the iron and steel industry of China. From 2000 to 2015, there are, respectively, 28, 31, 34, 36, 38, 38, 42, 43, 43, 46, 53, 54, 55, 55, 57, and 59 iron and steel companies listed in China Stock Exchange.

For the purpose of financial condition analysis, 33 candidate financial ratios are selected and listed in Table 2. For each year from 2000 to 2015, the candidate financial ratio data of the iron and steel listed companies are collected, and they constitute the initial dataset for the empirical experiment.

Table 2 Candidate financial ratios for FDP

Experimental design

Dynamic prediction of relative financial distress for a certain industry based on imbalanced data stream is to predict the future relative financial condition label of a company compared to other companies of the same industry in consideration of time moving and financial distress concept drift. Therefore, the experiment simulates the moving of a fixed-width time window, and reconstruct and retest the model year by year. The experimental design is shown in Fig. 3.

Fig. 3
figure 3

The framework of experimental design

Based on the financial data of iron and steel industry from 2000 to 2015, we set the width of time window as 10 years. Each time when time moves forward a year to the new current time point, the financial ratios with more information content are selected from the 33 candidate ones by the plus-L-minus-R approach. Then the companies’ relative financial condition labels are obtained based on the PCA evaluation scores, and they act as the testing sample labels for the current model and the new training sample labels for the next model reconstruction.

Suppose 2009 is the first modeling year of the iron and steel industry, and the width of time window is 10 years. Then the available financial data belong to the years 2000–2009, and feature selection can be carried out based on them by the plus-L-minus-R approach. Input the financial data batch of the years 2001–2010, respectively, into the financial condition evaluation module based on PCA, and all the companies in the industry are labeled as financial distress or financial safety for each year. Year by year and company by company, the financial condition labels of the years 2001–2009 are matched with the financial data of the years 2000–2008, and they together constitute the training dataset for FDP modeling. After the industry’s relative FDP model is built, we can input the financial data of 2009 into the FDP model and output the predicted financial condition labels of 2010, which are compared with the financial condition labels of 2010 output by the financial condition evaluation module for the purpose of FDP performance testing.

Suppose the time point for constructing the relative FDP model moves on from 2009 to 2014 with one fiscal year as the time unit. In each new fiscal year, the initial financial data batches for the experiment also roll on with the fixed time window width of 10 years. Namely, for each round of time window rolling, the financial data batch of the earliest year in the last time window is abandoned and the financial data batch of the new fiscal year is added, which forms dynamical financial data stream for the dynamic FDP modeling. In the experiment, totally five rounds of time rolling are simulated, and the FDP performance is evaluated by 6 years’ testing results.

To comprehensively evaluate the approach proposed in this research, the four basic classification algorithms of SVM, DT, KNN, and Logistic regression are used to construct four kinds of SMOTEBoost ensemble FDP models in each stage of experiment. Their experimental results are comprehensively compared and analyzed for model evaluation.

Experimental results and analysis

Experimental results of dynamical models

In the experiment of stage 1, the candidate financial ratio data of the companies in the iron and steel industry from the fiscal years of 2000–2009 are used for the feature selection by the plus-L-minus-R feature approach, and the financial ratios with better discriminant function are selected as the variables for industry’s relative financial condition evaluation and distress prediction, as listed in Table 3.

Table 3 The financial ratios selected based on datasets of 2000–2009

Take the experiment of stage 1 as an example. For each fiscal year from 2001 to 2009, extract the selected financial ratios’ dataset of the corresponding year, and use PCA to calculate and sort the comprehensive evaluation scores of the iron and steel companies, to obtain their industry’s relative financial condition labels of certain year. Suppose the industry’s relative financial condition is divided into the two classes of safety and distress in the proportion of 7:3, and they are, respectively, denoted as 1 and 0. Then these class labels of 2001–2009 are combined with the corresponding companies’ financial data of 2000–2008 year by year, and the training dataset of stage 1 is constructed. It is, respectively, used for training the SMOTEBoost-SVM ensemble FDP model, SMOTEBoost-DT ensemble FDP model, SMOTEBoost-KNN ensemble FDP model, and SMOTEBoost-Logistic ensemble FDP model. At this moment, the 2009 financial dataset can be input into these FDP models, and the predicted financial condition labels of 2010 can be output. Finally, the 2010 financial data of the same companies are input into the PCA module to generate their evaluated financial condition labels, which are compared with the predicted labels for model performance evaluation. Similar to the experiment of stage 1, the experiments of stages 2–6 are carried out with the simulated prediction year gradually moves to 2011, 2012, 2013, 2014 and 2015. The financial ratios dynamically selected by the plus-L-minus-R approach in stages 2–6 are, respectively, listed in Table 4. Such dynamical feature selection process makes the selected financial ratios more appropriate for the current financial distress concept, and helps improve the discriminant ability of FDP models for the industry.

Table 4 The financial ratios selected in stages 2–6

From stages 1 to 6, the numbers of training samples and testing samples for industry’s relative FDP modeling and testing are shown in Table 5. The total number of testing samples from stages 1 to 6 is 320, which includes 99 distress samples and 221 safety samples. To test whether there is significant difference of the selected financial ratios between the two classes of industry’s relative financial condition, nonparametric Wilcoxon test for independent samples is used for comparing their means because Shapiro–Wilk test indicates that they do not follow the normal distribution. The results show that all the selected financial ratios except those of growth capacity (V23, V24, and V26) have significantly different numerical distribution between the two classes. However, growth capacity ratios are remained for more comprehensive reflection of the iron and steel companies’ financial performance.

Table 5 The numbers of training samples and testing samples for industry’s relative FDP modeling and testing

By comparing the predicted labels with the target labels, we can calculate the distress class accuracies, the safety class accuracies and the overall accuracies of industry’s relative FDP, which are listed in Tables 6, 7, 8, 9, 10 and 11, respectively, for 2010–2015. Considering all the testing results from stages 1 to 6, the mean accuracies for 2010–2015 are listed in Table 12 and illustrated in Fig. 4.

Table 6 The accuracies of industry’s relative FDP for 2010 in the stage 1 experiment
Table 7 The accuracies of industry’s relative FDP for 2011 in the stage 2 experiment
Table 8 The accuracies of industry’s relative FDP for 2012 in the stage 3 experiment
Table 9 The accuracies of industry’s relative FDP for 2013 in the stage 4 experiment
Table 10 The accuracies of industry’s relative FDP for 2014 in the stage 5 experiment
Table 11 The accuracies of industry’s relative FDP for 2015 in the stage 6 experiment
Table 12 The mean accuracies for industry’s relative FDP of 2010–2015
Fig. 4
figure 4

Column graph of mean accuracies for industry’s relative FDP of 2010–2015

It is shown that the industry’s relative FDP models trained by different classifiers based on the class-imbalanced data stream of certain industry can obtain stable prediction performance. Their distress class accuracies, safety class accuracies and overall accuracies are all over 80%. Therefore, the approach proposed in this study for dynamical evaluation and prediction of industry’s relative financial distress is feasible. There is no evident difference among the mean overall accuracies of SMOTEBoost-SVM, SMOTEBoost-DT, SMOTEBoost-KNN and SMOTEBoost-Logistic. But SMOTEBoost-SVM and SMOTEBoost-Logistic have higher distress class accuracies than SMOTEBoost-DT and SMOTEBoost-KNN although the formers’ safety class accuracies are lower than those of the latter. Hence, from the point of view that the cost of misclassifying distress samples is usually higher than misclassifying safety samples, SMOTEBoost-SVM and SMOTEBoost-Logistic are more preferred than SMOTEBoost-DT and SMOTEBoost-KNN for dynamical industry’s relative FDP.

Experimental comparison between dynamical and stationary models

For the industry’s relative FDP approach proposed in this study, the process of PCA for industry’s relative financial condition evaluation is dynamically integrated with the reconstruction of FDP model. That is, the dynamical process of PCA evaluation contains the hidden process of industry’s relative financial distress concept drift, and makes the distribution of training data steam for FDP vary when time goes on and on. The dynamical industry’s relative FDP modeling based on the dynamical training data stream can well track such concept drift and make the current FDP models suitable for the current concept of industry’s relative financial distress. To test the existence of such industry’s relative financial distress concept drift, the stationary industry’s relative FDP models constructed in 2009 based on the financial ratio data of 2000–2009 are also directly used to predict the industry’s relative financial distress of 2010–2015 and the results of accuracies are listed in Tables 13, 14, 15, 16, 17 and 18. Considering all the testing results of the stationary models from stages 1 to 6, their mean accuracies for 2010–2015 are listed in Table 19.

Table 13 The accuracies of 2009 stationary models for industry’s relative FDP of 2010
Table 14 The accuracies of 2009 stationary models for industry’s relative FDP of 2011
Table 15 The accuracies of 2009 stationary models for industry’s relative FDP of 2012
Table 16 The accuracies of 2009 stationary models for industry’s relative FDP of 2013
Table 17 The accuracies of 2009 stationary models for industry’s relative FDP of 2014
Table 18 The accuracies of 2009 stationary models for industry’s relative FDP of 2015
Table 19 The mean accuracies of 2009 stationary models for industry’s relative FDP of 2010–2015

For clear comparison of prediction performance, the overall accuracies of the dynamic models from 2010 to 2015 are graphed as curves in Fig. 5, and the overall accuracies of the stationary models from 2010 to 2015 are graphed as curves in Fig. 6. It is obvious that the accuracy curves of the dynamic models keep stable at the relatively high place over 80% from 2010 to 2015, while the accuracy curves of the stationary models decline evidently from 2010 to 2013 and then keep stable between 60 and 70%. To compare the mean accuracies of dynamic and stationary models, they are illustrated together as curves in Fig. 7. As shown, the mean accuracies of the dynamic models are around 85%, but the mean accuracies of the stationary models are around 70%. Each line of the dynamic models is located higher than the corresponding line of the stationary models. These results intuitively indicate that dynamically updating the prediction models for industry’s relative financial distress with concept drift can achieve prominent better performance than the stationary models.

Fig. 5
figure 5

Curves of overall accuracies of the dynamic models from 2010 to 2015

Fig. 6
figure 6

Curves of overall accuracies of the stationary models from 2010 to 2015

Fig. 7
figure 7

Curves of mean accuracies of dynamic and stationary models

Financial feature analysis

To further analyze the financial features of relative financial distress of Chinese iron and steel industry, the financial ratios selected as feature attributes in each modeling year are counted for the selection frequencies, so as to find which financial ratios have more sensitive early warning ability. The financial ratios selected for the FDP and their frequencies in the six stages of modeling are listed in Table 20, and the corresponding radar chart is illustrated in Fig. 8. As shown, 14 financial ratios among the initial 33 candidate ones are selected as input variables in the 6 stages of FDP models. Even for the selected financial ratios, their frequencies in the six stages of FDP models are different. Some only appear once in a model, and some exit in all the six models. The financial ratios with high frequency in the FDP models are considered as important financial features that take an effective role in the early warning of relative financial distress for the Chinese iron and steel industry. As illustrated in Table 20 and Fig. 8, the financial ratios with the frequency of not less than four times cover solvency, operating capacity, growth capacity, and structural ratio, among which operating capacity and solvency are particularly significant and possess the percentage of 40 and 30% respectively. Therefore, it is critical for Chinese iron and steel companies to take actions for improving operating capacity and solvency, so as to gain industry’s relative competitive power.

Table 20 The financial ratios selected for the FDP and their frequencies in the six stages of modeling
Fig. 8
figure 8

The radar chart of financial ratios’ frequencies in FDP modeling

Conclusion

Based on the definition of financial distress and concept drift, this paper designs a dynamic prediction model for industry’s relative financial distress, which considers industry’s relative financial distress concept drift with the development of an industry and class imbalance between the distressed and the non-distressed. The whole model is divided into three modules, i.e. financial feature selection module, financial condition evaluation module, and FDP modeling module, which dynamically keep updating with the rolling of a fixed-width time window. Namely, the industry’s relative FDP model is reactivated at each time when the time window slides forward for a period. The three submodules are interrelated to each other. The financial feature selection module selects financial ratios that are input into the financial condition evaluation module for synthetic financial evaluation by PCA. At the same time, the financial condition labels output by the financial condition evaluation module constitute an input variable for the FDP modeling module. For the empirical study, the financial ratio data of the Chinese iron and steel industry from 2000 to 2015 are collected and empirical experiment is designed to test the feasibility of the proposed model. The following empirical conclusions are drawn: (1) when the four hybrid classifiers such as SMOTEBoost-SVM, SMOTEBoost-DT, SMOTEBoost-KNN, and SMOTEBoost-Logistic are applied as the basic classification algorithms, the proposed model all achieves satisfactory accuracies, which indicates that the dynamic industry’s relative FDP model proposed in this paper is effective for corporate financial risk management. (2) The industry’s relative financial condition labels of the same company output by the PCA module may vary at different stages, indicating the existence of industry’s relative financial distress concept drift. This is also the reason for the empirical results that the dynamic industry’s relative FDP model significantly outperforms the stationary model. (3) Operating capacity and solvency are the most important factors that influence the financial competitive power for Chinese iron and steel companies. Chinese iron and steel industry should take more measures to accelerate turnover of inventory, account receivable, and fixed assets, and to optimize the solvency ratios such as current ratio, quick ratio and owner’s equity to liabilities.