Dynamic prediction of relative financial distress based on imbalanced data stream: from the view of one industry

Sun, Jie; Zhou, Mengjie; Ai, Wenguo; Li, Hui

doi:10.1057/s41283-018-0047-y

Dynamic prediction of relative financial distress based on imbalanced data stream: from the view of one industry

Original Article
Published: 30 October 2018

Volume 21, pages 215–242, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Risk Management Aims and scope Submit manuscript

Dynamic prediction of relative financial distress based on imbalanced data stream: from the view of one industry

Download PDF

Jie Sun¹,
Mengjie Zhou²,
Wenguo Ai³ &
…
Hui Li⁴

212 Accesses
11 Citations
Explore all metrics

Abstract

Early studies on financial distress prediction (FDP) seldom consider the problem of industry’s relative financial distress concept drift and neglects how to dynamically predict industry’s relative financial distress. This paper proposes a novel model for dynamic prediction of relative financial distress based on imbalanced data stream of certain industry, and the whole model is divided into the three submodules: the financial feature selection module based on plus-L-minus-R approach, the financial condition evaluation module based on principal component analysis, and the FDP modeling module based on SMOTEBoost-SVM/DT/KNN/Logistic. After feature selection, the results of industry financial condition evaluation are used as class labels for industry’s relative FDP modeling, and the model keeps updating with time window sliding on. The empirical experiment is carried out based on the financial ratio data of Chinese iron and steel companies listed in Shanghai and Shenzhen Stock Exchange, and the results indicate the effectiveness of the dynamic model for industry’s relative FDP.

Financial Distress Prediction in an Imbalanced Data Stream Environment

Class-imbalanced dynamic financial distress prediction based on random forest from the perspective of concept drift

Article 17 August 2024

Prediction of Financial Distress for Electricity Sectors Using Data Mining

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

With the development of global economy, good financial condition becomes more and more important for an enterprise that wants to survive and develop under fierce market competition. A company with better financial condition usually has more business opportunities. However, when a company runs into financial distress, it may have the problems such as bad profitability, high liability and deficiency of cash flow, which may affect its business operation and even bring bankruptcy. Therefore, it is always a hot topic how to construct an effective financial distress prediction (FDP) model, because predicting financial distress in advance and taking corresponding measures in time can help a company and its investors avoid suffering great losses.

On the one hand, financial distress concept drift makes stationary FDP models to be unable to adapt to the new sample data stream. In other words, stationary FDP models are not suitable for the dynamic operational environment of enterprises. Since the number of financially distressed companies is often smaller than that of financially non-distressed companies, the data stream for dynamic FDP is class imbalanced rather than class balanced. On the other hand, enterprise financial distresses of different industries may show different characteristics. It is necessary to define the concept of financial distress from the view of a certain industry. Therefore, this study explores dynamic prediction of relative financial distress based on imbalanced data stream from the view of one industry. The process of financial condition evaluation and relative FDP is dynamically integrated based on the financial data stream of a certain industry. Hence, this study is not only able to provide an important tool for enterprises to make dynamic prediction of relative financial distress from the view of the industry, but also supplements the theoretical system of FDP.

Literature review

Concept of financial distress

Traditionally, financial distress usually refers to certain kind of financial difficulty faced by an enterprise. Foster (1986) defines financial distress as a serious liquidity problem which is impossible to be resolved without the large-scale restructuring of the operation or structure of economic entities. Doumpos and Zopounidis (1999) consider that financial distress also includes the situation of negative net asset value. In the twenty-first century, some researches define the financial distress of listed companies according to the regulations of Stock Exchange. In the study of Rafiei et al. (2011), an Iranian company, whose retained losses are more than 50% of its capital, is labeled to be in financial distress according to the commercial law of 141 Act of Tehran Stock Exchange. In Ding et al. (2008), Sun and Li (2012), and Geng et al. (2015), financial distress is defined as the criteria of the special treatment mechanism of China Stock Exchange. The above concept of financial distress is a kind of absolute definition of financial distress with certain criteria. A company satisfying the criteria is labeled to be in financial distress, and otherwise it is labeled to be in financial health.

Sun et al. (2011) propose the definition of an enterprise’s relative financial distress, which is the relatively bad financial situation of certain enterprise in the recent time span. With the process of an enterprise’s life cycle, the most recent time span moves on, and the relative financial situation of the same time point in different time spans may change, which is called the longitudinal concept drift of enterprise financial distress. Such definition of relative financial distress is also adopted in the research of Sun et al. (2016).

Prediction of financial distress

Most researches on FDP are based on absolute financial distress. Beaver (1966) applies the univariate analysis based on financial ratios for bankruptcy prediction. Altman (1968) uses the statistical approach of the multiple discriminant analysis to propose the famous Z-score model for bankruptcy prediction. Then, the statistical approach of the Logistic regression model is used for bankruptcy prediction (Ohlson 1980; Huang et al. 2012). More recently, Serrano-Cinca and Gutiérrez-Nieto (2013) apply partial least-squared discriminant analysis for the prediction of American bank bankruptcy, and it is not restricted by multicollinearity of independent variables.

The emergence of artificial intelligence and data mining techniques promotes the development of FDP, and various single classier algorithms are applied to FDP. Frydman et al. (1985) use decision tree (DT) for bankruptcy prediction. In the 1990s, the neural networks (NNs) are among the most widely used artificial intelligence methods for FDP, and many literature studies reached the conclusion that FDP based on NNs is more accurate than traditional statistical methods (Odom and Sharda 1990; Fletcher and Goss 1993; Carlos 1996; Zhang et al. 1999; Yang et al. 1999; Pendharkar 2005; Tseng and Hu 2010; Khashman 2011). Then after the support vector machine (SVM) was proposed (Vapnik 1998), it also came to be widely applied for FDP and proved its ability to demonstrate good generalization performance (Shin et al. 2005; Min and Lee 2005; Ding et al. 2008; Xie et al. 2011; Sun and Li 2012). In addition, some other artificial intelligence approaches such as the genetic algorithm (Kim and Han 2003), the case-based reasoning (Li and Sun 2009), and the rough set (McKee 2000; Bose 2006) are also applied in the research of FDP.

In the recent decade, more and more researches focus on the classifier ensemble approaches for FDP, which combine the predictions from multiple base classifiers instead of relying on a single classifier (Zhang et al. 2011). Sun and Li (2008) put forward a classifier ensemble for FDP based on weighted majority voting combination of different classifiers, and it outperforms the base classifiers. However, Tsai and Wu (2008) find that the NNs ensemble does not show better performance than the single best NNs classifier in many cases, and the possible reason is associated to too small training dataset. Alfaro et al. (2008) construct the NNs ensemble for FDP using the AdaBoost ensemble algorithm, and it has lower generalization error than the single NNs classifiers. Bagging and Boosting are the two most popular classifier ensemble algorithms, and Kim and Kang (2010) indicate that both Bagging and Boosting can improve the performance of FDP based on NNs. Li and Sun (2011) propose the principal component case-based reasoning ensemble method for FDP, and validate that it outperforms the best base model. Sun and Li (2012) train the base SVM classifiers using different kernel functions and different feature selection methods to construct a SVM ensemble model for FDP. Kim and Upneja (2014) compare the AdaBoosted DT ensemble and the single DT classifier for predicting restaurant financial distress, and find that the former outperforms the latter. Wang and Wu (2017) propose a business failure prediction model based on two-stage selective ensemble with manifold learning algorithm and kernel-based fuzzy self-organizing map. Wang et al. (2018) incorporate sentiment and textual information into the ensemble random subspace method for FDP.

The above studies on FDP do not consider the concept drift of financial distress, which is the process of variation of financial distress concept in the changing environment as time goes on (Sun and Li 2011). When there is financial distress concept drift, the FDP models trained on old sample data may become unsuitable for the current FDP. To dispose financial distress concept drift, some researches have been carried out. Sun and Li (2011) put forward a dynamic FDP model based on instance selection and time window. Sun et al. (2013) propose the adaptive and dynamic ensemble of SVM based on data batch combination. Sun et al. (2017) integrate sample time weighting with AdaBoost SVM ensemble, which can dynamically update the AdaBoost SVM ensemble FDP model. Li et al. (2017) propose a time-varying Malmquist DEA method. Liu and Wu (2017) put forward the incremental bagging based on selective ensemble and employ genetic algorithm to optimize the base classifier combination. These studies are based on the time series of panel data batches. In contrast, some researches on dynamic FDP are based on the longitudinal financial data stream of a certain company. For example, Sun et al. (2011) combine the ex-post evaluation of financial condition based on principal component analysis (PCA) and ex-ante prediction of financial distress based on NNs optimized by genetic algorithm for dynamic FDP, and Sun et al. (2016) propose another approach for dynamic evaluation and prediction of financial distress based on the entropy-based weighting, SVM, and an enterprise’s vertical sliding time window.

Theoretical concepts

Relative financial distress from the view of one industry

The relative financial distress in Sun et al. (2011, 2016) is the relative financial condition deterioration from the viewpoint of one enterprise. Such relative financial distress is not labeled by some concrete criteria. Instead, it is the result of comparing the financial conditions of different time points for an enterprise. The financial distress concept adopted in this study also belongs to relative financial distress. However, it is the result of comparing the financial conditions of different enterprises that belong to the same class of industry. Therefore, the relative financial distress concept in this study is defined from the view of one industry. Its definition is as follows: with the development of an enterprise and its industry, the comprehensive evaluation score of solvency, profitability, operating capacity, and growth ability for the enterprise becomes relatively worse in the industry.

Financial distress concept drift

Concept drift is generally known as changes in the target concept, and these changes are induced by changes in the hidden context (Schlimmer and Granger 1986). For a classification problem based on data stream, concept drift leads to changes of mapping relationship. In other words, the concept drift of data stream causes changes of the mapping relationship between the features and the class labels, which is hidden in the data. Usually, the data of a certain time point only reflect the concept of this time, and the concept hidden in the data drifts with time passing on. Sun and Li (2011) first validate the existence of financial distress concept drift. That is, the target concept of financial distress changes with the changing environment, or the underlying data distribution gradually changes with the inflow of new sample data although the target concept of financial distress does not change. Thus, the stationary FDP model, which is constructed on the sample data of past stationary time span, cannot adapt to the requirement of future FDP in the changing environment, or may become inaccurate for future FDP.

Due to the existence of financial distress concept drift, the FDP model should be constructed based on financial data stream with a dynamic model updating mechanism, instead of a stationary dataset. Namely, when time moves forward for a period, the new informative sample data should be added into the training dataset and the too-old sample data should be eliminated from the training dataset, so as to preserve the model’s predictive ability for future financial distress.

Imbalanced FDP

In a class-imbalanced dataset, the samples of one class are greatly more than the samples of the other class. The class with more samples is called as the negative class or majority class, and the class with fewer samples is called as the positive class or minority class. The phenomenon of class imbalance widely exists in the domains of fraud detection, bank credit scoring, text classification, as well as FDP. In most cases, decision makers care more for the minority class than the majority class, because the emergence of the minority class usually brings great losses to them. For example, falling into financial distress may disrupt the business activity of a company and even finally brings bankruptcy.

Class imbalance is obvious for FDP, since most enterprises are financially healthy and only a few enterprises are considered as financially distressed. For instance, there are 2831 listed companies in China Stock Market in 2015, and only 47 companies are marked as ST or ST* because of financial distress or other irregularities. For the classification modeling problem like FDP, the model trained on a class-imbalanced training dataset usually shows unsatisfactory performance for recognizing the minority class, for the reason that most classification algorithms are based on the assumption of class balance. When such classification algorithms are applied on a class-imbalanced dataset, the information of the minority class is overwhelmed by the information of the majority class, and the classification model trained on it tends to show bias toward the majority class. This finally leads to the reduction of recognition accuracy for the financial distress companies.

Dynamic model for industry’s relative FDP

Overall design of the dynamic prediction model

Taking the fiscal year as the time moving unit of the financial data stream, the width of the sample time window for training FDP model is set as n years. That is, the financial ratio data of the recent n years before the current modeling year should be used as the training dataset. Suppose T represents the FDP year. To train a (T − 1) FDP model with the forecasting ability of 1 year in advance, the financial data of the year (T − 1) should be corresponding to the financial condition labels of the year T to construct the training dataset. Let us denote the recent n years before the year T as shown in the formula (1).

$$ t = T - \text{ }(n - k) - 1\quad (k = 1,\;2, \ldots ,n). $$

(1)

In the above formula, k represents the sequence number of year. k = 1 corresponds to t = T − n, which means the starting year in the time window. k = n corresponds to t = T − 1, which means the last year in the time window. For example, suppose the width of time window for training the model is 10 years and it needs to make FDP for the year of 2015, then t = 2005, 2006,…,2014. To train a (T − 1) FDP model, the financial data of 2005–2013 should be, respectively, corresponding to the financial condition labels of 2006–2014 to construct the training dataset. Namely, the financial data of 2005 is corresponding to the financial condition labels of 2006, and so on. After training the FDP model, we should input the financial data of 2014 into the model, to output the prediction result of 2015’s financial condition labels.

The framework of the dynamic model for an industry’s relative FDP is designed as depicted in Fig. 1. In detail, the model is composed of three modules: financial feature selection module, financial condition evaluation module, and FDP module.

Suppose the number of enterprises in a certain industry in the kth year of the time window is denoted as N_k (k = 1, 2,…,n), and the number of the initial financial features is denoted as m. The initial financial dataset of the kth year is represented as F_k (k = 1, 2,…,n), and they together constitute the initial total dataset F.

$$ F_{1} = \left\{ {x_{ij}^{1} } \right\}\quad (i = 1,\;2, \ldots ,N_{1} ;\;j = 1,\;2, \ldots ,m) = \left[ {\begin{array}{cccc} {x_{11}^{1} } &\quad {x_{12}^{1} } &\quad \cdots &\quad {x_{1m}^{1} } \\ {x_{21}^{1} } &\quad {x_{22}^{1} } &\quad \cdots &\quad {x_{2m}^{1} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{1} 1}}^{1} } &\quad {x_{{N_{1} 2}}^{1} } &\quad \cdots &\quad {x_{{N_{1} m}}^{1} } \\ \end{array} } \right], $$

(2)

$$ F_{n} = \left\{ {x_{{_{ij} }}^{n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j = 1,\;2, \ldots ,m) = \left[ {\begin{array}{cccc} {x_{11}^{n} } &\quad {x_{12}^{n} } &\quad \cdots &\quad {x_{1m}^{n} } \\ {x_{21}^{n} } &\quad {x_{22}^{n} } &\quad \cdots &\quad {x_{2m}^{n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{n} } &\quad {x_{{N_{n} 2}}^{n} } &\quad \cdots &\quad {x_{{N_{n} m}}^{n} } \\ \end{array} } \right], $$

(3)

$$ F = F_{1} \cup F_{2} \cup \cdots \cup F_{n} . $$

(4)

In the financial feature selection module, a certain kind of feature selection method is applied to the initial total dataset F, and the irrelevant financial features are deleted. In this study, the plus-L-minus-R feature selection approach is adopted. To construct dynamic FDP model, the process of feature selection should be carried out each time when time moves forward to the next year. Suppose the number of financial features selected is denoted as m′. The financial data of the m′ financial features selected for the kth year of the time window can be denoted as $ F^{\prime}_{k} $ , where, (k = 1, 2,…,n), and the total dataset of the selected features can be denoted as F′.

$$ F^{\prime}_{1} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime 1} } \right\}\quad (i = 1,\;2, \ldots ,N_{1} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime 1} } &\quad {x_{12}^{\prime 1} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime 1} } \\ {x_{21}^{\prime 1} } &\quad {x_{22}^{{\prime 1}} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime 1} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{1} 1}}^{\prime 1} } &\quad {x_{{N_{1} 2}}^{\prime 1} } &\quad \cdots &\quad {x_{{N_{1} m^{\prime}}}^{\prime 1} } \\ \end{array} } \right], $$

(5)

$$ F^{\prime}_{n} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime n} } &\quad {x_{12}^{\prime n} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime n} } \\ {x_{{\text{ }21}}^{\prime n} } &\quad {x_{22}^{\prime n} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{\prime n} } &\quad {x_{{N_{n} 2}}^{\prime n} } &\quad \cdots &\quad {x_{{N_{n} m^{\prime}}}^{\prime n} } \\ \end{array} } \right], $$

(6)

$$ F^{\prime} = F^{\prime}_{1} \, \cup F^{\prime}_{2} \, \cup \cdots \cup F^{\prime}_{n} . $$

(7)

In the financial condition evaluation module, it utilizes the PCA method to make relative financial condition evaluation for all the enterprises of the industry year by year. For the kth year of the time window, each enterprise in the industry shall be labeled distressed or non-distressed, and the financial condition labels of all the enterprise in the industry compose a financial condition label vector of that year, which is denoted as Y_k (k = 1, 2,…,n).

$$ Y_{1} = \left[ {y_{{_{1} }}^{1} ,\;y_{{_{2} }}^{1} , \ldots ,y_{{_{N1} }}^{1} } \right], $$

(8)

$$ Y_{n} = \left[ {y_{{_{1} }}^{n} ,\;y_{{_{2} }}^{n} , \ldots ,y_{{_{Nn} }}^{n} } \right]. $$

(9)

Just as mentioned above, each enterprise’s financial data of the year t − 1 should be corresponding to its financial condition label of the year t, so as to construct a training sample for FDP modeling. However, the number of enterprises in the industry may vary in different years. When some new enterprises are set up by some new investors in the year t, they have financial condition labels of the year t, but they have no financial data of the year t − 1. These enterprises should be omitted from the training samples of the year t. Similarly, when some old enterprises drop out of the industry due to bankruptcy or merger and acquisition in the year t, they have financial data of the year t − 1, but they have no financial condition labels of the year t. These enterprises should also be omitted from the training samples of the year t. For the year t, namely, the kth year of the time window, suppose that K_k represents the number of training samples that both have financial data of the year t − 1 and financial condition labels of the year t. These enterprises’ financial data of the year t − 1 are, respectively, combined with their financial condition labels of the year t, to construct the training dataset of the year t. Then the training datasets of all the years in the training time window are combined, and they compose the total training dataset for the current modeling year, which is denoted as TD. If the number of training samples in TD is denoted as K, it is obvious K = K₁ + K₂ +···+K_n.

$$ \begin{aligned} {\text{TD}} & = \left\{ {x_{{i^{\prime}j^{\prime}}}^{\prime} ,\;y_{{i^{\prime}}}^{\prime} } \right\}\quad (i^{\prime} = 1,\;2, \ldots ,K;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) \\ & = \left[ {\begin{array}{ccccc} {x^{\prime}_{11} } &\quad {x^{\prime}_{12} } &\quad \cdots &\quad {x^{\prime}_{{1m^{\prime}}} } &\quad {y^{\prime}_{1} } \\ {x^{\prime}_{21} } &\quad {x^{\prime}_{22} } &\quad \cdots &\quad {x^{\prime}_{{2m^{\prime}}} } &\quad {y^{\prime}_{2} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots &\quad \vdots \\ {x^{\prime}_{K1} } &\quad {x^{\prime}_{K2} } &\quad \cdots &\quad {x^{\prime}_{{Km^{\prime}}} } &\quad {y^{\prime}_{K} } \\ \end{array} } \right]. \\ \end{aligned} $$

(10)

In the FDP module, the training dataset TD is used to train the FDP model, which can be used to make relative FDP for the enterprises of the industry for the year T. The target enterprises’ financial data of the year T − 1 should be input into the FDP model, and the predicted relative financial condition labels denoted as the vector Z′ will be output. When the year T really comes, the financial condition evaluation module will be applied again for these enterprises based on their financial data of the year T, to produce the real relative financial condition labels, denoted as the vector Z. Finally, we can obtain the accuracy of industry’s relative FDP by comparing the vector Z′ and the vector Z.

Such process of model construction and application for industry’s relative FDP is dynamic, because the modeling time window rolls forward with time moving on. Since the width of the modeling time window is fixed, the newest year’s financial data will flow in and the oldest year’s financial data will be eliminated continually year by year. For each new current year, the feature selection module is carried out again to select the most relevant financial features, and the financial condition evaluation module is also applied again based on the new financial features to produce the new vector of relative financial condition labels. Then the new training dataset is constructed, and the new industry’s relative FDP model is trained to make prediction for the next year. Year by year, the ex post industry’s relative financial condition evaluation and the ex-ante industry’s relative FDP is dynamically integrated. Therefore, the dynamic model designed for an industry’s imbalanced relative FDP has good adaptability to the developing environment of the industry, and can greatly improve the efficiency of financial risk management for the enterprises of the industry.

Feature selection based on plus-L-minus-R approach

Feature selection, also known as variable selection or attribute selection, is the process of selecting a subset of relevant features (variables, predictors) for model construction, which reduces the dimensionality of the dataset at the same time of remaining almost all the useful information. Too many features usually make the model too complex and reduce the generalization ability of the model. Feature selection can reduce the number of features by eliminating the irrelevant features, which helps save the time of model training, improves the accuracy of the model, and simplifies the model to be easily understood. There are mainly three types of searching algorithms for feature selection, namely complete searching, heuristic searching, and random searching. Among them, complete searching and random searching is usually time consuming when the initial dimensionality is high, because the former considers all feature subsets and the latter is complex in algorithm. Heuristic searching is usually popular for good efficiency and performance. The heuristic searching algorithms mainly include the sequence forward selection algorithm, the sequence backward selection algorithm, the plus-L-minus-R selection algorithm, and so on. The sequence forward selection algorithm and the sequence backward selection algorithm belong to greedy searching approach and are easy to run into partial optimality. The plus-L-minus-R algorithm usually has better feature selection performance, because it integrates the thoughts of those two algorithms and also avoids the shortcomings. Therefore, this study makes feature selection based on the plus-L-minus-R approach.

For the plus-L-minus-R approach, there are two forms. The first one starts from the empty set, and in each iteration it firstly pluses L features and then minuses R features (L > R), finally to make the evaluation function obtain the optimal value. The second one starts from the complete set, and in each iteration it firstly minus R features and then plus L features (L < R). Therefore, the parameters of L and R should be set for the plus-L-minus-R approach. For the dynamic modeling of industry’s relative FDP, feature selection is needed for each time of model updating and model reconstruction.

Firstly, the normalization processing should be carried out for the initial dataset of financial features, to avoid the effect of different units of different dimensions. This study applies the following formula (11) for the purpose of data normalization.

$$ x_{\text{norm}} = - 1 + \frac{x - \hbox{min} (x)}{\hbox{max} (x) - \hbox{min} (x)} \times (1 - ( - 1)). $$

(11)

In the above formula, min(x) and max(x), respectively, denote the minimum value and maximum value of certain feature.

Secondly, feature selection processing is carried out based on the normalized dataset of financial features. Suppose the initial financial feature set contains m financial features, and the financial feature set after feature selection contains m′ financial features and is denoted as A. Before starting feature selection, the initial parameters should be set. For example, A₀ = Ø, L = 4, R = 3, d_end = 10. It means: (1) feature selection starts from the empty set. (2) In each iteration, it firstly pluses 4 features and then it minuses 3 features. (3) Feature selection ends when the number of features selected reaches 10. Suppose the evaluation function for feature selection is J(·), the plus-L-minus-R algorithm follows formula (12) to add features and follows formula (13) to delete features.

$$ \left\{ {\begin{array}{l} {a^{ + } = \arg_{{a \notin A_{d} }} \hbox{max} \;J(A_{d} + a)}, \\ {A_{d + 1} = A_{d} + a^{ + } }, \\ {d = d + 1}, \\ \end{array} } \right. $$

(12)

$$ \left\{ {\begin{array}{l} {a^{ - } = \arg_{{a \in A_{d} }} \hbox{max} \;J(A_{d} - a)}, \\ {A_{d + 1} = A_{d} - a^{ - } }, \\ {d = d - 1}. \\ \end{array} } \right. $$

(13)

The iteration of adding features and deleting features ends when the number of features selected reaches d_end and the final financial feature set for FDP is selected.

Financial condition evaluation module based on PCA

It is worth mentioning that the industry’s relative FDP model in this study is used for prediction of an enterprise’s relative financial distress from the view of one industry, which means that the financial condition labels of training samples are enterprises’ relative financial performances in a certain industry rather than the absolute criteria such as bankruptcy or special treatment of Chinese listed companies. The approach of PCA method, which has been used for synthetic evaluation of financial performance (Sun et al. 2011), city economic performance (Zhu 1998), bank risk (Fang et al. 2018), etc., is used to make relative financial condition evaluation for all the enterprises of the industry. Given the normalized financial dataset after feature selection at the current time point, namely $ F^{\prime}_{n} $ in formula (14), the relative financial condition evaluation process based on PCA is as follows:

$$ F^{\prime}_{n} = \left\{ {x_{{_{{ij^{\prime}}} }}^{\prime n} } \right\}\quad (i = 1,\;2, \ldots ,N_{n} ;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}) = \left[ {\begin{array}{cccc} {x_{11}^{\prime n} } &\quad {x_{12}^{\prime n} } &\quad \cdots &\quad {x_{{1m^{\prime}}}^{\prime n} } \\ {x_{21}^{\prime n} } &\quad {x_{22}^{\prime n} } &\quad \cdots &\quad {x_{{2m^{\prime}}}^{\prime n} } \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ {x_{{N_{n} 1}}^{\prime n} } &\quad {x_{{N_{n} 2}}^{\prime n} } &\quad \cdots &\quad {x_{{N_{n} m^{\prime}}}^{\prime n} } \\ \end{array} } \right]. $$

(14)

(1)
Calculate the contribution rate and the cumulative contribution rate
Suppose the correlation matrix for the m′ financial features is denoted as R in formula (15).
$$ R = \left[ {\begin{array}{cccc} {r_{11} } &\quad {r_{12} } &\quad \cdots &\quad {r_{{1m^{\prime}}} } \\ {r_{21} } &\quad {r_{22} } &\quad \cdots &\quad {r_{{2m^{\prime}}} } \\ \vdots &\quad \vdots &\quad \vdots &\quad \vdots \\ {r_{{m^{\prime}1}} } &\quad {r_{{m^{\prime}2}} } &\quad \cdots &\quad {r_{{m^{\prime}m^{\prime}}} } \\ \end{array} } \right]. $$
(15)
Then the eigenvalues $ \lambda_{{j^{\prime}}} $ (j′ = 1, 2,…,m′) and eigenvectors $ e_{{j^{\prime}}} $ (j′ = 1, 2,…,m′) are calculated by the Jacobi method, and the eigenvalues $ \lambda_{{j^{\prime}}} $ (j′ = 1, 2,…,m′) should be ordered in descending, namely $ \lambda {}_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{{m^{\prime}}} \ge 0. $ The contribution rate and the cumulative contribution rate of each principal component are, respectively, denoted as $ \beta_{{j^{\prime}}} $ and P(j′).
$$ \beta_{{j^{\prime}}} = \frac{{\lambda_{{j^{\prime}}} }}{{\sum\nolimits_{{j_{1} = 1}}^{{m^{\prime}}} {\lambda_{{j_{1} }} } }}\quad (j^{\prime} = 1,\;2, \ldots ,m^{\prime}), $$
(16)
$$ P(j^{\prime}) = \frac{{\sum\nolimits_{{j_{1} = 1}}^{{j^{\prime}}} {\lambda_{{j_{1} }} } }}{{\sum\nolimits_{{j_{1} = 1}}^{{m^{\prime}}} {\lambda_{{j_{1} }} } }}\quad (j^{\prime} = 1,\;2, \ldots ,m^{\prime}). $$
(17)
(2)
Calculate the financial condition evaluation score
Suppose λ₁, λ₂,…,λ_l are the eigenvalues with the cumulative contribution rates over 90%, and they correspond to l (l ≤ m′) principal components. Calculate the loading of principal components as follows:
$$ g_{{bj^{\prime}}} = \sqrt {\lambda_{b} e}_{{bj^{\prime}}} \quad (b = 1,\;2, \ldots ,l;\;j^{\prime} = 1,\;2, \ldots ,m^{\prime}). $$
(18)
Therefore, the l principal components of the dataset $ F^{\prime}_{n} $ can be expressed as follows:
$$ \left\{ \begin{array}{l} s_{i1}^{n} = g_{11} x_{{\text{ }i1}}^{\prime n} + g_{12} x_{i2}^{\prime n} + \cdots + g_{{1m^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ s_{i2}^{n} = g_{21} x_{i1}^{\prime n} + g_{22} x_{i2}^{\prime n} + \cdots + g_{{2m^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ \cdots \hfill \\ s_{il}^{n} = g_{l1} x_{i1}^{\prime n} + g_{l2} x_{i2}^{\prime n} + \cdots + g_{{lm^{\prime}}} x_{{im^{\prime}}}^{\prime n} \hfill \\ \end{array} \right.\quad (i = 1,\;2, \ldots ,N_{n} ). $$
(19)
Thus, the dataset $ F^{\prime}_{n} $ with m′ features can be transferred as S_n:
$$ S_{n} = \left[ {\begin{array}{cccc} {s_{11}^{n} } &\quad {s_{12}^{n} } &\quad \cdots &\quad {s_{1l}^{n} } \\ {s_{21}^{n} } &\quad {s_{22}^{n} } &\quad \cdots &\quad {s_{2l}^{n} } \\ \vdots &\quad \vdots &\quad \vdots &\quad \vdots \\ {s_{{N_{n} 1}}^{n} } &\quad {s_{{N_{n} 2}}^{n} } &\quad \cdots &\quad {s_{{N_{n} l}}^{n} } \\ \end{array} } \right]. $$
(20)
Finally, the comprehensive financial condition scores of the N_n companies can be calculated as follows:
$$ Z_{i}^{n} = \sum\limits_{b = 1}^{l} {\beta_{b} s_{ib}^{n} } \quad (i = 1,\;2, \ldots ,N_{n} ). $$
(21)
(3)
Evaluate the financial condition
A company with higher comprehensive score shows relatively better financial situation in the industry at the current time point, and it has less possibility to fall into financial distress. Otherwise, a company with lower comprehensive score shows relatively worse financial situation in the industry at the current time point, and it has higher possibility to fall into financial distress. In terms of the comprehensive scores sorted in descending, the companies sorted at and before p percent positions are regarded as financial safety and those after p percent positions are considered as financial distress. For example, p percent equals 70% in this study. That is, the companies sorted at and before 70% positions are labeled as negative and those sorted in the last 30% are labeled as positive.

Ensemble FDP module based on SMOTE–AdaBoost

This study focuses on one industry’s relative financial distress. The number of sample companies for certain industry may be very limited, and the number of financial distress samples may be even fewer than that of financial safety samples. Therefore, SMOTE and AdaBoost are combined for construction of FDP model.

Zhou (2013) finds that SMOTE is appropriate for oversampling the positive financial distress samples when they are only much fewer than the negative financial safety samples. SMOTE is proposed by Chawla et al. (2002), and it increases new positive financial distress samples that are similar to but different from the original ones. For each minority positive sample x⁺, it finds k nearest neighbors (KNNs) of positive samples for it and randomly select n_smote samples from the KNNs, denoted as x⁺_near₁, x⁺_near₂,…,x⁺_near_{n_smote}. Then n_smote new positive samples can be generated between x⁺ and x⁺_near_i (i = 1, 2,…,n_smote) as formula (22), in which rand(0,1) produces a random number between 0 and 1.

$$ x^{ + } {\text{\_smote}}_{i} = x^{ + } + {\text{rand(}}0,1) \times \left( {x^{ + } \_{\text{near}}_{i} - x^{ + } } \right). $$

(22)

AdaBoost is an ensemble learning algorithm proposed by Freund and Schapire (1997). It can train multiple base classifiers on one training set by iteration and always make the new trained base classifier more concerned about the training samples misclassified by the last base classifier. This is realized by adjusting the weights of training samples in iteration. Namely, in the initial iteration, all training samples have the same weights and they sum to 1. After a base classifier is trained, the weights of the samples correctly classified by the last base classifier are reduced, while the weights of the misclassified samples are increased. After R rounds of iterations, R base classifiers are trained and combined to construct a stronger ensemble classifier.

The SMOTE–AdaBoost (SMOTEBoost) ensemble model for prediction of one industry’s relative financial distress is shown in Fig. 2. Firstly, SMOTE is used to balance the number of positive financial distress samples and the number of negative financial safety samples by artificially generating new positive samples. Then the AdaBoost ensemble classifier is constructed based on certain base classification algorithm such as SVM, DT, KNN classifier and Logistic regression. The algorithm is described in Table 1.

Table 1 The algorithm of AdaBoost ensemble classifier

Dynamic prediction of relative financial distress based on imbalanced data stream: from the view of one industry

Abstract

Similar content being viewed by others

Financial Distress Prediction in an Imbalanced Data Stream Environment

Class-imbalanced dynamic financial distress prediction based on random forest from the perspective of concept drift

Prediction of Financial Distress for Electricity Sectors Using Data Mining

Explore related subjects

Introduction

Literature review

Concept of financial distress

Prediction of financial distress

Theoretical concepts

Relative financial distress from the view of one industry

Financial distress concept drift

Imbalanced FDP

Dynamic model for industry’s relative FDP

Overall design of the dynamic prediction model

Feature selection based on plus-L-minus-R approach

Financial condition evaluation module based on PCA

Ensemble FDP module based on SMOTE–AdaBoost

Empirical experiment

Data collection

Experimental design

Experimental results and analysis

Experimental results of dynamical models

Experimental comparison between dynamical and stationary models

Financial feature analysis

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation