Keywords

1 Introduction

Financial distress, by the simplest definition, is a specific type of financial difficulty that a company faces due to internal or external reasons and tries to overcome. Financial difficulties are the obstacles that the company faces in meeting its obligations. These obstacles are lack of liquidity, lack of owner’s equity, failing to pay the debts, and lack of capital (Sun et al., 2014a). Companies face a legally binding bankruptcy if they cannot overcome these obstacles for a long time. Given all these, financial distress can be identified as a long and difficult process that starts with a firm’s inability to meet its obligations and extending to bankruptcy.

Classical literature limited the financial failure only to the event of bankruptcy. However, as some authors pointed out, financial failure may not always result in bankruptcy. A company can avoid bankruptcy even in a troubled process by accelerating cash flows through selling assets, downsizing, closure of loss-making transactions (Hashi, 1997). On the other hand, a company can unexpectedly come to the brink of bankruptcy due to unpredictable external shocks such as natural disasters, badly ending cases, global economic and financial crises, even if the company did not face financial difficulties previously (Meyer, 1982). Therefore, a typical commercial distress measurement like bankruptcy cannot refer to financial distress on its own. It is more realistic to assess financial distress as a process, rather than a specific incident, even though it makes it more complex to define and classify exactly. As a process, financial distress corresponds to steps that come sequentially to each other, rather than just one event (Agostini, 2013).

Financial distress can occur in different sizes in companies; while its results can affect an entire economy with a domino effect. The distress of companies can leave states, all stakeholders with whom the firm is connected in the finance and public industries in a difficult situation. Therefore, predicting the distress, by developing a good prediction model will give the company and its stakeholders, creditor institutions an opportunity to decrease the costs that will arise in case of a distress, manage, and monitor the process (Zhou et al., 2015).

Financial distress prediction (FDP), which is an important research topic in finance, economy, accounting, and engineering fields, is also called bankruptcy prediction and prediction of company distress. In general, FDP is the prediction of whether the firm will fail or not based on current financial data of the firms through mathematical, statistical, and artificial intelligence techniques. It is accepted that financial distress often remains under the surface but bankruptcy becomes open and obvious to all upon its declaration, therefore it requires an in-depth analysis (Pindado and Rodrigues, 2005; Doğan, 2020: 13).

In recent years, the academic and industrial interest in this topic has increased because of a growing number of firm bankruptcy with the impact of economic crises. The researchers used classical statistical techniques despite some disadvantages in the first years while they went into the effort of developing early warning models convenient for FDP with the machine learning methods in the recent years. This study used Support Vector Machine, which is a powerful machine learning method. There are many successful FDP studies performed with SVM. This study aimed to contribute to the literature by the selection and parameter optimization phase, whose importance for SVM was recently revealed.

2 Theoretical Background

The concept of financial distress emerges as a very important concept in financial research. There are many different solutions for this subject, from univariate ratio analysis to multivariate prediction methods, from traditional statistical methods to artificial intelligence-based machine learning methods, from a single classifier method to hybrid classifier methods designed to combine different classifiers (Sun et al., 2014b; Kumar and Ravi, 2007; Lin et al., 2012). Making financial distress prediction (FDP) through statistical models dates back to the 1960s. The first of those studies was Beaver’s (1966) study, which proposed a model with a single variable, and which tried to present the financial distress of an enterprise by dealing with financial ratios individually and thus obtaining a general idea about the financial risk of the enterprise. The study was considered a pioneer study in the finance literature. But in the following years, it was criticized since financial distress or business performance cannot be measured based on a single financial ratio and the prediction capacity would be very low. Following those criticisms, Altman (1968) used statistical methods with multiple variables for the first time through the Z-score model he developed. According to the results of the study, more reliable and consistent findings were obtained by evaluating different financial ratios together with their weights. After Altman’s success, some examples such as the multiple-regression analysis introduced by Meyer and Pifer (1970), the logistic regression analysis (LRA) introduced by Ohlson (1980), and the probit model introduced by Zmijewski (1984) were applied in the related field. However, some necessity of traditional methods such as linearity, normality, independent variables of prediction, and the functional form already existing between dependent and independent variables cannot quite be ensured in real-life problems. Today, there are alternative methods, which are less sensitive to the above-mentioned assumptions and which are developed based on artificial intelligence techniques.

Decision Trees (DT) are frequently used in artificial intelligence-based studies carried out on FDP due to their easy understanding and interpretation. Gepp and Kumar (2008), Gepp et al. (2010), and Li et al. (2010) proposed DT, classification and regression trees (CART), C5.0 algorithms for FDP, and showed that they yield better results than multidimensional analysis (MDA). Chen (2011) used the C5.0, CART, and CHAID and LRA methods in his FDP study on businesses registered on the Taiwanese stock exchange. In the findings of the study, it was concluded that the predictive power of decision trees increases even more as the financial distress approaches the year. Genetic programming (GP), which is one of the meta-heuristic methods, was used by Etemadi et al. (2009) for bankruptcy estimation and has been shown to perform better than MDA.

Artificial neural network (ANN) in pattern recognition and classification problems is a highly powerful instrument due to its non-linear non-parametric adaptive learning properties. ANN can very effectively represent and define the non-linear relationship in a data set. ANN was first applied to bankruptcy prediction by Odom and Sharda (1990). They also applied the multiple-variable discriminant analysis (MVDA) to their sampling of 129 enterprises, 65 of which went bankrupt. As a result, the correct classification rate for MVDA was 74.28%, whereas the rate reached 81.81% for ANN. Many similar studies have emphasized that ANN performs better than statistical methods (Tam, 1991; Tam and Kiang, 1992; Fletcher and Goss, 1993; Zhang et al., 1999; Liang and Wu, 2005).

SVM, developed by Vapnik (1995), has also been of interest to many researchers since they provide considerable results. The most fundamental difference between SVM and ANN is that the structure of SVM is based on structural risk minimization. Because it is aimed to minimize the empirical risk to minimize the training set error in ANN. Conversely, SVM adopts the principle of structural risk minimization, which has been shown to yield better performance than empirical risk, using quadratic programming to predict a single and optimal separator plane in the hidden feature space (Min et al., 2006; Zhongsheng et al., 2007). Fan and Palaniswami (2000), for the first time, applied SVM on three different datasets using the financial ratios suggested by the three models (Altman, 1993; Lincoln, 1984; Ohlson, 1980) best known in the literature. Besides, MDA has tested SVM’s success by developing financial failure prediction models with a multi-layer perceptron (MLP) and learning vector quantization (LVQ). Min and Lee (2005) also applied SVM to bankruptcy prediction problems. The results of the study show that when compared to ANN, SVM both gives better results and learning is made possible with a smaller number of training sets. To validate the high classification rate, ANN with backpropagation is compared with multiple-variable discriminant and Logit models, and according to the empirical results, SVM provided better results than all other methods. Shin et al. (2005) compared SVM to ANN to show the effectiveness of SVM, and SVM yielded better empirical results. The study also emphasizes these two important points: first, SVM reaches a better generalization capacity with fewer training sets since it tries to understand the geometric structure of the feature space without reproducing the weights of training samples; second, it makes SVM more useful than ANN, as ANN has certain limitations regarding classification problems. Similarly, Shin et al. (2005) made financial distress predictions for Chinese firms and they compared SVM to the other methods used in the above-mentioned study and reached the same conclusion.

Wu et al. (2007) presented a very comprehensive study in the financial failure estimation study using MDA, logit model, probit model, ANN, SVM methods. In this study, it is aimed to enhance the predictive performance of SVM. For this, researchers have optimized SVM parameters using the Genetic Algorithm (GA). Liang et al. (2016) presented a comprehensive study in which the main classifying method was SVM on 239 successful and 239 failed companies operating in Taiwan Stock Exchange from 1999 to 2009, and SVM inputs were investigated. Machine learning has tested the success of SVM with four methods that have been proven and used in the literature. These methods are k-NN, Naïve Bayes (NB), CART, and MLP. According to the experimental results, SVM has been found as the best prediction model. The reasons why SVM is preferred as a method over other data mining techniques in the present study are that SVM yield equivalent or better results can work with fewer training samples, and has fewer parameters to adjust. For this reason, the main estimation method of the study is determined as SVM. The contribution of the study to the literature is that it is aimed to try new ways to increase the predictive accuracy rate of SVM. Different processes affect the predictive accuracy of SVM. One of these processes is the determination of the optimal feature set (or variables) that provides quality information to the classifier. The learner may encounter redundant, irrelevant, or interrelated data while understanding the geometric structure of the classifier property space. When too much and unnecessary information is given to the model as input, a lot of time and cost will be spent and even the model’s suitability rate will decrease slightly (Piramuthu, 2004; Huang and Wang, 2006). However, it is not an easy way to interpret or exclude unnecessary information. For this reason, it is an important issue to filter large amounts of data and intensify it to provide more information especially in financial failure estimation (Tsai, 2008). In most of the current studies, the financial ratios that provide information were chosen from the financial ratios produced by the prediction models previously made. The classification ability of these models will largely depend on studies in which the selected financial ratios are taken (Wu et al., 2006).

In the first studies in the FDP literature (Beaver, 1966; Altman, 1968), the feature selection process was generally carried out using a qualitative approach such as the popularity of features (financial ratios), good results in past studies, or based on expert opinion. This approach has been replaced by quantitative selection techniques over time. Jo et al. (1997), Atiya (2001), Park and Han (2002), Shin and Lee (2002), Min and Lee (2005), Ding et al. (2008), Chen (2011), Li and Sun (2012) selected the features using statistical methods such as progressive regression, t-test, and correlation matrix, factor, and principal component analysis, which are examined in the filter methods category. Min et al. (2006) and Wu et al. (2007) preferred GAs examined in the wrapper feature selection methods category. In these studies, it was emphasized that the power of the prediction model depends on the selected prediction method and feature set. However, another important situation that increases the estimation performance is to investigate the optimal parameter set. SVM has two important parameters called “C” and “gamma”. There are lots of studies that emphasized the parameter optimization improves the performance of SVM (Wu et al., 2007; Shin et al., 2005). But there are a limited number of studies investigating both the optimal parameter pair and the optimal feature set. In this study, both parameter optimization and feature selection methods are used for SVM. The preferred feature selection method in the study is LRA. Despite some limitations, LRA is a multivariate statistical method that is frequently preferred in the studies of financial failure estimation. For this reason, it will be used as an alternative method in the study to test the success of SVM. For parameter optimization, the Grid search technique, which is one of the easy and effective methods, is preferred and this technique is presented in Sect. 3. In the next section, the empirical results are summarized. The final section presents a general summary of the study.

3 Proposed Methods for the Prediction Model

This chapter presents the working principle of SVM for a typical two-class classification problem and explains the LRA, which is a multiple-variable statistical technique. For detailed explanations about SVM, please refer to Gunn (1998), Smola and Schölkopf (1997), and Cristianini and Shawe-Taylor (2000).

3.1 SVM Classifier

The sample-class labels pair, \((x_{i} ,\,\,y_{i} ),\,\,i = 1,\,\,2,\,\,...,\,\,m\), \(x_{i} \in \Re^{n}\), and \(y_{i} \in \left\{ { + 1, - 1} \right\}\) which has p number of feature (attributes) and which comprises the training set as the linear hyperplane which will separate S training set to represent the class that output samples represent is formulated as follows:

$${\text{w}}.{\text{x }} + {\text{ b}}$$
(1)

There can be many linear planes that separate the problem linearly. This can be seen in Fig. 1:

Fig. 1
figure 1

Linear classification

However, it is aimed to find the most suitable separator hyperplane. This hyperplane maximizes the distance between support vectors from different classes, which is called the margin. The distance between \(\left\langle {w,x} \right\rangle + b = 0\) separator plane and the newly observed \(x^{\prime}\) pattern is determined by \(\left| {\left\langle {w,x^{\prime}} \right\rangle + b} \right|/\left\| w \right\|\). Each training pattern is at least \(\Delta \,\) distant from decision boundary and the distance of each training sample from the hyperplane for \(y_{i} \in \left\{ { + 1, - 1} \right\}\) is determined on condition that

$$ \frac{{y_{i} \left[ {\left\langle {w,x_{i} } \right\rangle + b} \right]}}{\left\| w \right\|} \ge \Delta ,\,\,\,\,\,i = 1,...,n $$
(2)

by the equality in the limit value Eq. (3).

$$ \frac{1}{\left\| w \right\|}(\min_{x:y = 1} \left| {\left\langle {w,x_{i} } \right\rangle + b} \right| + \min_{x:y = + 1} \left| {\left\langle {w,x_{i} } \right\rangle + b} \right|) = \frac{2}{\left\| w \right\|} $$
(3)

The hyperplane that best separates the training samples is the plane that minimizes the equation \(\eta (w) = \frac{1}{2}\left\| w \right\|^{2}\). Finding the optimum hyperplane for separable data is a quadratic optimization problem defined by linear limits. The problem is modeled as follows:

$$ \begin{gathered} \mathop {Min}\limits_{w,b} \frac{1}{2}w^{T} w \hfill \\ subject\,to:y_{i} (\left\langle {w \cdot x_{i} } \right\rangle + b) - 1 \ge 0 \hfill \\ \end{gathered} $$
(4)

If the problem has a very large data space, then it is not practical to look for a solution through the primal model. Therefore, it will be beneficial to construct the dual of the problem. For that Khun-Tucker theorem is used (Srang 1986: 538–540) and it is of two steps. In the first step, an unrestricted optimization problem is formed using the Lagrange function:

$$ L_{D} (w,b,\alpha ) = \frac{1}{2}w^{T} \cdot w - \sum\limits_{i = 1}^{m} {\alpha_{i} y_{i} (\left\langle {w \cdot x_{i} } \right\rangle + b) - 1)} $$
(5)

In the above-mentioned equation, \(\alpha_{i}\) is the dual Lagrange multipliers and this multipliers should be maximized by the condition, \(\alpha_{i} \ge 0\). On the other hand, when w and b are taken into consideration, the Lagrange function should be minimized. Therefore, the optimal value point of the Lagrange function is required. When Karush Khun-Tucker (KKT) conditions are to be satisfied in order to find the derivation of the function according to w and b, and to express it only according to \(\alpha_{i}\) parameter, the restricted optimum function is rewritten. That is the second step of forming the dual model. To form the dual model, the Lagrange function is rearranged using KKT conditions. Thus, the formulation of the dual problem is determined by:

$$ \begin{gathered} \mathop {Max}\limits_{\alpha } \,\,L_{D} (\alpha ) = \sum\limits_{i = 1}^{m} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{m} {\sum\limits_{i,j = 1}^{m} {\alpha_{i} \alpha_{j} y_{i} y_{j} \left\langle {x_{i} ,x_{j} } \right\rangle } } \hfill \\ subject\,\,to:\,\,\sum\limits_{i = 1}^{l} {\alpha_{i} } y_{i} = 0\,\,\,\,,\,\,\,\,\alpha_{i} \ge 0,\,\,\,\,i = 1,...,m \hfill \\ \end{gathered} $$
(6)

The Lagrange function should be maximized based on the non-negative variable \(\alpha_{i}\) with the aim of finding the optimal separator hyperplane. In the dual optimization problem, the \(w^{*}\) and \(b^{*}\) hyperplane parameters determine \(\alpha_{i}\). Thus, the optimal separator decision function \(f(x) = {\text{sgn}} (\left\langle {w^{*} \cdot x} \right\rangle + b^{*} )\) is rewritten:

$$ f(x) = y = sig\left( {\sum\limits_{i = 1}^{m} {y_{i} \alpha_{i}^{*} \left\langle {x_{i} ,x} \right\rangle + b^{*} } } \right) $$
(7)

In a typical classification problem, \(\alpha_{i}\) smallest sub-set of the Lagrange multipliers tends to be larger than zero. Besides, these non-negative training vectors are geometrically very close to the optimal separator plane. These vectors are termed support vectors and the optimal separator hyperplane is defined only on these support vectors.

If the problem is complex and non-linear, the margin could have a negative value and the appropriate solution area of the problem is empty. In order to overcome this situation, which makes the solution impossible, either you need to relax the strict inequalities, which is called “soft margin optimization”, or the problem is made linear using kernel trick. The soft margin optimization can be applied to make a small change in the solution explained above for linearly inseparable data.

In Fig. 2 below, (a) is an example to data that is linearly separated by the maximal margin, and (b) is an example to data that cannot be separated linearly.

Fig. 2
figure 2

Linearly separable data (a), and Linearly inseparable data (b)

In the second situation, the data can be linearly separated by assuming that a specific error is assigned for misclassified samples. In this case, the problem aims to find the hyperplane that minimizes the training errors by means of slack variables:

$$ \begin{gathered} \mathop {Min}\limits_{w,b,\xi } \,\,\,\frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{m} {\xi_{i} } \hfill \\ subject\,to:y_{i} (\left\langle {w \cdot x_{i} } \right\rangle + b) \ge 1 - \xi_{i} \,\,and\,\,\xi_{i} \succ 0,\,\,\,\,i = 1,...,l \hfill \\ \end{gathered} $$
(8)

In the above-mentioned model, the penalty parameter on training errors is represented by C, and the non-negative slack variable is represented by \(\xi_{i}\). This optimization problem can be solved via the Lagrange multipliers technique. The solution of problem is furthered almost in the same way as in the linear learning case. The Dual model is given below:

$$ \begin{gathered} \mathop {Max}\limits_{\alpha } \,\,L_{d} (\alpha ) = \sum\limits_{i = 1}^{m} {\alpha_{i} } + \frac{1}{2}\sum\limits_{i = 1}^{m} {\sum\limits_{i,j = 1}^{m} {y_{i} y_{j} \alpha_{i} \alpha_{j} \left\langle {x_{i} ,x_{j} } \right\rangle } } \hfill \\ subject\,to:\,\,\,\sum\limits_{i = 1}^{n} {y_{i} \alpha_{i} } = 0\,\,\,,\,\,0 \le \alpha_{i} \le C,\,\,\,i = 1,...,m \hfill \\ \end{gathered} $$
(9)

In model (9), the majorant of the Lagrange variable is represented by penalty parameter, C, and this parameter is predetermined by the user. Besides, the optimal separator hyperplane function is the same as Eq. (7). The mapping function \(\phi\) is applied for training samples in the non-linear SVM. Using the appropriate kernel function defines dot product (inner product) in feature space, the classifier could separate non-linear data. The Kernel function given in Eq. (10) uses the space of the inner product that we have used in the objective function in the Dual model (9).

$$ K(x_{i} ,x_{j} ) = \varphi (x_{i} )^{T} \varphi (x_{j} ) $$
(10)
$$ \begin{gathered} \mathop {Max}\limits_{\alpha } \,\,L_{d} (\alpha ) = \sum\limits_{i = 1}^{m} {\alpha_{i} } + \frac{1}{2}\sum\limits_{i = 1}^{m} {\sum\limits_{i,j = 1}^{m} {y_{i} y_{j} \alpha_{i} \alpha_{j} \left\langle {x_{i} ,x_{j} } \right\rangle } } \hfill \\ subject\,to:\,\,\,\sum\limits_{i = 1}^{n} {y_{i} \alpha_{i} } = 0\,\,\,,\,\,0 \le \alpha_{i} \le C,\,\,\,i = 1,...,m \hfill \\ \end{gathered} $$
(11)

When we follow the solution stage in the linearly separable case, the decision function is derived from \(f(x) = y = sig\left( {\sum\limits_{i = 1}^{m} {\alpha_{i}^{*} y_{i} \left\langle {K(x_{i} ,x_{j} )} \right\rangle + b^{*} } } \right)\). Besides, it must be said that there are lots of kernel functions that enhanced SVM to get the optimal result. The most commonly used of those functions are polynomial (12), radial basis (13), and sigmoid (14) kernels (Burges, 1998; Liao et al., 2004).

$$ K(x_{i} ,x_{j} ) = (1 + x_{i} \cdot x_{j} )^{d} $$
(12)
$$ K(x_{i} ,x_{j} ) = \exp \left( { - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} } \right) $$
(13)
$$ K(x_{i} ,x_{j} ) = tanh(K(x_{i} \times x_{j} ) - d) $$
(14)

3.2 Logistic Regression Analysis

Logistic regression is a regression analysis used to predict a dependent variable with two categories. The categories of the dependent variable here are formed by using a coding scheme as zero or one to signify that an event has occurred or has not occurred. LRA aims to find the most appropriate model to determine the relationship between a two-category dependent variable and a number of independent variables (Caesarendra, Widodo and Yang, 2010). In this manner, the logistic function with p number of independent variables is expressed as in (15):

$$ P(Y = 1) = \frac{1}{{1 + e^{{ - (\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + ... + \beta_{p} x_{p} )}} }} = \frac{{e^{{(\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + ... + \beta_{p} x_{p} )}} }}{{1 + e^{{(\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + ... + \beta_{p} x_{p} )}} }} $$
(15)

where, the statement \(P(Y = 1)\) represents the probability of the relevant event of the dependent variable to occur, whereas, \(\beta_{0} ,\,\,\beta_{1} ,\,\,...,\,\,\beta_{p}\) represent regression coefficients. In the case that the dependent variable represents the probability of the relevant event to occur, the output variables comprise of a number of responses restricted between 0 and 1. Logistic regression also provides a linear model, the natural logarithm of the rate of \(P(Y = 1)\) to \(1 - P(Y = 1)\) in the logistic regression model:

$$ g(x) = \ln \left( {\frac{P(Y = 1)}{{(1 - P(Y = 1))}}} \right) = \beta_{0} + \beta_{1} X_{1} + ... + \beta_{p} X_{p} $$
(16)

\(g(x)\) in the Eq. (16) has several features desired in a linear regression model. The independent variables here can be integrated in the model as a combination of continuous and categorical variables. In the analysis, to predict \(\beta_{0} ,\,\,\beta_{1} ,\,\,...,\,\,\beta_{p}\) parameters, the maximum probability prediction is applied after the transformation of the dependent variable to logit variable (Dreiseitl and Ohno-Machado, 2002; Kurt, Ture and Kurum, 2008; Yilmaz, 2009).

4 Experimental Study

In the SVM literature, many different model suggestions have been made within the scope of testing and strengthening the success of the method. One of these models is LRA, which is one of the multi-variable statistical techniques. The results of the analysis, which we call the logit model, have been compared to the results obtained by SVM. In another model, the logit model is used as a feature selection technique and with the variables which have been found significant and which would increase its prediction performance, another analysis was done by SVM. The obtained results from the proposed models have been discussed, and the comparisons are visualized through graphs. In this study, developed SVM model has been designed via MATLAB 9.4 (R2018a)—The Language of Technical Computing program and LIBSVM software system (Chang and Lin, 2011). Besides, the IBM SPSS Statistic-21 package program has been used for LRA.

4.1 Datasets

The firms that will be used for financial distress prediction operate in the manufacturing industry and sub-sectors of this industry. Besides, these firms are traded on the BIST stock exchange. Within the scope of these given, 172 of the firms constitute the datasets of the research. Considering that the firms which are subject to Capital Market Law (CML) and traded in Borsa Istanbul (BIST or Stock Market) have prepared their financial statements in accordance with the international financial reporting standards since 2007, the period between 2010 and 2017 has been determined as the “Research Period”. Besides, 24 financial ratios in 6 groups were used in the research. These ratios have been obtained from the firms’ annual balance sheets which are updated through footnotes. Using financial ratios makes it possible to control any potential problem that might occur due to the size of the enterprise and sector differences, and to minimize the impacts of those factors. Therefore, financial ratios, which are frequently used and considered important for firm distress predictions in the literature and which are statistically effective predictors, have been preferred. The financial ratios are given in Table 1. The balance sheets and income statements of the firms whose shares are traded in the Stock Market during the whole or part of the Investigation Period have been obtained by using Finnet Analysis Program.Footnote 1

Table 1 Financial ratios

The “success” or “distress” situations of the firms were used as classifying variables in this research. Based on the definitions regarding the concepts of financial distress in the literature reviewed within the framework of the study, the financial distress criteria have been determined. According to Beaver (1966), Deakin (1972), Aktaş (1993), Altman, Zhang and Yen (2007), Özdemir (2011), these criteria are as follows:

  1. 1.

    That the enterprise has filed for bankruptcy or has gone bankrupt,

  2. 2.

    That the enterprise has made a loss in the last 3 years,

  3. 3.

    That the enterprise has been delisted from stock exchange,

  4. 4.

    That the enterprise has a negative equity,

  5. 5.

    That the enterprise has been on the watchlist firms market for over a year,

  6. 6.

    That the enterprise has lost 10% of its total assets, and

  7. 7.

    That the enterprise has restructured its debts.

The enterprises that comply with at least one of the above criteria have been considered “distressed”, and all of those that do not as “non-distressed”. The distressed or non-distressed situations of all 172 firms in our data set have been identified. There are firms that were distressed all through the sampling period or firms which suffered financial distress for only one year and were non-distressed for the rest of the years. The exact opposite situation is also available. Many FDP researchers have used a balanced sample in which class frequencies are distributed as 50–50% (Altman, 1968; Park and Han, 2002; Shin et al., 2005; Sun and Li, 2011). However, most real-life problems have unbalanced class distribution (Liu et al., 2009). According to Zmijewski (1984), if the proportions of distressed and non-distressed classes differ clearly from the real-world stack, the prediction ability of the model is distorted. So the choice covers the whole spectrum in order to avoid any selection bias, firms have been randomly selected with their financial ratios for the years in question and added to the sampling. In the entire data, it was observed that 71 of the firms are classified as distressed firms, and 101 of the firms as successful firms. İt was divided into two groups. Since there is a consensus in the literature, the data set has been randomly split into two: training and testing set (%80–%20).

4.2 Study Design and Experiments

The outline of the process that has been proposed for the application part of the study is presented in Fig. 3. The detailed explanations are as follows:

Fig. 3
figure 3

The proposed analysis process for financial distress prediction

4.2.1 Kernel Function

Different kernel functions promote SVM in finding the optimal result. Also, it is possible for the user to write their own kernel function based on the structure of the problem. The polynomial, radial basis, and sigmoid kernel are the most used kernel functions (Liao et al., 2004). Since Radial basis function (RBF) can classify multidimensional data, it is the most widely used kernel. When compared to the polynomial kernel, it is known that RBF has fewer parameters. In several studies, RBF is compared to other kernel functions and no significant difference is observed.

In this study, the radial-based kernel function is used. Because RBF for SVM has been accepted as an effective choice in finding the most suitable result.

There are two significant parameters used in SVM that are called C and gamma. The selection of the value of C, which is called the penalty parameter, affects the classification output. If we assign a very high value to C, the classification accuracy rate during the training will be very high. However, the accepted model will most probably have a very low accuracy rate on the test data. If we select C to be very small, it is known that the classification accuracy rate will not be satisfactory. Therefore, the model is impractical. Gamma parameter, on the other hand, has a higher impact on the classification output than does C, because the value of gamma affects the separation output in the feature space. Assigning very high values to gamma leads to over-fitting and very low values to under-fitting (Pardo and Sberveglieri, 2005).

4.2.2 Parameter Optimization

The easiest way to adjust C and gamma parameter is the Grid search technique (Hsu et al., 2003). In this technique, the identification of the appropriate parameter to ensure a high classification accuracy rate is done by trying all different combinations between the lower limit and the upper limit determined for gamma and C. As can be seen in Fig. 4, the limits for C range from 2–5 to 215. Besides, the limits for gamma range from 2–15 to 23. Here, 110 different results are tried and the cross-validation rate for each parameter is calculated. Then SVM training process is initiated with the parameter pair that yields the best cross-validation rate.

Fig. 4
figure 4

Grid search

In this technique, which is a local search technique, the interval determined for the parameter values should be well adjusted (Lin et al., 2008). A very wide interval means wasted calculation time and determining a narrow interval might indicate that the satisfactory results are left out of the search space, or in other words, that good results are sacrificed. Determining an appropriate parameter for SVM is a separate area of study in itself and it is yet to be developed.

4.2.3 Feature Selection

The accuracy rate of SVM is not only affected by C and gamma parameters; the quality of the data set also effect this rate. For instance, a high correlation between features influences the solution results. Excluding an important feature from the model may reduce the accuracy rate. Conversely, some features included in the data set may not affect results or may contain noise.

Feature selection methods are analyzed under three categories as filter and wrapper (Liu and Motoda, 1998), and embedded (Saeys et al., 2007). As filter methods, factor analysis (FA), the principal components analysis (PCA), independent components analysis (ICA), and discriminant analysis (DA) are mostly used. As for wrapper methods, mostly meta-intuitive techniques with a road map which are based on the exploration of the optimal sub-set are used. In embedded techniques, random forest walk, the vector weights of SVM, and logistic model weights are used. Filter methods are fast, but they do not guarantee to give the optimal sub-set; wrapper methods work slowly and give the best approximate optimal solution. Embedded methods require more complicated calculations than wrapper methods since they work interactively with the classifier. While the outputs of the filter and wrapper methods are estimators, in embedded methods the output is an estimator and a feature sub-set. Based on Min and Lee (2005), LRA was used in the feature selection phase in the present study.

4.2.4 Data Pre-processing

Data pre-processing is applied not to have numeric difficulty during calculations and also to ensure that the large values of the variables are not affected by small values. Moreover, pre-processing appears to be a requirement for many machine learning techniques. The raw data is transformed using the formula given in Eq. (17).

$$ Z_{score} = \frac{{X_{i} - X_{mean} }}{S} $$
(17)

where Xi is the raw value that each variable takes, Xmean is the average of variable values, and S is standard deviation. Thus, raw financial ratios are normalized, with their average as zero and standard deviation as unit across samples.

4.2.5 Cross Validation (k-fold)

In order to make sure that we have developed a model that would assign the newly added data in the sample to the correct class, the model must have an acceptable accuracy rate on the test data set which was kept out of the analysis independently. The most reliable way to do so is to divide the data into k parts and to keep each time 1 part aside independently as the test set, and then train the model on the remaining k-1 parts. This method is called cross-validation. The advantage of cross-validation is that the test data set kept aside for each time is independent and increases the reliability of the results (Huang and Wang, 2006). k-fold cross-validation method was first applied in Salzberg’s study in 1997 taking k = 10 (Salzberg, 1997).

The parameters of the method we are going to use in the application stage are optimized by the Grid search technique. The parameter pairs, and therefore, the conformity rates will change in each iteration. For that reason, in the evaluation of prediction results, k-fold (k = 10) cross-validation rate is taken into consideration.

4.2.6 Performance Evaluation

The confusion matrix is used with the aim of comparing the predictions of the model with actual results. The 2 \(\times \) 2 confusion matrix to be used for a two-class example is presented in Table 2. On the left column of the table are the estimated class values of the samples kept aside as the test data set, and on the upper line are the actual class values.

Table 2 Confusion Matrix

In some cases, an example in the positive class might also be classified as positive in the prediction, which is called true positive (TP) separation; on the other hand, it is also possible that an example in a positive class might have been predicted to be placed in a negative class (false negative (FN) separation), which is called Type 2 error. In the exact opposite case, an example in a negative class might have been predicted to be in a negative class (true negative (TN) separation), or in a positive class (false positive (FP) separation). This is an indication of Type 1 error. The sensitivity which is called the true positive rate and specificity which is called true negative rate provides significant information about how the classifier separates the positive and negative limits. To evaluate the performances of the models, some performances criteria in the related literature are used criteria. The formulas of these performance criteria (accuracy, sensitivity, specificity, certainty, and Matthews correlation coefficient (MCC)) are as follows:

$$ Accuracy = \frac{TP + TN}{{TP + FP + FN + TN}} $$
(18)
$$ Sensitivity = \frac{TP}{{TP + FN}} $$
(19)
$$ Specificity = \frac{TN}{{TN + FP}} $$
(20)
$$ Precision = \frac{TP}{{TP + FP}} $$
(21)
$$ MCC = \frac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)} }} $$
(22)

4.2.7 Model Propositions

In order to obtain a powerful and useful prediction model, three different models have been proposed. Explanations about the models are presented under the titles below; the results and interpretations are discussed in the sect. 4.3 Empirical Results and Discussion”.

Model 1: The Analysis by the Support Vector Machines. In Model 1, all variables (Table 1) are used. These variables are the financial ratios which are most commonly encountered in the literature and which provide in many studies significant information regarding explaining financial distress. For the dependent variables of the sampling of 172 firms, only SVM, the support vector machine, the parameters of which have been optimized, has been applied in Model 1. This model has been named Grid SVM.

Model 2: The Analysis by the Logistic Regression. In Model 2, all variables are used to do LRA. This model, which we have called Logit, has been used to be informed about the performance of SVM.

Model 3: The Analysis with Feature Selection. In Model 3, LRA is used as the feature selection technique. Thanks to this analysis, the sub-set of features that will provide useful information was determined and SVM model was used. This model has been named Logit + Grid SVM.

4.3 Empirical Results and Discussion

Empirical results are analyzed under three main sections: The titles are: (1) Logistic Regression Model Output, (2) SVM Models Output (3) The Performances of Models.

4.3.1 Logistic Regression Model Outputs

LRA takes the cumulative logistic function as the basis. This function, when the financial characteristics of the firms are given, gives the probability of whether the firm will be included in the distressed or non-distressed class. The empirical results of this model are presented in Table 3.

Table 3 Logistic model outputs

\(x_{1} :\) asset growth, \(x_{19} :\) real operating profit margin, \(x_{17} :\) net profit margin, \(x_{21} :\) gross real operating profit margin, \(x_{23} :\) current ratio, and \(x_{22} :\) quick ratio in the model have been found to be significant at the 95% confidence level. The B value in the table indicates the coefficients of the logit model. The obtained logit model according to these results can be written as follows:

$$ L_{i} = - 1.031 - 1.761x_{1} - 1.947x_{19} - 1.750x_{17} + 0.746x_{21} + 1.465x_{23} - 2.728x_{22} $$

It is seen that the prediction model is completely meaningful according to the statistical results (-2 Log Likelihood = 86.949; \(\chi - Squared\) = 12.493; degrees of freedom (d. f.) = 8; p value = 0.131). From the statistical results of the coefficients (\(\chi - Squared\) 100,654; degrees of freedom (d. f.) 6; p value = 0.000), it is concluded that the coefficients are significant. For the obtained Logit model, it is interpreted that the independent variables can explain 69.5% of the variability (Nagelkerke R-Square = 0.695) in the financial situations of the firms.

To calculate the probability of whether a firm is financially non-distressed, the relevant financial ratios of the firm are placed in the Li function. The probability value corresponding to these numbers is calculated using \(P(L_{i} ) = \frac{1}{{1 + e^{{ - L_{i} }} }}\) equation. When this value is higher than 0.5, it is decided that the firm will be non-distressed; otherwise, it will be distressed.

4.3.2 SVM Models Output

Under this title, the classification performances of Logit + Grid SVM models in which we have applied the LRA as the feature selection technique are compared. In addition to the optimization of SVM parameters, it is concluded that the optimal feature sub-set selection affects the classification success of SVM. The analysis outputs presented in Fig. 5 show that the parameters of SVM can affect the results. As has been mentioned in previous sections, when C and gamma values are set to be very high causes over-fitting error. In the analyses done on the test data, the classification success of the method decreases. When the constant value for C is determined to be 25 and when we look at the cross-validity rate that it takes for all values in the determined interval for gamma value, the cross-validation rate decreases to around 60%—the values shown on the blue line—at too high or too low values. This situation applies to both model propositions.

Fig. 5
figure 5

Grid SVM (a) and Logit + Grid SVM Models Cross Validation Rates Graphs (b)

When we look at Fig. 5a, which shows Grid SVM results, the highest accuracy rate is 87.21%. It is seen that this accuracy rate is achieved when 2048 values for C and 1.2207e-04 values for gamma are assigned. In Fig. 5b, the impact of C and gamma on classification success in Logit + Grid SVM model is seen. Here, the highest cross-validation rate is 90.06% and this rate has been obtained at 256 value for C and 0.002 value for gamma. Another noteworthy point here is that the addition of feature selection stage to the analysis has increased the maximum value of cross-validation from 87.21 to 90.06%. In Table 4, a brief assessment of the effects of feature selection on SVM results was made. The values on the table are the values obtained by running both models 100 times on the test data set. As is indicated by the results, the accuracy rate for SVM after feature selection increased from 83.28 to 85.44%. The cross-validation rate, which is a more reliable rate, increased from 70.39 to 74.80%.

Table 4 Empirical results regarding models

4.3.3 Performances of the Proposed Models

It is seen that some different performance criteria are used in comparing the classification performances of the proposed models. Table 5 presents the results for the selected performance criteria. The accuracy rate of Logit + Grid SVM for training and the test sets are 94.24% and 93.75%, respectively. It can also be said that this model has a remarkably high sensitivity for both the training and the test set at a rate of 93.75% and 94.44%, respectively. The highest value of the specificity rate indicates the accuracy of the classifying model has been given by the logit model. It is the certainty rate which gives information about how many of the estimations of financial distress are real. The highest certainty value, too, has been obtained through Logit + Grid SVM. MCC value, which we have preferred for the situations in which the values in the confusion matrix are not distributed evenly, also provides information about the quality of the classifier. The highest MCC value again belongs to Logit + Grid SVM. It can be said that all three models are useful and produce classifiers with considerably high performances. As for the generalization capacity of the models, the relatively higher difference between the accuracy rates of the Logit model on the training data set and test data set indicates that its generalization performance is low.

Table 5 Performance of the proposed models

For precision, the Grid SVM yielded the lowest rate for the test set. This value is lower than the Logit has. Although it is shown in this study that logistic regression provides significant information with regard to the selection of the new feature sub-set, it is also seen that the performance of SVM operated by this new feature sub-set has increased.

5 Conclusion and Future Work

Since the financial distress of firms does not only affect the firm but also has an impact on the whole economy, financial distress prediction is a critically important subject, which has been frequently studied. In recent years, SVM has been commonly used in financial distress prediction studies. The financial distress model with SVM has been compared to other machine learning methods, it has been shown to yield good results. In the present study, it is aimed to make distress prediction by SVM. C and gamma parameters, which are considered as two significant parameters of SVM, are optimized by using grid search technique. It is shown to what extent the results are affected as a result of defining the relevant parameter pair correctly. Besides, it was seen that feature selection for SVM is another factor that significantly affects the results. To understand how feature selection affects classifying performance, the logistic regression analysis has been done. There are two reasons why this method has been chosen in the study: the first is that LRA does not require strict assumptions as in multiple-variable statistical techniques and it can be used as a feature selection technique; the second reason is that we wanted to compare the results of the logistic regression analysis to those of SVM.

Financial distress prediction is made based on a real data set of firms (172 firms) traded in the BIST share market between 2010 and 2017. The proposed models are compared based on this real data set. When the results of these proposed models are compared, it is concluded that SVM, which allows parameter optimization and feature selection, has a better success. As a consequence, a useful early warning model in financial distress prediction problem through SVM, a newly developed technique, is presented in the study.