1 Introduction

The banking system failure, integral to the financial crisis of 2009, had serious negative impacts on the shipping sector and indeed led to the withdrawal of the commercial banking sector from all major risk intensive private capital investments. One important regulatory consequence was the imposition of increasingly stringent capital adequacy rules by the Basel Framework (BIS 2019), which forced banks to offload the more risky assets from their balance sheets whilst also placing further restrictions on new investments. Furthermore, the crisis highlighted the financial system’s weakness in financial risk management and distress prediction. This led directly to the adoption of the Basel Internal Rating System for risk management which demanded the development of effective probability of default models geared to the specific characteristics of companies, taking at the same time due account of the macroeconomic environment.

Consequently, it is becoming increasingly important to model the probability of distress of shipping companies more accurately than before. Our research approaches this challenge through the development of a new methodology consisting of three core elements: (i) the application of modern machine learning (ML) classification algorithms; (ii) supplementing financial statement data with macroeconomic and market predictors and (iii) the application of modern multiple imputation techniques for the analysis and treatment of missing accounting information. Each of these elements has been previously addressed in various fields of endeavour including, to a limited degree, the shipping sector. However, to the authors knowledge, none has adopted a holistic methodology combining all three. The rationale behind this approach is to examine if, by simultaneously accounting for the non-linear relationship between default risk and the independent variables; exogenous macro effects; and the effects of incomplete data, our model can improve on the predictive power of traditional corporate distress modelling.

We first evaluate the performance of ML classification models in the prediction of financial distress (FD) which may eliminate the need for unobservable temporal effects. We appraise the relatively recently established models random forests (RF) and extreme gradient boosting (XGB), alongside an extended linear classifier, i.e., the generalised additive model (GAM). The objective is not simply to compare model performance (e.g. accuracy) but assess their individual confidence intervals (CI) thereby measuring a model’s true capacity to generalise on out of sample data. A rapidly increasing focus in the literature is the application of ML modelling (complex models exhibiting non-linear dependency structures between the covariates and the resulting outcome) in corporate failure prediction (Christoffersen et al. 2018; Jones et al. 2015a; Hernandez Tinoco and Wilson 2013). Much of the earlier work has focused on benchmarking the performance of ML models on generalised linear models such as logistic regression (LR). However, it is now widely accepted that generalised linear models result in significantly narrow confidence intervals (CI)Footnote 1 of aggregated FD predictions owing to their underlying assumption of conditional independence.

The fact that none of the models fully capture correlation in FD, solely through the application of accounting data, suggests the existence of unobserved macro effects creating correlation in distress. Shipping, being a high-risk sector, will always be highly sensitive to global macroeconomic shifts and stochastic market events. As such, a clearer understanding of those events, which accurately represent the risk profile of shipping companies, is essential. We develop distress prediction models employing not only firm level data but also macroeconomic and market indicators to detect early stages of distress.

A major aspect of our methodology is the tackling of the problem of missing accounting data. The global nature of the shipping industry, and diverse national accounting practices and laws, render the identification and collection of complete and consistent financial statements one of the major challenges. Therefore, the problem of missing accounting values and how to treat them is a major focus of this research. Our methodology is trained and tested by applying a test case comprising raw data compiled from detailed financial statements covering the period 2000–2018 of worldwide dry bulk carrier owners/operator companies, both listed and non-listed. Financial institutions and ultimate owner (parent) companies are not included due to the bias potential introduced through group level accounting practices.

Finally, the early detection of FD provides investors with ways of avoiding some of the costs associated with a bankruptcy filing and recovery. However, models must be transparentFootnote 2 and open to scrutiny by all stakeholders, investors and particularly regulatory bodies if they are to be accepted as practical tools.

The rest of the paper is structured as follows. In Sect. 2 we review the relevant literature; Sect. 3 details data issues; Sect. 4 describes the methodologies and models used and the results are presented and discussed in Sect. 5. Section 6 presents the conclusions and recommendation for further research.

2 Literature review

This literature review comprises five threads. The definition of FD is first addressed as its clear and accepted definition is core to this research. The second thread reviews research efforts into shipping company FD prediction followed by a review of the literature surrounding the application of machine learning tools in corporate FD prediction. The section also covers a review of the literature relating to incomplete financial accounting data and the selection of core independent variables.

2.1 Definition of financial distress

Much of the literature defines FD as being centred upon the final legal consequence of an organisation’s liquidation or bankruptcy. This legal event is represented by a dependent variable in a binary classification model variable (Balcaen and Ooghe 2006). This definition, however, only represents the worse-case scenario of FD and therefore presents challenges for FD prediction. The process of insolvency is, in many cases, significantly lagged (Hernandez Tinoco and Wilson 2013). The literature estimates a time gap of up to 3 years or more between the point at which a company experiences FD and the date of a legal declaration of insolvency (Theodossiou 1993; Hernandez Tinoco and Wilson 2013). For example, U.S. Chapter 11 legislation has brought about changes in the way organisations can be given time to reorganise their business, assets and debts in the event of impending insolvency. There are a number of stages a company can encounter before closure, for example (Wruck 1990) cites FD, insolvency, filing of bankruptcy and administrative receivership.

Prior to, the triggering of the terminal state, the literature generally follows two approaches. The first is an accounting features approach, utilising cross-sectional annual data, and is widely covered in the default prediction literature (Altman 1968; Ohlson 1980). This utilises historical financial statements which are benchmarked against historical default rates and generally modelled to produce a probability of bankruptcy.

The second approach is a mixed accounting/market approach which estimates a company’s probability of default founded on its distance to default (DD;Black and Scholes 1973; Merton 1974). The DD model utilises both the expected return on assets and the volatility of those returns to assess the probability of asset values declining below the value of the company’s debt (as a factor of the time to maturity of a company’s outstanding debt). Accepting this as a foundation,Footnote 3 we also include DD as a feature in our modelling.

2.2 Shipping company financial distress prediction

Research into shipping FD prediction has been relatively limited to date. Earlier works have focused largely on financial performance predictor/feature selection, relying on, the more conventional, binary logistic regression techniques (Antoniou et al. 1998; Grammenos et al. 2008; Kavussanos and Tsouknidis 2016; Mitroussi et al. 2016; Lozinskaia et al. 2017). Moreover, research interest was either on shipping bond markets or bank shipping debt.

The financing of the shipping industry has, traditionally, relied heavily on bank loans. A critical priority for bank credit risk departments concerns the provision of an optimal framework for assessing the credit rating of borrowers’ as well as of loan quality. This includes defining specific quantitative and qualitative criteria mirroring the borrowers’ ability to comply with the loan contract terms. Traditionally, this has been founded on the five core Cs of credit: the borrower’s ‘character’, ‘capacity’ and ‘capital’, ‘collateral’ and ‘conditions’ (Antoniou et al. 1998; Grammenos 2010) applied to shipping credit scenarios. Credit risk assessment work has often been performed following the construction of ‘standardized’ models, as noted by Dimitras et al. (2003). These authors contended that models which combine criteria and provide relative weighting to assist the decision-making process of a bank’s Credit Committee, are limited. Their paper presents work on the application of the monotone regression method, UTADIS, and was aimed at the analysis of both credit allocations and the evaluation of the criteria used for the selection of loan applications in the shipping industry. Gavalas and Syriopoulos (2016) proposed an integrated credit rating model founded on a series of critical qualitative and quantitative criteria for bank loan portfolios. The model was applied to, and tested on, bank financing decisions in the shipping sector as a case study. Again, the authors used a UTADIS approach to assess the relative impact of the selected risk factors on efficient credit rating scoring and loan quality assessment.

Finally, all these studies demonstrated limited access to longitudinal financial data which would allow for a more thorough assessment of predictive capabilities of the available tools. Moreover, their reliance on linear methodologies limits many earlier models in their capacity to accurately predict FD in out of sample data.

2.3 Machine learning application to financial distress prediction

Since the works of Altman (1968) and Ohlson (1980), research relating to the modelling of corporate FD and bankruptcy has been extensive (Altman 1977; Shumway 2001; Duffie and Singleton 2003; Hensher and Jones 2007). However, until recently, much of this work relied heavily on more traditional classifiers e.g. logit, probit or linear discriminant analysis models, which are commonly referred to as generalised linear models (GLM). The financial crisis of 2009, however, demonstrated that more research effort was required to develop models with enhanced predictive accuracy, not only for predicting ultimate failure events, but models which also detect the early stages of FD. Post crisis, research has highlighted failures in conventional corporate FD prediction models (Duffie et al. 2009; Barboza et al. 2017; Christoffersen et al. 2018). The academic consensus is that conventional statistical techniques have certain restrictive assumptions including linearity, normal distribution, multicollinearity, autocorrelation, sensitivity to outliers and homoscedasticity, which do not sufficiently capture the complex relationships between covariates and FD. These limitations, coupled with the need to account for frailty and unobserved heterogeneity, have resulted in a switch of focus by industry and academics alike to the application of more complex methods (Lessmann et al. 2015; Zhang et al. 2017). ML methods applied to FD prediction are now well established in the literature (Jones et al. 2015b; Ziȩba et al. 2016; Xia et al. 2017; Barboza et al. 2017) and the general conclusion is that ‘new age’ classifiers outperform transitional GLMs in out of sample generalisation.

The application of ML modelling (complex models exhibiting non-linear dependency structures between the covariates and the resulting outcomes) in corporate failure prediction is increasingly prevalent in the literature (Hernandez Tinoco and Wilson 2013; Jones et al. 2015b; Moraes Barboza et al. 2015; Christoffersen et al. 2018). Much of the previous work has attempted to benchmark the performance of ML models on GLMs. However, despite research demonstrating the enhanced generalisation performance of ML classifiers compared with GLMs, care must be taken as transparency is paramount in finance (as is demanded by investors and regulators) and, as the literature notes, ML models involve issues of transparencyFootnote 4.Missing accounting values.

The problem of missing data is predominant in financial modelling (Kofman 2003; Burger et al. 2018) and is a common feature of shipping company accounts (Sharife 2010; Merk 2020). This is also true of the raw panel dataset compiled for the study case used in this research. This issue has, to date, not been addressed in the shipping finance literature. There are various reasons for incomplete financial accounts and here we cite two examples which are a common feature in shipping company accounts. Firstly, open registries or “flags of convenience” (FOC) concede favourable tax environments to shipping companies (Merk 2020), and hence have become a part of shipping company tax planning. Shipping companies often exploit variations in domestic tax law and international taxation standards (Kim and Kim 2018; Merk 2020). This provides them with opportunities to eliminate or significantly reduce taxation and therefore, many multinational corporations use base erosion and profit shifting (BEPS) to reduce the corporate tax base (OECD 2013).

A second reason might be the application of international accounting standards. The period 2000–2019 saw the gradual global uptake of International Financial Reporting Standards (IFRS) for both public and SME companies. This gradual uptake, coupled with multiple changes to the IFRS by the International Accounting Standards Board (IASB), have contributed to inconsistencies which have resulted in certain accounting information either being incomplete or simply not reported. One prime example of this is the reporting of leased assets on company balance sheets prior to the coming into effect of IFRS16 (IFRS Foundation 2016). This was a result of a finding in 2005, by the US Securities and Exchange Commission (SEC), which alleged that US public companies had approximately US$1.25 trillion of off-balance sheet leases. Thus, the IASB deemed that an entity (lessee) which leases vessels should recognise and report assets and liabilities arising from those leases. According to Tahtah and Roelofsen (2016), a result of IFRS16 is that there would be a median debt increase of 24% and a 20% median increase in EBITDA for the transport and infrastructure industry.

There are three accepted approaches to the problem of missing data in statistical analysis. The first method is referred to as the “complete case” (Nguyen et al. 2017) or list-wise deletion approach which discards incomplete individual observations (company accounting years) and results in a residual dataset containing complete, observed data. The second method is referred to as the “omitted variable” approach which involves simply removing those covariates with missing values from the dataset (Honaker and King 2010). The third method is data imputation and is part of a growing field of research to address the challenge of missing values in data. In this research we focus on the multivariate imputation (MI) methodology (Rubin 1987). MI has become one of the most widely used methods for handling missing data and is receiving increasing attention in financial research (DiCesare 2006; Amel-zadeh et al. 2020).

Finally, the primary objective of this research is the accuracy of predictions rather than making valid subject related or sector informed inferences.Footnote 5

2.4 Predictor selection

Much of the earlier work on FD prediction has relied solely on publicly available historical accounting data or on securities market information. However, more recent research has recognised that accounting data alone are not enough to explain the relationship between the covariates and FD prediction. According to Balcaen and Ooghe (2006), if too much emphasis is placed on financial ratios for failure prediction then it is implicitly assumed that all FD indicators are contained within financial statements. There are many examples in the literature which examine combined approaches using accounting, macroeconomic/market, and qualitative data, in order to provide an enhanced model of FD prediction (Das et al. 2007; Duffie et al. 2009; Koopman et al. 2011). Furthermore, Bonfim (2009) postulates that when macroeconomic indicators are considered, this leads to an improvement in model results. The consensus in the literature is that macroeconomic dynamics represent an independent contribution in FD prediction. As regards shipping, this is an issue recognised by Lyridis et al. (2014) for example. Furthermore, recent literature has highlighted the failure of such traditional approaches to encapsulate spatial (annual) fluctuations in FD. Numerous publications (Duffie et al. 2009; Nickerson and Griffin 2017; Kwon and Lee 2018; Azizpour et al. 2018) suggest that simply modelling relationships between observable covariates and FD does not adequately account for latency (unobserved variables) and so the authors advocate approaches which include frailtyFootnote 6 or the inclusion of time-varying effects.

3 Methodological background

This section briefly outlines the analytical framework encompassing the main principles applied in our methodology, namely missing value treatment, data pre-processing, feature selection, classification algorithms and model evaluation metrics (a more detailed discussion can be found in “Appendix 1”).

With respect to the treatment of missing values, we provide both a complete case and a multivariate imputation analysis of the raw data. For the complete case data set, we simply select those records from the raw data which contain non-null values for all independent variables. For data imputation, we begin by assuming that our missing accounting information is missing at random (MAR). This approach assumes that the reasons for missing data in any sample can be explained by the observed data, i.e., the probability of missing values is dependent upon observed data as opposed to the values of missing data.

Once the issue of missing data has been addressed, we then examine and pre-process our resulting dataset for skewness, kurtosis and outliers (Barnes 1987). Pre-processing is performed using a variation of the Box–Cox (Box and Cox 1964) transformation (Yeo and Johnson 2000), and outlier treatment is applied through spatial sign (Serneels et al. 2006). As company default is a relatively rare event (Kim et al. 2015), the dataset is imbalanced with the “distressed” class being in the (significant) minority class. In order to account for this, we tested several sampling methods, with down sampling (reducing the instances of the majority class) producing the most effective results (out of sample generalisation) on our test dataset. Once transformed and sampled, the task of feature analysis and selection is undertaken. At this stage, the data are examined for multicollinearity and an assessment of the level of contribution to the dependent variable of independent variables. Random forest modelling (Breiman 2001) was selected for this task (Zhou et al. 2016; Lakshmipadmaja and Vishnuvardhan 2018).

The final dataset is partitioned into training and test sets on a ratio of 70:30 before applying classification modelling on the training data. As the literature suggests, there are many examples of research into the effectiveness of many machine learning algorithms in the financial and economics domain. The general consensus (Jones et al. 2015a; Son et al. 2019) is that tree-based algorithms consistently outperform models such as artificial neural network (ANNs) or support vector machines (SVMs). Our research corroborates their conclusions, however, in this paper we only report our results from the best performing tree-based classifier, extreme gradient boosting (XGB). As a benchmark, we also report the results generated from the implementation of one of the best performing linear based classifiers, GAM. The inclusion of an extended generalised linear model is to provide a balanced comparison with the complex model. Their inclusion is performed in the name of model transparency: Following the Ockham’s Razor principle, if two models demonstrate similar predictive power then the more transparent model which is preferable.

Finally, the results are compared using a variety of metrics necessary for the accurate assessment of classification performance. Specifically, receiver operating characteristics (ROC), H MeasureFootnote 7 (Hand 2009) and log loss (Bickel 2007) metrics are used with a focus on the ability of the models to accurately predict the minority “distressed” class (“Appendix 1”).

4 Data: bulk shipping case study

The bulk carrier fleet is an essential part of the global economy. An Equasis report (2019) notes that, as of 2018, the world fleet totalled 116,857 ships (1,361,920 GT), with dry bulk carrier vessels totalling 11,929 vessels (457,648 GT), accounting for 33.6% of the global fleet in GT terms. Furthermore, UNCTAD (2019) report that in 2018 the bulk fleet took delivery of 26.7% of the total of newly built GT, more than any other vessel type, followed by oil tankers (25%), containerships (23.5%) and gas carriers (13%). The dry bulk market is very diversified and volatile with bulk shipping comprising three major sectors: iron ore, coal and grain as well as other minor commodities, e.g. steel, forest products and minerals. According to UNCTAD (2019), the major dry bulk commodities represented more than 40% of total dry cargo tons shipped in 2018, with containerized cargo contributing 24%, minor bulks with 25.8% (the remaining volumes consisted of dry cargo including break-bulk).

Dry bulk shipping market is characterised by the large number of small-scale shipowners, few market barriers and transparent transactions (Wu et al. 2018). Furthermore, dry bulk freight rates are determined predominantly by market dynamics with no individual shipowner or charterer having a significant effect on rates. In short, the dry bulk market can be viewed as a perfectly competitive market (Yin et al. 2019) and, therefore, a viable test case for this study.

4.1 Empirical context

The study develops four main ex-ante models for estimating FD likelihood, to test the predictive power of three sets of independent variables (Table 1): financial statement ratios; macroeconomic indicators; and bulk shipping market predictors. In Model 1, the independent variable selection is made solely from financial statement ratios. Model 2 adds macroeconomic indicators to company level financial data. Model 3 comprises financials and market related covariates. Finally, Model 4 comprises all three sets of covariates.. Missing company level financial data are subjected to both case wise deletion and data imputation to examine corresponding model performance.

Table 1 Predictive features for the FD model

4.2 Data sampling

The raw dataset used for company level financials is sourced from unconsolidated statements of the Orbis company database (Bureau-Van-Dijk 2019) and consists of over 5000 global dry bulk shipping company yearly statements for the period 2000–2018. The shipping specifics are primarily drawn from Clarkson’s Shipping Intelligence Network (Clarksons 2019), whilst macroeconomic data are drawn from two data sources, the OECD (2019) and the World Bank (2019). At company level we apply filters to our raw data to exclude financial companies; such entities differ from other corporates particularly as regards their asset base, accounting standards and regulatory status. Furthermore, in order to avoid modelling distortions, holding companies are also filtered where they do not demonstrate that their holding entities’ prime business drivers are bulk shipping. There is no filtering on company size as we see the need to account for interactions between size and other variables in the models, thereby allowing for the modelling of companies of different sizes.

4.3 Dependent variable: outcome and hypotheses

The dependent variable is a binary variable, FD, representing the state (distressed or not distressed) of the company in any discrete accounting period. Our definition of FD in companies follows Pindado et al. (2008) and we outline the following primary conditions to be fulfilled in predicting company financial distress. The hypothesis follows that a company is distressed when any of the following events occurs (i) the company’s EBITDA to expenses ratio is short of its expenses for two consecutive years; (ii) the company suffers from negative growth for two consecutive years; (iii) when a formal default event has been triggered (Hernandez Tinoco and Wilson 2013); (iv) failed to publish accounts for the following year (Christoffersen et al. 2018). This definition also implies that companies experiencing FD in a single period can recover; we therefore implicitly model recurrent events.

4.4 Independent variable selection

The independent variable selection in this study was primarily driven by the specific nature of the dry bulk shipping sub-sector. The information was quantified through the inclusion of company level features as well as market and macroeconomic indicators (Table 1). The sector’s risk framework is largely described by financial features relating to the capital intensity and cash flow dependent nature of the industry, and through market and macroeconomic features which reflect a highly cyclical sub-sector with a high sensitivity to global and regional economic growth; fuel prices; and the balance of supply and demand.

5 Results and discussion

In this section, we present and discuss the results from the application of our methodology to the bulk shipping case study. We first present the results from the missing value analysis and treatments, followed by the results produced through the application of our two classifiers, GAM and XGB, to the four data models.

5.1 Missing values

5.1.1 Missingness analysis

The first objective was to analyse the financial statement data to ascertain the extent of the missingness. Table 2 shows that, of the 5368 company financial statements collected, only 1483 were complete, with approximately 72% of them being partially complete. However, at the individual financial ratio level, the missingness level is 17.6% with 10,405 out of 59,048 accounting ratio values not recorded in the dataset. A breakdown of the missing values on an individual ratio level can be found (Table 3).

Table 2 Missing financial statement value analysis
Table 3 Missing value level per accounting ratio

The results demonstrated a relatively high level of observed accounting values, 82.4%. This indicated that, if the MAR assumption is applied, there is sufficient information present in the observed values for multivariate imputation to yield beneficial results (in terms of reduced bias and efficiency), when compared with complete case treatment. A complete case treatment option would result in only 27% of the financial statement observation being available for analysis. This would result in a loss of 32,300 observed financial ratio values which are present in the 3885 incomplete financial statements; these contain significant levels of potentially exploitable information. A matrix plot of the missing data distribution is represented in Fig. 1, with grey denoting missing data.

Fig. 1
figure 1

Matrix plot of missing accounting data

A plot of the pairwise correlation point-biserial correlation coefficients, between covariate pairs, is shown in Fig. 2. Variables are assigned TRUE or FALSE depending on their missing data status and these Boolean vectors are correlated to the native variables.

Fig. 2
figure 2

Raw accounting data—observed v missing (NA) correlation coefficients

5.1.2 Post imputation evaluation

The RF algorithm was evaluated on its ability to obtain statistically valid inferences from the incomplete financial data set and the results indicate a limited loss of information from the imputed data (see “Appendix 2”).

A distribution histogram overlay of imputed and observed values is found in Fig. 3, depicting the distributions of original and imputed accounting ratio values. Although the goal is to have the two similar distributions, differences do not necessarily signal problematic imputation. The empirical density plots act as flags for potential problems with the imputed estimates. At this stage, no data pre-processing was performed on the pre-imputation dataset, as any bias would have been “locked into” the data prior to training and validation.

Fig. 3
figure 3

Overlaid histograms of imputed and original values

Figure 4 shows a plot of the bootstrapped correlation coefficients from the original and imputed data sets. This was generated through applying 20 iterations in the diagnostics function to obtain bootstrapped correlation coefficients with 95% confidence intervals. The correlation coefficients are represented by the dots and the red line. The blue line (intercept 0, slope 1) and the red correlation line should be aligned.

Fig. 4
figure 4

Correlation coefficient scatter plot

The final stage of missing value analysis was to examine the out of sample classification results when modelling using both the imputed and complete case datasets. The results of imputation (see “Appendix 2”) indicate that, as discussed previously, the removal of circa 72% of records (containing incomplete data) involved the introduction of bias. Indeed, further examination employing RF feature set analysis (Fig. 5), highlights the differences in contribution to the dependent variable, provided by the independent variables (depending upon the constitution of the individual data sets) see Table 4.

Fig. 5
figure 5

Missing value treatment—data set feature importance comparison

Table 4 Missing value treatments—RF classification

For example, in this case we observe a greater contribution by the current, gearing and solvency ratios in the RF MI dataset than in the complete case (CC) data, where profitability plays a greater role in establishing the distressed state. The RF importance analysis shows that the CC data set results in an over-weighted importance given to financial ratios which are not generally accepted as the most suitable in forecasting corporate financial health e.g. see Son et al. (2019). Furthermore, it is also noted that the removal of so much data, in order to distil the CC dataset, involves a significant diminution of over 50%, of the minority class (distressed), which further increases the risk of bias.

5.2 Prediction model evaluations

We evaluated the prediction power of the four data models utilising the GAM and XGB classification algorithms. We first examined variable correlations and cross reference the data modelsFootnote 8 with an analysis of feature importance, based on the permutation results from an RF feature evaluation of the dataset. The accounting feature distributions shown in Fig. 6 confirm the non-normal, skewed and kurtosis nature of the dataset, consistent with corporate company panel data. This suggested that the use of a non-parametric correlation test (Spearman 1904) was most suitable to assess correlations between the features. The correlation matrices produced from the Spearman tests on each of the four datasets can be seen in Fig. 7.

Fig. 6
figure 6

Accounting feature distributions

Fig. 7
figure 7

Feature correlation matrices

The results show correlations in both accounting (e.g. solvency and gearing ratios) and bulk market (e.g. freight and time charter rates) feature sets, with some significant enough to warrant closer examinations of the data. This was performed using the RF feature importance methodology: given the bias in mean decreased impurity measurement, when predictor variables are highly correlated (Strobl et al. 2008), we used both unconditional and conditional permutational analysis. The results shown in Fig. 8 were then used to identify the best performing variable permutations for our models. This information was used to inform feature selection testing for each of the three feature sets.

Fig. 8
figure 8

Unconditional and conditional permutation tables for each feature set

The classification performance results presented in Table 5, show that, in Model 1 (the accounting ratio set), both GAM and XGB classifiers detect contributions from the full feature set; as indicated by sensitivity, type II error, log loss and H Measure results. This is in contrast to the unconditional RF feature importance results (Fig. 8), which show strong contributions from asset turnover, liquidity, current and solvency ratios, as well as gearing ratios, indicating that the remaining ratios have limited predictive contributions. Furthermore, according to the unconditional permutation analysis, only identified profit margin as providing strong input to the distress rate.

Table 5 Classifier/feature set—performance summary

For Model 2, the results following the introduction of bulk market indicators to company financials concur with the feature importance analysis in that the addition of freight rate information as the only market predictor produces slightly better results for sensitivity, type II error and log loss for GAM. However, the figures for XGB indicate that this algorithm can perform optimally when the complete market feature set is combined with the accounting information. The feature set analysis indicates that for Model 3, the long-term interest rates and inflation play strong roles in the predictive power of the model. This is borne out in the results for both GAM and XGB for Model 3 with the strongest metrics produced when limiting the macroeconomic indicators to these features. Finally, combining all the feature sets into Model 4 appears to demonstrate that both GAM and XGB perform most optimally when the company financials are combined with freight rate and interest rate information. This is consistent with results produced from Model 2 and 3 tests.

The results shown in Table 5 also indicate that the FD prediction power of XGB is improved over that of the GAM classifier, albeit only marginally. However, as reported in Christoffersen et al. (2018), the differences between the results achieved through complex model compared with the GAM model is not as pronounced as in previous studies e.g. Jones et al. (2015a). A comparison of the example overall classifier performance with optimal H Measure cost settings for Model 4 (complete dataset) is shown in Fig. 9. Again, with this methodology, the discussion of the optimal cost setting for the H Measure would be a decision for the Credit Committee.

Fig. 9
figure 9

Classifier performance overview—Model 4

Finally, we illustrate the use of the methodology to predict the number of companies entering into distress through Fig. 10. This compares the realized percentage of firms in distress to the model predicted values from both classifiers. The models are estimated on an expanding window of data with a 2-year lag (t-2) to the forecasted data set e.g. the forecast for 2010 distress rates are estimated using 2003–2008 data.

Fig. 10
figure 10

Aggregated distress predictions—Model 4

5.3 Portfolio application

In the previous section we illustrated how our methodology can be used to predict individual company distress. Here we expand the use of the methodology by examining its capacity to assist with the assessment of shipping portfolio risks. A comparison is made of the models’ 95% Value-At-Risk (VaR) values with the realised portfolio distress rates. This can help banks with their estimated shortfall (ES) assessments.Footnote 9 The individual company banking information provided a foundation for the construction of bank portfolio related data for this study. Five banks where selected on the diversification of their bulk fleet exposure over the period 2005–2015 (11 portfolio data sets). The GAM and XGB algorithms were used to estimate the 95% VaR of the FD rates for each portfolio. The results are summarised in Fig. 11. The solid vertical lines represent the VaR estimates where the line reaches or exceeds the realised figures (dot) and the broken line represents those that fall below the realised VaR. The results show that GAM had 37 VaR violations and XGB producing 34, with neither performing optimally over the 2008–2010 period covering the financial crisis.

Fig. 11
figure 11

VaR—model estimations vs actuals

6 Conclusions

This study has introduced a novel methodology for the prediction of financial distress in dry bulk shipping focusing on the noisy and incomplete nature of shipping financial statement information. The methodology comprises a unique combination of features starting with a flexible definition of FD for shipping company distress. In addition, our methodology includes tools for the analysis and treatment of missing accounting information and incorporates modern machine learning tools.

The main conclusions of the study are, firstly, that multivariate imputation can help shed light on the nature and structure of missing accounting data as well as providing ML tools for its effective treatment. Secondly, we determined that the ML classification technique, XGB, showed an improvement over the use of GAM modelling for FD prediction. However, our results indicate that the GAM algorithm, has a predictive power comparable to that of more complex ML algorithms. Furthermore, the transparent nature of the GAM algorithm, over more complex “black-box” algorithms, could help facilitate the acceptance of this methodology by regulators.

The bulk shipping market case study also demonstrated that the introduction of non-corporate level, macroeconomic and market predictors do not perceptibly improve the predictive power of modern ML tools. The methodology revealed that whilst macro and market predictors do contribute to FD predictive modelling power, XGB can generalise as effectively by simply utilising a parsimonious accounting feature set. The results indicate that if sufficiently extensive longitudinal accounting data are available, then adequate macroeconomic and market information is captured therein, thus enabling advanced ML algorithms to generalise effectively on out of sample data (without the inclusion of additional non company level features).

The results, however, have indicated some limitations of the methodology which require further investigation. Given the noisy and incomplete nature of the available data, even complex models do not achieve an accuracy level, at present, to be relied upon to be used in anything other than an early ‘warning system’ for FD. Furthermore, its predictive power was compromised through the financial crisis of 2008, in that the model failed completely to handle systemic shock at both company and portfolio levels. In short, further research is required to examine techniques for addressing the problem of tail end events and to improve upon the treatment of missing accounting information.

In summary, this study demonstrated how the methodology could be used by banking credit risk departments to detect early signs of distress at both individual company and investment portfolio levels, providing for the dynamic monitoring of individual shipping company loans as well as portfolio risk.