1 Introduction

The PCAOB recently introduced a new standard for auditing accounting estimates, stating:

Accounting estimates are an essential part of financial statements. Most companies’ financial statements reflect accounts or amounts in disclosures that require estimation. Accounting estimates are pervasive in financial statements, often substantially affecting a company’s financial position and results of operations… The evolution of financial reporting frameworks toward greater use of estimates includes expanded use of fair value measurements that need to be estimated (PCAOB 2018, p. 3).

Indeed, most financial statement items are based on subjective managerial estimates: fixed assets are presented net of depreciation―an estimate―and accounts receivables, net of estimated bad debts. Liabilities, like pensions and post-retirement benefits are estimates, and revenues from long-term projects or from contracts with future deliverables include estimates. Many expenses, such as the stock options or warranty expenses, also require estimates. Some items, like the pension expense, are based on multiple estimates, some of which, such as the expected gain on pension assets, are essentially guesses. Generally effective audit procedures, such as obtaining third-party confirmations of assets and liabilities, are inapplicable to estimates, which are opinions rather than facts. By and large, accounting estimates are very difficult to audit. There is therefore an urgent need to provide both managers and auditors an alternative or complementary generator of estimates.

Machine learning, quickly spreading into diverse areas of managerial practice, has the potential to provide such an independent estimates generator. When used as a predictive tool, machine learning techniques have applications in many domains. Researchers and practitioners have exploited the ability of machine learning to learn data patterns and have applied it to different contexts. As an antecedent to our study, there is a growing literature in accounting applying machine learning tools to predict the quality of accounting numbers. The earlier studies by Perols (2011) and Perols et al. (2017) are among the first in accounting to predict accounting fraud. Two recent studies, by Bao et al. (2020) and Bertomeu et al. (2020), used various accounting variables to improve the detection of ongoing irregularities. There is another strand of research that investigates the prediction of corporate bankruptcies or defaults using machine learning techniques. For example, Barboza et al. (2017) compared several machine learning models with traditional models and found that boosting, bagging, and random forest algorithms provide better prediction performance. The promising findings in this area encourage the development of new methods to enhance the performance of machine learning tools.

Overall, these studies largely complement ours: while they capture excessive managerial discretion or credit risk anomalies through analysis of financial statement numbers, our approach shows how machine learning can directly improve the estimate of an account balance, thus revealing the mechanisms through which machine learning may alleviate both intentional and unintentional errors. In particular, we demonstrate, using insurance companies’ data on estimates and realizations of loss reserves (estimates of future claims related to current policies), that loss estimates derived from machine learning are, with a few exceptions, superior to the actual managerial loss estimates underlying financial reports. We thus establish, for the first time, the potential of machine learning to independently assess the reliability of estimates underlying financial reports, thereby improving the quality and usefulness of financial information. Furthermore, machine learning has the potential to substantially improve auditors’ ability to evaluate accounting estimates, thereby enhancing the usefulness of financial information to investors.

The paper’s order of discussion is as follows. Section 2 provides a background overview of the insurance claims loss estimation process. Section 3 and Section 4 discuss the machine learning algorithms used in this study and the application of machine learning to generate insurance companies’ loss estimates, respectively. Section 5 presents our sample, while Section 6 provides the empirical results concerning the machine learning estimates, compared with managers’ estimates. Section 7 presents additional analyses on estimation errors. Section 8 concludes.

2 Insurance claims loss estimation

Insurance companies provide protection to policyholders from certain risks that occur within a predefined period. While insurers receive the policy premium payments before or early during the period of coverage, the full costs of their activities―the total losses or claims by policyholders―usually remain unknown for several years after the coverage period ends. Insurance regulations require insurers to provide “management’s best estimate” for these future claims in financial reports and to disclose the gradual settlement of loss claims in the following years. In other words, insurers match the payoffs directly to the year in which the initial estimate was made and the related insurance premium revenue recognized.

The unpaid component of the estimated future losses (claims) is the insurance loss reserve. The loss reserve is often the most significant component in property and casualty insurance firms’ liabilities: on average, loss and loss adjustment expense reserves make up approximately two-thirds of an insurer’s liabilities (Zhang and Browne 2013). Managers’ loss estimation process is obviously subjective and requires considerable judgment, because not all claims for accidents that occur during a year are filed and settled by year-end. A substantial amount of losses incurred may be “Incurred But Not Reported Claims,” in which case the insurance policyholders do not report the losses to insurance firms by the end of the current year but file the claims in later years. In addition, after the claims are filed, the final cash settlement may take years to complete. For example, injuries in a car accident may lead to several years of treatment and result in extended payments. Thus insurance firms must estimate the costs of claims filed during the year as well as claims that relate to the current year but will be filed in subsequent years.

Given the material impact of loss estimation on insurance firms’ financial results and condition, auditors, investors, and regulators are naturally concerned with the quality of the estimates reported by managers. Studies have already established that managers may manipulate loss reserves to achieve various goals (e.g., Grace 1990; Petroni 1992; Weiss 1985; Beaver and McNichols 1998; Gaver and Paterson 2004; Browne et al. 2009).Footnote 1 Anderson (1971) analyzed the insurance industry from 1955 through 1964 and documented that insurers over-reserved losses heavily in early times but gradually reduced the degree of over-reserving to slightly under-reserving. However, Bierens and Bradford (2005) found that insurance firms from 1983 to 1993 tended to over-reserve. Grace and Leverty (2012) used a more recent sample (1989 to 2002) and found that firms generally overestimated losses but that there was considerable variation in insurers’ practices.

3 Machine learning techniques

3.1 Machine learning algorithms

We compared four popular machine learning algorithms to predict insurance losses for five business lines and selected the algorithm with the best accuracy among those examined. The model-generated predictions were then compared to managers’ estimates in financial reports. The four algorithms we used are linear regression, random forest, gradient boosting machine, and artificial neural network. We briefly discuss each machine learning model used.

3.1.1 Linear regression

Within the language of machine learning, linear regression is a supervised learning method that makes predictions based on the linear relationship between the numeric output and numeric input attributes (Friedman 2001; Bishop 2006). The learning process estimates the coefficients of the input attributes and aims to produce a prediction model that minimizes the mean squared error between the prediction and the true value. To select data attributes for linear regression, we used the M5 method: in each attempt, the attribute with the smallest standardized coefficient is selected and removed, and another regression estimation is performed (Witten et al. 2011). If there is an improvement in the model predictive accuracy in terms of the Akaike information criterion, the attribute is eliminated. This process is repeated until no improvement is observed.

Machine learning algorithms learn the hidden patterns of data in a way governed by a specific combination of hyper-parameters.Footnote 2 The determination of the optimal combination of hyper-parameters that produces a model with the most accurate prediction relies on trial and error. We used the Cartesian grid search to configure the optimal hyper-parameters for the model development in linear regression as well as the other three algorithms in this research. Cartesian grid search methodically builds and evaluates a model through each possible combination of a specified subset of hyper-parameters. For linear regression, the hyper-parameters are the learning rate and the number of iterations. Specifically, the learning is an iterative process of continuously updating the values of model weights (parameters): the entire training data needs to be passed through and learned by the algorithm multiple times. Each time the data is passed through the algorithm, the weights will be updated. This is called one training epoch. In other words, an epoch is the complete cycle of an entire training data learned by a model. Because we cannot always pass through the entire data into the algorithm at once, the data is divided into batches. The number of iterations is the number of batches needed to complete one epoch. The learning rate is the extent to which the parameters are adjusted during the learning process. Lower learning rates require more training epochs, given the smaller adjustment made to the model weights each update, whereas higher learning rates result in rapid changes and require fewer training epochs. The grids of hyper-parameter values we have tried are listed in Table 1.

Table 1 Tuning details for machine learning algorithms

3.1.2 Random forest

The random forest algorithm is derived from decision trees, which is a machine learning technique that extracts information from data and displays it in a tree-like structure. A decision tree consists of three components: a node, a branch, and a leaf. Each root node of the tree denotes an input attribute; the tree splits into branches based on the input attributes, with each branch representing a decision. The end of the branch is called a leaf, and each leaf node leads to a prediction of the target value. Decision trees can be applied to either regression or classification problems. We employed regression trees because the target variable (actual losses) has continuous values. A single decision tree could have limited capabilities to learn the data, whereas random forest improves the accuracy of the decision trees with the ensemble technique (Breiman 2001, 2002). Specifically, the random forest algorithm first generates a “forest” of decision trees; each tree uses a subset of randomly selected attributes. It then aggregates over the trees to yield the most efficient predictor. For a regression problem, the output is the mean prediction of individual trees. In general, the higher the number of individual trees, the better the predictive performance of the random forest.Footnote 3 However, as adding too many trees can considerably slow the training, after a certain point, the benefit in prediction performance from using more trees will be lower than the cost of computation time for these additional trees (Ramon 2013).

The hyper-parameters for grid search are the number of trees, the maximum depth of the tree, and the minimum leaf size. The maximum value of tree depth represents the depth of each tree in the random forest; a deeper tree will have more branches from the node to the root node and capture more information about the data. However, as the tree depth increases, the model may suffer from the overfitting problem because it captures too many details (Brownlee 2016). The third hyper-parameter refers to the minimum number of observations for a leaf. Although a smaller leaf makes the model more prone to capturing noises in training data, a too-small leaf size may result in overfitting. On the other hand, if we choose a leaf size that is too large, the tree will stop growing after a few splits, resulting in poor predictive performance. Table 1 provides the grids of values that have been tried in the random forest.

3.1.3 Gradient boosting machine

Gradient boosting machine uses an ensemble technique termed boosting to train new prediction models with respect to the errors of the previous models, and convert weak prediction models to stronger ones (Schapire 1990). The objective of boosting is to minimize model errors by adding weak learners (i.e., regression trees). After adding new trees, the learning procedure subsequently corrects for errors made by previous trees and improves predictions to reduce the residuals in a gradient descent manner (Friedman 2001; Mason et al. 2000). After the number of trees reaches a limit point, adding more trees will no longer improve the prediction performance (Brownlee 2016). The grid search approach for gradient boosting machine was the same as that for the random forest, except that we started with five as the maximum tree depth to ensure that the learners were weak but can still be constructed in the gradient boosting machine. The grids of values tried are reported in Table 1.

3.1.4 Artificial neural networks

An artificial neural network consists of multiple layers of interconnected nodes between the input and output data. Each layer transforms its input data into a more abstract representation, which is then used as the input data by the next layer to produce representation. An artificial neural network has three types of layers: input, hidden, and output layers. The input layer receives the raw data of explanatory variables, and the number of nodes in the input layer equals the number of the explanatory variables. Next, the input layer is connected to hidden layers, which apply complex transformations to the incoming data and transmit the output to the next hidden layers. The output will be transmitted only if it exceeds a certain threshold determined by an activation function.Footnote 4 The data processing is performed through a system of weighted connections: the values entering a hidden node are multiplied by certain predetermined weights, and the weighted inputs are then added to produce a single output. There may be one or more hidden layers. A neural network is called deep when more than two hidden layers exist. The output of the final layer (called the output layer) represents the extracted high-level information from the raw data (Sun and Vasarhelyi 2017). We normalized all input variables to improve the model performance and used several different values of hyper-parameters for the grid search. The hyper-parameters include activation function, number of hidden layers, number of nodes, number of epochs, and learning rate. The details are provided in Table 1.

3.2 Data splitting and performance validation

We employed a training, validation, and testing approach in this study, with the last year in the sample as the holdout set. For each algorithm, we developed machine learning models using the fivefold cross-validation method and used the holdout set to evaluate the practical usefulness of the models. The fivefold cross-validation method is a widely used resampling procedure in machine learning to estimate a model’s performance on a limited data sample. Specifically, the observations in the cross-validation sample are shuffled and randomly split into five equal groups. Each group is used, in turn, as a validation set, and the remaining four groups combined as a training set. A model is developed based on the training set and evaluated on the validation set using various evaluation metrics. The results are averaged to produce a unique evaluation of the model performance. Thus all observations are used for both training and validation, with each observation used only once for validation. We then applied the models developed from the cross-validation process to produce loss estimates for the holdout period that follows the training and validation period. Figure 1 illustrates the splits. Observations in the holdout set have not been used in the model development process. Therefore the holdout set was used only for evaluation rather than model selection purposes.

Fig. 1
figure 1

Illustration of the training/validation/testing approach

This design has two advantages in our setting. First, cross-validation generally results in a less biased and more robust estimate of the model accuracy than a simple training and testing split method (Brownlee 2018). Second, the training, validation, and testing approach alleviates the potential problems introduced by the inherent time-series order of the data.

3.3 Application to insurance loss estimation

We now explore the use of machine learning techniques in generating loss estimates for insurance companies. We conducted two sets of tests for each line of insurance (private passenger auto liability, commercial auto liability, etc.). The first set of tests did not include managers’ loss estimates as an input attribute, keeping only variables that are not directly affected by managerial judgments. In other words, the variables used in these tests are based on verifiable facts, such as the number of outstanding claims, the number of loss claims closed with and without payment, etc. In the second set of tests, we added managers’ initial estimates to the machine learning models. This design enables us to evaluate the performance of machine learning techniques on a standalone basis as well as when managers’ inputs are incorporated into the algorithms. Moreover, because firms may experience different loss claim and payment patterns during the financial crisis period, we constructed three test periods: models developed from training samples of 1996–2005, 1996–2006, and 1996–2007 were applied to the holdout sets in 2006, 2007, and 2008, respectively.Footnote 5

3.4 Insurance company data

U.S. insurance companies follow the statutory accounting principles (SAP)Footnote 6 to prepare statutory financial statements. Managers are required to provide the initial estimate for all losses (payment to insured) incurred in the current year (paid and expected to be paid in the future) as well as re-estimating the losses incurred in each of the previous nine years. In the statutory filings, insurance firms report the estimated total losses as “incurred losses” and the actual cumulative payments in each of the past 10 years (including the current year). The difference between the reported incurred losses and the cumulative paid losses for a given year is the reserve for future loss payment.

An advantage of the insurance industry data is that the extensive reporting requirements under SAP make it possible to match the actual payoffs to the insured directly to the year in which the initial estimate was made. Table 2 provides an example of this disclosure. Panel A shows the development of incurred losses for the National Lloyds Insurance Company. By the end of 2008, the estimates for the most recent 10 years (including the current year) are disclosed in Column 10. For example, in 2008, the current estimates for the losses incurred in the years 2007 and 2008 were $24,334,000 and $36,893,000, respectively. This is compared with the initial estimate for the losses incurred in 2007, which was $24,226,000 (Column 9), meaning that Lloyds revised up the estimated losses incurred in 2007 by $108,000 ($24,334,000−$24,226,000). Panel B reports the cumulative paid losses for each accident year, from 1999 to 2008, by the end of 2008. For example, by the end of 2008, the company has paid $32,324,000 for the losses incurred during the year 2008, and $23,585,000 for the losses incurred during 2007 (Column 10). In this example, the cumulative paid losses and the losses incurred for the year 1999 converge in 2006 and remain unchanged at $4618,000 (Row 2), indicating that Lloyds likely paid off all the claims for accidents that occurred in 1999 by the end of 2006.

Table 2 Illustration of incurred losses and cumulative paid net losses reported in 2008 for National Lloyds Insurance Company

3.5 Dependent variable

Our dependent variable is the ActualLosses, which is the actual ultimate costs of insured events occurring in the year of coverage. It is the sum of losses already paid in the coverage year and the losses to be paid in the future. We chose total actual losses as our dependent variable because they can be directly compared to managers’ “incurred losses” reported in annual reports. Studies that have examined insurance companies’ loss reserve errors measured the “actual losses” as the cumulative losses paid during several subsequent years (Weiss 1985; Grace 1990; Petroni and Beasley 1996; Browne et al. 2009; Grace and Leverty 2012). We measured the ActualLosses for an accident year t as the 10-year cumulative payment of losses incurred in year t. This variable was extracted from the financial report in year t + 9, which is also the last time managers disclose the loss payment for the initial accident year t.Footnote 7 For losses incurred in each business line during a given accident year, we generated only one estimate based on the information available at the end of this year, and the model estimates were compared to managers’ initial predictions made in year t.

3.6 Independent variables

Our independent variables (predictors) consist of information already known at the time of estimation, that is, year t (no look-ahead bias). We included three sets of independent variables. First, operational variables (e.g., claims outstanding, premiums written, or premiums ceded to reinsurers) for each business line were obtained from Schedule P of the statutory filings. The second set of variables are company characteristics (e.g., total assets or state of operation) for the accident year. Finally, we added exogenous environmental variables (e.g., inflation or GDP growth) that reflect the macroeconomic factors that may influence the payment for the accident year. We use the lagged value of macroeconomic data because some information may be released after the insurers’ financial statements are prepared. Definitions of all the independent variables are provided in Appendix Table 10.

3.7 Evaluation metrics

We used two metrics to compare estimates with actuals: the mean absolute error (MAE) and the root mean square error (RMSE). MAE is the average of the absolute differences between predictions and actual observations, and RMSE is the square root of the average of squared differences between predictions and actual observations. A smaller value of MAE or RMSE indicates higher prediction power. They are calculated as follows.

$$ MAE=\frac{1}{n}\sum \limits_{j=1}^n\mid TrueValu{e}_j- ModelEstimat{e}_j\mid . $$
(1)
$$ RMSE=\sqrt{\frac{1}{n}\sum \limits_{j=1}^n{\left( TrueValu{e}_j- ModelEstimat{e}_j\right)}^2} $$
(2)

We compared the machine learning loss predictions with managers’ estimates to evaluate the machine’s performance, using MAE or RMSE.

3.8 Our sample

We used the annual reports of US-based property and casualty insurance companies filed with the National Association of Insurance Commissioners (NAIC). The data were extracted from the SNL FIG website (S&P Global Market Intelligence platform), covering the period 1996 to 2017.

Property and casualty insurers offer a wide variety of insurance products that cover many different business lines, with each line having unique operating characteristics. The following procedures were separately performed on each business line to obtain the test samples.

  1. a)

    For each business line, we first identified the insurance companies that had conducted business in this line. To be included in the sample, the firm must have started this line’s business before 2008 and remained active until 2017, so that we can extract the ultimate payment (over 10 years) for at least one accident year.

  2. b)

    Firm-years with missing or zero values for all operational variables were deleted from the sample. If total assets, total liabilities, net premiums written, or the direct premiums written were zero or negative, the firm-year was also excluded from the sample.

  3. c)

    For each business line, we only kept observations that had positive total premiums and cumulative paid losses.

We focused on five business lines: (1) private passenger auto liability, (2) commercial auto liability, (3) workers’ compensation, (4) commercial multi-peril, and (5) homeowner/farmowner. The five lines were selected from the 20 business lines identified by NAIC primarily because these lines had a sufficiently large number of observations remaining after the sample selection process (more than 400 insurers), indicating significant operations in the business lines. We excluded minor lines, such as special liability, products liability, and international, together making up less than 5% of industry loss reserves (A. M. Best 1994). In the final sample, we have a total of 32,939 line-firm-year observations for all five business lines combined, with each line’s number of observations provided in Table 4.

Table 3 reports the payment patterns for each business line. The “tail” is an insurance expression that describes the time between the occurrence of the insured event (accident) and the final settlement of the claim. A longer tail usually implies higher uncertainty regarding the estimation of ultimate losses. As indicated in Table 3, for the private passenger auto liability line, 40.64% of the ultimate losses are paid off during the initial accident year, and 99.82% of the total payments throughout the 10 years are made during the first five years. The homeowner/farmowner business line (bottom line) has a payment pattern that differs from the private passenger auto liability line (top line). For these policies, the insurers pay off 72.62% of the ultimate losses after the first year, and 93.50% of the loss payments are made by the end of the second year. This suggests that the homeowner/farmowner business line has a relatively short tail, compared to the other four lines, consistent with prior studies that investigated the tail characteristics of insurance business lines (Nelson 2000). Overall, for all five business lines, the majority of payments are made during the first five years after the initial accident year.

Table 3 Cumulative payment percentage in the first five years for each business line

Table 4 reports summary statistics for firms that operate in each business line. The private passenger auto liability insurance is the largest business line in dollar terms, with total premiums written of $142 million on average, followed by the homeowner/farmowner line, and workers’ compensation line. The total premiums written in the private passenger auto liability line during 2015 amount to $199.37 billion, making up 34% of all the property and casualty insurance business.Footnote 8 The commercial auto liability and workers’ compensation are usually provided by larger insurance companies, with average assets of $1.467 billion and $1.437 billion, respectively. In general, managers overestimate the ultimate losses when they report the future loss projections for the first time, except in the commercial auto liability line.

Table 4 Summary Statistics for the five insurance business lines

3.9 Empirical results

In this section, we first report the fivefold cross-validation machine learning results for each business line and then present the holdout test results. For each machine learning algorithm, we developed models with and without managers’ loss estimates as an input attribute and reported the results separately. Because our objective is to evaluate whether machine learning models outperform managers in terms of predictive accuracy, we compared the model-generated estimates to managers’ predictions in financial reports. For example, the Sentry Select Insurance Company initially estimated the total losses incurred during 2006 in the private passenger auto liability line to be $49,127,000, while the actual losses were $41,787,000.Footnote 9 In the cross-validation process, the random forest algorithm without managers’ loss estimates generated an estimate of $43,657,000, and the prediction was $41,990,000 with managers’ estimates included in the model. Thus both machine models generated estimates that were substantially closer to the actual losses than managers’ prediction. Moreover, in the holdout tests, where we used the model developed from 1996 to 2006 to predict the losses incurred in 2007, the random forest models with and without managers’ estimates predicted $38,953,000 and $40,996,000, respectively. The managers’ estimate was $43,650,000 for the same year, while the actual losses turned out to be $39,791,000, suggesting that the machine learning predictions were again more accurate than the estimate provided by the firm.

We used the MAE and RMSE metrics to evaluate model performance.Footnote 10 In the cross-validation process, we found that the random forest algorithm produced good predictions for four out of five lines and the linear regression model performed well for the fifth—homeowner/farmowner line. Thus we report the linear regression prediction results for the homeowner/farmowner line and present the random forest prediction results for the other four lines.Footnote 11 We also report the model accuracy edge, relative to manager estimates. The accuracy edge of a machine learning model is computed as managers’ estimation MAE (RMSE) minus model estimation MAE (RMSE), divided by managers’ estimation MAE (RMSE).

3.10 Fivefold cross-validation

As illustrated in Section 4, we first used three samples to train and validate the machine learning models: 1996–2005, 1996–2006, and 1996–2007 samples. We selected the models based on the cross-validation results and report their performance in Table 5. For each of the five business lines examined, we report the MAE and RMSE of managers’ and models’ estimates as well as the corresponding accuracy edges. For example, managers’ estimation MAE (RMSE) for the private passenger auto liability line (line No. 1, first row) during the period 1996–2005 was 9461 (37,494). The random forest algorithm without managers’ estimates yielded an MAE (RMSE) of 8213 (34,687), having an accuracy edge of 13% (7%) over managers’ estimates. In the samples of 1996–2006 and 1996–2007, the random forest had smaller MAE and RMSE than those of managers, exhibiting higher predictive accuracy.

Table 5 Cross-validation results

The results of the commercial auto liability line (line No. 2), the workers’ compensation line (line No. 3), and the commercial multi-peril line (line No. 4) also suggest that the random forest algorithm achieves an accuracy edge over managers’ estimates. Specifically, in line No. 2, the random forest estimates were, on average, 16% (26%) more accurate than managers’ predictions in terms of MAE (RMSE). When predicting losses for line No. 3, the random forest model exhibited a considerable accuracy edge: on average, its MAE (RMSE) was 40% (35%) lower than managers’ estimation. Turning to line No. 4, the average accuracy edge of the random forest model measured by the MAE (RMSE) was 14% (19%), relative to managers. After including ManagerEstimate as an additional attribute, the performance of random forest algorithm was further improved. The results are reported in the right-hand side of Table 5. In line No. 1, the average accuracy edge of random forest increased to 24% (15%) in MAE (RMSE). The MAE (RMSE) comparisons in line No. 2–4 also indicate an enhancement of model accuracy after incorporating managers’ estimates. The accuracy edge in No. 3 was the most significant― 43% (39%) based on MAE (RMSE), on average.

In the homeowner/farmowner line (No. 5), however, managers outperformed the machine learning models. Consulting industry experts about this exception, a possible explanation is that the homeowner/farmowner line contains unique types of losses, such as catastrophes, property damage, and bodily injury. These loss categories are not differentiated in firms’ financial reports. Mixing up different loss types is problematic because different loss categories have unique payment patterns, which are known to managers but not available in the development of the machine learning prediction models. In addition, the homeowner/farmowner line has a relatively short tail, implying that the majority of losses (claims) have been reported and paid off during the first year (See Table 3). In this case, the total losses for most homeowner/farmowner accidents are already known to managers by year-end, making it challenging for machine learning models to outperform managers.

Collectively, our results suggest that machine learning models generate more accurate loss predictions than managers in most circumstances. Furthermore, we found that, in general, models incorporating ManagerEstimate have higher predictive accuracy, compared to the models without it.

To provide insights into the machine learning procedure, we present in Table 6 the 15 most influential variables identified by the random forest algorithm for each business line.Footnote 12 For example, the premiums written in the accident year (LinePremiums) was the most powerful predictor for the random forest algorithm to estimate losses in line No. 1. Overall, we observed that several predictors, such as LinePremiums, LinePayment, and ManagerEstimate (when added to the model), consistently play an important role in generating model predictions.

Table 6 List of influential variables in machine learning algorithms

3.11 Holdout tests

In this section, we apply the models developed from the cross-validation sample to predict losses for the holdout period. Specifically, the model established from the 1996–2005 sample was used to predict losses in 2006, and the model from the 1996–2006 (1996–2007) sample was employed to predict the losses in 2007 (2008).Footnote 13

The holdout tests examined the predictive accuracy of machine learning models on holdout sets—a more demanding prediction test. Table 7 Panel A reports the holdout results. Overall, the findings are consistent with the cross-validation results, indicating that machine learning models have superior predictive accuracy. When ManagerEstimate was not included in the model, the random forest algorithm generated more accurate estimates than managers in most cases for lines No. 1–4.Footnote 14 After we added ManagerEstimate to the model, its performance further improved. The analyses of line No. 5 (homeowner/farmowner line) indicate that managers outperformed linear regression when ManagerEstimate was not included in the model. After we added ManagerEstimate as an independent variable, the model’s performance was enhanced: its prediction error measured by RMSE was slightly smaller than managers’ in all three holdout samples.

Table 7 Holdout test results

We used the bootstrap technique to examine the statistical significance of the difference in prediction performance between machine learning models and managers.Footnote 15 The bootstrap method uses an existing sample to create a large number of simulated samples that can be used to estimate the distribution of the performance difference. Specifically, we used the bootstrap to simulate 10,000 samples for each holdout sample and computed the differences between managers’ and models’ absolute prediction errors for each bootstrap sample. These differences varied across the simulated samples and formed a distribution. Table 7 Panel B reports the bootstrap mean of the prediction error differences and the standard errors. We test whether the differences are significant and report the significance levels. Overall, the bootstrap analyses provide supports to the conclusions drawn from the holdout tests.

Taken together, our results indicate the usefulness of machine learning models in estimating insurance losses, particularly for long-tail lines of insurance business. In addition, the random forest algorithm consistently shows superior predictive accuracy for long-tail business lines, and the linear regression performs better when the claims tail is short. Thus it is essential to understand the economics of a business line before applying a model to predict its losses. Furthermore, leveraging the information in managers’ estimates enhances the prediction performance of machine learning models.

4 Additional analyses: Estimation errors

We now provide more detailed analyses of managers’ and machine learning’s estimation errors. Although machine learning process is usually presented as a black box and the models are challenging to interpret, in this section, we shed some light on the important question: what causes the advantage of machine learning models over managers?

Consistent with prior research, we defined managers’ estimation error (ManagerError) as the reported loss estimate minus the actual loss, scaled by total assets. We focused on the signed estimation errors, instead of absolute errors, as in the previous section. Signed errors can give more insights into managers’ reporting bias and are easier to interpret, whereas, in the previous section, our main objective is to evaluate the prediction accuracy. Similarly, the machine learning model estimation error (ModelError) is the model estimate minus the actual loss, scaled by total assets. Since the random forecast algorithm performed well in most business lines, we used the holdout prediction results generated by random forest models for the years 2006, 2007, and 2008 to calculate model estimation errors. Table 8 compares managers’ estimates to models’ estimates. On average, the model estimates were more accurate than manager estimates: the average absolute estimation error of machine learning models with (without) manager estimates as an input attribute was 0.0106 (0.0110), while the average manager estimation error was 0.0120. The difference is significant at 1% (5%) level. In addition, managers’ signed estimation errors were larger than model errors on average, suggesting that managers tended to overstate insurance losses during our sample period. The results on managers’ estimation errors are generally consistent with previous studies that investigated insurance loss estimation errors (e.g., Grace and Leverty 2012). To better understand the aggregate effect of estimation errors, we added the five lines’ losses for the current accident year and compared the total to the corresponding true aggregate reserves.Footnote 16 The results in Panel B of Table 8 suggest that, at the aggregated level, managers’ estimation errors were around 2.9% of the total assets on average, while machine learning models had an error percentage of 2.5%.

Table 8 Additional Analyses on Estimation Errors

Overall, the results suggest that machine learning algorithms can provide more accurate insurance loss estimates than those reported by managers. Broadly speaking, three reasons may explain the model’s edge. First, managers may be using low-quality information or fail to consider relevant information in their estimations. However, this is unlikely to be true, as all the input variables we included in the models were available before the initial accident year-end and the majority of variables were extracted from insurers’ financial statements. The macroeconomic variables (e.g., GDPLevel and Inflation) from external sources had trivial influence in the model estimation process (see Table 6). Thus the larger errors in managers’ estimates were not likely to be caused by inferior information quality. The second possible explanation is that insurance firms may apply erroneous estimation models. If so, we would expect managers’ estimates to be of little value when incorporated in model estimation. However, we found that, in general, managers’ estimates were among the top four most influential variables in predicting the ultimate losses (see Table 6), and incorporating managers’ estimates into machine learning algorithms increased the predictive accuracy (see Table 5 and Table 7). This finding suggests that managers’ procedure was overall effective in producing loss estimates. Third, various incentives may motivate managers to report biased estimates intentionally. It is well documented in the literature that various managerial incentives may lead to reporting bias. Reserving practice in the insurance industry provides significant flexibility and magnitude for managers to manipulate the numbers they report: management may increase or decrease the reported income by selecting a certain reserve level, which by nature is a subjective estimate of future cash payments. Moreover, the literature has found that auditors (Petroni and Beasley 1996) and regulators (Gaver and Paterson 2004) do not seem to detect insurers’ earnings management effectively. Managers may actually adopt a conservative practice via over-reserving to maintain a positive reserve margin.Footnote 17 However, as long as other managerial incentives are also present in the reporting decision and information users cannot anticipate those biases perfectly, manipulation is likely to occur, reducing the value of financial reports (Fischer and Verrecchia 2000; Samuels et al. 2018). We started by examining the rationales identified by prior studies and then investigated whether the machine learning model estimation errors were affected by the incentives that cause managers’ biases.

Because determining the taxable income involves loss estimates, over-reserving is more beneficial if more income is classified as a reserve. Thus insurers tend to overstate loss reserves to reduce current tax liabilities (Grace 1990). The decision variable here is the taxable income before reserves are determined, a higher value of which motivates managers to overstate reserves. Following this logic, research has captured insurers’ tax incentive by adding the estimated reserves back to the reported level of taxable income to derive TaxShield, which takes a larger value if the insurer has a higher tax reduction incentive (e.g., Grace 1990). It is calculated as follows.Footnote 18

$$ {TaxShield}_t=\frac{Net\ Incom{e}_t+ Loss\ Reserv{e}_t}{Total\ Asset{s}_t} $$
(3)

Second, we evaluated the impact of another well-recognized managerial incentive to distort the reported reserves: the income smoothing incentive (Weiss 1985; Grace 1990; Beaver et al. 2003). Firms, in general, are reluctant to report turbulent earnings that may indicate higher risks and discourage potential investors or bondholders from investing in the firm (e.g., Lambert 1984; Trueman and Titman 1988). Regulators are also concerned with a firm’s income stability and include the change in surplus ratio in their solvency test (Grace 1990).Footnote 19 As mentioned above, the unique nature of insurance loss reserving practice provides managers great opportunities to smooth income. We followed prior research and used an insurer’s average return on assets (ROA) during the past three years as our proxy for the income smoothing incentive (Smooth). The intuition is that following three previous good years, insurers tend to underestimate loss reserves in the current year to inflate earnings and continue the positive trend (Grace 1990). In addition to the Smooth variable, we used another indicator variable, SmallProfit. Beaver et al. (2003) have found that firms with small positive earnings are likely to have boosted reported income by understating loss reserves. Similar to Grace and Leverty (2012), we identified the firm-years in the bottom 5% of the positive earnings distribution. We expected these firms to have understated insurance loss reserves.

Financial distress also drives insurance firms to manage reserve estimates (Petroni 1992; Gaver and Paterson 2004). Regulators use the Insurance Regulatory Information System (IRIS) ratios to assess insurance firms’ solvency. The NAIC provides a “usual range” for each ratio, and if the ratio falls out of the range, a ratio violation occurs. Regulatory intervention is involved when the number of ratio violations exceeds an acceptable threshold. Thus financially weak firms tend to under-reserve to appear adequate in capital and avoid regulatory scrutiny (Petroni 1992; Gaver and Paterson 2004). We set the variable Violation equal to 1 if the firm has at least one IRIS ratio violation and 0 otherwise. As managers may manipulate financial reports to avoid ratio violations (e.g., Petroni 1992; Gaver and Paterson 2004; Guttman and Marinovic 2018), we expected that firms with IRIS ratio violations would be more likely to understate losses because they are closer to triggering regulatory scrutiny than firms without violations. Also, insurers are required to maintain sufficient capital measured by the risk-based capital ratio. Regulators may take actions against the firm if the ratio is less than two. We therefore included another indicator variable, Insolvency, which equals one if the ratio is smaller than two and 0 otherwise. The ratio is defined as the total adjusted capital divided by the authorized control-level risk-based capital, a hypothetical minimum capital level determined by the risk-based capital formula.

We also included a series of firm and business line characteristics as control variables. Detailed control variable definitions are provided in Appendix Table 10. The following regression model was used to examine the incentives related to manager estimation errors.

$$ ManagerError={\alpha}_0+{\alpha}_1 TaxShield+{\alpha}_2 Smooth+{\alpha}_3 SmallProfit+{\alpha}_4 Insolvency+{\alpha}_5 Violation+{\alpha}_6 Liab+{\alpha}_7 Crisis+{\alpha}_8 Size+{\alpha}_9 SmallLoss+{\alpha}_{10} Profit+{\alpha}_{11} Loss+{\alpha}_{12} Linesi\mathrm{z}e+{\alpha}_{13} Reinsurance+{\alpha}_{14} Public+{\alpha}_{15} Mutual+{\alpha}_{16} Group+ LineFixedEffects+\epsilon . $$
(4)

The regression results are reported in column (1) of Table 9. Consistent with prior research, we found that firms with a stronger tax reduction incentive were more likely to over-reserve, as indicated by the significant and positive coefficient of TaxShield (Coeff. = 0.055, p < 0.01). The significantly negative values for the coefficients of Violation (Coeff. = −0.003, p < 0.01) and SmallProfit (Coeff. = −0.005, p < 0.05) suggest that financially weak firms and firms with income smoothing incentives tended to report underestimated insurance losses. The findings support the argument that managerial incentives, including tax reduction, income smoothing, and financial strength concerns, affect insurance firms’ reserve levels. We also found that the coefficient for Crisis is significantly negative, indicating that, during the financial crisis period, managers were more likely to underreport loss estimates, presumably as a response to the negative macroeconomic shock.

Table 9 The association between estimation errors and managerial incentives

We next examine whether our machine model estimates are less affected by managers’ incentives, as one would expect. Therefore we re-run eq. (3) but replaced the dependent variable with ModelError. If the model estimates were less affected by the incentive biases than manager estimates, we would expect the coefficients of the incentive variables in the model regressions to be insignificant. Column (2) in Table 9 Panel A reports the regression results of the models without managers’ estimates as an input variable. The results indeed indicate that the incentives that drive managerial estimation biases did not affect the model estimates, as none of the incentive variables were statistically significant. However, when we used the models including manager estimates, the coefficient of SmallProfit became marginally significant at 10% level, suggesting that incorporating managers’ estimates might also bring their biases into the models. The regression results for the aggregated estimation errors are reported in Panel B of Table 9. We found that the aggregated manager estimation errors were significantly influenced by various managerial incentives, which did not seem to impact the aggregated model estimates: the coefficients of the incentive variables are mostly insignificant, except for the SmallProfit when we incorporated managers’ estimates into the models.

Overall, the results indicate that the influence of managerial incentives is hardly present in the model estimation, which explains, in part, the model’s superior performance.Footnote 20 Because reserving practice provides managers ample discretion in manipulation with relatively low costs, information users will find it difficult to efficiently evaluate the relevance of the reports, due to the noise in reporting (Fischer and Verrecchia 2000). Machine learning techniques discussed in this study provides a potential solution to help improve the quality of financial statements by providing estimates that are less affected by managerial biases.

5 Conclusion

Managerial subjective estimates are endemic to financial information, and their frequency and impact are constantly increasing, mainly by the expansion of fair value accounting by standard setters. The adverse impact of managerial estimation errors, both intentional and unintentional, on the quality of financial information is largely unknown and unresearched, but it is likely high. Undoubtedly, improvement in the quality and reliability of accounting estimates will substantially enhance the relevance and usefulness of financial information.

Accounting estimates generated by machine learning are potentially superior to managerial estimates because they may use the archival (training) data more consistently and systematically than managers. On the other hand, managers may include in their estimates (forecasts) forward-looking information (e.g., on expected inflation or the state of the economy) that machines obviously ignore. Accordingly, we assess the superiority of machines over humans in generating accounting estimates.

Our results, based on a large set of insurance companies’ loss (future claim payments) estimates, revisions, and realizations, indicate that, with one exception (homeowner/farmowner insurance), loss estimates generated by machine learning are more accurate than managers’ actual estimates underlying financial reports. This is a surprising and very encouraging finding, given the urgency to improve accounting estimates. At this early stage of applying machine learning to estimating accounting numbers, we don’t know how generalizable our findings are. More research is needed to establish and generalize the use of machine learning for other types of accounting estimates, such as bad debts and warranty reserves.

Accounting estimates generated by machine learning may have multiple uses in practice. They can be used by managers and auditors as benchmarks against which managers’ estimates will be compared, with large deviations suggesting a reexamination of managers’ estimates. Alternatively, machine learning could be used to generate managers’ estimates in the first place, enhancing the reliability (no manipulation) and consistency of accounting estimates. In any case, the potential of machine learning, whose use is fast expanding in many other fields, to improve financial information should be further researched.