1 Introduction

Insurance loss modelling and diagnostics have long been challenging actuarial problems essential for general insurance ratemaking and reserving. Insurance losses often exhibit peculiar distributional characteristics, including multimodality and outlier contamination (see, e.g., Blostein and Miljkovic 2019; Punzo et al. 2018b; Tomarchio and Punzo 2020). These losses are also influenced by the policyholder attributes, potentially in a complicated way. The levels of peculiarity and complexity also vary significantly across different insurance datasets. For example, while a simple log-normal model already fits well the Secura Re loss data (Blostein and Miljkovic 2019), the Greek automobile insurance data (Fung et al. 2023) requires a drastically more complicated model. Furthermore, insurance claim losses are also subject to policy coverage modifications, including deductibles (claims below a certain threshold are not reported) and policy limits (claims are reported only to a certain capped amount). As a result, losses are often random censored and truncated.

Among all statistical models, the Generalized Linear Model (GLM, Nelder and Wedderburn 1972) stands out as a benchmark for insurance loss modeling. Its appeal lies in its analytical tractability and interpretability. Nevertheless, the GLM has faced criticism for producing Maximum Likelihood Estimators (MLE) that exhibit high sensitivity to outliers. Furthermore, the GLM’s flexibility is limited in capturing unique model characteristics, such as non-linear regression, tail-heaviness, and distributional multimodality, which may or may not be present in an insurance loss dataset. Hence, it becomes imperative to explore robust methods for estimating GLM parameters and to determine whether the GLM is fundamentally suitable for modelling a specific insurance dataset. Motivated by the research problems above, in this paper, we make the following contributions.

Firstly, we introduce a score-based weighted likelihood estimation (SWLE), adapted and extended from the maximum weighted likelihood estimation (MWLE) proposed by Fung (2022), designed for the robust estimation of regression models. The SWLE employs a weight function to reduce the influence of extreme observations on GLM parameter estimations. This adjustment makes the estimated GLM parameters less sensitive to outliers, resulting in more reliable and robust estimations in the presence of model contamination. We show that the SWLE is consistent and asymptotic normal. Furthermore, we meticulously investigate the analytical properties of the SWLE under the GLM framework. Our findings reveal that many statistical quantities (e.g., score functions, information matrices) derived from the SWLE are analytically tractable, provided that the weight functions are thoughtfully selected. This analytical tractability for the SWLE is of great significance as it allows for the straightforward estimation of parameters and the quantification of estimation uncertainties with minimal computational burden. Building upon the tractability, we have developed efficient and easily implementable estimation algorithms tailored for the SWLE under the GLM framework. These algorithms only require running standard R functions and transforming the estimated parameters.

Secondly, we not only treats the proposed SWLE as a robust estimation tool but also explores how to leverage the SWLE as a model diagnostic tool. To this end, we develop a novel Wald-based test statistic based on the sensitivity of the SWLE weight functions to the estimated GLM parameters. This test provides a quantitative assessment of the overall suitability of using the GLM to model the dataset. Higher sensitivities suggest a rejection of the GLM. If the GLM is deemed inappropriate, we can further analyze the source of model misspecification, offering guidance on how to enhance the GLM model.

Thirdly, we extend the SWLE to random censored and truncated data, a relevant extension due to the presence of coverage modifications. We derive several statistical inference quantities under the incomplete data scenario, including the score function, information matrix, and a Wald-type diagnostic test statistic. The development of the SWLE for incomplete data is motivated by the inherent challenges in extending the MWLE proposed by Fung (2022) to incomplete data. This issue can be elegantly addressed by working with the weighted score function instead of the weighted likelihood function, as explained in detail in Sect. 3.3.

Addressing the robustness issues inherent in the MLE approach is itself not a new research problem. Several research works have proposed alternative estimation approaches to mitigate sensitivity to outlier contamination. In the actuarial loss modeling literature, related papers include, for example, Brazauskas and Serfling (2000, 2003), Poudyal (2021a, 2021b), Serfling (2002) and Zhao et al. (2018), which investigate robust methods under simple distributional models. However, extending these methods to models with numerous parameters presents significant challenges. This issue has recently been addressed by Fung (2022), who explore the MWLE for robust tail estimation within the finite mixture model. In contrast to our paper, Fung (2022) outlines the general MWLE framework without considering covariates or delving into the analytical tractability of related quantities. As mentioned earlier, extending the MWLE proposed by Fung (2022) to scenarios involving censored and truncated loss data is challenging, setting our contributions apart from Fung (2022). Furthermore, while many papers have considered robust methods within the GLM regression framework, as seen in Valdora and Yohai (2014), Aeberhard et al. (2014), Wong et al. (2014) and Ghosh and Basu (2016), these works primarily focus on complete data settings and do not address data censoring and truncation.

In the context of model diagnostics, actuaries often employ ad-hoc tools, such as Q–Q plots, Anderson–Darling (AD) tests, Kolmogorov–Smirnov (KS) tests, or deviance residual analysis plots. These standard diagnostic tools were originally designed for complete data, which renders them unsuitable for direct application to censored and truncated data. For instance, the traditional literature on AD tests (Anderson and Darling 1952) necessitates that observations be fully observed and identically distributed, making it clear that the presence of random censoring and truncation in the data violates this assumption. Meanwhile, the proposed Wald-based test built upon the SWLE is a more versatile statistical diagnostic tool capable of accommodating the policy modifications of insurance loss data, offering broader applicability compared to standard techniques.

The remainder of this article proceeds as follows. The class of generalized linear model (GLM) is first revisited in Sect. 2 with relevant notations. Then, Sect. 3 reviews some likelihood-based inference techniques, which motivate us to introduce a novel score-based weighted likelihood estimation (SWLE) approach for robust estimations of insurance loss models. In Sect. 4, we formally construct the SWLE framework for the GLM, supplemented by estimation algorithms and several theoretical properties to justify the computational tractability, consistency, and robustness of the proposed SWLE. Section 5 introduces an alternative use of the SWLE as a model diagnostic tool to quantitatively detect model misspecifications. The SWLE is further extended in Sect. 6 to cater for censored and truncated data. The performance and practical applicability of the proposed SWLE are analyzed through three simulation studies in Sect. 7 and two real insurance datasets in Sect. 8. Section 9 concludes.

2 Generalized linear model (GLM)

In this section, we briefly review the class of Generalized Linear Model (GLM) and define the relevant notations used throughout the paper. Suppose that there are n independent (transformed) loss severities \(\varvec{Y}=(Y_1,\ldots ,Y_n)\) with realizations \(\varvec{y}=(y_1,\ldots ,y_n)\). Each loss \(y_i\in {\mathcal {Y}}\) is accompanied by p covariates (policyholder attributes) denoted as \(\varvec{x}_i=(x_{i1},\ldots ,x_{ip})^T\in {\mathcal {X}}\) for \(i=1,\ldots ,n\). Also, define \(\varvec{X}=(\varvec{x}_1,\ldots ,\varvec{x}_n)^T\) as an \((n\times p)\)-design matrix consisting of the attributes of all policyholders. We model \(Y_i|\varvec{x}_i\) through GLM with density function given by

$$\begin{aligned} f(y_i;\varvec{x}_i,\varvec{\Psi })=\exp \left\{ \frac{\theta _iy_i-A(\theta _i)}{\phi }+C(y_i,\phi )\right\} ,\quad y_i\in {\mathcal {Y}}, \varvec{x}_i\in {\mathcal {X}} \end{aligned}$$
(2.1)

for \(i=1,\ldots ,n\). Here, \(\theta _i:=\theta (\varvec{x}_i,\varvec{\beta })\) is a canonical parameter which depends on the covariates \(\varvec{x}_i\) and regression coefficients \(\varvec{\beta }:=(\beta _1,\ldots ,\beta _p)^T\). The scale parameter is \(\phi \) and the set of all model parameters is \(\varvec{\Psi }=(\varvec{\beta },\phi )\). The mean and variance of \(Y_i|\varvec{x}_i\) are \(E[Y_i|\varvec{x}_i,\varvec{\Psi }]:=\mu _i=A^{\prime }(\theta _i)\) and \(\text {Var}[Y_i|\varvec{x}_i,\varvec{\Psi }]:=\sigma _i^2=\phi A^{\prime \prime }(\theta _i)\) respectively. For linear regression, it is assumed that \(\varvec{x}_i^T\varvec{\beta }=\eta (\mu _i)=\eta (A^{\prime }(\theta _i))\), where \(\eta (\cdot )\) is called the link function. We also define \(\xi (\cdot )=(\eta \circ A^{\prime })^{-1}(\cdot )\) such that \(\theta _i=\xi (\varvec{x}_i^T\varvec{\beta })\). In the special case where \(\xi (\cdot )\) or \(\eta (A^{\prime }(\cdot ))\) is an identity function such that \(\theta _i=\varvec{x}_i^T\varvec{\beta }\), we call the resulting \(\eta (\cdot )\) a canonical link. Throughout this paper, we further particularize the class of GLM to be considered, by assuming that the function \(C(y_i,\phi )\) in Eq. (2.1) can be decomposed as

$$\begin{aligned} C(y_i,\phi )=\left( \frac{1}{\phi }-c\right) g(y_i)+a(y_i)+b(\phi ) \end{aligned}$$
(2.2)

for a constant c and some functions \(g(\cdot )\), \(a(\cdot )\) and \(b(\cdot )\). This assumption is not restrictive in actuarial practice as most widely adopted GLMs (e.g., Gamma and inverse Gaussian) satisfy this assumption.

3 Likelihood-based inference techniques

This section briefly reviews several likelihood-based inference tools and proposes a score-based weighted likelihood estimation (SWLE) approach for robust parameter estimations of regression models.

3.1 Maximum likelihood estimation (MLE)

Statistical inference for loss regression models is exclusively dominated by the maximum likelihood estimation (MLE) approach in actuarial practice, which maximizes the log-likelihood function

$$\begin{aligned} {\mathcal {L}}^{\text {MLE}}_n(\varvec{\Psi };\varvec{y},\varvec{X})=\sum _{i=1}^{n}\log f(y_i;\varvec{x}_i,\varvec{\Psi }) \end{aligned}$$
(3.1)

with respect to the parameters \(\varvec{\Psi }\). MLE is not robust to model contamination, i.e., a few outliers may significantly distort the estimated MLE parameters. It is, therefore, worthwhile to explore robust estimation methods alternative to the MLE to obtain more stable and reliable estimates of parameters.

3.2 Maximum weighted likelihood estimation (MWLE)

The maximum weighted likelihood estimation (MWLE) is developed by Fung (2022) for robust estimations of loss distributions. The idea is to impose observation-dependent weights on each observation so that the observations deemed to harm the model robustness would make fewer impacts on parameter estimations. Slightly extending the MWLE proposed by Fung (2022) to the regression setting, the following weighted log-likelihood function is maximized:

$$\begin{aligned} {\mathcal {L}}^{\text {MWLE}}_n(\varvec{\Psi };\varvec{y},\varvec{X})&=\sum _{i=1}^{n}W(y_i,\varvec{x}_i)\log \frac{f(y_i;\varvec{x}_i,\varvec{\Psi })W(y_i,\varvec{x}_i)}{\int _{{\mathcal {Y}}} f(u;\varvec{x}_i,\varvec{\Psi })W(u,\varvec{x}_i)du}\nonumber \\&:=\sum _{i=1}^{n}W(y_i,\varvec{x}_i)\log f^*(y_i;\varvec{x}_i,\varvec{\Psi }), \end{aligned}$$
(3.2)

where \(W(\cdot )\) is the weight function, and \(f^*(y_i;\varvec{x}_i,\varvec{\Psi })\) is a transformed density function given by

$$\begin{aligned} f^*(y_i;\varvec{x}_i,\varvec{\Psi })=\frac{f(y_i;\varvec{x}_i,\varvec{\Psi }) W(y_i,\varvec{x}_i)}{\int _{{\mathcal {Y}}}f(u;\varvec{x}_i,\varvec{\Psi })W(u,\varvec{x}_i)du}. \end{aligned}$$
(3.3)

Fung (2022) assumes that \(0\le W(\cdot )\le 1\) and \(W(\cdot )\) is a non-decreasing function to address tail robustness issue. In this paper, we focus on robustness against the outliers, and hence we do not impose such an assumption on \(W(\cdot )\). The appropriate choice of \(W(\cdot )\) not only needs to address the modelling need (model robustness) but also has to result in an analytically tractable transformed density function \(f^*(y_i;\varvec{x}_i,\varvec{\Psi })\) such that computational burden is minimized. Details will be covered in Sect. 4.1. An adjustment term \(\int _{{\mathcal {Y}}}f(u;\varvec{x}_i,\varvec{\Psi })W(u,\varvec{x}_i)du\) is incorporated into Eq. (3.2) to adjust for the estimation biases introduced by weighting the log-likelihood function. With this regard, the MWLE is consistent and asymptotically normal under several mild regularity conditions.

3.3 Score-based weighted likelihood estimation (SWLE)

The main shortcoming of the above MWLE approach is that it is difficult to be extended to cater for incomplete data where the true value of insurance loss severity \(y_i\) may not be fully observed in exact due to censoring and truncation. The difficulty arises from the presence of a term \(W(y_i,\varvec{x}_i)\) in Eq. (3.2) that explicitly depends on the exact observed loss value \(y_i\). In situations where data is censored, the precise value of \(y_i\) may not be available, as we only possess information indicating that \(y_i\) falls within an interval. Furthermore, it remains unclear how to determine an appropriate weight to be applied to an inexact loss. In the insurance loss modelling perspective, loss severities are expected to be censored and truncated due to various forms of coverage modifications such as deductibles (which lead to left truncation) and policy limits (which lead to right censoring) applied to insurance policies.

With this regard, we propose a novel score-based weighted likelihood estimation (SWLE) for robust estimations of regression models while retaining its extensibility to the aforementioned incomplete insurance loss data. In a complete data scenario, the SWLE is obtained by solving the following set of score functions w.r.t. \(\varvec{\Psi }\):

$$\begin{aligned} {\mathcal {S}}_n(\varvec{\Psi };\varvec{y},\varvec{X})&:={\mathcal {S}}^{\text {SWLE}}_n (\varvec{\Psi };\varvec{y},\varvec{X})\nonumber \\&=\sum _{i=1}^{n}\left( \int _{{\mathcal {Y}}}f(u;\varvec{x}_i,\varvec{\Psi }) W(u,\varvec{x}_i)du\right) \frac{f^*(y_i;\varvec{x}_i,\varvec{\Psi })}{f(y_i;\varvec{x}_i,\varvec{\Psi })}\nonumber \\&\quad \times \frac{\partial }{\partial \varvec{\Psi }}\log f^*(y_i;\varvec{x}_i,\varvec{\Psi })=\varvec{0}. \end{aligned}$$
(3.4)

It is easy to show that the SWLE score function above is simply the derivative of the weighted log-likelihood function (Eq. (3.2)) w.r.t. \(\varvec{\Psi }\), and hence they are equivalent when the observed data is complete. However, unlike the MWLE, SWLE avoids an explicit expression of the weight function \(W(y_i,\varvec{x}_i)\) into the score function, addressing the aforementioned extensibility problem. Note that the above SWLE score function is for complete data and has not been extended to the case where the insurance losses are censored and/or truncated. We will leverage the censoring-truncation mechanism for insurance losses and the corresponding extension of Eq. (3.4) to Sect. 6.

4 SWLE for GLM

This section examines the theoretical properties of the proposed SWLE under the GLM modelling framework for insurance loss regression analysis. We first construct an appropriate class of weight functions such that: (i) the contributions of extreme observations or outliers are effectively down-weighted to ensure robust model estimations; (ii) the resulting statistical quantities, including the score function, are analytically tractable, to ensure computational desirability. Then, we present the estimation algorithm for SWLE and discuss its connection to the MLE estimation algorithm. We will prove that the SWLE is consistent and asymptotically normal under mild regularity conditions with an analytically tractable information matrix. We will also show that the proposed SWLE results in a bounded influence function (IF), ensuring robustness against outliers.

4.1 Construction

We consider the weight function with the following form

$$\begin{aligned} W(y_i,\varvec{x}_i)\propto \exp \left\{ \frac{{\tilde{\theta }}_iy_i}{{\tilde{\phi }}}+\left( \frac{1}{{\tilde{\phi }}}-c\right) g(y_i)\right\} , \end{aligned}$$
(4.1)

where \({\tilde{\theta }}_i=\xi (\varvec{x}_i^T\tilde{\varvec{\beta }})\), and \(\tilde{\varvec{\Psi }}:=(\tilde{\varvec{\beta }},{\tilde{\phi }})\) are the hyperparameters of the weight function, which control the extent that extreme observations are down-weighted and govern the trade-off between robust modelling and estimation efficiency. We start with the following lemma which suggests that the score function under the proposed SWLE is analytically tractable. The proof is leveraged to Section D.1 of the supplementary material.

Lemma 1

Suppose that the weight function \(W(y_i,\varvec{x}_i)\) and density function \(f(y_i;\varvec{x}_i,\varvec{\Psi })\) are given by Eqs. (4.1) and (2.1) respectively. Then we have:

  1. 1.

    The bias adjustment term is

    $$\begin{aligned} \lambda _i^*(\varvec{\Psi };\varvec{x}_i)&:=\int _{{\mathcal {Y}}} f(u;\varvec{x}_i,\varvec{\Psi })W(u,\varvec{x}_i)du\nonumber \\&\propto \exp \left\{ \frac{A(\theta ^*_i)}{\phi ^*}-b(\phi ^*) -\frac{A(\theta _i)}{\phi }+b(\phi )\right\} ; \end{aligned}$$
    (4.2)
  2. 2.

    The transformed density function is

    $$\begin{aligned} f^*(y_i;\varvec{x}_i,\varvec{\Psi })=\exp \left\{ \frac{\theta _i^*y_i -A(\theta _i^*)}{\phi ^*}+C(y_i,\phi ^*)\right\} , \end{aligned}$$
    (4.3)

    where \(\phi ^*=(\phi ^{-1}+{\tilde{\phi }}^{-1}-c)^{-1}\) and \(\theta _i^*=(\theta _i/\phi +{\tilde{\theta }}_i/{\tilde{\phi }})\phi ^*\).

Corollary 1

The following results based on Lemma 1 hold under the following two special cases:

  1. 1.

    If \({\tilde{\theta }}_i:={\tilde{\theta }}\) is independent of \(\varvec{x}_i\) (e.g. choose \(\tilde{\varvec{\beta }}\) such that only the intercept term is non-zero), then we have \(\theta _i^*=\xi ^{*}(\varvec{x}_i^T\varvec{\beta })\) with the transformed mapping function \(\xi ^{*}(z)=(\xi (z)/\phi +{\tilde{\theta }}/{\tilde{\phi }})\phi ^*\).

  2. 2.

    If a canonical link is selected for GLM such that \(\xi (z)=z\), \(\theta _i=\varvec{x}_i^T\varvec{\beta }\) and \({\tilde{\theta }}_i=\varvec{x}_i^T\tilde{\varvec{\beta }}\), then the transformed density function can be written as \(f^*(y_i;\varvec{x}_i,\varvec{\Psi })=f(y_i;\varvec{x}_i,\varvec{\Psi }^*)\), where \(\varvec{\Psi }^*=(\varvec{\beta }^*,\phi ^*)\) with \(\varvec{\beta }^*=(\varvec{\beta }/\phi +\tilde{\varvec{\beta }}/{\tilde{\phi }})\phi ^*\).

We can still write the transformed density as an Exponential dispersion model (EDM) with shifted parameter values from the above results. If we choose a canonical link, the resulting transformed density will still be expressed as a GLM with transformed parameters. With the above analytically tractable formulas, the SWLE score function can be written as

$$\begin{aligned} {\mathcal {S}}_n(\varvec{\Psi };\varvec{y},\varvec{X})&:=\sum _{i=1}^{n}{\mathcal {S}}(\varvec{\Psi };y_i,\varvec{x}_i)\nonumber \\&=\sum _{i=1}^{n}W(y_i,\varvec{x}_i)\frac{\partial }{\partial \varvec{\Psi }} \left[ \frac{\theta _i^*y_i-A(\theta _i^*)}{\phi ^*}+C(y_i,\phi ^*)\right] . \end{aligned}$$
(4.4)

Taking a derivative with the usage of chain rule, SWLE for GLM requires solving the following two sets of equations simultaneously for \(\varvec{\Psi }\)

$$\begin{aligned} {\mathcal {S}}_{n,\theta }(\varvec{\Psi };\varvec{y},\varvec{X})&:=\sum _{i=1}^{n}{\mathcal {S}}_{\theta }(\varvec{\Psi };y_i,\varvec{x}_i)\nonumber \\&:=\frac{1}{\phi }\sum _{i=1}^{n}W(y_i,\varvec{x}_i)\left( y_i-A^{\prime }(\theta _i^*) \right) \xi ^{\prime }(\varvec{x}_i^T\varvec{\beta })\varvec{x}_i=\varvec{0}, \end{aligned}$$
(4.5)
$$\begin{aligned} {\mathcal {S}}_{n,\phi }(\varvec{\Psi };\varvec{y},\varvec{X})&:=\sum _{i=1}^{n}{\mathcal {S}}_{\theta }(\varvec{\Psi };y_i,\varvec{x}_i)\nonumber \\&:=\frac{1}{\phi ^2}\sum _{i=1}^{n}W(y_i,\varvec{x}_i) \left\{ \left( y_i-A^{\prime }(\theta _i^*)\right) \phi ^*\left( \left( c-\frac{1}{{\tilde{\phi }}}\right) \theta _i+\frac{\tilde{\theta _i}}{{\tilde{\phi }}}\right) \right. \nonumber \\&\quad -\left[ \theta _i^*y_i-A(\theta _i^*)+g(y_i)\right] +\phi ^{*2}b^{\prime }(\phi ^*)\bigg \}=0, \end{aligned}$$
(4.6)

which are both analytically tractable.

Example 1

We hereby discuss the use of SWLE to three example GLM classes, which are commonly adopted for actuarial loss modelling and ratemaking purposes.

  1. 1.

    (Gamma GLM) Its density function is given by Eqs. (2.1) and (2.2) with \(A(\theta )=-\log (-\theta )\), \(c=1\), \(g(y)=\log y\), \(a(y)=0\) and \(b(\phi )=\phi ^{-1}\log (\phi ^{-1})-\log \Gamma (\phi ^{-1})\). To apply the proposed SWLE to Gamma GLM, the plausible weight function, according to Eq. (4.1), is gamma density function itself (because \(a(y)=0\)). For simplicity, one may particularize \({\tilde{\phi }}=1\) and \({\tilde{\theta }}_i={\tilde{\theta }}\) such that \(W(y_i,\varvec{x}_i):=W(y_i)=\exp \{{\tilde{\theta }}y_i\}\) exponentially distributed with \({\tilde{\theta }}<0\) being the only adjustable hyperparameter. In this case, \(W(y_i)\) is a decreasing function of \(y_i\), down-weighting large losses. Larger losses are down-weighted more significantly as \({\tilde{\theta }}\) becomes more negative, while \(W(y_i)\) becomes flat (SWLE recovers back to MLE) as \({\tilde{\theta }}\rightarrow 0\). Under the SWLE, the transformed density function \(f^{*}(y_i;\varvec{x}_i,\varvec{\Psi })\) still follows Gamma GLM with an identical transformed dispersion parameter \(\phi ^{*}=\phi \) and a shifted canonical parameter \(\theta _i^*=\theta _i+{\tilde{\theta }}\phi \). The left panel of Fig. 1 (using \(\theta =-1\), \(\phi =0.5\) and \({\tilde{\theta }}=-0.5\)) shows that the transformed density function peaks more in the body part and eventually becomes lighter tailed. This is a natural consequence of incorporating weight functions to reduce the influence of large losses.

  2. 2.

    (Linear model) Actuaries often model log-normal losses via an exponential transformation of the linear model. We have \(A(\theta )=\theta ^2/2\), \(c=0\), \(g(y)=-y^2/2\), \(a(y)=0\) and \(b(\phi )=-(1/2)\log (2\pi \phi )\). Since \(a(y)=0\), one can choose a linear model itself as the weight function. One may particularize \({\tilde{\theta }}_i:={\tilde{\theta }}\) fixed as e.g. sample mean and treat \({\tilde{\phi }}>0\) as the only adjustable weight function hyperparameter. The weight function \(W(y_i,\varvec{x}_i):=W(y_i)\) is then a normal density with fixed mean \({\tilde{\theta }}\) and adjustable variance \({\tilde{\phi }}\), down-weighting the observations from both tails. Smaller \({\tilde{\phi }}\) represents outliers are down-weighted by a larger extent. The middle panel of Fig. 1 (using \(\theta ={\tilde{\theta }}=0\), \(\phi =0.5\) and \({\tilde{\phi }}=2\)) shows that the transformed density under SWLE is still normal distributed with a sharper peak, coinciding with the Gamma case.

  3. 3.

    (Inverse-Gaussian GLM) We have \(A(\theta )=-(-2\theta )^{1/2}\), \(c=0\), \(g(y)=-1/(2y)\), \(a(y)=-\ln (2\pi y^3)/2\) and \(b(\phi )=\ln (1/\phi )/2\). As \(a(y)\ne 0\), the weight function cannot be Inverse-Gaussian itself. Instead, we refer to Eq. (4.1) and choose \(W(y_i,\varvec{x}_i)=\exp \{{\tilde{\theta }}_iy_i/{\tilde{\phi }}-(2y_i{\tilde{\phi }})^{-1}\}\) as the weight function. One may particularize the weight function \(W(y_i,\varvec{x}_i):=W(y_i)\) such that \({\tilde{\theta }}_i:={\tilde{\theta }}<0\) is fixed. The adjustable hyperparameter \({\tilde{\phi }}>0\) governs how the outliers are down-weighted. Similar to the previous two cases, the transformed density under SWLE is still Inverse-Gaussian distributed with a sharper node in the body, as demonstrated by the right panel of Fig. 1 (using \(\theta =-0.5\), \(\phi =1\), \({\tilde{\theta }}=-2\) and \({\tilde{\phi }}=1\)).

Fig. 1
figure 1

Weight function \(W(y_i;\varvec{x}_i)\) (dotted curves), original density function \(f(y_i;\varvec{x}_i,\varvec{\Psi })\) (solid curves with shaded areas underneath) and transformed density function \(f^{*}(y_i;\varvec{x}_i,\varvec{\Psi })\) (solid curves only), under Gamma (left panel), Normal (middle panel) and Inverse-Gaussian (right panel) distributions

4.2 Estimation

With a carefully chosen weight function, the SWLE score functions (Eqs. (4.5) and (4.6)) are analytically tractable. In particular, Eq. (4.5) looks almost the same as the standard system of GLM score functions for the MLE regression parameter estimations. Therefore, it is still possible to employ a standard IRLS-type approach to estimate the GLM parameters under the proposed SWLE approach. Making use of this desirable property, we propose an SWLE model calibration algorithm, which allows borrowing and adapting the existing software packages (such as glm function in R) to estimate SWLE parameters efficiently. The steps of the algorithm are outlined as follows:

  1. 1.

    Set the initial parameters as \(\varvec{\Psi }^{[0]}:=(\varvec{\beta }^{[0]},\phi ^{[0]})\).

  2. 2.

    For each iteration \(r=1,2,\ldots \), do:

    • Update the regression coefficients \(\varvec{\beta }^{[r]}\) by solving Eq. (4.5) with the dispersion parameter \(\phi =\phi ^{[r-1]}\) fixed. This can be done by the IRLS procedures directly using glm function in R, setting prior weights weights as \(W(y_i,\varvec{x}_i)\) and custom link function link as \(\eta ^*(\cdot )=(A'\circ \xi ^*)^{-1}(\cdot )\), where \(\xi ^*(\cdot )\) is the transformed mapping function (Corollary 1).

    • Update the dispersion parameter \(\phi ^{[r]}\) by solving Eq. (4.6) with the regression coefficients being fixed as \(\varvec{\beta }=\varvec{\beta }^{[r-1]}\). This can be simply done by Newton–Raphson method using uniroot function in R.

  3. 3.

    Continue iterating step 2 until the absolute change of the iterated parameter values is smaller than a particular threshold (e.g. \(|\varvec{\Psi }^{[r]}-\varvec{\Psi }^{[r-1]}|<10^{-6}\)).

If the canonical link function is chosen, from Corollary 1, the resulting transformed density function \(f^{*}(y_i;\varvec{x}_i,\varvec{\Psi }^{*})\) is still a GLM with the same link function but transformed parameters \(\varvec{\Psi }^{*}=(\varvec{\beta }^{*},\phi ^{*})\). Further, the score function in Eq. (4.5) can be simplified as

$$\begin{aligned} \sum _{i=1}^{n}W(y_i,\varvec{x}_i)\left( y-A^{\prime }(\theta _i^*)\right) \varvec{x}_i=\varvec{0}, \end{aligned}$$
(4.7)

which is a set of standard weighted GLM score functions which depends only on the transformed regression coefficients \(\varvec{\beta }^{*}\). As a result, it is even more computationally appealing to estimate \(\varvec{\Psi }^{*}\) first and then transform it back to \(\varvec{\Psi }\). The proposed algorithm consists of the following steps:

  1. 1.

    Obtain an IRLS estimate of the transformed regression parameters \(\hat{\varvec{\beta }}^{*}\) from Eq. (4.7), using glm function in R with weights being \(W(y_i,\varvec{x}_i)\) and link being the canonical function.

  2. 2.

    Obtain an estimated transformed dispersion parameter \({\hat{\phi }}^{*}\) by solving Eq. (4.6) with the transformed regression coefficients being fixed as the estimated values in the previous step. This can be done by uniroot function in R.

  3. 3.

    Revert the transformed parameters \(\hat{\varvec{\Psi }}^{*}\) to obtain the estimated parameters \(\hat{\varvec{\Psi }}\):

    $$\begin{aligned} {\hat{\phi }}=\left( \frac{1}{{\hat{\phi }}^{*}}-\frac{1}{{\tilde{\phi }}}+c\right) ^{-1} ,\quad {\hat{\beta }}=\left( \frac{\hat{\varvec{\beta }}^{*}}{{\hat{\phi }}^{*}}-\frac{\tilde{\varvec{\beta }}}{{\tilde{\phi }}}\right) \phi . \end{aligned}$$
    (4.8)

No further iterations are needed in steps 1 and 2 above. In this case, the computational burden under SWLE is almost the same as that under MLE.

4.3 Asymptotic properties

The following theorem shows that the proposed SWLE approach leads to the convergence to true model parameters asymptotically as \(n\rightarrow \infty \) and quantify the asymptotic parameter uncertainties. The proof is presented in Section D.5 of the supplementary material.

Theorem 1

Suppose that \(Y_i|\varvec{x}_i\) follows GLM with density function in the form of Eq. (2.1) and true model parameters \(\varvec{\Psi }_0:=(\varvec{\beta }_0,\phi _0)\). Assume that the mild regularity conditions outlined in Section A.1 of the supplementary material are satisfied. Then, there exists a solution \(\hat{\varvec{\Psi }}_n:=(\hat{\varvec{\beta }}_n,{\hat{\phi }}_n)\) of the SWLE score equations \({\mathcal {S}}_{n,\theta }(\varvec{\Psi };\varvec{y},\varvec{X})\) and \({\mathcal {S}}_{n,\phi }(\varvec{\Psi };\varvec{y},\varvec{X})\) (Eqs. (4.5) and (4.6)) such that

$$\begin{aligned} \sqrt{n}(\hat{\varvec{\Psi }}_n-\varvec{\Psi }_0)\overset{d}{\rightarrow }{\mathcal {N}}(\varvec{0},\varvec{\Sigma }), \end{aligned}$$
(4.9)

where \(\varvec{\Sigma }:=\varvec{\Sigma }(\varvec{\Psi }_0)=(\Gamma ^{-1})\Lambda (\Gamma ^{-1})^T\), with \(\Gamma \) and \(\Lambda \) being \((p+1)\times (p+1)\) matrices given by

$$\begin{aligned} \Gamma :=\Gamma (\varvec{\Psi }_0)= \begin{pmatrix} \Gamma _{\theta \theta }(\varvec{\Psi }_0) &{}\quad \Gamma _{\theta \phi }(\varvec{\Psi }_0) \\ \Gamma _{\phi \theta }(\varvec{\Psi }_0) &{}\quad \Gamma _{\phi \phi }(\varvec{\Psi }_0) \end{pmatrix};\quad \Lambda :=\Lambda (\varvec{\Psi }_0)= \begin{pmatrix} \Lambda _{\theta \theta }(\varvec{\Psi }_0) &{}\quad \Lambda _{\theta \phi }(\varvec{\Psi }_0) \\ \Lambda _{\phi \theta }(\varvec{\Psi }_0) &{}\quad \Lambda _{\phi \phi }(\varvec{\Psi }_0) \end{pmatrix}, \end{aligned}$$
(4.10)

where the elements of the matrices are expressed as

$$\begin{aligned} \Gamma _{\theta \theta }(\varvec{\Psi })&=E_{\varvec{x}}\left[ W_{\theta \theta }(\varvec{\Psi }, \varvec{x})\varvec{x}\varvec{x}^T\right] ,\; \Gamma _{\theta \phi }(\varvec{\Psi })=\Gamma _{\phi \theta } (\varvec{\Psi })^T=E_{\varvec{x}}\left[ W_{\theta \phi }(\varvec{\Psi },\varvec{x}) \varvec{x}\right] ,\\ \Gamma _{\phi \phi }(\varvec{\Psi })&=E_{\varvec{x}}\left[ W_{\phi \phi }(\varvec{\Psi },\varvec{x})\right] ,\\ \Lambda _{\theta \theta }(\varvec{\Psi })&=E_{\varvec{x}}\left[ V_{\theta \theta } (\varvec{\Psi },\varvec{x})\varvec{x}\varvec{x}^T\right] ,\;\Lambda _{\theta \phi }(\varvec{\Psi }) =\Lambda _{\phi \theta }(\varvec{\Psi })^T=E_{\varvec{x}}\left[ V_{\theta \phi } (\varvec{\Psi },\varvec{x})\varvec{x}\right] ,\\ \Lambda _{\phi \phi }(\varvec{\Psi })&=E_{\varvec{x}}\left[ V_{\phi \phi }(\varvec{\Psi },\varvec{x})\right] . \end{aligned}$$

Here, \(E_{\varvec{x}}[\cdot ]\) is the expectation taken on \(\varvec{x}\).The analytical expressions of \(W_{\theta \theta }(\varvec{\Psi },\varvec{x})\), \(W_{\theta \phi }(\varvec{\Psi },\varvec{x})\), \(W_{\phi \phi }(\varvec{\Psi },\varvec{x})\), \(V_{\theta \theta }(\varvec{\Psi },\varvec{x})\), \(V_{\theta \phi }(\varvec{\Psi },\varvec{x})\) and \(V_{\phi \phi }(\varvec{\Psi },\varvec{x})\) are listed in Equations (B.6)–(B.8) and (B.13)–(B.18) of the supplementary material. If a canonical regression link is selected (i.e., \(\xi (\cdot )\) is an identity function), the solution \(\hat{\varvec{\Psi }}_n\) will be unique.

In practice, we are unable to obtain the exact covariance matrix \(\varvec{\Sigma }\) above because the true model parameters \(\varvec{\Psi }_0\) are unobserved, and the randomness of \(\varvec{x}\) is not modelled explicitly to compute the expectation \(E_{\varvec{x}}[\cdot ]\). We, therefore, estimate the uncertainties of fitted model parameters \(\hat{\varvec{\Psi }}_n\) as follows:

$$\begin{aligned} \text {Var}(\hat{\varvec{\Psi }}_n)\approx \frac{1}{n}({\tilde{\Gamma }} ^{-1}){\tilde{\Lambda }}({\tilde{\Gamma }}^{-1})^T, \end{aligned}$$
(4.11)

where

$$\begin{aligned} {\tilde{\Gamma }}&:= \begin{pmatrix} {\tilde{\Gamma }}_{\theta \theta }(\hat{\varvec{\Psi }}_n) &{}\quad {\tilde{\Gamma }}_{\theta \phi }(\hat{\varvec{\Psi }}_n) \\ {\tilde{\Gamma }}_{\phi \theta }(\hat{\varvec{\Psi }}_n) &{}\quad {\tilde{\Gamma }}_{\phi \phi }(\hat{\varvec{\Psi }}_n) \end{pmatrix} =\frac{1}{n} \begin{pmatrix} \varvec{X}^T{\tilde{W}}_{\theta \theta }(\hat{\varvec{\Psi }}_n,\varvec{X})\varvec{X} &{}\quad {\tilde{W}}_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})\varvec{X} \\ \varvec{X}^T{\tilde{W}}_{\phi \theta }(\hat{\varvec{\Psi }}_n,\varvec{X}) &{}\quad {\tilde{W}}_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{X}), \end{pmatrix}\nonumber \\ \end{aligned}$$
(4.12)
$$\begin{aligned} {\tilde{\Lambda }}&:= \begin{pmatrix} {\tilde{\Lambda }}_{\theta \theta }(\hat{\varvec{\Psi }}_n) &{}\quad {\tilde{\Lambda }}_{\theta \phi }(\hat{\varvec{\Psi }}_n) \\ {\tilde{\Lambda }}_{\phi \theta }(\hat{\varvec{\Psi }}_n) &{}\quad {\tilde{\Lambda }}_{\phi \phi }(\hat{\varvec{\Psi }}_n) \end{pmatrix} =\frac{1}{n} \begin{pmatrix} \varvec{X}^T{\tilde{V}}_{\theta \theta }(\hat{\varvec{\Psi }}_n,\varvec{X})\varvec{X} &{}\quad {\tilde{V}}_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})\varvec{X} \\ \varvec{X}^T{\tilde{V}}_{\phi \theta }(\hat{\varvec{\Psi }}_n,\varvec{X}) &{}\quad {\tilde{V}}_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{X}) \end{pmatrix}\nonumber \\ \end{aligned}$$
(4.13)

with

$$\begin{aligned} {\tilde{W}}_{\theta \theta }(\hat{\varvec{\Psi }}_n,\varvec{X})&=\text {diag}(\{W_{\theta \theta }(\hat{\varvec{\Psi }}_n,\varvec{x}_i)\}_ {i=1,\ldots ,n}),\\ {\tilde{W}}_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})&={\tilde{W}}_{\phi \theta }(\hat{\varvec{\Psi }}_n,\varvec{X})^T =(\{W_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{x}_i)\}_{i=1,\ldots ,n}), \\ {\tilde{W}}_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})&=\sum _{i=1}^{n}W_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{x}_i),\; {\tilde{V}}_{\theta \theta }(\hat{\varvec{\Psi }}_n,\varvec{X}) =\text {diag}(\{V_{\theta \theta }(\hat{\varvec{\Psi }}_n, \varvec{x}_i)\}_{i=1,\ldots ,n}),\\ {\tilde{V}}_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})&={\tilde{V}}_{\phi \theta }(\hat{\varvec{\Psi }}_n,\varvec{X})^T =(\{V_{\theta \phi }(\hat{\varvec{\Psi }}_n,\varvec{x}_i)\}_{i =1,\ldots ,n}),\\ {\tilde{V}}_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{X})&=\sum _{i=1}^{n}V_{\phi \phi }(\hat{\varvec{\Psi }}_n,\varvec{x}_i). \end{aligned}$$

4.4 Robustness analysis

The robustness of the proposed SWLE can also be justified by showing that the SWLE has a bounded sensitivity against model perturbations. Assume that \((Y_i,\varvec{x}_i)\) are generated by a contamination model

$$\begin{aligned} {\tilde{F}}(y_i,\varvec{x}_i;\epsilon ,\Delta ,\varvec{\Psi }_0)=(1-\epsilon ) F(y_i,\varvec{x}_i;\varvec{\Psi }_0)+\epsilon \Delta (y_i,\varvec{x}_i), \end{aligned}$$
(4.14)

where \(F(y_i,\varvec{x}_i;\varvec{\Psi }_0)=F(y_i;\varvec{x}_i,\varvec{\Psi }_0)H(\varvec{x}_i)\) is a joint distribution on \((Y_i,\varvec{x}_i)\) with \(F(y_i;\varvec{x}_i,\varvec{\Psi }_0)\) being the cdf of \(Y_i|\varvec{x}_i\) under the GLM in Eq. (2.1), whereas \(\Delta (y_i,\varvec{x}_i)\) is a contamination distribution function on \((Y_i,\varvec{x}_i)\). The sensitivity of the estimated parameters by model contaminations can be evaluated using the influence function (IF, Hampel 1974):

$$\begin{aligned} \text {IF}(\varvec{\Psi }_0;F,\Delta )=\lim _{\epsilon \rightarrow 0}\frac{\tilde{\varvec{\Psi }}^{\epsilon ,\Delta }-\varvec{\Psi }_0}{\epsilon }, \end{aligned}$$
(4.15)

where \(\tilde{\varvec{\Psi }}^{\epsilon ,\Delta }\) are the asymptotic estimated parameters if the data generating model is \({\tilde{F}}(y_i,\varvec{x}_i;\epsilon ,\Delta ,\varvec{\Psi }_0)\) in Eq. (4.14). It is obvious that the influence function can be unbounded under the MLE approach when the contamination distribution function \(\Delta (y_i,\varvec{x}_i)\) assigns a probability mass on an arbitrarily (extremely) large \(y_i\). On the other hand, we have the following theorem to ensure that the influence function is bounded under the SWLE. The proof is delivered in Section D.2 of the supplementary material.

Theorem 2

Suppose that the following assumptions are satisfied:

  1. (i)

    There exists a compact space \(\bar{{\mathcal {X}}}\) such that \(\int _{{\mathcal {Y}}\times \bar{{\mathcal {X}}}}d\Delta (y_i,\varvec{x}_i)=1\), i.e., the covariates generated by the perturbed distribution are bounded by \(\bar{{\mathcal {X}}}\) almost surely.

  2. (ii)

    There exists a finite function \({\mathcal {P}}(\varvec{x}_i)\) such that \(|W(y_i,\varvec{x}_i)|\le {\mathcal {P}}(\varvec{x}_i)\), \(|W(y_i,\varvec{x}_i)y_i|\le {\mathcal {P}}(\varvec{x}_i)\) and \(|W(y_i,\varvec{x}_i)g(y_i)|\le {\mathcal {P}}(\varvec{x}_i)\) for every \(y_i\in {\mathcal {Y}}\).

  3. (ii)

    The mild regularity conditions outlined in Section A.1 of the supplementary material are satisfied.

Denote \(\varvec{\Upsilon }\) as a collection of all distribution functions on \((y_i,\varvec{x}_i)\). Under the SWLE, we have

$$\begin{aligned} \sup _{\Delta \in \varvec{\Upsilon }}|\text {IF}(\varvec{\Psi }_0;F,\Delta )|<\infty . \end{aligned}$$
(4.16)

Remark 1

The rationale of stating Assumption (i) in Theorem 2 is that the proposed SWLE primarily aims to ensure the robustness of the estimated parameters against the outliers on \(y_i\) instead of \(\varvec{x}_i\). In insurance practice, it often makes sense to consider bounded covariates space only: For the European automobile dataset in Sect. 8.1, variables are either categorical (e.g., car fuel type) or discrete (e.g. policyholder age) with practical upper limits (e.g., no policyholders are older than 120 years).

Finally, we need to check the validity of Assumption (ii) of Theorem 2 under some example cases:

Corollary 2

Suppose that \(F(y_i;\varvec{x}_i,\varvec{\Psi }_0)\) is the cdf of the Gamma GLM, linear model or inverse-Gaussian GLM specified in Example 1, and the weight function is in the form of Eq. (4.1), with the hyperparameters selected in accordance to Example 1. Then, Assumption (ii) of Theorem 2 holds.

5 Model diagnostic with SWLE

The previous sections introduce the SWLE as a statistical inference tool that is less sensitive against outliers. In insurance practice, the modelling challenge may not arise just from model contamination but also from a more fundamental problem that the true (data generating) model systematically deviates from the GLM. Such deviations may include tail-heaviness, distributional multimodality, non-linear regression links, and dispersion heterogeneity. In this case, the GLM may produce misleading pricing recommendations, and hence actuaries must reconsider alternative modelling frameworks. The research question is, are there any quantitative measures to recommend if one is worthwhile to try alternative models. In other words, we want to test the null hypothesis that

$$\begin{aligned} H_0:\text {The data is generated by a specified class of {GLMs}} \end{aligned}$$

against the alternative hypothesis \(H_1\) that \(H_0\) is false. In this section, we propose a novel Wald-test statistic (Van der Vaart 2000) based on the SWLE to quantitatively assess the hypothesis. Under \(H_0\), \(Y_i|\varvec{x}_i\) follows the GLM with density function given by Eq. (2.1) and true (unknown) model parameters \(\varvec{\Psi }_0:=(\varvec{\beta }_0,\phi _0)\). For \(k=1,\ldots ,K\), denote \(\hat{\varvec{\Psi }}^{(k)}_n\) as the estimated parameters under the SWLE where the weight function is given by Eq. (4.1) with hyperparameters \(\tilde{\varvec{\Psi }}^{(k)}:=(\tilde{\varvec{\beta }}^{(k)},{\tilde{\phi }}^{(k)})\) and \({\tilde{\theta }}_i=\xi (\varvec{x}_i^T\tilde{\varvec{\beta }}^{(k)})\). Here, K represents the total number of different sets of weight functions hyperparameters we consider.

Under \(H_0\), we expect that the choice of weight function hyperparameters would not impact the estimated parameters significantly because the SWLE is asymptotically consistent (Theorem 1) regardless of the hyperparameters chosen. In other words, large absolute difference of estimated parameters \(|\hat{\varvec{\Psi }}^{(k)}_n-\hat{\varvec{\Psi }}^{(k^{\prime })}_n|\) (for some \(k\ne k^{\prime }\)) is an evidence to reject \(H_0\). In this case, the data generating model may not be within the specified GLM class, and further explorations of alternative models are recommended. This motivates us to introduce the following two theorems, which provide a foundation on the construction of Wald statistic based on \((\hat{\varvec{\Psi }}^{(2)}_n-\hat{\varvec{\Psi }}^{(1)}_n,\hat{\varvec{\Psi }}^{(3)} _n-\hat{\varvec{\Psi }}^{(2)}_n,\ldots ,\hat{\varvec{\Psi }}^{(K)}_n-\hat{\varvec{\Psi }}^{(K-1)}_n)\). We first denote \(\hat{\varvec{\Psi }}_n^{\text {meta}}=(\hat{\varvec{\Psi }}_n^{(1)},\ldots ,\hat{\varvec{\Psi }}_n^{(K)})\) as a \(1\times (p+1)K\) horizontal vector containing all estimated parameters under various weight function hyperparameters. Also define \(\varvec{\Psi }_0^{\text {meta}}=(\varvec{\Psi }_0,\ldots ,\varvec{\Psi }_0)\) as a \(1\times (p+1)K\) horizontal vector containing K sets of true model parameters. The following results hold, and the proofs are leveraged to Section D.7 of the supplementary material.

Theorem 3

Under \(H_0\) with true model parameters \(\varvec{\Psi }_0\), and given that the mild regularity conditions outlined in Section A.2 of the supplementary material hold, we have

$$\begin{aligned} \sqrt{n}\left( \hat{\varvec{\Psi }}_n^{\text {meta}} -\varvec{\Psi }_0^{\text {meta}}\right) \overset{d}{\rightarrow }{\mathcal {N}}(\varvec{0},\varvec{\Sigma }^{\text {meta}}), \end{aligned}$$
(5.1)

where \(\varvec{\Sigma }^{\text {meta}}\) is a \((p+1)K\times (p+1)K\) matrix given by

$$\begin{aligned}{} & {} \Sigma ^{\text {meta}}\nonumber \\ {}{} & {} \quad = \begin{pmatrix} \left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) \Lambda ^{(1,1)} \left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) ^T &{} \left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) \Lambda ^{(1,2)}\left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) ^T &{} \dots &{} \left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) \Lambda ^{(1,K)}\left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) ^T \\ \left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) \Lambda ^{(2,1)}\left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) ^T &{} \left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) \Lambda ^{(2,2)}\left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) ^T &{} \dots &{} \left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) \Lambda ^{(2,K)}\left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) ^T \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) \Lambda ^{(K,1)}\left( \left[ \Gamma ^{(1)}\right] ^{-1}\right) ^T &{} \left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) \Lambda ^{(K,2)}\left( \left[ \Gamma ^{(2)}\right] ^{-1}\right) ^T &{} \dots &{} \left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) \Lambda ^{(K,K)}\left( \left[ \Gamma ^{(K)}\right] ^{-1}\right) ^T \end{pmatrix},\nonumber \\ \end{aligned}$$
(5.2)

with the matrix elements analytically expressed in Section C of the supplementary material.

Theorem 4

Denote \(\varvec{J}\) as a \(Q\times (p+1)K\) design matrix. Then, we have

$$\begin{aligned} n\left[ \varvec{J}\left( \hat{\varvec{\Psi }}_n^{\text {meta}}-\varvec{\Psi }_0^{\text {meta}}\right) ^T\right] ^T\left( \varvec{J}\varvec{\Sigma }^{\text {meta}}\varvec{J}^T\right) ^{-1} \left[ \varvec{J}\left( \hat{\varvec{\Psi }}_n^{\text {meta}}-\varvec{\Psi }_0^{\text {meta}}\right) ^T\right] \overset{d}{\rightarrow }\chi ^2_{Q}, \end{aligned}$$
(5.3)

where \(\chi ^2_{Q}\) is a chi-square distribution with Q degrees of freedom.

Based on the above theorem, one can develop various versions of Wald-type statistics as follows:

  1. 1.

    Meta Wald statistic: We aggregate all the estimated parameters differences \(\Delta \hat{\varvec{\Psi }}^{\text {meta}}_n:=(\hat{\varvec{\Psi }}^{(2)}_n -\hat{\varvec{\Psi }}^{(1)}_n,\hat{\varvec{\Psi }}^{(3)}_n -\hat{\varvec{\Psi }}^{(2)}_n,\ldots ,\hat{\varvec{\Psi }}^{(K)}_n-\hat{\varvec{\Psi }}^{(K-1)}_n)\) to create a combined (meta) test statistic. The \((p+1)(K-1)\times (p+1)K\) design matrix \(\varvec{J}\) in the above theorem is chosen as

    $$\begin{aligned} \varvec{J}^{\text {meta}}= \begin{pmatrix} \varvec{I} &{}\quad -\varvec{I} &{}\quad \varvec{0} &{}\quad \dots &{}\quad \varvec{0} &{}\quad \varvec{0}\\ \varvec{0} &{}\quad \varvec{I} &{}\quad -\varvec{I} &{}\quad \dots &{}\quad \varvec{0} &{}\quad \varvec{0}\\ \vdots &{}\quad &{}\quad \ddots &{}\quad &{}\quad &{}\quad \vdots \\ \varvec{0} &{}\quad \varvec{0} &{}\quad \varvec{0} &{}\quad \dots &{}\quad \varvec{I} &{}\quad -\varvec{I} \end{pmatrix}, \end{aligned}$$
    (5.4)

    where \(\varvec{I}\) is a \((p+1)\times (p+1)\) diagonal matrix. The Wald-type meta statistic is given by

    $$\begin{aligned} Z^{\text {meta}}_n=n\left( \Delta \hat{\varvec{\Psi }}^{\text {meta}}_n\right) \left[ \left( \varvec{J}^{\text {meta}}\right) \left( \hat{\varvec{\Sigma }}^{\text {meta}}\right) \left( \varvec{J}^{\text {meta}}\right) ^T\right] ^{-1}\left( \Delta \hat{\varvec{\Psi }}^ {\text {meta}}_n\right) ^T\overset{\cdot }{\sim }\chi ^2_{(p+1)(K-1)}, \end{aligned}$$
    (5.5)

    where \(\hat{\varvec{\Sigma }}^{\text {meta}}\) is the estimated covariance matrix \(\varvec{\Sigma }^{\text {meta}}\) in Theorem 3 evaluated at the fitted MLE parameters. The meta Wald statistic provides a single value to quantitatively assess the overall adequateness of using GLM to fit the data.

  2. 2.

    Individual Wald statistic: We perform pairwise comparisons of estimated parameters between two specific sets of weight function hyperparameters (say, \(\tilde{\varvec{\Psi }}^{(k)}\) and \(\tilde{\varvec{\Psi }}^{(k^{\prime })}\)). In this case, the design matrix \(\varvec{J}^{\text {ind}}=(\varvec{0},\ldots ,\varvec{0},\varvec{I},\varvec{0}, \cdots ,\varvec{0},-\varvec{I},\varvec{0},\ldots ,\varvec{0})\) has a dimension of \((p+1)\times (p+1)K\), with only the k-th and \(k'\)-th blocks being non-zero. The individual Wald statistic is

    $$\begin{aligned} Z_n^{\text {ind}}=n\left( \hat{\varvec{\Psi }}_n^{(k)}-\hat{\varvec{\Psi }}_n^{(k^{\prime })} \right) \left[ \hat{\varvec{\Sigma }}^{(k,k^{\prime })}\right] ^{-1} \left( \hat{\varvec{\Psi }}_n^{(k)}-\hat{\varvec{\Psi }}_n^{(k^{\prime })}\right) ^ T\overset{\cdot }{\sim }\chi ^2_{(p+1)}, \end{aligned}$$
    (5.6)

    where we have

    $$\begin{aligned} \varvec{\Sigma }^{(k,k^{\prime })}&=\left( \left[ \Gamma ^{(k)}\right] ^{-1}\right) \Lambda ^{(k,k)}\left( \left[ \Gamma ^{(k)}\right] ^{-1}\right) ^T +\left( \left[ \Gamma ^{(k^{\prime })}\right] ^{-1}\right) \Lambda ^{(k^{\prime },k^{\prime })} \left( \left[ \Gamma ^{(k^{\prime })}\right] ^{-1}\right) ^T\nonumber \\&\quad -\left( \left[ \Gamma ^{(k^{\prime })}\right] ^{-1}\right) \Lambda ^{(k^{\prime },k)} \left( \left[ \Gamma ^{(k)}\right] ^{-1}\right) ^T\nonumber \\&\quad -\left( \left[ \Gamma ^{(k)}\right] ^{-1}\right) \Lambda ^{(k,k^{\prime })}\left( \left[ \Gamma ^{(k')}\right] ^{-1}\right) ^T, \end{aligned}$$
    (5.7)

    and \(\hat{\varvec{\Sigma }}^{(k,k^{\prime })}\) is the estimated covariance \(\varvec{\Sigma }^{(k,k^{\prime })}\) evaluated at fitted MLE parameters.

  3. 3.

    Parameter-specific meta Wald statistic: We focus on a certain parameter of interest. From an insurance ratemaking perspective, actuaries are more interested in the regression coefficients to differentiate policyholders into various risk categories. From a risk management perspective, actuaries may be more interested in the dispersion parameter \(\phi \), which governs the distribution’s extreme losses. Suppose that the j-th parameter \(\Psi _j\) is the parameter of interest (\(j=1,\ldots ,p+1\)). The design matrix will then have a dimension of \((K-1)\times (p+1)K\), given by

    $$\begin{aligned} \varvec{J}_j^{\text {meta}}= \begin{pmatrix} \varvec{e}_j &{}\quad -\varvec{e}_j &{}\quad \varvec{0} &{}\quad \dots &{}\quad \varvec{0} &{}\quad \varvec{0}\\ \varvec{0} &{}\quad \varvec{e}_j &{}\quad -\varvec{e}_j &{}\quad \dots &{}\quad \varvec{0} &{}\quad \varvec{0}\\ \vdots &{}\quad &{}\quad \ddots &{}\quad &{}\quad &{}\quad \vdots \\ \varvec{0} &{}\quad \varvec{0} &{}\quad \varvec{0} &{}\quad \dots &{}\quad \varvec{e}_j &{}\quad -\varvec{e}_j \end{pmatrix}, \end{aligned}$$
    (5.8)

    where \(\varvec{e}_j\) is a \((J+1)\) horizontal vector with the j-th element equals to one and zero otherwise. Denote \(\Delta {\hat{\Psi }}^{\text {meta}}_{j,n}=({\hat{\Psi }}^{(2)}_{j,n} -{\hat{\Psi }}^{(1)}_{j,n},\ldots ,{\hat{\Psi }}^{(K)}_{j,n}-{\hat{\Psi }}^{(K-1)}_{j,n})\) as the aggregations of estimated parameter differences corresponding to the j-th parameter. The Wald-type statistic is

    $$\begin{aligned} Z^{\text {meta}}_{j,n}=n\left( \Delta {\hat{\Psi }}^{\text {meta}}_{j,n}\right) \left[ \left( \varvec{J}^{\text {meta}}_j\right) \left( \hat{\varvec{\Sigma }}^ {\text {meta}}\right) \left( \varvec{J}^{\text {meta}}_j\right) ^T\right] ^ {-1}\left( \Delta {\hat{\Psi }}^{\text {meta}}_{j,n}\right) ^T\overset{\cdot }{\sim }\chi ^2_{(K-1)}. \end{aligned}$$
    (5.9)
  4. 4.

    Individual parameter-specific Wald statistic: We do pairwise comparisons of a single parameter between two sets of weight function hyperparameters (i.e., comparing \({\hat{\Psi }}_{j,n}^{(k)}\) to \({\hat{\Psi }}_{j,n}^{(k^{\prime })}\)). The Wald statistic is

    $$\begin{aligned} Z^{\text {ind}}_{j,n}=n({\hat{\Psi }}_{j,n}^{(k)}-{\hat{\Psi }}_{j,n}^{(k^{\prime })}) ^2\left[ \hat{\varvec{\Sigma }}^{(k,k^{\prime })}\right] _{j,j}^{-1}\overset{\cdot }{\sim }\chi ^2_{(1)}, \end{aligned}$$
    (5.10)

    where \(\left[ \hat{\varvec{\Sigma }}^{(k,k^{\prime })}\right] _{j,j}\) is the (jj)-th element of \(\hat{\varvec{\Sigma }}^{(k,k')}\).

6 Extending SWLE to censored and truncated data

In general insurance practice, the actual values of (transformed) losses \(\varvec{y}=(y_1,\ldots ,y_n)\) may not be observed in exact (censoring) and may not be fully observed (truncation) due to coverage modifications of insurance policies including deductibles and policy limits. As a result, extending the above SWLE framework is vital for random censored and truncated regression data.

We formulate the censoring and truncation mechanisms by borrowing the notations from Fung et al. (2022). Denote \({\mathcal {T}}_i\subseteq {\mathcal {Y}}\) as a random truncation interval of observation i, meaning that a loss \(Y_i\) is observed conditioned on \(Y_i\in {\mathcal {T}}_i\). Further, define \({\mathcal {U}}_i\subseteq {\mathcal {T}}_i\) and \({\mathcal {C}}_i={\mathcal {T}}_i\backslash {\mathcal {U}}_i\) as the random uncensoring and censoring regions respectively. Denote \(\{{\mathcal {I}}_{i1},\ldots ,{\mathcal {I}}_{iM_i}\}\) as \(M_i\) disjoint random censoring intervals of observation i with \(\cup _{m=1}^{M_i}{\mathcal {I}}_{im}={\mathcal {C}}_i\). Then, \({\mathcal {R}}_i:=({\mathcal {U}}_i,{\mathcal {C}}_i,M_i, \{{\mathcal {I}}_{i1},\ldots ,{\mathcal {I}}_{iM_i}\})\) is called the censoring mechanism of observation i. Under censoring framework, the loss \(Y_i\) is observed in exact if \(Y_i\in {\mathcal {U}}_i\), while we would only know which censoring interval the loss belongs to (i.e. \(1\{Y_i\in {\mathcal {I}}_{i1}\},\ldots ,1\{Y_i\in {\mathcal {I}}_{iM_i}\}\)) if \(Y_i\in {\mathcal {C}}_i\). As a result, the observed (incomplete) information for loss i is given by \({\mathcal {D}}_i:=({\mathcal {R}}_i,{\mathcal {T}}_i,y_i1\{y_i\in {\mathcal {U}}_i\},\{1\{y_i\in {\mathcal {I}}_{im}\}\}_{m=1,\ldots ,M_i})\). Denote \({\mathcal {D}}:=\{{\mathcal {D}}_i\}_{i=1,\ldots ,n}\) as the observed information of all losses.

Note that the censoring and truncation mechanisms \(({\mathcal {R}}_i,{\mathcal {T}}_i)\) are observed but may differ across i, so the data is censored and truncated in random. This makes sense from an insurance perspective because different policyholders may choose different deductibles or policy limits which affect the censoring and truncation points. Also, the above formalism represents a general framework that includes left, right, and interval censoring and truncation. See Fung et al. (2022) for more details.

Our objective is to develop a robust method for censored and truncated data. To this end, we propose extending the SWLE score function in Eq. (3.4) rather than directly working with the weighted log-likelihood function in Eq. (3.2), stem from the inherent difficulty in constructing the weighted log-likelihood function in scenarios involving censored and truncated data, as discussed in Sect. 3.3. We construct the extended SWLE score function as

$$\begin{aligned} {\mathcal {S}}_n(\varvec{\Psi };{\mathcal {D}},\varvec{X})&:=\sum _{i=1}^n{\mathcal {S}}(\varvec{\Psi };{\mathcal {D}}_i,\varvec{x}_i)\nonumber \\&=\sum _{i=1}^{n}\left( \int _{{\mathcal {Y}}}f_{{\mathcal {T}}_i}(u;\varvec{x}_i, \varvec{\Psi })W(u,\varvec{x}_i)du\right) \frac{f^*_{{\mathcal {T}}_i}(y_i;\varvec{x}_i, \varvec{\Psi })}{f_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })}\nonumber \\&\quad \times \frac{\partial }{\partial \varvec{\Psi }}\log f^*_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })1\{y_i\in {\mathcal {U}}_i\}\nonumber \\&\quad + \sum _{i=1}^{n}\left( \int _{{\mathcal {Y}}}f_{{\mathcal {T}}_i} (u;\varvec{x}_i,\varvec{\Psi })W(u,\varvec{x}_i)du\right) \sum _{m=1}^{M_i} \frac{F^*_{{\mathcal {T}}_i}({\mathcal {I}}_{im};\varvec{x}_i,\varvec{\Psi })}{F_{{\mathcal {T}}_i}({\mathcal {I}}_{im};\varvec{x}_i,\varvec{\Psi })}\nonumber \\&\quad \times \frac{\partial }{\partial \varvec{\Psi }}\log F^*_{{\mathcal {T}}_i} ({\mathcal {I}}_{im};\varvec{x}_i,\varvec{\Psi })1\{y_i\in {\mathcal {I}}_{im}\} \end{aligned}$$
(6.1)
$$\begin{aligned}&=\sum _{i=1}^nW(y_i,\varvec{x}_i)\frac{\partial }{\partial \varvec{\Psi }} \log f^*_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })1\{y_i\in {\mathcal {U}}_i\}\nonumber \\&\quad +\sum _{i=1}^n\sum _{m=1}^{M_i}\lambda ^{*}(\varvec{\Psi };\varvec{x}_i) \frac{F^*({\mathcal {I}}_{im};\varvec{x}_i,\varvec{\Psi })}{F({\mathcal {I}}_{im}; \varvec{x}_i,\varvec{\Psi })}\nonumber \\&\quad \times \frac{\partial }{\partial \varvec{\Psi }}\log F^*_{{\mathcal {T}}_i} ({\mathcal {I}}_{im};\varvec{x}_i,\varvec{\Psi })1\{y_i\in {\mathcal {I}}_{im}\}, \end{aligned}$$
(6.2)

where \(f_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })\) and \(f^*_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })\) are the truncated density functions given by

$$\begin{aligned} f_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })&=\frac{f(y_i;\varvec{x}_i,\varvec{\Psi })}{F({\mathcal {T}}_i;\varvec{x}_i,\varvec{\Psi })}1\{y_i\in {\mathcal {T}}_i\},\nonumber \\ f^*_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })&=\frac{f^*(y_i;\varvec{x}_i,\varvec{\Psi })}{F^*({\mathcal {T}}_i;\varvec{x}_i,\varvec{\Psi })}1\{y_i\in {\mathcal {T}}_i\}, \end{aligned}$$
(6.3)

\(F(\cdot ;\varvec{x}_i,\varvec{\Psi })\), \(F^*(\cdot ;\varvec{x}_i,\varvec{\Psi })\), \(F_{{\mathcal {T}}_i}(\cdot ;\varvec{x}_i,\varvec{\Psi })\) and \(F^*_{{\mathcal {T}}_i}(\cdot ;\varvec{x}_i,\varvec{\Psi })\) are the distribution functions of \(f(\cdot ;\varvec{x}_i,\varvec{\Psi })\), \(f^*(\cdot ;\varvec{x}_i,\varvec{\Psi })\), \(f_{{\mathcal {T}}_i}(\cdot ;\varvec{x}_i,\varvec{\Psi })\) and \(f^*_{{\mathcal {T}}_i}(\cdot ;\varvec{x}_i,\varvec{\Psi })\) respectively, i.e., \(F({\mathcal {A}};\varvec{x}_i,\varvec{\Psi })=\int _{{\mathcal {A}}}f(u;\varvec{x}_i,\varvec{\Psi })du\) for any \({\mathcal {A}}\subseteq {\mathbb {R}}\).

The extended SWLE makes two major modifications compared to the original score function in Eq. (3.4) for complete data. Firstly, the density functions (e.g., \(f(y_i;\varvec{x}_i,\varvec{\Psi })\)) are changed to truncated density functions (e.g., \(f_{{\mathcal {T}}_i}(y_i;\varvec{x}_i,\varvec{\Psi })\)), reflecting that \(Y_i\) is observed conditioned on \(Y_i\in {\mathcal {T}}_i\). Secondly, Eq. (6.1) is segregated into two terms. The first term is very similar to the original score function, reflecting that full information \(y_i\) is used for the evaluation of score function if the observation is uncensored. The second term represents the modified score function as the observation falls into the censoring region, where the only information known is the identification of the censoring interval \({\mathcal {I}}_{im}\) an observation belongs to. In this case, the density functions in the score function are changed to distribution functions evaluated at the censoring interval \({\mathcal {I}}_{im}\).

Remark 2

From a practical standpoint, although small losses below the deductible are eliminated and very large losses exceeding the policy limit are “controlled”, the development of robust methodologies remains imperative for censored and truncated data. This necessity arises because the deductible (truncation point) and policy limit (censoring point) can exhibit significant variability among policyholders in many real insurance loss datasets. As a result, for a substantial portion of policyholders, losses are not effectively eliminated or controlled by these censoring and truncation points. As evidenced by the analysis of the two real insurance datasets in Sect. 8, both extremely large and small losses persist, despite the presence of deductibles and/or policy limits. This highlights the importance of extending the proposed SWLE framework to address incomplete data.

The following result theoretically justifies the consistency and asymptotic normality of the extended SWLE under the GLM framework, i.e., Eq. (6.1) is a proper extension from Eq. (3.4).

Theorem 5

Suppose that \({\tilde{Y}}_i|\varvec{x}_i\) (the loss random variable before truncation) independently follows the GLM with density function given by Eq. (2.1) and true model parameters \(\varvec{\Psi }_0:=(\varvec{\beta }_0,\phi _0)\) for \(i=1,\ldots ,n\). Each loss i is also equipped by censoring and truncation mechanisms \(({\mathcal {R}}_i,{\mathcal {T}}_i)\) described above. Assume that \({\tilde{Y}}_i\) is independent of \(({\mathcal {R}}_i,{\mathcal {T}}_i)\) conditioned on \(\varvec{x}_i\) for every \(i=1,\ldots ,n\). Also denote \(Y_i={\tilde{Y}}_i|{\tilde{Y}}_i\in {\mathcal {T}}_i\) as the truncated loss random variable. Suppose that the observed information is \({\mathcal {D}}:=\{{\mathcal {D}}_i\}_{i=1,\ldots ,n}\) described above. If the mild regularity conditions outlined in Section A.3 of the supplementary material are satisfied, then there exists a solution \(\hat{\varvec{\Psi }}_n:=(\hat{\varvec{\beta }}_n,{\hat{\phi }}_n)\) satisfying the extended SWLE score equations \({\mathcal {S}}_{n}(\hat{\varvec{\Psi }}_n;{\mathcal {D}},\varvec{X})=\varvec{0}\) (Eq. (6.1)) such that

$$\begin{aligned} \sqrt{n}(\hat{\varvec{\Psi }}_n-\varvec{\Psi }_0)\overset{d}{\rightarrow }{\mathcal {N}}(\varvec{0},\varvec{\Sigma }), \end{aligned}$$
(6.4)

where \(\varvec{\Sigma }:=\varvec{\Sigma }(\varvec{\Psi }_0)=(\Gamma ^{-1})\Lambda (\Gamma ^{-1})^T\), with \(\Gamma \) and \(\Lambda \) being \((p+1)\times (p+1)\) matrices given by Section D.4 of the supplementary material.

Remark 3

From Theorem 5 above and Section D.4 of the supplementary material, the covariance matrix \(\varvec{\Sigma }\) depends on the first and second derivatives of the cdf when the observations are censored and truncated. These terms can still be expressed analytically for the linear model and inverse-Gaussian GLM. For Gamma GLM, these terms can be expressed as incomplete di-gamma and tri-gamma functions, which can be computed using pgamma.deriv function within the heavy package in R.

Analogous to Sect. 5, one can construct a Wald-based test statistic to assess whether a specified class of GLM is appropriate for a dataset with censoring and truncation mechanisms. We here denote \(\hat{\varvec{\Psi }}^{(k)}_n\) as the solution satisfying the extended SWLE equations \({\mathcal {S}}_{n}(\hat{\varvec{\Psi }}_n^{(k)};{\mathcal {D}},\varvec{X})=\varvec{0}\) in Eq. (6.1) with weight function hyperparameters chosen as \(\tilde{\varvec{\Psi }}^{(k)}\) for \(k=1,\ldots ,K\). Also recall that \(\hat{\varvec{\Psi }}_n^{\text {meta}}=(\hat{\varvec{\Psi }}_n^{(1)},\ldots ,\hat{\varvec{\Psi }}_n^{(K)})\) and \(\varvec{\Psi }_0^{\text {meta}}=(\varvec{\Psi }_0,\ldots ,\varvec{\Psi }_0)\) defined in Sect. 5. The following theorem holds:

Theorem 6

Under \(H_0\) with true model parameters \(\varvec{\Psi }_0\), and given that the mild regularity conditions outlined in Section A.4 of the supplementary material hold, we have

$$\begin{aligned} \sqrt{n}\left( \hat{\varvec{\Psi }}_n^{\text {meta}}-\varvec{\Psi }_0^{\text {meta}}\right) \overset{d}{\rightarrow }{\mathcal {N}}(\varvec{0},\varvec{\Sigma }^{\text {meta}}), \end{aligned}$$
(6.5)

where \(\varvec{\Sigma }^{\text {meta}}\) is a \((p+1)K\times (p+1)K\) matrix given by Section D.6 of the supplementary material, and hence

$$\begin{aligned} n\left[ \varvec{J}\left( \hat{\varvec{\Psi }}_n^{\text {meta}}-\varvec{\Psi }_0^{\text {meta}} \right) ^T\right] ^T\left( \varvec{J}\varvec{\Sigma }^{\text {meta}}\varvec{J}^T\right) ^{-1} \left[ \varvec{J}\left( \hat{\varvec{\Psi }}_n^{\text {meta}}-\varvec{\Psi }_0^{\text {meta}}\right) ^T\right] \overset{d}{\rightarrow }\chi ^2_{Q} \end{aligned}$$
(6.6)

for a \(Q\times (p+1)K\) design matrix \(\varvec{J}\).

With the above theorem, the Wald-type diagnostic test statistic for censored and truncated data can be constructed as described by Sect. 5.

7 Simulation studies

7.1 Simulation 1: various GLMs

This study aims to empirically verify the asymptotic properties of the proposed SWLE and evaluate its finite-sample performance. In each simulation, we generate n observations \(\{(y_i,\varvec{x}_i)\}_{i=1,\ldots ,n}\) with \(p=2\) (for simplicity) so that \(\varvec{x}_i=(x_{i1},x_{i2})\). We set \(x_{i1}=1\) as an intercept term and generate \(x_{i2}\) iid from N(0, 1). The simulation design is as follows:

  1. 1.

    Data generating model for \(Y_i|\varvec{x}_i\): We consider Gamma GLM, linear model and inverse-Gaussian GLM with the following link function and parameter settings:

    • Gamma GLM: A log-link \(\mu _i=-1/\theta _i=\exp \{\varvec{x}_i^T\varvec{\beta }\}\) is selected. The parameters are specified as \(\varvec{\beta }=(1,0.5)^T\) and \(\phi =0.5\).

    • Linear model: A linear link \(\mu _i=\theta _i=\varvec{x}_i^T\varvec{\beta }\) is selected. The parameters are specified as \(\varvec{\beta }=(1,0.5)^T\) and \(\phi =0.25\).

    • Inverse-Gaussian GLM: A log-link \(\mu _i=(-2\theta _i)^{-1/2}=\exp \{\varvec{x}_i^T\varvec{\beta }\}\) is selected. The parameters are specified as \(\varvec{\beta }=(1,0.5)^T\) and \(\phi =0.1\).

  2. 2.

    Sample size: \(n=250,1000,2500,10{,}000,25{,}000\), aligning with the range of sample sizes for insurance loss data, from a few hundred data points for e.g. Secura-Re (ReIns package in R) to 10,000+ data points for French automobile insurance (CASdatasets package in R).

  3. 3.

    Weight functions \(W(y_i,\varvec{x}_i)\): Selected in accordance to Example 1.

  4. 4.

    Weight function hyperparameters \(\tilde{\varvec{\Psi }}\) and the number of hyperparameter sets K considered for evaluating meta Wald statistic: We propose determining the hyperparameter \({\tilde{\theta }}\) or \({\tilde{\phi }}\) by solving

    $$\begin{aligned} \frac{E_{Y,\varvec{x}}[W(Y,\varvec{x})|Y>q_{\alpha }]}{E_{Y,\varvec{x}}[W(Y,\varvec{x})]}=\delta \end{aligned}$$
    (7.1)

    for some fixed \(0<\alpha <1\) and \(0<\delta \le 1\), where \(q_{\alpha }\) is the \(\alpha \)-percentile of Y. An interpretation of the above equation is as follows: We want to choose the hyperparameter such that the average weight assigned to the extreme observations (larger than \(q_{\alpha }\)) is only \(\delta \le 1\) times as the overall average weight. Smaller \(\delta \) means extreme observations are down-weighted more. Note that the above equation is easy to compute (analytically or through simulation) since the true model is known. For example, when \(\alpha =0.99\) and \(\delta =0.1\) for Gamma GLM, then solving Eq. (7.1) yields \(({\tilde{\theta }},{\tilde{\phi }})=(6.53,1)\). Note that if \(\delta =1\), the weight function will be flat, and hence the proposed SWLE will be equivalent to the MLE. We select \(\alpha =0.99\) and consider the following choices of K for model diagnostic purposes:

    • \(K=2\): the two sets of hyperparameters are constructed based on \(\delta =1,0.001\).

    • \(K=3\): the three sets of hyperparameters are constructed based on \(\delta =1,0.1,0.001\).

    • \(K=5\): the five sets of hyperparameters are constructed based on \(\delta =1,0.5,0.1,0.01,0.001\).

  5. 5.

    Censoring and truncation: Not considered in this experiment for simplicity.

Each combination of data-generating model, n and K selected above results in a simulated dataset with a sample size of n. We replicate (simulate) each combination by \(B=500\) times to ensure thorough investigation on the adequateness of SWLE. We then fit each simulated dataset into Gamma GLM, linear model, and inverse-Gaussian GLM, respectively. If the fitted model class matches with the data generating model, then we are fitting a correct model class, and hence we should expect that the rejection rate of the Wald-type diagnostic test presented in Sect. 5 is low. Otherwise, the fitted model class is misspecified, and a high rejection rate is expected.

We first examine the case of correct model specification to verify the consistencies of the SWLE fitted parameters. Table 1 depicts the true model parameters \((\beta _1,\beta _2,\phi )\) versus the fitted SWLE parameters \(({\hat{\beta }}_1,{\hat{\beta }}_2,{\hat{\phi }})\) (averaged across the \(B=500\) replications), with weight function hyperparameters selected based on Eq. (7.1) with \(\delta =1\) (this simply reduces to MLE), \(\delta =0.1\) and \(\delta =0.01\) respectively. The SWLE fitted parameters are very close to the true values for all settings even if the sample size is relatively small (\(n=250\)), empirically justifying the adequacy of the proposed SWLE in recovering the true model parameters.

We then analyze the performance of the meta Wald-type diagnostic tool presented in Eq. (5.5) of Sect. 5 based on SWLE, considering both cases of correct and misspecified fitted models. Table 2 presents the rejection probabilities at 5% significance level using the meta Wald statistic across different true (data-generating) models, fitted models, n and K. As expected, the rejection rates are almost equal to 1 for most cases when the fitted model class is misspecified. Exceptions are when the sample size is small (\(n=250\)) enough to hinder the power of the proposed diagnostic test. Further, the rejection probabilities are mostly close to the desired level of 5% when the model is correctly specified. Exceptions are when \(K=5\) with a small sample size \(n<2500\) (the rejection probabilities are inflated). An interpretation is that as K grows large, the meta Wald-type statistics become more mathematically complicated, and hence a larger sample size is needed for convergence to the asymptotic results.

Table 1 [Simulation 1] True versus SWLE fitted parameters with correct model specifications
Table 2 [Simulation 1] Meta Wald statistic rejection probabilities at 5% significance level

7.2 Simulation 2: heavy-tail contaminated linear model

This study reveals how model contamination leads to unstable MLE estimates and how the proposed SWLE approach detects and addresses the robustness issues. In each simulation, we generate \(n=5000\) observations \(\{(y_i,\varvec{x}_i)\}_{i=1,\ldots ,n}\) with \(p=2\). \(\varvec{x}_i\) is generated by the same distribution as the previous study. \(Y_i|\varvec{x}_i\) is simulated by a contaminated regression model with the following density function:

$$\begin{aligned} f^c(y_i;\varvec{x}_i)=(1-\epsilon )f(y_i;\varvec{x}_i,\varvec{\Psi })+\epsilon \gamma (y_i;\varvec{x}_i), \end{aligned}$$
(7.2)

where \(f(y_i;\varvec{x}_i,\varvec{\Psi })\) is chosen as a linear model with parameters \(\varvec{\Psi }=(\varvec{\beta },\phi )=(1,0.5,0.25)\), \(\epsilon \) is the contamination probability, and \(\gamma (y_i;\varvec{x}_i)\) is a contamination density function. We chose \(\epsilon =0.1\) to represent a moderate perturbation to the linear model. To ensure a comprehensive simulation study, we conducted the same analysis in two additional scenarios: one with \(\epsilon =0.02\) representing a small perturbation, and another with \(\epsilon =0.2\) representing a large perturbation. The detailed analysis for these additional scenarios can be found in Section Fof the supplementary materials for conciseness. The contamination density \(\gamma (y_i;\varvec{x}_i)\) is chosen as a scaled and translated Student’s t-distribution with 2.5 degrees of freedom, scaled and translated in a way such that the mean is \(\mu _i=1+0.5x_{i2}\) and the variance is \(\sigma ^2=0.25\) (aligning with the linear model \(f(y_i;\varvec{x}_i,\varvec{\Psi })\)). Outliers will be more prevalent in the simulated data with such heavy-tailed contamination. We ignore the censoring and truncation effects.

Similar to the previous simulation study, the simulation is replicated by \(B=500\) times. For each replication, the resulting simulated dataset is fitted to the linear model, using the SWLE approach and considering \(K=5\) sets of weight function hyperparameters constructed based on Eq. (7.1) with \(\delta =1,0.5,0.1,0.01,0.001\).

Table 3 presents the estimated parameters (averaged across \(B=500\) replications) and the corresponding standard errors (SE) for each of the five weight function hyperparameter sets considered. In the table, k is the hyperparameter set index. For example, when \(k=1\), the hyperparameters are selected based on \(\delta =1\), leading to a standard MLE approach. \(\delta \) is reduced as k increases, resulting in more substantial down-weightings on the outliers. The MLE approach results in an unrobust estimated dispersion parameter \({\hat{\phi }}\), as evidenced by an abnormally large SE. The outliers severely distort the estimated parameters under the MLE approach. This issue can be effectively mitigated by the proposed SWLE approach: As \(k=2\), the SE of \({\hat{\phi }}\) is reduced significantly from 0.023 to 0.005. On the other hand, the standard errors of any estimated parameters generally increase as k increases. This is natural as more substantial down-weightings (i.e., larger k) often imply that more data points are effectively discarded for model estimation purposes, leading to a higher SE. Overall, Table 3 reveals a trade-off between estimation robustness and efficiency when the data-generating model is contaminated. In this case, the choice of \(k=2\) or \(k=3\) may result in the best fitted model as the stability of estimated parameters is guaranteed without substantially inflating the SE.

We then perform extensive diagnostic tests on the fitted SWLE models. The meta Wald test in Eq. (5.5) shows that the linear model is rejected in 448 out of the 500 replications (89.6%), suggesting that our proposed SWLE-based diagnostic tool is quite powerful in detecting model contaminations. We further perform individual Wald tests (Eq. (5.6) for each pair of weight function hyperparameter sets \((k,k')\)) to carefully examine how the simulated dataset deviates from the fitted linear model. The left panel of Table 4 showcases the rejection rates of the individual Wald tests for each pair of hyperparameter sets \((k,k^{\prime })\). While the rejection rate is very high (0.908) when \((k,k^{\prime })=(1,2)\), it gradually decreases as k and \(k'\) increase. As we note that \(k=1\) represents the MLE approach, we may conclude that after reducing the influence of extreme observations (by choosing \(k\ge 2\)), the simulated data behaves less significantly deviated from the linear model. To showcase an example, we report the individual Wald statistics and the p-values for one specific representative simulation replication in the right panel of Table 4. In this case, we observe that the individual Wald tests fail to reject the linear model, provided that the outliers are already sufficiently down-weighted (with \(k,k^{\prime }\ge 2\)). As a result, there is no evidence that the simulated data systematically deviates from the linear model. Instead, the deviation is solely caused by the few outliers caused by model perturbations.

Table 3 [Simulation 2] Mean estimated parameters and the corresponding standard errors under SWLE across five different sets of weight function hyperparameters
Table 4 [Simulation 2] Left panel: rejection rates of the individual Wald test in Eq. (5.6) for each pair of weight function hyperparameter sets. Right panel: individual Wald statistics (bottom left triangle) and the corresponding p-values (top right triangle) under a representative simulation replication

7.3 Simulation 3: GLM with varying dispersion

We also analyze a case in which the observations are generated from a GLM with varying dispersion, systematically deviating from the standard GLM. This study takes into account the effects of left truncation and right censoring. Our findings can be summarized as follows. Firstly, the proposed SWLE-based individual Wald tests strongly reject the standard GLM, demonstrating the high power of our diagnostic tests. Secondly, we can gain insights into how the standard GLM model is misspecified by carefully analyzing how the estimated model parameters change as the weight function hyperparameters under the proposed SWLE vary. This makes the proposed SWLE a valuable analytical tool for providing guidance on specifying alternative model classes that can enhance model accuracy and precision. For the sake of conciseness, all the details related to model specifications, simulation settings, fitting results, and analyses have been provided in Section E of the supplementary materials.

8 Real insurance data analysis

This section showcases the applications of the proposed SWLE fitting and diagnostic methods to two real insurance datasets: US indemnity losses and the European automobile insurance dataset. The data-generating model is unknown for a real dataset, and the data is also censored and truncated. Hence, it is challenging to determine the weight function hyperparameters by computing and solving Eq. (7.1) directly. We propose approximating Eq. (7.1) semi-analytically as follows. To begin with, defining \({\mathcal {Q}}_{\alpha }=(q_{\alpha },\infty )\cap {\mathcal {Y}}\), the numerator and denominator in the left-hand side of the equation can respectively be approximated and analytically expressed as (we refer readers to Section D.8 of the supplementary material for the derivations):

$$\begin{aligned} E_{{\mathcal {D}},\varvec{x}}[W(Y,\varvec{x})|Y>q_{\alpha }]&=E_{{\mathcal {D}},\varvec{x}}\left[ \lambda ^{*}(\varvec{\Psi };\varvec{x}) \frac{F^*({\mathcal {Q}}_{\alpha }\cap {\mathcal {T}};\varvec{x},\varvec{\Psi })}{F({\mathcal {Q}}_{\alpha }\cap {\mathcal {T}};\varvec{x},\varvec{\Psi })}\right] , \end{aligned}$$
(8.1)
$$\begin{aligned} E_{{\mathcal {D}},\varvec{x}}[W(Y,\varvec{x})]&=E_{{\mathcal {D}},\varvec{x}}\left[ \lambda ^{*}(\varvec{\Psi };\varvec{x}) \frac{F^*({\mathcal {T}};\varvec{x},\varvec{\Psi })}{F({\mathcal {T}};\varvec{x},\varvec{\Psi })}\right] . \end{aligned}$$
(8.2)

After obtaining an MLE of parameters \(\hat{\varvec{\Psi }}\), Eqs. (8.1) and (8.2) are approximated by

$$\begin{aligned} \widehat{E_{{\mathcal {D}},\varvec{x}}}[W(Y,\varvec{x})|Y>q_{\alpha }]&=\frac{1}{n}\sum _{i=1}^{n}\lambda ^{*} (\hat{\varvec{\Psi }};\varvec{x}_i)\frac{F^*({\mathcal {Q}}_{\alpha } \cap {\mathcal {T}}_i;\varvec{x}_i,\hat{\varvec{\Psi }})}{F({\mathcal {Q}}_{\alpha }\cap {\mathcal {T}}_i;\varvec{x}_i,\hat{\varvec{\Psi }})}, \end{aligned}$$
(8.3)
$$\begin{aligned} \widehat{E_{{\mathcal {D}},\varvec{x}}}[W(Y,\varvec{x})]&=\frac{1}{n}\sum _{i=1}^{n}\lambda ^{*}(\hat{\varvec{\Psi }};\varvec{x}_i) \frac{F^*({\mathcal {T}}_i;\varvec{x}_i,\hat{\varvec{\Psi }})}{F({\mathcal {T}}_i;\varvec{x}_i,\hat{\varvec{\Psi }})}. \end{aligned}$$
(8.4)

Finally, we solve the following equation, which serves as a semi-analytical approximation to Eq. (7.1), to determine the appropriate weight function hyperparameters:

$$\begin{aligned} \frac{\widehat{E_{{\mathcal {D}},\varvec{x}}}[W(Y,\varvec{x})|Y>q_{\alpha }]}{\widehat{E_{{\mathcal {D}},\varvec{x}}}[W(Y,\varvec{x})]}=\delta . \end{aligned}$$
(8.5)

8.1 European automobile insurance data

Consider a European automobile insurance dataset with \(n=10,032\) car damage claim losses during 2016. This dataset is also analyzed by Fung et al. (2022). The empirical (log-transformed) loss distribution is depicted by Fig. 2. Each claim is supplemented by policyholder information denoted by \(x_{i2}\) to \(x_{i11}\) described in Table F of the supplementary material (Section G), a policy limit (right censoring point) varying greatly from 900 to 183,610 Euros, and a deductible (left truncation point) ranging from 0 to 1,000 Euros. Note that 78% of the policies have a deductible level of zero, indicating that the losses from these policyholders are never eliminated. Also, extremely large losses are still prevalent in the dataset even if the losses are right censored, with the largest observed loss reaching 107,044 Euros (the average and median losses are just 967 and 521 Euros, respectively). As a preliminary analysis, we fit the loss amounts to the Gamma, log-normal, and inverse Gaussian distributions without incorporating covariates, deductibles, and policy limits. The goodness-of-fit is assessed by the three Q-Q plots in Figure A of the supplementary material. Both Gamma and inverse Gaussian distributions fit the loss data poorly. The log-normal distribution decently fits the body part of the empirical distribution, but the right tail seems slightly under-extrapolated. Note that inclusion of the covariates’ effects and consideration of data incompleteness may improve the goodness-of-fit for the right tail. Hence, it is reasonable to consider a log-normal GLM as a baseline benchmark model using the proposed SWLE approach to assess the appropriateness of the log-normal model and recommend suitable model improvements. The MLE parameters and the corresponding SE are listed in the second and third columns of Table 5.

Fig. 2
figure 2

[European automobile claims] Empirical distribution of the log-transformed European car damages claim amount and weight function densities with standard (left panel) and alternative (right panel) hyperparameter settings

Considering the log-normal GLM, we first define \(Y_i\) as the log-transformed car damage loss and \(\varvec{x}_i\) as the covariate vector with length \(p=11\). Setting \(C_i\) and \(T_i\) respectively as the log-transformed policy limit and deductible, we have \(M_i=1\), \({\mathcal {U}}_i=(-\infty ,C_i]\), \({\mathcal {C}}_i={\mathcal {I}}_{iM_i}=(C_i,\infty )\), and \({\mathcal {T}}_i=(T_i,\infty )\) according to Sect. 6. Same as the previous studies, we first consider \(K=5\) “standard” sets of weight function hyperparameters chosen in accordance to Example 1 by solving Eq. (8.5) with \(\alpha =0.99\) and \(\delta =1,0.5,0.1,0.01,0.001\). The five resulting weight functions are plotted against the log-transformed loss in the left panel of Fig. 2.

Our primary objective is to assess the goodness-of-fit of the lognormal GLM through employing the proposed SWLE method. If the GLM is determined to be a suitable fit for the loss data, our secondary goal is to investigate how the SWLE enhances the robustness of parameter estimation. Conversely, if the GLM is found to be unsuitable, our secondary objective is to explore how to utilize the SWLE to offer guidance for improving the model beyond the constraints of the GLM.

To do so, we first thoroughly examine the impacts of the SWLE weight functions on the estimated parameters by introducing the standardized parameter deviance residual, defined as

$$\begin{aligned} {\hat{r}}_j^{(k)}=\frac{{\hat{\Psi }}_j^{(k)} -{\hat{\Psi }}_j^{(k_0)}}{SE({\hat{\Psi }}_j^{(k_0)})},\quad k=1,\ldots ,K\;\text {and}\;j=1,\ldots ,p+1, \end{aligned}$$
(8.6)

where \({\hat{\Psi }}_j^{(k)}\) is the estimated j-th parameter of \(\varvec{\Psi }\), using the SWLE approach with the k-th set of weight function hyperparameters selected. \(k_0\in \{1,\ldots ,K\}\) is the weight function index selected as the benchmark, and \(SE({\hat{\Psi }}_j^{(k_0)})\) is the SE of \({\hat{\Psi }}_j^{(k_0)}\). In this case, we choose \(k_0=1\) because it represents the MLE approach. The proposed residual statistic reflects the sensitivity of the estimated parameters to the choice of weight function hyperparameters. If the loss data is indeed generated from the (lognormal) GLM, we would anticipate that the estimated parameters remain relatively stable regardless of variations in k. In this scenario, \({\widehat{r}}_j^{(k)}\) should be close to zero for any value of k. Conversely, if \({\widehat{r}}_j^{(k)}\) exhibits a systematic trend (either increasing or decreasing) as k varies, it becomes apparent that the lognormal GLM is not accurately representing the real loss data. Therefore, the analysis of \({\widehat{r}}_j^{(k)}\) serves as a valuable tool to diagnose the lognormal GLM. If the GLM is deemed inappropriate to the loss data, this analysis provides insights into the sources of model misspecifications, drawing recommendations for enhancing the model.

Figure 3 plots \({\hat{r}}_j^{(k)}\) against k for each parameter \(j=1,\ldots ,12\) with the “standard" hyperparameter setting. The 95% confidence intervals of \({\hat{r}}_j^{(k)}\) are also constructed, appearing as the grey shallows in the figure. Below are some of the observations and recommendations for model improvements:

  • The residuals for the dispersion parameter \(\phi \) significantly decrease as k increases from 1 to 5, as shown in the bottom right panel of Fig. 3. This implies that when the tail observations, whether they are very large or small losses, receive reduced weight in the estimation process (by selecting a larger k), a smaller estimated \(\phi \) will be obtained compared to when we choose \(k=1\) (the MLE approach). An attribution for this phenomenon is that the loss data originates from a distribution with heavier tails than the lognormal GLM assumes. Consequently, without choosing a larger k such that the weights corresponding to the extreme observations are diminished, the estimated \(\phi \) under the lognormal GLM will be distorted (inflated) by the extreme observations. Hence, the residuals decrease as k increases. In other words, the log-normal model still under-extrapolates the tail-heaviness of the empirical distribution even after incorporating the effects of covariates, censoring, and truncation, so one should fit a heavier-tailed model.

  • While the MLE suggests that the car age negatively impacts the loss amounts (\({\hat{\beta }}_3<0\)), Fig. 3 reveals that the residuals associated with \(\beta _3\) (the regression coefficient for car age) exhibit a positive trend when \(k>1\). This suggests that the estimated coefficient becomes less negative as the weights concentrate more toward the central region of the loss distribution with the selection of a larger k. This interpretation implies that the influence of car age on the distribution is less negative in the body region compared to the tail region. Oppositely, the residuals for \(\beta _4\), \(\beta _6\) and \(\beta _8\) are significantly negative as \(k>1\), meaning that the influences of these variables are more negative (or less positive) to the body part of the loss distribution than to the tail part. As a result, one should consider modeling the heterogeneity of covariate influence to various parts of the loss distribution.

Fig. 3
figure 3

[European automobile claims] Standardized parameter deviance residuals (vertical axis) versus weight function hyperparameter index k (horizontal axis) with standard hyperparameter setting

The above “standard” weight function setting weights the observations symmetrically, i.e., the losses from both tails are under-weighted. Therefore, the study above does not tell whether the model misfit comes from the left or right tail. Hence, one may also consider an “alternative” weight function setting, which allocates asymmetric weights to the observations from the left and right tails. Considering also \(K=5\) sets of hyperparameters, we set the weight function hyperparameters as \({\tilde{\beta }}_1^{(k)}={\hat{\mu }}_Y+0.5(k-3){\hat{\sigma }}_Y\), \({\tilde{\beta }}_j^{(k)}=0\) for \(j=2,3,\ldots ,p\) and \({\tilde{\phi }}^{(k)}={\hat{\sigma }}_Y^2\), where \({\hat{\mu }}_Y\) and \({\hat{\sigma }}_Y\) are respectively the empirical mean and standard deviation of the log-transformed losses, so that the resulting weight functions are plotted against the log-transformed loss in the right panel of Fig. 2. As k increases, more weights are assigned to larger losses. We choose \(k_0=3\) in Eq. (8.6) when a maximum weight is assigned to an average loss. Figure 4 plots \({\hat{r}}_j^{(k)}\) against k for each parameter using the “alternative" weight function hyperparameter setting. The observations are as follows:

  • The residual associated with the dispersion parameter \(\phi \) is significantly positive when \(k>3\) (i.e., larger weights on larger losses) and negative when \(k<3\). Employing similar analytical reasoning as in the analysis of \({\hat{r}}_j^{(k)}\) shown in Fig. 3, this observation suggests that the log-normal model underestimates the heaviness of the right tail while simultaneously overextrapolating the left tail.

  • The residuals for \(\beta _4\) to \(\beta _8\) are significantly positive when \(k>3\) and negative when \(k<3\). This reflects that the influences of these variables are more positive (or less negative) to the upper quartiles of the loss distribution and vice versa to the lower quartiles. This also echos with the results obtained by Simulation study 3, where the fitted regression parameters show a clear trend when more and more weights are assigned to larger losses. Therefore, a possible model improvement is to allow for a varying dispersion in the regression model.

For completeness, we conduct the parameter-specific meta Wald tests (Eq. (5.9)) under both “standard” and “alternative” hyperparameter settings. The resulting Wald statistics and p-values for each parameter are displayed in the four rightmost columns of Table 5. Not surprisingly, the p-values are very small for many parameters.

To sum up, it is clear that the lognormal GLM is misspecified, as evidenced by the consistent monotonic increase or decrease observed in many residuals in Figs. 3 and 4, as well as the strong rejection of the lognormal GLM by the Wald test (Table 5). This highlights the importance of considering the proposed model enhancements for the log-normal model.

Fig. 4
figure 4

[European automobile claims] Standardized parameter deviance residuals (vertical axis) versus weight function hyperparameter index k (horizontal axis) with alternative hyperparameter setting

Table 5 [European automobile claims] Second and third columns: the MLE parameters and the corresponding SE

8.2 US indemnity losses

We also analyzed a publicly available dataset of \(n=1500\) indemnity losses in the US, extensively studied in previous works such as Poudyal (2021a) and Punzo et al. (2018a). In this dataset, the maximum benefit (right censoring point) varies greatly among policyholders, resulting in the presence of extremely large losses even after right censoring, motivating the use of a robustification method. We employed the proposed SWLE both as a diagnostic tool to assess the appropriateness of the GLM and as a robust method to improve the accuracy and stability of the estimated GLM parameters. Our findings are summarized as follows. Firstly, the SWLE-based Wald test indicates that the log-normal distribution is, overall, a suitable model for fitting the indemnity loss data after controlling for the contamination by a few outliers. Secondly, as a robust method, the proposed SWLE enhances estimation accuracy and stability compared to the standard MLE approach after sufficiently de-emphasizing the influence of outliers. These findings closely align with the results of simulation study 2 (Sect. 7.2). For conciseness, the detailed analysis is leveraged to Section H of the supplementary materials.

9 Concluding remarks

This paper introduces a score-based weighted likelihood estimation (SWLE), which incorporates weights to reduce the impact of the outliers, to estimate the parameters of the GLM robustly. With a specially designed weight function, closed-form expressions are obtained for both the score function and asymptotic covariance matrix, making it computationally appealing to estimate parameters and determine parameter uncertainties. The robustness of the SWLE is also theoretically justified by a bounded influence function (IF). Apart from robust estimations, the SWLE also serves as a diagnostic tool to quantitatively assess the overall appropriateness of fitting the GLM. We further extend the SWLE to cater to random censored and truncated regression data prevalent in the insurance losses subjected to coverage modifications. The proposed tool is exemplified on three simulation studies and two real insurance datasets, revealing the usefulness of the SWLE in the following three aspects:

  • If the data-generating model is the GLM (Simulation study 1), the SWLE will provide consistent estimations;

  • If the data-generating model is the GLM contaminated by a few outliers (Simulation study 2 and US indemnity loss data), the SWLE will provide a more reliable estimate of parameters as compared to the MLE;

  • If the data-generating model deviates systematically from the GLM (Simulation study 3 and European automobile insurance data), the SWLE Wald test will detect the model misspecifications with very high power and suggest model improvements.

The SWLE is applicable not only to the GLM but also to other more complex model classes, including the generalized additive models (GAM) for non-linear regression links and finite mixture models for distributional multimodalities. Therefore, it is worthwhile to explore alternative weight functions such that the SWLE is computationally appealing for broader model classes. Another potential research direction is to extend the SWLE to cater to multivariate and longitudinal data. This is useful in insurance practice because insurance companies often contain multiple business lines (multivariate losses), and policyholders often have multiple years of claim history (longitudinal data). Finally, as discussed in Remark 1, the current study does not consider the effects of covariate outliers. To do so, Assumption (i) of Theorem 2 needs to be removed, and hence extra conditions on the weight function \(W(y_i,\varvec{x}_i)\) will be needed. We leverage this study to the future research direction.