1 Introduction

In recent years, advancements in big data, machine learning, and artificial intelligence (AI) have profoundly reshaped the insurance industry, ushering in a new era for insurance economics. Technological advances transform various aspects of insurance, from risk assessment to customer service. For instance, the increased availability of detailed data, coupled with efficient data collection and analysis tools, enhances risk evaluation and cost estimation. This, in turn, benefits policyholders by mitigating issues like adverse selection and moral hazard. Additionally, AI contributes to improved service quality by providing better insurance services and streamlining claims management.

This paper contributes to the existing literature on big data, risk classification, and privacy considerations in insurance markets by providing a comprehensive review of the relevant research and presenting an application with respect to risk classification accounting for privacy costs. We discuss the impact of big data, machine learning, and artificial intelligence on risk classification in insurance by providing an overview of the literature on changes in the risk landscape of insurers and the implications for insurance market dynamics. Starting with seminal contributions from insurance economics, such as the work of Einav and Levin (2014), we broaden our analysis by incorporating research in other disciplines, including ethics (Steinberg 2022), law (Siegelman 2014), and medicine (Ho et al. 2020). These diverse perspectives help to provide a holistic understanding of the multifaceted implications of big data, risk classification, and privacy in insurance markets. The paper also identifies potential areas for future research, highlighting the importance of interdisciplinary collaboration between law, ethics, medicine, etc. with economics.

Traditionally, the information advantage in insurance markets resided with insured individuals, leading to the phenomenon of adverse selection. However, recent research by Brunnermeier et al. (2022) suggests that the use of advanced data analytics allows insurers to infer statistical information, effectively reversing the information advantage and the dynamics of adverse selection. Motivated by this insight, we provide an application that focuses on risk classification from the perspective of insurance companies, rather than adopting a general equilibrium model. An insurer’s risk classification methodology and its accuracy can be improved through innovation in insurance pricing (Cather 2018). However, this process often requires large amounts of policyholder data and the permission to make use of such data. In addition to transaction costs that may arise from price innovation techniques (e.g., for data collection, storage and processing), insurers should consider that individuals may have different privacy preferences and potential policyholders may require some form of compensation for providing and allowing the use of their personal data (Regner and Riener 2017; Benndorf and Normann 2018; Gemmo et al. 2020). Increased privacy awareness and stricter regulation in many countries allow individuals to demand such compensation, and the application of innovations in insurance pricing can lead to changes in the customer base faced by insurers (Altman et al. 1998; Cather 2018; Lai et al. 2021).

We investigate the conditions under which insurers are willing to use policyholders' private data to classify risks more accurately. We develop a model that allows insurers to assess the maximum shift in demand for which they have a profit incentive to innovate in risk classification using private data. In doing so, we consider the “cost of privacy,” which plays a central role in modern insurance markets (for a review, see Hoy 2006, and Gemmo et al. 2019). We analyze how the choice of a more accurate classification method affects the decision on the optimal number of risk classes, given heterogeneous risks in the population, and provide examples for term life insurance contracts.

The remainder of this article is structured as follows. We begin by presenting our comprehensive review of the literature in Sect. 2. In Sect. 3, we continue with an application of the use of big data in insurers' risk classification. We examine how the choice of screening technology interacts with the choice of the optimal number of risk classes using examples from term life insurance. Section 4 discusses main findings and provides an outlook for future research.

2 Literature review

We conduct a comprehensive literature review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA, see Page et al. 2021) protocol to identify and categorize academic research on the use of big data in the insurance sector. The review strategy and data collection are described in Appendix 1. Based on this process, a database of 104 papers is created and key findings are extracted. The intersection of economics, business, law, ethics, and medicine has produced a rich body of literature exploring various aspects of insurance and risk management. We group all papers in four distinct areas (Table 1)Footnote 1 and shed light on the evolving landscape of insurance in the digital age, with a focus on economics and its intersections with other disciplines. To have a better understanding of the parts of research for which we have empirical results, we also add in Table 1 which papers are theoretical, empirical, and experimental.

Table 1 Mapping of the literature

2.1 New risks and new products

The introduction of new technologies in insurance markets has a significant impact on the frequency and severity of losses, resulting in a shift from low-severity–high-frequency to high-severity–low-frequency risks; an example is the potential tampering of self-driving cars (Eling and Lehmann 2018).Footnote 2 This transformation is driven by advancements such as automation, artificial intelligence, and interconnected systems. The increasing connectivity and interdependence of systems, especially in supply chains, coupled with the collaboration of policyholders facilitated by social networks (Albrecher et al. 2019), have introduced new risks, including cyber risk. All these developments highlight the evolving nature of risks in the digital era (Eling and Lehmann 2018; Lanfranchi and Grassi 2022). The utilization of new technologies also enables insurers to offer more personalized coverage and thereby extend the insurability of risks, particularly in the case of on-demand insurance (Braun et al. 2023). For instance, the use of big data in index insurance has the potential to facilitate the development of more effective and sustainable agricultural risk management plans (Castillo et al. 2016). Similarly, big data can be used in weather index insurance (Cesarini et al. 2021) and insurance against natural disasters (Timms et al. 2022; Charpentier et al. 2022). New statistical methods can also produce different valuations of financial data according to different characteristics of investors (Farboodi et al. 2022). Other studies discuss the application of emerging technologies, such as blockchain, to approve the insurability of liability insurance in the context of 3D printing (Faure and Li 2020). The application of new technologies can also reduce barriers for consumers to enter the insurance market (Garven 2002), thereby accelerating social inclusion (Nayak et al. 2019a). Infantino (2022) provides an assessment of big data analytics from an European perspective, highlighting in particular legal and regulatory aspects.

An emerging concern in financial and insurance markets is the growing importance of reputational risk. This comprises concerns about unconscious discrimination, price discrimination, and the potential for negative public backlash, often referred to as “shit storms” (Fuster et al. 2019, 2022). The impact of machine learning algorithms on credit markets and mortgage lending underscores the need for careful management of reputational risk. There have been research efforts to eliminate potential discrimination in insurance pricing (Lindholm et al. 2022). While there is a significant amount of research underway studying the use of insurance for cyber security risk mitigation (Biener et al. 2015; Bodin et al. 2018; Xie et al. 2019; Doss and Narasimhan 2021), there is limited exploration of how the insurance sector responds to cyber risks and the ensuing reputational risk. Bednarz and Manwaring (2022) have highlighted that the datafication of insurers' processes can contribute to excessive data collection in the context of insurance contracts, with significant risks of consumer harm, particularly in terms of discrimination, exclusion, and unaffordability of insurance. Unconscious discrimination can potentially disrupt traditional characteristics of distribution and solidarity, for example, in health insurance (McFall 2019). Also sociological research examines the pricing of risks and the politics of classification in insurance and credit markets, highlighting the importance of reputation management within the insurance industry (Krippner and Hirschman 2022).

2.2 Better/more information on policyholder behavior

The use of big data and technology in insurance markets has a significant impact on the information landscape. The application of data analytics and data mining in various domains has improved the ability of insurance companies to accurately price policies (Bohnert et al. 2019; Hassani et al. 2020). This improved accuracy in risk classification is due to an increased number of observations and the inclusion of new variables in the analysis (Che et al. 2022). For example, in car insurance, the use of telematic boxes makes it possible to measure acceleration and braking behavior, which can be correlated with the likelihood of accidents. AI and machine learning techniques can reveal hidden patterns and relationships within large data sets, identifying new variables relevant to risk classification (Brunnermeier et al. 2022). The use of technology to collect data can be used to uncover risk determinants and make self-protection more effective (Li and Peter 2021). Baecke and Bocca (2017) claim that including telematic variables significantly improves the accuracy of policyholders’ risk assessment. While Geyer et al. (2020) find private information to more strongly affect the bonus-malus division, they find no evidence of it affecting the policyholders’ ex ante choice contract. Brunnermeier et al. (2022) also discuss the dangers of market concentration posed by the emergence of big data (including the rise of data brokers), emphasizing the importance of consumer activism and regulatory tolerance. As noted by McFall et al. (2020), the adoption of big data analytics in insurance is transforming how risk is governed, managed, and priced within the industry. Eling and Kraft (2020) provide an extended review of the literature on the use of telematics in insurance and discuss its impact on insurability.Footnote 3

In life, health, and long-term care insurance, the information that could be used to categorize risk includes medical tests, medical history, etc. which are considered especially sensitive. It is held that insurance discourages (prospective) policyholders from taking diagnostic tests as these tests might reveal information that leads to un-insurability (Doherty and Posey 1998). Doherty and Posey (1998) show that when linked to a treatment option, testing is encouraged when both test results and information status are restricted. In this context, changes in risk classification give rise to discussions about ethical and legal limits on the use of data, such as the debates on genetic testing or the unisex debate. (Hoy and Polborn 2000; Thiery and van Schoubroeck 2006; Liukko 2010; Rothstein 2015; Bélisle-Pipon et al. 2019; Nill et al. 2019; Posey and Thistle 2021). For instance, Hoy and Ruse (2005) emphasize that the debate over whether insurance companies should be allowed to use genetic test results for underwriting purposes must be seen in the broader context of the genetic testing debate.

A sizable part of the literature on risk classification focus on the effects of risk categorization on welfare. Hoy (1982) presents the implications of incorrectly categorizing risk on welfare. Hoy (1984) shows that categorization might lead to an increase in wealth inequality, while it reduces (on average) the unfavorable price discrimination against low risk. In addition, Hoy (2006) discusses the effect of a regulatory framework that restricts the use of certain information by insurers in rate making. He derives conditions under which regulation is explicitly welfare-enhancing or welfare-detrimental. Filipova (2006, 2007), and Filipova-Neumann and Welzel (2010) study the welfare effect of introducing insurance contracts that involve the possibility of some form of tracking data access and argue that some degree of monitoring could increase welfare. Rothschild (2011) and Dionne and Rothschild (2014) emphasize that bans on using certain information to categorize risk are sub-optimal and that alternative insurance contracts should be considered. Crocker and Zhu (2021) and Pram (2021) find that utilizing a voluntary imperfectly informative test to classify risks is more efficient than not utilizing the test or making it compulsory. This result is based on the assumption that (prospective) policyholders do not know the outcome of the test ex ante. Jin and Vasserman (2021) present empirical evidence of both self-selection into monitoring and behavioral change in car insurance. They argue that monitoring generates large profits and welfare gains, but that demand frictions and policies restricting firms' ownership of collected data erode these gains.Footnote 4

Technological advancements also offer the possibility of addressing moral hazard by implementing almost perfect screening mechanisms (Jin and Vasserman 2021; Holzapfel et al. 2023), such as telematics-based systems (see Paefgen et al. 2013; Keller and Transchel 2016; Balasubramanian et al. 2018). The integration of behavior-based personalized insurance can serve as incentives for policyholders to engage in “low-risk behavior” (Meyers and van Hoyweghen 2018). Furthermore, the application of AI and machine learning algorithms can significantly enhance the detection of insurance fraud (Bologa et al. 2013; Saldamli et al. 2020). These advancements underscore the transformative potential of technology in mitigating moral hazard and improving the efficiency and effectiveness of the insurance industry. Einav et al. (2016) highlight the economic content of risk scores, providing insights into the implications of risk assessment models on insurance markets. They find that risk scores confound underlying health and endogenous expenditure responses to insurance; even when individuals have different behavioral responses to contracts, strategic motivations for cream-skimming can persist in situations with “perfect” risk scoring within a given contract.

2.3 Better risk (type) information

Another strand of literature in the field of risk classification studies its implication on information asymmetry and adverse selection. On the one hand, Bond and Crocker (1991) argue that using endogenous categorization—classifying risks based on voluntary consumption of products that are related to the underlying loss—leads to a more efficient allocation by partly mitigating information asymmetries. Crocker and Snow (2000) add on the topic by highlighting the costs of classification risk, which depend on whether insurance markets with symmetric or asymmetric information are considered. On the other hand, Thomas (2007) emphasizes the negative effects of risk classification and argues for a socially optimal level of adverse selection. Cather (2018) shows that innovation in risk classification methods leads to cream-skimming and pushes other insurers in the market to adopt them at a very fast pace. Browne and Kamiya (2012) study the demand for underwriting and how the cost and accuracy of categorizing tests affect it.

The utilization of big data and advanced technology in the insurance market has not only impacted risk classification, but also changed the concept of adverse selection, resulting in reverse selection dynamics (Filipova-Neumann and Welzel 2010; Cather 2018; Eling et al. 2022). As early as 1976, Rothschild and Stiglitz showed that one way to deal with adverse selection is to distinguish high-risk and low-risk individuals, thereby establishing a separating equilibrium. Brunnermeier et al. (2022) point out that insurance companies transfer information advantages from the insured to the insurance company by inferring statistical information, that is, the reversal of adverse selection.Footnote 5 Braun et al. (2023) show that the heterogeneity of policyholders in terms of claim amounts and claim frequency can be better exploited through on-demand contracts, which allow for better screening of the policyholder type. Furthermore, telematics can be beneficial for high-risk individuals as a condition of insurability, effectively mitigating selection problems (Guillen et al. 2019; Eling and Kraft 2020; Fang et al. 2020; She et al. 2022). Jeanningros and McFall (2020) investigate the value of sharing data in a life and health insurance company, also highlighting the role of branding and behavior in insurance markets. These transformations in insurance markets have given rise to ethical and legal considerations regarding data usage, contributing to the ongoing discourse on the topic (Palmer 2006; Kiviat 2019; Steinberg 2022; Krippner 2023; Południak-Gierz and Tereszkiewicz 2023). For instance, concerns have been raised about the potential overuse of medical data by insurance companies and its potential impact on medical advancements (Blasimme et al. 2019). Some studies argue that insurance companies collect customer data through wearable devices and other means, resulting in consumers relinquishing power and control over the data generated from their activities (Gidaris 2019). The ethical implications of data-driven business models are also examined by Breidbach and Maglio (2020), who analyze accountable algorithms and the ethical considerations associated with their use. They highlight the need for transparency and fairness in algorithmic decision-making. Similarly, Ciborra (2006) explores the ethical dimensions of risk and digital technologies, highlighting that digital tools are both the infrastructure of the risk industry and the source of new, often unpredictable, risks. In particular, Liu (2023) found that AI-generated demand information reduces sales agents’ own information acquisition and increases adverse selection; agents using AI attract riskier consumers and do not match them to more expensive products to achieve stronger incentive compatibility.

2.4 Privacy, ethical concerns, and legal challenges

The utilization of new technologies has instigated shifts in risk perception and raised concerns regarding privacy and transparency within insurance markets (Gemmo et al. 2019). Several authors have tried to identify the characteristics that influence consumers' willingness to share personal data, be it the type of information, the characteristics of the company collecting and using the data, the purpose of use, or the consumers' own characteristics (Phelps et al. 2000; Rohm and Milne 2004; Milne et al. 2004; Pew Research Center 2014; Acquisti et al. 2016; Benndorf and Normann 2018). Farrell (2012) proposes a model that regards privacy as a final good whose optimal level can be chosen efficiently, whereas Kehr et al. (2015) suggest that behavioral biases affect the privacy valuation. The literature empirically observes differences between the willingness to sell private data and the willingness to buy privacy protection (Phelps et al. 2000; Milne et al. 2004). Additional studies explore issues that may arise from the application of big data in the insurance industry, such as the transformation of fairness connotations (Barry 2020), the dynamics between individuals and groups (McFall and Moor 2018), and the enhancement of privacy protection laws for consumers (Soyer 2022), among others. Related to this, Strohmenger and Wambach (2000) and Hoy and Ruse (2005) provide a discussion of the arguments for and against the use of genetic testing in insurance rating. Some studies argue that behavior-based insurance can exploit the insured (Tanninen 2020) and potentially compromise the autonomy of policyholders (Tanninen et al. 2022).

Some papers study the willingness to share data with insurance firms specifically.Footnote 6 Wiegard and Breitner (2019) consider wearable technologies and suggest that privacy concerns are the main hurdle for pay-as-you-live insurance contract adaptation. Blakesley and Yallop (2019) conduct a similar study on the UK insurance market. They emphasize that insurance firms should establish ethical standards above the legal requirements for data-driven insurance contracts to achieve wider consumer adoption against a ‘fair’ incentive. Gemmo et al. (2019) consider an insurance market framework with asymmetric information, and show how the existence of policyholders’ privacy concerns can affect market equilibria and social welfare. The authors find that information disclosure can lead to a Pareto improvement of social welfare—even in the presence of privacy costs—although it can also decrease or eliminate cross-subsidies. Striking a balance between the desire for privacy and the necessity to mitigate risks presents challenges for individuals and the industry alike (Biener et al. 2020). The evolving risk landscape and the increased availability of data can impact the market structure (Gemmo et al. 2019), while also give rise not only to ethical, but also regulatory considerations (Blakesley and Yallop 2020; Loi et al. 2022). This also raises the question of how to reconcile consumers' perceived privacy risks with their own welfare, which is influenced by various factors (Wiegard and Breitner 2019; Lünich and Starke 2021).

All these developments in the risk landscape underscore the profound impact of technology on the insurance industry, requiring careful consideration of risk management strategies and the establishment of regulatory frameworks to effectively address the challenges and ensure the existence of fair and sustainable insurance markets. The inverse selection dynamics, that is, the transfer of information advantages from the insured to the insurer, opens a broad area of future research that revisits results from (standard and non-standard) models with asymmetric information, in which the informational advantage has been on the side of the policyholder. In the subsequent section, we provide an example of a framework that considers the firm's perspective in deciding whether and to what extent to implement new data-driven technologies. Our application analyzes the decision of an insurance company to choose the risk classification system that maximizes its expected profit. We consider the privacy implications of using private data in the insurer's decision-making process by relating the willingness of (potential) policyholders to share private information to their willingness to pay for insurance. With this we connect two fundamental parts of the literature review; more accurate risk classification (third part) and reduction in demand from privacy concerns (fourth part).

3 Application: the optimal risk classification system from the insurer’s perspective

In Gatzert et al. (2012), different forms of substandard annuities are presented and the challenges of the underwriting process in insurance practice are identified. In a theoretical model, a risk classification system for substandard annuities is derived assuming that the insurer wants to maximize it's expected underwriting profits and that risk classification is costly. In addition, the model includes the cost of an inappropriate risk assessment (causing underwriting risk) that occurs when policyholders are assigned to inappropriate risk classes. Specifically, such inadequate risk assessment is modeled by assuming error probabilities for misclassifying policyholders into a lower risk class, thereby understating expected indemnity payments.

We aim to contribute to the existing body of knowledge by linking the willingness of (prospective) policyholders to provide private information for risk classification purposes to their willingness to pay for insurance. We combine the analysis of an insurer's optimal risk classification strategy with considerations of policyholders' privacy preferences. Given the developments toward better risk predictability—although the debate on the welfare effect and the best regulatory framework is still ongoing—we find it interesting to examine the decision to implement new technologies for risk classification purposes from the firm's perspective. We add to this area of research by analyzing an insurer's underwriting decision process with respect to offering policies that require policyholders to share private data in exchange for some compensation. The additional data could allow insurers to classify risks more accurately.

In the absence of classification costs and under full information, it would be optimal for the insurer to classify each subpopulation of policyholders with equal risk into a separate group (Gatzert et al. 2012). In practice, the classification process involves transaction costs, and the information available to insurers to identify which risk subpopulation a (prospective) policyholder belongs to is not perfect. Therefore, a decision must be made regarding the optimal classification system.

We analyze the choices an insurer faces when implementing new screening techniques that can improve risk classification. To that end, we take the position of an insurance company that seeks to maximize its expected underwriting profit. We provide a framework that an insurer could use to decide whether and to what extent they should invest in pricing innovation using new technologies and big data analyses. Moreover, we revisit the problem of choosing the optimal risk classification system presented in Gatzert et al. (2012) and extend it to incorporate the conditions under which innovation in risk classification methods is profitable and how the optimal classification system changes with it.

We build an application for term life insurance business. Typically, the insurer could group policyholders into different risk classes based on estimates of their mortality risk with a certain classification error. Based on the heterogeneity of the underlying population and their price-demand characteristics, the insurer would choose to offer a profit-maximizing number of classes (Gatzert et al. 2012). Studies suggest that biological ageFootnote 7 serves as a good predictor of age-related diseases and mortality risk (Horvath 2013; Putin et al. 2016; Huang et al. 2017; Milevsky 2020a; Wu et al. 2021). Therefore, requesting policyholders’ data necessary to calculate their biological age can lead to improved accuracy in the classification into risk classes. However, the requirement to share personal data is expected to affect the price–demand characteristics, because the updated policy embeds both term life coverage and trading personal data. This way, the decision of the insurer on whether to use risk class indicators, such as biological age, affects their decision on the optimal number of risk classes to offer. In the following sections, we set up a framework for profit-maximizing insurers to navigate through these decisions.

3.1 General procedure of selecting a classification system to maximize the expected profit

We lay out a framework to analyze the insurer’s problem of selecting the classification system that maximizes its expected underwriting profit. The proposed framework focuses on two choice variables, namely, the number of risk classes offered, and the probability of misclassification. To focus on the interaction of these two choice variables, we make some simplifying assumptions:

  1. (i)

    We assume that risks are purely unsystematic and hence, the owners of the insurance company can (fully) diversify them. Therefore, our choice of the “optimal” classification system refers to the classification system that maximizes the insurer’s expected profit.Footnote 8

  2. (ii)

    We assume no other risk source beside the policyholders’ claim distributions.

  3. (iii)

    We set the riskless rate of return in our two-points-in-time-model to zero.

  4. (iv)

    We assume that the insurer cannot go into default within the timeframe of our model.

  5. (v)

    We assume that the insurer faces a downward-sloping, linear demand function in each of the subpopulations.

  6. (vi)

    We consider no general administrative or agency costs.

Since potential policyholders within a group are homogenous in terms of risk, the actuarially fair premium (per contract) based on the expected claims is identical. This is reflected in a constant (parallel) line of marginal costs. Moreover, the insurer has a risk of misclassification—hence, the policyholder’s risk type is not fully transparent. The insurer does not have the full information regarding the risk group to which the potential policyholder belongs, it can only infer potential policyholders’ risk groups based on the information it is allowed to gather from them and its internal risk evaluation models. Misclassification can generate a loss (or at least a deviation from the maximal profit attainable) to the insurer if, for example, a high-risk is categorized as a low risk.Footnote 9 Furthermore, also potential policyholders are not fully aware of their risk group. Potential policyholders could have access to their data when performing given tests, but they would not have access to the data-extensive risk evaluation models that the insurer uses. Therefore, while the potential policyholder may have an indication of the risk group to which she belongs, that self-assessment will not always be correct.

It is assumed that introducing new screening techniques that require policyholders to provide additional personal data, will, on the one hand, result in a lower or equal reservation price for the new policyFootnote 10 from the potential policyholder’s perspective (Regner and Riener 2017; Benndorf and Normann 2018; Gemmo et al. 2020). On the other hand, new screening techniques are expected to improve the accuracy of the classification system, giving the insurer the possibility to identify with a better accuracy the risk group to which the potential policyholders belong (Baecke and Bocca 2017; Verbelen et al. 2018; Geyer et al. 2020). In addition, an insurer faces the possibility of losing or acquiring policyholders to/from competitors who do not offer the product with the same accuracy in risk classification.

We work through the problem of analyzing the trade-off between the effects of the new classification technique by first constraining the expected impact of the private data requirement on the demand curve. Second, we describe the general decision algorithm of an insurer faced with an underlying population composed of n risk groups and, choosing the classification system that maximizes its expected profit. In the decision algorithm, we analyze how the choice of screening technology interacts with the choice of the number of risk classes to offer. In addition, an illustrative numerical application using data from the German term life insurance market is provided.

The choice of the classification system refers to the simultaneous choice of the number of risk classes and the classification method, where the latter may or may not require the use of private data. The choice of whether to require the use of private data determines the accuracy of the classification system. The insurer is constrained in this choice to the extent that, with full accuracy, a given (estimated) downward shift in the demand curve is expected to occur. The demand curve for the insurance policy that requires the use of private data and guarantees full accuracy in classification differs from the demand for the initial policy for two main reasons. First, policyholders require compensation for the additional personal information that they need to provide. Thereby, their willingness to pay for the new policy changes. To determine how this change reflects in a new demand curve, we need to consider the relationship between willingness to pay for insurance and privacy concerns (translated into a required deduction). While there is, to the best of our knowledge, no conclusive research on this relationship, the fact that they are both influenced by a very similar set of consumer characteristics suggests that the two might not be independent (Bansal et al. 2010). We assume the willingness to pay for insurance to be negatively related to the willingness to share private data and to grant permission for their use in the risk classification process.Footnote 11 In this case, the demand curve will shift downwards by more for those who have a higher willingness to pay for insurance. One way to interpret this would be to consider willingness to pay for insurance as driven by the degree of risk aversionFootnote 12 and thinking of more risk-averse policyholders as more prone to assessing private data as sensitive. This would lead policyholders with a higher degree of risk aversion to request a higher deduction to compensate for the disutilityFootnote 13 caused by sharing the required private data.Footnote 14

Second, the change in the risk classification methodology would have an impact on the insurers’ competitiveness in the market, provided that the insurer has proprietary rights over the new methodology and the additional data collected. This will result in the loss of some potential clients and the acquisition of others. The magnitude of this effect depends on the timing of implementation of the new risk classification technology versus competitors. If the insurer is among the first movers, on the one hand, the incentive is for potential policyholders who assess themselves as belonging to the lower risk groups to switch from competitors to the insurer applying the new classification system. On the other hand, potential policyholders who assess themselves to belong to the higher risk groups have the incentive to switch to other “traditional” providers. Literature on adverse retention suggests that low-risk policyholders are more likely to switch providers (Altman et al. 1998; Lai et al. 2021). However, there is also evidence suggesting that the first-mover advantage is minor (Reimers and Shiller 2018) and therefore it is fair to assume that either type of shift is limited. Whereas, if the more accurate risk classification methodology has already been implemented by many competitors, the expected effect is more on retaining policyholders from lower risk groups.

We assume that within a risk group, there is no interrelation between the willingness to pay for the new insurance policy and the predisposition to change providers. To the best of our knowledge, there is no research indicating otherwise. To pin down the shape of the demand shift, we assume that policyholders with a willingness to pay for insurance equal to zero require no compensation for sharing private data. Therefore, the willingness to pay for the new product does not become negative. Figure 1 depicts, in a given risk group, the demand for the new insurance product that uses private data to achieve fully accurate risk classification, versus the demand for the insurance product that uses a conventional risk classification methodology that does not require private data. We denote \({N}_{s}\) the total number of policyholders with a positive willingness to pay for insurance in risk subpopulation s, \({P}_{s}^{{\text{R}}}\) the maximal reservation price of policyholders in risk subpopulation s, and \({P}_{s}^{{\text{A}}}\) the actuarially fair premium for policyholders in risk subpopulation s. We denote by \(1- {\alpha }_{1}\) the percentage discount that the policyholder with the highest willingness to pay for insurance requires, leading to a new highest willingness to pay of \({\alpha }_{1}{P}_{s}^{{\text{R}}}\). As the willingness to pay for insurance and the discount required for giving up private data are positively related, the new demand curve will be obtained by multiplying the initial one with a coefficient \({0<\alpha }_{1}<1\). The dark blue line in Fig. 1 depicts the shift in demand driven solely by the change in willingness to pay for insurance of the initial policyholder base, that is, if no client migration was expected. The parallel shifts in demand driven by the effect on market competitiveness is described by the coefficients \({\beta }_{{\text{s}}}\). Note that the parallel shift depicted by the dashed line is only illustrative and, the actual shift could be an expansion or a contraction.

Fig. 1
figure 1

Demand curve shift in a given risk group when using private data in risk classification. This figure illustrates the cumulative shift in demand in a given risk group, accounting for the change in demand driven directly by the data requirement within the client base of the insurer as well as the additional shift caused by the exchange of potential policyholders in a risk group among insurers

We build on the model set forth by Gatzert et al. (2012) by attaching a binary choice of misclassification probability to the classification system. The insurer can either rely on given information/data constraints and classify risks with a certain misclassification probability or opt for an innovative classification method that attains full classification accuracy. The latter is possible only when requesting prospective policyholders to provide and consent on the use of certain private information. As discussed, linking the insurance policy to the request for use of private data leads to an alteration of the demand—due to compensation for privacy concerns associated with sharing private data and effects on competitiveness. We lay out an optimization procedure for the insurer to choose the classification system that maximizes the expected profit and analyze how the given variables affect this decision. We examine the incentives in terms of increased expected profit of insurers to innovate in the risk classification space.

3.2 Risk classification framework

We consider S heterogeneous subpopulations that contain \({N}_{s}\) policyholders, with \(s\in \left[1,S\right],\) who are homogeneous with respect to the expected claim payment for a certain type of insurance policy.Footnote 15 For example, in the case of term life insurance, we can imagine the overall population of prospective policyholders formed by subpopulations with the same life expectancy.Footnote 16 Subpopulations are characterized by their cost function as well as their price-demand function. Since we consider policyholders within a subpopulation homogenous in terms of expected claim payment, the respective average cost (and marginal cost) function will be constant and equal to the expected claim payment per policyholder in the subpopulation, denoted by \({P}_{s}^{{\text{A}}}\).

A classification system m is considered any grouping of all subpopulations into \({I}_{m}\) risk classes. In this setting, when ranking risk subpopulations in decreasing order, only adjacent subpopulations can be grouped into the same risk class. Moreover, we assume that the number of subpopulation(s) per risk class is equal among risk classes, when possible, otherwise higher risk class(es) include one more subpopulation than lower risk class(es). The problem then consists of finding the optimal number of risk classes, between 1—that is, grouping all subpopulations together—and S—that is, putting each subpopulation in a separate class. Risk classes will also be characterized by their cost function and price–demand function. In the cases in which a risk class contains more than one subpopulation, its cost and price–demand function will be aggregated functions of the cost and price–demand function of the contained subpopulations.Footnote 17 In the case of S risk classes, the cost and price–demand functions of the risk class will be the same as that of the corresponding subpopulation.

A classification system \(m\), with \({I}_{m}\) risk classes will also be characterized by classification costs, denoted by C, and the probability of misclassification, denoted by \(\it {\text{ur}}\). We assume classification costs to be proportional to the number of risk classes and model them as \(C=c({I}_{m}-1)\), where \(c\in {R}_{0}^{+}\). For simplicity, we assume that policyholders are only misclassified to adjacent risk classes and that the probability of misclassification is equal in either direction. This implies that the highest and lowest-risk class will have a lower total probability of misclassification as they only have one adjacent risk class. Furthermore, in this setting, the misclassification probability can only take two possible values: \(r>0\), when using the default classification methodology, or 0, when using an innovative classification methodology that employs private data. The innovative classification methodology is associated with a shift in the price–demand curve in each risk subpopulation. Hence, a classification system can be fully defined by the number of risk classes, the accuracy of classification (reflected in the potential use of private data), and classification costs. The classification system can be denoted by \(m\left\{{I}_{m},{\text{ur}}\left({\alpha }_{1,}{ \beta }^{S}\right),C({I}_{m})\right\},\)Footnote 18 where \({\beta }^{S}\) is a vector of length \(S\), containing the coefficient of expansion or contraction of the client base in each risk subpopulation, for simplicity we will refer to this only as a classification system m.

Within a classification system, risk classes are characterized by their cost function and price–demand function. We will denote \({{\text{MC}}}^{l}\left(n\right)\) the marginal cost function of risk class \(l\), where \(l=\left\{1,\dots ,{I}_{m}\right\}\), and \({{\text{WTP}}}^{l}\left(n\right)\) its price–demand function, that is, willingness to pay function. These are functions of the number of risks in the risk class and, where the risk class contains more than one subpopulation, are obtained by aggregating the corresponding functions of the subpopulations. In what follows, we will omit the \(l\) subscript as well as the n and, for simplicity, write \(l\left\{{\text{MC}}, {\text{WTP}}\right\}\) to refer to a risk class. Having a defined demand and cost function, each risk class will also have a profit function, which apart from the WTP and MC functions, depends also on the probability of misclassification of policyholders from that risk class into adjacent risk classes—and the price the misclassified policyholders are offered in the “incorrect” (adjacent) risk class—if the probability of misclassification is positive.

In this setup, the problem of finding the optimal, that is, the profit-maximizing classification system from the insurer’s perspective can be broken down into several steps:

  1. 1.

    Given the classification system \(m\left\{{I}_{m},p\left({\alpha }_{1,}{ \beta }^{S}\right), C({I}_{m})\right\}\) with \({I}_{m}\) risk classes, classification cost \(C\), and a classification methodology that either makes use of private data or not, and translates into a combination of misclassification probability and demand shift, calculate the overall profit \({\pi }_{m}\) based on one of the procedures below, as appropriate:

    1. (i)

      In the case of the default classification methodology that results in a misclassification probability \({\text{ur}}=r>0\) (and \({\alpha }_{1}=1\), \({\beta }^{S}={1}^{S}\)):

      (1) Rank subpopulations by riskiness (highest to lowest) and separate all subpopulations into risk classes \(l\), where \(l=\left\{1,\dots ,{I}_{m}\right\}\) (\(l\)= 1 highest risk, \(l={I}_{m}\) lowest risk).

      (2) For each risk class \(l\), calculate the price–demand and cost functions by aggregating the corresponding functions of the subpopulations that it contains, and find the profit-maximizing price-demand combination \({p}_{l}^{{\text{d}}*}\) and \({n}_{l}^{{\text{d}}*}\), hereon the superscript d refers to variables under the default classification methodology.Footnote 19

      (3) For each risk class \(l\), calculate the additional demand created by incorrectly offering to the policyholders belonging to it the optimal prices of the adjacent risk classes \({p}_{l-1}^{{\text{d}}*}\) and/or \({p}_{l+1}^{{\text{d}}*}\), \({n}_{l-1,l}\) , and \({n}_{l,l+1}\), respectively.Footnote 20

      (4) Calculate the maximal expected profit for each risk class and then adjust for misclassification as \({\widetilde{\pi }}_{l}^{{\text{d}}}=r\left({n}_{l-1,l}\left({p}_{l-1}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)+\left(1-2r\right)\left({n}_{l}^{{\text{d}}*}\left({p}_{l}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)+r\left({n}_{l,l+1}\left({p}_{l+1}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)\) when the risk class has two adjacent risk classes; \({\widetilde{\pi }}_{l}^{{\text{d}}}=r\left({n}_{l,l+1}\left({p}_{l+1}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)+\left(1-r\right)\left({n}_{l}^{{\text{d}}*}\left({p}_{l}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)\) or \({\widetilde{\pi }}_{l}^{{\text{d}}}=r\left({n}_{l-1,l}\left({p}_{l-1}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)+\left(1-r\right)\left({n}_{l}^{{\text{d}}*}\left({p}_{l}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\right)\) for the highest and lowest-risk class, respectively; or \({\widetilde{\pi }}_{l}^{{\text{d}}}={\pi }_{l}^{{\text{d}}}={n}_{l}^{{\text{d}}*}\left({p}_{l}^{{\text{d}}*}-{p}_{l}^{{\text{A}}}\right)\) in the case of only one risk class.

      (5) Calculate the total expected profit under this classification system as the sum of the expected profits in each risk class after deducting the classification costs: \({\pi }_{m}^{{\text{d}}}={\sum }_{l=1}^{{I}_{m}}{\widetilde{\pi }}_{l}^{{\text{d}}}-c({I}_{m}-1)\).

    2. (ii)

      In the case of the innovative classification methodology that uses private data and yields a fully accurate classification \({\text{ur}}=0\):

      (1) Estimate the shift in demand that the incorporation of data requirements in the insurance policy would cause, both in terms of affecting the willingness to pay for the policy of the current client base in terms of magnitude \(0<{\alpha }_{1}<1\), and the effect on client migration from/to competitors \({\beta }^{S}\).

      (2) Rank subpopulation by riskiness (highest to lowest) and separate all subpopulations into risk classes \(l\), where \(l=1,\dots ,{I}_{m}\) (\(l\)= 1 highest risk, \(l={I}_{m}\) lowest risk).

      (3) For each risk class l, calculate the price–demand and cost functions by aggregating the corresponding (shifted) functions of the subpopulations that it contains and find the profit-maximizing price–demand combination \({p}_{l}^{{\text{n}}*}\) and \({n}_{l}^{{\text{n}}*}\), hereon the superscript n refers to variables under the innovative classification methodology.

    (4) For each risk class l, calculate the maximum expected profit \({\pi }_{l}^{{\text{n}}}={n}_{l}^{{\text{n}}*}\left({p}_{l}^{{\text{n}}*}-{p}_{l}^{{\text{A}}}\right)\).

    (5) Calculate the total expected profit under this classification system as the sum of the expected profit in each risk class deducting classification costs: \({\pi }_{m}^{{\text{n}}}={\sum }_{l=1}^{{I}_{m}}{\pi }_{l}^{{\text{n}}}-c({I}_{m}-1)\).

  2. 2.

    Repeat this procedure for all 2 * S possible classification systems \({m}^{{\text{d}}}\in \left\{{1}^{d}, \dots , {S}^{d}\right\}\) and \({m}^{{\text{n}}}\in \left\{{1}^{n}, \dots , {S}^{n}\right\}\) and choose the one that yields the highest total profit, \({m}^{*}\).

Following this procedure, the insurer can simultaneously decide on whether it is optimal to innovate the classification technique using private data and choose the optimal number of risk classes. Figure 2 illustrates the discussed algorithm for selecting the profit-maximizing classification system with respect to the number of risk classes and methodology employed.

Fig. 2
figure 2

Algorithm of selection of the profit-maximizing classification system

The proposed procedure has its limitations, partly stemming from the simplifying assumptions we made such as linearity of the demand curve of the insurer, no risk of default of the insurer, symmetric misclassification, composition of the underlying population, etc. While these assumptions might need to be relaxed/adapted in practical applications, they allow us to assess the average effects of the usage of new technologies on the profit-maximizing number of risk classes without considering specifically the steepness of the demand curve at the initial and new profit-maximizing combination in each risk class or the policyholders’ willingness to change providers based on their differences in default risk.

Figure 2 illustrates the algorithm for selecting the classification system that maximizes expected profit by integrating the selection of the classification methodology, employing additional data or not, and the number of risk classes offered.

3.3 Numerical applications

We provide two illustrations of our proposed procedure for selecting the optimal risk classification system. These examples allow us to discuss more concretely the incentives for insurers to innovate in risk classification and how this affects the incentives to offer different granularities of risk classes.Footnote 21

3.3.1 Population of five homogeneous subpopulations in the term life insurance market

The application of our proposed decision-making process requires an estimation of market data specific to the firm, the line of business under consideration, and client characteristics in that line of business. However, to illustrate our proposed procedure, we will apply our setup to the term life insurance market, with estimates taken from empirical data collected by Braun et al. (2016).

An important risk factor used to classify policyholders when it comes to term life insurance is age. Age is used to estimate mortality risk. However, another risk factor, biological age, has caught insurance researchers’ interest (Hochschild 1988). Recent research shows that biological age is a better predictor of mortality than chronological age (that is, the time that has passed since a person was born) (Huang et al. 2017; Mamoshina et al. 2018; Milevsky 2020a; Wu et al. 2021). Based on this, we could assume that insurers can opt either for classification into risk classes based on (chronological) age or request policyholders to provide information and test data necessary to estimate their biological age. The former would lead to a misclassification probability in terms of predicting mortality, while the latter allows the insurer to classify policyholders more accurately into mortality brackets. For simplicity, we assume that the second method is fully accurate.

We refer to the data that Braun et al. (2016) collected on the willingness to pay for the “classic product” divided into five groups based on age. We concentrate only on policyholders who smoke, to make sure that the subpopulations are approximately homogeneous. To align with our setup, we derive the linear approximation of the willingness to pay for term life insurance curve in each subpopulation and take the highest reservation price and the number of policyholders per group from the approximation. The marginal cost in each subpopulation is taken as constant, equal to the average variable costs presented by Braun et al. (2016). Note that in the data, ranking the groups by marginal cost does not yield the same result as ranking them by the highest reservation price. To aggregate the demand curves correctly, we rank the groups based on the highest maximal reservation price. Classifying based on age leads to a 20% misclassification probability in either direction. In addition, the classification cost increases by \(c=100\) units with the number of risk classes. Moreover, we assume that the introduction of the innovative classification method is not expected to lead to an exchange of clients with competitors in either subpopulation, that is, \({\beta }^{5}={1}^{5}\) and that willingness to pay for insurance is negatively correlated to the willingness to share data. Under these assumptions, the maximal profit generated in either classification system is presented in Table 5 in the Appendix. Note that in the absence of classification costs, it is optimal for the insurer to classify each subpopulation into a separate risk class, regardless of classification accuracy. Once classification costs are introduced, applying classification based on age would lead to a division into four risk classes yielding the highest profit. The insurer would be able to accept a maximal fall in demand with \({\alpha }_{1}=0.946\) to implement a classification based on biological age (under the assumption that classifying based on biological age eliminates misclassification) and find it at least as profitable as classification based on chronological age. In this case, the optimal number of risk classes would be two. Note that the cost of misclassification, that is, the difference between the total (optimal) profit without misclassification and the total profit with misclassification, decreases with the number of risk classes. Therefore, the increased profit from a more accurate classification is higher in classification systems with fewer risk classes.

3.3.2 A numerical example of a population with seventy homogenous subpopulations

In practice, larger variation among policyholders is common, and assuming that the population has only five subpopulations is not very realistic. Therefore, we relax the assumption imposed by the availability of willingness-to-pay data and construct an example that illustrates the problem of selecting the profit-maximizing classification system when allowing for more diversity among (prospective) policyholders in a population.

We consider a population made of 70 subpopulations (S = 70). This assumption regarding the diversity within the population of potential policyholders is realistic, for instance, in the case of term life insurance, where mortality risk differs every year of biological age, ceteris paribus, but within a subpopulation of the same biological age, it can be considered constant. In the absence of empirical data regarding willingness to pay at this level of granularity, we construct a numerical example. We assume a vector of maximum reservation prices per class decreasing from 3500 in the highest-risk subpopulation to 50 in the lowest-risk subpopulation, an equal number of policyholders in each subpopulation \({N}_{s}=100\) and expected claim payments decreasing from 2800 in the highest-risk subpopulation to 40 in the lowest-risk subpopulation.Footnote 22 For a given range of initial probabilities of misclassification, \({\text{up}}\), and a possible range of classification cost per additional risk class, \(c\), Table 6 in the Appendix shows the number of risk classes that would yield the highest total profit in each case. Furthermore, Table 6 presents the maximum shift in demand (measured by the coefficient \({\alpha }_{1}\)) that the insurer is willing to accept to obtain policyholder data if the insurer estimates no effect on client migration to/from competitors in either subpopulation, that is, \({\beta }^{70}={1}^{70}\). Lastly, column 5 in Table 6 presents the number of risk classes that yield the highest profit given a shift in demand due to the acquisition of policyholder’s data, necessary to implement a fully accurate classification. In term life insurance, we can think of an initial misclassification probability \({\text{ur}}\) when using age to proxy mortality risk. Then suppose that by acquiring from policyholders the data necessary to calculate biological age, the insurer would be able to eliminate this probability of misclassification.

Results show that in the absence of classification costs, that is, the cost related to the setup and maintenance of an additional risk class, the insurer would opt for maximal granularity in classification, that is, offering 70 risk classes, regardless of the classification method used or level of initial misclassification. The same holds when using the default classification method and assuming a positive initial misclassification probability for any classification cost below a threshold \(c\le 30\). This means that the insurer has the incentive to treat each subpopulation as a separate risk class even if it does not have full information regarding the risk group to which policyholders belong. However, if we assume classification costs \(c=1000\) and an initial probability of misclassification \(p=15\%\), the maximal total profit is achieved by offering only 23 different risk classes. Panel (ii) in Fig. 3 shows the profit-maximizing number of risk classes depending on the assumed probability of misclassification. The cost of misclassification, that is, the difference between the theoretical maximal profit without misclassification and the maximal profit with misclassification, increases quickly with the number of risk classes, reaches its peak, and then decreases. This would suggest that, at first, the effect of increasing the total share of policyholders allocated to the incorrect risk class dominates—more (adjacent) risk classes lead to an overall higher share of total policyholders misclassified. However, as risk classes get more granular, the “missed” profit from each incorrectly classified risk becomes lower.

Fig. 3
figure 3

Illustration of the profit-maximizing classification system in a heterogeneous population. This figure illustrates the selection of the profit-maximizing classification system in the case of a heterogeneous population. The underlying population is assumed to be composed of 70 subpopulations, with 100 (prospective) policyholders belonging to each subpopulation. The cost of maintaining an additional risk class is assumed to be 1000. The vector of maximal reservation prices per class decreases (linearly) from 3500 in the highest-risk subpopulation to 50 in the lowest-risk subpopulation, whereas the expected claim payment decreases (linearly) from 2800 in the highest-risk subpopulation to 40 in the lowest-risk subpopulation. Panel (i) displays the difference in (maximal) profit between using a classification system that yields a certain misclassification rate and using a classification system that eliminates the initial misclassification. The latter is enabled by using private data such that the demand for the new policy shifts with a certain expected \({\alpha }_{1}\) and no change in the client base is expected, \({\beta }^{70}={1}^{70}\). The insurer would switch to the new classification methodology only for non-negative values of the differences in profit. Panel (ii) shows the number of risk classes that would maximize profits depending on different values of misclassification faced

Could the insurer increase its profit by introducing a new classification methodology, using biological age, for instance, instead of age as risk factor? For calculating biological age, the insurer needs to use policyholders’ private data and that, as discussed, will affect its demand function. If we assume that the insurer estimates no effect of using private data on client migration to/from competitors in either subpopulation, that is, \({\beta }^{70}={1}^{70}\), then the insurer would have an incentive to innovate using private data only if the compensation that its client base requires reflects in \({\alpha }_{1}=0.986\).Footnote 23 In this case, it would be optimal for the insurer to offer only 17 risk classes instead of 23. If the base of potential policyholders is thought to be more concerned on average about the use of its (required) private data and, therefore, only be expected to allow its use against a higher compensation, a profit-maximizing insurer would not innovate in risk classification using private data. Panel (i) in Fig. 3 shows the difference in (maximal) profits between the two classification methodologies for different combinations of the initial probability of misclassification (which can be corrected) and expected demand shift as measured by \({\alpha }_{1}\).

Focusing on the effect that the use of policyholders’ private data has on risk classification, our results do not fully validate the concern expressed in the literature that having more information on individual risk might lead to smaller risk pools (Eling and Lehmann 2018; Cevolini and Esposito 2020). In fact, for a given cost of setting up and maintaining an additional risk class, c, and a given (even small) fall in expected demand due to requiring private data, the optimal number of risk classes from the perspective of a profit-maximizing insurer in a fully accurate classification system is lower or equal to the optimal number of risk classes in a system with a positive probability of misclassification. A lower (or equal) number of risk classes is equivalent to the pooling of more (or as many) risk groups together in a risk class.

4 Outlook

While a large body of literature on big data, risk classification, and privacy in insurance markets has been emerging over recent years, we have identified a few avenues for future research that have not yet been (sufficiently) explored. To systematically identify those areas, we reviewed all outlook or future research sections in the literature referenced in Sect. 2. After excluding articles without future research sections and those published before 2011, a total of 41 articles that contain future research are analyzed. The review demonstrated that the topics of big data, risk classification, and privacy in insurance markets reach far beyond the field of economics. Therefore, we classify possible future research in the four categories economics, law, medicine, and ethics. Most of questions raised below are in the intersection of economics with other fields, emphasizing the need for cross-disciplinary research.

From the perspective of economic research, the current literature emphasizes the importance of privacy economics, but also points out that privacy protection is rapidly becoming a pressing public policy issue (Acquisti et al. 2016; Biener et al. 2020; Blakesley and Yallop 2020; Gemmo et al. 2020; Hoy and Durnin 2012; Loi et al. 2022; Steinberg 2022). Future research should improve our understanding of the economics of privacy and its interaction with insurance economics. While privacy preferences have been incorporated into several models, empirical research on the determinants of privacy preferences is still relatively scarce. Furthermore, as mentioned above, both the willingness to share private data and risk aversion are found to be influenced by similar consumer characteristics. This observation suggests that they are not independent of each other. The relationship between risk aversion and privacy concerns is particularly relevant to the study of insurance contracts, which require access to a wider range of private data. To our knowledge, this relationship has not been studied empirically in the insurance context.

In data and technology applications, future research could explore a broad range of topics. On the one hand, research should expand the scope of data sets, focusing specifically on driving behavior data to better understand customers' behavioral habits and improve the risk selection process (Baecke and Bocca 2017; Cather 2018; Biener et al. 2015; Brunnermeier et al. 2022). This can help insurance companies price insurance products more accurately and provide policies that better meet customer needs. On the other hand, the value and effect of sophisticated data mining techniques in risk selection should be further studied (Baecke and Bocca 2017; Holzapfel et al. 2023; Liu 2022, 2023). Additionally, the extensive use of big data highlights the importance of cyber insurance and requires for further research encompassing data testing and modeling, strategies to address information asymmetry in cyber risks, and the interplay between information asymmetry and network effects. The public good attributes of cybersecurity and the potential ramifications of government intervention also warrant further exploration (Biener et al. 2015; Eling and Lehmann 2018; Hassani et al. 2020; Xie et al. 2019). Among climate risk and global pandemics, Koijen and Yogo (2023) identify cyber risk as one of the new risks, for which the opportunities and challenges presented to the insurance industry offer interesting topics for future research. Big data and the related privacy considerations require an adequate data security protocol. With the ongoing process of digitalization and technological advancements, both the vehicles to protect sensitive data as well as those to breach this protection have become more developed. One challenge posed to insurance companies is that there are insufficient data available to accurately model the loss distribution for cyber risks (Koijen and Yogo 2023). While a substantial body of research currently focuses on utilizing insurance to mitigate cyber security risks (Biener et al. 2015; Bodin et al. 2018; Xie et al. 2019; Doss and Narasimhan 2021), there is a limited exploration of how the insurance sector responds to reputational risks and the ensuing cyber risk. A high level of uncertainty with respect to the loss distribution may increase premium loadings, increase deductibles, or decrease overall insurance supply, a result relevant for both cyber risk and reputational risk insurance. These circumstances may not only affect the demand for insurance, but the resulting lack of insurance coverage may alter household and firm behavior in the context of activities that exposes them to a high level of risk (Koijen and Yogo 2023).

For risks other than those that have emerged recently, new and more efficient ways to collect data allow for more accurate risk categorization. The aforementioned inverse selection dynamics, that is, the transfer of information advantages from the insured to the insurer (Villeneuve 2000, 2005; Brunnermeier et al. 2022), opens up a broad area for future research that revisits results from (standard and non-standard) asymmetric information models in which the information advantage has been on the side of the policyholder. In this paper, we propose a framework for an insurance company to navigate the decision process of whether and to what extent to implement new data-driven technologies. Our application analyzes an insurer's decision to choose the risk classification system that maximizes its expected profit. We consider the improvements in risk classification accuracy that can be achieved using big data, while taking into account the associated privacy implications. We do so by relating (potential) policyholders' willingness to provide additional private information to their willingness to pay for insurance. Our results suggest that improved risk classification accuracy, when achieved through the use of private data, does not necessarily lead to more granular risk classes. However, reducing the cost of establishing and maintaining a separate risk class could lead to more granularity in risk classification.

Even when considering an informational advantage on the insurer’s side, modeling policyholder behavior as well as insurance demand requires insurers to consider what level of information is available to households and to consider households’ beliefs and preferences. For instance, a consumer’s knowledge and beliefs about the loss distribution may diverge from the information the insurer matches to the respective consumer. In life and health insurance, consumers are likely to base their insurance demand on beliefs about their own health and longevity and they may or may not be able to estimate and consider their own biological age (Huang et al. 2017; Milevsky 2020a; Wu et al. 2021). The trust that consumers have in the insurance industry or individual firms (Courbage and Nicolas 2021; Gennaioli et al. 2022), their knowledge about their existing coverage, e.g., social insurance (Parente et al. 2005), as well as reliance on other safety nets (Kotlikoff and Spivak 1981; Brown et al. 2012) may also affect their willingness to pay for insurance. Empirical analyses of these mostly unobservable determinants of insurance demand, an exploration of correlations with observable characteristics, as well as theoretical models that incorporate such characteristics, could greatly help insurance companies in predicting insurance demand.

Other areas in which not a lot of research has been done are the potential changes in industrial organization which come along with the increasing use of big data. Additionally, the environmental cost of digital technologies and big data have been little explored in the literature (see Lucivero 2020; Samuel et al. 2022). It would be of interest to potentially quantify these costs in the insurance industry to get a clearer understanding of the trade-offs of utilizing big data.

There are numerous directions for future research in the intersection between insurance economics and legal. First, the application of big data will inevitably bring about the issue of insurability changes, which will trigger new legal issues (Eling and Kraft 2020). With the emergence of new possibilities of discrimination based on lifestyle tracking, new legal and regulatory challenges emerge (McFall 2019). Second, the legal and economic implications of self-insurance in the context of information asymmetry can be studied. This includes consideration of the impact on self-insurance of incomplete cure of disease, as well as the potential application of self-insurance in disease prevention. Some preventive measures may not be observable by insurance companies, which raises legal questions and ethical issues that require more in-depth research (Crainich 2017). This aspect is not only relevant in the health domain, but also more broadly in IT security. Further research is also needed to better understand the legal and economic challenges of risk scoring, paying particular attention to the legal issues multidimensional heterogeneity poses to the credit and insurance fields. As technology evolves, risk scoring models are likely to become more complex, requiring legal frameworks to accommodate new ways of using data (Einav et al. 2016; Eling and Lehmann 2018; Fuster et al. 2019; Hoy and Durnin 2012; Loi et al. 2022; Steinberg 2022). Also, the use of big data in fraud detection might require some interdisciplinary research on the legal barriers and economic implications of data usage. Overall, further economic and legal research is required to help inform the decision-makers in developing regulatory frameworks.

In the intersection with the sphere of medical research, future endeavors may further probe the application and impact of big data and genetic information in health insurance. This entails for example investigating the potential influence of genetic information on adverse selection and whether individuals with differing genetic test results adopt varying health insurance strategies (Crainich 2017; Filipova-Neumann and Hoy 2014; Hoy and Durnin 2012; Nill et al. 2019; Posey and Thistle 2021). Additionally, future research should delve more deeply into the utilization of various technologies in healthcare, encompassing the monitoring of medical applications of health data. These technologies offer the promise of enhanced disease prediction and prevention but also raise legal and ethical quandaries necessitating further research and regulation (Filipova-Neumann and Hoy 2014; Hoy and Durnin 2012; Nayak et al. 2019a, 2019b).

Lastly, future research must attend to the ethical dimensions of insurance, particularly given the continuous integration of new technologies. The insurance industry faces the challenge of balancing individual privacy and risk management needs (Biener et al. 2020; Meyers and van Hoyweghen 2020; Nill et al. 2019). Research should focus on examining the ethical repercussions of information asymmetry on data sharing and the ethical dilemmas arising from such asymmetry (McFall 2019). These areas of investigation will offer guidance for the future development of the insurance industry, ensuring that data and technology applications align with legal and ethical standards while meeting the expectations of customers and society (Aburto Barrera and Wagner 2023; Eling and Kraft 2020; Kiviat 2019; Tanninen et al. 2022; Wiegard and Breitner 2019).