Keywords

1 Introduction

Personal data has been described as the “the new oil of the Internet and the new currency for the digital world” [1] and is created, stored, shared, and used by almost all companies. Market studies expect the overall global data monetization market to increase from USD 2.3bn in 2020 to USD 6.1bn in 2025 [2]. A sharp increase of data creation and low cost of storage are main drivers. Companies such as Facebook and Google have created massive data environments and developed their business models around the voluntary sharing of personal data by individuals in exchange for using their services [3, 4].

The OECD defines personal data as “any information relating to an identified or identifiable individual (data subject)” [5]. It shows traits of a public good rather than a commercial good as it is difficult to exclude other parties from using it in an effective way and usage by one party does not prevent other parties from using it as well [3, 5, 6]. Data owners on the other hand have a need for privacy [7]. With the collection of personal data happening continuously and through a multitude of interconnected devices and modalities, there is an abundance of data and it is challenging to introduce effective control mechanisms on who transfers and uses the data [3]. Additionally, at least in Europe, privacy and the right for information to remain private is of high value and are protected by complex laws and regulations such as the General Data Protection Regulation [8] or new regulations aiming to increase data sovereignty [9, 10]. The current business assumption is that users are willing to share their data in exchange for the (free) service they receive. However, studies such as Schwartz [11], have shown that users are often not fully aware that they are paying for the service with their data and that their data may be collected and sold to data brokers who in turn sell data bundles to different data users along a data value chain. Other studies such as Sindermann et al. [12] investigate whether this influences customers willingness to pay for Social Media and show that only a minority supports a monetary payment model. As consumers learn more about how their data is used, this duality will eventually have consequences for companies’ business models [3, 12]. Thus, it will be necessary for companies to attach a monetary value to the personal data shared with them and allow data producers/owners to actively consent to the use of their personal data in return for monetary or non-monetary compensation and develop a data market [6]. This can be facilitated through a data broker who transfers data from data owner to data user and monetary or non-monetary compensation from data user to data owner. The price of data depends highly on a diverse range of factors, some inherent in the data and some based on data context and the data subject. Combined with the need to account for the value attributed to privacy by data subjects, this adds an extra layer of complexity to pricing models as the value of the privacy of the data owner may exceed the value companies are willing to pay. The question of how to price personal data remains open despite multiple calls for research concerning this topic [6, 13]. Literature reviews in this area focus on pricing models [14, 15] or on the value of privacy [7] and have been published some 5 years ago. A more current review was undertaken by Wdowin and Deepeeven [16] as part of a research project and describes some factors related to the value of data. However, there is no detailed description of the search process and which criteria were chosen to define papers included in the review. There is currently no consensus in the literature on which factors influence the value of personal data. Given the omnipresence of personal data across divergent research fields and the regulatory efforts focusing on data sovereignty, a narrative synthesis of current findings is needed to provide a sound basis for research, model development and decision makers to answer the research question: Which factors influence the pricing of personal data?

We contribute to the discussion on pricing personal data in the following respects: (1) we identify relevant influence factors for pricing personal data and (2) provide a structure and categories for further qualitative and quantitative analysis of the subject.

2 Research Approach and Sample

2.1 Research Approach

We perform a narrative review [17] to conceptually integrate different fields of research on influence factors of the price of personal data. We use the following keywords: pricing of data, data markets, value of data, data valuation, data monetization, economics of personal data, pricing personal data and worth of privacy. We search EBSCO Business Source Complete as well as eLib, a specific directory which encompasses Web of Science, Scopus, Tema, Springer Link, Science Direct and other open access directories. We search for the keywords in TITLE/ABSTRACT of publications between 2001 and 2021. We find 1,535 papers in total. We screen the title and abstract based on (a) the topic of pricing privacy, pricing personal data or personal data markets and (b) peer review. We exclude papers that are out of scope, e.g. focusing on company data, pricing goods or services using personal data or using personal data for price discrimination. We exclude journal editorials or short summaries of conference papers. Overall, we exclude 1383 papers. We initially include 152 papers in our sample. We perform forward/backward referencing and include 8 additional in-scope papers respectively grey literature. Following the PRISMA [18] recommendations, we then perform a full text screening and exclude an additional 107 papers (1 duplicate, 106 papers out of scope based on the above mentioned criteria) from our final sample. Our overall sample consists of 54 papers. We extract relevant influence factors using an inductive content analysis [19] to identify the major themes in the literature. Two researchers perform the initial open coding independently and then perform multiple rounds of clustering and categorization to determine the main influence factors on the price of personal data. We derive a description for each factor and analyze the frequency of occurrence within our sample. We count each factor once per paper to determine the frequency.

2.2 Sample Description

The papers in the sample are published between 2005 and 2021 (see Fig. 1). The topic has become more frequently analyzed in the literature since 2014, highlighting its importance for current political and public discussion. There is a large spread of publications amongst different scientific outlets. About 60% of the analyzed papers were published in different scientific journals, while 30% were published in conference proceedings and the remaining 9% were published as grey literature reports. The topic of pricing personal data lies at an interface between business, economics, and information technology research. This is reflected in the publication outlets. The journals differed widely within the sample, with Electronic Markets being the most used outlet (3 publications), followed by Computer Law & Security Review (2 publications) and the IEEE Internet of Things Journal (2 publications). The remaining journals published one paper on the subject.

Fig. 1.
figure 1

Papers according to their year of publication

3 Results

3.1 General Results

Overall, the analyzed papers portrait a diverse stream of research. We classified them into eight categories: case study; commentary; data market model; data market model, technicalFootnote 1; data pricing model; experiment; literature review and report. The results are provided in Fig. 2. The appendix provides an overview of the core topics covered.

Fig. 2.
figure 2

Papers according to their content

The two most prevalent categories are theoretical data market models (14 papers +5 papers on algorithm-based data markets) and experiments (14 papers). Data market models focus on the development of different theoretical scenarios for establishing a personal data market. Most of the papers are theoretical in nature and deal with specific game-theoretical [20, 21] or auction-based [22] approaches for market creation. Some underline their findings with simulations based on real life data sets [23, 24]. Algorithm-based models show different ways to technologically facilitate data market settings and highlight the difficulties incorporating real-life influence factors such as anonymization and noise as well as profit-maximization calculations into an algorithm [25,26,27]. The experiments focus mainly on (a) eliciting willingness to pay for e.g. keeping personal data such as social media data [3, 28, 29] or preferences private [30] and (b) willingness to accept money for e.g. social media data [31] or location data [32, 33] from the participantsFootnote 2. A notable finding is a gap between WPT and WTA-values with the average WTA-value being significantly higher than the WTP to protects ones privacy [35,36,37]. We further find several papers on what we call data pricing models, which focus on different pricing methods and ways to elicit prices for personal data. Literature reviews in this area focus on pricing models [14, 15] or on the value of privacy [7] and have been published some 5 years ago. A more current review was undertaken by Wdowin and Deepeeven [16] as part of a research project and describes some factors related to the value of data. However, there is no detailed description of the search process and which criteria were chosen to define papers included in the review. The few case studies found in our research focus on the complexity of choice and value estimation for personal data.

3.2 Influence Factors on Pricing Personal Data

We find a multitude of influence factors on the price of personal data. We categorize them into four overarching categories: (1) data properties, (2) data context, (3) perceptions of data owner and (4) perceptions of data user. A detailed overview of the subsumed influence factors, their description and frequency of occurrence across the sample can be found in the respective Tables 1, 2, 3 and 4. “Data properties” refer to the inherent properties of a personal data dataset. This category represents general properties of data that are not limited to personal data, but apply to all types of data. We included this category as it lists important factors for pricing data. Within this category, sensitivity of data was found most frequently in the literature, followed by the data content, volume and coverage of data, quality data and level of data aggregation. The remaining factors were mentioned less frequently.

Table 1. Influence factors within the category “Data properties” and frequency of occurrence

“Data context” refers to the environment and background of the personal data dataset. Within this category, data volume and inferability were found most often in the literature, often referring to arbitrage situations. Further, cost and length of data storage and cost of data gathering were discussed in the literature, while the remaining factors occurred less often.

Table 2. Influence factors within the category “Data context” and frequency of occurrence

The category “perceptions of data owner” focuses on the preferences and views of the data owner and the related willingness to consider selling personal data in general. Individual privacy and risk preferences and informational self-determination were found most frequently in the literature and are the most frequent influence factors mentioned overall.

Table 3. Influence factors within the category “Perceptions of data owner” and frequency of occurrence

The category “perceptions of data user” focuses on the preferences and views of the data user. Within this category, the trust in the data market and the individual utility of the purchased data for the data user were most frequently mentioned.

Table 4. Influence factors within the category “Perceptions of data user” and frequency of occurrence

Overall, individual preferences on risk and privacy, informational self-determination, sensitivity of data and data volume and inferability were identified as the most mentioned influence factors in the literature.

4 Synthesis of Findings

Our main finding is that while there already is a body of literature concerned with pricing personal data, the research base is heterogenous. Research is spread out amongst fields of science, journals, and research streams. It seems that there is not yet a consensus on key questions and a lack of overarching frameworks on the subject. This diverse research base, however, highlights the importance of the topic.

We have identified two main research aspects: (1) theoretical and what we call technical data market models and (2) experiments to elicit willingness to accept money/willingness to pay for privacy in exchange for data. Market models are focused on theoretical aspects of pricing personal data and are utilizing game theoretical and profit maximization for developing narrow scope data market models. Technical data market models provide algorithms for different pricing mechanisms such as query-based pricing. While most papers try to validate their models with real life data sets, no paper provides insights from a real-life application of their model, which leads us to assume that data access is limited and that companies already operating data markets are unwilling to disclose their models as they are presumably core to their respective business models. Experiments focus on very narrow experimental settings, often with students as their subjects and are mainly focused on social media data (likes, shares, and general personal information) and a few using location data. To our knowledge, there are no experiments focusing on pricing more sensitive personal data such as electronic health records and very few studies with demographically diverse participants. Particularly when considering more sensitive personal data it would be interesting to gather information from a broad demographic including diverse age groups, educational backgrounds and levels of digital aptitude, as experiments show (1) irrationality concerning ones data (e.g. [35]) and (2) for parts of the participants a plain refusal to partake in data pricing (e.g. [38]). Additionally, there are some studies on data pricing models and very few case studies and literature reviews on the subject. While each of the streams has created a significant insight into the topic, it would be most useful to combine them to merge the aspects of (irrational) decision making of data subjects with more economical and algorithm-based thinking into a more practical data market model/algorithm, as has been attempted by e.g. Biswas et al. [39].

Looking at the factors that influence the price of personal data, we develop four categories of influence factors: (1) data properties, (2) data context, (3) perceptions of data owner and (4) perceptions of data user. All categories include several factors which we also rank by occurrence in the papers of the literature review. The most prevalent factors are individual privacy and risk preferences and informational self-determination. This is not surprising, as those factors are what differentiates personal data from e.g. company data and should thus have a strong impact on the price of personal data. Factors relating to trust, while still important, occur much less frequently, despite seemingly being a factor in increasing market participation by data owners. Market models thus may exclude an important factor in their setups if they exclude trust-creating mechanisms, Sensitivity, data content, and data volume and inferability are further factors that frequently appear in the literature. Those factors are rather generic and certainly applicable not only to the price of personal, but also data in general. Pricing also seems to be dependent on cost and length of data storage and cost of data gathering, which is not surprising since the utility derived from the data needs to outweigh the cost and since most data consumers operate within a restricted budget. Other factors such as culture, ownership level or origin of data appear infrequently and seem to be of less importance. Relating to culture, this is interesting since there are significant differences in how cultures approach the topic of privacy [40]. The most frequently used pricing method is market pricing. Other pricing concepts, such as option, per query or auction models appear much less frequently. This may be due to their perceived complexity in setup and operation.

5 Conclusion and Limitations

We conduct a narrative literature review and show a diverse array of research streams and questions that are related to the topic of pricing personal data and emphasize its importance. Due to the broad frame an aggregation of the results can only be done in a limited way. We believe that the resulting influence factors of our work are a valuable contribution to the scientific discussion and model development in future research. Our qualitative analysis of influence factors based on the underlying literature shows that the influence factors of the price of personal data can be classified into four categories. We provide a first description and ranking of the factors based on our literature review. These factors are only a starting point to researching the influence factors for pricing personal data and need further verification through empirical analyses such as a quantitative study or structured equation modelling. We aim to validate these factors empirically in future projects and to develop an operational model to quantify a price for personal data.