Keywords

1 Introduction

Since 2000, data generation by various sources, such as Internet usage, mobile devices and industrial sensors in manufacturing, has been growing enormously [18]. As of 2011, these sources were responsible for a 1.4-fold annual data growth [24]. Furthermore, storing and processing of data have become easier and less expensive due to technological developments, such as distributed and in-memory databases that run on commodity hardware and decreasing hardware prices [1]. The resulting massive influx of data has inspired various notions about the future of information science, with the most popular notion being Big Data. With the rise of Big Data in practice, it became a topic of interest for scientists in different research disciplines as well, particularly those in the field of information/decision support systems/and technologies. The analysis of publications in the field of Big Data by [16] reveals that recent research mainly focuses on aspects of data storage and data analysis. Less attention has been paid so far on the aspects of data selection and the operationalization of the results.

As the number of data sources as well as the total number of data points available inside and outside of companies have increased, coordinated data selection in the forefront of decision making with respect to a specific economic goal has become more relevant [21]. The lack of a detailed and goal-oriented data selection process may lead to inefficient decision support (DS) because (i) questions regarding which data sources are generally available for specific analytic purposes and (ii) questions about which data sources and respective results should be integrated into the decision making process remain unanswered.

To identify relevant, available data, we propose that both a process model for identifying specific optimization problems and the development of a data landscape that provides a structured overview of the available data inside and outside the company as well as its characteristics are mandatory. To the best of our knowledge, neither such a process model nor a data landscape for DS currently exist [16].

We test our model in the field of online advertising, as the process of data selection and data evaluation is particularly relevant for companies doing online advertising. The field of online advertising offers multiple possible data sources within and outside the advertising company in different levels of aggregation (e.g., specific user-level data vs. aggregated data) at different levels of temporal availability (e.g., frequently vs. sporadic). New cookie-tracking technologies offer companies the potential to “follow” individual users across multiple types of online advertising. These clickstream data include highly detailed user-level data such as user- and time-specific touch points with different advertising channels and different types of interactions (e.g., view and click).

Online advertising has become increasingly important for companies in their attempts to increase consumer awareness of products, services, and brands. With a share of nearly 50 % of total online advertising spending, paid search advertising has become the favored online advertising tool for companies. In addition to paid search advertising, companies can combine several forms of display advertising, such as banner or affiliate advertising, on multiple platforms (i.e., information sites, forums, or social network sites) to enhance consumer awareness [2]. These increased opportunities to advertise online add complexity to managerial decisions about how to optimally allocate online advertising spending, as consumers are often exposed to numerous types of online advertising during their browsing routines or their search-to-buy processes [32].

The goal of this paper thus is twofold: (i) the development of a process model for the generation of a data landscape and (ii) its empirical application.

The paper is structured as follows: after describing the current state of science about data selection and its weaknesses in the field of DS applications, a process model for the development of a data landscape is developed. This section is followed by the testing of the proposed model in the field of online advertising. Finally, based on the identified data, we apply the model of [30] to enhance DS in the field of display advertising. After outlining our findings and discussing our results, we conclude this study by highlighting its limitations and providing suggestions for future research.

2 Data Landscape and Decision Support

2.1 Current Research

An initial literature review revealed that no process models specific to the development of data landscapes have been published in the field of online advertising or decision support, although [7] claim that “what data to gather and how to conceptually model the data and manage its storage” is a fundamental issue.

The fields of data warehouse (DW) and information system (IS) development represent a preliminary stage in developing data landscapes in terms of information requirement analysis, which includes the identification of data and information necessary to support the decision maker [5]. Winter and Strauch [2003] distinguish between the two systems, citing the underlying IT-infrastructure, the number of interfaces and connections, the degree of specification, and the number of involved organizational units as distinguishing factors. The different characteristics lead to a disparity in the information requirement analysis because IS requirements target necessary and desirable system properties from prospective users whereas the required information for a data warehouse system can usually not be gathered correctly due to the uniqueness of many decision/knowledge processes. Consequently, how extensively these models can be applied to data landscape development must be tested.

The existing identification approaches for DW can be categorized as data/ supply-, requirement/goal/demand-, or process-driven [36]. Data-driven approaches focus on the available data, which can be found in the operational systems (e.g., ERP or CRM systems) [26]. This approach can help identify the sum of the overall available data but fails to incorporate the users respective decision-makers actual and future requirements. Requirement-driven approaches focus on the requirements of the system user, assuming that a user can best evaluate his information need, which is simultaneously a limiting factor because most users are not aware of the overall available data sources [14]. Furthermore, in an early study [12] explains human biasing behaviors, which have a negative influence on data selection in the initial phases of a data warehouse development. He describes strategies to determine the information requirements, including asking, deriving them from an existing information system, synthesizing them from characteristics of the utilizing system, and discovering them through experimentation with an evolving information system. He also emphasizes the relevance of data characteristics, claiming, the format of the data is the window by which users of the data see things an events. Format is thus constrained by the structure.

As a special form of the requirement-driven approach, the process-driven approach focuses on data from existing business processes and therefore avoids the subjectivity of the requirement-driven approach and the constraints of the data-driven approach [22]. Depending on the coverage of business processes by IT systems, this approach can produce results that are similar to those of the data-driven approach; as more process steps are covered, the results from the two approaches are more comparable. One challenge for the use of the process-driven approach in landscape development can be the identification of the relevant decision process.

Using a method engineering approach, the information requirement analysis by [37] introduces the information map that described which source systems provide which data in which quality but does not amplify the development of this data landscape. [15] present a mixed demand/supply-driven goal-oriented approach, incorporating the graphical representation of data sources and attributes depending on the particular analytic goal. The graphical representation contains aspects of a data landscape but does not contain a characterization/evaluation of the attributes and focuses on existing, internal data sources. [25] also propose a goal-oriented approach, introducing a hierarchy among the strategic, decisional and informational goals. Based on the information goals, measures and dependencies among them are identified.

Less research has been published regarding information requirement analysis for IS/decision support systems. [5] categorize existing approaches into observation techniques (prototyping), unstructured elicitation techniques (e.g., brainstorming and open interviews), mapping techniques (e.g., variance analysis), formal analytic techniques (repertory grid), and structured elicitation techniques (e.g., structured interviews and critical success factors), which can be used to identify requirements based on existing information systems. [12] presents four strategies for generic requirement identification on the organization or application-level: (i) asking, (ii) deriving it from an existing information system, (iii) synthesizing it from characteristics of the utilizing system, and (iv) discovering it from experimentation with an evolving information system. In their literature review, [35] compare and evaluate methods for analyzing information requirements for analytical information systems based on the requirement engineering by [20]. Their analysis reveals that most publications address elicitation, but the issue needs to be pursued further. The same applies to research about documentation of the information requirement, which lacks a sufficient level of detail that is coherent for both business and IT.

The presented models can not be utilized for the information requirement analysis in the context of decision support as the existing models focus on internal company data and hence do not consider possible valuable external data for DS purposes. Therefore, an external perspective has to be incorporated. Second, to cope with the multiple data sources, a structure must be provided that supports focusing only on decision-relevant data which can only be found in the work by [15, 25]. Consequently, we propose a process model decision support that enhances the process of identifying and evaluating potential data sources.

2.2 Development of the Process Model for the Data Landscape

The proposed process model for data landscape development combines and extends the goal-oriented approaches by [15, 25] and the data model-oriented level-approach by [19]. The initial goal-oriented approach helps identify relevant analysis tasks, whose results support the overall decision making process.

The starting point can be the pursuit of a strategic goal or a specific analytic question. In the first case, the decision and information goals are derived based on the strategic goal, using a top-down approach. For example in the field of online advertising, a strategic goal can be the improvement of the overall company reputation or an increase in sales. These goals can focus on the department level or the company level.

In the next step, the strategic goal is itemized into decision goals, which, when completed, contribute to the achievement of the overall strategic goal.

In the third step, the decision goals are specified by developing information goals as the lowest hierarchical step. Information goals are concrete goals that contain distinctive analytic questions. These form the basis for the subsequent identification of relevant data sources in an information requirement analysis.

Fig. 1.
figure 1

Process steps.

The goal hierarchy supports the identification of analytic questions, based on requirements, as a first step to frame the requirements based on the necessary decision support, incorporating the uniqueness of each decision making process [36]. Furthermore, it fosters the definition of analytic goals, independent of the perceived limitations regarding employees’ knowledge of available data sources. Due to their granularity, information goals can be used to derive concrete hypotheses that can be tested. In case a concrete analytic goal exists, this technique can be used as a bottom-up approach to identify further informational goals. In this case, the related decisional and strategic goals are first defined. Based on the decision goal, further information goals are derived (Fig. 1).

In the next step, the related business process is defined for each analytic goal. For example in the field of online advertising, for the possible information goal “analyze online customer conversions under the influence of online advertising” the related generic business process is established as a potential customer interacts with an advertisement (i.e., by being exposed to a banner advertisement or clicking on a paid search advertisement), visits the online shop, and purchases a product.

In the next step, the related data sources, e.g., ERP-/CRM-systems, and attributes for each process step are identified. To this point, this approach for a high- or mid-level data analysis is similar to the one proposed by [19]. We extend this approach to cope with the requirements of DS in the emerging Big Data context regarding the dimensions volume, variety, velocity and veracity. Considering the numerous data sources within and outside the company that can contain business process and decision-relevant data, we extend the approach by distinguishing internal and external data sources [34]. For example, the data sources regarding a purchased product are not limited to product master data and sales data on the product level. They can be enriched by customer reviews from external product platforms regarding customer satisfaction or product weaknesses and can therefore foster the decision support, e.g., with regard to companies spending on product development, product quality management, or reputation management.

Table 1. Data characteristics and possible features for each attribute.

The available data sources, spots and attributes in the field of online advertising and decision support are heterogeneous. We understand data spots to be the next lower level of data sources the customer or product master data which contain again attributes, e.g., name, address. Consequently, for decisions in which data sources and spots should be integrated into the analysis, information about the potential information content and the amount of data processing work resulting from its characteristics is needed. For example, the characteristics in Table 1 must be defined for each attribute.

Therefore, the second extension is the introduction of a low-level attribute characterization that contains the determination of data characteristics for each attribute in addition to the type and source system of the data, which are already known from database development-related approaches. Furthermore, attributes that do not contain further insight independent of the decision in focus (e.g., customer telephone number) are eliminated in the following data cleaning step. This removal step aims to simplify the subsequent model building process. Previous approaches to information requirement analysis do not consider further data characteristics as the physical attributes like data type (e.g. varchar). With the increase in the number and points of origin of potential available data sources, a cost estimation in the early stages of heterogeneous source utilization is crucial.

The determination of characteristics fosters the evaluation of attributes regarding costs and effort for an integration into the DS. Using Twitter as an example, although data collection is simplified by using the available API, the process of data cleaning with regard to the noisy data is time consuming. Conversely, the (pre-)processing of clickstream data is less time consuming due to the higher degree of structure. To incorporate these characteristics, the degree of structure and distinction between machine- and human-generated data is introduced, assuming that unstructured data generated by humans, such as reviews or blog entries, are more likely to contain noisy data, which increase the time needed for data (pre-) processing due to typos (e.g., gooood instead of good) or linguistic features (e.g., irony, sarcasm). With regard to blog entries or tweets from different countries, the text language also influences the preprocessing time, although research has revealed that machine-based translation does not necessarily impair the results [13]. The effort for data preprocessing is related to the data quality, which is a major subject in the field of Big Data [23]. In addition the available volume influences the sample size and the coverage of the analysis. The velocity influences the time intervals in which the decision model can be updated based on new data. The costs per unit target purchased data, e.g. advertising data or market research data. The level indicates in how far decisions can be made on customer level. The historical availability defines the period, which can be incorporated in the analysis. This is of special interest regarding the changes in customer online behaviour. In case different internal and external data sources are supposed to be integrated in a decision support system, the data characteristics can support the technological decisions regarding database management software as well. The introduced aspects of external data integration and characterization incorporate the requirements from decision support into the Big Data context. Based on the developed data landscape, a model building process that is used to answer the origin question can be established.

2.3 Model Evaluation

Evaluation in a general sense, is understood as the systematic process, applied for the targeted and goal-oriented evaluation of an object [9] p.4. The execution of an evaluation is not only connected with interest of gaining knowledge. Additionally evaluation serves the documentation of effects.

In the context of design oriented information system research, evaluation is understood as assessment of material or immaterial objects under consideration of a specific aim. Although evaluation plays a subordinated role within design science research, [3] draws a connection between evaluation and validation, stating that “[]...a constructed, not yet evaluated artefact does not represented a valid research result”.

As an artefact in the context of design science research is developed, the aspect of evaluation plays a relevant role. Existing design research processes contain a one step focussing explicitly on evaluation instead of validation, e.g. the approach by [17] whose individual research steps can be grouped to the steps of Build, Evaluate, Theorize, and Justify.

Existing evaluation approaches in the field of design science research can be distinguished based on the evaluation against the research gap or the real world problem [10]:

  1. 1.

    The artefact is evaluated against the identified research gap. The focus is on the evaluation of the accurate construction of the artefact, based on requirements defined before.

  2. 2.

    The artefact is evaluated against (an expert of) the real world carried out by applying the artefact to the real world problem in focus.

  3. 3.

    The research gap is evaluated against the real world will not be further pursued.

The evaluation against the identified research gap is carried out based on the design-science research guidelines [17] by comparing the developed construction model with the guidelines.

The evaluation against the real world is carried out by the latter model application. The goal is to proof in how far the developed artifact helps to solve the identified real world problem, in this case the challenges resulting from the high number of available data sources in the context of decision support. The results from the evaluation against the identified research gap can be found in the next section. The results of the evaluation against the real world can be found in the chapter describing the empirical application.

Although the presented paper is not solely linked with information system research, it exist extensive overlaps with the field of IT infrastructure especially data warehousing. With regard to the limited space of this paper, the guidelines are only shortly described and than cross checked with the presented model.

(1) Design as an Artifact demands the production of a viable artifact. This is fulfilled as an independent process model is developed, applicable as a basis for the respective information system development. The (2) Problem Relevance is given as until today companies are confronted with an 1.4-fold annual data growth based on numerous different company-internal and -external sources which results in the described insecurity about the data selection for decision support applications [24]. The (3) guideline Design Evaluation demands for an evaluation of the utility, quality, and efficacy of the designed artefact. Therefore, in the next section, the model is applied in a two step approach in the field of online marketing, both qualitative and empirical. Guideline (4) targets the Research Contribution. As no comparable process model for the development of a data landscape exists so far, the presented model is a distinct contribution. This aspect is in conjunction with guideline (5), the Research Rigor in terms of the application of rigorous methods. This is given as the in the forefront of the model building, an extensive literature review has been carried out, which led to the selection of the two presented publications, which act as a basis for the developed model, complemented by an two step model evaluation as described. Guideline (6) contains the Design as a Search Process, demanding the utilization of available means. This transfer of this guideline can not be executed completely as the with regard to the novelty of this approach, the run through several test cycles in order to refine the means could not be carried out so far.

3 Empirical Application in the Field of Online Advertising

3.1 Testing the Process Model

We test the model using the example of a telecommunication service provider that sells its products and services both online and in brick-and-mortar outlets. We first define a strategic goal and then develop respective information goals. This is followed by the definition of the corresponding business process and the identification of related data sources, data spots and attributes.

For online advertising, a strategic goal may be optimizing the companys advertising spending, such as by reducing the cost per order (CPO). The CPO is the sum of the advertising costs divided by the total number of purchases. Therefore, two possible resulting decision goals are reducing the advertising spending while keeping sales constant and vice versa. Therefore, related information goals include measuring the effects of reduced advertising spending on sales or the targeted exposure of online advertising activities to potential consumers to reduce scattering losses. The latter information goal is the basis for the further analysis of related data sources, spots and their characteristics. Scattering losses can be analyzed and optimized for each active advertising channel, such as paid search advertising or social media advertising. In the following example, we will focus on display advertising activities.

Based on the information goal of “reducing scattering losses of display advertising activities”, we identify the related business process, which contains the process of redirecting possible customers from third-party websites to the companys online shop with the help of display advertisements. Because the company sells products with different technical specifications, the process begins with the customers browsing routines or internet-based information search regarding a product or service. During the search, an advertisement for the company is displayed to the potential customer, who either clicks on the advertisement or visits the online shop directly. The visit to the shop leads to a purchasing decision, which terminates the analyzed process.

This business process given the information goal serves as the basis for the following identification of related data sources and spots as described in Sect. 2.2. The description of each data spot and its attributes and characteristics would be beyond the scope of this paper. Therefore, we analyze only a selection of data sources sufficient to demonstrate the functionality of the process model:

  • The main internal data source (high level) in the information search process step is the companys website respective to the companys webserver. On the middle level, the contained data spots are primarily customer reviews and clickstream data [4]. On the low level, which contains the data characteristics, the reviews are poly-structured (i.e., text, evaluation scheme, time of creation, and user name) and written in the customers national language. They are written on a sporadic basis. Furthermore, because they are stored on own servers, the acquisition costs are low in the first step. However, due to the low structure of text and potential noisy data, the data preprocessing is time-consuming and therefore cost-intensive. Reviews are human-generated on an individual level and are available because the product is sold in the online shop. As the second main data spot, the redirection to the company’s website after clicking on advertisements creates individual user journeys (clickstream data including information of which user clicked on what type of online advertisement at which point of time and finally bought a product). These data have a high degree of structure and can be accessed free of charge because the telecommunication company in focus has its own webserver. The data are machine-generated on individual level. Therefore, less time is required for data preprocessing than for the customer reviews.

  • The data sources and spots identified so far inside the company are enriched in the next step by the external data perspective. On a high level, websites from other online shops selling a product or service, such as Amazon.com or product review websites from magazines and product-related fora, are additional data sources. The contained data spots include the review texts and ratings, the time stamp and the reviewers profile (e.g., number of reviews written, products reviewed so far). Compared to the review data from the companys website, the data are poly-structured, available since the product has been sold in the respective online shop and generated at irregular intervals. The information value differs significantly across reviews and is based on the length of the review, the active vocabulary used and the reviewers intention [28]. In addition, fora may contain phony reviews by reputation management agencies that are designed to influence product sales. Therefore, the data preprocessing effort is high. The difference between internal and external reviews is the absence of an API to access and store the data. Therefore, its acquisition costs are higher than are those for internal review data, and access is not always possible due to crawling limitations.

  • A next process step is the contact of the potential customer with a displayed advertisement (such as individual “view”-touch point events of individual users with display advertisements). Because the company has outsourced its online advertising activities, the related data source is an external advertising server. The contained data include the cookie ID, type of advertisement displayed (e.g., banner, pop-up; here, a banner), timestamp, display duration, location (URL, position on-page, and size) and whether the advertisement has been viewed (y/n) and clicked (y/n). These data have a high degree of structure and contain low to no noisy data because they are machine-generated. On the downside, the data are cost-intensive because they must be purchased from the advertising agency.

  • The data source for the final process step, the potential conversion, is again the companys web server, which contains the same data used in the first information-gathering step (internal clickstream data). Additional data spots include the conversion (y/n), products in the shopping cart and time of a potential cart abandonment

The structured process leads to numerous potential data sources with heterogeneous characteristics that analysis may generally be useful in reducing display advertising costs. However, each of the data sources has a different expected level of contribution to the information goal. For example, the internal data sources may include directly available information about how display advertising affected consumers decision and buying processes, which helps companies optimize display advertising activities [2, 30, 33], whereas the external available data sources, such as customer reviews, only have indirect effects on the effectiveness of display advertising activities and, therefore, will not directly contribute to the information goal.

Following the principal of first considering data that are easy to generate and analyze and that are expected to contribute to the information goal, we anticipate that the internal clickstream data offer deep insight into consumer online clicking and purchasing behavior. Based on this clickstream data, which contain highly detailed user-level information, we are able to analyze user clicking and purchasing behavior. The results are intended to contribute to the information goal of reducing display advertising costs given the same output or the same number of sales.

3.2 Analyzing Clickstream Data

The telecommunication company in question runs multiple advertising campaigns. As discussed above, the company generates highly detailed user-level data that contain time-specific touch points for individual users with multiple advertising channels. Analyzing the advertising-specific attribution to the overall advertising success (e.g., sales) is an ongoing problem that is the focus of recent scientific research because the options for online advertising have become increasingly complex, leading to the necessity of making sophisticated decisions [29]. For example, because companies run multiple online advertising campaigns simultaneously, individual consumers are often exposed to more than one type of online advertising before they click or purchase. Standalone metrics, such as click-through rates, which are the ratio of clicks to impressions, or conversion rates, defined as the number of purchases in relation to the number of clicks, are not able to realistically assign these clicks and purchases to a specific type of online advertising. These metrics neither explain the development of consumer behavior over time (i.e., a consumer is first exposed to a display advertisement, later searches for the advertised product, and finally purchases it) nor account for the potential effects of interaction among multiple types of online advertising.

[30] have recently demonstrated how having and analyzing clickstream data can explain consumer online behavior and consequently optimize online advertising activities. Therefore, we follow [30] in modeling clickstream data and analyzing individual consumer purchasing behavior. That is, we interpret all interactions with advertisements as a repeated number of discrete choices [4]. For example, consumers can decide whether to buy a product after clicking on an online advertisement, which results in a conversion/non-conversion decision. Note that we model the consumer choice of buying or not buying (binary choice) by incorporating the effects of repeated interaction with multiple types of online advertising as explanatory variables. As already demonstrated by [6], it is useful to consider short-term advertising effects on consumers’ success probabilities by adding variables to the model specification that vary across time t with each advertisement interaction (\(X_{ist}\)) as well as their long-term effects by incorporating variables that only vary across sessions s (\(Y_{is}\)). To model the individual contribution of each advertising effort and its effect on the probability that a consumer i will purchase, we specify a binary logit choice model following the specification of [30]. The probability that consumer i purchases a product at time t in session s is modeled as follows:

$$\begin{aligned} Conv_{ist}= {\left\{ \begin{array}{ll} 1 &{} \text {if user}\;i\;\text {purchases at time}\;t\;\text {in session}\;s\\ 0 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(1)

with the probability

$$\begin{aligned} P(Conv_{ist}=1)&= \dfrac{exp(\alpha _i + X_{ist}\beta _i + Y_{is}\gamma _i + \epsilon _{ist} )}{1+exp(\alpha _i + X_{ist}\beta _i + Y_{is}\gamma _i + \epsilon _{ist})}, \end{aligned}$$
(2)

where \(X_{ist}\) are variables varying within (t), across sessions (s), and across consumers (i); \(Y_{is}\) are variables varying across sessions (s) and consumers (i); and \(\alpha _i\), \(\beta _i\), and \(\gamma _i\) are consumer-specific parameters to be estimated.

\(\alpha _i\) accounts for the propensity of an individual consumer to purchase a product after clicking on a respective advertisement. For example, previous research indicates that consumer responses to banner advertisements are highly dependent on individual involvement [8, 11] and exhibit strong heterogeneity [6].

To account for the effects within a consumer’s current session across multiple advertising types, we follow Nottorf and Funk (2013) and define the following variables incorporated by \(X_{ist}\):

$$\begin{aligned} X_{ist}&= \lbrace x^\text {search}_{ist}, x^\text {social}_{ist}, x^\text {display}_{ist}, x^\text {affiliate}_{ist}, x^\text {newsletter}_{ist},\\&\qquad x^\text {other}_{ist}, x^\text {brand}_{ist}, x^\text {direct}_{ist}, x^\text {conv}_{is(t-1)}, \text {Conv}_{is(t-1)},\text {TLConv}_{ist}\rbrace .\nonumber \end{aligned}$$
(3)

We expect the effect of repeated clicks on advertisements to vary depending on the type of online advertising that is being clicked on. Thus, \(x^\text {search}_{ist}\), ..., \(x^\text {other}_{ist}\) refer to the cumulative number of clicks on the respective type of advertisement.Footnote 1 \(x^\text {brand}_{ist}\) accounts for the cumulative number of brand-related interactions (e.g., the search query of the consumer included the company’s name). \(x^\text {direct}_{ist}\) refers to the cumulative number of direct visits of a consumer (e.g., via direct type-in or the use of bookmarks). \(x^\text {conv}_{is(t-1)}\) is the cumulative number of conversions until the consumer’s last touch point (\(t-1\)) in the current session s. \(\text {Conv}_{is(t-1)}\) is an indicator function that assumes the value 1 if a consumer has purchased in \(t-1\). \(\text {TLConv}_{ist}\) refers to the logarithm of time since a consumer’s last purchase. If a consumer has not yet purchased, the variable remains zero.

Table 2. Descriptive statistics of the variables used in the final model specification.

The variables \(Y_{is}\) are similar to those specified as \(X_{ist}\), but now account for the long-term, inter-session effects of previous touch points of a consumer:

$$\begin{aligned} Y_{is}&= \lbrace y^\text {search}_{is}, y^\text {social}_{is}, y^\text {display}_{is}, y^\text {affiliate}_{is}, y^\text {newsletter}_{is}, \\&\qquad y^\text {other}_{is}, y^\text {brand}_{is}, y^\text {direct}_{is}, y^\text {conv}_{i(s-1)}, \text {IST}_{is}, \text {Session}_{is} \rbrace .\nonumber \end{aligned}$$
(4)

\(y^\text {search}_{is}\), ..., \(y^\text {other}_{is}\) refer to the number of clicks on respective advertisements in previous sessions. \(y^\text {brand}_{is}\), \(y^\text {direct}_{is}\), and \(y^\text {conv}_{i(s-1)}\) also account for the total number of respective interactions in previous sessions. \(\text {IST}_{is}\) is the logarithm of the intersession duration between session s and \(s-1\) and remains zero if a consumer is active in only one session. \(\text {Session}_{is}\) refers to the number of sessions during which a consumer has been activeFootnote 2.

3.3 Empirical Data

The dataset analyzed consists of information on individual consumers and the point in time at which they clicked on different advertisements and made purchases. The internal clickstream data were collected within a one-month period in 2013 and consist of more than 500,000 unique users. Because no information on the number of consumer sessions and their duration is accessible, we follow [6, 29] and manually define a session as a sequence of advertising exposures with breaks that do not exceed 60 min. We report the descriptive statistics of our final set of variables in Table 2. To test the out-of-sample fit performances of the model, we split the data into a training sample (50,000 consumers) and a test group (470,906 consumers). The dataset has been sanitized, and we are unable to provide any further detailed information on the dataset for reasons of confidentiality.

3.4 Results and Discussion

Similar to Nottorf and Funk (2013), we use a Bayesian standard normal model approach to account for consumer heterogeneity and to determine the set of individual parameters. We apply a Markov Chain Monte Carlo (MCMC) algorithm including a hybrid Gibbs Sampler with a random walk Metropolis step for the coefficients for each consumer [31]. We perform 5,000 iterations and use every twentieth draw of the last 2,500 iterations to compute the conditional distributions.

Table 3. Parameter estimates of the proposed model. We report the mean and the 95 % coverage interval; significant estimates are in boldface.

The parameter estimates for \(X_{ist}\) and \(Y_{is}\) can be found in Table 3. The mean of the intercept \(\alpha _i\), which accounts for the initial “proneness to purchase” (following [6]), is –5.85. This results in a very low initial conversion probability of 0.29 %. In contrast to the prior findings of Nottorf and Funk (2013) who modeled click probabilities, only a few significant parameter estimates exist. For example, whereas each additional click on a social media \(x^\text {social}_{ist}\) or display \(x^\text {display}_{ist}\) advertisement significantly decreases conversion probabilities within consumers’ current sessions (–6.08 and –1.14), consumers’ clicks on the remaining channels do not significantly influence conversion probabilities. However, although the parameter estimates of the remaining channels are not significant, they are still influencing conversion probabilities differently. For example, \(x^\text {search}_{ist}\) is negative, with a value of –0.58, indicating that each additional click on a paid search advertisement within a consumer’s current session decreases the conversion probability. Conversely, \(x^\text {newsletter}_{ist}\) = 0.48 is positive, so each additional click on newsletter-links slightly increases the probability of a purchase.

To demonstrate how the analysis of clickstream data can optimize the display advertising efficiency, we propose a method for short-term decision support in real-time bidding (RTB).Footnote 3 Therefore, we first highlight the out-of-sample fit performance of our proposed model by predicting the actual outcome for the last available touch point of each consumer from the test data set (conversion/no conversion) and comparing them with the actual, observed choices. Furthermore, we rank all of these consumers by their individual conversion probabilities at the last touch point, separate them into quartiles, and examine how many conversions each of the quartiles actually receives (Table 4).Footnote 4 For example, the quartile with the lowest 25  the total 2,017 conversions that were observed at the last available touch point for each consumer from the test data set, whereas 25 % of the consumers with the highest conversion probability (75–100 %) receive nearly 50 %, bidding behavior and advertising-spending toward this upper quartile bin may lead to improved short-term decision support and potential financial savings and, thus, contribute to the overall strategic goal of reducing the CPO.

Table 4. Quartiles are grouped by predicted conversion probabilities for n = 470,906 consumers. In Scenario 1 (2), a CPC of €0.50 (€0.30) is assumed to calculate the CPO.

Based on the forecast for each consumer-conversion probability-quartile, we can calculate the expected quartile-specific conversion rate (CVR). Let us now assume that the company in question actually engages in a RTB setting. Depending on the individual setting (i.e., the contribution margin of the advertised product), companies usually determine a specific maximum amount of money that they are willing to spend to acquire new customers (which is the maximum CPO the company is able to spend). In the following example, we consider two scenarios, each of which has a different cost per click (CPC), which results in different CPOs depending on the expected CVRs (the right side of Table 4). To be clear, let us consider an example and assume a maximum CPO of €75.00. Given that maximum,Footnote 5 we see that in Scenario 1, only the consumers within the quartile bin 75–100 % should be exposed to display advertisements because the CPC of the other consumers is expected to be higher than €75.00. A company that does not have information on the clickstream data would not have exposed any consumers to display advertisements in the first scenario because the company would not have categorized consumers along their individual conversion probabilities; with €116.28, the total expected CPO is higher than the maximum CPO. In the second scenario with a decreased CPC, the company would expose all consumers to display advertisements, although only the consumers with the highest expected CVR have a CPO that is lower than the maximum CPO (€36.59).

The procedure outlined above leads to additional profit (\(profit_{add}\)), in contrast to a company that does not analyze clickstream data and consequently does not optimize display advertising activities. To illustrate this result for Scenario 1, we must consider the opportunity cost of a “lost” conversion (\(cost_{opp}\)) of a consumer whom we do not expose to display advertisements because we focus on the consumers who have the topmost conversion probabilities multiplied by the number of lost conversions (\(conv_{lost}\)). Simultaneously, we save on the consumers (\(user_{lost}\)) whom we do not expose to display advertising due to an expected CPO that is too high:

$$\begin{aligned} profit_{add} = user_{lost} * \text {CPC} - conv_{lost}*cost_{opp} \end{aligned}$$
(5)

We assume that the cost of a lost conversion is equal to the maximum CPO (€75.00). Given that assumption, the expected profit is €26,828.85 for Scenario 2.Footnote 6 In the first scenario, a company that does not use the information derived from clickstreams would lose €13.286.75 because it misses 25 % of the consumers with the highest predicted probabilities.Footnote 7 Please note that this profit/ loss is a sample calculation and may not hold true for every hour/day iteration. Nonetheless, this example demonstrates how analyzing clickstream data contributes not only to the information goal of reducing display advertising costs but also to the overall strategic goal of reducing the global CPO.

4 Conclusions

The increasing amount of available data with heterogeneous characteristics regarding structure, velocity and volume hinders the selection of data for decision support purposes. The existing models primarily target the information requirement analysis for data warehouse development but do not support the data evaluation process in the early stages of data analysis for decision support.

We developed a data landscape that enhances both the data selection and the decision support process. The proposed framework incorporates the derivation of specific goals whose fulfilment enhance the decision support and the identification of related business processes as well as the selection of relevant data for each process step.

We tested the framework to enhance decision support in online advertising, partly by using approaches for information requirement analysis from the data warehouse and information system literature. Based on the derived information goal of optimizing display advertising spending, we have found that the internally available clickstream data offer deep insights into consumer online clicking and purchasing behavior. Applying the model of [30], we successfully analyzed and predicted consumers’ individual purchasing behavior to optimize display advertising spending.

The developed artifact could be evaluated successfully both against (i) the identified research gap and (ii) against the real world. For the first evaluation target, the guidelines of design science research by [17] have been applied. The real world evaluation let to improved monetary benefits.

The utility of the process model for the development of a data landscape can be demonstrated because the model helps identity, classify, characterize and evaluate data in ways that can contribute to decision making. The characterization of data spots related to the business process fosters understanding about the data and their attributes for decision support purposes. The absence of such model results can lead to an incomplete basis for decision making. The limitation of the presented model results from the nature of processes, which have a static character and do not completely account for customer behavior, e.g., multiple runs through the process of information gathering.

The presented process model suggests different opportunities for further research. The proposed model was applied in the field of online advertising. It should also be tested in different scenarios to determine the degree of possible generalization and application-specific needs, particularly with regard to the identification of the related business process. Furthermore, the development of a graphical representation could foster the decision making process.