1 Introduction

A firm that can predict customer purchase behavior will achieve numerous benefits, including an improved customer acquisition rate, increased sales, and enhanced competitiveness. Research efforts from the marketing and customer relationship management (CRM) perspectives have made major strides in predicting customer purchase behavior. For example, market basket analysis [29, 31] examines groups of items purchased in supermarkets or other outlets to identify purchase patterns, and discrete choice models [16, 30] predict which products a specific customer is likely to select from a candidate product set. Certain studies abstract the prediction of purchase behavior into a classification task [7, 10] that requires customers’ demographic features, such as age, gender, education and occupation.

Unlike traditional businesses, firms in the e-commerce context find it difficult to obtain information about customer demography or family background because these data are usually regarded as private [19]. Rather, it is more convenient to access customer reviews, product ratings and visiting tracks. Therefore, the methods and algorithms used to predict customer purchase behavior in the traditional business context must be modified for e-commerce. To meet the challenge of predicting purchase behavior in the e-commerce context, we first examine the purchasing decision process of customers. Guo and Barnes [8] propose a three-stage purchasing decision process in the e-commerce context, which is illustrated in Fig. 1.

Fig. 1
figure 1

Consumer behavior and the purchasing decision in the e-commerce context

The first stage, problem/motivationsFootnote 1 recognition, captures consumers’ perceptions of how products may help them to bridge the gap between the desired and actual states. During the second stage, a customer must seek information about product performance or other criteria and must evaluate product alternatives based on price, brand and other attributes.

Recommender system technology has been widely adopted by e-commerce websites. This technology not only provides appropriate recommendations to users but also yields substantial profits for the service provider. Nevertheless, we believe that traditional recommendation algorithms cannot effectively predict purchasing behavior in the e-commerce context, for three reasons:

  1. (1)

    Recommender systems [12, 17, 32, 33] predict the items most likely to interest a customer either by exploiting product ratings by other customers with similar tastes [collaborative filtering (CF)] or by using past product ratings by the target customer (content-based recommendations). However, the rating predicted by a recommender system for a candidate product only indicates the impression the customer will likely have of that product, i.e., it predicts a “like” signal. This is a far cry from the goal of predicting purchase behavior. According to our experiments, the use of CF alone to predict customer purchase behavior leads to a poor performance. Reference [28] notes that a consumer usually makes purchasing decisions based on a product’s marginal net utility. A rational consumer chooses to purchase the product that maximizes total net utility.

  2. (2)

    Many works on recommender systems focus on the user–item matrix and employ tensor factorization. However, those studies only consider associations among users and items and ignore rich product features.Footnote 2 We think that side information plays an important role in the second stage of the purchasing decision process depicted in Fig. 1.

  3. (3)

    Recommender systems usually do not address the first stage of the purchasing decision process shown in Fig. 1. Rather, recommender systems address only a single component of the second stage of the purchasing decision process. In addition, some recommender systems aim to increase product awareness [4] and thus consider freshness and novelty in product recommendations [18]. Such systems cannot be employed to perform the predictive task.

In this study, we propose a two-stage predictive framework. First, customer’s motivations are investigated. However, not all motivations that affect purchases of real products can be investigated in the e-commerce; one such example is Maslow’s psychological and safety needs. We exploit the associations among products to predict a customer’s motivations. The second step of the framework addresses the numerous factors that affect customer decision making, such as price [25], brand preference, economic needs [23], etc. We seek to integrate side information about products into the predictive task by learning customers’ preferences for particular product features. Furthermore, we leverage these preferences to select a collection of candidate products based on customer’s motivations. When a customer submits one purchased product into the predictive framework, this framework can return the top n products most likely to be purchased by that customer in the future.

The remainder of this paper is organized as follows. Section 2 reviews the related research on predicting customer purchase behavior. Section 3 introduces the proposed predictive framework. Section 4 presents the experimental results. Section 5 contains the analysis of customer characteristics in the experimental dataset. Section 6 presents the conclusions.

2 Related work

The three principal research streams are related to predicting customer purchase behavior. Researchers in the marketing and retailing fields have expended significant effort to predict customer purchase behavior, which can help firms to identify likely purchasers of their products or services and to implement cross-selling and up-selling campaigns. The association rule technique developed for market basket analysis has become a popular method for identifying customers’ purchase patterns by extracting associations, or co-occurrences, from stores’ transaction databases. These purchase patterns can then be exploited to predict customers’ future purchase behavior. Chi-Wing Wong et al. [29] conceive the loss rule, which is similar to the association rule, to model the cross-selling effect. Mobasher et al. [15] use frequent item mining techniques to discover sequential patterns that are used to generate recommendations. Yang et al. [31] define online shopping patterns and develop methods to perform market basket analysis across websites. Discrete choice models may be employed to analyze customer preferences, which play an important role in purchasing decisions. Yang and Allenby [30] introduce a Bayesian autoregressive discrete choice model to study preference interdependence among individual consumers. Moon and Russell [16] develop a product recommender system based on the autologistic choice model. However, we believe that using association rules to identify purchase patterns only finish first step of the purchasing decision illustrated in Fig. 1.

The CRM system of a firm generally maintains a large quantity of customer data, including age, gender, income and purchase lists. Data mining techniques are often employed in analytical CRM to transform this large volume of data into valuable knowledge that may be used to support marketing decision making. Based on such data mining techniques, customers can be segmented into clusters with internally homogenous and mutually heterogeneous characteristics [9]. Customers can also be ranked on their probability to behave a certain way (e.g., to buy a specific product or to respond to a particular marketing campaign). With the help of segmentation and ranking schemes, a firm may approach a carefully selected group of customers, which leads to higher success rates for its marketing campaigns [26]. In general, it is difficult to acquire customer demographic information such as income, age and gender in the e-commerce context because such information is usually regarded as private [19]. Instead, it is easier to acquire web access information such as product reviews and ratings, which provide richer information than traditional CRM. Therefore, CRM prediction techniques based on private customer information do not transfer well to the e-commerce domain.

E-commerce firms generally use recommender systems to predict customer purchase behavior. Recommender systems based on a Markov chain model utilize sequential basket data by predicting the user’s next action based on his/her last action. Zimdars et al. [33] describe a sequential recommender based on Markov chains and investigate how to extract sequential patterns to learn the next state using a standard predictor [e.g., a decision tree (DT)]. Mobasher et al. [15] use pattern mining methods to discover sequential patterns that are used to generate recommendations. Nevertheless, this category of recommender system focuses only on associations among products and is not suitable for predicting purchase behavior.

Most recommender systems recommend products based on the user’s entire history. Extensive effort has been expended to develop factorization-based CF approaches. Reference [12] uses customer’s purchase and click history to build customer’s profile and then incorporate the profile into a one-class CF model. The factorizing results of the model are used to make product recommendation. Ulrich and Koenigstein [17] present a novel Bayesian generative model for implicit CF. Specifically, they incorporate random graphs, which can be leveraged to predict the presence of edges in a CF model, into an inference procedure. As a result, their model is able to explicitly extract a “like” probability that is largely agnostic to product popularity. Rendle et al. [22] develop a factorized personalized Markov chain (FPMC) model that subsumes both a common Markov chain and the normal matrix factorization model. Experiments show that the FPMC model outperforms both common matrix factorization and the unpersonalized Markov chain model.

Content-based filtering emphasizes a focal customer’s preferences and can recommend products that have received few or even no ratings from other customers [20]. Content-based filtering builds a preference profile for each customer, uses it to search prospective products with attributes or characteristics that are highly relevant and similar to those specified by the focal customer’s profile, and makes product recommendations accordingly [3]. Content-based filtering can be supported by traditional inductive learning techniques that construct automated classifiers based on important patterns in customer preference profiles.

However, traditional recommendation algorithms often recommend products with the highest predicted ratings. This is far cry from the goal of predicting purchase behavior. Reference [6] maintains that the price factor, which is represented as a quantity, can significantly improve the recommendation performance. Reference [28] notes that a rational consumer chooses to purchase the product with the maximum total net utility. Traditional recommendation algorithms ignore the significant impact of product features on customer purchasing decisions. In this study, we seek to incorporate customers’ preference for product features into predictions of purchasing behavior.

3 Predicting customer purchase behavior

This section proposes a framework to predict customers’ purchase behavior.

3.1 Motivation of the framework

As an initial step, we investigate the decision-making process in online shopping. Let c k be a customer and both d i and d j be two products. We use P(d i) to denote the probability that product d i is purchased and use P(d j|d i) to indicate the conditional probability that a user purchasing item d i will purchase item d j as well [11]. Suppose that customer c k is likely to purchase d i with the probability P(d i|c k) [4] and P(c k) is a prior probability with a uniform distribution. We can make sure that P(c k, d j) = P(d j|c k)P(c k) also indicates the probability that c k purchase product d j. Furthermore, P(c k, d j|c k, d i) = P(d j|c k, d i) denotes a probability that after c k purchased d i in time t, c k will also purchase d j later.

$$ P\left( {d_{j} |c_{k} ,\;d_{i} } \right) = \frac{{P(d_{i} |c_{k} ,\;d_{j} )P(d_{j} ,\;c_{k} )}}{{P(d_{i} ,\;c_{k} )}}. $$

If the event c k purchasing a product is random, i.e., uncorrelated to purchasing d i, the probability will be

$$ P\left( {d_{j} |c_{k} ,\;d_{i} } \right) = \frac{{P(d_{j} |d_{i} )P(d_{j} |c_{k} )}}{{P(d_{j} )}}. $$

Let ω = {d 1,…, d i−1d i+1,…, d m} be a collection of candidate products. We calculate P(d j|c k, d i) for each product d j ∈ ω and then rank them. Inspired by the query likelihood model [14, 21], where the prior probability of a document P(doc) is often treated as uniform across all documents to reduce model complexity, we also assume that the prior probability of a product P(d) is uniform across all products. Therefore, P(d j) can be ignored for the purpose of rank. A predictive framework may be specified as

$$ P\left( {d_{j} |c_{k} ,\;d_{i} } \right) \propto P\left( {d_{j} |d_{i} } \right)P\left( {d_{j} |c_{k} } \right). $$

Parameter P(d j|c k) may also be interpreted as c k’s preference for d j. Each product contains many features and a customer may have a preference for these features. We make the conditional independence assumption, that is, for each customer c k, his/her preferences for the features of product d j, {f j1,…,f jn}, are independent of each other. Thus, P(d j|c k) may be in the form of

$$ P\left( {d_{j} |c_{k} } \right) = P\left( {f_{j1} , \ldots ,f_{jn} |c_{k} } \right) = \prod\limits_{l = 1}^{n} {P\left( {f_{jl} |c_{k} } \right)} . $$

In other words, customer c k’s preferences for these features determine the probability of c k purchasing d j. We propose a predictive framework for customer purchase behavior called CustOmer purchase pREdiction modeL (COREL), which takes the form of

$$ P\left( {d_{j} |c_{k} ,\;d_{i} } \right) = \frac{1}{Z}P\left( {d_{j} |d_{i} } \right)\prod\limits_{l = 1}^{n} {P\left( {f_{jl} |c_{k} } \right)} , $$

where Z is a normalization factor.

Jingdong (www.jd.com) is a well-known B2C e-commerce website in People’s Republic of China. Figure 2 shows the product-related information presented on Jingdong. An analysis of these product features indicates that they may be classified into three categories.

Fig. 2
figure 2

Product information presented on a Chinese e-commerce website

  • Certain product features are depicted only in the image that may be easily recognized by the user but not by the automatic program, such as product appearance (e.g., color, size, style).

  • Dynamic product features. In the e-commerce context, certain product information will change each time the product is purchased and rated, such as the number of reviews, the average rating and sales.

  • Product features that are static and observable by the analyst, e.g., price and brand.

It is necessary to employ different methods to learn users’ preferences for these three categories of product features. This study conceives of the following methods:

  1. (a)

    It may be extremely difficult to learn customers’ preferences for dynamic product features. However, we find that many customers exhibit similar tendencies with respect to part of dynamic product features. For instance, these customers always feel inclined to buy hot-selling products with high ratings and numerous customer reviews. Therefore, we propose a model called the heat model to summarize these features and thereby calculate a popularity score for each candidate product based on the current status of its dynamic features.

  2. (b)

    Discrete choice models are widely used by firms to analyze individuals’ choices among a set of products [30]; these choices reveal customer preference for static and observable product features [2]. We develop a hierarchical Bayesian discrete choice model to learn customers’ price sensitivity and brand preferences.

  3. (c)

    CF may be used to estimate a customer’s rating for a particular e-commerce product by exploiting product ratings by customers with similar taste. We employ CF to learn customers’ preferences for features that cannot be directly observed by the analyst.

3.2 Methodology

The predictive framework comprises as a two-stage process. First, P(d j|d i)is used to predict the customer’s motivations, i.e., to build a collection of candidate products ω. Second, the preferences of customer c k for features \( \prod\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{P}}(f_{i} |c_{k} )} \) are used to determine which products of ω are most likely to be purchased. Figure 3 presents the general process of the predictive framework.

Fig. 3
figure 3

The working process of COREL

Estimating parameter P(d j|d i), constructing a heat model and developing a hierarchical Bayesian discrete choice model are key to building the predictive framework.

3.2.1 Estimating P(dj|di)

Parameter P(d j|d i) represents the association between both products d i and d j. If a customer bought d i, the parameter may reveal his/her motivations for d j. Market basket analysis can be employed to estimate this parameter. Specifically, when two products occur in the same market basket, it is generally thought that there exists an association between them. Using maximum likelihood estimation, P(d j|d i) takes the form of

$$ P\left( {d_{j} |d_{i} } \right) = \frac{{|d_{i} \cap d_{j} |}}{{|d_{i} |}}, $$
(1)

where |d i| denotes the number of product d i purchased by the customer and |d i ∩ d j| is the frequency with which both products d i and d j co-occur in the same market basket. However, the experiment discussed in Sect. 4.3 demonstrates that the collection of product candidates developed using formula (1) is so small that COREL fails to achieve a good performance.

Therefore, we propose to build associations between categories and then obtain product candidates from categories associated with a particular product. Generally, e-commerce websites organize products into multi-level categories. For instance, Jingdong uses three category levels for its products; the categories of the “EPSON LQ-630k Printer” in order from first level to third level are “Computer or Office Equipment→Printing related Office Equipment→Printer”. We generate category associations at the third-level category using formula (2). Thr(d i) denotes the third-level category of product d i.

$$ P\left( {d_{j} |d_{i} } \right) = \frac{{|Thr(d_{i} ) \cap Thr(d_{j} )|}}{{|Thr(d_{i} )|}}. $$
(2)

The experiments discussed in Sect. 4.3 demonstrate that the association of categories can broaden the candidate collection and thereby lead to a better performance.

3.3 Heat model

As discussed above, it is difficult to learn customers’ preferences for dynamic product features. However, we find that many customers exhibit similar tendencies with respect to particular product features. Figure 2 illustrates four product-related features described on the Jingdong website: (1) Qr, the number of reviews; (2) Qs, the average rating; (3) Qa, the number of days since the product’s on-shelf date; and (4) Qu, the number of days since the most recent review. We seek to summarize customers’ preferences for these features by calculating a popularity score.

Prior work has shown that support vector regression (SVR) [27] is an excellent tool for predictive tasks. We develop an SVR-based model called the heat model H(d i) that can calculate the popularity of products based on features Qr, Qs, Qa and Qu. A training set is a necessary component for learning the heat model. To the best of our knowledge, no e-commerce website provides labeled data regarding product popularity. However, we observe that a visitor might be able to determine which of two online products is more popular. Inspired by this observation, we conceive of the following steps to generate at raining set (steps 1–3) and to train an SVR model using this training set (step 4). Figure 4 illustrates the process of building the heat model.

Fig. 4
figure 4

The process of building the heat model

Step 1 crowdsourcing is an online, distributed problem-solving and production model. Gathering data to train an algorithm is a common use of crowdsourcing [24, 1]. With this notion in mind, we develop a crowdsourcing system in which crowdworkers must select the more popular product from a pair of candidates displayed on a web page. The interface of this system is shown in Fig. 5.

Fig. 5
figure 5

The interface of a crowdsourcing system

The system runs according to the following process. Two products, A and B, are randomly selected from a product database; the webpage displays features Qr, Qs, Qa and Qu for both products; and the participant selects the more popular candidate. If the choice is product A, the two instances listed in the following table are generated for the training set.

Err_Qr

Err_Qs

Err_Qa

Err_Qu

Label

Qr(A) − Qr(B)

Qs(A) − Qs(B)

Qa(A) − Qa(B)

Qu(A) − Qu(B)

1

Qr(B) − Qr(A)

Qs(B) − Qs(A)

Qa(B) − Qa(A)

Qu(B) − Qu(A)

−1

In the training set obtained with this system, one instance includes five fields: Err_Qr, Err_Qs, Err_Qa, Err_Qu and label. Qr(A) denotes the feature Qr of product A.

Step 2 a classifier must be constructed to compare the popularity of two products. Logistic regression (LR) has proved to be an excellent tool for addressing binary classification problems. Additionally, the experiments discussed in Sect. 4.2 demonstrate that LR outperforms k-nearest neighbor (KNN) and DT models in classifying the crowdsourcing data. Thus, we build a LR model f(φ) to compare the popularity of two products. φ in the model is a vector where the elements denoted as Err_Qr, Err_Qs, Err_Qa, and Err_Qu represent differences between the features of the two products being compared.

$$ f(\varphi ) = \frac{\exp (\pi (\varphi ))}{1 + \exp (\pi (\varphi ))}, $$

where

$$ \pi (\varphi ) = \beta_{0} + \beta_{1} \times Err\_Qr + \beta_{2} \times Err\_Qs + \beta_{3} \times Err\_Qa + \beta_{4} \times Err\_Qu. $$

We employ the training set generated in step 1 to learn the parameters of the LR model.

Step 3 to generate a training set for the SVR model, we collect a product set ω by randomly selecting 1000 products from Jingdong. Each product in the set must be assigned a popularity score. We propose algorithm 1 to accomplish this task.

In algorithm 1, the array Pd stores the calculated popularity of all products in ω within the range [0, 1]. V(a) denotes the feature vector of a product a.

Employing algorithm 1 to calculate the popularity of each product in ω generates a training set for the SVR model. Two instances in the training set are shown in the following table, where score refers to the popularity of a product and ln(Qr) is the natural log of the Qr attribute.

ln(Qr + 1)

Qs

ln(Qa + 1)

ln(Qu + 1)

Score

1.0986

4

5.9054

5.8833

0.23539

0.69315

5

6.0497

5.9636

0.32821

Step 4 we use the training set generated in step 3 to train an SVR model called the heat model. In this study, we compare ε-SVR and μ-SVR, which use the polynomial kernel and the radial basis function as the kernel function of SVR, respectively. Because there is little general guidance on determining the parameters of SVR, this study varies the parameters to select the optimal values for the best prediction performance. We use the LIBSVM software system [5] to build an SVR model. Experimental results show that ε-SVR with a radial basis function fits well with our purpose, i.e., calculating popularity.

Given a set of data points {(X 1, z 1),…,(X l , z l )}, where X i ∈ R n is an input and z i ∈ R 1 is a target, the standard form of ε-SVR [27] is

$${\mathop {\min }\limits_{w,b,\xi ,\xi *} } \;\frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^l {{\xi _i}} + C\sum\limits_{i = 1}^l {\xi _i^*} . $$

Subject to

$$ \begin{array}{ccccc} & {W^T}\phi \left( {{X_i}} \right) + b - {z_i} \le \varepsilon + {\xi _i},\\ & {z_i} - {W^T}\phi \left( {{X_i}} \right) - b \le \varepsilon + \xi _i^*,\\ & {\xi _i},\;\xi _i^* \ge 0,\quad i = 1, \ldots ,l. \end{array}$$

Experimental analysis indicates that the heat model obtains the best performance in this study when the parameters of ε-SVR are C = 1 and ε = 0.3.

3.3.1 A hierarchical Bayesian discrete choice model

Economic models of choice typically assume that an individual’s latent utility is a function of brand and attribute preference [30]. We develop a hierarchical Bayesian discrete choice model to calculate the probability of c k choosing d j based on his/her brand preference and price sensitivity, DC(c k, d j).

We divide the price and brand of each product into three levels: high, medium and low price; large, moderate and small brand (Sect. 4 gives an example). Accordingly, the feature vector x of a product d has six binary value features x = (p_hi, p_me, p_lo, b_la, b_mo, b_sm) corresponding to three price levels and three brand levels. Only one of the three price levels in the feature vector has a value of 1 whereas the others have a value of 0. As an example, (p_hi = 1, p_me = 0, p_lo = 0) indicates that the price of a product is at a high level. Brand features are also subject to this rule. For instance, (b_la = 0, b_mo = 0, b_sm = 1) means that a product belongs to a small brand. A hierarchical Bayesian discrete choice model is in the form of

$$ DC\left( {c_{k} ,\;d_{j} } \right) = P\left( {y_{j} = 1} \right) = \frac{1}{{1 + \exp ( - V(d_{j} ,\;c_{k} ))}}. $$
(3)

Utility function

$$ U\left( {d_{j} ,\;cf_{k} } \right) = V\left( {d_{j} ,\;c_{k} } \right) + e_{jk} , $$
$$ V\left( {d_{j} ,\;c_{k} } \right) = \beta_{1} \times p\_hi + \beta_{2} \times p\_me + \beta_{3} \times p\_lo + \beta_{4} \times b\_la + \beta_{5} \times b\_mo + \beta_{6} \times b\_sm. $$

P(y j = 1) denotes the probability of a customer selecting product d j. Every customer may have particular preferences regarding price and brand. For example, one customer may prefer large-brand products whereas other customers may not care about a product’s brand as long as the product’s price is low. In the hierarchical Bayesian model, the coefficients of utility function are decided by customer features, which are described in Sect. 5.

Use B denotes β 1 ~ β 6.

$$ B = Z\Updelta + U;\quad u_{i} \sim N\left( {0,\;V_{\beta } } \right). $$

Matrix Z contains customer features. The coefficient matrix Δ has a normal distribution with means vec(\( \overline{\Updelta } \)) and covariance matrices given by the Kronecker product of A −1 and V β.

$$ \beta_{n} \sim N\left( {\Updelta^{\prime} Z_{n} ,\;V_{\beta } } \right), $$
$$ vec(\Updelta ) \sim N\left( {vec(\bar{\Updelta }),\;A^{ - 1} \otimes V_{\beta } } \right), $$
$$ V_{\beta } \sim IW(v,\;V). $$

The vec operator creates a column vector from a matrix by stacking the column vectors of [13]. Hyperparameter V β has an inverted Wishart prior. We set noninformative prior v, V, and A to v = m + 3, V = v·I, = 0 and A = 0.01, respectively, where m is the number of coefficients in the utility function.

We employ the Metropolis–Hastings MCMC algorithm to estimate the parameters of the hierarchical Bayesian model, proposing a normal distribution for the MCMC algorithm. The log-likelihood function is

$$ L(X,\;Y,\;B) = \sum\limits_{i} {\log \left( {P\left( {x_{i} } \right)*y_{i} + \left( {1 - P\left( {x_{i} } \right)} \right)*\left( {1 - y_{i} } \right)} \right)}, $$
$$ P\left( {x_{i} } \right) = \frac{{\exp (x_{i} *B)}}{{1 + \exp (x_{i} *B)}}. $$

The steps for estimating the parameters of the hierarchical Bayesian model are as follows.

Using saved draws, we can plot the posterior distribution of coefficients. As illustrated in Fig. 6, the means of the posterior distributions of the three coefficients p_hi, p_me and p_lo for one customer are approximately −2.7, 0.8 and 0.5, respectively. We can conclude based on the point estimates of these three coefficients that the customer generally rejects high-priced products and tends to prefer medium-priced products over low-priced products.

Fig. 6
figure 6

The posterior distributions of the coefficients of a customer

To train a hierarchical Bayesian discrete choice model, it is necessary to know customers’ options in a finite alternative set. However, in the e-commerce context, we can only know which products a customer purchased; we cannot know which products were available to but declined by the customer. When training a hierarchical Bayesian discrete choice model, both positive and negative samples are necessary components. Regarding the purchased products as positive data, we develop a technique to generate one negative instance from each positive instance. One instance in the training dataset is a feature vector of a purchased product combined with a label. The six features p_hi, p_me, p_lo, b_la, b_mo, and b_sm in each instance represent the product’s price and brand levels.

p_hi

p_me

p_lo

b_la

b_mo

b_sm

Label

1

0

0

1

0

0

1

When each feature in the positive instance is inverted, we can derive a negative instance.

p_hi

p_me

p_lo

b_la

b_mo

b_sm

Label

0

1

1

0

1

1

0

3.3.2 Collaborative filtering

CF may be used to estimate a customer’s rating for a particular e-commerce product based on the product ratings by customers with similar tastes. We employ CF to learn customers’ preferences for product features that cannot be observed by analyst. We predict c k’s rating for d j using collaborative filtering CF(c k, d j) that is calculated using formula (4).

$$ CF\left( {c_{k} ,\;d_{j} } \right) = \frac{{\sum\nolimits_{s \in S} {Sim(c_{k} ,\;s) \times rating(s,\;d_{j} )} }}{|S|}, $$
(4)

where S denotes a set of customers that comprises the top 10 customers most similar to c k and rating(s, d j) refers to a rating that customer s gives to product d j. The possible rating values are defined on a numerical scale from 0 (strongly dislike) to 5 (strongly like). Sim(c k, s) stands for the similarity between customers c k and s, which can be calculated using the cosine measure. A customer feature vector is defined as a set of product ratings. Consider the following case in the feature vector of c k: V(c k) = (0, 4, 1, 0, 5). This case indicates that c k did not purchase product d 1 (or that he/she gave product d 1 a rating value of 0) and gave d 2 a rating value of 4, etc.

$$ Sim\left( {c_{k} ,\;c_{l} } \right) = \frac{{V(c_{k} )V(c_{l} )}}{{|V(c_{k} )||V(c_{l} )|}}. $$
(5)

4 Experiments

We collected customer information and product reviews from Jingdong using a web crawler. The collected data contain 727,878 product items, 342,451 customers and 14,634,059 reviews from 2004 to 31 January 2013. Jingdong’s products are assigned three levels of categories. There are 19 first-level categories, 124 second-level categories and 1078 third-level categories.

It is difficult to collect customer purchase information directly from e-commerce websites because these data are generally regarded as private. Therefore, our study is based on the following assumption: if a customer frequently writes reviews on an e-commerce website, his/her reviews can reveal nearly all of his/her product purchases (on Jingdong, only customers who have purchased a product are authorized to write reviews for that product). Therefore, we identify customers with high reviewing frequencies and generate purchase data based on their reviews. In addition, we recruited 55 participants on the crowdsourcing platform; these participants generated training data containing 1351 * 2 instances.

We use an IBM computer with 16 × 2 GHz CPU and 64G memory to cope with very large matrixes generated in the experiments.

4.1 Data processing

The collected data are processed as follows.

  • Dataset division the dataset is divided into three sections by date. A section: before 30th June 2012, B section: from 30th June 2012 to 31st July 2012, and C section: after 31st July 2012. Figure 7 illustrates these divisions.

    Fig. 7
    figure 7

    Dividing the dataset into three sections

  • Customer selection we identify customers who purchased products in all three sections and for whom the number of items in A section is greater than 30. A total of 2770 customers meet this requirement.

  • Training set the purchase data in A section.

  • Target set the products in B section.

  • Test set the products in C section.

  • Setting price levels for products let d be an item and thr be its third-level category. If the price of d is above the 75th percentile for all product prices in thr, we assign the features of d to (p_hi = 1, p_me = 0 and p_lo = 0). If the price of d is below the 25th percentile for price, its features are assigned to (p_hi = 0, p_me = 0 and p_lo = 1). Otherwise, its features are assigned to (p_hi = 0, p_me = 1 and p_lo = 0).

  • Setting brand levels for products we examine the distribution of the number of products for each brand. If the number of a brand’s products is greater than the 75th percentile of the distribution, the features of all items under that brand are set as (b_la = 1, b_mo = 0 and b_sm = 0). If a brand lies below 25th percentile of the distribution, features of its products are set as (b_hi = 0, b_mo = 0 and b_sm = 1). Otherwise, its features are assigned to (b_hi = 0, b_mo = 1 and b_sm = 0).

This paper refers to the processed data as the JD dataset.

4.2 Exploring crowdsourcing data

In this section, we examine whether the data samples collected from the crowdsourcing system are suitable for our task. In other words, we evaluate whether participants make similar judgments regarding product popularity based on the product features presented to them. If participants exhibit similar judgments, the feature space of the collected data should be partitionable. Based on this idea, we employ LR, DT and KNN models to build classifiers based on the collected data and use a 10-fold validation method to examine the precision of the three classifiers.

$$ {\text{Precision }}=\frac{{{\text{The}}\;{\text{number}}\;{\text{of}}\;{\text{correctly}}\;{\text{classified}}\;{\text{instances}}}}{{{\text{The}}\;{\text{total}}\;{\text{number}}\;{\text{of}}\;{\text{instances}}\;{\text{in}}\;{\text{the}}\;{\text{dataset}}}}. $$

The results presented in Fig. 8 demonstrate that all three models perform well, with precision scores of 92.4, 90.6 and 84.5 % for the LR, DT and KNN models, respectively. This experiment demonstrates that the feature space of the collected data is partitionable; it also implies that crowdworkers have similar views regarding product popularity.

Fig. 8
figure 8

Classification of data gathered from the crowdsourcing system

The uncertain quality of crowdsourced labels is a challenge for the crowdsourcing system, and a detailed discussion on this subject is beyond the scope of this paper. In this study, a classifier with 92.4 % precision is adequate for our purpose, i.e., calculating product popularity, even if unreliable crowdworkers might exist.

4.3 Exploring parameter P(dj|di)

In this section, we investigate the impact of parameter P(d j|d i) on the performance of the predictive framework as follows: we choose the last product purchased d i by customer c k in B section; we acquire an associated product set ω for d i based on data in A section using formula (1) [if P(d j|d i) > 0, we say that d j is associated with product d i]; we use the products in ω as our prediction result and the products purchased by c k in C section as the test set Φ. If any product in ω occurs in Φ, we say that customer c k is successfully predicted. An analysis of all customers yields the results provided in Table 1.

Table 1 Precision of using associated products for predictions

When the top 10 associated products are obtained for each customer, only 9.9 % of customers are successfully predicted. When we obtain the top 100 associated products, the precision improves to 26.2 %. However, if the predictive framework is to be used as a recommender system, it is totally impractical to recommend 100 items to an e-commerce customer in one time.

In the JD dataset, the application of formula (1) to each product yields an average of 33 associated products. However, when we obtain the top 100 associated products for each item, the predictive performance only reaches 26.2 %. This result means that the use of formula (1) to acquire associated products harms COREL’s performance.

Continuing this scenario, we build a basic model that uses associated categories for prediction. We obtain the third-level category of d i, named thr; we use formula (2) to acquire the top 1, 5 and 10 associated categories of thr; and we reduce all products in test set Φ to their third-level category. If any associated category of d i occurs in Φ, we say that customer c k is correctly predicted. Table 2 presents the results.

Table 2 Exploring associated categories

Table 2 indicates that obtaining the top 10 associated categories achieves the best performance. Specifically, 75.3 % of customers will purchase products in at least 1 of the top 10 categories in the future. However, this result does not mean that continuing to increase top n will improve COREL’s performance; rather, expanding the candidate set further can introduce numerous marginally associated products.

4.4 Exploring collaborative filtering for prediction

For customer c k, we use formula (4) to acquire a collection of products ω, which contains the top 10 CF-score products not reviewed by c k in the A and B sections. The products purchased by c k in C section are employed as test set Φ. If any product d ∈ ω occurs in test set Φ, we say that c k is correctly predicted. An analysis of all customers achieves a precision rate of only 2.6 %. This result means that the use of CF alone is unsuitable for predicting customer purchase behavior (Table 3).

Table 3 Precision of CF

Continuing the experiment, we build a model called M2 = P(d j|d i) * CF. This model uses the top 10 associated categories of d i to build candidate sets and employs formula (4) to calculate CF scores for products in the candidate sets. We build three subsets by obtaining the top 1, 5 and 10 products. If any one of the products in a particular subset occurs in test set Φ, we say that c k is correctly predicted. After exploring all customers in the JD dataset, we determine the precision of the predictive framework for each of the three subsets. These results are shown in Table 4.

Table 4 Precision of model M2

The results show that combining CF and P(d j|d i) dramatically improves predictive performance compared with using CF alone. From a recommender system perspective, it can be said that when the model recommends 10 products to customers, 27.7 % of those customers will purchase at least 1 of the recommended products. The model that combines P(d j|d i) and CF (M2) also outperforms the basic model that uses only P(d j|d i) (Table 1).

4.5 Exploring the heat model

Let d i be the last product purchased by customer c k in B section and Thr(d i) be the third-level category of d i. As described in Sect. 4.4, we obtain the top 10 associated categories of Thr(d i) to generate a candidate set. Then, the heat model is used to calculate the popularity of each product in the candidate set and to form three product subsets, namely, the top 1, 5 and 10 most popular products. If a product in one of these subsets occurs in test set Φ, we say that c k is successfully predicted. An analysis of all customers in the JD dataset is conducted to acquire precision rates for the three subsets. Table 5 presents the results.

Table 5 Performance of the heat model

This experiment shows that the use of the heat model alone for the prediction task yields a poor performance. Accordingly, we build model M3 = M2 * H(d i), which combines the heat model and the M2 model. We repeat the above experiment using M3 and obtain the results shown in Table 6.

Table 6 Performance of model M3

A comparison of models M3 and M2 indicates that M3 outperforms M2 by up to 5 % when the top 10 candidates are obtained. This result demonstrates that incorporating the heat model into M2 significantly improves the performance of the predictive framework.

4.6 Exploring the performance of COREL

This section investigates the performance of COREL in predicting customer purchase behavior. We compare COREL to several baseline models, which are identified in Table 7.

Table 7 Predictive models

Recommender systems based on a Markov chain model utilize sequential basket data by predicting the user’s next purchase action based on the user’s last purchase action. By comparison, a factorization method based on matrix or tensor decomposition learns the general tastes of the user but disregards sequential information. The FPMC model [22] subsumes both a common Markov chain and the normal matrix factorization model. In this study, we implement an FPMC algorithm-based predictive model that makes item recommendations based on sequential basket data in the JD dataset. The parameters of implementing the FPMC algorithm are listed in Table 8.

Table 8 Parameters of implementing the FPMC model

We also implement the recommendation algorithm SVDutil [28], which utilize marginal net utility for recommendation. SVDutil assumes that each entry Pu,i in a user–product matrix PM×N can be estimated using the form \( \overline{{\text{P}}}_{{{\text{u}},{\text{i}}}} = {\text{q}}_{\text{i}}^{\text{T}} {\text{p}}_{\text{u}} \) where qi and pu are vectors, which are the hidden representation of product i and user u. These vectors can be estimated based on all given entries in PM×N. The value of Pu,i in the observed matrix PM×N is determined by the user purchase history. SVDutil ranks all products by their estimated \( \overline{{\text{P}}}_{{{\text{u}},{\text{i}}}} \) values and selects the top n to recommend. On implementing SVDutil, we set parameter θ = 0.7 whereas other parameters such as learning rate β and regularization parameters λ are the same with those in [28]. Additionally, [28] uses product titles to calculate similarity of both products Sim(i, j). In the experiment, we calculate the similarity use the following form of

$$ {\text{Sim(}}i,\;j) = 0.5 \times {\text{category}}\;{\text{similarity}} + 0.5 \times {\text{similarity}}\;{\text{of}}\;{\text{product}}\;{\text{title}}, $$

where the \( {\text{category}}\;{\text{similarity}} = \left\{ {\begin{array}{ll} 1 & {{\text{same}}\;{\text{in}}\;{\text{third-level}}\;{\text{category}}}, \\ 0 & {{\text{otherwise}} }. \\ \end{array} } \right. \)

The seven models listed in Table 7 make predictions based on the JD data set. These models use the last product purchased d i by customer c k in B section to predict items purchased by c k in C section. The candidates are generated by obtaining the top 10 associated categories of d i. Products purchased by c k in C section are used as test set Φ. These models calculate a prediction score for each candidate and select the top n candidates to build product subset ω. If any product in ω occurs in test set Φ, we say that customer c k is successfully predicted. Using precision as a measure of model performance, we present the results for all customers in the JD dataset in Table 9.

Table 9 Evaluation of the predictive performance of seven models

As shown in Table 9, market basket analysis (M1) exhibits the poorest performance in predicting customer purchase behavior. The combination of CF and market basket analysis (M2) dramatically improves the model’s predictive performance. When M2 is combined with the heat model (M3), precision is further increased. Model M4, which incorporates customers’ price sensitivity, outperforms other models when n = 1; this result means that the introduction of price sensitivity into the predictive task can improve performance. However, the addition of customers’ brand preferences to the model (M5) does not significantly improve model performance and even decreases model precision when n = 3 and 1. Section 5 discusses reasons for this decrease in performance.

We also observe that model M4 significantly outperforms the FPMC model (M6) and SVDutil (M7). We believe that using only sequential basket data and user–item correlation and omitting customers’ preferences regarding product features make FPMC and SVDutil inappropriate for predicting customer purchase behavior.

5 An analysis of customers in the JD dataset

In this section, we discuss the characteristics of customers in the JD dataset by examining the estimated parameters of the hierarchical Bayesian model introduced in Sect. 3.3. Table 10 lists customer variables.

Table 10 Customer variables

To explore the relationships between customers’ features and their preferences, all variables are normalized into the range [0, 1]. Table 11 shows the estimated parameters of the hierarchical Bayesian model.

Table 11 Posterior mean of parameter Δ in the hierarchical Bayesian model

Based on the data in Table 11, we can conclude that customers with high purchase frequencies, large SDs and low monetary values tend to prefer small-brand products, whereas customers who have low purchase frequencies and high monetary values show an inclination toward large brands. It can be further inferred that small-brand products are generally less expensive than large- or moderate-brand products.

Figure 9 shows the posterior distributions of the six model parameters. Although the posterior distributions of both moderate and small brands place most of their respective mass on negative values, the posterior distribution of large brands retains the bulk of its mass approximately 4. We can thus conclude that taken as a whole, the customers in the JD dataset have a greater preference for large-brand products than for small-brand products. In addition, customers prefer medium- and low-priced products to high-priced products. Brand has a lager impact on customer choice than price does. However, further analysis shows that large-brand products may have a higher probability of being purchased than small-brand products when brand level is divided by the number of items under that brand. Therefore, the brand preference derived from our model is not suitable for prediction purposes. Our experiments also demonstrate that it is not helpful to improve model performance in predicting purchase behavior. This result explains why model M5 exhibits a poorer performance than M4 does.

Fig. 9
figure 9

Posterior distributions of parameters in model M5

6 Conclusions

Researchers in the marketing and CRM fields have made numerous significant contributions to the prediction of customer purchase behavior in the traditional business context. However, new methods and techniques must be developed to perform the predictive task in the e-commerce context.

We propose a predictive framework called COREL for customer purchase behavior. This framework comprises a two-stage process. First, the associations between products are investigated and exploited to predicate customer’s motivations, i.e., to build a candidate product collection. Second, customers’ preferences for product features are learned to identify which candidate products are most likely to be purchased. This study investigates three categories of product features: dynamic product features, features that may be observed by the user but not by the analyst and product features that are static and observable by the analyst. We exploit the purchase data from an e-commerce website to develop methods to learn customer preferences for each of these three categories.

The results prove that our approach to calculating product popularity is feasible and that customer preference for product features has a significant impact on purchasing decisions.

Economic models of choice typically assume that an individual’s latent utility is a function of brand preference. In this study, however, brand preference does not significantly improve the performance of COREL when we divide product brand level by the number of items under the brand. In the future, we will investigate an approach that can improve the performance of the predictive framework by incorporating customers’ brand preferences into the model.