1 Introduction

1.1 Motivation

By identifying potentially loyal customers who are more likely to revisit, merchants can considerably save on promotion cost and enhance return on investment  [27]. Many studies in recent years have focused on online stores and online text reviews with the help of a data provider  [18, 42]. In contrast, the analysis of revisit intention in the off-line environment has not been carried out. The main reason lies in the difficulties of collecting large-scale data that is closely related to key attributes of revisiting, such as customer satisfaction with products, service quality, atmosphere, purchase history, and personal profiles  [37, 42]. Those attributes are either subjective or confidential, which are not easily accessible. Owing to these limitations, research on customer revisits in off-line stores has been conducted through surveys. These studies help us gain an understanding of underlying hypotheses that affect customer satisfaction. However, their findings cannot be easily generalized because of a small sample size.

With the advance of sensing technologies such as radio-frequency identification (RFID)  [8, 35], Bluetooth  [45], and Wi-Fi fingerprinting  [36], we are capable of collecting high-frequency signal data without installing any applications on customer devices  [29, 30]. These signals can be converted to fine-grained mobility data. Using such data, noninvasive monitoring of visitors has been carried out in different settings, such as in museums  [45] and supermarkets  [40], providing empirical findings of customer behaviors. Nowadays, collecting data in a certain physical boundary is called as geofencing  [32] and its market size is increasing rapidly. Companies such as ZOYI, VCA, RetailNext, Euclid, ShopperTrak, and Purple installed their own sensors to geotrack real-time mobility patterns of customers in their clients’ stores. Their proprietary solutions provide visitor monitoring results, such as funnel or hot-spot analysis results displayed through a dashboard. In addition, it is expected that huge amounts of shopping behaviors will be generated in cashier-less stores introduced by the enterprises such as Alibaba and Amazon.

1.2 Contribution

In this paper, we propose a systematic framework for predicting the revisit intention of customers using Wi-Fi signals captured by in-store sensors. Our framework includes the entire procedure for revisit prediction—from data preparation to model learning. The key challenge is how to generate the most effective set of features from the Wi-Fi signals. We systemically design the features to summarize each visit in two aspects. First, we interpret the device location at various semantic levels to understand user behaviors. Second, we utilize weak signals usually captured outside a store to expand our trajectory to the widest possible range. Using this information, we are able to track a customer’s behavior outside the store even if they did not enter the store.

We also benefit from large-scale customer mobility data captured by in-store sensors. Seven flagship stores in downtown Seoul were carefully selected for data collection to cover various shop categories. The number of unique customers collected in the seven stores reaches 3.75 million. The data is very attractive because we can capture approximately 20–30%Footnote 1 of customer mobility without any intervention. Furthermore, the data collection period is 1–2 years, which is long enough to study revisit behaviors.

Fig. 1
figure 1

Revisit prediction framework architecture

Figure 1 illustrates the overall procedure of our prediction framework. If a customer comes into a store, the framework detects his/her Wi-Fi signals, and through the data preprocessing described in Sect. 2.2, transforms the signals to a visit and an occurrence. From the customer’s visit and previous occurrences, extensive features are derived to describe his/her motion patterns, as discussed in Sect. 4. In this regard, our framework relies upon the belief that motion patterns unconsciously reflect consumer’s satisfaction with the store  [13]. Finally, we can predict his/her revisit behavior, using a trained model.

Our experiments demonstrate that our revisit prediction framework achieves up to 80% accuracy of the binary revisit classification of all trajectories. Additionally, it successfully predicts the revisit of first-time visitors by up to 72% accuracy. In the case of actual apparel stores, it is very useful to predict the revisits of first-time customers, because they account for more than 70% of all visitors. Most importantly, our 80% accuracy is achieved by features, all derived from Wi-Fi signals with minimal external information (dates of public holidays, clearance sales). Thus, we expect that the prediction power will rise significantly by adding private data such as personal profiles and purchasing patterns.

Fig. 2
figure 2

Revisit statistics of store E_GN. \(E[RV_{\mathrm{bin}}(v_k)]\) denotes the average revisit rate of the group of visitors who visit k times

Figure 2 illustrates the observed revisit statistics during the data collection period in store E_GN. The black line denotes the number of observations \(|{v_k}|\) of kth visits (\(v_k\)), and the gray line denotes the average revisit rate \(E[RV_{\mathrm{bin}}(v_k)]\) of all \(v_k\)’s. The fact that the \(|{v_5}|\) is 100 times less than \(|{v_1}|\) implies that it is very difficult to retain first-time visitors as regular customers. It also describes how valuable it is to raise the revisit rate of first-time visitors that account for 70% of all customers,Footnote 2 thereby emphasizing again the importance of our work. Along with the model accuracy, we report the predictive power of each feature group and semantic level to show whether or not the trajectory abstraction boosts the predictability. We also demonstrate the effectiveness of using customer mobility features in comparison with baseline models considering visit distribution and temporal information. We discuss how the collection period and the volume of data affect performance. Another important goal of this paper is to share the unexpected challenges faced when two groups of data show inherent differences in a statistical sense.

This paper extends our earlier work  [12] presented at IEEE ICDM 2018 and also selected as one of the best papers. In particular, the evaluation of our framework has been significantly improved by addressing the comments to our earlier work. In this extended version, we empirically show that mobility features are effective even with a few records, by tracking the predictive power of our model conditioned on the number of previous visits. We also test various machine learning techniques and their stacked ensemble model, in addition to XGBoost  [4] used in our earlier work.

The remainder of this paper is organized as follows. In Sect. 2, we describe the datasets used in this paper. After introducing the main concepts and formalizing the problem in Sect. 3, we describe the characteristics of the features in Sect. 4. In Sect. 5, we explain the experiment settings and present overall prediction results. Also, we discuss the lessons and challenges obtained through the experiments. After reviewing related work in Sect. 6, we conclude this study in Sect. 7.

2 Data description

In this section, we introduce our customer mobility data captured from off-line stores. The number of customers in our data is very high, and the collection period is long enough to obtain reliable results. Throughout this section, we share some statistics of our datasets and introduce necessary preprocessing to find meaningful semantics from the raw Wi-Fi signals.

2.1 Data collection stores

We collected data from seven flagship stores located in the center of Seoul. Each of these stores is one of the largest stores of each brand, consisting of several floors. These stores are known to be the busiest stores in Korea. Because of their location and size, these stores have up to 10,000 daily visitors. For example, our target store E_GN is a four-story building located on the side of a Gangnam boulevard where two million people walk by each month. Store E_SC is located on the ground floor of a major department store in the downtown Sinchon area, which is also connected to one of the busiest subway stations used by college students. Table 1 presents the statistics of the seven datasets, and Fig. 3 illustrates the location of sensors and categories of two stores E_GN and E_SC.Footnote 3

Table 1 Statistics of the datasets
Fig. 3
figure 3

Location of sensors and categories of two stores E_GN and E_SC. Wi-Fi icons indicate the location of the sensors, and the category names for each section are described in (c)

2.2 Preprocessing to generate trajectories

2.2.1 Signal-to-session conversion

To collect Wi-Fi signals, we utilized ZOYI Square sensors developed by WalkInsights.Footnote 4 The installed sensors enable us to collect Wi-Fi signals from any device that turns on its Wi-Fi. A single Wi-Fi signal includes an anonymized device ID, sensor ID, timestamp, and its received signal strength indicator (RSSI) level, which is a measurement of the power present in a received radio signal. Signals are collected continuously from each device at fairly short intervals, which are less than 1 s. To understand customer mobility, we carry out a conversion process to remove redundant signals and combine them into Wi-Fi session logs. Each session includes a device ID, area ID, and dwell time, and it becomes an element of a semantic trajectory. Predefined RSSI thresholds are utilized for signal-to-session conversion. These values guarantee that the device is in the vicinity of a sensor. The logic of this conversion is simple. For instance, a new session is created when a sufficiently strong RSSI is received for the first time. The session continues if the sensor receives consecutive strong signals, and it ends if the sensor no longer receives strong signals. The session also ends if another sensor receives a strong RSSI from that device.

2.2.2 Location semantics

It is also possible to detect the semantic location of a customer by taking advantage of the semantic coherency of contiguous sensors. For example, we can identify if the customer is looking at daily cosmetics or she is in a fitting room. Additionally, we can describe a customer’s location to floor-level or gender-level semantic areas. Moreover, we generate in-/out-level areas by examining whether the customer is inside the store, nearby the store (up to 5 m), or far away from the store (up to 30 m). This becomes possible by controlling multiple RSSI thresholds to activate detection with weaker signals. Therefore, an entity of Wi-Fi session data encompasses a customer’s dwell time not only in the area corresponding to sensors but also in the wider semantic areas. By integrating the Wi-Fi sessions with different semantics, we construct a multilevel semantic trajectory to describe each visit as illustrated in Fig. 4.

Fig. 4
figure 4

Generation of multilevel trajectories to predict a customer’s revisit intention: Using noninvasive monitoring, customer Wi-Fi signals are collected. These are then transformed into a sensor-based trajectory, and further summarized into categories, floors, genders, and surrounding areas. We extracted features from these multilevel trajectories to determine the characteristics related to customer behavior

3 Problem definition

In this section, we formally define the main concepts introduced in our paper. First, we define a multilevel semantic trajectory (\(\mathbb {T}\)) that expresses a customer’s motion pattern, and define visit (v) and occurrence (o) using \(\mathbb {T}\). Next, we define the revisit interval (\(RV_{\mathrm{days}}\)) and the revisit intention (\(RV_{\mathrm{bin}}\)), which are the labels in our prediction model. Finally, we introduce the revisit prediction problem.

3.1 Key terms and concepts

Definition 1

A semantic trajectory\(\mathcal {T}\) is a structured trajectory of size n \((n\ge 1)\) in which the spatial data (the coordinates) are replaced by semantic areas  [43], that is, \(\mathcal {T} = \{s_1,\ldots ,s_n\}\), where each element ( = a session) is defined by \(\smash {s_i = (sp_i, t^{(sp_i)}_{in}, t^{(sp_i)}_{out})}\). Here, \(sp_i\) represents the semantic area, \(\smash {t^{(sp_i)}_{in}}\) is a timestamp for entering \(sp_i\), and \(\smash {t^{(sp_i)}_{out}}\) is a timestamp for leaving \(sp_i\). \(\Box \)

If a session length \(\smash {t^{(sp_{i})}_{out}} - \smash {t^{(sp_i)}_{in}}\) is shorter than 5 s considering walking speed and the distance between adjacent sensors, a customer is likely to pass that area without consideration, and thus, we delete the element from the trajectory.

Definition 2

A multilevel semantic trajectory\(\mathbb {T}\) = {\(\mathcal {T}_1,\ldots ,\mathcal {T}_l\}\) is a set of semantic trajectories with l\((l\ge 1)\) different semantic levels. Each semantic trajectory \(\mathcal {T}_i\) represents a customer’s trajectory using semantic areas of level i. \(\Box \)

For our indoor environment, we utilized semantic levels inside the store, except for the highest level l indicating the in/out level. The total dwell time of \(\mathcal {T}_l\) is always longer than \(\mathcal {T}_1,\ldots ,\mathcal {T}_{l-1}\), because the in/out mobility utilizes weak signals that can be captured for a longer period than the strong signals used for indoor behavior.

Definition 3

A visitv is a unit action of entering the store. \(v_k(c, [t_s, t_e], \mathbb {T})\) is a kth visit by customer c, who is sensed from \(t_s\) to \(t_e\), of which the motion pattern is described with a multilevel semantic trajectory \(\mathbb {T}\). \(\Box \)

We consider only the visits that are long enough to represent meaningful behavior. For the sensor-level trajectory \(\mathcal {T}_1\), the total dwell time \(t_e - t_s\) should be greater than 1 min, because it takes less than 1 min to go through the store. The data preprocessing thresholds are empirically configured depending on the size of a store and the number of sensors.

Definition 4

An occurrenceo is a special case of a visit that represents a unit action of lingering around the store without entrance. \(o_k(c, [t_s, t_e], \mathbb {T})\) is a kth occurrence by customer c, who is sensed from \(t_s\) to \(t_e\), of which the mobility is only captured in the outdoor area with \(\mathbb {T}\) = {\(\emptyset ,\ldots ,\emptyset ,\mathcal {T}_l\}\). \(\Box \)

Although we did not have any personal information such as the customer’s residence, we could measure his/her accessibility to the store through the occurrence. For each visit, we use a set of previous occurrences as a reference to generate store accessibility features [SA], which will be explained in Sect. 4.1.9.

3.2 Prediction objectives

If a customer revisits the store after d days, the previous visit v of the customer has a d-day revisit interval, denoted by \(RV_{\mathrm{days}}(v) = d\), and a positive revisit intention, denoted by \(RV_{\mathrm{bin}}(v) = 1\), as in Definition 5.

Definition 5

If two consecutive visits of customer \(c_i\), \(v_k = v_k(c_i, [t_{k,s}, t_{k,e}],\mathbb {T}_k)\) and \(v_{k+1} = v_{k+1}(c_i, [t_{k+1,s}, t_{k+1,e}], \mathbb {T}_{k+1})\), meet the condition \(t_{k,e} < t_{k+1,s}\), the revisit interval\(RV_{\mathrm{days}}(v_k)\) and revisit intention\(RV_{\mathrm{bin}}(v_k)\) of the former visit \(v_k\) are \(RV_{\mathrm{days}}(v_k) = \# \text {days of}t_{k+1,s} - t_{k,e}\) and \(RV_{\mathrm{bin}}(v_k) = 1\). If a visit \(v_k\) does not have any following revisit, then \(RV_{\mathrm{days}}(v_k) = \infty \) and \(RV_{\mathrm{bin}}(v_k) = 0\). \(\Box \)

3.3 Predictive analytics

Our problem is now formally defined as follows:

Customer Revisit Prediction:    Given a set of visits \(V_{\mathrm{train}} = \{ v_1, \ldots , v_n \}\) with known revisit intentions \(RV_{\mathrm{bin}}(v_i)\) and revisit intervals \(RV_{\mathrm{days}}(v_i)\)\((v_i \in V_{\mathrm{train}})\), build a classifier C that predicts unknown revisit intention \(RV_{\mathrm{bin}}(v_{\mathrm{new}})\) and revisit interval \(RV_{\mathrm{days}}(v_{\mathrm{new}})\) for a new visit \(v_{\mathrm{new}}\).

4 Feature engineering

To have a multiperspective view of customer movements, we construct each visit as a five-level semantic trajectory, \(\mathbb {T} = \{\mathcal {T}_{1}, \mathcal {T}_{2}, \mathcal {T}_{3}, \mathcal {T}_{4}\), \(\mathcal {T}_{5}\}\), where the levels correspond to sensor, category, floor, gender, and in/out, respectively. We expect the pattern captured using multiple levels can be helpful in predicting customer revisits. Thus, some features were created for each semantic level.

Table 2 Description of the representative features according to the data sources and feature groups

Table 2 gives a summary of the representative features in our framework. The first column describes the ten different feature groups categorized by their characteristics. The first seven feature groups are generated solely from the customer mobility itself. The last three feature groups: Upcoming Events [UE], Store Accessibility [SA], and Group Movement [GM] are generated using certain references: [UE] uses sales and holiday information for the near future, [SA] uses the occurrences of the customer before making this visit, and [GM] considers other visits at the same time.

Fig. 5
figure 5

The relationship between the selected features and \(RV_{\mathrm{bin}}\) in store E_SC (\(E[RV_{\mathrm{bin}}(v)\)\((v \in V_{\mathrm{all}})] = 0.3616\)). Each marker point represents the average revisit intention \(E[RV_{\mathrm{bin}}(v)]\)\((v \in V_{b})\) of the set \(V_{b}\) of visits obtained by equal-frequency-binning the entire data according to feature values. Indoor moving pattern features \(f_1\), \(f_7\), and \(f_9\) shows at most 40% deviation of \(E[RV_{\mathrm{bin}}(v)]\) according to the feature value. The store accessibility feature \(f_{17}\) shows 325% deviation, which is the highest among the selected features. For \(f_9\), the group of customers who are most likely to use the back door are located on the left side of the x-axis

Fig. 6
figure 6

Key factors of \(v_1\)’s revisit: discount and seasonality. Discount-sensitive: A set \(V_b\) of customers who visited between 30 and 45 days before a clearance sale showed a high \(E[RV_{\mathrm{bin}}(v)]\)\((v \in V_{b})\) compared to other customers; this difference was more apparent in first-time visitors than all visitors. Seasonal-sensitive: Another peak of \(E[RV_{\mathrm{bin}}(v)]\) appeared on the set of customers who made a visit between 90 and 105 days before the sale. It described the seasonal revisit, and it was also more noticeable to first-time visitors than all visitors

For seven stores, the total number of generated features varies from 220 to 866 depending on the number of areas and the number of semantic levels used. \(\mathcal {T}_{2}, \mathcal {T}_{3}, \mathcal {T}_{4}\)-level features are generated only for two stores: Store ID of E_GN and E_SC, where we continuously tracked their floor plans during data collection periods. Among all features, we introduce 20 representative features to best describe the characteristic of each feature group. On the right side of the table, the corresponding semantic level for each feature is marked.

Figures 5 and 6 display meaningful relationships between the feature values of \(f_1\), \(f_7\), \(f_9\), \(f_{15}\), and \(f_{17}\) with the average revisit intention rate  \(E[RV_{\mathrm{bin}}(v)]\). By dividing total visits into 20 equal bins according to feature values, we can identify the association between feature values and revisit rates without being affected by outliers.

4.1 Feature descriptions

In this section, we introduce the detail of each feature group used in our model. With the background information for designing each feature, we show some correlations between features and customer revisits.

4.1.1 Overall statistics [OS]

[OS] features represent the high-level view of a customer’s indoor movement patterns, and therefore, the predictive power is relatively strong. By considering the trajectory as a whole, we can extract features such as total dwell time \((f_1)\), trajectory length \((f_2)\), and average frequency of each area. We also apply skewness \((f_3)\) or kurtosis to measure the asymmetric or fat-tail behavior of the dwell-time distribution of each area.

4.1.2 Travel distance, speed, and acceleration [TS]

[TS] features are in-depth information that needs to be explored  [25]. To approximate the physical distance \((f_4)\) traveled by the customer, we created a network based on the physical connectivity between areas. We used the transition time to obtain the shopping speed \((f_5)\), and we modeled the acceleration from the speed variation between consecutive areas. A time series analysis using the Haar Wavelet Transform (HWT)  [34] was performed, as well as statistical analysis, to determine how the customer’s interests changed with time. We included the first-16 HWT coefficients \((f_6)\) in our feature set.

4.1.3 Area preference [AP]

With [AP] features, it is possible to identify the difference between a customer viewing a specific area with high concentration and a person shopping lightly throughout the store. The area name and dwell time \((f_8)\), and its proportion over the total dwell time of the top-3 areas at each level are included in the basic features. The coherency of each level \((f_7)\) determines the consistency of the customer’s behavior. The definition of the coherency metric is the proportion of time spent in the longest staying area. This metric is effective to capture regular customers who know the store’s layout and go directly to the desired area.

4.1.4 Entrance and exit pattern [EE]

Interestingly, customers leaving through the back door \((f_9)\) revisited 13.6% more than customers leaving through the front door, according to our data. Therefore, we estimated the customers’ entrance and exit patterns from the sensors nearby the front and back doors. We expected that customers familiar with the store might have used a more convenient door.

4.1.5 Heuristics [HR]

To fully exploit the relation between customer trajectories and revisits, we interviewed the managers and part-timers of the stores to get intuitions on what kinds of patterns are likely to appear from the customers who are willing to revisit. In general, the interviewees agreed that staying in certain areas, trying an item, and purchasing or postponing the item can reflect customers’ interest and purchase pattern that lead to revisits. These steps of actions, in fact, correspond to online shopping activities—i.e., browse, add to cart, checkout, and then revisit or churn  [18]. As we do not know whether a customer actually tried an item in the fitting room or purchased it, we inferred those actions by tracking the dwell time in the fitting room and the checkout counter. Here are two representative heuristics anticipating the revisit of customers for future purchase.

  • If a customer wears clothes in the fitting room without purchase (\(\le \) 1 min in the checkout counter): \(f_{11} = 1\), for all other cases: \(f_{11} = 0\).

  • If a customer stays in the store much longer (= 10 min) than average visitors, without purchase: \(f = 1\), if not: \(f = 0\).

The reasons for these associations are as follows. If the customer tries an item or stays in the store for a long time, he/she is prone to purchase the item. However, the fact that the customer does not purchase the item right away implies that there is a possibility of purchasing that item at the next visit.

4.1.6 Statistics of each area [ST]

If a certain semantic area is highly relevant to revisit, the statistics from that area have higher predictability. For all semantic areas, we created six features including the number of times it was sensed \((f_{12})\), the percentage of the total time spent in the area (that is used for developing the coherency feature), and the standard deviation of the times sensed in the area \((f_{13})\). As explained before, the difference in the total number of features is mainly due to the difference in the number of areas that each store has.

4.1.7 Time of visit [TV]

The temporal features include the time of visit such as the hour of the day and day of the week \((f_{14})\) as basic features. The values of the features described above are ordinal and thus were transformed into multiple binaries by one-hot encoding. The value of a temporal feature is determined by the entrance time.

4.1.8 Upcoming events [UE]

Customers are more likely to visit a store in the period of a clearance sale. However, they are less likely to visit the fashion district in the holiday seasons(e.g., Spring Festivals, Thanksgiving week) since they are out of the city center. For example, customers who visited one month before the clearance sale have higher chance to revisit since they would like to get a discount during the upcoming sales. By combining simple extrinsic information, the temporal features, particularly [UE], become the second strongest predictive feature groups. It contains six features, including a number of days left for the next clearance sale \((f_{15})\) and a number of holidays for next 30 days \((f_{16})\), as numeric features.

4.1.9 Store accessibility [SA]

When installing sensors inside the store, could you imagine that the weak noise collected outside the store would provide the most important clue to predict revisit? Surprisingly, the revisit predictability increased dramatically when we included [SA] features using weak signals, which could have been overlooked as mere noises. The following settings are expected to be applicable to many studies when conducting research using in-store signals that do not contain customer address information.

The features are designed to capture various aspects from interarrival times. We utilized two additional outdoor areas nearby the store—5 m and 30 m zone—to detect the customer occurrences. Considering a customer’s arrival process to 5 m zone, let us denote the time of the first occurrence by \(T_1\). For \(k>1\), let \(T_{k}\) denote the elapsed time between \(k-1\)th and the kth event. We call the sequence \(\{T_k, k=1,2,...,\}\) as the sequence of interarrival times. Considering the target visit as nth event of the arrival process, we use the following features:

  • \(n-1\): Number of occurrences before the visit;

  • \(T_n\): Number of days from the last occurrence \((f_{17})\);

  • \(\mathbb {1}_{n>1}\): Existence of having any occurrence before the visit;

  • \(\mu = \sum _{k=2}^{n}T_{k}/(n-1)\): Average interarrival time \((f_{18})\);

  • \(\sigma = \sqrt{\sum _{k=2}^{n}(T_{k}-\mu )^2/(n-1)}\): Standard deviation of interarrival times;

In addition to these five features from \({T_k}\), we added the average sensed time for previous occurrences.

4.1.10 Group movement [GM]

Unlike previous features, [GM] features were extracted by considering multiple trajectories. This is a representative feature that can only be captured by analyzing surrounding trajectories that happened simultaneously with the main trajectory. In our feature extraction framework, we considered the presence of companions \((f_{19})\) and the number of companions \((f_{20})\). One of the biggest characteristics of judging whether or not to be a companion is to enter the store at the same time. Based on the information obtained through the field study, we considered that two visitors are in a group when their entrance time and exit time are both within 30 s. Additional information related to this feature can be found in Sect. 5.3.2 and “Appendix D”.

4.2 Unused features

Some potentially useful features were not included in our final model because their effect on the accuracy was marginal. However, we would like to mention them since they could be useful in other types of predictive analytics  [14, 18].

4.2.1 Sequential patterns

Sequential patterns  [7, 14] were not effective for the revisit prediction task on our datasets, so we omitted them from the final framework. To briefly describe our approach, we retrieved top-k discriminative sequential patterns by the information gain and generated k features. Each feature \(f_i(v)\) denotes the number of times a trajectory of visit v contains ith patterns. We considered diverse levels of sequential patterns, as in Table 3, but the result was not satisfactory. Despite that it was expensive to generate the features, their information gains were typically low.

Table 3 Types of sequential patterns

4.2.2 Past indoor information

We excluded the features that average up the customer’s previous indoor mobility statistics, as well as those that represent the amount of changes from past statistics  [18]. By nature, the number of features becomes doubled per revisit by considering that information. However, they were not a strong indicator of revisits unlike [SA] and thus were removed.

4.2.3 Features that may interfere with fair evaluation

Since most customers have a small number of visits, we developed a general model that considers the mobility of the entire set of customers. According to this principle, we considered each visit separately, by removing customer identifiers. In this way, we can also ensure that our model is robust to general cross-validation settings. We excluded the visit date to avoid a biased evaluation that favors the customers who visited in an earlier period. We also ignored the explicit visit count information.

5 Evaluation results

In our experiments, we verify that our feature set designed from customer mobility patterns is effective in predicting customer revisit, especially for newcomers. In addition, we verify the performance of individual feature groups and semantic levels. Throughout the discussion section, we provide more detailed analyses regarding the revisit prediction. The key contents include the demonstration of the performance change over the length of data collection period and model robustness on missing customers. We conclude this section by sharing the difficulties of securing accuracy in line with the gap between the predictive power and the statistical significance of each feature.

5.1 Settings

5.1.1 Prediction tasks

We designed prediction tasks to explore customers’ revisit behaviors. The first task is a binary classification task to predict customers’ revisit intention \(RV_{\mathrm{bin}}\). The second task is a regression task to predict the revisit interval \(RV_{\mathrm{days}}\) between two consecutive visits. For each task, we conducted experiments on two different data subsets. First, we see the performance of our model on the entire customer dataset. Second, we used a dataset consisting of only the first-time visitors to show that our prediction framework is effective in determining the willingness of first-time visitors to revisit.

5.1.2 Scoring metrics

We used two scoring metrics: accuracy and root-mean-squared error (RMSE) for the classification and regression tasks, respectively.

  • The accuracy is the ratio of the number of correct predictions to that of all predictions. We used it for the classification task because it is considered to be the most intuitive metric for store managers and practitioners. To fairly compare the model performance in seven imbalanced datasets with different revisit rates, we downsampled non-revisited customers for each dataset. In this way, we designed the task as a binary classification on balanced classes having 50% as a random baseline. To mitigate the risk of the sampling bias, we prepared ten different downsampled train/test sets with random seeds. The averages of ten executions were reported in the paper.

  • The RMSE is measured between the actual interval and the predicted interval. To make the RMSE values of seven stores with different data collection periods comparable, a RMSE value was normalized by the length T of the data collection period. Because we cannot calculate the revisit interval for the last visit, we excluded the customers’ last visits for the regression task.

5.1.3 Data preparation

The training and testing data were prepared with three settings:

  • S1: Fivefold cross-validation by dividing customers, where each customer data can be included only in a single fold.

  • S2: Fivefold cross-validation by dividing visits,Footnote 5 where each visit is handled independently.

  • S3: First 50% visits as the training data, and other 50% as the testing data.

The accuracy difference between S1 and S2 was insignificant to the fourth decimal place. In S3, there was an accuracy loss of about 2.5% on average compared to S1 and S2, due to floor plan changes of the stores and inaccurate labels caused by truncation in time (Sect. 5.3.1). Because of the page limit, we report the main results using the configuration S1.

5.1.4 Classifier

All results described in this section were obtained using Python API of the XGBoost  [4] library that optimized the gradient boosting tree  [5] framework. XGBoost gave the best performance among logistic regression, decision trees, random forests, AdaBoost, and gradient boosting trees implemented in the Python Scikit-learn  [26] library. For this manuscript, we also compared the performance with up-to-date boosting classifiers such as LightGBM  [11] and CatBoost  [28], and LightGBM was 5.7 times faster than CatBoost with similar performance. To further improve performance, we also tried a two-level stacking by incorporating the top-3 individual models, but the performance improvement was marginal. We add the results of the non-best models in “Appendices A and B” to avoid breaking the original flow.

We used all features for training and testing the model, since using all features gives the best performance and the boosting tree classifier is robust to potential correlations between features. The elapsed time for each fold with 200,000 visits and 660 features took no longer than 1 min in a single machine (Intel i7-6700 with 16 GB RAM, without GPU).

5.2 Results

5.2.1 Overall results

Table 4 shows the overall accuracy and RMSE. First, the prediction accuracy for first-time visitors is 67% averaged over seven stores. By only using mobility data captured by in-store sensors, two out of three customer’s revisit is predictable without having any historical data in the store. Second, the average prediction accuracy increases to 74% by considering all customers. Third, the stores with a long data collection period and abundant user logs generally show high performance, while this trend might not happen depending on the characteristics of the stores.

Table 4 Performance of classification and regression tasks

5.2.2 Predictive power of feature groups

Figure 7a investigates the predictive power of each group of features in store E_SC. Each bar corresponds to the prediction results using the features of only a specific group. The labels of the x-axis are the abbreviations of the feature groups categorized in Table 2. For the convenience of comparison, the leftmost bar on the figure represents the results when all features in Table 4 are used. It was observed that the store accessibility [SA] features have the strongest predictive power, especially for the prediction with all visitors, followed by the upcoming event [UE] features for the first-time visitors.

5.2.3 Predictive power of semantic levels

As opposed to our intuition, a performance of semantic levels inside the store did not boost the performance that much. As in Fig. 7b, the performance of the features generated from the category level (\(\mathcal {T}_2\)) only beats the features from the sensor level (\(\mathcal {T}_1\)). Besides, the semantic trajectories generated from the floor-level (\(\mathcal {T}_3\)) and the gender level (\(\mathcal {T}_4\)) were not effective to predict customer revisit in the store E_SC. We can conclude that finding effective trajectory abstraction is difficult even if the hierarchical information is provided.

Fig. 7
figure 7

Performance comparison on feature groups and semantic levels (store E_SC)

5.2.4 Performance improvement by analyzing trajectories

To measure the performance improvement using our features, we developed two different baselines for comparison. The first baseline is a theoretical lower bound of the prediction accuracy obtained from revisit statistics, shown in Fig. 2. Since we fully ignored any other information here, the prediction accuracy with this limited information must be lower than that of using full trajectories. The procedure of deriving lower bounds is given in “Appendix C”.

The second baseline is a model to which the visit date is added. Since our task utilizes finite time series datasets with time-dependent objectives, the earlier collected logs tend to have a relatively high revisit rate. Therefore, by including a visit date as an additional feature, the baseline accuracy improves naturally. If there existed infinite data, the performance increase by this factor would disappear. The benefit of using customer mobility can be considered as the gap between our final model and the second baseline.

Figure 8 reports the accuracy of our modelFootnote 6 against two baselines. We note that our final model is more effective than the second baseline by 4.7–11.6% in terms of accuracy. Among the first-time visitors, the effectiveness of trajectory analysis increases, showing a performance improvement of 8.0–24.3%.

Fig. 8
figure 8

Effectiveness of analyzing customer trajectories

5.2.5 Prediction accuracy according to the number of visits

For further analysis, we measured the prediction accuracy for each customer group determined by their number of visits. For this experiment, we used the model trained on all customers.

Customers who visit more than a certain number of times usually have a high chance to revisit. Thus, we expect that our model can predict their revisits with high accuracy. The results in Table 5 confirm this expectation. As customers visited more often, the prediction accuracy tended to increase in all stores. Interestingly, we found that the prediction accuracy sometimes was the lowest in the case of \(v_{2}\) since those groups of customers seemed to have the most uncertain behavior on their revisits.

Table 6 shows the improvement of our model compared with the two baselines in Sect. 5.2.4 for each customer group. It indicates that our model is more effective than the baselines by over 10%, especially on \(v_1\) and \(v_2\). Thus, our feature set is shown to be effective in predicting customers’ revisits even when they are newcomers.

Table 5 Prediction accuracy (%) conditionally measured on groups of customers with the same number of visits
Table 6 Improvement of our model against the two baselines

5.3 Discussions

5.3.1 Importance of data collection period

Fig. 9
figure 9

Impact of the data collection period

We are wondering how much the model’s performance varies depending on the amount of data. Figure 9a shows that the overall prediction accuracy increases as the length of the data collection period increases. The performance rapidly increases over the first few months, and then the increment is getting smaller. The main reason for the poor performance in the first few months is the lack of information on revisiting customers. Therefore, the labels in the training data could be inaccurate if we collected the information for an insufficient period. To confirm our conjecture, we also examined the proportion of customers’ revisit intention as the data collection progressed, as in Fig. 9c. The proportion, \(E[RV_{\mathrm{bin}}(v)]\), indeed increased as the data collection period increased. However, prediction accuracy on first-time visitors did not always increase. We notice that the average revisit rate also decreases for those cases, i.e., O_MD and L_MD, which implies that recently visited customers do not tend to revisit the store. Overall, with a longer data collection period, performance improvement occurs by having more positive cases for regular customers.

Fig. 10
figure 10

Missing behaviors in noninvasively collected data. a Customers’ revisits were untraceable if they did not have Wi-Fi turned on. b The actual group movement ratio observed from the store was 56% instead of 15.6% observed in the data. Researchers must not interpret the data as it is, when explaining the real behavior

5.3.2 Real behavior and collected data—Are they same?

Noninvasively collected data is also limited, considering that not all users turn on Wi-Fi of their mobile device. Since the 4G LTE connection is very fast and ubiquitous in Korea, the proportion of ‘always-on’ users is just 30%  [24]. This limitation implies that the datasets were missing some customer behaviors in the real world. Figure 10a shows untraceable revisits due to the conditional Wi-Fi usage of the customer, and Fig. 10b shows a gap between the actual/observed proportion of group movements caused by low Wi-Fi usage. The reason for the difference is that both companions must use Wi-Fi to verify the accompanying records on the data. \(p_x\) denotes the probability of customers who turn on Wi-Fi on-site (including ‘conditionally-on’ users), and \(p_y\) denotes the actual proportion of customers in a group of size two. Here we ignore groups more than two customers, which are not that common. Then the proportion \(p_{yo}\) of group customers observed in the data can be represented as Eq. (1).

$$\begin{aligned} \begin{aligned} p_{yo} {}&= \dfrac{\mathrm{Observed(Group)}}{\mathrm{Observed(Group)} + \mathrm{Observed(Individual)}} \\&= \dfrac{p_y(p_x)^2}{p_y(p_x)^2+2p_yp_x(1-p_x)+(1-p_y)p_x} = \dfrac{p_xp_y}{1+p_y-(p_x)^2}{} \end{aligned} \end{aligned}$$
(1)

Therefore, readers should recognize that the observed movement ratio can be very different from the actual movement ratio. We leave additional details in “Appendix E” and briefly introduce how to utilize this gap to decide the 30 s threshold to determine group movements. In the future, if customers’ behaviors are more traceable with additional sensing technologies, we believe that noninvasively collected data will better reflect actual customer behaviors.

Fig. 11
figure 11

Model robustness on missing customers

5.3.3 Performance on incomplete data

Assuming that some of the customers’ data are completely gone, is the performance of our model reliable? We confirmed that over 95% of the performance of our model is maintained with a very small fraction of the dataset (e.g., 0.5% for L_MD). For each store, we randomly removed the records of a set of customers and measured the model performance using the remaining data. Figure 11 shows the averages for 20 different executions. The accuracy loss of the model was within 3% if 10,000 visits were secured. This observation can be also interpreted as follows:

  • For large-scale mobility data, a comparable prediction model can be built by using small data subsets.

  • On the other hand, we can estimate the prediction performance when all customer data becomes traceable.

  • High prediction accuracy of some stores may not be due to their large number of visitors.

5.3.4 Meaningful insights but low predictability

We would like to point out that securing prediction accuracy can be difficult although the differences between revisitors and non-revisitors are obvious. Some feature values significantly differ by the revisit status, each of which should be helpful to explain the difference between the two groups. But from the perspective of a prediction task, the correlation coefficient was relatively small, and the prediction accuracy using the feature was not very high.

In Table 7, the relative difference diff1 of feature values depending on the future revisit status is noticeable (2.7–104.2%). Besides, the p value (\(p<10^{-100}\)) from Mann-Whitney U test shows that the feature values of the two groups are from different distributions. From another perspective, the relative difference diff2 in the average revisit rate between the top 5% and the bottom 5% of customers in terms of feature values also shows clear distinction by 43.5–134.7%.

However, the correlation coefficient and the final prediction accuracy using the feature are not as impressive as diff1 and diff2. Practitioners should note that the behavioral difference between the two groups is obvious and the p value is high, but not in terms of the metric of correlation and prediction accuracy. Also, the feature should not be discarded because of the low correlation coefficient. If the feature has a nonlinear tendency, its predictive power can be strong. The statistics of \(f_b\) and \(f_c\) in Table 7 confirms our argument. We assert that our high-quality prediction came from a combination of various kinds of features which behave differently.

Table 7 Statistics of feature values with revisit status, and their final predictability: statistics from the store O_MD
Fig. 12
figure 12

Detailed relationship between four features and \(E[RV_{\mathrm{bin}}(v)]\) mentioned in Table 7

6 Related work

Predictive analytics using trajectories. Next location prediction using trajectories is one of the most actively studied topics in the computer science community. To predict the next location, frequent trajectory patterns  [7, 23], nonlinear time series analysis of the arrival and residence time  [31], the hidden Markov model (HMM) [22], and cluster-based prediction with semantic features  [44] were applied. Performances of many approaches were compared by Baumann et al. [1], and the data sparsity problem was handled by Xue et al. [39]. The results support that the prediction of the next location using partial trajectories is feasible, along with the regularity studies of human mobility  [6, 19, 33]. Within the subject of predicting the next location, the prediction of the final destination of a taxi  [2, 3, 20] has been also actively studied since the 2015 ECML/PKDD competition.Footnote 7 The main difference between our study and previous studies is the prediction objective. We studied the customers’ revisit intentions in the off-line stores using indoor trajectories. Thus, our model focused on predicting revisits instead of locations. As far as we know, there is no study of predicting revisit intention using large-scale trajectories captured by in-store sensors.

Customer behavior in the store. Park et al. [25] examined the factors of route choice in three clothing outlets by tracking 484 customers. They considered spatial characteristics of the outlet, types of customers, and their shopping behaviors. In the grocery store, an RFID-based tracker system with shopping carts enabled Hui et al. [8] to find some interesting causality such as consumers who spent more time in the grocery store become more purposeful, or after purchasing virtue categories, the presence of other shoppers attract consumers yet reduce their tendency for purchase. Yada [40] applied a character string analysis technique, EBONSAI, originally developed in the field of molecular biology. They converted each shopping area into a character to applied their algorithm in order to discover purchasing behaviors. Hwang and Jang [9] introduced process mining techniques to understand customer pathways. The Petri-net model learned by inductive learning algorithms provides a formal representation of the shopping path of customers. With the collaboration between sensor providers and their clients, they showed that customers’ behavioral patterns and sales revenue changed in accordance with process models and store layouts. This study also utilized the Kolon store dataset collected by ZOYI, a data provider of our seven stores. Although these studies did not focus on customers’ revisit, they were valuable resources for us to develop the features that describe customers’ motion patterns. Currently, Alibaba’s Hema XianshengFootnote 8 and Amazon GoFootnote 9 are the most widely known future retailers, breaking the traditional retail experience. Because of the abundant in-store data from these retailers, we expect that there will be tremendous opportunities to study customer behavior patterns during their shopping time.

Indoor analysis in other places. Traditionally, the analysis of customers’ indoor movement and connections to space has been conducted in the area of architecture or interior design. Especially for museums, various movement patterns were tracked manually  [16, 41] to rearrange the exhibits to enhance the satisfaction of visitors  [10]. For example, the extent of visibility of the display was studied  [17] to arrange the main display by using the behavior of passive visitors  [10]. They concluded that visitors are influenced by the continuity in display within their view. With the help of noninvasive monitoring, visitor studies in the museum have come to a new phase. Yoshimura et al. [45] installed eight beacons in the Louvre Museum and analyzed the most popular paths to mitigate a micro-congestion inside the museum. By tracking visitors’ movements, the Guggenheim MuseumFootnote 10 increased customers’ engagement by making smarter curatorial decisions. Both museums and stores are the places where customers’ indoor mobility data can be meaningful for the study of customer satisfaction. Thus, we expect that our framework is also applicable to the museum visitor studies.

7 Conclusions

Various retail analytics companies have set up sensors to monitor customer mobility in off-line stores. Although it was difficult to connect with other kinds of data because of legal issues, we confirmed that customer mobility indeed involves diverse meanings. Without having access to customer purchase data or customer profile, we have found that revisit intention of customers are predictable by up to 80%, using only Wi-Fi signals collected by in-store sensors. Toward this goal, we suggested guidelines for setting the collection period of indoor data for revisit prediction. We also showed our model is robust even if a majority of customer data is missing. Moreover, we demonstrated that significant observations may be in disagreement with the final predictive power. We expect that our findings will help data scientists and marketers from both retail analytics companies and their clients make important decisions. In the future, we plan to discover additional aspects of revisits from inter-store mobility with an advanced model to learn the customer revisit mechanism.