1 Introduction

The rapid growth of Internet usage drives the popularity of online shopping. Especially, online shopping demands can be boosted along with advanced vehicular telematics over heterogeneous wireless networks [14], and large amounts of real-time online data can also be collected by using distributed information estimation technologies [15].According to the latest eMarketer’s forecast, it will increase 5.4% from 2015 to 2019, and account for 12.8% of global retail sales by 2019 [7]. To attract more users and maximize the revenue, e-commerce business (e.g., Amazon) strives to provide better services, i.e., designing personalized recommendation systems to improve users’ shopping experience [11, 17, 18].

One of the fundamental problems here is to thoroughly understand users’ purchasing demands and shopping patterns [4, 10]. Compared to traditional survey questionnaire, data-driven behavior analysis can comprehensively reveal what users prefer and how they select products [3, 5, 6, 16, 19]. Previous work studied repeated consumption behavior [1, 2], but how users browse different kinds of products online, and how distinct their behaviors are, are still unknown.The work of heterogeneous systems from Qiu et al. [13] has proposed the way to reduce cost of complicate heterogeneous data and system. Also, their work on online system [9] and green cloud [12] had provide valuable guidance in online data processing and cost reduction.

To investigate this problem, we use a large-scale dataset that contains user online shopping and browsing logs at one of major e-commerce businesses in China. Our dataset is collected from one of major ISPs, which contains 67 million browsing records of 2 million users in Shanghai from March 2 to March 31 in 2015. First, we use a co-clustering method to cluster both users and categories of browsed products simultaneously. Then, we seek to understand the characteristic of shopping behaviors based on average consecutive browses. We obtain two major findings, summarized as follows:

  • There are both homogeneous (e.g., users browsing one category) and heterogeneous (e.g., users browsing diverse categories) groups of users.

  • There exist distinct differences of user browsing behaviors in different categories. Repetitive and targeted browsing are two prevalently recognized patterns.

Our findings are useful in designing customized online shopping web systems for dedicated groups of users by adapting to their personal consumption behaviors. In addition, from the perspective of ISPs, they can characterize the user profile that has potential commercial value.

2 Dataset

Our dataset is collected through deep packet inspection appliances at the gateways of ISP, which contains complete shopping and browsing logs of a large online e-commerce platform. Each entry in the logs is characterized by user ID, timestamp and requested URL. To obtain detailed information, we crawl URLs at the e-commerce website and obtain the corresponding product category of each browsing request.

In summary, we obtain more than 67 million browsing requests of 2,141,951 users who browse over 15 million products, which are classified into 28 categories (e.g., Clothing, Books, Phone & Accessories). We plot the distribution of the number of products, users and browses in 11 major categoriesFootnote 1 that occupy 82.35% of browses, as shown in Fig. 1. Although phone & accessories attracts most browses, the number of products belonging to this category is relatively small. In contrast, (E-)Books & CDs owns a large amount of products but relatively low browses. The reason is that user demands are different in shopping products of diverse categories, and they exhibit different shopping behaviors when browsing different kinds of products.

Fig. 1
figure 1

Distribution of the number of browses, users and products in major 11 categories1

3 Metric and methodology

With the goal of profiling users by their online shopping and browsing logs, we now describe our metrics and methodology. We denote Ω = {c k }(1 ≤ k ≤ 28) as the set of product categories and W k as the number of products belonging to c k . For a given user i, we model a user’s browsing record as a sequence of browsing events \(s_{i_{j}}\): \(S_{i}=\{s_{i_{1}}, {\Delta } t_{i_{1}}, s_{i_{2}}, {\Delta } t_{i_{2}}, \newline ..., s_{i_{j}}, {\Delta } t_{i_{j}}, ...\}\) with \({\Delta } t_{i_{j}}\) representing the time gap between two consecutive events \(s_{i_{j}}\) and \(s_{i_{j + 1}}\). We denote \({M_{i}^{k}}={\sum }_{j = 1}^{N_{i}}I_{c_{k}}(s_{i_{j}})\) as the total number of browsing in k-th category by user i, where N i is the number of browses by user i and I A (x) is the indicator function.

  • Browsing Entropy: It measures how diversely users browse the products of different categories, defined as

    $$ E_{i} = \frac{-\sum\limits_{k = 1}^{28}{\frac{{M_{i}^{k}}}{{\sum}_{k = 1}^{28}{{M_{i}^{k}}}} \log_{2}{\frac{{M_{i}^{k}}}{{\sum}_{k = 1}^{28}{{M_{i}^{k}}}}}}}{\log_{2}{{\sum}_{k = 1}^{28}I_{\{>0\}}({M_{i}^{k}})}}. $$
    (1)

    Its value ranges from 0 to 1. A higher value indicates more uniformly distribution among all categories. If the user only browses one category, it is 0.

  • Repetitive Ratio: This measures how frequently products of the same category are browsed by users, expressed as \(R(k)={\sum }_{i}{M_{i}^{k}}/{W_{k}}\), where a higher value indicates that users more frequently browse products of k-th category.

  • Co-Clustering of Users and Categories: Since users and their browsing categories are associated with each other, we need to cluster both of them simultaneously. We use Phantom [8] to perform divisive hierarchical co-clustering. We calculate the normalized number of browses in each category per user, then obtain a feature matrix with users on each row and categories on each column, which is the input of the co-clustering algorithm.

    To evaluate the effectiveness of co-clustering, we define the average distance to each cluster as follows:

    $$ D(m, k)=\frac{1}{p_{m}}{\sum\limits_{i = 1}^{i=p_{m}}|(\mathbf{F}_{m}^{i}-\bar{\mathbf{F}}_{k})|}, $$
    (2)

    where \(\bar {\mathbf {F}}_{k}=\frac {1}{p_{k}}{\sum }_{i = 1}^{i=p_{k}}\mathbf {F}_{k}^{i}\). \(\mathbf {F}_{m}^{i}\) represents the array consisting of the normalized browses in each category by user i in cluster m, and p m denotes the number of users in cluster m. In particular, if D(m,m) < D(m,k)(mk) is satisfied, users in cluster m have higher similarity compared with that in other clusters.

  • Category-based Browsing Behavior Analysis: In order to reveal how distinct users browse different categories, we partition the browsing sequence S i into different sessions by a time threshold, exceeding which indicates a user is offline. Then, we count the number of consecutive browses on each category in each session within S i . Finally, we average the consecutive browsing on k-th category by user i.

4 Results

In this section, we leverage the above metrics to analyze the online shopping behaviors based on our collected dataset, which is described in details in Section 2.

Browsing entropy

We first examine how diversely that users browse different categories according to (1). Figure 2 shows nearly half of the users (entropy with 0) concentrate on one category, and only 3.6% have a browsing entropy greater than 0.8. This indicates that most of the users focus on a few categories when they are shopping and browsing online.

Fig. 2
figure 2

Distribution of browsing entropy that is defined in (1)

Repetitive ratio

Figure 3 shows the browsing repetitive ratio of major categories. We find that there exist significant differences, i.e., House Appliances enjoys highest repetitive ratio while Books & CDs obtains the lowest repetitive ratio. This suggests that users have different shopping behaviors in different categories, i.e., repetitive or targeted browses.

Fig. 3
figure 3

Average browsing repetitive ratio in 11 major categories1

Co-clustering of users and categories

To investigate prevalent patterns of users’ browsing behavior on different categories, we apply the co-clustering algorithm to identify the groups of users and categories simultaneously. We first choose users that have a sufficient and reasonable number of browsing records according to Figs. 4 and 5 by focusing on users that browse 100 to 3000 products, finally obtaining 46,366 users.

Fig. 4
figure 4

Distribution of the number of browsing per user: 27% of users have only one browse, and 96% of users have less than 100

Fig. 5
figure 5

Distribution of the time gaps. When users have more than 3000 browses, more than 80% of time gaps between user browsing events is 1s, which suggests abnormal behaviors, i.e., machine generated logs

We obtain 18 clusters (12 major clusters listed in Table 1) and plot the heatmap of average distance among clusters according to (2) in Fig. 6, which verifies the effectiveness of co-clustering results. The visualization of the obtained clusters is shown in Fig. 7, where we can intuitively observe several enlightening clusters as follow

  • Cluster 1 (Business Usage): Users tend to browse office product equipment or ticket booking for business purpose.

  • Cluster 2 (Individual Dressing): In this cluster, users always choose the products belonging to Clothing, Sports & Health and Footwear for individual dressing.

  • Cluster 4 (Household Usage): Users are browsing products for household usage, including Kitchenware, House Decorations, Household Appliances.

  • Cluster 8 (High-income Group): Users in this cluster browsing the products of Car Accessories may have individual cars, and are recognized as the high-income group. They at the same time prefer expensive High-end Brand products.

Table 1 Co-clustering results
Fig. 6
figure 6

Average distance of users in one cluster to others, where numbers 1-18 represent each cluster, and lighter color represents smaller distance

Fig. 7
figure 7

Diversity of co-clustering results showing user online shopping and browsing behaviors

Category-based browsing behavior analysis

Based on our analysis about the grouping patterns between users and categories, we study the browsing behaviors of users in each cluster according to Table 1. By averaging user consecutive browsing and residence time in the corresponding category in each cluster, we plot them in Fig. 8 with the threshold as 5 minutes to partition sessions. The number of consecutive browses exhibits distinct differences among different clusters. For example, Toys & Musical Instruments (Cluster 12) that are popular among children exhibit short residence time but more consecutive browses, which indicates more repetitive browsing; while (E-)Books & CDs (Cluster 10) has long residence time, which shows that users are willing to gather more information about the products. In particular, Ticketing and PC & Office in Cluster 1 attract the least consecutive browses, which suggests that users choose them directly with clear purpose of purchase.

Fig. 8
figure 8

Average residence time and consecutive browses on the corresponding category in each cluster, where residence time is measured in seconds

5 Conclusion & future work

With a dataset containing users’ one-month online shopping and browsing records, we investigate the grouping characteristics between users and product categories, and uncover distinct patterns of browsing behaviors in different categories. Our findings provide valuable insights for e-commerce business to customize its online web shopping system to enhance user experience. As for future work, we would like to study how users’ preference change over time and how their other online activities are related to their shopping behaviors.