Profiling users by online shopping behaviors

Yan, Huan; Wang, Zifeng; Lin, Tzu-Heng; Li, Yong; Jin, Depeng

doi:10.1007/s11042-017-5365-7

Profiling users by online shopping behaviors

Published: 11 December 2017

Volume 77, pages 21935–21945, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Profiling users by online shopping behaviors

Download PDF

Huan Yan¹,
Zifeng Wang¹,
Tzu-Heng Lin¹,
Yong Li¹ &
…
Depeng Jin¹

1367 Accesses
5 Citations
Explore all metrics

Abstract

Online shopping has been prevalent in our daily life. Profiling users and understanding their browsing behaviors are critical for enhancing shopping experience and maximizing sales revenue. In this paper, based on a one-month dataset recording 2 million users’ 67 million online shopping and browsing logs, we seek to understand how users browse and shop products, and how distinct these behaviors are. We find that there exist dedicate groups of users that prefer certain product categories corresponding to similar demands. Moreover, distinct differences of behaviors exist in categories, where repetitive and targeted browsing are two major prevalent patterns.

User behaviour modeling, recommendations, and purchase prediction during shopping festivals

Article 06 September 2018

Exploring Offline Browsing Patterns to Enhance the Online Environment

How Do Consumers Behave in Social Commerce? An Investigation Through Clickstream Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The rapid growth of Internet usage drives the popularity of online shopping. Especially, online shopping demands can be boosted along with advanced vehicular telematics over heterogeneous wireless networks [14], and large amounts of real-time online data can also be collected by using distributed information estimation technologies [15].According to the latest eMarketer’s forecast, it will increase 5.4% from 2015 to 2019, and account for 12.8% of global retail sales by 2019 [7]. To attract more users and maximize the revenue, e-commerce business (e.g., Amazon) strives to provide better services, i.e., designing personalized recommendation systems to improve users’ shopping experience [11, 17, 18].

One of the fundamental problems here is to thoroughly understand users’ purchasing demands and shopping patterns [4, 10]. Compared to traditional survey questionnaire, data-driven behavior analysis can comprehensively reveal what users prefer and how they select products [3, 5, 6, 16, 19]. Previous work studied repeated consumption behavior [1, 2], but how users browse different kinds of products online, and how distinct their behaviors are, are still unknown.The work of heterogeneous systems from Qiu et al. [13] has proposed the way to reduce cost of complicate heterogeneous data and system. Also, their work on online system [9] and green cloud [12] had provide valuable guidance in online data processing and cost reduction.

To investigate this problem, we use a large-scale dataset that contains user online shopping and browsing logs at one of major e-commerce businesses in China. Our dataset is collected from one of major ISPs, which contains 67 million browsing records of 2 million users in Shanghai from March 2 to March 31 in 2015. First, we use a co-clustering method to cluster both users and categories of browsed products simultaneously. Then, we seek to understand the characteristic of shopping behaviors based on average consecutive browses. We obtain two major findings, summarized as follows:

There are both homogeneous (e.g., users browsing one category) and heterogeneous (e.g., users browsing diverse categories) groups of users.
There exist distinct differences of user browsing behaviors in different categories. Repetitive and targeted browsing are two prevalently recognized patterns.

Our findings are useful in designing customized online shopping web systems for dedicated groups of users by adapting to their personal consumption behaviors. In addition, from the perspective of ISPs, they can characterize the user profile that has potential commercial value.

2 Dataset

Our dataset is collected through deep packet inspection appliances at the gateways of ISP, which contains complete shopping and browsing logs of a large online e-commerce platform. Each entry in the logs is characterized by user ID, timestamp and requested URL. To obtain detailed information, we crawl URLs at the e-commerce website and obtain the corresponding product category of each browsing request.

In summary, we obtain more than 67 million browsing requests of 2,141,951 users who browse over 15 million products, which are classified into 28 categories (e.g., Clothing, Books, Phone & Accessories). We plot the distribution of the number of products, users and browses in 11 major categories^{Footnote 1} that occupy 82.35% of browses, as shown in Fig. 1. Although phone & accessories attracts most browses, the number of products belonging to this category is relatively small. In contrast, (E-)Books & CDs owns a large amount of products but relatively low browses. The reason is that user demands are different in shopping products of diverse categories, and they exhibit different shopping behaviors when browsing different kinds of products.

3 Metric and methodology

With the goal of profiling users by their online shopping and browsing logs, we now describe our metrics and methodology. We denote Ω = {c _k}(1 ≤ k ≤ 28) as the set of product categories and W _k as the number of products belonging to c _k. For a given user i, we model a user’s browsing record as a sequence of browsing events $s_{i_{j}}$: $S_{i}=\{s_{i_{1}}, {\Delta } t_{i_{1}}, s_{i_{2}}, {\Delta } t_{i_{2}}, \newline ..., s_{i_{j}}, {\Delta } t_{i_{j}}, ...\}$ with ${\Delta } t_{i_{j}}$ representing the time gap between two consecutive events $s_{i_{j}}$ and $s_{i_{j + 1}}$. We denote ${M_{i}^{k}}={\sum }_{j = 1}^{N_{i}}I_{c_{k}}(s_{i_{j}})$ as the total number of browsing in k-th category by user i, where N _i is the number of browses by user i and I _A(x) is the indicator function.

Browsing Entropy: It measures how diversely users browse the products of different categories, defined as
$$ E_{i} = \frac{-\sum\limits_{k = 1}^{28}{\frac{{M_{i}^{k}}}{{\sum}_{k = 1}^{28}{{M_{i}^{k}}}} \log_{2}{\frac{{M_{i}^{k}}}{{\sum}_{k = 1}^{28}{{M_{i}^{k}}}}}}}{\log_{2}{{\sum}_{k = 1}^{28}I_{\{>0\}}({M_{i}^{k}})}}. $$
(1)

Its value ranges from 0 to 1. A higher value indicates more uniformly distribution among all categories. If the user only browses one category, it is 0.
Repetitive Ratio: This measures how frequently products of the same category are browsed by users, expressed as $R(k)={\sum }_{i}{M_{i}^{k}}/{W_{k}}$, where a higher value indicates that users more frequently browse products of k-th category.
Co-Clustering of Users and Categories: Since users and their browsing categories are associated with each other, we need to cluster both of them simultaneously. We use Phantom [8] to perform divisive hierarchical co-clustering. We calculate the normalized number of browses in each category per user, then obtain a feature matrix with users on each row and categories on each column, which is the input of the co-clustering algorithm.

To evaluate the effectiveness of co-clustering, we define the average distance to each cluster as follows:
$$ D(m, k)=\frac{1}{p_{m}}{\sum\limits_{i = 1}^{i=p_{m}}|(\mathbf{F}_{m}^{i}-\bar{\mathbf{F}}_{k})|}, $$
(2)
where $\bar {\mathbf {F}}_{k}=\frac {1}{p_{k}}{\sum }_{i = 1}^{i=p_{k}}\mathbf {F}_{k}^{i}$. $\mathbf {F}_{m}^{i}$ represents the array consisting of the normalized browses in each category by user i in cluster m, and p _m denotes the number of users in cluster m. In particular, if D(m,m) < D(m,k)(m ≠ k) is satisfied, users in cluster m have higher similarity compared with that in other clusters.
Category-based Browsing Behavior Analysis: In order to reveal how distinct users browse different categories, we partition the browsing sequence S _i into different sessions by a time threshold, exceeding which indicates a user is offline. Then, we count the number of consecutive browses on each category in each session within S _i. Finally, we average the consecutive browsing on k-th category by user i.

4 Results

In this section, we leverage the above metrics to analyze the online shopping behaviors based on our collected dataset, which is described in details in Section 2.

Browsing entropy

We first examine how diversely that users browse different categories according to (1). Figure 2 shows nearly half of the users (entropy with 0) concentrate on one category, and only 3.6% have a browsing entropy greater than 0.8. This indicates that most of the users focus on a few categories when they are shopping and browsing online.

Repetitive ratio

Figure 3 shows the browsing repetitive ratio of major categories. We find that there exist significant differences, i.e., House Appliances enjoys highest repetitive ratio while Books & CDs obtains the lowest repetitive ratio. This suggests that users have different shopping behaviors in different categories, i.e., repetitive or targeted browses.

Co-clustering of users and categories

To investigate prevalent patterns of users’ browsing behavior on different categories, we apply the co-clustering algorithm to identify the groups of users and categories simultaneously. We first choose users that have a sufficient and reasonable number of browsing records according to Figs. 4 and 5 by focusing on users that browse 100 to 3000 products, finally obtaining 46,366 users.

We obtain 18 clusters (12 major clusters listed in Table 1) and plot the heatmap of average distance among clusters according to (2) in Fig. 6, which verifies the effectiveness of co-clustering results. The visualization of the obtained clusters is shown in Fig. 7, where we can intuitively observe several enlightening clusters as follow

Cluster 1 (Business Usage): Users tend to browse office product equipment or ticket booking for business purpose.
Cluster 2 (Individual Dressing): In this cluster, users always choose the products belonging to Clothing, Sports & Health and Footwear for individual dressing.
Cluster 4 (Household Usage): Users are browsing products for household usage, including Kitchenware, House Decorations, Household Appliances.
Cluster 8 (High-income Group): Users in this cluster browsing the products of Car Accessories may have individual cars, and are recognized as the high-income group. They at the same time prefer expensive High-end Brand products.

Table 1 Co-clustering results

Full size table

Category-based browsing behavior analysis

Based on our analysis about the grouping patterns between users and categories, we study the browsing behaviors of users in each cluster according to Table 1. By averaging user consecutive browsing and residence time in the corresponding category in each cluster, we plot them in Fig. 8 with the threshold as 5 minutes to partition sessions. The number of consecutive browses exhibits distinct differences among different clusters. For example, Toys & Musical Instruments (Cluster 12) that are popular among children exhibit short residence time but more consecutive browses, which indicates more repetitive browsing; while (E-)Books & CDs (Cluster 10) has long residence time, which shows that users are willing to gather more information about the products. In particular, Ticketing and PC & Office in Cluster 1 attract the least consecutive browses, which suggests that users choose them directly with clear purpose of purchase.

5 Conclusion & future work

With a dataset containing users’ one-month online shopping and browsing records, we investigate the grouping characteristics between users and product categories, and uncover distinct patterns of browsing behaviors in different categories. Our findings provide valuable insights for e-commerce business to customize its online web shopping system to enhance user experience. As for future work, we would like to study how users’ preference change over time and how their other online activities are related to their shopping behaviors.

Notes

These include Phone & Accessories (MP & AC), PC & Office (PC & OF), Books & CDs (BK & CD), Clothes (CL), House Decorations (DE), Household Appliances (HA), Sports & Health (SP & HE), Gifts & Bags (GI & BA), Cosmetics (CM), Maternity & Child (MA & CH) and Digital Products (DP).

References

Anderson A, Kumar R, Tomkins A, Vassilvitskii S (2014) The dynamics of repeat consumption. In: International conference on world wide web, pp 419–430
Benson AR, Kumar R, Tomkins A (2016) Modeling user consumption sequences. In: International conference on world wide web. International world wide web conferences steering committee, pp 519–529
Chen M, Ma Y, Song J, Lai CF, Hu B (2016) Smart clothing: connecting human with clouds and big data for sustainable health monitoring. Mobile Networks and Applications 21(5):825–845
Article Google Scholar
Chen M, Hao T, Hwang K, Wang L (2017) Disease prediction by machine learning over big healthcare data. IEEE Access 4:1242–1253
Google Scholar
Chen M, Ma Y, Li Y, Wu D, Zhang Y, Youn (2017) Wearable 2.0: enable human-cloud integration in next generation healthcare system. IEEE Communications 55(1):54–61
Article Google Scholar
Chen M, Yang J, Hao Y, Mao S, Hwang K (2017) A 5G cognitive system for healthcare. Big Data and Cognitive Computing 1(1)
eMarketer (2016) Worldwide retail ecommerce sales: emarketer’s updated estimates and forecast through 2019, pp 2–4. http://www.emarketer.com/public_media/docs/eMarketer_eTailWest2016_Worldwide_ECommerce_Report.pdf
Keralapura R, Nucci A, Zhang ZL, Gao L (2010) Profiling users in a 3g network using hourglass co-clustering. In: International conference on mobile computing and networking, MOBICOM 2010, Chicago, Illinois, USA, September. DBLP, vol 49, pp 341–352
Li J, Qiu M, Ming Z, Quan G, Qin X, Gu Z (2012) Online optimization for scheduling preemptable tasks on IaaS cloud systems. J Parallel Distrib Comput 72(5):666–677
Article Google Scholar
Li Y, Chen M (2015) Software-defined Network function virtualization: a survey. IEEE Access 3:2542–2553
Article Google Scholar
Liu CH, Zhang Z, Chen M (2017) Personalized multimedia recommendations by adaptive feedback control frameworks for cloud-integrated cyber physical systems. IEEE Syst J 11(1):106–117
Article Google Scholar
Qiu M, Ming Z, Li J, Gai K, Zong Z (2015) Phase-change memory optimization for green cloud with genetic algorithm. IEEE Trans Comput 64 (12):3528–3540
Article MathSciNet MATH Google Scholar
Qiu M, Sha HM (2009) Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Trans Des Autom Electron Syst 14(2):1–30
Article Google Scholar
Tian D, Zhou J, Wang Y, Lu Y (2015) A dynamic and self-adaptive network selection method for multimode communications in heterogeneous vehicular telematics. IEEE Trans Intell Transp Syst 16(6):3033–3049
Article Google Scholar
Tian D, Zhou J, Sheng Z (2017) An adaptive fusion strategy for distributed information estimation over cooperative multi-agent networks. IEEE Trans Inf Theory 99:1–1
Article MathSciNet MATH Google Scholar
Zhang Y (2016) Grorec: a group-centric intelligent recommender system integrating social, mobile and big data technologies. IEEE Trans Serv Comput 9(5):786–795
Article Google Scholar
Zhang Y, Zhang D, Hassan MM, Alamri A, Peng L (2015) CADRE: cloud-assisted drug REcommendation service for online pharmacies. Mobile Networks and Applications 20(3):348–355
Article Google Scholar
Zhang Y, Chen M, Huang D, Wu D, Li Y (2016) iDoctor: personalized and professionalized medical recommendations based on hybrid matrix factorization. Futur Gener Comput Syst 66:30–35
Article Google Scholar
Zheng K, Yang Z, Zhang K, Chatzimisios P (2016) Big data-driven optimization for mobile networks toward 5G. IEEE Netw 30(1):44–51
Article Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China
Huan Yan, Zifeng Wang, Tzu-Heng Lin, Yong Li & Depeng Jin

Authors

Huan Yan
View author publications
You can also search for this author in PubMed Google Scholar
Zifeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tzu-Heng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yong Li
View author publications
You can also search for this author in PubMed Google Scholar
Depeng Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, H., Wang, Z., Lin, TH. et al. Profiling users by online shopping behaviors. Multimed Tools Appl 77, 21935–21945 (2018). https://doi.org/10.1007/s11042-017-5365-7

Download citation

Received: 26 April 2017
Revised: 30 August 2017
Accepted: 30 October 2017
Published: 11 December 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11042-017-5365-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Profiling users by online shopping behaviors

Abstract

Similar content being viewed by others

User behaviour modeling, recommendations, and purchase prediction during shopping festivals

Exploring Offline Browsing Patterns to Enhance the Online Environment

How Do Consumers Behave in Social Commerce? An Investigation Through Clickstream Data

1 Introduction

2 Dataset

3 Metric and methodology