Abstract
With the rapid development of information technology and fast growth of Internet users, e-commerce nowadays is facing complex business environment and accumulating large-volume and high-dimensional data. This brings two challenges for demand forecasting. First, e-merchants need to find appropriate approaches to leverage the large amount of data and extract forecast features to capture various factors affecting the demand. Second, they need to efficiently identify the most important features to improve the forecast accuracy and better understand the key drivers for demand changes. To solve these challenges, this study conducts a multi-dimensional feature engineering by constructing five feature categories including historical demand, price, page view, reviews, and competition for e-commerce demand forecasting on item-level. We then propose a two-stage random forest-based feature selection algorithm to effectively identify the important features from the high-dimensional feature set and avoid overfitting. We test our proposed algorithm with a large-scale dataset from the largest e-commerce platform in China. The numerical results from 21,111 items and 109 million sales observations show that our proposed random forest-based forecasting framework with a two-stage feature selection algorithm delivers 11.58%, 5.81% and 3.68% forecast accuracy improvement, compared with the Autoregressive Integrated Moving Average (ARIMA), Random Forecast, and Random Forecast with one-stage feature selection approach, respectively, which are widely used in literature and industry. This study provides a useful tool for the practitioners to forecast demands and sheds lights on the B2C e-commerce operations management.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abasabadi S, Nematzadeh H, Motameni H, Akbari E (2021). Automatic ensemble feature selection using fast non-dominated sorting. Information Systems 100: 101760.
Abolghasemi M, Beh E, Tarr G, Gerlach, R (2020). Demand forecasting in supply chain: The impact of demand volatility in the presence of promotion. Computers & Industrial Engineering 142: 106380.
Ali Ö G, Sayın S, Van Woensel T, Fransoo J (2009). SKU demand forecasting in the presence of promotions. Expert Systems with Applications 36(10): 12340–12348.
Andersen J, Giversen A, Jensen A H, Larsen R S, Pedersen T B, Skyt J (2000). Analyzing clickstreams using subsessions. In Proceedings of the 3rd ACM international workshop on Data warehousing and OLAP. ACM, November, 25–32.
Athanasopoulos G, Hyndman R J, Kourentzes N, Petropoulos F (2017). Forecasting with temporal hierarchies. European Journal of Operational Research 262(1): 60–74.
Bauer H H, Falk T, Hammerschmidt M (2006). eTransQual: A transaction process-based approach for capturing service quality in online shopping. Journal of Business Research 59(7): 866–875.
Besbes O, Gur Y, Zeevi A (2016). Optimization in online content recommendation services: Beyond click-through rates. Manufacturing & Service Operations Management 18(1): 15–33.
Biau G, Scornet E (2016). A random forest guided tour. Test 25(2): 197–227.
Breiman L (2001). Random forests. Machine Learning 45(1): 5–32.
Breiman L, Friedman J, Stone C J, Olshen R A (1984). Classification and Regression Trees, CRC press.
Cantallops A S, Salvi F (2014). New consumer behavior: A review of research on eWOM and hotels. International Journal of Hospitality Management 36: 41–51.
Cao P, Zhao N, Wu J (2019). Dynamic pricing with Bayesian demand learning and reference price effect. European Journal of Operational Research 279(2): 540–556.
Chandrashekar G, Sahin F (2014). A survey on feature selection methods. Computers & Electrical Engineering 40(1): 16–28.
Chen Q, Zhang M, Xue B (2017). Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation 21(5): 792–806.
Chiew K L, Tan C L, Wong K, Yong K S, Tiong W K (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences 484: 153–166.
Choi T M, Hui C L, Liu N, Ng S F, Yu Y (2014). Fast fashion sales forecasting with limited data and time. Decision Support Systems 59: 84–92.
Chong A Y L, Ch’ng E, Liu M J, Li B (2017). Predicting consumer product demands via Big Data: The roles of online promotional marketing and online reviews. International Journal of Production Research 55(17): 5142–5156.
Chong A Y L, Li B, Ngai E W, Ch’ng E, Lee F (2016). Predicting online product sales via online reviews, sentiments, and promotion strategies: A big data architecture and neural network approach. International Journal of Operations & Production Management 36(4): 358–383.
Chou M C, Sim C K, Yuan X M (2020). Policies for inventory models with product returns forecast from past demands and past sales. Annals of Operations Research 288: 137–180.
Dai A, Zhang Z, Hou P, Yue J, He S, He Z (2019). Warranty claims forecasting for new products sold with a two-dimensional warranty. Journal of Systems Science and Systems Engineering 28(6): 715–730.
Ding Y, Liu J (2021). Joint pricing strategies of multi-product retailer with reference-price and substitution-price effect. Journal of Data, Information and Management 3(1): 49–63.
Divakar S, Ratchford B T, Shankar V (2005). Practice prize article — CHAN4CAST: A multichannel, multiregion sales forecasting model and decision support system for consumer packaged goods. Marketing Science 24(3): 334–350.
Dong J, Hu Z, Liang C (2017). E-commerce supply chain coordination under demand influenced by historical sales rate. 2017 3rd International Conference on In formatiom Management (ICIM) 61–71, IEEE.
Fan Z P, Che Y J, Chen Z Y (2017). Product sales forecasting using online reviews and historical sales data: A method combining the Bass model and sentiment analysis. Journal of Business Research 74: 90–100.
Ferreira K J, Lee B H A, Simchi-Levi D (2016). Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management 18(1): 69–88.
Fildes R, Goodwin P, Önkal D (2019). Use and misuse of information in supply chain forecasting of promotion effects. International Journal of Forecasting 35(1): 144–156.
Giang N L, Ngan T T, Tuan T M, Phuong H T, Abdel-Basset M, de Macêdo A R L, de Albuquerque V H C (2019). Novel incremental algorithms for attribute reduction from dynamic decision tables using hybrid filter-wrapper with fuzzy partition distance. IEEE Transactions on Fuzzy Systems 28(5): 858–873.
Goltsos T E, Syntetos A A, van der Laan E (2019). Forecasting for remanufacturing: The effects of serialization. Journal of Operations Management 65(5): 447–467.
Got, A, Moussaoui A, Zouache D (2021). Hybrid filter-wrapper feature selection using Whale Optimization Algorithm: A Multi-Objective approach. Expert Systems with Applications 183: 115312.
Guyon I, Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157–1182.
Guyon I, Weston J, Barnhill S, Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1): 389–422.
Hanna R C, Lemon K N, Smith G E (2019). Is transparency a good thing? How online price transparency and variability can benefit firms and influence consumer decision making. Business Horizons 62(2): 227–236.
He J, Wang X, Vandenbosch M B, Nault B R (2020). Revealed preference in online reviews: Purchase verification in the tablet market. Decision Support Systems 132: 113281.
Huang G, Liu L (2006). Supply chain decision-making and coordination under price-dependent demand. Journal of Systems Science and Systems Engineering 15(3): 330–339.
Huang T, Fildes R, Soopramanien D (2014). The value of competitive information in forecasting FMCG retail product sales and the variable selection problem. European Journal of Operational Research 237(2): 738–748.
Hyndman R J, Koehler A B (2006). Another look at measures of forecast accuracy. International Journal of Forecasting 22: 679–688.
Hyndman R J, Koehler A B, Snyder R D, Grose S (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3): 439–454.
Jiménez-Cordero A, Morales J M, Pineda S (2021). A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification. European Journal of Operational Research 293(1): 24–35.
Kamakura W A, Kang W (2007). Chain-wide and storelevel analysis for cross-category management. Journal of Retailing 83(2): 159–170.
Kim J, Kang J, Sohn M (2021). Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data. Knowledge-Based Systems 220: 106901.
Kim S, Kim H (2016). A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting 32(3): 669–679.
Koehn D, Lessmann S, Schaal M (2020). Predicting online shopping behaviour from clickstream data using deep learning. Expert Systems with Applications 150: 113342.
Korobilis D (2017). Quantile regression forecasts of inflation under model uncertainty. International Journal of Forecasting 33(1): 11–20.
Kursa M B, Rudnicki W R (2010). Feature selection with the Boruta package. Journal of Statistical Software 36(11): 1–13.
Lee L, Charles V (2021). The impact of consumers’ perceptions regarding the ethics of online retailers and promotional strategy on their repurchase intention. International Journal of Information Management 57: 102264.
Leung K H, Mo D Y, Ho G T, Wu C H, Huang G Q (2020). Modelling near-real-time order arrival demand in e-commerce context: A machine learning predictive methodology. Industrial Management & Data Systems 120(6): 1149–1174.
Li C, Lim A (2018). A greedy aggregation-decomposition method for intermittent demand forecasting in fashion retailing. European Journal of Operational Research 269(3): 860–869.
Li J, Manry M T, Narasimha P L, Yu C (2006). Feature selection using a piecewise linear network. IEEE Transactions on Neural Networks 17(5): 1101–1115.
Lohrmann C, Luukka P (2019). Classification of intraday S&P500 returns with a Random Forest. International Journal of Forecasting 35(1): 390–407.
Lu L, Gou Q, Tang W, Zhang J (2016). Joint pricing and advertising strategy with reference price effect. International Journal of Production Research 54(17): 5250–5270.
Ma S, Fildes R, Huang T (2016). Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra-and inter-category promotional information. European Journal of Operational Research 249(1): 245–257.
Makridakis S (1993). Accuracy measures: Theoretical and practical concerns. International journal of Forecasting 9(4): 527–529.
Maldonado S, Pérez J, Bravo C (2017). Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research 261(2): 656–665.
Maldonado S, Weber R, Basak J (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences 181(1): 115–128.
Martínez A, Schmuck C, Pereverzyev Jr S, Pirker C, Haltmeier M (2020). A machine learning framework for customer purchase prediction in the non-contractual setting. European Journal of Operational Research 281(3): 588–596.
Mueller S Q (2020). Pre-and within-season attendance forecasting in Major League Baseball: A random forest approach. Applied Economics 52(41): 4512–4528.
Nakariyakul S, Casasent D P (2009). An improvement on floating search algorithms for feature subset selection. Pattern Recognition 42(9): 1932–1940.
Nakariyakul S (2018). High-dimensional hybrid feature selection using interaction information-guided search. Knowledge-Based Systems 145, 59–66.
Narayanan A, Sahin F, Robinson E P (2019). Demand and order-fulfillment planning: The impact of point-of-sale data, retailer orders and distribution center orders on forecast accuracy. Journal of Operations Management 65(5): 468–486.
Navarro F F G, Muñoz L A B (2009). Gene subset selection in microarray data using entropic filtering for cancer classification. Expert Systems 26(1): 113–124.
Neto J Q F, Bloemhof J, Corbett C (2016). Market prices of remanufactured, used and new items: Evidence from eBay. International Journal of Production Economics 171: 371–380.
Nikolopoulos K (2021). We need to talk about intermittent demand forecasting. European Journal of Operational Research 291(2): 549–559.
Omuya E O, Okeyo G O, Kimwele M W (2021). Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications 174, 114765.
Ot A, Ttn B, Sm C (2021). A novel wrapper-based feature subset selection method using modified binary differential evolution algorithm. Information Sciences 565, 278–305.
Pang G, Casalin F, Papagiannidis S, Muyldermans L, Tse Y K (2015). Price determinants for remanufactured electronic products: A case study on eBay UK. International Journal of Production Research 53(2): 572–589.
Pannakkong W, Sriboonchitta S, Huynh V N (2018). An ensemble model of arima and ann with restricted boltzmann machine based on decomposition of discrete wavelet transform for time series forecasting. Journal of Systems Science and Systems Engineering 27(5): 690–708.
Peng H, Long F, Ding C (2005). Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8): 1226–1238.
Petropoulos F, Hyndman R J, Bergmeir C (2018). Exploring the sources of uncertainty: Why does bagging for time series forecasting work? European Journal of Operational Research 268(2): 545–554.
Ramanathan U, Muyldermans L (2010). Identifying demand factors for promotional planning and forecasting: A case of a soft drink company in the UK. International journal of production economics 128(2): 538–545.
Reunanen J (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research 3(Mar): 1371–1382.
Subramanian R, Subramanyam R (2012). Key factors in the market for remanufactured products. Manufacturing & Service Operations Management 14(2): 315–326.
Sun L, Zheng X, Jin Y, Jiang M, Wang H (2019). Estimating promotion effects using big data: A partially profiled LASSO model with endogeneity correction. Decision Sciences 50(4): 816–846.
Tang L, Sun L, Guo C, Zuo Y, Zhang Z (2021). A Simulation Research Towards Better Leverage of Sales Ranking. Journal of Systems Science and Systems Engineering 30(1): 105–122.
Trapero J R, Kourentzes N, Fildes R (2015). On the identification of sales forecasting models in the presence of promotions. Journal of the operational Research Society 66(2): 299–307.
Van Donselaar K H, Peters J, de Jong A, Broekmeulen R A (2016). Analysis and forecasting of demand during promotions for perishable items. International Journal of Production Economics 172: 65–75.
Wang P, Du R, Hu Q (2020). How to promote sales: Discount promotion or coupon promotion? Journal of Systems Science and Systems Engineering 29(9): 381–399.
Wu M, Ma L, Xue W (2020). Order timing for manufacturers with spot purchasing price uncertainty and demand information updating. Journal of Systems Science and Systems Engineering 29(6): 631–654.
Wu W, Liu M, Liu Q, Shen W (2016). A quantum multiagent based neural network model for failure prediction. Journal of Systems Science and Systems Engineering 25(2): 210–228.
Xie G, Qian Y, Wang S (2021). Forecasting Chinese cruise tourism demand with big data: An optimized machine learning approach. Tourism Management 82: 104208.
Xu X, Zeng S, He Y (2017). The influence of e-services on customer online purchasing behavior toward reman-ufactured products. International Journal of Production Economics 187: 113–125.
Yan T, Sun B (2011). A study on statical and dynamical characteristics model of e-commerce competitive environment. 2011 International Conference on Business Management and Electronic Information IEEE 4: 573–580.
Ye Q, Law R, Gu B (2009). The impact of online user reviews on hotel room sales. International Journal of Hospitality Management 28(1): 180–182.
Yeo J, Hwang S W, Koh E, Lipka N (2018). Conversion prediction from clickstream: Modeling market prediction and customer predictability. IEEE Transactions on Knowledge and Data Engineering 32(2): 246–259.
Yıldırım M, Okay F Y, Özdemir S (2021). Big data analytics for default prediction using graph theory. Expert Systems with Applications 176: 114840.
Yu H, Chen X, Li Z, Zhang G, Liu P, Yang J, Yang Y (2019). Taxi-based mobility demand formulation and prediction using conditional generative adversarial network-driven learning approaches. IEEE Transactions on Intelligent Transportation Systems 20(10): 3888–3899.
Zhu F, Zhang X (2010). Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics. Journal of Marketing 74(2): 133–148.
Acknowledgments
This work has been supported in part by the National Natural Science Foundation of China under Grant Nos. 72172169, 71903024, 91646125 and Program for Innovation Research at the Central University of Finance and Economics. The authors sincerely thank the editors and two anonymous referees for their constructive comments to significantly improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Hongyan Dai is currently working as a professor in Business School, Central University of Finance and Economics, Beijing, China. Dr. Dai received her BSc from Beijing University of Posts and Telecommunications in 2004, MSc from Tsinghua University in 2006, and PhD from Hong Kong University of Science and Technology in 2011. Dr. Dai’s current research interest includes sharing economy, O2O logistics network design, data-driven optimization etc. She published over 20 papers in reputational international journals, including European Journal of Operational Research, International Journal of Production Economics, International Journal of Production Research, and Annals of Operations Research. She is the Principal Investigator for several National and Provincial level projects, including the Major Research Plan of the National Natural Science Foundation of China; and several industry projects, such as Jingdong-to-home, State Grid, and Siemens. She serves as the research fellow of China Society of Logistics, review specialist of National Natural Science Foundation of China, and referee of over ten international journals.
Qin Xiao is a PhD candidate in Business School, Central University of Finance and Economics, Beijing, China. She received her bachelor degree from Central University of Finance and Economics in 2016. Her current research interest includes O2O logistics network design, data-driven optimization etc. She published several papers in international journals, such as International Journal of Production Economics. She has participated in National and Provincial level projects, including the Major Research Plan of the National Natural Science Foundation of China; and industry projects, such as Jingdong-to-home.
Nina Yan is currently working as a full professor in Business School, Central University of Finance and Economics, China. She received her PhD in management science from Northeastern University in March 2007. Her current research interests include supply chain finance, operations management, platform economics, etc. She has published over twenty papers in such journals, including Decision Sciences, Decision Support Systems, European Journal of Operational Research, International Journal of Production Economics, International Journal of Production Research, Journal of Business Research, and Omega.
Xun Xu holds a PhD in operations management from Washington State University. He is currently an associate professor in the Department of Management, Operations, and Marketing in College of Business Administration at California State University, Stanislaus in the United States. His research interests include service operations management, supply chain management and coordination, sustainability, e-commerce, data and text mining, and hospitality and tourism management. He has published over fifty papers in such journals as Annals of Tourism Research, Computers and Industrial Engineering, Decision Sciences, Decision Support Systems, European Journal of Operational Research, Journal of Business Research, Journal of the Operational Research Society, Journal of Travel Research, International Journal of Hospitality Management, International Journal of Contemporary Hospitality Management, International Journal of Information Management, International Journal of Production Economics, and International Journal of Production Research.
Tingting Tong is an associate professor in Dongbei University of Finance and Economics, China. She received her PhD in Economics from Georgia Institute of Technology in August 2016. Her research focus on operations management, labor economics, and applied econometrics. Her articles have been published by journals such as China Economic Review, Decision Sciences, Decision Support Systems, International Journal of Production Economics, Journal of Business Research, and Journal of Transport Geography.
Rights and permissions
About this article
Cite this article
Dai, H., Xiao, Q., Yan, N. et al. Item-level Forecasting for E-commerce Demand with High-dimensional Data Using a Two-stage Feature Selection Algorithm. J. Syst. Sci. Syst. Eng. 31, 247–264 (2022). https://doi.org/10.1007/s11518-022-5520-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11518-022-5520-1