1 Introduction

In the dynamic landscape of supply chain management (SCM), the relentless pursuit of efficiency and adaptability has driven a continuous evolution in forecasting strategies and technologies. This article embarks on a systematic exploration, aiming to identify and analyze the state-of-the-art in supply chain (SC) forecasting, ultimately proposing a novel framework that integrates the power of big data analytics (BDA) into SCM. The increasing complexity and interconnectedness of global SCs have underscored the need for sophisticated forecasting strategies. Traditional approaches are being reevaluated in the wake of technological advancements, leading to a paradigm shift in how we perceive and optimize SC forecasting. The integration of BDA emerges as a transformative force, promising enhanced predictive capabilities and a holistic framework that spans problem identification, data sourcing, exploratory data analysis, machine learning (ML) model training, hyperparameter tuning, performance evaluation, and optimization.

The SC has evolved sufficiently over the past years to discover new methods and techniques for solving SCM problems. The SC can develop its configuration based on its control, coordination, and management [1]. The advent of big data (BD) brings one such change. Like other fields, BD can be utilized to improve decision-making reprocesses and alter business models through multiple resources, tools, and applications [2]. Therefore, SC and BD usages are connected to help one another. Although the concepts for SCM are already well-developed, it is possible to improve further; recent research on enhancing efficiency through collaboration [3], usage of RFID and intelligent goods [4] are two examples of such innovations that improved SCM processes. Newer technologies are further enabling the discovery of innovative strategies for solving SC problems. BDA is one such disruptive innovation. Although BD has been present for a long time, the approaches to making sense of BD are comparatively new, and such systems have not been wholly integrated into other branches of knowledge [5]. We identified the absence of data usage and relevant processes in SC directly as a major problem that needs to be addressed.

BD has similarly grown popular over the years. After academic and technical publications first mentioned such technological developments, it drew the attention of various people, including literary scholars, corporate leaders, and government officials [6, 7]. The most recognizable feature of BD is probably its size or the amount of data stored. The distinctive features of greater data variety, high velocity in collection and analysis, the necessity to navigate veracity challenges, and the inherent value growth with increasing data analysis [8] set the stage for a new era in SCM. Simply having access to BD is not helpful; data analytics is a must to create value or gather information out of the enormous collection of data. Where analytical methods are applied to BD, it is called BDA. While BDA has vast applications, its role in improving the SC process is notable.

The motivation behind this research stems from the recognition that the traditional forecasting paradigms may not be equipped to address the intricacies of modern SC dynamics. Phantom inventory, varying time horizons, and diverse SCM objectives necessitate a more adaptive and data-driven approach. By introducing a comprehensive framework, this article seeks to bridge the gap between traditional forecasting methods and the demands of contemporary SC environments.

1.1 Research gaps

We found potential research gaps combining the preprocessing for ML forecasting, control process (SC processes where BDA is helpful), and post-process (for evaluating the forecasting model). Although there is fragmented research on these particular topics (“SC forecasting model performance,” “the application of BDA on SC,” or “ML forecasting techniques with BDA implementation,” or “BD driven SC performance evaluation”), the necessity to form a cyclic connection among these three processes led us to the development of this article. This paper identifies a critical gap in current SCM practices—the underutilization of data and relevant processes—and positions BDA as a powerful solution. The motivation behind this study lies in addressing the pivotal challenge of harnessing the full potential of BD within SCM, recognizing it not just as a technological evolution but as a strategic imperative for future competitiveness.

1.2 Research Objectives

The primary objective of this article is to shed light on the potential BDA investigations on SCM studies, what significant contributions BDA has made to the efficient use of ML forecasting in SC processes, what preprocessing and post-processing SC forecasting techniques have been robustly developed so far, and are currently in use. The forecasting techniques in an SC setting have been discussed mainly from the perspective of BDA. This research aims to continually enhance the performance of a forecasting model incorporating a sustainable circular BDA-SCM framework that can drive future research using business intelligence and value theory as theoretical approaches. Systematic literature review (SLR) was used to perform this research, which is an approach used to locate, evaluate, and interpret what relevant research has been done on a specific issue or topic by sketching out and analyzing the current intellectual landscape [9]. The review paper attempts to combine the applications of BDA and ML forecasting in SCM by seeking the solutions to the following research questions (RQs) for guiding the study’s development to accomplish its overall objective:

  • RQ1: What are the efficient steps to formulate an ML Forecasting model to predict the SC factors?

  • RQ2: How can the forecasting, SC decision-making, and performance measurement processes be connected, tracked, and optimized in cyclic order?

  • RQ3: How can forecasting affect SC performance, and which ML forecasting models are relevant to SC forecasting?

The scope of this research extends beyond the technical intricacies of forecasting algorithms. It delves into the broader implications of forecasting on the human workforce, inventory management, and the overall performance of the SC. By addressing the adverse effects of phantom inventory and emphasizing the dependency of managerial decisions on SC key performance indicators (KPIs), this research contributes to the overarching goal of improving operations management, transparency, and planning efficiency. The novelty of this paper lies in several key aspects:

  • The introduction of a comprehensive BDA-SCM framework that provides a holistic view of SC forecasting and highlights interconnections between processes, offering a novel perspective.

  • The integration of ML techniques within SCM for forecasting purposes presents novel approaches to enhance accuracy and effectiveness.

  • Addressing the issue of phantom inventory, providing insights and potential solutions to improve inventory management practices and forecasting precision.

  • Exploring the connection between accurate forecasting and SC performance, offering a novel perspective on leveraging forecasting models for optimization.

  • Conducting a comprehensive survey of 152 papers spanning several decades, providing a unique and valuable contribution to the field and consolidating a vast body of literature.

    Including papers from such a broad timeframe allows for identifying trends, shifts in methodologies, and key milestones in the field. This comprehensive survey sets this paper apart from other SLR review papers that have not undertaken such an extensive examination of the literature. The insights gained from this extensive survey enhance the robustness and reliability of the conclusions drawn in the paper.

The forthcoming sections will comprehensively explore the “Research procedures,” and the proposed cyclic connection embedded within the framework will be elucidated in “BDA-SCM framework". Moving forward, “Pre-process” section addresses critical components, including the imperative need for strategic data collection aligned with SC objectives, methodologies for data preprocessing, feature engineering (FE), exploratory data analysis, and the classification of forecasting types based on distinct time horizons. The “Control-process” section delves into optimizing preprocessing methodologies. This optimization, rooted in post-process KPIs, aims to elevate overall control processes, encompassing inventory management, workforce determination, cost optimization, and production and capacity planning. The discussion on SC KPIs and error-measurement systems for model optimization unfolds in the “Post-process” section. This section aims to provide insights into refining forecasting models for superior performance. In the “Challenges” section, attention is directed toward acknowledging and addressing technological obstacles encountered during the extensive review of pertinent articles. This section highlights and discusses the challenges inherent in navigating the technological landscape within the scope of this research. In the “Practical Implications" section, we delve into actionable insights for SC practitioners, detailing the implementation of the proposed BDA-SCM framework in real-world scenarios and outlining the substantial benefits they can expect. By incorporating these novel aspects, the paper contributes to the existing body of knowledge in the field of BDA-SCM framework for forecasting, offering new insights, methodologies, and recommendations for future research and practical implementation.

2 Research Procedures

2.1 Planning the Review

In this article, the BDA-SCM cyclic framework was initially developed to incorporate pre-process, control process, and post-process phases. Each phase was illustrated utilizing the most relevant selected works of literature. For forecasting purposes, pre-process recommendations include a step-by-step approach to forecasting and BDA best practices to facilitate comprehensive demand forecasting considering state-of-the-art technologies and relevant research. In the control process, how SC factors and forecasting affect workforce efficiency have been discussed. The post-process portion of the managerial decision-making process explains how managers use the KPI and optimization of the forecasting model to choose the appropriate metrics and insights.

2.2 Conducting the Review

2.2.1 Search Strategy

This SLR aimed to provide a comprehensive and objective evaluation of the existing research until 2023 on BDA-SCM, including an investigation and analysis of various SC forecasting problems and BDA innovations, strategies, and techniques. Major academic databases, including Google Scholar and Science Direct, were searched to minimize bias and ensure the inclusion of a broad range of relevant sources and content. Only English articles published in peer-reviewed journals in the fields of Computer Science, Business, Management and Accounting, Engineering, and Decision Sciences were included. Figure 1 shows the PRISMA flow diagram for the systematic review process, which includes the number of articles identified, screened, and included in the analysis. A combination of keywords and subject headings related to the topic of interest was used to develop the search strategy. The search strings were limited to the title, abstract, and keywords fields and included the following terms:

  • (“Data Analytics” OR “Big Data” OR “Data Analysis”) AND (“Supply Chain Management”) AND (“Forecasting”)

  • (“Data Preprocessing” OR “Data Wrangling” OR “Supply Chain Data Analysis”)

  • (“Supply Chain Forecasting” OR “Demand Forecasting”)

  • (“Warehouse” OR “Inventory”) AND (“Workforce” OR “Human”) AND (“Forecasting”)

  • (“Supply Chain Performance” OR “Supply Chain KPI” OR “Supply Chain Monitoring”)

  • (“Forecasting KPI” OR “Forecasting Error Measurement” OR “Forecasting Performance”)

  • (“Forecasting Model” OR “Time-series Forecasting”)

2.2.2 Selection Strategy

The relevance of each publication was assessed to ensure that the selected papers were empirically sound and conceptually relevant to BDA-SCM-related research advances. Articles were considered more relevant if the search terms appeared in the title, abstract, keywords, and throughout the text. The identified papers were critically analyzed, particularly regarding the relevant sections that mentioned BDA-SCM. This approach drew from relevant views on SCM-forecasting challenges and BDA techniques and helped to achieve the research review goals. The remaining articles were then assessed to verify that they provided the necessary research perspective and empirical data to meet the review’s objectives. Finally, to ensure that the selected articles aligned with the review goals, we conducted a rigorous alignment process, comparing the articles to the research review objectives. Only articles that met all of the selection criteria were included in the final review.

Fig. 1
figure 1

PRISMA flow diagram illustrating the article selection process for the SLR on BDA-SCM forecasting

3 BDA-SCM Framework

The proposed BDA-SCM framework, depicted in Figs. 23 and 4, establishes a cyclic connection that facilitates continuous improvement in SC forecasting. This cyclic process seamlessly integrates three essential stages: Pre-process, Control-process, and Post-process, fostering a dynamic relationship that optimizes SC operations iteratively. Figure 2 mainly consists of the use and cyclic flow of data in SC. It only includes the SC parts where BDA may be involved. Figure 3 complements Fig. 2 by mentioning the methods for cleaning, exploring, and analyzing data properly. It includes FE techniques to select only the most relevant and unique features from which ML algorithms can learn efficiently. Finally, Fig. 4 is a proposed method for data splitting, model training, hyperparameter optimization, cross-validation, testing, and evaluating errors to perfect the forecasting methods mentioned in Fig. 2.

In the Pre-process stage, the focus is on ensuring accurate and relevant data aligned with SC objectives. The cyclic nature of this stage involves a continuous feedback loop. For example, after training an initial ML forecasting model, the performance is evaluated using real-time data. Any discrepancies or deviations from expected outcomes trigger a revisit to the Pre-process stage. This might involve reassessing data collection methods, exploring new data sources, or refining the preprocessing steps to enhance the quality of input data. The Control-process stage benefits from the cyclic connection, encompassing decision-making areas like production planning, workforce determination, and inventory management. Suppose a decision made based on forecasted data results in suboptimal outcomes. In that case, this feedback loops back to the Pre-process stage. The system may reevaluate the forecasting model’s inputs, incorporating real-time data to enhance decision-making accuracy in subsequent cycles. In the Post-process stage, the cyclic connection enables continuous performance improvement. After the initial model predictions, performance metrics are analyzed, and any deviations from expected results trigger a reevaluation of the forecasting model. This feedback loop, integrated into the Post-process stage, ensures that the model evolves over time, adapting to changing SC dynamics and improving its predictive capabilities.

Consider a scenario in demand forecasting where the initial ML model predicts a surge in demand for a particular product. If the actual demand deviates from the forecast, the cyclic connection triggers a reassessment in the Pre-process stage. Analysts may explore new data sources, refine data preprocessing methods, or adjust FE techniques to capture changing demand patterns more accurately. In inventory management, the Control-process stage involves decisions on stock levels based on forecasted demand. If the actual inventory levels deviate significantly from the forecast, the cyclic connection prompts a revisit to the Pre-process stage. This may involve refining the preprocessing of inventory data, incorporating real-time data on SC disruptions, or adjusting the forecasting model to enhance inventory optimization. In production planning, the decision-making process relies on accurate product demand forecasts. If the actual production output falls short or exceeds the forecasted demand, the cyclic connection triggers a reassessment in the Pre-process stage. This may involve refining data collection methods, exploring new features relevant to production efficiency, or adjusting the forecasting model to better align with dynamic production needs.

Fig. 2
figure 2

Big data analytics in supply chain processes (pre-process, control-process, post-process)

Fig. 3
figure 3

Data preprocessing, feature engineering, exploratory data analysis, and data reduction

Fig. 4
figure 4

Machine learning model training, hyperparameter optimization, and model evaluation

4 Pre-process

4.1 Identifying Business Problems

At the outset, the type of data that needs to be collected, stored, analyzed, and interpreted is selected based on SC strategies. Reference [10] categorized SC strategies based on risk and impact, such as robustness for low-impact high-risk, agility for low-risk high-impact, rigidity for low-impact low-risk, and resilience for high-impact high-risk decisions. Varying levels of responsiveness and efficiency can adopt the strategies. Responsiveness has been a critical factor in gaining a competitive advantage, and it depends on the deviations in demands and a company’s capability to respond to such deviations. An increase in responsiveness decreases efficiency, and vice versa [11]. The responsiveness level affects product volume, order fulfillment rate, workforce, manufacturing capacity, warehouse capacity, transportation carriers, product mix, supplier’s product mix, inbound and outbound logistics, etc. [12]. Therefore, whether data needs to be collected should be decided based on responsiveness, as different data sets are required to boost the SC efficiency and responsiveness by allocating them. Furthermore, the frequency of data analyses would also depend on the company’s responsiveness level. In short, the factors that may dictate the sort of forecasting required for a business include the context of forecasting, the types of data available, the required level of accuracy, the length of the forecasting period, the time available for each forecast, and the value addition made through the forecast [13].

4.2 Identifying Data Sources

Once the data that needs to be gathered is selected, identifying the sources is essential. Determining variables is required for timely forecasts to bring helpful information [14]. Moreover, a conclusion may not be based on a single type of data; the initial conclusion can be validated based on multiple data types. Reference [15] mentioned 56 different data sources for four main SCM levers, procurement, warehouse operations, marketing, and transportation, as leveraging various data sources allows finding actionable insights quickly; some of the more relevant data sources have been listed below:

  1. 1.

    Transportation

  2. 2.

    Barcode systems

  3. 3.

    Demand chain

  4. 4.

    CRM transaction data

  5. 5.

    BOMs

  6. 6.

    Customer surveys

  7. 7.

    Blogs and news

  8. 8.

    Demand forecasts

  9. 9.

    Procurement

  10. 10.

    Delivery times and terms

  11. 11.

    Invoice data

  12. 12.

    ERP transaction data

  13. 13.

    GPS-enabled BD telematics

  14. 14.

    Product reviews

  15. 15.

    Competitor pricing

  16. 16.

    Inventory costs

  17. 17.

    Customer location and channel

  18. 18.

    Traffic density

  19. 19.

    Email records

  20. 20.

    Crowd-based pickup and delivery

  21. 21.

    Equipment or asset data

  22. 22.

    Intelligent Transport Systems

  23. 23.

    EDI purchase orders

  24. 24.

    Warehouse operations

  25. 25.

    Logistics network topology

  26. 26.

    In-transit inventory

  27. 27.

    SRM transaction data

  28. 28.

    Transportation costs

  29. 29.

    Warehouse costs

  30. 30.

    Pricing and margin data

  31. 31.

    RFID

  32. 32.

    Origination and destination (OND)

  33. 33.

    Local and global events

  34. 34.

    Supplier current capacity and customers

  35. 35.

    Sales history

  36. 36.

    Weather data

  37. 37.

    SKU level

  38. 38.

    Supplier financial performance information

  39. 39.

    Raw material pricing volatility

  40. 40.

    On-shelf-availability

  41. 41.

    P2P (Procure-to-Pay)

  42. 42.

    Product traceability and monitoring system

4.3 Data Preprocessing and Feature Engineering (FE)

4.3.1 Duplicates Removal

It is problematic to waste space and runtime with duplicate rows. Duplicate rows create incoherence, and the ML model fails to learn new information. Because of input mistakes, changes in some feature values (e.g., the identifier value) may generate duplicate rows that will be deemed distinct by the machine. It is easier to drop the duplicates or substitute them with relevant values using data preprocessing libraries in Python and R languages. Nevertheless, the main challenge is identifying factors on which the duplicates should be removed. Of the number of methods invented to remove duplicates, we review the following:

Bayesian: The Fellegi-Sunter-algorithm is the most commonly used model in probabilistic approaches because of its Bayesian nature [16, 17]. The Bayes Decision Rule is a common approach [18]. A Bayesian inference difficulty may develop when the probability density of a unique row differs from a duplicate record and the functions are known. Neural network (NN) algorithms are more accurate without the Fellegi-Sunter-algorithm if the data are adequately described or labeled [19].

Partitioning Methods: Clustering methods identify and drop duplicates utilizing graph partitioning approaches [20]. However, Reference [21] compared 12 clustering methods and found that the popular sophisticated algorithms provided lower accuracy, first suggesting that Markov Clustering is a more scalable, accurate, and efficient algorithm.

Aggregate fitting: CART [22] and SVM [23] aggregate fitting results for various row features. SVM is highly memory efficient and works well with lots of dimensions [24]. However, it does not work well with large datasets or data with overlapping classes. CART is intuitive and easily used. The problem is that data is classified based on the sample and may not apply to larger datasets.

Others: Bootstrapping clusters [25] or hierarchical graph structures encode the features as non-matchable binary features creating dual probability densities rather than probabilistic distribution modeling for the inspected quantities [26]. Bootstrapping clusters are used for unsupervised data. Simple techniques have long been studied, such as utilizing distance measurements to identify duplication [27]. Weighted transformations also occur in literature [28]. Additional methods, like ranking the most same-type weighted rows comparable to those provided, are also utilized to identify the least duplicated rows [29].

4.3.2 Dealing with Categorical Features

One-hot, ordinal, Helmert, polynomial, and binary encoders are outperformed with a 95% accuracy by Sum, and Backward Difference encoders are preferred for prediction jobs [30]. Reference [31] presented a generic Information-based encoder that transforms mixed-type features into numeric ones while maintaining the dataset’s original dimension, with better accuracy than One-Hot and Feature-Hashing. Reference [32] demonstrated that Ordinal-encoder (straightforward and convenient to execute but incorporates a sequence of features) outperformed Hashing (introduces limited features and moderately ignores the feature sequence); One-hot-encoding generates a massive number of features and forces the use of a very simplified regression analysis. To train residual features from time categorical variables derived from variable time stamps, a DeepGB neural network with embedding layers may be used, which are necessary to learn multiple time series at once to encode categorical features in a lower dimension or by embedding their IDs and retrieve helpful information [33].

4.3.3 Data Scaling

Normalization is valuable when using ANN, clustering techniques, or classification software. The learning phase may be accelerated by normalizing the data features in tanning faces for backpropagation NN methods.

Min-max normalization: The scaling of b values of a numerical feature F to a defined range represented by \([\text {new-min}_F, \text {new-max}_F]\) is termed min-max normalization. To acquire the new value, the following equation is applied in b to produce a changed value \(b'\):

$$\begin{aligned} b' = \frac{{(b - min_F)}}{{max_F - min_F}} \cdot (new - max_F - new - min_F) + new - min_F \end{aligned}$$
(1)

where \(max_F\) and \(min_F\) mean the maximum and minimum feature values, respectively. In normalization, \([new-min_F, new-max_F] = [0,1]\) or \([-1,1]\) are the usual intervals [34].

Datasets prepared for use with distance-based learning methods commonly use this normalization technique. The features having a significant \(max_F-min_F\) difference will be prevented from dominating the distance computation by applying a normalization to rescale the data to the same value ranges, and it will not be able to distort the learning process by assigning the older features much weight. It is also known to help ANNs learn faster by allowing the weights to converge more quickly.

Z-score normalization: Min-max normalization is not practicable if the minimum and maximum values are not provided. Even when these values are known, the existence of outliers might cause the min-max normalization to be skewed by clustering the values and restricting the computational accuracy available to represent them.

$$\begin{aligned} b' = \frac{(b - \bar{x})}{s_x} \end{aligned}$$
(2)

where \(\bar{x}\) is the sample mean.

$$\begin{aligned} \bar{x} = \frac{1}{n} \sum _{i=1}^{n} b_i \end{aligned}$$
(3)

Moreover, \(s_x\) is the mean absolute deviation of x [35].

$$\begin{aligned} s_x = \frac{1}{n} \sum _{i=1}^{n}(b_i-\bar{x}) \end{aligned}$$
(4)

Decimal scaling normalization: Normalising the numerical feature values by relocating the decimal-point by 10th power divisions so that the \(\text {highest absolute value}<1\) after transformation is a simple method for minimizing the absolute feature values.

$$\begin{aligned} b' = \frac{b}{10^k} \end{aligned}$$
(5)

where k is an integer (the lowest), such that \(new-max_F<1\).

4.3.4 Data Transformation

Data transformation can create new features, also known as changing features, where mathematical formulae derived from business models or pure mathematical formulae are used to integrate the raw input features. Linear, quadratic, polynomial, non-polynomial, rank, and Box-Cox transformations are a few of the different existing transformation techniques.

Normalizations may not be sufficient in research experiments, and full automation to fit the data and optimize the resulting model. Combining the data embedded in several features may be advantageous in some circumstances. Linear transformation based on simple algebraic operations is a basic approach that may be utilized for this goal. A quadratic transformation can occur when a newly introduced feature is formed using the expressions in quadratic form. Using the fundamental features of the dataset, quadratic modifications can assist us in uncovering information that is not directly there. Transformation approximation using polynomials could be implemented by brute force exploration with one unit at a time when no expert assistance can tell us which transformation and features to employ. The transformation of the rank approach is recommended for identical training and test data or a complete dataset for DA and cluster analysis model development [36].

Nonparametric approaches using rank transformation are not recommended to be introduced into traditional statistics courses because they inhibit how widely the nonparametric technique may be used, which is unnecessary. Another misperception is that the nonparametric technique is utilized chiefly for hypothesis testing. This entirely obscures the superior theoretical and conceptual flexibility of many nonparametric methods.

Reference [37] studied the limited sample aspects of the estimated parameters using the Box-Cox transformation. Under the premise of approximating normalcy, the technique worked well. The outputs were utterly impartial for forecasting, and their differences were surprisingly small. Asymptotic variances and stability features of Box-Cox estimates in the linear model were examined by [38]. In the case of unknown transformation parameters, linear regression models with minor to intermediate error variances showed much higher asymptotic variances than known ones. Furthermore, they observed that Box-Cox approaches perform inconsistently in models with minor to intermediate residual variance.

4.3.5 Filtering Extreme Outliers

The most often recommended approach in the literature of outlier identification and repair is via filtering.

Outlier detection: Statistical methods for detecting outliers include box plots, scatter plots, z-scores, and IQR (Interquartile Range) scores. Normal distribution empirical relations should be followed for outliers where the values are \(<\mu\)\(3\sigma\) or \(>\mu +3\sigma\) for normal distribution, where \(\sigma\) and \(\mu\) are the standard deviation and mean of a particular feature. IQR proximity rule should be used in which outliers are \(<(Q1\)\(1.5\times IQR)\) or \(>(Q1 + 1.5\times IQR)\) for skewed distribution. For other distributions, a percentile-based approach should be used in which values that are distant from the 99 percentile and \(<1\) percentile are regarded as outliers.

Outlier Treatment: Various techniques can be employed to address outliers within a dataset. Trimming, the first method, involves the removal of outliers, but it is generally not recommended due to potential information loss. As the second approach, capping identifies outliers based on a predefined threshold, either greater or less than the established limit. The number of outliers in the dataset influences the determination of this capping threshold. Alternatively, outliers may be treated similarly to missing values (MVs). Lastly, outlier removal clustering (ORC), a modification of K-Means Clustering, eliminates outliers in iterative loops. ORC effectively removes outliers from clusters, and careful parameter adjustments are essential as the dataset influences model precision. Importantly, ORC ensures that the computation of centroids remains unbiased, particularly when dealing with distant locations from the k-clusters.

4.3.6 Dealing with Missing Values (MVs)

In SC Data Analysis, one of the preprocessing techniques, Imputation, is adopted to overcome the drawbacks of MVs. The most straightforward approach to drop the rows having MVs is if a comparatively small fraction of observations is present and the analysis of all rows is not substantially skewed in interpretation [39]. Reference [40] showed that MVs are generally connected with three sorts of issues:

  1. 1.

    Inefficiency.

  2. 2.

    Difficulties in managing and interpreting data.

  3. 3.

    Skewness because of discrepancies between perfect and missing data.

When it comes to MV therapy, there are generally three options [41]:

  1. 1.

    First, eliminate all instances that have MVs in their features. Thus, removing features with higher-than-normal MV levels falls within this area.

  2. 2.

    When estimating the model parameters for a whole dataset, another way is to employ maximum likelihood processes, using the obtained model parameters for imputation via sampling.

  3. 3.

    Finally, MV imputation is a group of processes focused on substituting predicted MVs for existing ones. Most of the time, the features in a data set are interdependent. As a result, MVs may be calculated by identifying correlations among features.

Common approaches: To keep the MVs unchanged, known as Do Not Impute (DNI), is the most straightforward approach where if the baseline MVs strategies are available, the algorithm must employ them. When many rows include MVs and using DNI would lead to an irrelevant, inaccurate, and small dataset, then MVs are commonly substituted by the universal-most-frequent feature value for nominal features and the universal mean value for quantitative features [42]. Reference [43] showed another process utilizing Hot Deck that partitions the complete dataset into clusters, links each row with a cluster, and fills up the MVs, where any complete row from the cluster can be utilized. The imputation of Cold Deck is identical to the hot deck, except the dataset cannot be the existing dataset. They demonstrated that the MVs imputation based on the KNN might beat the internal techniques assessing C4.5 and CN2 to handle MVs and exceed the imputation method of mean or mode, which is widely intended to treat MVs.

Maximum likelihood imputation methods: Assume for n independent rows \((i=1,\ldots ,n)\), there are k variables \((y_i1,y_i2,\ldots ,y_ik)\) with no missing data. The maximum likelihood function is [44]:

$$\begin{aligned} L = \prod _{i=1}^n f_i(y_{i1}, y_{i2}, \ldots , y_{ik}; \theta ) \end{aligned}$$
(6)

Assume that \(y_1\) and \(y_2\) have MVs that fulfill the Missing at Random (MAR)-assumption for a specific row i. The combined probability for that observation is the chance of witnessing the remaining features, \(y_{i3}\) through \(y_{ik}\). If \(y_1\) and \(y_2\) are two discrete features, this is the aforementioned combined probability multiplied by all potential values of the two features with MVs:

$$\begin{aligned} f_i^* (y_{i3},\ldots ,y_{ik};\theta ) = \prod _{y_1}\prod _{y_2}f_i(y_{i1},\ldots ,y_{ik};\theta ) \end{aligned}$$
(7)

For continuous MVs,

$$\begin{aligned} f_i^*(y_{i3}, \ldots , y_{ik}; \theta ) = \int _{y_1} \int _{y_2} (y_{i1}, y_{i2}, \ldots , y_{ik}) \, dy_2 \, dy_1 \end{aligned}$$
(8)

The multiplication of probabilities for all the rows is the overall likelihood. If there are q rows with full data and \(p-q\) rows with MVs on \(y_1\) and \(y_2\) features, then the ML function becomes:

$$\begin{aligned} L = \prod _{i=1}^q f_i(y_{i1}, y_{i2}, \ldots , y_{ik}; \theta ) \prod _{i=q+1}^p f_i^*(y_{i3}, \ldots , y_{ik}; \theta ) \end{aligned}$$
(9)

Reference [45] narrowed down the following imputation options using non-parametric statistical testing:

  • Row elimination (IM) and no imputation (DNI) methods are outperformed by imputation techniques that fill in the MVs.

  • No single-size/generic imputation method works for all regressors or classifiers.

CMC and EC methods are proposed to yield a lower noise ratio for Wilson and balance the mean MI difference. The proposed imputation approaches focused on classification techniques, including Rule Induction Learning Models: FKMI, Black Boxes Methods: EC, and lazy learning (LL) models: MC.

4.3.7 Binning

In this method, a continuous variable is converted into a group of intervals. Each interval can then be treated as a ‘bin,’ with the option of enforcing an order dependent on the data’s subsequent processing. While smoothing, each bin’s min and max values are calculated as bin borders. Then, for each value, the nearest border value is substituted. Typically, the smoothing effect increases with bin width. If the bin widths are identical, binning may be used as a discretization method by substituting mean or median for bin value. It is possible to create hierarchical ideas by iterating over this procedure indefinitely. It’s unsupervised since class labels are not used, and the user specifies bin numbers.

4.3.8 Deep Learning (DL) Based FE

A multi-filter NN (MFNN) end-to-end model was developed for multivariate financial time-series FE and classification-based forecasting utilizing DL techniques [46]. Their proposal MFNN was 15.41% higher than the best result (Logistic Regression) of traditional ML models and 22.41% higher than the statistical approach (Linear Regression) in terms of returns.

4.4 Exploratory Data analysis (EDA) and Data Reduction (DR)

In this process, the target or dependent column and independent features are obtained. The DR, EDA, and clustering techniques reduce runtime and space during the deep-learning modeling phase. DR can be employed to decrease the size of a dataset while still keeping the data’s original integrity. In our framework, we suggest performing Feature Selection and Feature Extraction simultaneously after selecting the target column and finding redundant features; then, Discretization may be performed if necessary. Then, the dataset will be ready for further analysis and model training.

4.4.1 Identifying Redundant Features

Feature Redundancy lengthens the modeling time of ML algorithms and leads to model overfitting. Feature redundancy arises from the possibility of derivation from another feature or set of features. The following techniques may be adopted to handle redundancy:

Covariance and correlation: In statistics, covariance refers to the amount that two features or factors change in tandem whose value lies in the \((-\infty , +\infty )\) range. Positive covariance indicates they move in the same direction. Negative covariance means that any features are greater than the mean, and others are less than the mean, and vice-versa. Zero covariance means features may be independent under a certain hypothesis. On the other hand, correlation analysis is a widely used dimensionless measurement ranging from \(-1\) to \(+1\) to discover redundancies in numerical features that evaluate and quantify the relationship intensity. The features are positively correlated for correlation values greater than zero (0); for zero, they are independent; for less than zero, they are negatively correlated [34]. Covariance and correlation are directly proportional to each other. In numeric feature selection, correlation is better to use, as correlation analysis is scaled [− 1, 1], but the covariance range is indefinite \((-\infty , +\infty )\). We should choose correlation for better interpretation. Changes in location, size, or scale have no effect on correlation. However, both of them are limited to only being able to identify linearity.

\(\chi ^2\) correlation: The \(\chi ^2\) (Chi-Square) test is often used when dealing with nominal features and finite value sets. We can use the \(\chi ^2\) test to see whether there is any link between the values of two nominal features, where a probability table with joint events is established. If \(\alpha\) (significance level) is less than the estimated one (or the \(\chi ^2\)-value (calculated) > table value), the null hypothesis gets discarded, and the two features can be said to be correlated statistically [34]. SC Analysts must remember that the \(\chi ^2\)-test does not tell much about the strength of the relationship between two features. The \(\chi ^2\)-test offers advantages such as resilience regarding data distribution, computational simplicity, extensive information produced from the test, utilization for investigations where parametric criteria cannot be satisfied, and scalability in processing data from two and multiple-group research. The drawbacks are sample size constraints and difficulty in comprehension when there are many (\(>20\)) features.

4.4.2 Feature Selection (FS)

The reasons for conducting FS may include removing unnecessary data, enhancing forecasting accuracy, reducing data cost and model complexity, and improving training efficiencies such as reductions in space needs and computational costs [47]. FS approaches, despite their widespread use, have several drawbacks [48, 49, 50, 51, 52]:

  • Training data size significantly impacts the subsets produced by many FS models (particularly those created using wrapper-based techniques). If the training data is limited, then the feature subsets retrieved will be limited, resulting in the loss of key variables.

  • Because the target feature is connected with many independent features, and their removal would adversely influence learning accuracy, reducing high-dimensional data to a limited range of features is not always possible.

  • When dealing with huge datasets, a reverse elimination approach takes too long since the algorithm must make judgments based on enormous amounts of data in the early stages.

  • In certain circumstances, FS results will still include significant important features that may obstruct the use of complicated training strategies.

Leading methods: In order to create FS techniques by combining a feature evaluation score and a cutting criterion, Reference [53] recommended that functions based on information principle produce better accuracy, not suggesting any universal cutting condition. However, those independent of the metric perform best, and outcomes differ across models. For each kind of model, wrapper techniques were recommended to avoid this effect.

Reference [54] investigated nine feature selectors running across 11 simulated datasets to examine the methodologies in the context of a growing number of unnecessary features, noise in the data, redundancy, correlation between attributes, and the ratio of observations to features. ReliefF proved to be the best alternative regardless of the specifics of the data, and it is a filter with a cheap computational expense. Wrapper techniques have proven to be an intriguing choice in specific disciplines if they can be used with the same classifiers and consider the greater computing costs. Extensive theoretical research has been conducted on the Relief and its variants, showing that they are resilient, noise-resistant, and can decrease their space-time complexity in parallel [55].

Since the emergence of rough sets in pattern recognition, several FS techniques have based their criteria for assessing reductions and approximations based on this idea [56]. Because total searches of substantial datasets are impossible, stochastic methods based on meta-heuristics and approximate assessment criteria have also been explored. Reference [57] utilized particle swarm optimization for this job. Features are discontinuous, making it challenging to pick approximately set-based characteristics in the literature. Rough set-based FS’s key drawback is the constraining condition that all values be discrete, for which issue, a fuzzy rough FS method (FRFS) was suggested [58, 59].

When data is vast, messy, blended with categorical and numerical variables, and may have dynamic effects requiring sophisticated models, the synthesis of forecasting analytics in the form of ensembles can create a compressed sample of non-redundant features [60]. There are four phases to the technique suggested here: identifying relevant features, computing masking scores, removing the masked factors, and generating residuals for progressive modification. The Random Forest ensemble is considered in all four stages.

Two problems arose simultaneously with the growth of highly-dimensional data: FS is essential in every training, and the accuracy and robustness of the FS algorithms may be ignored. Reference [61] discussed the FS reduction job introducing the Quadratic Programming FS (QPFS), which utilizes the Nyströn-approximation-matrix diagonalization method for large datasets. mRMR and ReliefF were outperformed using Pearson’s correlation coefficient and MI. A local learning-based approach may be beneficial when assessing many irrelevant attributes and complicated data ranges [62]. The impacts of high-dimensional datasets may be mitigated by pre-processing the feature ranking procedure to exclude class-dependent density-based features [63]. To scale any method in significant data issues demands cutting-edge distributed-computing frameworks like MapReduce and Message Passing Interface (MPI) [64].

We can use supervised FS if the data has class labels; otherwise, unsupervised FS is the best option. This approach generally maximizes clustering efficiency or the FS based on correlation, feature dependency, and priority. The primary premise is to eliminate features that bring almost no value beyond what is already provided by the existing features in the system. Reference [65] suggested using feature dependency/similarity to reduce redundancy without needing a search procedure. An information compression metric called the maximum information compression index governs the clustering partitioning process, which uses features as the measure of similarity. Forward orthogonal search (FOS) is another unsupervised FS approach that aims to maximize the total reliance on the data to find relevant features [66]. Without compromising performance in clustering, the unsupervised FS used the Random Cluster Ensemble framework to compress the set of features by roughly 1/100 of its initial dimensions [67]. When compared to well-known classifications, precision/recall analyses revealed that feature weighting was highly successful in discovering the most suitable clusters [68].

4.4.3 Feature Extraction

Feature extraction accelerates the ML algorithm’s execution, optimizes raw data quality, boosts the algorithm’s efficiency, and simplifies the interpretation of the findings.

Principal component analysis (PCA): It aims to analyze a collection of features’ variance-covariance patterns employing a few linear combinations and seeks the optimal k number of \(n-dimensional\) orthogonal vectors for data description, where \(k \ge n\). Accounting for the most critical percentage of the discrepancy in the original dataset, the principal component (the first derived feature) is produced in decreasing order of contribution. Typically, for containing \(\ge\) 95% variance, just the top few principal components are retained. PCA is beneficial when many independent variables correlate with one another [69]. The principal component is quick and comprehensive and ensures a solution is found for all datasets [70].

Factor analysis: The fundamental concept underlying component analysis identifies a collection of influencing factors to restore the current features through a series of linear adjustments on the components. It is a method that finds out the range of factors along with their associated loadings, providing the features as well as the mean of the features [69]. The factor models can be solved by (1) the Maximum-likelihood method and (2) the Principal-component method. Maximum likelihood presupposes actual data following a normal distribution and is computationally costly. The comparative differences between PCA and factor analysis are:

  1. 1.

    Factor analysis, unlike PCA, implies a basic structure that connects the factors to the empirical observations.

  2. 2.

    A three-factor system is substantially different from a two-factor system in factor analysis; however, in PCA, the two initial principal components stay the same when employing a third component.

  3. 3.

    PCA is simple and quick. There are several methods for doing the computations in factor analyses, some of which are complex and tedious.

  4. 4.

    Using a sequence of linear transformations, PCA attempts to spin the original features’ axis. Again, factor analysis generates a new range of features to demonstrate the observed covariances and correlations.

Multidimensional scaling (MDS): MDS may be used in SCM to estimate the map depicting transportation distances between or within inventories using the distance matrix. The result is skewed owing to the disparity between calculated distances and the actual distances between inventories lying in a straight line. The map is typically centered on the origin and expanded to cover considerable distances. However, the answer may be found in any rotation. Locally linear embedding (LLE): With LLE, local linear fits are used to restore universal nonlinear configuration [71]. All points are a linear weighted sum of their neighbors if adequate data is available. It is the basic notion behind the manifold approximation algorithm. For the LLE algorithm, the geometric principle is all that is required. LLE’s advantages are that local minima are not involved in optimizations and have only two parameters. The embedded space has a universal coordinate system and preserves the local geometry of high-dimensional data. LLE also has several inherent shortcomings, which are stated as follows:

  • LLE generates folds and nonhomogeneous warps when the dataset is small or the points are irregularly measured.

  • Noise significantly affects LLE, which causes embedding derivation errors.

  • Short circuits may develop during the neighbors since the query typically uses Euclidean distance.

  • Poor eigenproblems may arise.

  • If two high-dimensional space observations differ, LLE cannot assure that their corresponding low-dimensional space instances also differ.

  • LLE’s embedding findings are extremely susceptible to its two system parameters: the number of clusters of each instance and regularisation.

  • LLE presupposes that complete data exists on a unified surface and is unsupervised, but that does not happen for multi-labeled classification tasks.

  • It is unclear how to assess the new sample data points because LLE does not provide a parameterized function that reflects between high-dimensional space and low-dimensional manifold.

4.4.4 Cardinality Reduction

The merging of two or more nominal or ordinal variables into a single unique category is called cardinality reduction. It is challenging to manage nominal features with a large number of groups. Converting high cardinal variables into binary variables provides many new variables, mostly zeroes. However, if utilized without conversion with models like Decision Tree that can accept them, there are risks of model over-fitting. So, decreasing the number of groups should be considered [70].

4.4.5 Discretization

The discretization method turns the numerical data into qualitative data, i.e., quantitative features, into discontinuous or nominal features that provide a non-overlapping segmentation of a linear system. Discretization can decrease data since it converts data to a much smaller sub-ensemble of discrete values from an enormous range of numerical values. Numerical features should be discretized as real-world dataset features are generally continuous, whereas most of the current ML algorithms can only be trained by utilizing nominal features in categorical data [72].

Discretization generally involves four steps: (1) Continuous feature values to be discretized need sorting, (2) identifying a breakpoint or nearby intervals for joining, (3) dividing or combining continuous value ranges based on specific criteria, and (4) stopping this process at definite value.

MVD and UCP are promising approaches that are not supervised and helpful to apply to various ML issues other than the classification under adverse circumstances. They generalized a subset of the top global discretizers based on a compromise between UCPD, FUSInter, Distance, MDLP, and Chi2 as the ranges and accuracy [72]. The possibility of utilizing multivariate discretization features may be investigated since parallel computers are becoming strong. Chi2 may delete redundant features, and Contrast or ID3 (dynamic discretization methods) may be addressed to integrate discretization into a learning process [73].

4.5 Forecasting

Forecasting can be called predicting or estimating a value from the future [74]. Forecasting in a business-like SC performance is vital for suppliers that do forecast more than those that do not [75]. Mainly three types of forecasting are done based on the length of the forecast: operational forecasting for short-term operational activities that range from hours to a few weeks, tactical forecasting for a moderate duration to support tactical planning that ranges from months to a few years, and strategic forecasting which is aligned with long-term goals to make strategic decisions [76]. Furthermore, the frequency of a type of forecasting that is done is dependent on the length of the forecast. Long-term forecasts are rarely done, whereas operational forecasts may be required frequently. The different forecasts deal with different uncertainties. Long-term forecasts deal with raw material cost fluctuations, final product price changes, seasonal variations in demand, and changes in production rate in the long term. In contrast, short-term uncertainties are concerned with variations in daily processes, order cancellations, random failures in production, etc. [77].

While forecasting is practical, forecasting correctly with more accuracy is even more helpful. Demand forecasting that allows anticipating sales in the forecasted period helps minimize overproduction and overstock [13].

4.5.1 Types

Although forecasting techniques have evolved, forecasting techniques may be divided into three main categories: qualitative techniques, which deal with qualitative data or information to forecast; time series analysis and projection, which are related to historical data and patterns arising from them; causal models, where along with the historical data, special events and their relation with system elements are also considered [13]. The qualitative technique is not related to BD and data analytics much; the other two are. Even so, qualitative data can be used to adjust forecasting models toward incredible accuracy. Reference [78] displayed one such example: Qualitative data can be used through fuzzy NNs combined with quantitative data for training the model. Nevertheless, accurate forecasts cannot be made based only on qualitative data. Time-series analysis is pretty straightforward, especially with the recent advancement of statistical tools. However, the role of such forecasts is to reduce errors in the forecast by minimizing the deviations at each point. Therefore, they do not consider special occasions such as promotions where sales are more remarkable than usual [79]. This flaw brings us to causal models or models that consist of probabilities of forecasting accuracy, the effect of outside interventions, and the interrelation of different types of variables in the model [80].

With the evolution of knowledge, different techniques for forecasting have emerged, and new classifications to understand them. Reference [81] classified the different techniques into two broad groups of Intuitive and Formalized methods and divided Formalized methods further into Mathematical, System-structural, Associated, and Advanced information methods.

4.5.2 Model Fit and Train

The dataset can be randomly split into the train, validation, and test sets for unbiased evaluation with new data to evaluate predictive performance with data different from training data. The best approach would be to split a dataset by a date feature. The most recent samples can be utilized for validation and testing. The primary concept is to choose a sample subset that accurately reflects the model data.

Two factors determine the proportions of these three sets: the number of data samples and training models. Some models require significant training data; therefore, the model should be tuned for more extensive training sets in this scenario. Models with fewer hyperparameters will be easier to validate and tune, allowing a small validation set size. An extensive validation set will benefit if the model contains more significant hyperparameters. There will be no requirement for a validation set if the model has no hyper-parameters or is challenging to adjust.

When using k-fold CV, the train-test dataset splitting is repeated for k-times, with each new set being given a shot at becoming the hold-out set. Time-series data cannot be used with k-fold CV directly since they believe there is no connection between the rows and that they are all separate instances. For time-series data, instances’ time horizon prevents arbitrarily dividing them into clusters. Instead, data should be segmented, and the chronological sequence of instances maintained. The term backtesting is used in time-series forecasting to describe the technique of evaluating models using past data. In meteorology, this is regarded as ’hindcasting’ rather than ’forecasting.’

4.5.3 Hyperparameter Tuning

Optimizing performance requires tuning hyperparameters automatically by Automated ML (AutoML). Hyperparameters are available in most ML systems. Hyperparameter adjustment has the most influence in optimizing, regularising, and architecting NNs. Common use cases of automatic hyperparameter optimization (HPO) include:

  • ML, specially AutoML, will require less manual effort.

  • ML algorithms’ efficiency (by customizing them to the task at hand) has improved, resulting in the new high state-of-the-art for significant ML standards in research findings [82].

  • It enhances the opportunity to reproduce the ML process.

  • It allows the fair comparison of methods with the same type of tuning.

One issue with HPO is that a particular configuration does not work well for all datasets [83]. These days, optimizing hyperparameters above the default parameters supplied by standard ML packages is increasingly acknowledged.

To assess a lower-cost optimization model, the authors proposed Bayesian optimization and hyperband (BOHB) as an efficient, flexible, stable, and parallelizable default HPO technique [84]. However, if all hyperparameters are valid and just a few function evaluations are available, the (Spearmint) Gaussian technique is recommended [82]. To solve restricted optimization issues in vast areas, they suggested RandomForest-based Tree Parzen Estimator (TPE) or sequential model-based algorithm configuration (SMAC) and covariance matrix adaptation-evolution strategy (CMA-ES). Genetic approaches were initially used for adjusting two hyperparameters of RBF-SVM C and \(\mu\) faster than GridSearch for better forecasting accuracy [85]. CMA-ES was initially utilized for the optimization of hyperparameters to optimize hyperparameters of SVM C and \(\alpha\), (for all input sizes) the kernel scale of length \(l_i\), and the whole matrix of spin and scaling [86]. CMA-ES has lately proved a perfect solution for Parallelized HPO, superior to current Bayesian heuristics while optimizing 19 deep-NN hyperparameters on parallel 30 GPUs [87]. A Gaussian online approach incorporated EI to tune the SVM hyperparameters, attaining factor 100 (regression, three hyperparameters) and 10 (classification, two hyperparameters) speedups against GridSearch [88]. A robust, adaptable, and analogous combination of Hyperband and Bayesian optimization was introduced that significantly surpassed both BlackBox and Hyperband optimization for a broad variety of issues, along with SVM adjustment, different types of NNs, and reinforced ML algorithms [89]. As early as 2002, ancient ML models offered GridSearch for hyperparameter optimization [90, 91]. PatternSearch and GDFS (Greedy Depth-First Search) were the first dynamic optimization techniques for HPO, with GDFS outperforming GridSearch. Particle Swarm Model Selection (PSMS) handles conditional configuration space with a customized particle swarm optimizer. Modified Ensembling was added to PSMS to prevent overfitting and integrate the better methods from many generations [92]. In addition, to maximize pipeline architecture and solely utilize Particle Swarm Optimization for every pipeline hyperparameter, PSMS was modified to utilize a genetic optimization algorithm [93]. For the hyperparameter adjustment of deep neural, Reference [94] utilized Bayesian optimization, outperforming random searching and manual. In addition, TPE generated better output than a Gaussian approach considering the mechanism. Random forest TPE and Bayesian optimization have also succeeded in searching for combined neural and HPO [95]. We suggest a unique manual approach that might be helpful in general cases:

  • If there are many hyperparameters, the CV score can be evaluated for the first hyperparameter. After that, such a hyperparameter value should be selected to avoid overfitting and lower accuracy. After setting that hyperparameter, the next hyperparameter should be evaluated by iterating a similar process one by one. The HPO algorithm should be chosen based on the hyperparameter type.

  • If there are less than or equal to two hyperparameters, the desired HPO approach can be used directly.

4.5.4 Model Evaluation

In our framework, we suggest model fitting and training on our analysis-ready training data using default parameters, and then we move to the next step of tuning the hyperparameters. If the model accuracy deteriorates, it is not the feature’s fault; instead, we should focus on the HPO of the models. After HPO, the top-performing models can be easily chosen based on the elimination process, but the model-overfitting issue should be considered. The model can be evaluated with the predicted sales against the actual sales after at least one month in the initial operating period.

4.5.5 Top Forecasting Models

Table 1 provides a list of time-series demand forecasting models that have been used in our reviewed literature. Table 2 provides a comprehensive overview of the most recently proposed ML models in different forecasting applications and the corresponding performance metrics evaluated in each literature. Considering accuracy and precision in forecasting the future time-series lags, the ARIMA model outperformed the AR (AutoRegressive), MA (moving average), and SES (Simple Exponential Smoothing) models. The empirical research reported that long short-term memory (LSTM) enhanced forecasting by 85% when evaluated by comparison to ARIMA (traditional-based model). Furthermore, the number of epochs (training times) did not influence the forecasting model’s performance, which showed genuinely random behavior [96].

Table 1 List of time-series demand forecasting models used in literature review
Table 2 The most recent (2022–2023) ML models for forecasting applications in SCM

For the demand forecasting procedure, [97] evaluated statistical models, RBFNN (Radial Basis Function NNs), and winter models with SVM. According to their conclusion, the efficiency of SVM outperforms other algorithms by about a mean MAPE outcomes threshold of 7.7%. Reference [99] demonstrated a new AI-utilized forecasting approach evaluating a fuzzy reasoning strategy and ANN based on the Adaptive Network to handle the demand containing inadequate knowledge. During testing, they obtained MAPE 18% on average for some products. For unpredictable customer demands, neural methods have provided a robust forecasting strategy in a multi-level SC framework. A greedy aggregation decomposition (GAD) approach is a generic approach to self-development in a discontinuous time-series forecasting method that considers double-based causes of variation, addressing a practical discontinuous issue of forecasting demand [100]. With a limited dataset, they outperformed SBA, Croston’s method, TSB, MA-7, SES, MAPA, MA-3, ADIDA, iADIDA, and N-7 with a MAPE accuracy rate of 5.9%. Reference [101] offered the SHEnSVM (Selective and Heterogeneous Ensemble of SVMs) model for sales forecasts. Individual SVMs were trained using samples produced by the bootstrap method, and grid search parameters were created, as stated by the specified model. The optimum specific combo strategy was found using a genetic algorithm. They claimed a 10% increment using the SVM algorithm & a 64% on average enhancement in MAPE. The authors used beer data from three product variants in their tests. Reference [102] integrated DL technique, the SVR algorithm, and best-performing time-series analytic models utilizing the boosting ensemble approach to demand forecasting systems. Their DL implementation in the new integration strategy (MAPE: 24.7%) lowers mean forecasting error in the SC, outperforming both the conventional best-performing forecasting model (MAPE: 42.4%) and the unique integration strategy without DL (MAPE: 25.8%). XGBoost, ARIMA, and Snaive STL decomposition have outperformed solo and hybrid models and the modeling mix and have provided the best forecasting accuracy [105].

Reference [98] used Facebook Prophet (FB-Prophet) and ANNs to forecast lithium mineral resource prices in China. Quality and quantity of lithium data, network architecture, and activation functions significantly impacted the performance of an ANN forecasting model. Overfitting can occur when an ANN model is too closely tailored to the training dataset, and regularization and early halting strategies can enhance the model’s performance. The FB-Prophet model, which uses a decomposable time-series model, can effectively forecast data with fewer value matrices, handle missing values, and practice adjustments. Reference [111] created recurrent NN (RNN), LSTM, and gated recurrent unit (GRU) models to forecast the demand for U.S. influenza vaccinations, with data from 1980 to 2011 serving as the training set and data from 2012 to 2020 serving as the testing set. The prediction models may be scalable because there was no overfitting between the expected and actual numbers. The error comparisons demonstrated that GRU is more precise than LSTM and RNN in predicting vaccination demand. Energy generation and demand forecasting search net (EGD-SNet), a framework that can anticipate energy production, demand, and temperature across various areas, was reported in the study of [112].

The 10 most popular ML regressors in the EGD-SNet framework include 11 dimensionality reduction techniques and 13 alternative FS algorithms. It employs a particle swarm optimizer (PSO) to train regressors intelligently by locating the best hyperparameters. Also, it can create an end-to-end pipeline by selecting the right regressor, feature, and dimensionality reduction methodologies to accurately anticipate energy generation or demand for a specific geographical data set, depending on the features of the data. References [113] implemented many DL methods, including data collecting, de-noising or pre-processing, feature extraction, and classification stages. Two primary DL models determine feature extraction. The first variation used three RNN structures: LSTM, BiLSTM, and GRU. The second variant used the temporal convolutional network (TCN). They used SoftMax, RT, RF, KNN, ANN, and SVM classifiers for an online dataset. TCN predicts COVID-19-restricted shipping risk almost 100% accurately.

As the daily fish demand forecasting models for grocery merchants to reduce food waste and enhance sustainable SCs, References [121] investigated LSTM, Feedforward NNs, Support Vector Regression, RF, and a Holt-Winters statistical model. The findings showed that the LSTM model provided the best outcomes in terms of root mean squared error (27.82), mean absolute error (20.63), and mean positive error (17.86). References [122] forecasted solar Global Horizontal Irradiance using statistical and Deep Learning architectures, which aids grid management and power distribution and highlights Pakistan’s solar power potential in addressing global climate change. They employed SARIMAX, Prophet, LSTM, convolutional NN (CNN), and ANN statistical approaches. Error measures like \(R^2\), MAE, MSE, and RMSE were used to evaluate each model’s performance. They concluded that SARIMAX and Prophet are ideal for long-term forecasts, whereas ANN, CNN, and LSTM are best for short-term forecasts. References [106] found that the optimum model for every participant in the SC across all three inventory replenishment strategies is a stacked ensemble model consisting of XG Boost, Ada Boost, and Random Forest. According to a methodology for comparing forecasting methods developed by [123], the MLP method has a little edge over the CNN, LSTM, and CNN-LSTM approaches. References [124] utilized data from the last five years to estimate demand for eight dairy products from five dairy production facilities using a direct multistep prediction method. ARIMA works effectively on a narrow subset of unpredictable series, whereas LSTMs excel at anticipating seasonal patterns. It outperforms ARIMA for trends. Monthly data decreased model training error.

For the purpose of forecasting daily energy use, References [103] investigated the effectiveness of three ML models, SVR, RF, and XGBoost; three deep learning models, RNNs, LSTM, and GRU; and ARIMA. For both very short-term load forecasting (VSTLF) and short-term load forecasting (STLF), the suggested XGBoost models beat competing models; the ARIMA model did the worst. References [125] presented a smart platform for data-driven blood bank management that forecasts blood demand and balances blood collection and distribution based on optimal blood inventory management to avoid blood wastage and shortage. This improves blood quality and quantity, increasing blood collection by 11% and reducing blood waste by 20%. Balancing blood collection and distribution based on good blood inventory management and arranging blood donation sessions to avoid cancellations may lower inventory levels. References [109] proposed a CNN-LSTM model with Swish Activation to estimate a store’s supply based on prior sales. This outperforms Rectified Linear Unit (ReLU), the most effective activation function. They forecasted sales using Multilayer Perceptron, LSTM cells, and CNNs. CNN-LSTM Model has a reduced RMSE, according to the experiment. Pharmaceutical businesses can use Shallow NN and DNN demand forecasting models for eight anatomical treatment chemical thematic drug groups [114]. Shallow NN models performed well for five of eight medication categories, while the ARIMA model performed best for the other three.

References [115] introduced an extreme learning machine (ELM) model using the Harris Hawks optimization (HHO) method to estimate e-commerce product demand. In forecasting product demand for the next three months, the ELM-HHO model outperformed the statistical ARIMA (7,1,0) model by 62.73%, the NN-based GRU model by 40.73%, the LSTM model by 34.05%, the traditional non-optimized ELM model with 100 hidden nodes by 27.16%, and the ELM-BO model by 11.63%. References [107] developed a novel ML forecasting approach by merging adaptive neuro-fuzzy inference system (ANFIS) and time-series data features to forecast real-time e-order arrivals in distribution hubs, helping third-party logistics providers better manage hourly-based e-order arrival rates. ELM, GB, KNN, MLP, and DT were five ML algorithms used by [108] to forecast demand in a business based on Black Friday customer information. According to the results, MLP, ELM, GB, KNN, and DT were the top algorithms in terms of MSE, while ELM, MLP, GB, DT, and KNN had the greatest performances in terms of MAE. Moreover, ELM had a higher \(R^2\) value of 0.6365, whereas DT had a lower value (0.4877). References [104] compared RF, XGBoost, gradient boosting, AdaBoost, and ANN algorithms to a hybrid (RF-XGBoost-LR) model for retail chain sales forecasting. A US retail company’s weekly sales data was used to analyze estimates based on factors like temperature and shop size. The hybrid RF-XGBoost-LR outperformed other models in many criteria. RNN and LSTM were used by [110] to improve stock price prediction. The memory cell, a computer that replaces artificial neurons, is buried in the network. The study increased epochs and load sizes to improve precision. Time and lot size boost prediction accuracy in this work. The test data predicts the specified technique, which yields more accurate outcomes. The proposed method forecasts stock markets more accurately. The study by [126] presents QAmplifyNet, a novel hybrid quantum-classical neural network, revolutionizing SC backorder prediction. Achieving 90% accuracy, it outperforms traditional models on short, imbalanced datasets, demonstrating superior interpretability and predictive capabilities. QAmplifyNet’s integration into real-world systems offers transformative potential for enhancing inventory control and operational efficiency, marking a breakthrough in SC optimization through quantum-inspired techniques.

Based on the mentioned studies, we suggest considering the following recently best-performing hybrid time-series demand forecasting ML models: 1. XGBoost-LSTM [116] 2. FB-Prophet [117] 3. XGBoost-LightGBM [118] 4. M-GAN-XGBoost [119] 5. SARIMA integrated AttConvLSTM, and FB-Prophet [127] 6. AUG-NN [120].

When selecting the primary top-forecasting model, it is recommended to consider the best cross-validation (CV) score, minimum runtime, and space consumption as criteria for evaluation.

5 Control-Process

Actual processes do not always go as predicted. There are variabilities in performances emerging from changing levels of efficiency. In ideal cases, the workforce and existing capacity can achieve the goal as planned. However, performance levels are inconsistent with humans [128] and vary with other factors such as WIP inventory, machine utilization, product mix, and queueing system [129]. Such fluctuations in efficiency cannot be predicted accurately. Hence, they must be recognized in time, and appropriate measures must be taken to meet the requirements. Responsive to the SC process’s randomness may consist of three steps: capturing or recording data simultaneously with the SC activity, comparing the recorded data with the standard, and adjusting capacity to meet the short-term goal. Data can be used to determine the optimal decision for the changing suppliers, changing price levels, the competitiveness of the competitors, and monitoring performance during the process [130]. Furthermore, during a discrepancy in planned and actual output levels, the root cause can be identified using BDA [131]. References [132] mentioned other benefits of BD on workforce scheduling, production efficiency, employee productivity, capacity utilization, flexibility, and lead time reduction.

5.1 Information Flow

Forecasting decisions affect further SC planning. As such, information flows across multiple phases of the SC process. Logistics superiority and better stock level synchronization are possible through a flow of demand information from downstream members to the upstream ones and the flow of production plan and delivery information from the upstream members to the downstream ones [133]. Like the three types of forecasting, there are three types of decisions in SC: strategic, operational, and tactical [134]. Recent studies have shown how the findings from one level can affect other decisions and limit the number of options for subsequent decisions [135].

It is possible to attain efficiency through forecasting by properly allocating resources in different areas such as workforce, capacity, inventory management, etc. There are need-based, supply-based, and demand-based models for forecasting and planning in such areas of SCM [136]. Each method requires some sort of information flow. The forecasted amount can be used to estimate the number of dependent inventory demands. The final product’s demand indirectly affects the required workforce, capacity, warehouse planning, and lower costs through optimization.

5.2 Production Efficiency

Having real-time data on production boosts production efficiency. Firms can manage order processing across SCs and companies while decreasing errors and waste inside manufacturing facilities by incorporating real-time data into SC operations [2]. This efficiency is further enhanced when data from suppliers and distributors are available. Through close connections and sharing information with SC partners, data-driven SCs may also affect manufacturing and operations processes through increased efficiency in product development, product design, quality improvement, and balance between capacity and demand [137]. Additionally, data integration in the SC has been found to aid in developing production strategies and the timely delivery of products and services [138].

5.3 Employee Productivity

In general, there is either an excess of the workforce or a shortage in the production process; the question is how to reduce the inefficiency. Under variable output requirements, workforce scheduling without data analysis entails investing in cross-job training to enable workers to be more productive and efficient in their work. However, this reduces performance as time is spent on upskilling or reskilling.

Data-driven decision-making, or the forecasting of required outputs to estimate the required workforce, is an excellent way to minimize such costs of hiring and laying off by adequately scheduling the workforce [132]. Through proper scheduling, the workforce from idle time can be shifted for workdays requiring extra hours and thereby balanced. References [139] showed that work pressure could be balanced with reduced slack time and workforce through different heuristic algorithms, with each algorithm performing well in different areas of efficiency.

5.4 Inventory Management

Reducing costs from inventory can cut the overall cost of the business. Different models have been created to minimize costs and maximize profits that aid with material planning mechanisms, stock-out predictions, inventory level predictions, and many more [140]. Inventory costs can be lowered at sourcing, transportation, and holding levels, optimizing inventory decisions [137].

5.5 Role of Data by Time Frame

Although BDA can make SC processes efficient, it cannot be done with the same forecast. Capacity planning or storage size falls under long-term strategic decisions requiring long-term forecasts or aggregated short-term forecasts. Contrarily, production plans may be short-term operational decisions requiring short-term predictions. References [141] stated how predictions could be derived from the aggregation of short-term, disaggregation of long-term, or a co-integration of both kinds of forecasts. Hence, on the one hand, separate forecasts can be produced. Conversely, forecasts can be derived from other forecasts to maintain relevance.

6 Post-process

6.1 SC Performance

Once an SC process is completed, the performance needs to be reviewed to identify the gaps in planning models. Performance measurement is defined as quantifying actions across two fundamental dimensions: effectiveness and efficiency [142]. Performance measurement is essential to control the output; without it, no person or machine can be held liable for subpar performance, and problems will be harder to identify and solve. Performance measurement helps with information for management feedback, decision-making, monitoring performance, diagnosing problems, motivating people, identifying potentials of a decision, measuring success or failure, reviewing and adjusting business strategies, specifying company goals, and much more [143]. Reference [144] offered a comprehensive methodology considering all three SC system stages, including ERP-based SC performance. To comprehend whether network scanning and embeddedness are linked to SC performance, Bernardes and Zsidisin ([145], p. 209) studied the correlation between SCM strategy and network scanning and embeddedness concepts.

Immediately after the tasks are completed, the performance data must be recorded. For data collection, the performance metrics are first to be identified, just as those found for business evaluations [146]. References [147] mentioned plan success, source optimization, production efficiency, delivery performance, and customer support-relation and satisfaction, each having multiple performance metrics under them. The performance level found afterward can be of three types: below average, average, and above-average [148]. The actions followed after such a finding are different in each case. When a below-average performance is observed, managers can either look for anomalies in the system or review whether the goals set were too high to achieve.

Conversely, an above-average performance requires rechecking the goals so that optimization of resources is possible. In order to meet the goals-setting theory stated by [149], these changes may be adjusted to suit. Operational benefits such as performance monitoring, objective setting, management, transparency, and planning functions can be improved with the assistance of BDA and performance metrics derived from them through the use of predictive KPIs, dashboards, and scorecards by the SC operational managers within the organization [130].

Besides managerial decision-making, the performance data are crucial to modifying existing forecasting models. Under a considerable deviation of performance, the data received from this level needs to be sent back to the forecasting stage to tweak the forecasting model to higher perfection. The performance metrics can thus act as an indicator of forecasting model errors.

6.2 Forecasting Error Measurement

Our proposed cyclic framework is evaluated against predicted sales when the actual sales data is available or when the hold-out set is used. Nevertheless, the hold-out set might not be perfect for real-world scenarios, so we encourage a cyclic and continuous development process from real-sales data evaluation insights. We encourage a cyclic and continuous development process from real-sales data evaluation insights. A few evaluation metrics can be used for the post-process evaluation to fine-tune the forecasting model in the preprocessing phase. Assume test data with m periods, \(t=1,\ldots ,m\). The difference between forecasted sales \(f_t\) and actual sales \(y_t\) at a period t can be referred to as the forecasting error \(e_t=y_t-f_t\).

6.2.1 Mean Absolute Error

$$\begin{aligned} MAE = \frac{1}{m} \sum _{t=1}^m |e_t | = mean(|e_t |) \end{aligned}$$
(10)

MAE is very straightforward and relatively simple to explain, and scale dependence is its disadvantage.

6.2.2 Mean Absolute Percentage Error

$$\begin{aligned} MAPE = \frac{1}{m} \sum _{t=1}^m|\frac{e_t}{y_t}\times 100 \end{aligned}$$
(11)

MAPE is perhaps the most often utilized error indicator for business forecasting because of its comprehensiveness. However, despite the term ‘Percentage,’ the MAPE value may be higher than 100%. The rows equal to 0 causes problems since the fraction’s denominator cannot be filled in. MAPE is an appropriate metric when dealing with intermittent demand. Asymmetry is its major drawback as it penalizes overfitting more than underfitting, leading to probable skewness.

6.2.3 Mean Squared Error

$$\begin{aligned} MSE = \frac{1}{m} \sum _{t=1}^m {e_t}^2 \end{aligned}$$
(12)

Compared to RMSE, MSE takes less runtime and is more flexible. However, we might not interpret MSE as the actual sales because the error is squared.

6.2.4 Root Mean Squared Error

$$\begin{aligned} RMSE=\sqrt{\frac{1}{m} \sum _{t=1}^m{e_t}^2} =\sqrt{mean(|{e_t}^2|)} \end{aligned}$$
(13)

Two consequences occur by performing the dual transformation in RMSE: more weight is placed on more significant errors, and positive and negative errors cannot cancel one another out since they are all transformed into positives.

6.2.5 Mean Absolute Scaled Error

For non-seasonal time-series,

$$\begin{aligned} MASE=\frac{\frac{1}{J} \sum _{j}|e_j|}{\frac{1}{T-1}\sum _{t=2}^T|y_t-y_(t-1)|} \end{aligned}$$
(14)

For seasonal time-series,

$$\begin{aligned} MASE=\frac{\frac{1}{J} \sum _{j}|e_j|}{\frac{1}{T-m}\sum _{t=m+1}^T|y_t-y_(t-m)|} \end{aligned}$$
(15)

With MAE, outliers are protected; with RMSE, we are assured of an impartial prediction. SC Analysts need to analyze MAE and see whether it results in a significant bias; therefore, they should utilize RMSE. In situations when there are many outliers in the dataset, MAE may help correct the skewed prediction.

6.2.6 Tracking Signal

The tracking signal is the way to verify if the current forecasting method is correct. A tracking signal that changes according to the forecast bias shows bias in the prediction model. It is often employed when the forecasting model’s validity is questionable.

$$\begin{aligned} \text {Algebraic sum of forecast error}= & {} \sum _{t=1}^m|e_t| \end{aligned}$$
(16)
$$\begin{aligned} \text {Tracking Signal}= & {} \frac{\text {Algebraic sum of forecast error}}{\text {Mean Absolute Error}} \end{aligned}$$
(17)

A rule of thumb holds that the technique employed for forecasting is accurate when the tracking signal is within \(-4\) to \(+4\).

6.3 Phantom Inventory

Research has shown that imprecise perpetual inventories (PIs) are overestimated approximately 50% of the time; that is, PI displays higher stock than that is present in the shop, called a phantom inventory. The most severe issue in a phantom inventory is unavailability—the system considers it has an adequate inventory and does not order a replenishment. The recognized reasons for phantom inventory are [150]:

  • Stolen goods, defective products that are not reported.

  • Cashier mistakes.

  • A shop may get deliveries from the distribution center (products that should have but were not received).

  • Returned goods that should update the system are sometimes wrong.

To resolve the stock inconsistency, businesses may perform a bunch of tasks [151]:

  • The supply of safety may be raised. The enhanced security inventory aims to mitigate inventory problems by having ‘excess’ inventories at hand. RFID may reduce the costs of storing this additional and redundant inventory.

  • The business may often conduct manual inventory numbers. Physical inventory audits may interrupt storage, are extremely expensive, and differ in precision – improved RFID precision may be an affordable option.

  • The business may construct a continuous decrease equivalent to the total inventory loss that one believes takes place to balance the phantom inventory. The issue is that the precise inventory loss is not known. The visibility provided by RFID may be more accurate than conventional stock loss techniques.

  • The business may attempt to minimize mistakes by improving inventory management, decreasing fraud, etc.

Inventory precision determines forecasting, procurement, and replenishment quality, where inventory records are used as input. Inaccurate demand forecasting due to phantom inventory (overstated PI) may be improved by including RFID in the process [152].

7 Challenges

This section aims to provide a comprehensive overview of the challenges encountered during the review of 152 articles from 1969 to 2023 in BDA-SCM for forecasting, with a specific focus on data preprocessing and ML techniques. The challenges identified herein will serve as a valuable resource for future researchers, enabling them to address and overcome these obstacles, ultimately advancing the domain and contributing to its growth and development.

7.1 Data Quality and Reliability

One of the critical challenges observed in the reviewed literature is the issue of data quality and reliability. Many studies acknowledged the presence of incomplete, inconsistent, and erroneous data within SC datasets. Future research efforts should focus on developing robust data cleansing, integration, and quality assurance techniques to enhance the reliability and accuracy of the forecasting models.

7.2 Scalability and Performance

With the exponential data growth in SCM, scalability and performance have become significant challenges. The reviewed articles often lacked details on how their proposed techniques would scale up to handle large-scale datasets or real-time processing requirements. Future researchers should explore scalable algorithms, distributed computing frameworks, and parallel processing techniques to ensure the effectiveness and efficiency of forecasting models.

7.3 Variety and Complexity of Data Sources

The diverse range of data sources, such as structured, unstructured, and semi-structured data, presents challenges in data preprocessing and feature extraction. The reviewed literature indicated limited exploration of techniques for effectively handling data variety and complexity. Future research should focus on developing innovative methods for integrating and analyzing heterogeneous data sources to extract meaningful insights for accurate forecasting.

7.4 Feature Engineering and Selection

Effective FE and to improve forecasting accuracy, including automated FS, dimensionality reduction, and feature representation approaches to identify the most relevant features for forecasting within the SC context. Future researchers should investigate advanced FE techniques to improve forecasting accuracy, including automated FS, dimensionality reduction, and feature representation.

7.5 Model Interpretability and Explainability

The black-box nature of some ML models limits their interpretability and hampers decision-making processes. The surveyed literature revealed a lack of emphasis on model interpretability, hindering the wider adoption of forecasting techniques in SCM. Future research should focus on developing transparent and interpretable models that explain their predictions, enabling practitioners to understand and trust the results.

7.6 Real-Time Data Processing and Analysis

SCM requires real-time monitoring and decision-making capabilities. However, the surveyed literature demonstrated a limited exploration of real-time data processing and analysis techniques for forecasting purposes. Future research efforts should concentrate on developing real-time forecasting frameworks that leverage stream processing, online learning, and adaptive algorithms to handle dynamic and time-sensitive SC scenarios.

7.7 Privacy and Security Concerns

Integrating big data in SCM raises concerns regarding data privacy and security. The surveyed articles paid limited attention to these challenges, and there is a lack of comprehensive approaches to ensure the privacy and security of sensitive SC data. Future researchers should focus on developing robust privacy-preserving and secure ML techniques to safeguard data while maintaining the accuracy and efficiency of forecasting models.

7.8 Integration of Domain Knowledge

SCM involves complex domain-specific knowledge, including industry-specific constraints, regulations, and contextual factors. The reviewed literature showed a limited integration of such domain knowledge into the forecasting frameworks. Future research should emphasize the incorporation of domain expertise and contextual information to enhance the relevance and accuracy of forecasting models within the SC domain.

7.9 Lack of Benchmark Datasets and Evaluation Metrics

The absence of standardized benchmark datasets and evaluation metrics hinders the comparison and reproducibility of forecasting techniques. The reviewed articles often utilized different datasets and evaluation metrics, making it challenging to assess the performance of various models. Future researchers should strive to establish benchmark datasets and evaluation protocols specific to SC forecasting, enabling fair comparisons and facilitating advancements in the field.

By overcoming these challenges through innovative techniques and methodologies, researchers can contribute to the advancement of this field, leading to more accurate, scalable, and interpretable forecasting models for SCM.

8 Practical Implications

The findings of this research offer substantial practical implications for SC practitioners, providing actionable insights that can be effectively implemented in real-world scenarios. The proposed BDA-SCM framework serves as a strategic guide, and its practical application holds the potential for significant benefits in enhancing overall SC operations.

SC practitioners can implement the BDA-SCM framework by initially aligning data collection methodologies with specific SC objectives. This involves systematically gathering data directly relevant to the SC ecosystem’s unique dynamics and challenges. By integrating the framework into their operational processes, practitioners can leverage the power of BDA at various stages, from problem identification to performance evaluation.

The implementation of the BDA-SCM framework promises several tangible benefits for SC practitioners. Firstly, the framework enhances the accuracy of forecasting models, providing practitioners with more reliable insights into demand patterns, inventory needs, and workforce requirements. This, in turn, enables optimized decision-making across various facets of SCM. Secondly, the cyclic connection within the framework ensures adaptability to dynamic SC conditions. SC practitioners can continuously refine and optimize their forecasting models based on real-time data, staying responsive to changing market dynamics and mitigating potential disruptions. Furthermore, the framework’s emphasis on KPIs and error-measurement systems enables practitioners to evaluate and improve their forecasting models’ performance systematically. This enhances operational transparency and contributes to the SC’s overall efficiency and planning effectiveness.

In practical terms, the BDA-SCM framework supports inventory management by providing accurate demand forecasts, aids in determining workforce needs, optimizes cost factors, and facilitates efficient production and capacity planning. By fostering a holistic approach to SCM, the framework equips practitioners with a systematic and data-driven strategy to address the intricacies of modern SC dynamics. In essence, the practical implementation of the BDA-SCM framework empowers SC practitioners to navigate the complexities of their operational environments with greater precision and foresight, ultimately contributing to enhanced resilience, efficiency, and competitiveness in the ever-evolving SCM landscape.

9 Conclusions

This systematic review diligently identified and compared state-of-the-art SC forecasting strategies and technologies within the defined temporal scope, conducting a comprehensive review of 152 papers from 1969 to 2023. This study has made significant strides in addressing the challenges inherent in SC forecasting, offering cutting-edge technological solutions within a comprehensive BDA-SCM framework. The key findings and contributions of this study can be summarized as follows:

  1. 1.

    Pre-process: In the pre-processing stage of SC forecasting, the significance of accurate data aligned with SC objectives was emphasized. The study provided recommendations for SC analysts, including using EDA, FE, hyperparameter tuning, and recent ML model training approaches to improve forecasting accuracy. However, it is essential to note that further research is needed to explore advanced techniques for data cleansing, integration, and quality assurance to ensure reliable and high-quality input data.

  2. 2.

    Control-process: The study discussed how BD could facilitate efficient managerial decision-making in various areas of SCM, such as production and capacity planning, workforce requirements, and inventory management. Leveraging insights from forecasted data allows decision-makers to optimize SC operations and resource allocation. However, future research should focus on developing real-time decision support systems that can integrate and analyze large-scale data streams to enable timely and effective decision-making.

  3. 3.

    Post-process: The post-process section emphasized SC performance measurement and the role of BDA in optimizing model predictions. By analyzing performance metrics and leveraging BDA techniques, SC practitioners can identify areas for improvement and refine their forecasting models accordingly. Future research efforts should focus on developing comprehensive performance measurement frameworks specific to SC forecasting, including quantitative and qualitative metrics, to enable more accurate evaluation and comparison of forecasting models. Additionally, the study addresses the accuracy of inventory records as a crucial determinant for forecasting, procurement, and replenishment quality. Mitigating inaccuracies resulting from phantom inventory is highlighted, with the inclusion of RFID technology in inventory management processes as a viable solution. Future research should explore advanced techniques and methodologies to address phantom inventory, incorporating emerging technologies and developing comprehensive inventory management and forecasting frameworks to enhance overall SC performance.

This study has successfully addressed the research questions posed:

RQ1: The study has identified and outlined efficient steps to formulate an ML forecasting model for predicting SC factors. Recommendations for accurate data preprocessing, FE, hyperparameter tuning, and advanced ML model training approaches have been provided to enhance the accuracy of SC forecasting models.

RQ2: The study has emphasized the importance of connecting, tracking, and optimizing the forecasting, SC decision-making, and performance measurement processes in a cyclic order. The proposed BDA-SCM framework encompasses the Pre-process, Control-process, and Post-process stages, providing guidance on integrating these processes to optimize SC operations and resource allocation.

RQ3: The study has explored the impact of forecasting on SC performance and identified relevant ML forecasting models for SC forecasting. The connection between accurate forecasting and improved SC performance has been highlighted, with recommendations for performance measurement and using BDA techniques to optimize model predictions.

By successfully addressing these RQs, this study contributes to advancing the field by providing insights into efficient ML modeling steps, the integration of forecasting and SC decision-making processes, and the relevance of ML forecasting models for SC forecasting. Future research should build upon these findings to further enhance the understanding and implementation of BDA in SCM.

While this systematic literature review (SLR) followed a rigorous and objective evaluation approach, acknowledging its limitations is crucial. These limitations include the availability of relevant literature, potential publication bias, and the dynamic nature of the BDA-SCM field. Future research endeavors should aim to address these gaps by conducting further empirical studies, developing benchmark datasets, and exploring emerging technologies and methodologies to advance the understanding and implementation of BDA in SCM.

By considering and addressing the challenges and limitations outlined in this study, future researchers can build upon its findings and contribute meaningfully to the continued advancement of the BDA-SCM domain.