1 Introduction

Intermittent demand forecasting remains an important yet challenging task in many organizations. Intermittent demand items are common for firms with slow moving items such as after-sales service parts and high-valued capital goods, as well as those with high number of stock keeping units (SKUs). The number of these infrequent demand items as well as the associated inventory cost could be significant, accounting for up to 60% or 70% of a firm’s total inventory investment in some industries (Boylan & Syntetos, 2021; Hu et al., 2018; Johnston et al., 2003). Therefore, accurate forecasting of these items is essential to a firm’s successful inventory planning and management. Boylan and Syntetos (2021) provide a nice discussion of the economic and environmental benefits of accurate intermittent demand forecasting, including lower inventory holding costs, higher revenues through increased service levels, and less environmental harm due to lower risk of obsolescence. The role of intermittent demand in supply chain forecasting is reviewed by Syntetos et al. (2016).

Forecasting intermittent demand is challenging because intermittent demand occurs sporadically and at random. The intermittent demand time series is characterized by zero demands in many periods interspersed with non-zero demands that are often highly variable. Unlike non-intermittent demand forecasting where only demand size is uncertain, both the timing of a positive demand occurrence and the size of the demand are uncertain in intermittent demand forecasting. Therefore, for intermittent demand forecasting, one is concerned with not only the demand size, but also when a positive demand will occur. This may be the reason that limited research attention has been given to intermittent demand forecasting (Nikolopoulos, 2021), although there has been an increasing interest in the topic in recent years.

Research on intermittent demand has largely followed Croston (1972) in which Croston developed the first systematic approach to intermittent demand forecasting. To deal with the aforementioned uncertainties inherent in intermittent demand time series, Croston estimates the demand size and the inter-demand interval separately using exponential smoothing and then the ratio of the two estimates is the demand forecast. Although Croston’s method outperforms traditional exponential smoothing method for intermittent demand forecasting, Syntetos and Boylan (2001) show that Croston’s method is positively biased. One of the most well-known modifications to Croston’s method is proposed by Syntetos and Boylan (2005) who correct the bias with an approximation method (known as the Syntetos–Boylan Approximation or SBA in the literature), which has been shown to be generally more effective than the Croston method (Babai et al., 2014; Boylan & Syntetos, 2007; Eaves & Kingsman, 2004; Teunter & Sani, 2009). Other modifications include Babai et al. (2019), Doszyn (2019), Leven and Segerstedt (2004), Prestwich et al. (2014), Shale et al. (2006), Syntetos and Boylan (2001), Teunter et al. (2011), and Yang et al. (2021) among others.

Alternative methods have also been developed for intermittent demand forecasting. These include the bootstrapping method (Hasni et al., 2019; Syntetos et al., 2015; Willemain et al., 2004; Zhou & Viswanathan, 2011), the aggregate-disaggregate approach (Boylan and Babai, 2016; Nikolopoulos et al., 2011; Petropoulos et al., 2016), hybrid models (Hua & Zhang, 2006; Hua et al., 2007; Nasiri Pour et al., 2008; Zhuang et al., 2022), and others (Nikolopoulos et al., 2016; Zhu et al., 2017). A good overview of intermittent demand forecasting methods is provided by Boylan and Syntetos (2021). An extensive review of various methods for spare parts demand forecasting is given in Pince et al. (2021).

Neural networks represent another category of promising methods due to their attractive features for time series forecasting in general (Zhang et al., 1998) and intermittent demand forecasting in particular (Kourentzes, 2013). The main advantage of neural networks is their flexible, nonlinear modeling capability, allowing them to deal with complex structures in the intermittent demand data as Pince et al. (2021) state that neural networks, as a versatile tool, “can capture nonlinear patterns in the data, such as intermittence and lumpiness, better than most time-series methods.” Although neural networks have received limited attention in the intermittent demand forecasting literature, several different models have been used for intermittent demand forecasting with varying levels of success (Babai, et al., 2020; Gutierrez et al., 2008; Kourentzes, 2013; Lolli et al., 2017; Mukhopadhyay et al., 2012). The majority of these models are feedforward and recurrent types of neural network.

In this study, we propose the use of transformer neural networks (or transformers) for intermittent demand forecasting. Transformers are newly developed neural network architectures based on the attention mechanism (Vaswani et al., 2017). The introduction of the attention mechanism has improved the success of various deep learning models (Bahdanau et al., 2015). Attention is a component of deep learning to capture important and relevant sequential information in inputs to improve output prediction accuracy. The use of the attention mechanism in neural network architectures has been increasingly common (Niu et al., 2021). Originally developed for natural language processing, transformers with the attention mechanism have gained significant interest and achieved great success in many fields including natural language processing, computer vision, and image analysis. Due to their capability of modeling long-term dependencies and interactions in sequential data, transformers hold great promise in time series modeling and forecasting. Although researchers have begun to explore the use of transformers for time series forecasting (Wen et al., 2022), to the best of our knowledge, no research has examined how effective transformers are in predicting intermittent demand.

The effectiveness of transformer models (Transformer hereafter) for intermittent demand is tested with a data set of weekly demand series for 925 SKUs from an airline spare parts provider. We compare the performance of Transformer with that of two traditional methods of Croston’s and SBA as well as several neural network architectures including feedforward neural networks, recurrent neural networks, and long short-term memory. Our results based on six different forecasting performance measures show that Transformer is the overall best method under a variety of settings.

The rest of the paper is organized as follows. In the next section, we provide a focused review of the relevant studies in intermittent demand forecasting with neural networks. Section 3 describes data and methodology. Section 4 reports the results. Finally, Sect. 5 concludes the paper.

2 Intermittent demand forecasting with neural networks

Although neural networks have been widely used for time series forecasting (Zhang et al., 1998), only a few studies have used neural networks for intermittent demand forecasting (Lolli et al., 2017). One of the first studies is conducted by Guitierrez et al. (2008) who compare the performance of neural networks with that of several traditional methods such as the Croston’s, single exponential smoothing, and SBA. Using 24 intermittent demand time series each with 967 daily observations, they find that the three-layer feedforward neural network architecture they proposed with two input nodes (representing the previous period demand and the number of periods separating the last two nonzero demands), three hidden nodes, and one output node (for the predicted demand) generally outperforms the traditional methods.

Nasiri Pour et al. (2008) is another early study that adopts neural networks for intermittent demand forecasting. They propose a hybrid approach with neural networks used to forecast the occurrence of non-zero demand, and traditional methods used to predict the demand size. The results using 30 spare parts demand series with 69 monthly observations show that the hybrid model performs better than SBA and several standalone neural network models including feedforward, recurrent, and generalized regression networks.

Using the same data from Guitierrez et al. (2008), Mukhopadhyay et al. (2012) conduct another comparative study involving the feedforward neural network. They find that the neural network model and the 5-day weighted moving average model perform well relative to several traditional methods including simple exponential smoothing (SES), Croston’s, and SBA. The neural network used in the study has the same basic structure as Guitierrez et al. (2008) except that one of the input nodes represents the cumulative number of successive periods with zero demand instead of the number of periods separating the last two non-zero demands. Mukhopadhyay et al. (2012) also examine the effect of different data splitting strategies, forecast lead time, and performance measures on the forecasting performance of each method.

Kourentzes (2013) proposes two feedforward neural network models inspired by Croston’s method. Both models use non-zero demands and inter-demand intervals as inputs, while the output layer differs in that one employs two nodes to represent predicted demand and inter-demand interval separately (NN-Dual), while the other employs just one node to represent predicted demand rate (NN-Rate). Based on 1000 simulated intermittent time series, the results show that these models perform worse than Croston’s method and several modified Croston’s methods in terms of forecast accuracy. However, the neural networks, especially NN-Rate, perform much better than Croston’s method and its variants with regard to inventory metrics (i.e. service levels).

All of the above-mentioned studies adopt feedforward type of neural networks. In addition to the feedforward network, Lolli et al. (2017) also consider two other types of neural networks: recurrent and time-delay, for intermittent demand forecasting. They use three input nodes in their networks: last demand size, the number of periods separating the last two nonzero demands, and the cumulative number of successive periods with zero demand, which is the combined set of inputs used by Guitierrez et al. (2008) and Mukhopadhyay et al. (2012). With an experiment study using two different learning approaches (back-propagation and extreme learning machine), two learning mode (batch and online), and three neural network architectures (feedforward, recurrent, and time-delay) as well as 24 weekly auto spare parts demand intermittent time series with varying length from 61 to 414 observations, they find that neural network models generally perform well relative to Croston’s method and SBA in terms of forecast accuracy.

A more recent study by Babai et al. (2020) makes a further modification of the neural network structure proposed by Guitierrez et al. (2008). In addition to the same two inputs (i.e., the last period demand and the number of periods separating the last two nonzero demands) used by Guitierrez et al. (2008), Babai et al.’s (2020) proposed model also adds multiple lagged demands prior to the last period and the number of periods between the forecast period and the last zero demand period. In addition, they consider varying numbers of hidden nodes such as three, five and nine, instead of three used in Guitierrez et al. (2008). Using a monthly demand dataset of 5,135 SKUs from an airline each with 123 observations and considering both forecasting and inventory performance metrics, their results are somewhat mixed depending on the performance metrics used. Generally speaking, the proposed neural network models outperform those of Guitierrez et al. (2008) and are better than or comparable to the parametric methods. Surprisingly, the results also show that SES performs very well overall in terms of both forecast accuracy and inventory efficiency, while the two bootstrapping methods examined do not perform well.

Several observations can be made following the above discussion. First, neural networks are promising for intermittent demand forecasting. Although there are mixed findings, neural networks generally outperform traditional parametric or nonparametric methods in terms of forecasting and/or inventory performances. Second, the number of intermittent time series and the number of observations in each series vary quite significantly in published studies. For example, both Guitierrez et al. (2008) and Lolli et al. (2017) use 24 time series in their studies, but Lolli et al.’s (2017) time series length varies from 61 to 414 observations while Guitierrez’s et al. (2008) time series all have 967 observations. On the other hand, Babai et al. (2020) employ over 5,000 time series, but with a relatively small number of observations (123) in each series. Third, although the focus has been on forecasting accuracy, there is no consensus on best forecast accuracy or error measures. Thus, different error measures are used by different researchers. Finally, the neural network architectures considered in the literature are similar with the feedforward multiple layer networks the most popular. Guitierrez et al. (2008) neural network model appears influential because many subsequent studies either adapt their model with different inputs or use the model as a benchmark for comparison purposes.

3 Data and methods

3.1 Data

The data used for this study is from an airline service parts provider/distributor that is located in the Southeast region of U.S. The company provides us with weekly demand data for a total of 5,800 SKUs from January 2018 to December 2021. Following the literature (e.g., Willemain et al. 1994), we use the squared coefficient of variation of demand (CV2) and the average demand interval (ADI) to identify noise and intermittence levels. Then using the classification scheme devised by Syntetos et al. (2005) with their cutoff points of 0.49 for CV2 and 1.32 for ADI, we divide our data into four categories: intermittent, lumpy, smooth, and erratic. Because each category has its unique characteristics, we elect to focus in this study on SKUs in the intermittent category (CV2 < 0.49 and ADI > 1.32). There are 925 items in this category. Because the starting and ending dates may differ for different SKUs, the length of time series varies. The minimum number of observations is 106 and the maximum is 207.

Table 1 provides a summary of demand characteristics for our sample. It shows that there is a high degree of variation in both demand size and demand intermittence. For example, the mean demand is 2.55 with the minimum of 1.07 and the maximum of 404.75. While the mean demand interval is 9.27 weeks, the minimum mean interval is 1.26 weeks and the maximum mean interval is 184 weeks. The demand per period also shows significant lumpiness with the lowest equal to 0.07 and the largest equal to 10.87.

Table 1 Demand characteristics

Because different methods may vary in performance due to different levels of demand intermittence, we group SKUs in our sample based on sparsity defined as the percentage of zero demand periods in each time series. Table 2 shows the frequency distribution of the SKUs with regards to different levels of sparsity. Since no SKUs have zero demand percentage below 0.2, we have a total of eight different levels of sparsity starting from level 1 which is for SKUs with zero demand percentages between 0.2 and 0.3, level 2 with zero percentages between 0.3 and 0.4, and so on. Interestingly, the number of SKUs increases with the sparsity level with only 3 at level 1 and 388 at level 8. Most SKUs (794 or 85.8%) are at levels 6–8, indicating an overall high degree of demand intermittence in our data.

Table 2 Sparsity frequency distribution

All data are normalized using the Min–Max transformation so that all values are between 0 and 1. Normalizing numeric inputs generally avoids the problem that when some values or features dominate others in magnitude because the model performance is likely impacted by very high input values and thus under-weigh low valued inputs regardless of their potential contributions. In addition, we partition the data into three parts: 60% in the training set, 20% in the validating set, and 20% in the testing set.

3.2 Methods

3.2.1 Croton’s method

While simple exponential smoothing (SES) has been widely used for forecasting, it has significant limitation in intermittent demand forecasting because it weights recent data more heavily than older observations. As a result, the SES would produce a forecast biasedly high after a demand occurs but low before a demand realizes, creating unnecessarily high stock levels. Croston (1972) tries to address this bias by estimating the demand size and demand interval separately. To make the method easier to implement, he uses SES to estimate these quantities. Specifically, let Zt (Dt) be the estimated (actual) non-zero demand for time t, Pt (Qt) be the estimated (realized) inter-demand interval since the last nonzero demand. The Croston’s updating formulae based on SES are given below:

$${Z}_{t}=\alpha {D}_{t}+\left(1-\alpha \right){Z}_{t-1}$$
(1)
$${P}_{t}=\alpha {Q}_{t}+\left(1-\alpha \right){P}_{t-1}$$
(2)

where α is a smoothing constant between 0 and 1. Note that Croston’s original method uses the same smoothing constant for updating both smoothed estimates, although different smoothing constants can be used in the above equations. Croston (1972) suggests a range between 0.1 and 0.3 for α. We use the same smoothing coefficient as indicated in the above updating equations in this study.

The forecast demand for the next time period, Zt+1, is then estimated as a ratio of the above quantities:

$${Z}_{t+1}={Z}_{t}/{P}_{t}$$
(3)

Note that Pt is strictly positive. If the actual demand (Dt) is zero, then Zt, Pt, and Zt+1 will remain the same, i.e., no updating occurs. On the other hand, if demand occurs in every period (Pt = 1), then Zt+1 = Zt, or Croston’s method would produce the same forecasts as SES.

In order to find the best smoothing coefficient, we conduct an experiment with α varying from 0.1 to 0.5 with a step size of 0.01. The best coefficient for each time series is chosen by minimizing the mean squared error (MSE) as Lolli et al. (2017) suggested that using MSE as an accuracy measure is analogous to the neural network training method. However, unlike Lolli et al. (2017) who use the training set to determine the optimal parameters, we use the validation set as this is more in line with how the neural network model building and parameter estimation are performed.

3.2.2 Syntetos–Boylan approximation

Because of the positive bias in Croston’s method, several adjustments or modifications to Croston’s method have been proposed (Syntetos & Boylan, 2001, 2005; Teunter & Sani, 2009). Among them, the Syntetos–Boylan approximation (SBA) developed by Syntetos and Boylan (2005) has received the most attention from the literature and most studies use it as a benchmark. The SBA method adjusts the Croston forecast downward by multiplying a factor of (1- α/2) as follows:

$${Z}_{t+1}={\left(1-\frac{\alpha }{2}\right)Z}_{t}/{P}_{t}$$
(4)

The optimal smoothing coefficient for the SBA method is obtained using the same approach as with Croston’s method. That is, we select the smoothing constant over the range of 0.1 to 0.5 in steps of 0.01 that minimizes MSE of the validation sample.

3.2.3 Feedforward and recurrent neural networks

We use the same feedforward neural network (FNN) and recurrent neural network (RNN) architectures as in Lolli et al. (2017). Both FNN and RNN models are fully connected with a single hidden layer. In addition, we use the following three inputs in these networks: the demand at time period t, the number of periods separating the last two non-zero demand transactions at the end of the immediately preceding period, and the cumulative number of successive periods with zero demand. See Lolli et al. (2017) for details on the structures of these models as well as how the models are trained.

3.2.4 Long short-term memory model

The long short-term memory (LSTM) model is proposed to improve the performance of RNN due to RNN’s shortcomings of potential information loss or vanishing/exploding gradient problems, especially over long time periods (Hochreitier and Schmidhuber, 1997; Zhang et al., 2021). LSTM is a special type of RNN that is capable of modeling long-term dependencies. Specifically, LSTM adds four gates/cells to an RNN model as shown in Fig. 1:

  • Forget gate: this gate helps decide what type of information to keep and what information to forget. This is useful for long time series because it helps keep the most useful or relevant information. At any time t, an LSTM receives input vector \({\mathrm{\rm X}}_{t}\in {R}^{n\times d}\), the previous time period hidden state \({\mathrm{\rm H}}_{t-1}\in {R}^{n\times h}\), the previous time period cell state \({C}_{t-1}\in {R}^{n\times h}\) where \(n\) represents the number of samples in a batch, \(h\) is the number of cells in the hidden layer, and \(d\) is the number of inputs. The gate with the sigmoid activation function (\(\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}\)) determines what information to be removed from \({\mathrm{\rm X}}_{t}\) and \({\mathrm{\rm H}}_{t-1}\) (Zhang et al., 2021). At the Forget gate, the model takes the input of \({\mathrm{\rm X}}_{t}\) with the hidden state information \({\mathrm{\rm H}}_{t-1}\), and outputs the following:

    $${F}_{t}= \sigma \left({\mathrm{\rm X}}_{t}{W}_{xf}+{\mathrm{H}}_{t-1}{W}_{hf}+{\mathrm{b}}_{f}\right)$$

    where \({W}_{xf} and\) \({W}_{hf}\) are weight parameters, and \({b}_{f}\) is a bias parameter. Note that the range of sigmoid function is between 0 and 1, the values of \({F}_{t}\) will be no more than 1 as well, essentially keeping part of the information for further processing.

Fig. 1
figure 1

Illustration of the LSTM architecture

  • Input gate: this gate is used to quantify the importance of the information from the input. With the input vectors of \({\mathrm{\rm X}}_{t}\) and \({\mathrm{\rm H}}_{t-1}\) through the sigmoid activation function, the output concatenates the two vectors, which further becomes part of the input to the memory state: where \(\odot\) refers to the elementwise Hadamard product operator; \(\widetilde{{C}_{t}}\) is a term that represents what current information is kept for future memory updating purpose. Its values are in the range of (−1, 1) by utilizing the tanh function as shown below,

    $${I}_{t}= \sigma \left({\mathrm{\rm X}}_{t}{W}_{xf}+{\mathrm{H}}_{t-1}{W}_{hf}+{\mathrm{b}}_{f}\right)$$
  • Memory gate: the primary purpose of this gate is to determine what information to carry over to the next stage/period. The next period memory information considers information from the previous period, the current period output from the forget gate, the hidden state, and the current period input:

    $$ C_{t} = F_{t} \odot C_{t - 1} + I_{t} \odot \widetilde{{C_{t} }} $$

    where \(\odot\) refers to the elementwise Hadamard product operator; \(\widetilde{{C}_{t}}\) is a term that represents what current information is kept for future memory updating purpose. Its values are in the range of (−1, 1) by utilizing the tanh function as shown below,

$$\widetilde{{C}_{t}}=tanh\left({\mathrm{\rm X}}_{t}{W}_{xc}+{\mathrm{H}}_{t-1}{W}_{hc}+{\mathrm{b}}_{c}\right)$$
  • Output gate: the output gate serves two purposes. The first is to generate the output values/vectors for the current period if such outputs are needed. Specifically, the output combines the input and hidden state with the sigmoid function:

    $${O}_{t}= \sigma \left({\mathrm{\rm X}}_{t}{W}_{xo}+{\mathrm{H}}_{t-1}{W}_{ho}+{\mathrm{b}}_{o}\right)$$

    The second purpose of this gate is to derive the hidden state information for the next period from the current output and the long-term memory (Ct):

    $$ H_{t} = O_{t} \odot \tanh \left( {C_{t} } \right) $$

Overall, the next period hidden state depends on the current period output and concatenates with the current period memory state after the tanh activation function.

In summary, through these four different gates, LSTM is able to mitigate part of the gradient vanishing or exploding problem of RNN by transferring and keeping only some of the useful information from the past over time.Footnote 1

We use the standard three-layer LSTM model in our study. For the LSTM model training, the hyperbolic tangent (tanh) is used as the activation function. We consider both the number of hidden neurons and the batch size as parameters and the optimal numbers are chosen from a fixed set of [5, 10, 20, 30, 40, 50, 60, 70] for the neurons and from the set of [10, 20, 30, 40, 50] for the batch size. The models are trained with the Adam optimization algorithm (Kingma & Ba, 2017), an extension to stochastic gradient descent that has been increasingly used in deep learning applications. The number of epochs is set at 100 with proper early stopping.

3.2.5 Transformer

Transformer represents a relatively new deep learning architecture proposed by Vaswani et al. (2017). It relies entirely on the self-attention mechanism to compute representations of input and output without resorting to recurrent or convolutional layers. At a high level, a Transformer model is composed of two components: position-wise feedforward network layer and multi-head attention layer. Position-wise network is a fully connected feedforward network, which is applied to each position separately and identically. This approach can ensure the position information of each symbol in the input sequence during the operation. Multi-head attention allows the model to focus on information from different representation subspaces from different positions by stacking multiple self-attention layers, just like multiple channels of a convolutional network.

When the sequence of inputs is long, LSTM may still have difficulty in interpreting the information. Part of the reason is that the input is sequential and entered one at a time. In contrast, transformers can take all the input at one time instead of sequentially (Vaswani et al., 2017). However, one challenge for simultaneous inputs is about how to treat various inputs differently by utilizing important information while ignoring noisy types of inputs. Self-attention is an important mechanism that allows inputs to interact with each other and find out what information should be paid more attention to. It can extract importance of different period inputs, and assign weights to different inputs that will be aggregated into the final output. Figure 2 provides a high-level comparison between recurrent neural networks without and with self-attention. Note that when there is no self-attention mechanism in RNN (top part of the figure), inputs are taken into consideration sequentially, and they are treated equally. In comparison, when there is self-attention (lower part of the figure), each period takes inputs of all previous periods, and through training, weights can be assigned to each input, thus differentiating inputs by their importance.

Fig. 2
figure 2

Illustration of the self-attention

Figure 3 shows a more detailed view of the self-attention mechanism regarding its major components. Each input is represented by three elements of query (Q), key (K), and value (V) vectors (Vaswani et al., 2017).Footnote 2 The core of the mechanism is to calculate the self-attention score, which is a measure of relevance between the current input and any other inputs in the previous sequence. In particular, the self-attention scores go through a softmax activation function defined as \(softmax({z}_{i})=\frac{{e}^{{z}_{i}}}{{\sum }_{j=1}^{N}{e}^{{z}_{j}}}\), which scales to be a percentage of total focus that is given to an input in the sequence. Using the softmax function, the attention score by utilizing input vectors Q, K, and V can be calculated as:

$$ {\text{Attention}}\;\left( {Q,\;K,\;V} \right) = {\text{softmax}}\left( {\frac{{QK^{T} }}{{\sqrt {d_{k} } }}} \right)V $$
Fig. 3
figure 3

Illustration of different components of the self-attention mechanism

Note that self-attention scores are calculated by comparing query of the current input to the keys of all other inputs. In particular, it is scored as the dot product between the query vector of the current input and the key vector of the input being scored. Each input vector is multiplied by the corresponding softmax score. After which, the current input is encoded by summing all of the above scaled values.

The encoded input then goes to the Transformer-decoder as shown in Fig. 4. In each decoder component, first the self-attention scores are masked, meaning that certain values are hidden. The masks hide the ground truth information and help derive the right weights in the model during training. The question of how much information to hide/mask is treated as a hyperparameter, which can be derived by the prediction error in the training process. The masked self-attention scores pass through a fully connected feedforward neural network. And finally a linear model combines all relevant decoded values to generate an output as the predicted value. Note that the Transformer adopts an encoder and decoder scheme in which the encoder is used to derive the self-attention scores as illustrated in Fig. 3 while the decoder is used for prediction as shown in Fig. 4.

Fig. 4
figure 4

Illustration of the transformer-decoder architecture

We use PyTorch, a machine learning framework based on the Python programming language to configure and train the transformer model. We make an effort to ensure that the encoder layer matches the input dimension size and the decoder layer fits the forecasting requirement for the dependent feature. The number of variables used as inputs to the model in encoder layer is three, which is the same as the input dimension size.Footnote 3 Since we are forecasting the intermittent demand in this study, the decoder layer is implemented as a linear layer. As with LSTM, transformer models are trained for 100 epochs with the Adam optimizer and MSE loss function.Footnote 4

3.3 Accuracy measures

Although there is no consensus on what forecast accuracy (or error) measures are most appropriate to evaluate intermittent demand forecasting methods, several measures are commonly used in the literature. While most studies adopt two or three such measures (Babai et al., 2014; Lolli et al., 2017), we elect to use five error measures, including the Mean Error (ME), the Mean Absolute Error (MAE), the Mean Absolute Percentage Error (MAPE), the Root Mean Squared Error (RMSE), and the Mean Absolute Scaled Error (MASE). We use the scaled measures as they are appropriate for intermittent time series forecasting. The formulas to calculate these measures are given below:

$$ {\text{ME}} = \frac{{\mathop \sum \nolimits_{t = 1}^{n} \left( {Z_{t} - D_{t} } \right)}}{{\mathop \sum \nolimits_{t = 1}^{n} D_{t} }},\;{\text{MAE}} = \frac{{\mathop \sum \nolimits_{t = 1}^{n} \left| {Z_{t} - D_{t} } \right|}}{{\mathop \sum \nolimits_{t = 1}^{n} |D_{t} - \frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} D_{t} |}},\;{\text{RMS = }} \frac{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} \left( {Z_{t} - D_{t} } \right)^{2} } }}{{\frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} D_{t} }} $$
$$ {\text{MAPE}} = \frac{{\mathop \sum \nolimits_{t = 1}^{n} \left| {Z_{t} - D_{t} } \right|}}{{\mathop \sum \nolimits_{t = 1}^{n} D_{t} }},\;{\text{MASE}} = \frac{{\frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} \left| {Z_{t} - D_{t} } \right|}}{{\frac{1}{{n_{1} - 1}}\mathop \sum \nolimits_{t = 2}^{{n_{1} }} |D_{t} - D_{t - 1} |}} $$

where Dt and Zt are actual demand and the forecast, respectively; n is the number of periods in the test sample and n1 is the number of periods in the training sample. In addition to these overall forecast error measures, we also use the percentage of best forecasts (PB) with a particular error measure to compare the performance of various methods. PB has been used by Gutierrez et al. (2008), Mukhopadhyay et al. (2012), and more recently Doszyn (2019).

4 Results

We largely follow the approach used by Doszyn (2019) to report results. We first report results for all data combined. Then we delve into how the performance of various methods varies with data sparsity. Because of potential outliers in results, we calculate both mean and median performance measures. While most prior studies on intermittent demand forecasting report only mean performance, Doszyn (2019) focuses exclusively on the median performance level.

Table 3 provides a summary of mean and median performance metrics (ME, MAPE, MAE, RMSE, and MASE) with different methods across all 925 SKUs. Several observations can be made based on the results in Table 3. First, SBA performs consistently better than Croston judged by both overall bias (ME) and forecast error measures with both means and medians. Second, Transformer is the best performer based on almost all error measures except for the median RMSE, in which case, its overall median RMSE is 3.100 which is lower than that for Croston and SBA, but higher than that with FNN, RNN, and LSTM. Although Transformer also has the lowest ME based on the mean, it is not the lowest with the median. The lowest median ME is achieved by LSTM although FNN, RNN, and SBA are not far behind. It is interesting to note that except for the median ME with the transformer, all methods yield negative MEs, suggesting that the methods generally overforecast demand. However, the positive bias is much higher with Croston than with all other methods based on both mean and median MEs. Because of the similarity in results using either the mean or the median, we will focus on the mean results in the following discussion.

Table 3 Overall results

To examine how data sparsity impacts forecasting performance, we divide our data into eight different buckets based on the percentage of zero demand in each time series as discussed in Sect. 3.1. Table 4 shows the results of mean ME with different levels of sparsity that are also reflected in Fig. 5. We find that mean MEs are almost all negative for all models at different sparsity levels except for SBA at the sparsity levels of 2 and 3, and Transformer at the sparsity levels of 3 and 5, again suggesting that overall, the models overestimate demand. However, at lower levels of sparsity (sparsity levels 1–5), MEs are close to zero for all models, indicating relatively low forecasting bias. As the sparsity level increases especially at high levels (from 6 to 8), the bias generally increases as well. We also find that while Transformer’s overall performance in ME is among the best across all sparsity levels, Croston is clearly the worst especially at higher sparsity levels.

Table 4 Mean ME results with sparsity
Fig. 5
figure 5

Mean ME at different sparsity levels

Tables 5, 6, 7 and 8 report results of mean MAPE, MAE, RMSE, and MASE, respectively, with different sparsity levels. The corresponding Figs. 6, 7, 8 and 9 show graphically how the performance of different models varies with sparsity. In general, we observe an upward trend across all models with each error measure as sparsity increases. This suggests that the higher the sparsity level is, the worse forecast accuracy is for every model. In other words, it becomes more difficult to forecast accurately as sparsity gets higher. We also observe that at low levels of sparsity (5 and lower), the overall performance of all models is similar while at high levels of sparsity (6 and higher), there are noticeable differences in performance among these methods with Croston being the worst and Transformer either the best or close to the best.

Table 5 Mean MAPE results with sparsity
Table 6 Mean MAE results with sparsity
Table 7 Mean RMSE results with sparsity
Table 8 Mean MASE results with sparsity
Fig. 6
figure 6

Mean MAPE at different sparsity levels

Fig. 7
figure 7

Mean MAE at different sparsity levels

Fig. 8
figure 8

Mean RMSE at different sparsity levels

Fig. 9
figure 9

Mean MASE at different sparsity levels

The percentage best (PB) results are provided in Table 9 for all SKUs and in Table 10 with different sparsity levels. For each overall error measure (in the case of ME, we use the absolute value), we identify the best method as the one associated with the lowest value of the error measure, and then the percentage best for a method is calculated as the number of times the method is the best divided by the total number of SKUs in the whole sample (Table 9) or at a specific sparsity level (Table 10). The best PB is in bold for each error measure. As Doszyn (2019) discussed, the same lowest error measure could be associated with multiple methods, causing the sum of percentage best across all methods not equal to one. But this does not happen often with our results.

Table 9 Percentage best results for all SKUs
Table 10 Percentage best results with different sparsity levels

The PB results in Table 9 show that Transformer is the best method in terms of MAPE, MAE, and MASE. Based on each of these measures, Transformer performs the best for nearly 50% SKUs, which is much higher than that for any other method considered in this study. It is also the 2nd best approach with regard to RMSE with PB = 23.1%. The overall best method based on RMSE is RNN with PB = 24.6%. The best method based on |ME| is SBA (PB = 32.1%) while Transformer is tied with the 3rd best (PB = 15.4%). Table 9 also shows that Croston, FNN, and LSTM are among the worst judging from PB across different error measures.

Considering different sparsity levels, Table 10 suggests that Transformer is the overall best method especially at higher sparsity levels. For example, at sparsity levels 5 and 6, PB for Transformer is the highest with respect to all error measures except for |ME|. At sparsity levels 7 and 8, Transformer significantly outperforms other methods on three error measures with dominant PB. Surprisingly, Transformer also performs very well at several lower levels of sparsity (levels 1 and 3). On the other hand, we find that SBA is the best performer at sparsity levels of 2 and 4 based on almost all error measures. In fact, SBA performs well at other sparsity levels as its PB is often the 2nd highest among all methods considered.

Overall, we find that Transformer performs consistently well judged by both the parametric error measures and the nonparametric percentage best measure. This finding is different from that obtained by Doszyn (2019) who finds that the best methods by means of parametric measures (i.e., error measures) are different than those using the nonparametric measure (i.e., PB).

5 Conclusions

Intermittent demand forecasting is an important and challenging problem facing many organizations. As a result, there is an increasing research effort in searching for better forecasting systems for intermittent demand items. While most existing research has focused on traditional methods such as Croston’ method and its variants, machine learning based methods have been proposed for intermittent demand forecasting in recently years.

This study proposes the use of Transformer, a newly developed deep learning method, for intermittent demand forecasting. In addition to Transformer, we also consider five other methods (Croston, SBA, feedforward neural networks, recurrent neural networks, and long short-term memory networks). Our empirical setting is a dataset of 925 intermittent demand items from an airline service parts provider. Six forecasting performance measures are employed, including the mean error, the mean absolute percentage error, the mean absolute error, the root mean squared error, the mean absolute scaled error, and the percentage best. We also consider both mean and median results which are similar, suggesting that outliers are not a major concern in our results.

The results show that overall, Transformer performs very well relative to all other methods with different performance measures. We also find the at low sparsity levels, all methods perform similarly in terms of forecast error measures (ME, MAE, MAPE, RMSE, and MASE). As sparsity increases, it becomes more difficult to make accurate forecasts, resulting in higher error measures for all methods including Transformer. However, Transformer’s performance advantages over the Croston method, SBA, and various conventional neural network models (FNN, RNN, and LSTM) increase with the sparsity level. Using the measure of percentage best, we find that Transformer once again the overall best performer, although SBA performs well too especially at low sparsity levels.

Our findings confirm prior research that no forecasting method is the universal best under all circumstances. However, one method could be an overall better one for some forecasting settings. This research shows the promise of Transformer in forecasting intermittent demand. One of the limitations of this research is that we have only analyzed 925 time series from the same industry. In addition, the data we used are classified as “intermittent” based on Syntetos et al. (2005) classification scheme. Future research may extend this study by applying Transformer and other deep learning methods to different intermittent demand items in different industries and with different characteristics. Finally, this study shows how sparsity impacts the model performance. While sparsity is an important characteristic of intermittent demand, capturing the average intermittence in a time series, another important dimension is the variance of intermittence, which is a measure of how zero demand periods are dispersed. Studying how the variance of intermittence impacts the model performance could be an interesting future research direction.