1 Introduction

Sales forecasting is the basis of each stage of firm management planning. Effective sales forecasting can boost firm performance regarding inventory management, merchandise procurement, and sales management, thereby increasing firm profits and decreasing costs. Thus, to improve business management performance, firms must have appropriate sales forecasting models to effectively estimate sales of all products within a specific future period [15].

For computer retailers, the accuracy of product sales forecasts is potentially more critical than for other industries. Rapid technological development and accelerating rate of product innovation have intensified competition in the computer market, leading to shortening of product life cycle. Poor sales forecasting may lead firms to maintain insufficient product inventories or overstock inventories. Moreover, these firms may fail to satisfy customer needs and subsequently profit less, decreasing their competitiveness. Consequently, how to construct an effective sales forecasting model for computer products is a critical problem for computer retailers [5, 6].

Numerous studies have investigated sales forecasting in diverse industries such as fashion [7, 8], clothing [3, 9], food products [1], electronics [10, 11], and automobiles [12]. However, few studies have investigated sales forecasting for the information technology or computer products. Lu et al. [4] used multivariate adaptive regression splines to construct sales forecasting models for computer wholesalers. Lu [5] combined the variable selection method and SVR to construct a hybrid sales forecasting model for computer products. Lu and Shao [6] integrated ensemble empirical model decomposition and an extreme learning machine (ELM) for forecasting computer product sales. Dai et al. [13] utilized three different independent component analysis algorithms and support vector regression (SVR) for forecasting sales of an information technology chain store.

Most of the existing studies that have focused on modeling the sales forecasting of computer products have directly used all training data to construct the forecasting model without considering the extent of the relevance between the training data and the data to be forecasted (test data). In such cases, forecasting accuracy may be reduced because the training data possibly contain excessive data irrelevant to the test data, thereby increasing training errors. To reduce computational time and obtain promising forecasting performance, several recent studies have proposed using clustering algorithms to divide the whole forecasting data into multiple clusters having consistent data characteristics before constructing forecasting models [1420]. However, they have often been used to predict stock prices.

For example, Tay and Cao [14] integrated self-organizing map (SOM) and support vector machine (SVM) to construct a forecasting model, which comprised a two-stage network architecture. In the first stage of their study, the input variable space was divided into multiple irrelevant clusters through the SOM. In the second stage, SVM was used to construct forecasting model for each cluster. They employed a stock market index and five actual futures as empirical data and showed that using the integrated forecasting system exhibited superior performance to that of using an SVR model alone. Similarly, Cao [15] integrated an SOM with support vector regression (SVR) to construct an expert forecasting system for predicting stock price indices. The results showed that the expert system presented favorable forecasting performance and a high convergence speed. Lai et al. [17] adopted K-means clustering to cluster stock price indices, subsequently analyzing the data in each cluster using a fuzzy decision tree to forecast stock prices. Their empirical results indicated that a hybrid method can yield better forecasting results. By combining SOM and SVR to construct forecasting models for predicting Taiwan stock price indices, Huang and Tsai [18] showed that the integrated model exhibited a higher forecasting performance. In a study constructing a stock price forecasting model for the India Nifty index, Badge and Srivastava [19] used K-means clustering to cluster historical stock data into different groups and applied autoregressive integrated moving average (ARIMA) model to the selected suitable group to construct a forecasting model. Their results indicated that the combined model using clustering data in model training stage can improve the forecasting accuracy.

Few studies have applied clustering-based forecasting models to sales forecasting [11, 16, 2024]. Moreover, no study has investigated the sales forecasting of computer products. Hadavandi et al. [11] integrated genetic fuzzy systems and K-means algorithm to construct a sales forecasting expert system for forecasting monthly printed circuit board sales. Their empirical results exhibited that the clustering-based forecasting method can generate better prediction results. Thomassey and Fiordaliso [16] proposed a clustering-based sales forecasting system by combining K-means algorithm and C4.5 decision tree algorithm for new items of textile distributors. They utilized 285 real sales items from a French textile distributor as empirical data and showed that the proposed forecasting system outperformed the five competing models. Kumar and Patel [20] used Fisher’s clustering method to present a hybrid sales forecasting method for retail merchandising. The results showed that their model produced significantly better sales forecasting results than the individual forecasting model without clustering. Chang and Lai [21] combined an SOM with case-based reasoning (CBR) to forecast the sales amounts of new books, showing that using this integrated model yielded more accurate forecasting results compared with using CBR alone. Chang et al. [22] constructed a monthly sales forecasting model for printed circuit boards in Taiwan. By integrating K-means clustering with a fuzzy neural network (FNN), they developed a KFNN hybrid forecasting model, which exhibits higher forecasting accuracy compared with the four other forecasting models. Lu and Wang [23] integrated independent component analysis, growing hierarchical self-organizing map (GHSOM), and SVR to construct a sales forecasting model for computer wholesalers. Their experimental results indicated that the integrated model accurately forecasted the sales of computer wholesalers. Murlidha et al. [24] utilized standard hierarchical agglomerative clustering algorithm with a new sales pattern distance between two sales series to propose a new clustering-based forecasting model to forecast product demand sales of retailers. They demonstrated that the proposed clustering-based sales forecasting model can generate the best prediction performance.

In the present study, a clustering-based forecasting model integrating clustering and machine-learning techniques is proposed for predicting computer product sales. Since computer manufacturers periodically launch new products or remodel the existing merchandise on the markets to keep pace with the technological advancement, computer retailers accordingly have to orchestrate their marketing campaigns timely for implementation of annual sales plans. As a result, the sales data of computer products exhibit similar data patterns or features at different time periods, and it is believed that a clustering-based forecasting model for sales can be effectively applied to computer products. Three clustering techniques including SOM, GHSOM, and K-means algorithm and two machine-learning techniques including SVR and extreme learning machine (ELM) are used in this study. The SOM, GHSOM, and K-means are commonly adopted clustering methods in previous studies [25]. The SOM has demonstrated its compelling performance of analyzing highly dimensional input data and visualizing data in a comprehensive manner, while GHSOM has lent itself to investigate the hierarchical relationships of input data via its dynamic topology to further elicit the insights of clusters underlying in high-dimensional large datasets. On the other hand, K-means is a very efficient and common clustering algorithm in a variety of data mining applications. Nevertheless, it is prone to terminate its processing iterations so rapidly to obtain a local optimal solution.

Referring to forecasting techniques, SVR is an effective machine-learning algorithm [26, 27]. It is derived from the structural risk minimization principle for estimating a function by minimizing an upper bound of the generalization error and has been receiving increasing attention for solving nonlinear regression estimation problems. The ELM is a novel learning algorithm for single-hidden-layer feed-forward networks. It provides enhanced generalization performance with faster learning speeds and avoids many problems faced using traditional neural network algorithms such as the stopping criterion, learning rate, number of epochs, local minima, and over-tuning [28]. These two prediction methods have been widely applied to various forecasting problems [2935] and sales forecasting problems [46, 3638].

In the proposed scheme, first, the clustering technique is employed to divide the training data into multiple small training data sets (i.e., clusters) possessing similar data features or patterns before machine-learning technique is used to train the forecasting models. After the cluster containing data patterns most similar to those of the test data is identified by using the average linkage method, the forecasting model trained using this cluster is applied for sales forecasting. We combined three clustering techniques (i.e., SOM, GHSOM, and K-means) and two machine-learning techniques (i.e., SVR and ELM) to construct six clustering-based forecasting models, which are called SOM-SVR, SOM-ELM, GHSOM-SVR, GHSOM-ELM, K-SVR, and K-ELM. The empirical retail data for notebook computers (NBs), personal computers (PCs), and liquid crystal displays (LCDs) from three computer retailers are collected and used as the numeric examples in the present study due to their dominance in computer product retailing market. They are generally the three highest priced products and the most crucial stock keeping units of computer retailers. The datasets of NBs, PCs, and LCDs are employed for evaluating the forecasting performance of the six clustering-based forecasting models and two single machine-learning techniques (i.e., single SVR and single ELM) without using clustering algorithm to partition training data. The forecasting accuracy of the six clustering-based forecasting schemes, single SVR, and single ELM is compared to identify whether the clustering-based forecasting models outperform the single machine-learning techniques and which of the six clustering-based forecasting models is the most appropriate scheme for computer retailing sales forecasting.

The rest of this paper is organized as follows. Section 2 gives a brief introduction about SOM, GHSOM, K-means, SVR, and ELM algorithms. The proposed clustering-based sales forecasting model is thoroughly described in Sect. 3. Section 4 presents the experimental results from three computer products sales data. The paper is concluded in Sect. 5.

2 Research methodology

2.1 SOM

The SOM algorithm proposed is a kind of artificial neural networks with unsupervised learning and referred to as a nonlinear, ordered, smooth mapping method for high-dimensional input data onto one- or two-dimensional display [39]. The fundamental principle of an SOM is to identify certain similar features, rules, or relations between unlabeled sample groups and group samples with similar patterns into the same category. SOM functions with competitive learning that earns activation opportunities through competition between neurons of the output layer. Different from the general competitive learning neurons, SOM rather relies on the principle of “reciprocity” competition.

The SOM network comprises a set of i units deployed in 2D grid with weight vector m i , normally randomly initialized. A typical SOM architecture composed of input and output layers allows lateral interaction between the neurons to activate and inhibit one another. In each training session, when a neuron has a minimum Euclidean distance from the input vector x, the neuron represents the winning neuron expressed as W, as shown in the following [39]:

$$W(t) = \arg \, \mathop {\hbox{min} }\limits_{i} \left\{ {\left\| {x(t) - m_{i} (t)} \right\|} \right\}$$
(1)

The weight vector of the winning neuron is incrementally adapted to the input signal vector of nearby winning neurons by a certain fraction of Euclidean distance, a time-decreasing leaning rate (α). Adaptation herein means a gradual reduction in relative element difference between input patterns and the vector model, as shown in Eq. (2).

$$m_{i} (t + 1) = m_{i} (t) + \alpha (t) \cdot h_{Wi} (t) \cdot [x(t) - m_{i} (t)],$$
(2)

where t represents the current training iteration and x denotes input vector.

The amount of movement is controlled by the learning rate α. The principle for adjusting α is to make substantial adjustments in the initial learning stage of the network. When the learning time lengthens, α decreases gradually. The neighbor neurons near the winning neuron are expressed using neighbor kernel h Wi to represent the distance between neuron i in the output space and winning neuron W of the cycle. The neighbor kernel is limited to a scalar quantity between one and zero to ensure the distance intensity adjusted by the nearby unit is larger than that of the remote units. A Gaussian function is commonly used, as shown in Eq. (3). Generally, when the distance to the winning neuron increases, the neighborhood function is a simple decreasing function surrounding the winning neuron [39].

$$h_{Wi} = \exp \left( { - \frac{{\left\| {r_{W} - r_{i} } \right\|^{2} }}{{2 \cdot \delta (t)^{2} }}} \right)$$
(3)

where \(\left\| {r_{W} - r_{i} } \right\|^{2}\) represents the distance between W and i in the output space and r i represents the two-dimensional vector unit in the grids. Time variant δ is neighborhood range. This learning procedure leads similar patterns to mapping into neighboring regions while dissimilar patterns are apart.

2.2 GHSOM

A GHSOM is hierarchical deployment of SOMs of various sizes which allows the size and dimensionality of its map to incrementally grow during the training process to adapt the training dataset based on the defined parameters. A GHSOM comprises multiple layers in hierarchical architecture. Instead of adding rows or columns to a SOM structure, each layer of GHSOM inserts a new independent SOM which maps the detailed patterns represented by a specific neuron. A GHSOM grows in two orientations and also is controlled by two parameters, τ 1 and τ 2, respectively. The former determines the growth of a map, whereas the latter dominates the hierarchical growth of the GHSOM [40]. The training and growing process mainly depends on the quantization error (QE) of a neuron which is an index of the error occurred in the mappings of the data onto a neuron. It is noted that the larger QE, the higher heterogeneity of the data cluster.

As to the basic GHSOM algorithm, the upmost layer of GHSOM (layer 0) contains a sole neuron which represents the mean of all input samples [41, 42]. The mean quantization error (MQE), referred as to a measurement of deviation of samples in the input space, can be obtained by Eq. (4)

$${\text{MQE}}_{0} = \frac{1}{\varOmega (X)} \cdot \sum\limits_{{x_{j} \in X}} {\left\| {m_{0} - x_{j} } \right\|}$$
(4)

where X is the set of all input samples, m 0 is the sole model vector of layer 0, and Ω(X) indicates the number of samples. Conforming to SOM learning algorithm, the offspring layers are hierarchically created below the ancestor layer after a predetermined iterations, and then the mean quantization errors for all units can be defined by Eq. (5)

$${\text{MQE}}_{i} = \frac{1}{{\varOmega (S_{i} )}} \cdot \sum\limits_{{x_{j} \in W_{i} }} {\left\| {m_{i} - x_{j} } \right\|}$$
(5)

where S i is a subset of samples for unit i.

Since the MQE measures the dissimilarity between the input vector and a specific unit, high MQE values represent that the input space is not correctly clustered. The unit possessing the highest MQE is selected as an error unit υ, as shown in Eq. (6). Between the error unit υ and its most dissimilar neighbor d, a new column or row is added, resetting the learning rate and neighborhood ranges.

$$\upsilon = \arg \mathop {\hbox{max} }\limits_{i} \left( {\sum\limits_{{x_{j} \in W_{i} }} {\left\| {m_{i} - x_{j} } \right\|} } \right)$$
(6)

The growing process continues until the MQE m (i.e., the mean of all MQE i values) reaches the fraction τ1 of MQE u (i.e., the MQE of the corresponding unit u in the upper layer), as shown in Eq. (7).

$${\text{MQE}}_{m} < \tau_{1} \cdot {\rm MQE}_{u}$$
(7)

Please note that the smaller the τ 1 is and the longer the training time is, the larger the resulting map is. If the units of a completely trained map exhibit low similarity, the next layer of the map is continuously created. The threshold parameter of similarity between the units is τ 2. Equation (8) serves as the termination criterion to halt the growing process. If unit i satisfies the condition of Eq. (8), the next layer of expansion is not required; otherwise, a new map grows in the next layer.

$${\text{MQE}}_{i} < \tau_{2} \cdot {\text{MQE}}_{o}$$
(8)

It appears that the smaller the τ 2 is, the more easily the units expend to the next layer, the deeper hierarchical architecture a GHSOM has.

2.3 K-means

The K-means is one of the simplest and most efficient clustering algorithms [25]. The main idea of K-means clustering is to divide a set of data into mutually exclusive k clusters and assign each sample to the cluster whose center is nearest to the assigned sample, based on minimization of the squared error criterion function [43].

Initially, the k cluster centers are randomly designated among all the input samples. Then, a serial of local search is conducted to minimize the squared error between sample points and cluster centers and to obtain the optimum of Eq. (9)

$$E = \arg \hbox{min} \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{k} {\omega_{ij} \left\| {x_{i} - \theta_{j} } \right\|^{2} } }$$
(9)

where n is the size of data samples, k is the predetermined number of clusters, x i is the ith sample point, θ j is the center of cluster j, and ω ij is the affiliation element which specifies the x i cluster membership, given w ij is

$$\omega_{ij} = \left\{ {\begin{array}{ll} {1,\quad {\text{if}}\,\left\| {x_{i} - \theta_{j} } \right\| \le \left\| {x_{i} - \theta_{m} } \right\|, \, \forall \, m \ne j} \\ {0,\quad {\text{otherwise}}} \\ \end{array} } \right.$$
(10)

\({\text{subject to}}\sum\nolimits_{j = i}^{k} {\omega_{ij} = 1} ,\quad i = 1,2, \ldots ,n\quad {\text{and}}\quad \sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{k} {\omega_{ij} = n} }\).

2.4 SVR

Support vector regression (SVR) based on the principle of structural risk minimization is a machine-learning algorithm. The concept of SVR involves converting low-dimensional nonlinear regression problems into high-dimensional linear regression problems. The basic function of SVR can be expressed as the following equation:

$$f\left( x \right) = \left( {w \cdot \phi \left( x \right)} \right) + b$$
(11)

where w is weight vector, b is bias, and ϕ(x) is a kernel function which use a nonlinear function to transform the nonlinear input to be linear mode in a high dimension feature space. Traditional regression gets the coefficients through minimizing the square error which can be considered as empirical risk based on loss function. Vapnik [27] introduced so-called ε-insensitivity loss function to SVR. It can be expressed as:

$$L_{\varepsilon } (f(x) - y) = \left\{ {\begin{array}{ll} {\left| {f(x) - y} \right| - \varepsilon } & {{\text{if }}\left| {f(x) - y} \right| \ge \varepsilon } \\ 0 & {\text{otherwise}} \\ \end{array} } \right.$$
(12)

where y is the target output, ɛ is the region of ε-insensitivity, and when the predicted value falls into the band area, the loss is zero. Contrarily, if the predicted value falls out the band area, the loss is equal to the difference between the predicted value and the margin.

Considering empirical risk and structure risk synchronously, the SVR model can be constructed to minimize the following programming:

$$\begin{aligned} {\text{Min}}:\frac{1}{2}w^{T} w + C\sum\limits_{i} {\left( {\xi_{i} + \xi_{i}^{*} } \right)} \hfill \\ {\text{Subject to}}\left\{ {\begin{array}{ll} {y_{i} - w^{T} x_{i} - b \le \varepsilon + \xi_{i} } \\ {w^{T} x_{i} + b - y_{i} \le \varepsilon + \xi_{i}^{*} } \\ {\xi_{i} , \quad \xi_{i}^{*} \ge 0} \\ \end{array} } \right. \hfill \\ \end{aligned}$$
(13)

where i = 1, 2, …, n is the number of training data; (ξ i  + ξ * i ) is the empirical risk; \(\frac{1}{2}w^{T} w\) is the structure risk preventing over-learning and lack of applied universality; C is modifying coefficient representing the trade-off between empirical risk and structure risk. Equation (13) is a quadratic programming problem. After selecting proper modifying coefficient (C), width of band area (ε), and kernel function (K), the optimum of each parameter can be resolved though Lagrange function. The general form of the SVR-based regression function can be written as [27]

$$f(x,w) = f(x,\alpha ,\alpha^{*} ) = \sum\limits_{i = 1}^{N} {(\alpha_{i} - \alpha_{i}^{*} )K(x,x_{i} ) + b} ,$$
(14)

where α j and α * j are Lagrangian multipliers and satisfy the equality α j α * j  = 0; \(K(x_{i} , \, x_{i}^{{\prime }} )\) is the kernel function. Any function that meets Mercer’s condition can be used as the kernel function.

Although several choices for the kernel function are available, the most widely used kernel unction is the radial basis function (RBF) defined as [44] \(K(x_{i} ,x_{j} ) = \exp \left( {\frac{{ - \left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\sigma^{2} }}} \right)\), where σ denotes the width of the RBF. Thus, the RBF is applied in this study as kernel function.

2.5 ELM

Extreme learning machine (ELM) proposed by Huang et al. [28] is a new learning method for single-hidden-layer feed-forward neural networks (SLFNs). An ELM is a simple, rapid, and efficient SLFN, which focuses on the input weight values of SLFNs being random. In other words, the parameters of the hidden layer nodes are selected randomly. After the hidden nodes parameters are chosen randomly, SLFN becomes a linear system where the output weights of the network can be analytically determined using simple generalized inverse operation of the hidden layer output matrices.

Consider N arbitrary distinct samples (x i , t i ) where x i  = [x i1, x i2, …, x in ]T ∊ R n, and t i  = [t i1, t i2, …, t im ]TR m. SLFNs with \(\tilde{N}\) hidden neurons and activation function g(x) can approximate N samples with zero error. This means that

$${\mathbf{H}}\beta = T$$
(15)

where

$$H(w_{1} , \ldots ,w_{{\tilde{N}}} ,b_{1} , \ldots ,b_{{\tilde{N}}} ,x_{1} , \ldots ,x_{N} ) = \left[ {\begin{array}{*{20}c} {g(w_{1} \cdot x_{1} + b_{1} )} & \cdots & {g(w_{{\tilde{N}}} \cdot x_{1} + b_{{\widetilde{N}}} )} \\ \vdots & \ddots & \vdots \\ {g(w_{1} \cdot x_{N} + b_{1} )} & \cdots & {g(w_{{\tilde{N}}} \cdot x_{N} + b_{{\tilde{N}}} )} \\ \end{array} } \right]_{{N \times \widetilde{N}}};$$
$$\beta_{{\widetilde{N} \times m}} = (\beta_{1}^{T} , \ldots ,\beta_{{\widetilde{N}}}^{T} )^{t} ;\,T_{N \times m} = (T_{1}^{T} , \ldots ,T_{N}^{T} )^{t}$$

where w i  = [w i1, w i2, …, w in ]T, \(\, i = 1,2, \ldots ,\tilde{N},\) is the weight vector connecting the ith hidden node and the input nodes, \(\beta_{i} = [\beta_{i1} ,\beta_{i2} , \ldots ,\beta_{im} ]^{T}\) is the weight vector connecting the ith hidden node and the output nodes, b i is the threshold of the ith hidden node, and w i  · x j denotes the inner product of w i and x j . \({\mathbf{H}}\) is called the hidden layer output matrix of the neural network; the ith column of \({\mathbf{H}}\) is the ith hidden node output with respect to inputs x 1x 2, …, x N .

Thus, the determination of the output weights (linking the hidden layer to the output layer) is as simple as finding the least-square solution to the given linear system. The minimum norm least-square (LS) solution to the linear system (i.e., Eq. 15) is [28]

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\beta } = H^{\varPsi } T$$
(16)

where \({\mathbf{H}}^{\varPsi }\) is the Moore–Penrose generalized inverse of matrix \({\mathbf{H}}\). The minimum norm LS solution is unique and has the smallest norm among all the LS solutions.

The first step of ELM algorithm is randomly assign input weight w i and bias b i ; Then, the hidden layer output matrix \({\mathbf{H}}\) is calculated; finally, one can calculate the output weight β, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\beta } = {\mathbf{H}}^{\varPsi } T\), where T = (t 1, …, t N )t. For the details of the ELM algorithm, see Huang et al. [28].

3 Proposed clustering-based sales forecasting scheme

This study uses clustering algorithm and machine-learning technique to propose a clustering-based forecasting model for computer retailing sales forecasting. The research scheme of the proposed methodology is presented in Fig. 1. As shown in Fig. 1, the proposed methodology consists of two phases: training and testing.

Fig. 1
figure 1

Proposed clustering-based sales forecasting scheme

In the training phase, the purpose is to divide the overall training data and its complex data characteristics into multiple small training data sets having consistent data characteristics, as well as to train individual forecasting models for the clusters. The detailed procedure of the training phase can be summarized in the following steps:

  1. 1.

    First, historical sales data with a time length t are collected as the training data X = [x i ], \(\, i = 1,2, \ldots ,t\). Because historical sales data are favorable forecasting variables for sales data [4, 5], we use an appropriate historical data (i.e., window size) as the forecasting variables. Subsequently, we apply a moving window method to construct a forecasting variable matrix with the dimensions of L × q, q = t − L on data X under a predetermined window length (L; i.e., L forecasting variables) as follows:

    $${\mathbf{X}}_{F} = [f_{1} ,f_{2} , \ldots ,f_{q} ] = \left[ {\begin{array}{*{20}c} {x_{L} } & {x_{L + 1} } & \cdots & {x_{t - 1} } \\ \vdots & \vdots & \ddots & \vdots \\ {x_{2} } & {x_{3} } & \cdots & {x_{t - L + 1} } \\ {x_{1} } & {x_{2} } & \cdots & {x_{t - L} } \\ \end{array} } \right]$$

    The corresponding target variable Y = [y 1y 2, …, y q ] = [x L+1x L+2, …, x t ] features a dimension of 1 × q. For example, when t = 100 and L = 3, then q = 100–3 = 97, \({\mathbf{X}}_{F} = [f_{1} ,f_{2} , \ldots ,f_{97} ] = \left[ {\begin{array}{*{20}c} {x_{3} } & {x_{4} } & \cdots & {x_{99} } \\ {x_{2} } & {x_{3} } & \cdots & {x_{98} } \\ {x_{1} } & {x_{2} } & \cdots & {x_{97} } \\ \end{array} } \right],\) and the corresponding target variable Y = [y 1y 2, …, y 97] = [x 4x 5, …, x 100].

  2. 2.

    Then, in order to divide the whole forecasting data \({\mathbf{X}}_{F}\) into N clusters which possess consistent data characteristics, the clustering technique is used to partition f i into N clusters, \(\, i = 1,2, \ldots ,q\). Three types of clustering including K-means, SOM, and GHSOM algorithms are considered in this study.

  3. 3.

    Finally, the machine-learning technique is employed to train the forecasting model with the training data of each cluster. With N clusters, N forecasting models are trained. In this phase, the machine-learning techniques considered in this study are SVR and ELM.

The estimation accuracy of SVR and ELM may highly depend on the choice of parameters. However, there are no general rules for setting the parameters of SVR and ELM. For modeling SVR, the grid search proposed by Lin et al. [45] is a common and straightforward method using exponentially growing sequences of C and ɛ to identify good parameters. The parameter set of (C, ɛ, σ) which generate the minimum forecasting root mean square error (RMSE) is considered as the best parameter set. In this study, the grid search is used in each cluster to determine the best parameter set for training SVR forecasting model.

$${\text{RMSE}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {(y_{i} - e_{i} )^{2} } }}{n}}$$
(17)

where \(y_{i}\) and \(e_{i}\) represent the actual and predicted value at week i, respectively, and n is the total number of data points.

As discussed in Sect. 2.5, it is known that the most important and critical ELM parameter is the number of hidden nodes and that ELM tends to be unstable in single run forecasting [28]. Therefore, the ELM models with different numbers of hidden nodes varying from 1 to 30 are constructed. For each number of nodes, an ELM model is repeated 10 times and the average RMSE of each node is calculated. The number of hidden nodes that gives the smallest average RMSE value is selected as the best parameter of ELM model.

After the forecasting models trained using the clusters of training data, in the testing phase, the cluster with data patterns most similar to those of the test data is identified. And the trained forecasting model of the cluster is adopted to yield sales forecasting result. The detailed steps of the testing phase are described as follows:

  1. 1.

    If the sales data in time t (y t ′) are the forecast target, the sales data from time t − 1 to t − L are used as corresponding forecasting variable data P = [y i ′], i = 1, 2, …, L. Note that L is number of forecasting variables.

  2. 2.

    Then, the average linkage method based on Euclidean distance is applied to measure the similarity between the test data and every cluster. That is, the Euclidean distances (d i , i = 1, 2, …, L) between the center of forecasting variable data P and the center of each cluster are computed, where d i represents the Euclidean distance between the test data and the cluster i.

  3. 3.

    The cluster with minimal Euclidean distance (d i ) is the cluster which has the most similar data features or patterns to those of the test data. It is called cluster B, \(B = arc \, \hbox{min} (d_{i} )\). The trained forecasting model of cluster B is the most suitable model for predicting test data.

  4. 4.

    The predicted value of the test data is obtained using the trained forecasting model corresponding to the cluster B. The best parameter set of the forecasting model is determined in the training phase.

4 Empirical study

4.1 Empirical data and performance evaluation criteria

This study constructs six clustering-based forecasting models, namely SOM-SVR, SOM-ELM, GHSOM-SVR, GHSOM-ELM, K-SVR, and K-ELM. As the biweekly sales amount is more practical than daily and weekly sales amount for the sales and inventory management of computer retailers, the biweekly sales data for PC, NB, and LCD products of three computer retailers were collected and used as illustrative examples. The research data comprised 124 points of biweekly sales data from January 2005 to September 2009.

Figures 2, 3, and 4, respectively, show the trend of the sales for the PCs, NBs, and LCDs. From the figures, it can be observed that the sales data of each computer products exhibit similar data features in its different time periods. From the figures, it can be observed that the sales data of each computer products exhibit similar data features in its different time periods. However, the structure of the sales data for the three examined products differed. First, regarding PCs, the sales data revealed a stable sales performance from 2005 to 2006 (the prior 52 sample points). Thereafter, because of competition with other substitute products (e.g., Tablets and NBs) and changes to retailers’ sales strategies, PC sales fluctuated drastically, generating a sales trend distinct from that prior to 2006. Consequently, a low level of similarity for the data structure of PCs at different periods was observed. Unlike the PC product, the NB and LCD products exhibited an obvious periodic sales trend and similar data patterns at different time points. Furthermore, compared with LCDs, the NB products were associated with a more apparent and stable data structure because changes in their product specification as well as their demand and sales characteristics are relatively constant.

Fig. 2
figure 2

PC sales amounts

Fig. 3
figure 3

NB sales amounts

Fig. 4
figure 4

LCD sales amounts

The sales amounts of previous six periods (i.e., t − 1, t − 2, …, t − 6) are used as six forecasting variables. Moreover, the first 88 data points (71 % of the total sample points) are used as the training sample, while the remaining 36 data points (29 % of the total sample points) are holdout and used as the testing sample for out of sample forecasting. The moving (or rolling) window technique is used to forecasting the training and testing data. All of the eight forecasting schemes are used for one-step-ahead forecasting of biweekly sales data.

Regarding the criteria for the forecasting performance evaluations, we use mean absolute percentage error (MAPE) and root mean square percentage error (RMSPE) to evaluate forecasting accuracy. A smaller value or small error indicated that the forecasting value and actual value were approximate. The definitions of these criteria are as follows:

$${\text{MAPE}} = \frac{{\sum\nolimits_{i = 1}^{n} {\left| {\frac{{y_{i} - e_{i} }}{{y_{i} }}} \right|} }}{n}$$
$${\text{RMSPE}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {\left( {\frac{{y_{i} - e_{i} }}{{y_{i} }}} \right)^{2} } }}{n}}$$

where y i and e i represent the actual and predicted value at week i, respectively, and n is the total number of data points.

The SVR, ELM, GHSOM, and SOM analyses are conducted using MATLAB version 7.8.0 (R2009a) toolbox (MathWorks, Natick, MA, USA), and the K-means is performed using SPSS version 12.0 software (SPSS, Inc., Chicago, IL, USA).

4.2 Results of single SVR and single ELM models

In modeling the single SVR model, the whole training data are used and the grid search is applied for determining the best parameter set of (C, ɛ, σ). The parameter-searching scope of the three variables ranged from 2−15 to 215. The parameter set with minimal testing errors is the optimal parameter set. A list of the SVR testing errors of different parameter sets for the PCs is given in Table 1. As given in Table 1, the parameter set (C = 211, ɛ = 2−13, σ = 29) provides a minimum testing RMSE and is considered the best parameter set for the single SVR model in forecasting sales for the PCs. When the grid search is also used for the NBs and LCDs, the best SVR parameter sets for the NBs and LCDs are (C = 2−13, ɛ = 2−15, σ = 29) and (C = 211, ɛ = 2−15, σ = 2−15), respectively.

Table 1 Model selection results of the single SVR model for PCs

Regarding the single ELM model, as mentioned in Sect. 3, we test the numbers of hidden nodes from 1 to 30 and repeat the test 10 times in each node for calculating average RMSE. Figure 5 shows the average RMSE values of the single ELM model with different numbers of hidden nodes. As shown in Fig. 5, the single ELM model with seven hidden nodes has the lowest average RMSE values and is therefore the optimal ELM model for forecasting sales of LCD. By following the same procedure, the appropriate number of hidden nodes of the single ELM for the NBs and LCDs is 9 and 5, respectively.

Fig. 5
figure 5

Average RMSE values of the single ELM model for PCs with different numbers of hidden nodes

Table 2 shows a list of the forecasting results of the single SVR and single ELM models for the PC, NB, and LCD products. The table shows that the forecasting results of the single SVR model were superior regarding NB sales, whereas the single ELM model generated low forecasting errors for PC and LCD sales. In summary, the single ELM method had more satisfactory forecasting performance than that of single SVR model.

Table 2 Sales forecasting results for PCs, NBs, and LCDs using the single SVR and single ELM models

4.3 Results of the clustering-based schemes

In modeling six clustering-based forecasting models, the number of clusters is a critical parameter. An excess number of clusters lead to an overly low number of training data; thus, satisfactory forecasting models cannot be produced. By contrast, an excessively low number of clusters cause samples in the training data to contain features or patterns dissimilar to those of the testing data, which also leads to poor forecasting models. To obtain satisfactory forecasting results, each model tests two to six clusters when the clustering-based forecasting models are constructed. The cluster number with the minimal forecasting error is the optimal number of clusters. Moreover, during the construction of the six clustering-based forecasting models, the procedure used for selecting the optimal parameter of the SVR and ELM is adapted from the procedure used for the single SVR and single ELM models mentioned previously.

Table 3 shows the forecasting results of the six clustering-based forecasting models when different numbers of clusters were used. Regardless of the number of clusters used, the GHSOM-ELM model generates the best forecasting results and yields the lowest forecasting errors (MAPE, 7.42 %; RMSPE, 9.21 %) when five clusters are employed. Thus, based on the results given in Table 3, the GHSOM-ELM demonstrates the highest forecasting performance for PC product sales of all of the clustering-based forecasting models, including the single SVR and single ELM models.

Table 3 Forecasting results of the six clustering-based forecasting models for PC sales

Regarding NB products, the results of the six clustering-based forecasting models when using different numbers of clusters are given in Table 4. The GHSOM-ELM model yields promising forecasting results when using different numbers of clusters, except for with three clusters. In addition, the MAPE (11.43 %) and RMSPE values (15.24 %) of the GHSOM-ELM model using four clusters are the lowest. Thus, when using four clusters, the GHSOM-ELM model exhibits the most optimal forecasting performance superior to the other five clustering-based forecasting models, single SVR, and single ELM. Therefore, this sales forecasting model is the most suitable scheme for NB products.

Table 4 Forecasting results of the six clustering-based forecasting models for NB sales

Table 5 shows the sales forecasting results of the six clustering-based forecasting models for LCD products when using different numbers of clusters. The GHSOM-ELM model also demonstrates the best forecasting performance when using any number of clusters. Moreover, the lowest MAPE (9.31 %) and RMSPE (11.41 %) values are observed when the cluster number is five, thereby creating a forecasting result superior to that of the other seven models. Therefore, as given in Table 5, the GHSOM-ELM model is suitable for forecasting LCD sales by using six clusters.

Table 5 Forecasting results of the six clustering-based forecasting models for LCD sales

Overall, as given in Tables 3, 4, and 5, the forecasting errors of the GHSOM-ELM model are lower than those of the GHSOM-SVR, K-SVR, SOM-SVR, K-ELM, and SOM-ELM models, as well as those of the single SVR and single ELM models, for sales data of all three computer products. Using different numbers of clusters, the GHSOM-ELM model generated promising forecasting results for all three products, except for the NB product when using three clusters. These results demonstrate that the GHSOM-ELM model is a robust forecasting model.

Besides, in order to demonstrate the effective of the GHSOM-ELM model, the best forecasting results of each clustering-based model for PC, NB, and LCD products are summarized and compared in Table 6. Note that the number in the parentheses means the most suitable numbers of clusters for each clustering-based forecasting model. For example, for forecasting PC sales, GHSOM-SVR(6) indicates that six clusters can generate the best forecasting results when using GHSOM-SVR model. From Table 6, it can be found that the GHSOM-ELM model yields the best forecasting results for forecasting the sales of the three computer products. Based on the findings discussed above, it can be inferred that the GHSOM-ELM model is suitable for computer retailing sales forecasting.

Table 6 Comparison of the best forecasting results of the six clustering-based forecasting models for PC, NB, and LCD sales

4.4 Significance test

For evaluating whether the proposed GHSOM-ELM model is superior to the GHSOM-SVR, K-SVR, SOM-SVR, K-ELM, and SOM-ELM in computer retailing sales forecasting, the Wilcoxon signed-rank test is employed. The test is a distribution-free, nonparametric technique that does not require any underlying distributions in the data, and deals with the signs and ranks of the values and not with their magnitude. It is one of the most commonly adopted tests in evaluating the predictive capabilities of two different models to see whether they are statistically significant different between them [4, 46, 47]. For the details of the Wilcoxon signed-rank test, please refer to Diebold and Mariano [46] and Pollock et al. [47].

Based on the forecasting results in Table 6, the test is used to evaluate the predictive performance of the six clustering-based forecasting models. Table 7 shows the Z statistic values of the two-tailed Wilcoxon signed-rank test for MAPE values between the GHSOM-ELM model and other five competing models, where the numbers in parentheses are the corresponding p values. It can be observed from Table 7 that the MAPE values of the GHSOM-ELM model are significantly different from the GHSOM-SVR, K-SVR, SOM-SVR, K-ELM, and SOM-ELM, except the GHSOM-SVR model in LCD product. It can be concluded that the GHSOM-ELM model significantly outperforms the other five clustering-based models for computer retailing sales forecasting.

Table 7 Wilcoxon signed-rank test results between the GHSOM-ELM and the five competing clustering-based models by different computer products

4.5 Robustness evaluation

To evaluate the robustness of the proposed GHSOM-ELM method, the performance of the six clustering-based forecasting models, single ELM model, and single SVR model was computed using different ratios of training and testing sample sizes. The testing plan is based on the relative ratio of the size of the training dataset size to complete dataset size. In this section, three relative ratios, 60, 70, and 80 %, are considered. Table 8 presents the prediction performance of all eight forecasting models for the three products (PC, NB, and LCD) at different relative ratios when MAPE was used as the indicator. Sections 4.2 and 4.3 describe the process of using the eight prediction models to predict the three products at a relative ratio of 70 %. We undertook the same procedure to generate prediction results for relative ratios of 60 and 80 %.

Table 8 Robustness evaluation of the six clustering forecasting schemes, single ELM model, and single SVR model by different training and testing sample sizes

Table 8 reveals that, when predicting the three computer products at three different relative ratios by using the MAPE indicator, the GHSOM-ELM method generated the smallest prediction error compared with the other methods. This result indicates that this forecasting method outperformed the other seven methods. According to Table 8, compared with the other five clustering prediction techniques and the two single machine-learning methods, the proposed GHSOM-ELM method showed superior performance in predicting the NB and LCD products at three different relative ratios. However, for the PC product, although the GHSOM-ELM method significantly outperformed the other seven methods at relative ratios of 70 and 80 %, the prediction errors of this method and the other five clustering techniques did not differ considerably when the relative ratio was 60 %. Moreover, the prediction results of the GHSOM-ELM method were not evidently superior to those of the ELM. Figure 2 reveals a possible explanation for this result. As shown in Fig. 2, the trend chart of the PC product reveals that at a relative ratio of 60 %, the data pattern or structure of the training data (the prior 74 observations) is different from that of the testing data (the latter 50 observations). This may make the clustering results of the training phase of the proposed forecasting model cannot proper capture the pattern of the testing data. Therefore, the GHSOM-ELM method would perform similarly to the other five clustering-based forecasting models and would also obtain results similar to those of both single SVR and single ELM models. In other words, when the sales data show dissimilar data characteristics or patterns between training and testing datasets, the GHSOM-ELM method cannot outperform the other clustering prediction techniques. Subsequently, we reviewed the study conducted by Choi et al. [48] to further determine and analyze the prediction performance of the proposed clustering-based forecasting models when different sales data structures are taken into account. In the future, we will use appropriate indicators (e.g., auto correlations functions, ACF) and time-series analysis technique (e.g., wavelet transform) to extensively analyze the sales data of computer products and subsequently evaluate the applicability and validity of the proposed method.

5 Conclusion

Because of the rapid technological development, computer products are frequently replaced. Consequently, to compete with numerous competitors, computer retailers rely on accurate sales forecasting as the basis for effective management of marketing and inventories. This study used K-means, SOM, and GHSOM as three clustering techniques and a SVR and an ELM as two machine-learning techniques to construct six clustering-based forecasting models for computer product sales forecasting. The actual sales amounts for the PC, NB, and LCD products of three computer retailers were used as the empirical data. The results showed that the GHSOM-ELM model exhibited the most promising performance for forecasting the sales of three computer products when compared with the other five clustering-based forecasting models, single SVR, and single ELM. In addition, the GHSOM-ELM model is a robust sales forecasting model that generated the lowest forecasting errors regarding the data of the three computer products when using different numbers of clusters. Thus, the proposed GHSOM-ELM model is an effective sales forecasting model that is suitable for forecasting sales in a computer retail environment.