Keywords

1 Introduction

The massive data are generated all the time through a variety of devices around the world such as driverless cars, mobile terminals, humidity sensors, computer clusters and so on. Big data has brought tremendous impact on people’s life. In order to meet the requirement of real-time data, a series of data stream processing platform came into being. Data stream processing system [1] provides a way to deal with big data and greatly improves the real-time performance of data processing. Reducing the operating cost as much as possible and ensuring the stable operation of the system has become a research hotspot. Load predicting technology [2] can solve the above problems to a certain extent. Therefore, load predicting has become one of the research hotspots. Thus, the difficulty of load predicting is that data stream in data stream processing system is temporary compared with traditional processing system. The scale of data to be processed is also complex and unpredictable, and it also needs to meet the requirements of the real-time performance of the stream processing system. Prediction models for load can be divided into linear and nonlinear prediction. The linear prediction mainly includes ARMA [3] model and FARIMA [4] model. The nonlinear prediction mainly includes neural network [5], wavelet theory [6] and support vector machine (SVM) [7]. Due to the uncertainty of the rule of flow processing load fluctuation, the method of nonlinear prediction is more concerned by researchers. The basic principle of the most of the prediction algorithms is based on the existing load time series data [8] and on other historical rules to predict. Recently, some achievements have been made: Box-Jenkins’s classic model [9], load predicting by utilizing time series of dynamic load change of processor, prediction of processor behavior by using thread execution time slice in CPU as a parameter. Warren et al. [10] used of job execution time and queue latency as a basis for predictions. Wolski [11] proposed CPU utilization for time-based UNIX systems. The above prediction algorithm has high theoretical value, but did not study the characteristics of the data stream processing system.

In this paper, a load forecasting algorithm based on improved GSOM model is proposed. The rest of this paper is organized as follows. Our proposed algorithm is detailed in Sect. 2. The corresponding comparative experiments are carried out and the experimental results are analysed in Sect. 3. Finally, this paper concludes with Sect. 4.

2 Methodology

2.1 Related Works

Self-organization mapping (SOM) [12] is widely used in pattern recognition as clustering algorithm. The research goal of this paper is to predict the load of stream processing system. Whereas, the traditional SOM consists of three phases: Competition, Cooperation and Adaption processes. In Competition process; the discriminant function is calculated to meet most matching neurons by computing the Euclidean distance using Eq. (1). Where n is the number of output neurons.

$$i(\hat{x}) = argmin_{j} \left\| {\hat{x} - \hat{w}} \right\|,\quad j = 1,2, \ldots ,n$$
(1)

In the cooperation process; the adjacent neurons will cooperate with each other and the weight vectors will be adjusted in the neighborhood of the winning neurons. The Gaussian functions are used using the Eq. (2). The neighborhood function reflects are computed using Eq. (3). Where, σ0 is the initial neighborhood radius which is generally set to half the output plane and \(\uptau\) is the time constant.

$$h_{{i(\hat{x})}} (n) = e^{{ - \frac{{d_{i,j}^{2} }}{{2\sigma^{2} (n)}}}}$$
(2)
$$\sigma (n) = \sigma_{0} e^{{ - \frac{n}{\uptau}}}$$
(3)

where in adaptation process started after neighborhood function is determined. The winning vector of neurons in the winning neurons and their topological neighborhoods can be updated using Eq. (4). Where wj the weight of j neuron, and t represent the winning vector of neurons.

$$w_{j} (n + 1) = w_{j} (n) + \eta \left( n \right)h_{j,i(x)} (n)\left( {X(n) - w_{j} (n)} \right),\quad n = 1,2, \ldots ,T$$
(4)

Learning efficiency computed using Eq. (5), where η is a constant greater than 0 and less than 1, and T is the total number of iterations, η(0) is the initial learning efficiency.

$$T_{\eta } (n) = \eta (0)\left( {1 - \frac{n}{T}} \right),\quad n = 1,2, \ldots ,T$$
(5)

The algorithm has three basic steps after initialization: sampling, similarity matching, and updating. Repeat these three steps until the feature mapping is completed.

2.2 Improved Growing Threshold Setting Method

The load predicting of stream processing system has higher requirements for real-time response. The predicted effect depends not only on the output of the algorithm, but also on the response speed. This paper presents an improved GSOM algorithm. There are improvements in network parameter initialization, clustering prediction mode, new node initialization, and operation efficiency and so on, which can better meet the demand of load predicting of stream processing system.

GSOM determines when to increase a new neuron according to the growth threshold (GT), so the value of GT should be set reasonably. If the threshold is too trivial, the neuron will be added frequently which will increase the training burden. If the threshold is too huge, the prediction of the load will be inaccurate. As the network grows, the addition of neurons should be more and more prudent. Therefore, the value of the threshold should be closely related to the current network condition. Drawing on the general idea of clustering algorithm: the points in the same category should be as close as possible, and the points in different categories can be as far away as possible, A new method to adjust growing threshold dynamically is presented in this paper.

$$\begin{aligned} j = \arg \mathop {\hbox{min} }\limits_{j} \left\| {x_{n} - w_{j} } \right\|,\quad j = 1,2, \ldots ,m \hfill \\ GT = \hbox{min} \left\| {w_{i} - w_{j} } \right\|,\quad i = 1,2, \ldots ,m,\;i \ne j \hfill \\ \end{aligned}$$
(6)

where j is the number of winner neuron and wj is the weight of it. wi is the weight of neuroni’s nearest neighbor and GT is set to be the distance of neuron j and its’ nearest neighbor i. The main idea of the method is that if vector xi belongs to a category represented by wj, then the distance from xi to wj is at least less than the distance between wj and its nearest neighbor. Equation (7) below shows how the growing threshold works:

$$GT < \hbox{min} \left\| {x_{n} - w_{j} } \right\|,\quad j = 1,2, \ldots ,m$$
(7)

where m is the number of competition neurons. The network considers input x as a new input pattern and grows itself only when the distance between x and its’ winner neuron j larger than the growing threshold.

2.3 Initial Parameter Optimization

  1. 1.

    Neuron number initialization algorithm

Each time a new input pattern arrives; the network will dynamically add neurons and adjust parameters until it reaches steady. Therefore, if the initial neuron number is too small, it will lead to frequent adding neurons in the training phase, which will affect the response speed of the system. On the other hand, a too large number which will cause excessive death neurons and brings unnecessary interference to the training process. Accordingly, setting up a proper number of initial neurons can accelerate the training process of SOM network. The following algorithm draws lessons from the idea of dichotomy, and calculates the average distance distmean for the input sample set X. If the Euclidean distance distij between the two input Xi and Xj is smaller than the average distance mean, it indicates that the two inputs are very likely to belong to the same category. By pre-processing the set of input vectors by probabilistic analysis and dichotomy method, a rough number of M is obtained and used as the number of initial neurons. Compared with traditional methods which based on experience or simply choose fixed m, this method can greatly accelerate the training process and reduce the number of iterations.

  1. 2.

    Initialize neurons’ weights

The weights of neurons should be initialized first, then they’ll be adjusted gradually to reflect the characteristics of the input data set during the training process. Traditional SOM networks used to initialize neurons’ weights with random numbers, and the weights generated randomly do not contains any characteristic of the training data. A method which initialize neuron’s with typical input vectors is proposed in this paper. As the number of neurons is knows as m, the problem of initializing m neurons’ weights is converted to the problem of finding m typical input vectors that can represent the characteristics of their respective categories. The general criterion for clustering problem is to make the distance between nodes in the same category as close as possible, and the distance between nodes of different categories as far away as possible. Therefore, in our work, we use the greedy algorithm to select m vectors from the input data set, which has the farthest distance from each other, then initialize neurons’ weights with these vectors.

2.4 Computational Performance Optimization

SOM requires repeated iteration during the training process, and after that, the weights of the whole network need to be adjusted each time a new neuron added in. The prediction is timeliness. Thus the complexity of traditional SOM algorithm is intolerable in load prediction problem of stream processing system. To solve the problem, a caching-based load prediction mechanism is proposed in this section. On the one hand, the new prediction mechanism uses SOM as a classifier to predict load accurately, and on the other hand, it improves the computational efficiency of load prediction. The algorithm improves computational efficiency with the following three methods.

  1. 1.

    New neuron weight vector assignment strategy

The efficiency of network learning process is greatly influenced by the initial value of the network connection weight. In order to speed up the retraining process after a neuron added in, the weight of winning neuron and the input pattern itself are used to assign new neuron node’s weight. The weight initialization formula is shown as follows:

$$w_{new} = a * w + b * X_{i} + c*Random$$
(8)

where wnew is a linear combination of the winning neuron weight wi and the new pattern vector Xi. A random quantity is imported in order to ensure that the weight does not bias the current vector Xi too much. According to the experimental result, it works well when a takes 1/5, B takes 3/5 and C takes 1/5. The initial weight of the new neuron need not be very precise, because it will be constantly adjusted the subsequent iteration process, but a rational initial value do help reduce the iterations and make the network stable.

  1. 2.

    Predicting strategy after pattern recognition

When the input vector does not conform to the winning neuron constraints, which means the distance of input vector and its’ wining neuron larger than the growing threshold, a new neuron need to be added, but the new empty neurons do not contain any known load information. In order to solve this problem, a prediction mechanism is proposed. When the input vector is considered belongs to a knowing cluster, predict the load according to the historical data of that cluster. Otherwise predict it with linear regression algorithm based on all the historical data. After the real load arrives, add the information to the new neuron.

The linear regression algorithm works as following: For input matrix X, the regression coefficient is stored in the vector w, and the result of the prediction will be given by Y = XTw. In order to make the best prediction, the square error is used to measure the effect:

$$Err = \sum\limits_{i = 1}^{m} {\left( {y_{i} - x_{i}^{T} w} \right)^{2} }$$
(9)

The equation can be represented in the form of a matrix as (y − Xw)T(y − Xw), find the derivative of w and make it equal to zero, solve the equation and get w as follows:

$$\hat{w} = (X^{T} X)^{ - 1} X^{T} y$$
(10)

With the weight vector \(\hat{w}\) and the input data set X, the predicted load can be given by \(Y = X^{T} \hat{w}\).

  1. 3.

    Cold backup strategy

As is said above, adding new neurons will cause the SOM network to reiterate to adjust parameters, and the high time complexity of the iteration process can not meet the real-time requirement of stream processing system. To this end, the SOM cold backup strategy is proposed, System maintain two SOM networks of dynamic and static. The static network is responsible for receiving input and predicting the load, and the dynamic network is responsible for adding new neurons and retraining to make the network stable. The synergy process of the two networks is as follows:

  1. (1)

    In the initial stage, two networks are the same.

  2. (2)

    When an input vector Xi comes, the static SOM calculates the winning neuron and compare it with the threshold GT. If the input belongs to an existing cluster, take the historical data and predict load for input Xi.

  3. (3)

    If Xi belongs to a new cluster, the dynamic SOM network performs the operation of adding neurons and retrains the network parameters. The static network remains the same, using linear regression algorithm to calculate the results.

  4. (4)

    After retraining process of the dynamic SOM completed, replicate the dynamic network to replace the static classifier.

The process of training iteration is responsible for the dynamic SOM network. This strategy can avoid the problem of failing to meet the real-time requirement of the stream processing system because of the network updates.

2.5 Implementation of the Improved GSOM Based Load Predicting Algorithm

The existing GSOM algorithm can meet the requirement of dynamic adding of neurons and recognizes new classification of input vectors. However, the algorithm iterates frequently, and the computing speed can not meet the requirements of the real-time performance of the stream processing system. In order to recognized input task’ cluster and predict its’ load requirement accurately and quickly, this paper proposes a LP-IGSOM (Load Predicting based on Improved Growing Self-Organizing Map) algorithm. Compared with the existing GSOM, the LP-IGSOM has improvements in the initializing neuron numbers, optimizing calculate efficiency and some other ways. The specific process of the LP-IGSOM algorithm is showing as Algorithm 1:

  1. (1)

    Initialization phase:

    1. (a)

      According to the known input mode, calculate the rough class number m, which used to initialize the number of neurons.

    2. (b)

      Select m input vectors with the largest distance from each other and initialize m neuron weights.

    3. (c)

      According to current network status, calculate the growing threshold.

  2. (2)

    Growth phase:

    1. (a)

      Add input to the network.

    2. (b)

      Use the Euclidean distance to find the winning neuron in the traditional SOM algorithm.

    3. (c)

      Determine whether the winning neuron is greater than the threshold GT, if not, skip to step f.

    4. (d)

      If the winning neuron is a boundary node, add a neuron and initialize the weight of the new neuron using the current input mode X, the winning neuron weight W, and the random quantity. If not, skip to step f.

    5. (e)

      Reset the learning rate to the initial value and adjust the neighborhood to the initial value.

    6. (f)

      Update the neighborhood vector of neurons.

    7. (g)

      Repeat step b to step f until the clustering effect stabilizes for the existing data.

  3. (3)

    Prediction phase:

    1. (a)

      Find the winning neuron, if there is no need to add new nodes, take all the known loads from the winning neurons and calculate the average as the result of load predicting.

    2. (b)

      If a node needs to be added, the static network uses linear regression to predict the load on the input mode, and the dynamic SOM network adds nodes to re-train.

    3. (c)

      Visit the real information of the load and add to the new neurons.

3 Experiment Result

The experimental data in this paper simulates the data set proposed by the pavement sensor network. The statistical data arrival speed is 5000 pieces per second. Due to the particularity of data stream, little research has been done on load predicting. Here we choose the classic linear regression prediction algorithm and the classical clustering algorithm K-means for comparison, respectively predict the load on the data stream sent by the sensor network and compare it with the real load situation. Experiment related parameters are set as follows; The maximum learning rate parameter is 0.9, the minimum learning rate parameter is 1E-5 and The number of iterations of training neural network is 1000. The initial neighborhood radius is 5.

Figure 1 shows the actual changes of the load over time during the operation of the stream processing system. Under standard data source and fixed computing topology, samples are taken every two minutes from 0 to 20, and the prediction of the calculated load by GLP-SOM and linear regression, k-means clustering is recorded and compared with the real load. Figure 2 shows the actual load curve and the load curve predicted by each algorithm. Under fixed computing topology, the input modes are known modes and no new mode enters the system. Therefore, linear regression, K-means and GLP-SOM algorithms are better able to predict the load, as shown in Fig. 2, The predicted curve is closer to the actual load curve. In this case, the main effect of the affection prediction is the performance of the algorithm and the fluctuation of the data source. Among them, the MSE of LR algorithm is 81, the errors of k-means and GLP-SOM are smaller, which are 32.7 and 21.5 respectively. LR algorithm has a relatively poor prediction effect when the data source fluctuates greatly, while the prediction effect based on clustering algorithm is relatively stable.

Fig. 1
figure 1

The real load situation of the nodes

Fig. 2
figure 2

The load predicting curve based on standard data source and fixed computing topology

On the basis of the existing calculation rules, new calculation topologies are continuously generated, corresponding to new calculation modes. In Fig. 3, the face of the new calculation rules, the data source and the calculation topology have no prior knowledge in the historical data. With the method of linear regression prediction, the situation can not be handled and predicted well. The prediction error is too large to reach 322.5. Among the three classifiers algorithms, K-means of fixed clustering has the worst prediction effect, and the MSE reaches 383.9. Due to the fixed value of K, the new input mode will be forcibly classified into existing clusters, and the current clustering characteristics will be affected. Making k-means no matter dealing with simple mode or new mode, the prediction effect has a greater error. The proposed algorithm based on LP-IGSOM can identify and dynamically grow neurons in the face of new input. The overall prediction effect is more accurate, and the actual load error is smaller, MSE is 77.6. The experimental results show that the load forecasting algorithm based on the proposed GSOM model can effectively deal with the new input mode. The accuracy and speed of load predicting are superior to other methods.

Fig. 3
figure 3

The load predicting curve based on standard data source and customized computing topology

4 Conclusion

In this paper, we propose an effective load prediction algorithm based on the improved GSOM model for data stream processing system. Compared with other traditional predicting algorithms, we optimize the GSOM algorithm for the complex features of the stream processing system. The proposed load prediction algorithm LP-IGSOM achieved higher prediction accuracy rate and speed efficiency with a significant improvement than the traditional load predicting algorithms.