Keywords

1 Introduction

Bayesian network is a complete model for the variables and their relationship, it can be used to answer probabilistic queries about them. Furthermore, it has a strong ability to deal with uncertain problems logically and understandably, and can make inferences from uncertain large amount of information [1]. Especially with the increasing computing power and the emergence of big data makes Bayesian network increasingly powerful. It shows great promise in the big data field.

Accordingly, our team’s project is to design and develop a correlation model of heating load multidimensional feature factors and an online diagnose model for the monitoring data of large heating units. For the former model, for each specific unit parameter (such as heat supply), we need to describe the interaction between one and the remaining parameters. For the latter model, it should be able to identify outliers in data online. They are correlation analysis problem and detection problem of time sequence outliers, which means we need a model that can answer probabilistic queries about the variables and their relationships, and can make inference from uncertain information to help us retrospect the outliers [2]. Thus, we choose to build a Bayesian network model to solve the analysis problem and we use the autoregressive model and the posterior inspection to carry out the outlier detection. In this paper, we concentrate on the first problem and the solution to it.

A Bayesian network consists of two components: the structure, which is a Directed Acyclic Graph (DAG), for representing the conditional dependencies among variables, and a set of parameters for representing the quantitative information of the dependency [3]. Accordingly, learning a BN from data includes structure learning and parameter learning.

This paper focuses on the structure learning of Bayesian network. We will discuss our method of structure learning of Bayesian network in the part 3, our method of dealing with the data of our project (incremental learning) in part 4 in this paper, and experiments and testing results in part 5 in this paper.

2 Background

In the big data time, the problem of data sparsity still exist. It will be quite difficult to solve such problems with classical statistical methods. Under this circumstance, the Bayesian network provides us a wonderful answer to these complex problems.

A Bayesian network (BN) is composed of a directed acyclic graph (DAG) and a set of parameters. In a DAG G = (X, E), X is a collection of nodes or vertices, and E is a collection of edges. Those nodes of the DAG represent variables in the Bayesian sense: they might be discrete, continuous, known, or unknown [3]. In our project, we deal with the continuous parameters. Edges represent conditional dependencies. Each node is associated with a probability function that takes, as input, a particular set of values for the node’s parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node.

As shown in Fig. 1 this is a DAG diagram with P nodes \( \left[ {{\text{X}}_{1} ,{\text{X}}_{2} , \ldots ,{\text{X}}_{\text{p}} } \right] \) and a directed edge to each other nodes. In DAG, each node represents a free variable, while the opposite side represents the influence of variables. The parent node is “cause” and the child node is effect. By establishing such a relationship, the independence assumption between variables is constructed, that is, a set of parent nodes of a given node, which is independent of all its non-descendant nodes. Therefore, the joint distribution probability of all nodes represented by Bayesian network can be expressed as the product of the conditional probability of each node, namely:

Fig. 1.
figure 1

Example of Bayesian network

$$ {\text{P}}\left( {X_{1} ,X_{2} , \ldots ,X_{n} } \right) = \mathop \prod \limits_{i = 1}^{n} P\left( {X_{i} |X_{1} ,X_{2} , \ldots ,X_{i - 1} } \right) = \mathop \prod \limits_{i = 1}^{n} P(X_{i} |\pi \left( {X_{i} } \right)) $$

In the graphical structure of Bayesian network, there are three possible types of adjacent triplets allowed in a directed acyclic graph (DAG):

$$ \begin{array}{*{20}l} {\text{X} \to \text{Y} \to \text{Z}} \hfill \\ {\text{X} \leftarrow \text{Y} \to \text{Z}} \hfill \\ {\text{X} \to \text{Y} \leftarrow \text{Z}} \hfill \\ \end{array} $$

Structural learning is the key point of the whole Bayesian network, which directly influences the subsequent parameter learning and inference. The scientific problem of structure learning of Bayesian network is: how to find a directed acyclic graph structure (DAG) that is corresponding to data distribution in a reasonable time. The challenge is that the structural space is an exponential function of variables. In the face of such a surpass-exponential structure space, it is not feasible to find the global optimal solution in a reasonable time and limited storage space.

The current structure learning method can be divided into two categories: discrete Bayesian network structure learning and continuous Bayesian network structure learning; by the adopted technical methods, it can be divided into: based on the constraints of the Bayesian network structure learning [4,5,6], based on the scoring-search of the Bayesian network structure learning [7,8,9,10], and the mixed method of structure learning; according to the data processing method can be divided into: batch learning and incremental learning [11,12,13]; by the accuracy of the solution can be divided into: accurate learning and approximate learning.

The comparisons between these approaches can be found in the Tables 1, 2 and 3 below.

Table 1. Advantages and disadvantages of three types of Bayesian network structure learning methods
Table 2. Advantages and disadvantages of batch learning and incremental learning
Table 3. Advantages and disadvantages of accurate learning and approximation learning

The application background is as follows:

  1. (1)

    We deal with continuous, high-dimensional and large sample data;

  2. (2)

    The degree of dependence between quantitative variables;

  3. (3)

    The complexity of reasoning is the exponential of the largest group;

Considering the three reasons, we decided to learn by incremental Sparse Bayesian networks (SBN) to deal with these problems at the same time.

Firstly we introduce SBN [14], it is a kind of continuity for high-dimensional variable method for generating a Bayesian network, different from traditional discrete Bayesian networks, it finally be able to generate a directed acyclic graph, and the corresponding mixture Gaussian distribution. On the specific implementation, the core idea of the algorithm is to maintain a coefficient matrix B. The matrix saves the relation coefficients of each node (attribute characteristics) relative to the parent node, which is continuously updated in the iteration, and finally a relational coefficient matrix is generated, which is further transformed into a directed acyclic graph.

The advantage of using this method to study Bayesian network structure is:

  1. (1)

    The current mainstream continuous Bayesian network structure learning method mainly includes the conditional Gaussian-Bayesian network and the linear Gaussian-Bayesian network. The conditional covariance matrix is invertible, which limits its application. Linear Gaussian-Bayesian network not only is not subject to this restriction but also combines the sparse technology represented by the current, such as lasso, to reduce the complexity of the model.

  2. (2)

    It can theoretically guarantee the ring detection in the structure.

  3. (3)

    Due to the particularity of SBN algorithm, the prior knowledge of domain experts can be easily added, and it is easy to extend to incremental learning.

3 Structure Learning

BN structure learning aims at a NP-hard problem that selecting a probabilistic model that explains a given set of data. The main task of Bayesian network structure learning is to construct a directed graph which conforms to the dependence of the characteristics through the data of the original sample. It is an important basis for subsequent parameter learning and Bayesian inference.

In our project, our team needs to learn large BN structures with high accuracy and efficiency from limited samples. Therefore, a useful strategy is to impose a sparse-constraint of some kind. Many real-world networks are indeed sparse, such as the gene association networks, and our relationship network of heating load factors. When learning the structure of these networks, a sparse constraint helps prevent overfitting and improves computational efficiency.

Meanwhile, considering the common Bayesian Network structure learning is dependent on the characteristics of discretization and continuous raw data, in order to avoid the unreliability caused by human for data discretization as well as the complexity of the reasoning, we adopt a Bayesian network for continuous data structure learning algorithm (Sparse Bayesian Network, SBN) [14]. The method (SBN) itself is aimed at constructing model which has small sample size of continuous data. However, given our sample size is larger, the implementation of SBN based on the further provision of batch training and incremental training, can effectively shorten the training time.

In the SBN algorithm, the following structures are mainly defined. First, the number of variables is p. We use a two dimensional array to represent the DAG we get (a \( {\text{p*p}} \) matrix G). \( G_{ij} = 1 \) represents that there is a directed edge from node i to j. In addition, a p-dimensional matrix is needed to indicate whether there is a path between nodes, namely the relationship closure.\( P_{ij} = 1 \) represents that there is a path from node i to j, and vice there is none. At last, we need a \( \left( {{\text{p}} - 1} \right) * {\text{p}} \) matrix B to record all the coefficients of every node and its parent, since any one node is unlikely to become their own parent node (assuming that there is no the loop), matrix B has one less row compared to the above matrixes. SBN is essentially a model which learn a group of relation coefficients iteratively on each dimension to build a connection between nodes and their parents through data. There is no directed edge between two nodes when the relationship coefficient is zero, the core of algorithm is to optimize a set of formula below:

$$ \begin{array}{*{20}c} {\hat{B} = \begin{array}{*{20}c} p \\ {min} \\ {i = 1} \\ \end{array} \left\{ {\left( {x_{i} - \beta_{i}^{T} x_{/i} } \right)\left( {x_{i} - \beta_{i}^{T} x_{/i} } \right)^{T} /2 + \lambda_{1} \left| {\left| {\beta_{i} } \right|} \right|1} \right\}} \\ {{\text{s}}.{\text{t }}\beta_{ji } \times P_{ij} = 0,i,j = 1, \ldots ,p,i \ne j} \\ \end{array} $$

Where \( \upbeta_{\text{i}} \) is the ith column of matrix B, and \( x_{/i} \) represents the sample matrix after remove the variable i in matrix B. The optimization goal of the formula is to minimize the fitting error between real value and the coefficient, then plus the regular penalty term \( {\text{L}}_{1} \). Moreover, \( \uplambda_{1} \) is used to control the number of non-zero in matrix B, namely controls the sparsity of network structure. The larger the \( \uplambda_{1} \), the smaller the number of non-zero in matrix B and the more sparse network structure. In the iterative process, it is necessary to maintain the \( \upbeta_{\text{ji}} \times {\text{P}}_{\text{ij}} \) constant to zero, so as to guarantee the network structure is not circular.

Unfortunately, the formula above is hard to solve. Therefore, in practical development, we change the optimization to matrix B to aiming at each column. That is, the cumulative process of each variable optimization. The transform formula is:

$$ \widehat{{B_{ap} }} = \mathop {\hbox{min} }\limits_{B} \mathop \sum \limits_{i = 1}^{p} f_{i} \left( {\beta_{i} } \right) = \mathop {\hbox{min} }\limits_{B} \mathop \sum \limits_{i = 1}^{p} \left\{ {\left( {x_{i} - \beta_{i}^{T} x_{/i} } \right)\left( {x_{i} - \beta_{i}^{T} x_{/i} } \right)^{T} /2 + \lambda_{1} \left| {\left| {\beta_{i} } \right|} \right|1 + \lambda_{2} \mathop \sum \limits_{{j \subset X_{i} }} \left| {\beta_{ji} \times P_{ij} } \right|} \right\} $$

In the formula, \( {\text{j}} \subset {\text{X}}_{\text{i}} \) represents the sample without variable i. We add two penalty terms into this formula, Where \( \uplambda_{1} \) is still the \( {\text{L}}_{1} \) regularization penalty term, and it is used to control the sparsity of network structure. Then, \( \uplambda_{2} \) make \( \left| {\upbeta_{\text{ji}} \times {\text{P}}_{\text{ij}} } \right| \) close to zero, so as to avoid having a loop in the graph, and by proving that, when \( \uplambda_{2} \) meets the following condition:

$$ \lambda_{2} > \frac{{\left( {n - 1} \right)^{2} p}}{{\lambda_{1} }} - \lambda_{1} $$

It ensures that no ring is formed during training.

After given the value of \( \uplambda_{1} \),\( \uplambda_{2} \), we can calculate it by the BCD algorithm. BCD algorithm is a method of block optimization. For matrix B, BCD algorithm holds all the remaining columns, updating \( \upbeta_{\text{i}} \) in turn, which is equivalent to optimizing \( {\text{f}}_{\text{i}} \left( {\upbeta_{\text{i}} } \right) \) until the preset convergence conditions are satisfied. Specifically, in our scheme, we adopted the \( {\text{L}}_{2} \) paradigm change less than 0.001 as the convergence condition. For each optimization of \( {\text{f}}_{\text{i}} \left( {\upbeta_{\text{i}} } \right) \), it’s feasible to use a form similar to LASSO optimization:

$$ f_{i} \left( {\beta_{i} } \right) = \frac{{\left( {x_{i} - \beta_{i}^{T} x/i} \right)\left( {x_{i} - \beta_{i}^{T} x/i} \right)^{T} }}{2} + \mathop \sum \limits_{{j \subset X_{i} }} \left( {\lambda_{1} + \lambda_{2} \left| {\beta_{ji} \times P_{ij} } \right|} \right)\left| {\beta_{ji} } \right| $$

In the above formula, \( {\text{x}}_{\text{i}} \) is the sample vector with the characteristic of i. \( \upbeta_{\text{i}}^{\text{T}} \) is the column that corresponds to feature i in the relational matrix,\( {\text{P}}_{\text{ij}} \) is the connectivity of feature i and j at this stage. In particular, the judgment of connectivity can be resolved using BFS (depth-first search). For the optimization of \( {\text{f}}_{\text{i}} \left( {\upbeta_{\text{i}} } \right) \), the shooting algorithm can be used to iterate. Finally, we calculate the following formula through constant calculation:

$$ \upbeta_{\text{ji}}^{{{{\rm t}} + 1}} = \left( {\left| {\frac{{\left( {{\text{x}}_{{\rm i}} -\upbeta_{i/j}^{\text{t}} \,^{{\rm T}} {\text{x}}_{{/\left( {{{\rm i}},{\text{j}}} \right)}} } \right){{\rm x}}_{\text{j}}^{{\rm T}} }}{{{\text{x}}_{{\rm i}} {\text{x}}_{{\rm i}}^{\text{T}} }}} \right| - \frac{{\uplambda_{1} +\uplambda_{2} \left| {{\text{P}}_{{\rm ij}} } \right|}}{{{\text{x}}_{{\rm i}} {\text{x}}_{{\rm i}}^{\text{T}} }}} \right) + {{\rm sign}}\left( {\frac{{{\text{x}}_{{\rm i}} -\upbeta_{i/j}^{\text{t}} \,^{{\rm T}} {\text{x}}_{{\rm j}}^{\text{T}} }}{{{{\rm x}}_{\text{i}} {{\rm x}}_{\text{i}}^{{\rm T}} }}} \right) $$

Then the convergence conditions are going to be approximated. Finally, considering the implementation details of the algorithm, we need to normalize the original samples (keep the mean value 0 and the variance 1).

4 Bayesian Incremental Learning

The aim of incremental learning is for the learning model to adapt to new data without forgetting its existing knowledge, it does not retrain the model. Incremental algorithms are less restrictive than online algorithms, and incremental algorithms process input examples one by one (or batch by batch) and update the decision model after receiving each example. Incremental algorithms may have random access to previous examples or representative/selected examples. In such a case, these algorithms are called incremental algorithms with partial memory.

Typically, in incremental algorithms, for any new presentation of data, the update operation of the model is based on the previous one. Streaming algorithms are online algorithms for processing high-speed continuous flows of data. In streaming, instances are processed sequentially as well and can be examined in only a few passes (typically just one). These algorithms use limited memory and limited processing time per item.

Considering the large sample size (training, using data a year sample size of 500000 or so), although the overall training is able to achieve a more accurate solution, the training process is time-consuming. Based on two characteristics of overall training, we made a few improvements and changes.

First of all, we have to deal with a large number of data samples. The cost of all training to achieve a global solution is to sacrifice the training time, so we take the sliding window as the core. Because is streaming data at the same time, in practical applications, the sample data may not all have, many times the relationship between the variables can cause some changes with time and, this is called online learning based on streaming data. Considering that boiler data satisfies the definition of streaming data, an incremental updating interface is provided based on SBN. Learning through some offline data first, obtain a fairly network structure, and also using the ideas of sliding window, with the passage of time constant sliding window, the data change, once every sliding window, try to calculate the coefficient of a matrix. On the basis of the changes to set a threshold value, and the relationship between the threshold, when the relationship matrix in the numerical change more than threshold, modified, this is to prevent the noise to interfere with the accuracy of the model, and use of the threshold value to determine the relationship between directed edge, when the relation matrix of value across the threshold, the relationship between the network structure of the directed graph updates, because every incoming sample occupies smaller share in the window, it can ensure no dramatic changes, the relationship between matrix when a concept drift, window after drift value will be more and more, makes the relationship matrix gradually closer to the new concept, Thus, an incremental learning method is realized.

In the actual implementation, we take the original sample and store it in a matrix, each row representing a set of boiler data. The number of rows in this matrix is advance given. First, we accept a certain amount of data in the matrix, and then we learn it, and we get the raw matrix B. When the data that we accept exceeds the number of rows in the matrix, we delete the first row of the matrix and add a new set of data to the last row of the matrix, which is the idea of a sliding window. In the process of checking, because our Bayesian network structure is done with linear Gaussian fitting every time, the structure and solution we get corresponding to the next new window are close to the convergence. Furthermore, because the sample size of a single batch is much smaller than the total sample, the relational matrix can converge to a local solution in a short time. With the arrival of the subsequent batch, the relationship matrix is constantly updated. After update, we check the new relationship matrix, and we update the edges according to the value of threshold, to control the sparsity of the structure (Tables 4 and 5).

Table 4. Shows a more detailed description of proposed algorithm.
Table 5. Shows sliding window update algorithm.

For the SBN algorithm, the time complexity is \( {\text{np}}^{2} \). n is the iteration number of BCD and p is the number of variables. In most cases, the number of variables is small, no more than 100, and the iteration number can be really big when the data size is huge. For example, for our boiler data from our client, it is usually a hundred thousand orders of magnitude. In this case, the n is too big and the time complexity is high. However, our incremental approach can converge within 10 iterations. In our implementation of the window slide in the project, we set a window in 3000 sets of data, and the time complexity of this method is \( {\text{c}}p^{2} \). In this case, c3000*10 which is much less than a hundred thousand.

In addition, the experiment found that if the overall distribution of a one-dimensional data is basically unchanged, the first batch of samples can converge the matrix which corresponds to the dimension to an approximate global solution, and subsequent iterations will be greatly reduced.

Whether the data is large or small, the approach of incremental sparse Bayesian network structure learning theoretically takes much less time than other Bayesian network learning method.

5 Experiments and Testing Results

Based on the methods above, we conducted test based on the provided data, because an important idea of BCD + shooting method is to use a specific column, fixed other data to do block optimization. So the order of the original data characteristics will produce certain effect to the learned structure. According to the proof of theory and experiment, the higher the variable sequence, the more the tendency to become the ancestor node. Therefore, it is an important part to find a more reasonable feature order for this method. According to the prior knowledge, choosing a good order of characteristic variables can effectively improve the validity of the network. We conducted experiments based on two variables and different sample sizes.

5.1 Time Efficiency Experiments

We firstly conducted two time efficiency experiments on two data sets which belong to two kinds of variable order. One data set is large and another data set is small. Since we need to deal with data with continuous parameters, we choose batch learning approach to compare with our incremental approach, they both can deal with continuous parameters by using SBN algorithm.

oder1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]

The size of this group of data is 520,000, with 57dimensions (57 variables) (Figs. 2 and 3).

Fig. 2.
figure 2

Structure constructed by incremental learning for order 1 (sample size is 520,000).

Fig. 3.
figure 3

Structure constructed by batch approach for order 1 (sample size is 520,000)

  • Time efficiency:

  • \( \text{Incremental}\,\text{approach}:\,14\text{h}\quad \quad \quad \text{Batch}\,\text{approach: }29\text{h} \)

  • Analysis:

This group of data has 57 sets of variables, of which 37 groups converge before the 5th iteration, and 14 groups converge in the sixth iteration, and the remaining six groups converge in the eighth iteration. The time complexity is much lower than the batch.

It can be seen from the above time that the incremental learning method adopted by us is obviously better than the batch mode in time performance, and the convergence to the final solution is much faster. At the same time, the structure generated by the two methods is different, which is less accurate than the batch type, but it is still accurate. From the comparison experiment on order1, we can know that the incremental learning method we have improved is very effective and advanced for the situation size of 500,000+.

oder2 = [0, 1, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]

The size of this group of data is 3, 000, with 30 dimensions (30 variables).

Therefore, the running time is very fast. In the case of incremental situation, 25 dimensions converge in the first five rounds of iteration, and four dimensions converge in the sixth round, and the last one will converge in the seventh round. Because the sample size is small, there is no great advantage for incremental approach in time complexity (Figs. 4 and 5).

Fig. 4.
figure 4

Structure constructed by incremental learning for order 2 (sample size is 3,000)

Fig. 5.
figure 5

Structure constructed by batch approach for order 2 (sample size is 3,000)

  • Time efficiency:

  • \( \text{Incremental}\,\text{approach}:\,0.4\text{h}\quad \quad \quad \text{Batch}\,\text{approach: }0.4\text{h} \)

  • Analysis:

Through the above time comparison, we can see that the incremental learning method adopted by us is not significantly different from the batch type in terms of time performance, and the obtained structure is basically the same. This shows that for smaller sample sizes, incremental learning and batch can be equally accurate and fast.

5.2 Accuracy Experiment of Structure Learning Results

In the first experiment, we managed to prove that the incremental approach for sparse Bayesian network structure learning is more efficient in time than batch learning. In this experiment, we try to conduct a comparison experiment to find the most suitable penalty value and the relation threshold value that can make our method most accurate. At the same time, prove that our method can generate an accurate structure through structure learning.

Before the test, we should explain the two parameters in our system: penalty and threshold. The penalty controls the sparsity and avoid becoming a circle through training (It is explained in detail in part IV. structure learning of this paper), and the relation threshold removed some edge with weak correlation, namely after-pruning (It is explained in detail in part V. Incremental learning of this paper).

It’s worth mentioning that the effect of redundant edges on structural correctness is less than absent edges, which means we should pay more attention on decrease the number on absent edges. Moreover, the bigger the threshold is, the more sparse the structure is, and normally the bigger the penalty is, the more sparse the structure is.

Dataset: CHILD

Number of nodes: 20; Number of arcs: 25; Number of parameters: 230; Average Markov blanket size: 3.00; Average degree: 1.25; Maximum in-degree: 2.

The influence of the value of these two parameters can be found in Table 6, we can find a best setting for them when the redundant edges (RE) and absent edges (AE) are least.

Table 6. The redundant edges and absent edges with different parameters for CHILD

As we can observe in chart 4, when the penalty value is 0.8 and the relation threshold is 0.10, the absent edges and redundant edges are the least. It is mathematical that there must be a most suitable setting that the structure is closest to the ground truth, since the two parameters can both regulate the sparsity of the structure.

Apparently, when the threshold or the penalty is too small, the structure can be so tense, in that case, there will be too much redundant edges. On the other hand, when the threshold or the penalty is too big, the structure can be so sparse, which leads to a result that it is more likely that more edges will be missed. Therefore, the balance spot is when the penalty value is 0.8 and the relation threshold is 0.10.

Dataset: ALARM

Number of nodes: 37; Number of arcs: 46; Number of parameters: 509; Average Markov blanket size: 3.51; Average degree: 2.49; Maximum in-degree: 4.

As we can observe in the Table 7, there are two situations which have the least redundant edges and absent edges. First one is when the penalty value is 0.7 and the relation threshold is 0.10, and the second one is when the penalty value is 0.8 and the relation threshold is 0.09. For a dataset which has 44 edges, a structure with 8 absent edges is considerably accurate in machine learning field.

Table 7. The redundant edges and absent edges with different parameters for ALARM

Compare this setting of the dataset ALARM and the last setting of the dataset CHILD, we conclude that when the penalty value closes to 0.8 and the relation threshold value closes to 0.10, the structure our system learned is the most suitable one.

On the other hand, we need to evaluate the performance between the offline learning and online learning based on sliding windows which we set as 10,000. Due to the size of our data is large, online learning is more efficient than offline method. In order to prevent the influence of individual data on parameter updating, we slide the windows once when every 100 new data coming. What’s more, we set penalty based on the size of sliding windows. Due to the lack of correct network based on our data, we assume the offline result as the groundtruth. In order to reduce the influence of data size, we set the size of data as 50,0000. The result is showing in Tables 8, 9 and 10.

Table 8. The comparison of offline learning and online learning Bayesian networks for 30 nodes
Table 9. The comparison of offline learning and online learning Bayesian networks for 50 nodes
Table 10. The comparison of offline learning and online learning Bayesian networks for 62 nodes

Tables 8, 9 and 10 show the comparison of offline learning and online learning Bayesian Networks. First line is the offline groudtruth, second line is the network based on first batch, The third line is the networks based on the half of data, The last line is the networks based on the total data.

From the result of table above, we can find the result of online learning is close to offline learning, but the edges of online learning is more than offline learning, it may because the penalty we set is not reasonable enough. The first batch result has a gap with groundtruth, but with the online updating of the network, the result is slowly approaching the groundtruth. Base on the experiment, we can find the online learning method we proposed is more efficient than offline learning which could maintain accuracy.

6 Conclusion of Experiments

In conclusion, the time efficiency experiments prove that our incremental approach for sparse Bayesian network structure learning more efficient than the existing approach of Bayesian structure learning. The accuracy experiments prove that our method is considerably accurate among machine learning methods.

The results demonstrate the accuracy, efficiency of our approach, and our approach can solve the current intractability that learns structure with continuous parameters. All the advantages above reflect the advancement of our approach.

Last but not the least, those experiments we conduct prove that our incremental approach of Bayesian network structure learning can be wildly applied in big data filed. For example, many situations in big data like the analysis of the data of complex equipment, the origin of galaxies, the pathogenic genes, the operation mechanism of the large heating units, etc. To uncover the laws underlying these problems, we must understand their genetic networks and tease out the intricacies of the events. Due to the invalidation of classical statistics in those problems, our approach appears to be more valuable.