1 Introduction

Nowadays, Bayesian networks (BNs) become a considerable probabilistic models in the field of artificial intelligence and widely applied in many areas such as computational biology, medical diagnosis, vision recognition, data mining, information retrieval and so on (Ahmad et al. 2012; Hill 2012; Ji 2011; Li 2011; Murphy and Mian 1999; Yang et al. 2010; Zou and Conzen 2005). The BN is a graphical model which is represented as a directed acyclic graph (DAG) where nodes correspond to variables and arcs denote dependences among variables which encodes the joint probability distribution. The capability of representing the uncertainty of real problems, comprehensibility of graphical model, the strong mathematical bases of formalism, inference the network and calculating the value of unobserved variables based on observed variables are some of the main advantages of using BNs (Gámez et al. 2011; Ji 2011).

As the popularity of BNs increased, the building of BNs has been considered by researchers. The creation of BNs can be divided into two categories: Structure learning and parameter learning. The former refers to earning the topology of network based on the collected data, and the later refers to calculating the conditional probabilities of a given structure.

The structure learning can be done as a manual task by an expert but this is so complicated and time-consuming and moreover sometimes it might be inaccessible. On the other hand, it has been proved that structure learning is an NP-hard problem (Chickering 1996). Therefore, the structure learning based on the available data is increasingly becoming a vital factor in using BNs.

Generally, there are two main approaches for structural learning: dependency-based and score-based approaches (Gámez et al. 2011). The first contains methods which estimate dependencies among variables and the second refers to approaches which try to find a network structure that is most fit with training data. The score-based method can be seen as optimization problem because there is an evaluation function which gives score to candid structures. The goal is to finding the structure which maximizes the function measure (Khanteymoori et al. 2011). Although dependency-based approaches are relatively simple for implementations, the assessment of dependencies is complicated and unreliable when the number of variables is increased (Ji et al. 2013). Therefore, score-based methods are attracting considerable interest due to learning structure.

Since this problem is NP-hard, Several authors have attempted to solve the problem based on heuristic methods which can be refer to ant colony optimization (Ji 2011), asexual reproduction optimization (ARO) (Khanteymoori et al. 2011), particle swarm optimization (Cowie et al. 2007), artificial bee colony algorithm (Ji et al. 2013),genetic algorithm (You 2001; Larrañaga 1996; Wong et al. 1999) and so on (Alonso-Barba and Puerta 2011; Gámez et al. 2011; Larrañaga 2013; Ziegler 2008).

The aim of this study is to provide a suitable method for structure learning by using a hybrid GA/PSO algorithm called Breeding Swarm Particle Swarm Optimization (BSPSO). In this method, a portion of population which is called breeding rate is reproduced according to crossover operator in genetic algorithm (Mattew and Terence 2006). The combination of these algorithms allows to increase the chance of reproducing particles with higher scores, while the cost of computations is not changed significantly. Furthermore, by using mutation operator, the diversity of particles will be saved and will prevent from early convergence. Therefore, the search strategy will be more robustness against local maxima.

This paper is organized as follows: In Sects. 2 and 3, some preliminaries and basics concepts about BNs and structure learning are reviewed. In Sect. 4, our method is described in detail. Results are shown in Sect. 5, and our conclusions are drawn in the final section.

2 Bayesian network

BNs or causal networks represent probability distribution, correlations and relations between random variables in real-world problems. We can suppose BN as a tuple \(\hbox {BN}({S},\theta )\) in which S is structure which demonstrates the random variables and the relations between them by a directed acyclic graph (DAG). In a DAG, random variables and relations are nodes and edges, respectively. \(\theta \) is a set of local parameters which denotes the conditional probability distributions for the values of each variable according to the structure S. These measures are saved as a table for each variable which is called conditional probability table (CPT).

Fig. 1
figure 1

An example of a simple Bayesian network structure

Figure 1 shows a simple BN with five variables and theirs CPTs which all variables are binary. The set of variables contains Cancer, Smoker, Pollution, X-ray and Dyspnea.

The information about the variables and the dependencies between them can be gained easily from the topology of a BN.

Suppose a BN with a set of random variables \(X_{1}, \ldots ,X_{n}\). The immediate predecessors of variable \(x_{i}\) are referred to as its parents, with values parents (\(x_{i}\)). The joint probability distribution is factored as:

$$\begin{aligned}&P\left( X_{1}=x_{1},\ldots ,X_{n}=x_{n} \right) \nonumber \\&\quad =\prod \limits _{i=1}^{n} {p(X_{i}= x_{i}\vert \mathrm{parents}(x_{i}))} \end{aligned}$$
(1)

As mentioned earlier, in order to design a BN, two phases must be accomplished. First, search for suitable structure and next, compute CPTs of variables based on the found structure such that the output approximates distribution of the given set of samples. The both of these phases are important and are called structure learning and parameter learning, respectively.

The most popular parameter learning method is the expectation maximization (EM) algorithm (Chickering et al. 1995). This paper concentrates on the structure learning and presents a new approach to find an appropriate BN structure which is matched with the given dataset.

Fig. 2
figure 2

The diagram of the proposed method

3 Structure learning

The structure learning for a BN can be stated as follows: X is a set of random variables \(X_{1},\ldots ,X_{n}\). Each variable \(X_{i}\) has a special domain which takes values \(val(X_{i})\). Given a training set with M cases, \(D=\{d[1],\ldots , d[M]\}\) where d[i] is an instance of domain variables \(val\left( X_{1},\ldots ,X_{n} \right) \). The learning goal is to find a network structure G that is a good predictor for the data.

It has now been demonstrated that the number of possible structures for BN with n nodes can be given by the recurrence relation as follows (Robinson 1977):

$$\begin{aligned} r\left( n \right) =\sum \limits _{i=1}^n {\left( -1 \right) ^{i+1}\left( {}_{i}^{n} \right) 2^{i\left( n-i \right) }r\left( n-i \right) \in n^{2^{o(n)}}} \end{aligned}$$
(2)

The above formula obviously shows that the size of search space is exponential and it is impossible to use an exhaustive search to find the best structure. As already mentioned, dependency based and score based are two main structure learning methods. In this article, we are only interested in the score-based approach. Also, the reasons that why score-based approach are more popular than dependency-based approach have been declared.

Score-based approaches define structure learning as an optimization problem. In order to find the best structure that fits with training data, use a scoring criteria and a search algorithm. several scoring criteria have been proposed so far that can be denoted to Bayesian information criteria (BIC), minimum description length (MDL) score and Bayesian Dirichlet (BDe) score (Heckerman et al. 1995). Since the search space is exponential and the search problem is NP-hard, using the heuristic methods is popular.

Among the heuristic methods, genetic algorithm (GA) has been widely noticed. However, one of the major drawbacks to exploiting this algorithm is high computational complexity especially when the number of random variables are increased. Using PSO can resolve the challenge of computational complexity, but this method suffers from premature convergence when diversity of the population is low. In the proposed method, in order to exploit the advantages of GA and PSO altogether, a hybrid algorithm called Breeding Swarms PSO (BSPSO) algorithm is used.

4 Proposed method

The input is a dataset which is a \(m\times n\) matrix M. m is the number of records, and n is the number of variables. The output is a BN with n nodes that is completely fittest with data. In the first step, some information in matrix M is modified with preprocessing. Next, through the use of BS, we were able to find the best BN structure. Figure 2 shows the diagram of the proposed method. Moreover, some important steps will be introduced in details as follows.

4.1 Preprocessing

In the initial stage of the preprocessing, continuous variables will be discretized. For example, suppose age as a continuous variable. It can be discretized to less than 20, between 20 and 45 and more than 45. Next, if there is a field with missing values, the measure ‘U’ will be added to it and its variable too. Finally, all the variables have discrete domains without missing values.

4.2 Breeding Swarms algorithm

Breeding Swarms or GA/PSO is a hybrid algorithm which has been proposed by Settles (Mattew and Terence 2006).This algorithm combines the updating rules of PSO including particle velocity and particle position with the operators of GA. The reproduction between the particles increases the probability of making more desirable particles in search space. Since only the portion of population is constructed by GA, the computational complexity is improved. Moreover, the diversity of population prevents from the early convergence. \(\psi \) is the only additional parameter that is called breeding rate and determines the portion of combination . Its measure depends on the problem criteria and is set experimentally. The algorithm is summarized in Fig. 3. Suppose N is size of the population. The position of \((N-1)\psi \) particles is updated based on PSO algorithm and construct \((N-1)\psi \) particles of next generation. The remains are made by GA from all the particles of current population.

Fig. 3
figure 3

The Breeding Swarm PSO algorithm

4.2.1 Particle formation

Fig. 4
figure 4

The corresponding particle of a simple BN

Suppose there are N random variables. Since each BN structure is a directed acyclic graph, it can be represented by an adjacency matrix, therefore, each particle can be considered as a \(N\times N\) matrix. Figure 4 is an example and displays the corresponding particle of a BN.

4.2.2 Initial population

A population is composed of N particles and can be randomly generated. Each cyclic graph with n nodes can be a candidate structure for BN.

4.2.3 Selection operator

Binary Tournament selection is used to generate \(N\psi \) particles of the next generation. For selection of each parent, two particles are randomly selected from the whole of population. Then, the particle with better score is selected. By having this strategy, weak particles have a considerable chance for survival.

4.2.4 Recombination operator (crossover)

In the proposed method each chromosome is coded in matrix form. Therefore, single point crossover is adopted with probability \(P_{c}\) to determine whether the operation will be performed or not. The operator randomly selects a crossover point, and the bit strings after that point are swapped between the two parents.

This recombination may be lead to unjustified offspring and make cycles in the corresponding graphs. Therefore, checking the justifiability of the two produced offspring and correcting them is indispensable.

Fig. 5
figure 5

Type of graph cycles. (1) length one, (2) length two and (3) length three and more

Fig. 6
figure 6

An example of computing the new position of a particle

As it is illustrated in Fig. 5, the cycles in graphs can be classified in to three groups. Amending the unjustified offspring is done according to the type of cycles:

  • Cycle with length one: when there is a cycle from one node to itself. In this case, some entries of adjacency matrix in the main diagonal are one. Since both of the parents are justified and the main diagonal of both of them is zero, after applying the crossover operator, there is not any cycle with length one in new offspring.

  • Cycle with length two: as it is shown in Fig. 5, there is a bi-directed edge between two nodes. In this case, there are two entries with measure one which are symmetric with respect to the main diagonal. In this case, in order to improve the efficiency of the proposed method, the score of each edge is computed individually based on the Bayesian criteria and the edge that has minimum score against others will be deleted.

  • Cycle with length three or more: the Warshalls algorithm with the order \(O(n^{3})\) is used to find cycles with length more than two. Although this algorithm has been proposed to find the shortest path between two nodes, it can be used to find the shortest path between one node and itself. In the proposed method, among the edges which construct the cycle, the edges with least score will be deleted.

4.2.5 Mutation operator

This operator randomly inserts or deletes an edge into the structure (graph) or selects and edge and reverses its direction. The insertion of new edge may lead to creating cycle in mutated chromosome. In this case, the inserted edge will be deleted.

4.2.6 Update the position of particle

This operator computes the new position for each particle by applying its velocity (V) to its current position. Each point in search space is a Directed Acyclic Graph (DAG) and is represented in matrix form. Hence, it is appropriate that its velocity is denoted in the form of a matrix \(n\times n\) which n is the number of random variables. Let V be a set of velocities \(V=\{v_{1},v_{2},\ldots ,v_{m}\) that each \(v_{ij}^{m}\in \{0,1,-1\}\). m is the size of population. If \(v_{ij}=1\), it means an edge is added from i to j. If \(v_{ij}=-1\), it means the edge between i and j is deleted, and \(v_{ij}=0\) means that the edge between i and j is not changed. \(v_{i}\) is computed based on the current velocity, position of the best particle and best position that the ith particle has been seen up to now.

The insertion of an edge may lead to a cycle in corresponding graph of particles. In this condition, in order to decrease the computational complexity, the inserted edge will be removed. The Fig. 6 illustrates an example of applying velocity to the current position of a particle.

Fig. 7
figure 7

The pseudo-code of the BSPSO method

The proposed method is summarized in Fig. 7.

5 Experimental results

In order to evaluate the proposed method, several real-world problems are used and the validity of this method is demonstrated through computer simulation.

  1. (A)

    Datasets In this experiment, seven benchmarks are used which are described as bellows:

    1. (1)

      ASIA network: This network has been represented by Lauritzen and Spiegelhalter (Lauritzen and Spiegelhalter 1988)which is used as a basic model for analyzing the performance of structure learning algorithms. This network composed of 8 nodes and 8 edges which is considered as a small network.

    2. (2)

      Car diagnosis problems: This problem has been introduced by Norsys (NorsysSoftwareCorp) and is a simple example of a belief network. The reason why a car does not move is presumed based on spark plugs, headlights, main fuse, and so on. 15 nodes as well as 17 edges are used in this BN. Moreover, all nodes of the network take discrete values. Some of them can take on three discrete states, and the others can take on two states. A database of two thousand cases is utilized to train the BN. The database was generated from Netica tool.

    3. (3)

      Child network

      This network represents a diagnosis model of heart diseases for newborn babies which has composed of 20 nodes and 25 edges (Spiegelhalter and Cowell 1992).

    4. (4)

      Mildew network

      This network is designed for determining the amount of fungicides against mildew of wheat (Jensen and Jensen 1996)and consists of 35 nodes and 46 edges.

    5. (5)

      Insurance network

      Insurance is a medium network with 27 nodes and 52 edges which has been proposed for estimating the risks of car insurance(Binder 1997).

    6. (6)

      ALARM network: A logical alarm reduction mechanism (ALARM) is a medical diagnostic system for patient monitoring. It is a complex belief network with 37 nodes and 46 edges (Beinlich 1989). As in the previous example, the database is generated from Netica tool.

    7. (7)

      Hailfinder network

      The network has been represented for forecasting stormy weather in Northeastern Colorado (Abramson 1996). Hailfinder composed of 56 nodes and 66 edges.

      It is be noted that all of the introduced benchmark networks are available in http://www.bnlearn.com/bnrepository/. The goal of the structure learning for BN is to obtain the structures which are close to the desired networks.

  2. (B)

    Evaluation criteria

    Quantitative and Qualitative criteria are two main standards which are used to analyze the performance of the output results. The former refers to Bayesian Dirchlet metric which has been used in K2 algorithm (Cheng et al. 1997; Cowell 1998) and its score is between 0 and \(-\infty \). The later compares the output with desired structure and is based on the differences evaluates the output structure. Structure learning factor (SLF) and topology learning factor (TLF) (Colace et al. 2010) are two popular qualitative criteria which are defined as follows:

    $$\begin{aligned} \left\{ {\begin{array}{l} \text {SLF}=\frac{{\text {TC}}}{{\text {TE}}} \\ \text {TLF}=\frac{{\text {TC}}+{\text {IE}}}{{\text {TE}}}\\ \end{array}} \right. \end{aligned}$$
    (3)

    In the above formula, TC is the number of edges which are correctly determined, TE is the number of edges in target graph and IE is the number of edges which determined correctly but with wrong direction.

  3. (C)

    Simulation results

    In order to set used parameters optimally in proposed method, several experiments are done based on the seven available databases described in Table 1.

Table 1 The default parameters in experiments
Fig. 8
figure 8

The diagram of breeding rate for ASIA Bayesian network

  • Breeding rate:

    In this experiment, the goal is find the optimal breeding rate for each problem. For this purpose, all other parameters are fixed. In order to evaluate performance of the algorithm, the Bayesian Dirichlet metric which mentioned previously is used. The measures of bellow table are according to the average of 10 times running for each problem.

    As it is shown in Fig. 8, the quality of scores when the breeding rate is located in range [0.1, 0.5] is better than second part, i.e. [0.5, 0.9]. Therefore, it can be said that when GA is dominated to PSO, the performance of method will be decreased. On the other hand, when the breeding rate is less than 0.3, the performance has been decreased that shows that the best range for breeding rate is [0.3, 0.5]. The experimental results in Table 2 demonstrate that the algorithm has different behavior against other databases and the performance is more suitable when the breeding rate is greater than 0.5. The experiment reveals that the number of variables in databases has direct impact such that when the BN is small, the diversity of the population respect to its size is suitable but when the size of BN is increased, it is necessary to have more diversity in population and search space is explored more accurately. The experimental results show that the range [0.6, 0.8] is suitable in Alarm and Car problems.

  • Size of the population

    The other parameter which should optimally set is the size of population. For this purpose, the breeding rate is set based on the results of pervious experiment. In this test, the size of population is changed from 5 to 40 and in each phase, 5 particles are inserted to the population. The algorithm is run 10 times and the average of gained scores is computed.

    The results are listed in Table 3. As it is shown, in Car and Asia problem, when the size of population is 20, the performance is suitable. But in Alarm network, with increasing the size of population, the resulted scores are improved. However, the growth of population leads to the more computational complexities. Therefore, it is appropriate that the size is determined based on the available resources.

  • Performance evaluation

    For further assessment, the proposed method is run 30 times with the optimal parameters.

    The results are depicted in Figs. 9, 10 and 11 for Asia, Car and Alarm networks, respectively.

Table 2 Experimental results of breeding rate setting in three test Bayesian networks
Table 3 Experimental results of population size setting in three test Bayesian networks
Fig. 9
figure 9

Analysis the efficiency of the proposed method for ASIA Bayesian network

Fig. 10
figure 10

Analysis the efficiency of the proposed method for Car Bayesian network

The formulas 47 are used to compare the results with the target networks.

$$\begin{aligned}&\frac{\text {Total number of TC edges }}{\text {Total number of edges}} \end{aligned}$$
(4)
$$\begin{aligned}&\frac{\text {Total number of IE edges }}{\text {Total number of edges}} \end{aligned}$$
(5)
$$\begin{aligned}&\frac{\text {Total number of ME edges }}{\text {Total number of edges}} \end{aligned}$$
(6)
$$\begin{aligned}&\frac{\text {Total number of EE edges }}{\text {Total number of edges}} \end{aligned}$$
(7)

In the above relations, TC is the number of edges correctly added between the same nodes as those in target network. ME is the number of missed edges against target network. IE is the number of edges that have been determined correctly but are reverse and EE is the number of edges which are wrongly added.

Fig. 11
figure 11

Analysis the efficiency of the proposed method for Alarm Bayesian network

Table 4 The average of correct, missed, reverse and wrong edges over 30 times executions

To assess more accurately, the proposed method is run for each network 30 time individually and the average of above relations are listed in Table 4.

Fig. 12
figure 12

Comparison the efficiency of the proposed method with REST and CONAR methods

Fig. 13
figure 13

Comparison the normalized graph errors of our method with CGA and SGA methods

These tests highlighted that the resulting networks are close to the target networks and can conclude that the proposed method has suitable performance.

Table 5 Comparison of SLF measure among the proposed method and other nine algorithms on seven datasets
  • Comparison with other methods

    The proposed method has been compared with totally thirteen well-known methods. It has been tried to select the methods which are applicable and cover all different classes for Bayesian structure learning based on data. These methods are REST (Cowie et al. 2007), CONAR (Cowie et al. 2007), CGA (Larrañaga 1996), SGA (You 2001), FGS (Chickering 2002), FCI (Colombo and Maathuis 2014), PC (Spirtes et al. 2000), MMHC (Tsamardinos et al. 2006), TPDA (Cheng et al. 1997), SCA (Friedman et al. 1999), Minimum spanning tree (MST) (BayesiaLab 6.0.2), EQ (Munteanu and Bendou 2001)and Taboo (BayesiaLab 6.0.2).

    Three evaluation metrics which are structure learning factor (SLF), topology learning factor (TLF) and graph error criteria have been used in which the first two metrics have been declared in relation 3.

    Furthermore, graph error is the number of all errors such as reversing, missing and extra edges.

    As mentioned above, among the comparing methods, REST and CONAR are based on the particle swarm optimization and CGA and SGA use the genetic algorithm. Since the proposed method has a hybrid strategy combining GA and PSO, for more assessment, the comparison of these four methods has been represented individually.

    The diagram in Fig. 12 shows the results for Asia, Car diagnosis and Alarm networks. Because of the stochastic behaviors of GA and PSO, the evaluating metrics have been calculated based on average over 30 runs.

    The results of SLFs and TLFs show that performance of the proposed method is higher than the other two methods. Also, in Alarm Bayesian network which is a challenging problem, the results of presented method are acceptable. This result has further strengthened our hypothesis that the combination of GA with PSO increases the chance of gaining optimal solution and prevents from early convergence to local optima.

In order to compare the proposed method with GA-based approaches (CGA and SGA), graph error criteria is used.

The diagram of comparing normalized graph errors of GA based and the proposed method are represented in Fig. 13.

The first three datasets which are related to ASIA network include 2000, 3000 and 5000 records, respectively. The second group of datasets, i.e., the fourth to sixth are related to Alarm network and include 3000, 5000 and 10,000 records, respectively. As expected, this experiment demonstrates that the graph errors of the proposed method are less than GA-based methods.

Table 6 Comparison of TLF measure among the proposed method and other nine algorithms on seven datasets
Table 7 Comparison of graph error among the proposed method and other nine algorithms on seven datasets

Next, the other nine methods have been compared according to the pervious assessment metrics and the results have been demonstrated separately in Tables 5, 6 and 7. In order to have a complete report, seven datasets are generated from the introduced benchmark networks. PC, SCA, MMHC and TPDA has been run by using Causal Explorer system (Statnikov 2010). For FCI and FGS algorithms TETRAD project (Scheines 1998) has been used. Finally, BayesiaLab software (BayesiaLab 6.0.2) has been used for running Taboo, Minimum Spanning Tree and EQ methods. It is be noted that parameters of all the algorithms have been set based on their default values. In each row of Tables, the best value for each metric is shown in bold. Finally, we have reported the best results of proposed method over 30 runs.

SLF and TLF measures in Tables 5 and 6 demonstrate that the proposed method almost outperforms the other approaches and achieves the highest SLF on five datasets (asia, child, alarm, mildew and insurance) and obtains the highest TLF measures on five datasets (asia, car, child, mildew and insurance).

However, there are no major differences among BSPSO method and other approaches in other cases.

By analyzing the graph error measures in Table 7, it can be obtained that EQ and Taboo have smaller graph error measures than the other approaches in most cases. Maybe the main reasons for obtaining their efficient performance are that the first switches from greedy strategy to Taboo for avoiding from local minima and the second reduces the search space by searching for the equivalence classes of Bayesian networks. BSPSO achieves the best graph error score on four datasets (Asia, Car, Alarm and Hailfinder). Also its results are affordable in other cases. Some methods such as PC, MMHC and SCA have poor outcomes. Maybe setting their parameters more accurately for more complex networks and increasing the number of samples in datasets improve their performances.

6 Conclusion

In this paper, Breeding Swarm PSO (BSPSO) algorithm as an efficient swarm intelligence-based approach has been used to solve the BN structure learning problem. BSPSO combines PSO and GA which provides a synergy between their capabilities and help to search more effectively the solution space and avoid from early convergence.

In order to set some parameters such as breeding rate, the method is run several times with different conditions. So as to assess the performance of the proposed method, seven datasets have been generated based on seven real-world networks. Several experiments have been done to evaluate the performance of the proposed method. Moreover, it has been compared with thirteen representative algorithms which four of them are GA based and PSO based, respectively. The experimental results demonstrate that BSPSO completely outperforms the four population-based approaches and also has promising performance against the other nine methods in finding the desirable networks.

Future work will concentrate on more complex problems in learning BNs, e.g., hidden variables and multi-relational data.