1 Introduction

Bayesian networks (BNs) are popular within the AI probability and uncertainty community as a method of reasoning under uncertainty. From an informal perspective, BNs are directed acyclic graphs (DAGs), where nodes are random variables and arcs specify independent assumptions between these variables. After construction, a BN constitutes an efficient device for performing probabilistic inference [1].

One of the most important challenges in this field is training the BN that best reflects the dependence relations in a database of cases. Training an optimal structure for BNs is difficult according to the large number of possible DAG structures, given even a small number of nodes to connect. It is proven that training BN from data is an NP-Hard problem [2]. Different algorithms for training BNs from data have been developed; generally, there are two main approaches: constraint-based approach and search-and-score approach. In algorithms, following the first approach [37], existence of certain conditional dependencies is estimated from data. This estimation is performed using statistical or information theoretic measures. Constraints of conditional independence are propagated throughout the graph and the networks that are inconsistent with them are eliminated from further consideration. Finally only the statistically equivalent networks consistent with conditional independence tests remain.

On the other hand, algorithms which are following the search and score approach [1], [815] attempt to identify the network that maximizes a score function, indicating how well the network fits the data. One such score metric is Bayesian Information Criterion, which will be explained in Section 2.1. Algorithms in this category search the space of all possible structures for the one that maximizes the score using greedy, local, or some other search algori- thms.

In this paper, a new algorithm is proposed for structure training in BNs. The proposed algorithm which is named BNC-VLA, follows the search and score approach, and uses learning automata to search the optimal structure among all possible structures. Learning automata are chosen due to their learning capability and negligible computation cost. BNC-VLA is different from two prior learning automata-base algorithms, which have been proposed in [10] and [11]. Two main differences can be mentioned; firstly, in [10] and [11], learning automata are assigned to possible edges; therefore, the number of learning automata for a network with n random variables is, \(\frac {1}{2}n\times (n-1)\) in [10] or n×(n−1) in [11] but in BNC-VLA learning automata are assigned to random variables, and n learning automata are used. Secondly, in [10] and [11] and also in many other algorithms, cycle removing is done in a separate phase which causes more time consuming. However, in our proposed algorithm, BNC-VLA, cycle avoidance process runs during the BN construction phase, and no more phases is needed. In order to do these the action sets of learning automata should be variable, so we have used variable-action set learning automata. At each stage, a BN is constructed based on the selected edges by learning automata; and it is evaluated using a scoring function and a training dataset. As algorithm proceeds, the learning algorithm tries to find BN structure with a higher score. Guided search by learning automata makes the search space smaller in order to find optimal BN structure in a lower computation time.

The reminder of this paper is organized as follows. Section 2 explains the BNs and structure training preliminaries. Learning automata is described briefly in Section 3. In Section 4 the proposed learning automata-based algorithm for training the BN structure is explained. Time complexity analysis and convergence results are presented in Section 5. Experimental results obtained by BNC-VLA are reported in Section 6. Finally, the paper concludes with a conclusion given in Section 7.

2 Bayesian networks and structure training

A BN describes the joint probability distribution over a set of random variables with defining series of probability independences and series of conditional independences [16]. From an informal perspective, a BN is a directed acyclic graph (DAG), where nodes are random variables, and arcs specify the independence assumptions that must be held between the random variables. Giving prior probabilities for nodes with no parent and conditional probabilities for all other nodes given all possible combinations of their parents, we can specify the probability distribution of a BN. The joint probability of any particular combination of n random variables can be written according to (1),

$$ P\left( X_{1},\mathellipsis ,X_{n} \right)={\prod}_{i=1}^{n} {P(x_{i}\vert Parents\thinspace \left( X_{i} \right))} $$
(1)

where (X 1,…, X n ) is the set of random variables. When the network is constructed it can be an efficient device to perform probabilistic inference. Nevertheless, the problem of constructing such a network has remained [1]. Construction of the BN is separated into two training subtasks: structure training to determine the topology of the network, and parameter training which defines numerical parameters (conditional probabilities) for a given network topology.

In this work, we focus on structure training based on search-and-score approach. Algorithms which follow this approach have two main components: a search procedure and a scoring metric. Search procedure determines an algorithm to search throughout all possible networks and then select one; and scoring metric evaluates the quality of the selected network. In the rest of this section, we briefly describe both of these components.

2.1 Scoring metric

There are varied metrics proposed for evaluating the structure of a BN such as Bayesian metric, Minimum Description Length, and Bayesian Information Criterion, to mention a few. Bayesian metric measures the quality of the BN by computing a marginal likelihood of the BN with respect to the given data and inherent uncertainties [17], [18]. Minimum Description Length is based on the assumption that the number of regularities in the data encoded by a model is somehow proportional to the amount of data compression allowed by the model [10]. And finally the Bayesian Information Criterion (BIC), which is used as scoring metric in this work, is a criterion for model selection among a finite set of models, and it is based on the likelihood function. The BIC metric for a constructed graph G is given by the following equation,

$$ BIC=\log_{2}{P\left( D\thinspace\vert\thinspace {G,\hat{\theta_{G}}}\right) -\frac{\vert n\vert }{2}}\log_{2}{(M)} $$
(2)

where D is the set of training samples, \(\hat {\theta _{G}}\) is Maximum Likelihood (ML) estimation for G’s parameters, and M is the number of samples in the dataset. If all variables are multinomial, by considering r i as a finite set of outputs for x i, q i as the number of configurations for x i ’s parents, N i j as the number of observations of x i where its parents’ configuration is j, and finally N i j k as the number of observation of x i = k where its parents’ configuration is j, (2) can be rewritten as:

$$\begin{array}{@{}rcl@{}} BIC&=&{\sum}_{i=1}^{M} {\sum}_{j=1}^{q_{i}} {\sum}_{k=1}^{r_{i}} {log}_{2}\left( \frac{N_{ijk}}{N_{ij}} \right)\\ &&-\frac{{log}_{2}n}{2} {\sum}_{i=1}^{M} {q_{i}(r_{i}-1)} \end{array} $$
(3)

and in this case, BIC is converted to a simple counting problem.

2.2 Search procedure

Given a scoring metric, the problem of learning the structure of a Bayesian network belongs to the case of NP-hard problems and there is no polynomial-time algorithm for finding the best network structure corresponding to the most scoring metrics [2]. Usually, a simple greedy algorithm is used to build the network [12]. The greedy algorithm adds an edge with the greatest improvement of the current network quality in search step until no more improvement is possible. The initial network structure can be a graph with no edges; furthermore, it can take advantage of using the prior information such as using the best tree computed by the polynomial-time maximum branching algorithm [17], [19] Since Bayesian network is an acyclic graph, after each search step, the graph structure must be validated, and all cycles must be removed from the constructed graph To the best of our knowledge, Genetic Algorithms [1, 8], Hill-Climbing [12], Simulated Annealing [15], A* search [13], Ant Colony optimization [14] and Learning Automata [10], [11] are used in search procedure to build the network.

3 Learning automata

A learning automaton [20], [21] is an adaptive decision-making unit that improves its performance by learning how to choose the optimal action from a finite set of allowed actions through repeated interactions with a random environment. The action is randomly chosen based on a probability distribution kept over the action-set and at each instant, the given action is served as the input to the random environment. The environment responds to the taken action in turn with a reinforcement signal. The action probability vector is updated based on the reinforcement feedback from the environment. The objective of a learning automaton is to find the optimal action from the action-set so that the average penalty received from the environment is minimized.

The environment can be described by a triple E={α, β, C}, where α={α 1, α 2 ,…, α r } represents the finite set of inputs, β={β 1, β 2 ,…, β m } denotes the set of values that can be taken by the reinforcement signal, and C={c 1, c 2 ,…,c r } denotes the set of penalty probabilities, where element c 1 is associated with the given action α 1. If the penalty probabilities are constant, the random environment is said to be a stationary random environment, and if they vary with time, the environment is called a non stationary environment. The environments depending on the nature of the reinforcement signal β can be classified into P-model, Q-model and S-model. The environments in which the reinforcement signal can only take two binary values 0 and 1 are referred to as P-model environments. Another class of environments allow a finite number of values in the interval [0, 1] be taken by the reinforcement signal; such environments are referred to as Q-model environments. In S-model environments, the reinforcement signal lies in the interval [a,b].

Learning automata can be classified into two main families [20, 23] and [24]: fixed structure learning automata and variable structure learning automata. Variable structure learning automata are represented by a triple, 〈β, α, T〉, where \(\underline {\beta }\) is the set of inputs, \(\underline {\alpha }\) is the set of actions, and T is learning algorithm. The learning algorithm is a recurrent relation, which is used to modify the action probability vector. Let \(\alpha _{i}(k)\epsilon \underline {\alpha }\) and \(\underline {p}(k)\) denote the action selected by a learning automaton and the probability vector defines over the action-set at instant k, respectively. At each instant k, the action probability vector \(\underline {p}(k)\) is updated by the linear learning algorithm given in (4), if the selected action α i (k) is rewarded by the random environment, and it is updated as given in (5) if the taken action is penalized.

$$ p_{j}\left( k+1 \right)=\left\{ {\begin{array}{l} p_{j}\left( k \right)+a\left[ 1-p_{j}\left( k \right) \right]\thinspace \thinspace j=1 \\ \left( 1-a \right)p_{j}\left( k \right)\thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \forall j\ne i \\ \end{array}} \right. $$
(4)
$$ p_{j}\left( k+1 \right)=\left\{ {\begin{array}{l} (1-b)p_{j}\left( k \right)\thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace \thinspace j=1 \\ \left( \frac{b}{r-1} \right)+\left( 1-b \right)p_{j}\left( k \right)\thinspace \thinspace \thinspace \forall j\ne i \\ \end{array}} \right. $$
(5)

Where a and b denote the reward and penalty parameters and determine the amount of increase and decrease of the action probabilities, respectively; and r is the number of actions that can be taken by learning automaton. If a = b, the recurrent (1) and (2) are called linear reward-penalty (L RP ) algorithm, if a>b the given equations are called linear reward- 𝜖 penalty (L R𝜖 P ), and finally if b = 0 they are called reward-Inaction (L RI ) In the latter case, the action probability vectors remain unchanged when the taken action is penalized by the environment.

3.1 Variable action-set learning automata

A variable action-set learning automaton [20] is an automaton in which the number of actions available at each instant changes with time. Such an automaton has a finite set of n actions, \(\underline {\alpha }=\{\alpha _{1} \alpha _{2},\textellipsis , \alpha _{r}\}\). A={A 1, A 2, , A m } denotes the set of action subsets and A(k)⊆α is the subset of all the actions that can be chosen by the learning automaton, at each instant k. the selection of the particular action subsets is randomly made by an external agency according to the probability distribution ψ(k)={ψ 1(k), ψ 2(k), ψ m (k)} defined over the possible subsets of the actions, where ψ i (k)=prob[A(k) = A i |A i 𝜖 A,1≤i≤2n -1].

\(\hat {p}_{i}\left (k \right )=\text {prob}[\alpha (k)=\alpha _{i}|A(k),\alpha _{i}\epsilon A(k)\)] denotes the probability of choosing action α i , conditioned on the event that the action subset A(k) has already been selected and α i 𝜖 A(k) too. The scaled probability \(\hat {p}_{i}\left (k \right )\) is defined as:

$$ \hat{p}_{i}\left( k \right)=\frac{p_{i}(k)}{K(k)} $$
(6)

Where \(K\left (k \right )={\sum }_{\alpha _{i}\in A(k)} {p_{i}(k)} \) is the sum of the probabilities of the actions in subset A(k), and p i (k)=prob[α(k)=αi]

The procedure of choosing an action and updating the action probabilities in a variable action-set learning automaton are described as follows. Let A(k) be the action subset selected at instant n. Before choosing an action, the probabilities of all the actions in the selected subset are scaled as (6). Then the automaton randomly selects one of its possible actions according to the scaled action probability vector\(\thinspace \hat {p}_{i}\left (k \right )\). Depending on the received response from the environment, the learning automaton updates its scaled action probability vector. Note that only the probability of the available actions is updated. Finally, the action probability vector of the chosen subset is rescaled as \(p_{i}\left (k+1 \right )= \quad \thinspace \hat {p}_{i}\left (k+1 \right )K(k)\), for all α i A(k).

4 Learning automata-based BN structure training algorithms

In this section, we propose a new learning automata-based structure training algorithm for BNs, called BNC-VLA This algorithm increases the quality of constructed BNs and it decreases the execution time for constructing BNs. Overall, in comparison with other search-base algorithms like Genetic Algorithm [1, 8], Hill-Climbing [12], Simulated Annealing [15], A* search [13], Ant Colony optimization [14], and Learning Automata-based algorithms [10, 11], BNC-VLA shows superior results. In our algorithm, a team of learning automata is assigned to the random variables; they choose a network from all possible networks, and then a scoring function evaluates the chosen network. A reinforcement signal is produced based on the given score, and finally the action probability vectors of every automaton are updated according to the received signal. Pseudo code of BNC-VLA is given in Algorithm 1.

figure d

Let n denotes the number of random variables. By assigning a learning automaton to each random variable we have a team of n learning automata A={A 1, A 2, , A n } with a set of action-sets, \(\underline {\alpha }=\{\underline {\alpha }_{1}\underline {\alpha }_{2},\mathellipsis ,\thinspace \underline {\alpha }_{n}\) in which \(\underline {\alpha }_{i}=\{\underline {\alpha }_{i1}\underline {\alpha }_{i2},\mathellipsis ,\underline {\alpha }_{ij},\mathellipsis ,\underline {\alpha }_{ir_{i}}\) defines the set of actions which can be taken by learning automaton A i for each \(\alpha _{i}\in \underline {\alpha }\) and r i =(n−1). In a network with n random variables the maximum number of directed edges from a node to other nodes is (n−1); therefore, each learning automata has (n−1) actions in its action set corresponding to the possible edges. Each learning automaton can be in one of two states: active and passive, all learning automata are initially set to passive state. BNC-VLA consists of a number of stages. Stage k of BNC-VLA is briefly described in the following steps:

  1. 1.

    BN construction

    Choose one of the passive automata at random, and mark it as active.

    Repeat the following until there is no more passive automata to be activated.

    1. a)

      The chosen automaton chooses one of its actions according to its action probability vector. Then the selected action is marked, and the automaton cannot choose it again in current stage.

    2. b)

      Each learning automaton changes its number of actions (or scales its action probability vector) by disabling the actions that may cause a cycle during the construction of the BN (the process of disabling the actions is described later).

    3. c)

      Choose one of the automata at random and if it is passive, mark it as active. Chosen automaton can be passive or active; however, active automata which have no unmarked action cannot be chosen.

  2. 2.

    Dynamic threshold computing

    Let us assume that BN τkis constructed at stage k. The average score of the entire constructed networks until stage k is computed as a dynamic threshold T k :

    $$ T_{k}=\frac{1}{k}{\sum}_{i=1}^{k} {BIC(\mathrm{\tau }_{\mathrm{i}}\mathrm{)}} $$
    (7)

    where B I Ci), which is defined using (3), denotes the score of the constructed network, BN τi in stage i.

  3. 3.

    Update action probability vector

    At each stage k, if the score of the constructed network, B I Ci) is more than or equal to the dynamic threshold T k , then the actions chosen by automata are rewarded and penalized otherwise. If a learning automaton has been chosen more than once, it would have more than one selected actions; however, only the best action participates in rewarding and penalizing processes. In order to find the best selected action in an automaton, for each selected edge which is associated to one action, mutual information between the nodes is computed using (8) and the training dataset.

    $$ I(ij)={\sum}_{x_{i},x_{j}} {P\left( x_{i},x_{j} \right)log\frac{P(x_{i},x_{j})}{P\left( x_{i} \right)P(x_{j})}} $$
    (8)

    Where, x i x j are two corresponding nodes. The best action is the action with the highest mutual information value. This approach causes to find simpler networks as well as speed up the convergence. Then, each automaton updates its action probability vector by using L RI reinforcement scheme. Disabled actions are enabled again and probability vectors are updated as described in Section 3.1 on variable action-set learning automata.

  4. 4.

    Termination

    Step 1, 2, and 3 are repeated until the stage number exceeds a pre-specified threshold Th k or the score of a constructed BN becomes greater than a certain pre-defined score BIC o p t . In the latter, the network which is constructed just before the algorithm stops is the BN with the maximum expected score among all networks.

As mentioned earlier, at each stage the number of actions that can be taken by an automaton, changes (or reduces) with time in such a way that no cycle appears in the BN (see line 14 of the Algorithm). To avoid the formation of a cycle in the BN, BNC-VLA performs as follows: At each stage k each edge e (j i)is removed and its corresponding action is temporarily disabled in the action-set of the automaton A j , if the edge e (i, j) is chosen. Then Let π i, j denotes the path connecting random variable x i to random variable x j , and \({p_{j}^{i}}\) and \({q_{j}^{i}}\) denote the choice probability of edge e (i, j) and path π i, j , respectively. BNC-VLA also removes each edge e (r s)and temporarily disables its corresponding action in the action-set of the automaton A r , if the edge e (i, j) is chosen and both paths π i, r and π s, j have been already constructed by the activated automata. This disabling process is repeated until the stop condition of step 1 is met. By this, the cycle freeness of the proposed BN construction algorithm is guaranteed. At each stage, the algorithm updates the action probability vectors twice; the first time is when the selected actions are rewarded or penalized, and the second time is when the disabled actions are enabled again at the end of each stage. In both cases, the action probabilities are updated as described in previous section on variable action-set learning automata.

4.1 Simple improvements

In this section, two improvements are proposed in order to speed up the convergence.

Method1::

Usually, there is no prior information in learning automata, and action probability vectors are uniformly initialized. However, in the field of BNs we can get prior information about variables’ dependencies, which can be used for initializing probability vectors in order to speed up the convergence. To do this the mutual information for each possible undirected edge is computed using (8), and training dataset.

Let \({p_{i}^{j}}\) be the probability of choosing action (edge) \(\underline {\alpha }_{ij}\) by A i ; \({p_{i}^{j}}\) can be initialized as follows.

$$ {p_{i}^{j}}=\frac{I(i,j)}{{\sum}_{for\thinspace all\thinspace j\ne i} {I(i,j)} } $$
(9)

It is expected that the convergence speed of BNC-VLA is improved in this case.

Method2::

In BNC-VLA, all learning automata use the same learning rate, which remains unchanged during the execution of the algorithm. Such a learning rate gives the equal importance to all possible edges to appear in the BN. This may prolong the convergence of the optimal solution for some learning automata and accelerate the convergence to a non-optimal solution for some others. Here, we discuss a statistical method for adjusting the learning rate to speed up the convergence rate. This method uses the Maximum Likelihood (ML) estimation in BNs for adjusting the learning rate during the algorithm. For each possible edge e (i, j)the probability of being observed in the BN is computed by dividing the number of samples x j if x i is its parent to the number of all samples. Then, the learning rate is increased for the edges which have higher ML value in comparison with others. In this case, the convergence speed of BNC-VLA may be increased.

5 Theoretical analysis

In this section, time complexity analysis and convergence results are presented.

5.1 Time complexity analysis

Lemma 1

Let the number of iterations is iters and the number of random variables is n; the time complexity of the proposed algorithm is O(iter×n 2 ).

Proof

The inner loop of BNC-VLA, in Step 1, executes n×(n−1) times in the worst case and n times in the best; so the upper bound on its time complexity is O(n 2) Besides, the time complexity of computing the BIC in Step 2 is O(M n 2), where M is the number of training samples; and in Step 3 the time complexity of updating the action probability vectors for n learning automata is O(n 2). Finally, the outer loop is related to the number of iterations iters, and the time complexity of the proposed algorithm, BNC-VLA, is O(i t e r×(n 2 + M n 2 + n 2)); which is rewritten as: O(i t e r×n 2). □

5.2 Convergence results

In this section, we prove the convergence of BNC-VLA to the optimal solution, when each learning automaton updates its action-set by a linear reward-inaction reinforcement scheme. BNC-VLA is designed for stochastic environments, where the environmental parameters may vary over time; therefore, the method that is used to prove the convergence of it, partially follows the method given in [21, 23] to analyze the behavior of the learning automata operating in non-stationary environments.

Theorem 1

Let q i (k) be the probability of constructing BN τ i at stage k. If \(\underline {q}(k)\) is updated according to BNC-VLA, then there exists a learning rate a*(ε) ∈ (0,1) (for every ε > 0) such that for all a ∈ (0,a*), we have P r o b[l i m k → ∞ q i (k) = 1] 1−ε

Proof

The steps of convergence proof are briefly outlined as follows. At first, it is proved that the penalty probability of each BN converges to the constant value of the final penalty probability, if k is selected large enough. This property is shown in Lemma 2. Then, it is shown that the probability of choosing a BN with the maximum score is a sub-martingale process for large values of k, and so the changes in the probability of constructing the BN are always nonnegative. Lemma 3 and 4 show this result. Finally, the convergence of BNC-VLA to a BN with the maximum expected score is proved by using martingale convergence theorems. Therefore, the following lemmas need to be proved before starting the proof of Theorem 1. □

Lemma 2

If BN τ i penalized with probability c i (k) at stage k (i.e. c \(_{i}(k)=prob\left [\overline {BIC} \tau _{i}<T_{k}\right ])\) and \(\lim _{k\rightarrow \infty }c_{i}\left (k \right )=c_{i}^{\ast }\) . Then, for every ε∈(0,1) and k>K(ε) we have, \(prob\left [ \left | c_{i}^{\ast }-c_{i}\left (k \right ) \right |>0 \right ]<\varepsilon \)

Proof

Let c i denotes the final value of probability c i (k) when k is large enough. Using weak law of large numbers, we have concluded that \(prob\left [ \left | c_{i}^{\ast }-c_{i}\left (k \right ) \right |>\varepsilon \right ]\to 0\)

Hence, for every ε∈(0,1), there exists a (ε)∈(0,1) and K(ε )< such that for all a<a* and k>K( ε) we have \(prob\left [ \left | c_{i}^{\ast }-c_{i}\left (k \right ) \right |>0 \right ]<\varepsilon \) , and the proof of Lemma 2 is completed. □

Lemma 3

Let \(c_{j}\left (k \right )=prob[\overline {BIC} \tau _{j}\left (k+1 \right )<T_{k}\) and d j (k)=1−c j (k) be the probability of penalizing and rewarding BN τ j at stage k, respectively. If \(\underline {q}\left (k \right )\) evolves according to BNC-VLA, then the conditional expectation of q i (k) is defined as

$$\begin{array}{@{}rcl@{}} &&E[q_{i}(k + 1)|q(k)]\\ &&\quad={\sum}_{j=1}^{r} {q_{j}\left( k \right)\left[ c_{j}\left( k \right)q_{i}\left( k \right)+d_{j}{\prod}_{e_{\left( m,n \right)}\in \tau_{i}} {{\delta_{n}^{m}}\left( k \right)} \right]} \end{array} $$

Where

$${\delta_{n}^{m}}\left( k \right)=\left\{ {\begin{array}{l} {p_{n}^{m}}\left( k+1 \right)={p_{n}^{m}}\left( k \right)+a\left( 1-{p_{n}^{m}}\left( k \right) \right); \quad e_{(m,n)}\in \tau_{j}\\ {p_{n}^{m}}\left( k+1 \right)={p_{n}^{m}}\left( k \right)\left( 1-a \right);\quad\quad\quad\quad\quad e_{(m,n)}\notin \tau_{j}\\ \end{array}} \right. $$

Where r denotes all constructed BNs.

Proof

Since the reinforcement scheme that is used to update the probability vectors in BNC-VLA is L RI , at each stage k the probability of choosing the BN τ i (i.e., q i (k)) remains unchanged with probability c j (k) (for all j=1,2, , r) when the selected BN τ j is penalized by the random environment. On the other hand, when the selected BN τ j is rewarded, the probability of choosing edges of BN τ i increases by a given learning rate as that of the other edges decreases Hence the lemma is proven. □

Lemma 4

The increment in the conditional expectation of q i (k) is always non-negative subject to \(\underline {q}\left (k \right )\) is updated according to BNC-VLA. That is, Δq i (k)>0.

Proof

Define Δq i (k) = E[q i (k+1)|q(k)]−q i (k)

From Lemma 3, we have

$$\begin{array}{@{}rcl@{}} {\Delta} q_{i}\left( k \right)&=&E[q_{i}(k + 1)|q(k)] -q_{i}\left( k\right)\\ &=&\sum\limits_{j=1}^{r} q_{j}\left( k \right)\left[\right.c_{j}\left( k \right)q_{i}\left( k \right)\\ && + \, d_{j}\prod\limits_{e_{\left( m,n \right)}\in \tau_{i}}{\delta_{n}^{m}(k)\left.\right]}-q_{i}\left( k \right) \end{array} $$
(10)

where

$$ {\delta_{n}^{m}}\left( k \right)\,=\,\left\{ {\begin{array}{l} {p_{n}^{m}}\left( k\,+\,1 \right)\,=\,{p_{n}^{m}}\left( k \right)\,+\,a(1\!\,-\,\!{p_{n}^{m}}\left( k \right); \quad e_{(m,n)}\in \tau_{j} \\ {p_{n}^{m}}\left( k\,+\,1 \right)\,=\,{p_{n}^{m}}\left( k \right).\left( 1\,-\,a \right); \quad\quad\quad\,\,\,\,\,\!\!\! e_{(m,n)}\notin \tau_{j} \end{array}} \right. $$
(11)

\({p_{n}^{m}}\left (k \right )\) is the probability of choosing edge e (n, m) at stage k. The probability with which the BNs are constructed, rewarded or penalized is defined as the result of the probability of choosing the edges along the BNs, so we have

$$\begin{array}{@{}rcl@{}} {\Delta} q_{i}\left( k \right)&=&\sum\limits_{j=1}^{r} {\prod\limits_{e_{\left( m,n \right)}\in \tau_{j}} {p_{n}^{m}} \left( k \right)} \left[ \prod\limits_{e_{\left( m,n \right)}\in \tau_{j}} {{c_{n}^{m}}\left( k \right)} \prod\limits_{e_{\left( m,n \right)}\in \tau_{i}} {{p_{n}^{m}}\left( k \right)}\right.\\ \end{array} $$
$$\begin{array}{@{}rcl@{}} &&+\left.\prod\limits_{e_{\left( m,n \right)}\in \tau_{j}} {{d_{n}^{m}}\left( k \right)} \prod\limits_{e_{\left( m,n \right)}\in \tau_{i}} {{\delta_{n}^{m}}\left( k \right)} \right] -{\prod}_{e_{\left( m,n \right)}\in \tau_{i}} {{p_{n}^{m}}(k)} \end{array} $$

Where \({\delta _{n}^{m}}(k)\) is defined as given in (10) \({c_{n}^{m}}(k)\) is the probability of penalizing edge e (m, n) at stage k, and \({d_{n}^{m}}\left (k \right )=1-{c_{n}^{m}}(k)\). At each stage, BNC-VLA chooses edges of DAG constructing one of r possible BNs.

$$\begin{array}{@{}rcl@{}} {\Delta} q_{i}\left( k \right)&=&{\prod}_{e_{\left( m,n \right)}\in \tau_{i}} E\left[{p_{n}^{m}}\left( k+1 \right)|p^{m}(k)\right]\\ &&-{\prod}_{e_{\left( m,n \right)}\in \tau_{i}} {{p_{n}^{m}}\left( k \right)} \end{array} $$

The above mentioned equality can be rewritten as:

$$\begin{array}{@{}rcl@{}} {\Delta} q_{i}\left( k \right)&\ge& {\prod}_{e_{\left( m,n \right)}\in \tau_{i}} E\left[ {p_{n}^{m}}\left( k+1 \right)|p^{m}(k)\right] -{p_{n}^{m}}\left( k \right)\\ &&={\prod}_{e_{\left( m,n \right)}\in \tau_{i}} {\Delta {p_{n}^{m}}(k)} \end{array} $$
(12)

And \({\Delta } {p_{n}^{m}}(k)=\)a.\(\thinspace {p_{n}^{m}}(k){\sum }_{s\ne n}^{r_{m}} {{p_{s}^{m}}\left (k \right ).({c_{s}^{m}}\left (k \right )-{c_{n}^{m}}\left (k \right ))} \)

q i (k)∈(0,1) for all \(\underline {q}\in {S_{r}^{0}}\), where \(S_{r}=\{\underline {q}(k):0\le q_{i}(k)\le 1;\thinspace \sum \nolimits _{i=1}^{r} {q_{i}(k)} =1\}\) and \({S_{r}^{0}}\) denotes the interior of S r . Hence, \({p_{n}^{m}}\left (k \right )\in (0,1)\) for all m, n. Since edge e (m, n)τ i is the edge with high probability which can be selected by automaton A m , it is shown that \(c_{s}^{m^{\ast }}\left (k \right )-c_{n}^{m^{\ast }}\left (k \right )>0\) for all sn. It follows from Lemma 2 that for large values of k, \({c_{s}^{m}}\left (k \right )-{c_{n}^{m}}\left (k \right )>0\). Therefore, we conclude that for large values of k,the right hand side of the above equation consists of nonnegative quantities and so we have \({\prod }_{e_{\left (m,n \right )}\in \tau _{i}} {a{p_{n}^{m}}(k)} {\sum }_{s\ne n}^{r_{m}} {{p_{s}^{m}}\left (k \right ).({c_{s}^{m}}\left (k \right )-{c_{n}^{m}}\left (k \right ))} \ge 0\) and from (10), we have

$$\begin{array}{@{}rcl@{}} {\Delta} q_{i}\left( k \right)&\ge& \mathrm{}{\prod}_{e_{\left( m,n \right)}\in \tau_{i}} {a{p_{n}^{m}}(k)} {\sum}_{s\ne n}^{r_{m}} {p_{s}^{m}}\left( k \right). \\&&({c_{s}^{m}}\left( k \right)-{c_{n}^{m}}\left( k \right)) \end{array} $$

This completes the proof of this lemma. □

Corollary 1

The set of unit vectors in \(S_{r}-{S_{r}^{0}}\) forms the set of all absorbing barriers of the Markov process \({\{\underline {q}\left (k\right )\}}_{k\ge 1}\) , where \({S_{r}^{0}}=\left \{ \underline {q}\left (k \right ):q_{i}\left (k \right )\in \left (0,1 \right );{\sum }_{i=1}^{r} {q_{i}\left (k \right )} =1 \right \}\).

Proof

Lemma 4 implicitly proves that \(\underline {q}\left (k \right )\) is a sub-martingale. Using martingale theorems and the fact that \(\underline {q}\left (k \right )\) is a non-negative and uniformly bounded function, it is concluded that q i (k) converges to q∗ with probability one. Hence, from (10), it can be seen that q i (k+1)≠q i (k) with a nonzero probability if and only if q i (k)∉0,1}, and \(\underline {q}\left (k+1 \right )=\underline {q}(k)\) with probability one if and only if q ∈{0,1} where \(\lim _{k\rightarrow \infty }q_{i}(k)=q^{\ast }\), and hence the proof is completed. □

Let Γ i (q) be the probability of convergence of BNC-VLA to unit vector e i with initial probability vector; \(\underline {q}.\thinspace {\Gamma }_{i}\left (q \right )\) is defined as follows:

$$\begin{array}{@{}rcl@{}} {\Gamma}_{i}\left( q \right)&=&prob[q_{i}(\infty)=1|\underline{q}(0)=\underline{q}]\\ &=&prob[q^{*}=e_{i}|\underline{q}(0)=\underline{q}]. \end{array} $$

Let C(S r ):S r R be the state space of all real-valued continuously differentiable functions with bounded derivative defined on S r , where R is the real line. If ψ(.)∈C(S r ), the operator U is defined as:

$$ U\psi (q)= E[\psi \left( q\left( k+1 \right) \right)\vert q\left( k \right)=q] $$
(13)

where E[.] represents the mathematical expectation.

It has shown in [23] that operator U is linear, and it preserves the non-negative functions as the expectation of a non-negative function remains nonnegative. In other words, U ψ(q)≥0 for all qS r , if ψ(q)≥0. If the operator U is applied n (for all n>1) times repeatedly, we have

$$U^{n-1}\psi \left( q \right)=E[\psi (q(k+1))\vert q\left( 1 \right)=q]. $$

Function ψ(q) is called super-regular (sub-regular) if and only if ψ(q)≥U ψ(q)(ψ(q)≤U ψ(q)), for all qS r . It has been shown in [23] that Γ i (q) is the only continuous solution of UΓ i (q)=Γ i (q) subject to the following boundary conditions.

$${\Gamma}_{i}\left( e_{i} \right)=1 $$
$$ {\Gamma}_{i}\left( e_{j} \right)=0;\thinspace \thinspace j\ne i\thinspace $$
(14)

Define \(\phi _{i}\left [ x,q \right ]=\frac {e^{-xq_{i}/a}-1}{e^{-x/a}-1}\) where x>0. ϕ i [x, q]∈C(S r ) satisfies the boundary condition above.

Theorem 2

Let ψ(.)∈C(S r ) be super-regular with ψ i (e i )=1 and ψ i (e j )=0 for j≠i then

$$\psi_{i}\left( q \right)\ge {\Gamma}_{i}\left( q \right) $$

for all q∈S r . If ψ(.)∈C(S r ) is sub-regular with the same boundary conditions, then

$$ \psi_{i}\left( q \right)\le {\Gamma}_{i}\left( q \right) $$
(15)

for all q∈S r .

Proof

Theorem 2 has been proved in [21].

In what follows, we show that ϕ i [x, q] is sub-regular function, and ϕ i [x, q] qualifies as a lower bound on Γ i (q). Super and sub-regular functions are closed under addition and multiplication by a positive constant, if and only if ϕ(.) is super regular then – ϕ(.) is sub-regular. Therefore, it follows that ϕ i [x, q] is sub-regular if and only if \(\theta _{i}\left [ x,q \right ]=e^{-xq_{i}/a}\) is super-regular.

We now determine the conditions under which θ i [x, q] is super-regular. From the definition of operator U given in (13), we have:

$$\begin{array}{@{}rcl@{}} &&E\left[\left.e^{\frac{xq_{i}(k+1)}{a}}\right| q(k)=q\right]\\[-5pt] &&=\left[{\sum}_{j=1}^{r} {q_{j}d_{j}^{\ast}e} ^{-\frac{x}{a}\left[\underset{\,\,\,\,\, e_{\left( m,n \right)}\in \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} \left( {p_{n}^{m}}+a\left( 1-{p_{n}^{m}} \right) \right) \right]}\right.\\[-5pt] &&\quad \left.+{\sum}_{j=1}^{r} {q_{j}d_{j}^{\ast }e} ^{-\frac{x}{a}\left[\underset{\,\,\,\,\, e_{(m,n)}\notin \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} ({p_{n}^{m}}\left( 1-a \right))\right]} \right]\\[-5pt] &&U\theta_{i}\left( x,q \right)\\[-5pt] &&=\left[\vphantom{e ^{-\frac{x}{a}[\underset{\,\,\,\,\, e_{(m,n)}\notin \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} ({p_{n}^{m}}\left( 1-a \right))]}} q_{j}d_{j}^{\ast }e^{-\frac{x}{a}(q_{i}+a\left( 1-q_{i} \right))}\right.\\[-5pt] &&\quad \,+{\sum}_{j=1}^{r} {d_{j}^{\ast }e }^{-\frac{x}{a}\left[\underset{\,\,\,\,\, e_{\left( m,n \right)}\in \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} \left( {p_{n}^{m}}+a\left( 1-{p_{n}^{m}} \right) \right) \right]}\\[-5pt] &&\quad\left.+{\sum}_{j\ne 1}^{r} q_{j}d_{j}^{\ast }e ^{-\frac{x}{a}\left[\underset{\,\,\,\,\, e_{(m,n)}\notin \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} ({p_{n}^{m}}\left( 1-a \right))\right] } \right]\\[-5pt] \end{array} $$

Where \(d_{j}^{\ast }\) denotes the final value to which the reward probability d j (k) is converged (for large value of k), and \(e^{-\frac {x}{a}(q_{i}+a\left (1-q_{i} \right ))}\) is the expectation of θ i (x, q) when the BN τ i is rewarded by the environment.

$$\begin{array}{@{}rcl@{}} &&U\theta_{i}\left( x,q \right)=\left[\vphantom{e^{-\frac{x}{a}\left( q_{i}\left( 1-a \right)\underset{\,\,\,\,\,e_{\left( m,n \right)}\in \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} \frac{{(p}_{n}^{m}+a\left( 1-{p_{n}^{m}} \right))}{{(p}_{n}^{m}\left( 1-a \right))} \right)}} q_{j}d_{j}^{\ast}e^{-\frac{x}{a}\left( q_{i}+a\left( 1-q_{i} \right) \right)}\right.\\[-5pt] &&\qquad+\left.{\sum}_{j\ne i} q_{j}d_{j}^{\ast }e^{-\frac{x}{a}\left( q_{i}\left( 1-a \right)\underset{\,\,\,\,\,e_{\left( m,n \right)}\in \tau_{j}}{{\prod}_{e_{\left( m,n \right)}\in \tau_{i},}} \frac{{(p}_{n}^{m}+a\left( 1-{p_{n}^{m}} \right))}{{(p}_{n}^{m}\left( 1-a \right))} \right)} \right]\\[-5pt] &&= \left[{\sum}_{j=i} {q_{j}d_{j}^{\ast }e^{-\frac{x}{a}\rho_{j}^{i}(q_{i}+a\left( 1-q_{i} \right))}}\right.\\ &&+\left.{\sum}_{j\ne i} {q_{j}d_{j}^{\ast }e^{-\frac{x}{a}{\rho_{j}^{i}}(q_{i}(1-a))}} \right] \end{array} $$

Where \({\rho _{j}^{i}}>0\) is defined as

$${\rho_{j}^{i}}=\left\{ {\begin{array}{l} {\prod}_{\begin{array}{l} e_{\left( m,n \right)}\in \tau_{i}, \\ e_{\left( m,n \right)}\in \tau_{j} \\ \end{array}} \frac{{(p}_{n}^{m}+a\left( 1-{p_{n}^{m}} \right))}{{(p}_{n}^{m}\left( 1-a \right))} ; \quad i\ne j \\ 1; \quad\quad\quad\qquad i=j \, or \, \left( \tau_{i}\cap \tau_{j} \right)=\emptyset \\ \end{array}} \right. $$
$$\begin{array}{@{}rcl@{}} &&U\theta_{i}\left( x,q \right)-\theta_{i}\left( x,q \right)\\ && \quad =\left[ e^{-xq_{i}{\rho_{j}^{i}}/a}{\sum}_{j=i} q_{i}d_{j}^{\ast }e^{-x(1-q_{i}){\rho_{j}^{i}}}\right.\\ && \quad\quad \left.+e^{-xq_{i}{\rho_{j}^{i}}/a}{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }e^{xq_{i}{\rho_{i}^{j}}}} \right]-e^{-xq_{i}/a} \end{array} $$

θ i (x, q) is super-regular if

$$\begin{array}{@{}rcl@{}} &&e^{-xq_{i}{\rho_{j}^{i}}/a}{\sum}_{j=i} q_{i}d_{j}^{\ast }e^{-x(1-q_{i}){\rho_{j}^{i}}}\\ &&\quad +e^{-xq_{i}{\rho_{j}^{i}}/a}{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }e^{xq_{i}{\rho_{i}^{j}}}} \le e^{xq_{i}/a} \end{array} $$

and

$$U\theta_{i}\left( x,q \right)\le e^{xq_{i}/a}q_{i}d_{j}^{\ast }e^{-x(1-q_{i})}+e^{xq_{i}/a}\sum\limits_{j\ne i} {q_{i}d_{j}^{\ast }e^{xq_{i}}} \quad , $$

if θ i (x, q) is super-regular. Therefore, we have

$$\begin{array}{@{}rcl@{}} &&U\theta_{i}\left( x,q \right)-\theta_{i}\left( x,q \right)\\ &&\quad\!\le\! \left[e^{-xq_{i}/a}q_{i}d_{j}^{\ast }e^{-x(1-q_{i})} \,+\, e^{-xq_{i}/a}{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }e^{xq_{i}}} \right]\,-\,e^{-xq_{i}/a}. \end{array} $$

After multiplying and dividing the right hand side of the inequality above by −x q i and some algebraic simplifications, we have

$$\begin{array}{@{}rcl@{}} &&U\theta_{i}\left( x,q \right)-\theta_{i}\left( x,q \right)\\ &&\le-x{q_{i}e}^{-xq_{i}/a}\left[ q_{i}d_{j}^{\ast }\frac{e^{-x(1-q_{i})}-1}{-xq_{i}}-{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }\frac{e^{xq_{i}}-1}{xq_{i}}} \right]\\ &&=-x{q_{i}e}^{-\frac{xq_{i}}{a}}\left[ d_{j}^{\ast }\frac{e^{-x\left( 1-q_{i} \right)}-1}{-x}-{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }\frac{e^{xq_{i}}-1}{xq_{i}}} \right]\\ &&=-x{q_{i}e}^{-xq_{i}/a}\left[ (1-q_{i})d_{j}^{\ast }\frac{e^{-x(1-q_{i})}-1}{-x{(1-q}_{i})}\,-\,{\sum}_{j\ne i} {q_{i}d_{j}^{\ast }\frac{e^{xq_{i}}-1}{xq_{i}}} \!\right] \end{array} $$

and

$$V\left[ u \right]=\left\{ {\begin{array}{l} \frac{e^{u}-1}{u};u\ne 0 \\ 1; \quad \,\,\, u=0 \\ \end{array}} \right. $$
$$\begin{array}{@{}rcl@{}} &&U\theta_{i}\left( x,q \right)-\theta_{i}\left( x,q \right)\\ &&\quad \le-x{q_{i}e}^{-\frac{xq_{i}}{a}}\left[\vphantom{{\sum}_{j\ne i}}\left( 1-q_{i} \right)d_{i}^{\ast }V[-x\left( 1-q_{i} \right)]\right.\\ &&\left.\quad -\left( {\sum}_{j\ne i} {q_{j}d_{j}^{\ast}}\right)V[xq_{i}] \right]=xq_{i}\theta_{i}\left( x,q \right)G_{i}(xq) \end{array} $$

where

$$\begin{array}{@{}rcl@{}} G_{i}\left( x,q \right)&=&\left( 1-q_{i} \right)d_{i}^{\ast }V[-x\left( 1-q_{i} \right)]\\ &&-\left( {\sum}_{j\ne i} {q_{j}d_{j}^{\ast}}\right)V[xq_{i}] \end{array} $$
(16)

Therefore, θ i (x, q) is super-regular if

$$ G_{i}\left( x,q \right)\ge 0. $$
(17)

From (16), it follows that θ i (x, q) is super-regular if we have

$$ f_{i}\left( x,q \right)=\frac{V\left[ -x(1-q_{i}) \right]}{V\left[ xq_{i} \right]}\le \frac{{\sum}_{j\ne i} {q_{i}d_{i}^{\ast }} }{(1-q_{i})d_{i}^{\ast }}. $$
(18)

The right hand side of the inequality (18) consists of the non-negative terms, so we have

$$\begin{array}{@{}rcl@{}} \left( {\sum}_{j\ne i} q_{i} \right){\min}_{j\ne i}\left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right)&\le& \frac{1}{(1-q_{i})}{\sum}_{j\ne i} q_{j}\frac{d_{j}^{\ast }}{d_{i}^{\ast }}\\ &\le& \left( {\sum}_{j\ne i} q_{i} \right){\max}_{j\ne i}\left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right). \end{array} $$

Substituting \({\sum }_{j\ne i} q_{i} \) by (1−q i ) in the above inequality, we can rewrite it as:

$${\min}_{j \ne i}\left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right)\le \frac{{\sum}_{j\ne i} {q_{j}\frac{d_{j}^{\ast }}{d_{i}^{\ast }}} }{{\sum}_{j\ne i} q_{j} }\le {\max}_{j \ne i}\left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right). $$

From (18), it follows that θ i (x, q) is super-regular if we have

$$f_{i}\left( x,q \right)\ge {\max}_{j \ne i} \left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right) $$

For further simplification, let us employ logarithms. If Δ(q, x) = l n f i (x, q), it has been shown in [23] that

$$-{{\int}_{0}^{x}} {H^{\prime}\left( u \right)du\le {\Delta} \left( q,x \right)} \le -{\int}_{-x}^{0} {H^{\prime}\left( u \right)du} $$
$$H\left( u \right)=\frac{dH(u)}{du},\thinspace H\left( u \right)=lnV(u). $$

Therefore, we have

$$\frac{1}{V[x]}\le \frac{V[-x(1-q_{i})}{V[xq_{i}]}\le V[-x] $$

and

$$ \frac{1}{V[x]}= {\max}_{j \ne i}\left( \frac{d_{j}^{\ast }}{d_{i}^{\ast }} \right). $$
(19)

Let x∗ be the value of x for which (19) is true. It is shown that there exists a value of x>0 under which (19) is satisfied, if (d j /d i ) is smaller than 1 for all ji. By choosing x = x∗, (19) holds true. Consequently, (15) is true and θ i (x, q) is a super-regular function. Therefore, \(\phi _{i}\left [ x,q \right ]=\frac {e^{-xq_{i}/a}-1}{e^{-x/a}-1}\) is a sub-regular function satisfying the boundary conditions given in (14). From Theorem 2 and inequality (15), we conclude that

$$\phi_{i}\left[ x,q \right]\le {\Gamma}_{i}(q)\le 1. $$

From definition of ϕ i [x, q], we see that given any ε>0 there exists a positive constant a∗<1 such that 1−εϕ i [x, q]≤Γ i (q)≤1 for all 0 < aa .

Thus we conclude that the probability with which BNC-VLA constructs the BN with the maximum BIC is equal to 1 as k, and so Theorem 1 is proved. □

Theorem 3

Let q i (k) be the probability of constructing BN τ i at stage k, and (1- ε) be the probability with which Algorithm 1 converges to BN τ i . If q (k) is updated by Algorithm 1, then for every error parameter ε∈(0,1) there exists a learning rate a∈(εq) so that \(\frac {xa}{e^{xa}-1}=\left (\frac {d_{j}}{d_{i}} \right )\) , where \(1-e^{-xq_{i}}=\left (1-e^{-x} \right )\left (1-\varepsilon \right )andq_{i}=[q_{i}(k)\vert k=0].\)

Proof

It has been proved in [23] that there always exists a x > 0 under which (19) is satisfied, if \(\frac {d_{j}}{d_{i}}<1\) for all ji. Hence, it is concluded that \(\phi _{i}\left [ x,q \right ]\le {\Gamma }_{i}\left (q \right )\le \frac {1-e^{-xq_{i}}}{1-e^{-x}}\) where q i is the initial choice probability of the optimal BN τ i. From Theorem 1, for each 0 < a< a∗ the probability of converging Algorithm 1 to the BN with the maximum expected BIC is (1- ε) where a ∈(0,1). Therefore, it is concluded that,

$$ \frac{1-e^{-xq_{i}}}{1-e^{-x}}=1-\varepsilon . $$
(20)

It is shown that for every error parameter ε∈(0,1)there exists a value of x under which (19) is satisfied, and so we have \(\frac {x^{\ast }a}{e^{x^{\ast }a}-1}={\max }_{j \ne i}\left (\frac {d_{j}^{\ast }}{d_{i}^{\ast }} \right )\).

It is concluded that for every error parameter ε∈(0,1) there exists a learning rate a∈(ε q) under which the probability of converging Algorithm 1 to the BN with the maximum expected BIC is greater that (1- ε) and hence the theorem is proved. □

6 Numerical results

In order to evaluate the performance of BNC-VLA, two classes of experiments have been developed. First we have determined two well-known BNs (structures + conditional probabilities). We have simulated them, using BNC-VLA and databases of cases, which reflect the conditional independence relations between the variables. We have evaluated the constructed BNs; and we have also compared the proposed algorithm, BNC-VLA with other algorithms in this case. Second, since classification is one of the important applications of BNs, for 25 chosen datasets the classification accuracy of the proposed algorithm has been compared against other algorithms.

6.1 Evaluation results on well-known networks

The BNs used in these experiments are the ALARM and the ASIA networks. ALARM network [25] is a medical diagnostic alarm message system for patient monitoring; it contains 37 nodes and 46 arcs (see Fig. 1). Researchers in this field use datasets which are generated from three versions of ALARM network with the same structure but different probability distributions. We use the 5000 first cases from the database that was generated by Edward Herskovits [26]. ASIA network [27] is a simple network with eight binary nodes and eight arcs (Fig. 2). It was introduced by Lauritzen and Spiegelhalter to illustrate their method of propagation of evidence, considers a small piece of factious qualitative medical knowledge. A dataset with 5000 cases is sampled using the Netica tool [28] for constructing its network.

Fig. 1
figure 1

The ALARM network

Fig. 2
figure 2

The ASIA network

6.1.1 Sensitivity analysis

In this section, we study the effect of learning parameters, a in (4) and bin (5), on the performance of the BNC-VLA Since the reward-Inaction (L RI ) algorithm has been used, b=0, and we only peruse the effect of parameter a in order to find the best value of it in our implementation. To do this the database with 5000 cases is considered to construct the ALARM network. Figure 3 represents the results. It indicates that the best value of parameter a is between 0.4 and 0.5, therefore in following experiments the value of a is 0.4.

Fig. 3
figure 3

Analysis the effect of parameter a for BNC-VLA

6.1.2 Performance evaluation of BNC-VLA

In these experiments, BNC-VLA is employed to construct well-known BNs, the ALARM and the ASIA, using their training samples. The constructed network for ALARM is identical to its origin; except that two arcs {21−31 and 12−32} are missed. A subsequent analysis has revealed that missing arcs are not supported by the 5000 cases, and their nodes are actually independent in the employed database. It is similar to the result of [2] which used 10000 cases and nodes ordering. On the other hand, the constructed BN for ASIA is completely the same with its original network.

Then, we consider different subsets of datasets for both ALARM and ASIA networks, which consist of the first 500, 1500, 3000, and 5000 cases. In order to evaluate the behavior of the algorithm, for each dataset of cases, following parameters are considered:

  1. I.

    BIC as scoring metric

  2. II.

    Hamming distance

  3. III.

    Computation time

BIC is measured by (3). The interpretation of BIC is: the higher this parameter, the better the network. Hamming distance directly compares the structure of the learned and the original networks. We define the Hamming distance between two DAGs as the number of following operators required to make the DAGs match: add, remove, or reverse a directed edge. The lower Hamming distance indicates the more similarity between the constructed network and the original one. BNC-VLA is also evaluated based on the computation time; to do this it is implemented in .Net framework in a PC which has a single CPU of Intel(R) Core™2 Duo 3.33GHz and a 1GB memory. To measure the computation time, BNC-VLA is run with no prior limitation in the number of iterations until more repetition does not increase the score.

Table 1 represents the average results found after 10 independent runs with the databases of 500, 1500, 3000, and 5000 cases. In this table μ + σ indicates the mean, and the standard deviation over the executions carried out. The value inside (.) is the best resul found along the experimentation. As the results indicate, BNC-VLA constructs satisfactory networks (networks with low Hamming distance and high BIC score) especially with databases of 3000 and 5000 cases, in acceptable time consuming. Better results for ASIA were predictable, because the ASIA is simpler than the ALARM.

Table 1 Average results of BNC-VLA after 10 runs

6.1.3 Comparison the BNC-VLA with other algorithms

In the following experiments, BNC-VLA is compared against other algorithms. All implemented algorithms are described below:

  • Other learning automata-based method, which uses a team of FALA [10];

  • Hill climbing-based algorithm [12];

  • Genetic algorithm which, firstly, findsthe best ordering of variables and then starts the search process [3];

  • And finally Ant colony optimization [14] called ACO.

Many other BN structure learning algorithms exist; A* search-based algorithm with a shortest path perspective [13] is an example. But, it is not possible and necessary to compare our proposed algorithms with all of them; so we have selected more comprehensive algorithms, which also stand in the same class with our algorithm.

Datasets with 5000 cases are considered for both the ALARM and the ASIA networks; considered metrics are the same as the metrics in previous experiments. Algorithms have run with no prior limitation in time until no improvement in the score is observed. Table 2 shows the results after 10 independent runs. In this table μ + σ indicates the mean, and the standard deviation over the executions carried out. The value inside (.) is the best result found along the experimentation. From the results, we observe that the BNC-VLA is superior to other algorithms in both the ALARM and the ASIA networks.

Table 2 Comparative experiments results after 10 runs

6.2 Evaluating the predictive ability of proposed algorithm in classification

A constructed BN is an efficient device to perform probabilistic inference whereupon, predictive ability in different applications is one of the significant issues in the field of BNs. Classification is one of the important applications of BNs which is used in varied fields such as recommender systems for estimating users’ ratings based on their implicit preferences, bank direct marketing for predicting clients’ willingness of deposit subscription, and disease diagnosis for assessing patients’ breast cancer risk [29]. In this section, at first we briefly explain about BNs classifiers in Section 6.2.1, and then by choosing datasets of different applications, we evaluate the classification accuracy and classification time of different algorithms in Sections 6.2.2 and 6.2.3, respectively.

6.2.1 BN classifiers

Suppose that each training sample is a vector of attributes (X 1, X 2 …X v−1, C). The goal of classification is predicting the right value of class variable c = x v having (x 1, x 2 …x v−1). If the performance measure is the percentage of correct predictions on test samples (classification accuracy), the correct prediction for (x 1, x 2 …x v−1) is a class that maximizes P(c|x 1, x 2 …x v−1). If there is a BN over (x 1, x 2 …x v−1, C), we could compute these probabilities by inference on it. After the structure of a BN is specified, estimating the parameters so that the network can provide the best prediction for the value of the class variable in the test samples is important; however, it is out of the scope of this study, and we simply use Maximum Likelihood (ML) to estimate the parameters’ values.

6.2.2 Comparison the classification accuracy of BNC-VLA against other classifiers

In order to evaluate and compare the classification accuracy of proposed algorithm, BNC-VLA, 25 datasets are used, which consist of 21 datasets from UCI [30] and others from [31]. Table 3 shows a brief description of these datasets. All other implemented classifiers are described as follows:

  • Naïve Bayes classifier;

  • TAN-based classifier;

  • FALA-based classifier[10];

  • Hill climbing-based classifier [12];

  • Genetic algorithm-based classifier [3];

  • And finally ACO-based classifier [14].

We also compare the performance of proposed classifiers with two simple and well-known classifiers, Naïve Bayes and TAN.

Table 3 Datasets and their samples

Furthermore, in order to construct more efficient networks, another scoring function is used, which is proposed in [31] and is called classification rate:

$$ CR=\frac{1}{\vert D\vert }{\sum}_{m\in D}^{\vert D\vert } {\delta (B_{D}\left( x_{1:N}^{m} \right),c^{m})} $$
(21)

Where, |D| is the number of training samples. The equation simply represents the rate of samples that are classified correctly by the network. And \(\delta \left (B_{D}\left (x_{1:N}^{m} \right ),c^{m} \right )=1\) if BN classifier \(B_{D}\left (x_{1:N}^{m} \right )\) which is trained with D, predicts the right value of class variable c m having attributes \(x_{1:N}^{m}\).

Table 4 represents the average error rate of different algorithms for classification of different datasets. For each algorithm, we have carried out 10 independent runs over each dataset with a fixed execution time. The best results are highlighted.

Table 4 Average error rate for different datasets

As a reference for the goodness of the results, we can consider the average error rate for all 25 datasets, which are 0.120174 for BNC-VLA, 0.159136 for FALA, 0.161916 for ACO, 0.172716 for Genetic, 0.171472 for Hill Climbing, 0.18094 for NB, and 0.1586 for TAN. Moreover, for 20 datasets BNC-VLA has shown the best results. For five remaining datasets, Cleve, Crx, Flare, Sat image, and Vehicle, results are approximately the same as the best. The best results for Cleve and Crx are achieved by ACO and HC, respectively; this is justifiable by considering the stochastic nature of the algorithms. For other datasets TAN is superior to BNC-VLA. The volumes of data in Flare, Sat image, and Vehicle are large enough and moreover; the numbers of classes are small considering the number of attributes.

Therefore, since TAN constructs the network based on the conditional mutual information test, it has shown good results on them. However, the differences between the best results and the results achieved by BNC-VLA are negligible.

Figure 4 represents the scatter charts which directly compare the BNC-VLA classifier with other classifiers; points above the line y =x represent wherever the BNC-VLA has shown better performance in comparison with other algorithms. Figure 5 shows a bar chart which compares average classification error of all algorithms on different datasets.

Fig. 4
figure 4

Scatter chart, compare BNC-VLA classifier with (a) FALA-based, (b) Genetic-based, (c) ACO-based, (d) Hill Climbing-based (e) Na>ve Bayes, and (f) TAN, classifiers; points above y = x show better performance of proposed algorithm

Fig. 5
figure 5

Average error rate for classification

All in all, the BNC-VLA has outperformed simple structures like NB and TAN; and in comparison with FALA-based algorithm, Genetic algorithm, ACO-based algorithm, and Hill climbing algorithm, again BNC-VLA has shown better results.

6.2.3 Comparison the computation time for classification

Finally, classification execution time of BNC-VLA is compared against other algorithms. All algorithms are implemented and executed in. Net framework in a PC which has a single CPU of Intel(R) Core™2 Duo 3.33GHz and a 1GB memory. Table 5 shows the results after 10 independent runs for seven chosen dataset. Chosen datasets for these experiments are: Letter, Chess, German, Hepatitis, Breast, Glass2, and Iris. The results indicate that BNC-VLA needs less time for classification, in comparison with other algorithms.

Table 5 Classification execution time of different algorithms on different datasets

7 Conclusion

In this paper a learning automata-based algorithm, named BNC-VLA, is proposed for BN structure training. Unlike the other learning automata-based methods which assign learning automata to the possible edges, in our algorithm each automaton is assigned to a random variable. Therefore, the number of learning automata reduces and the convergence speeds up. Furthermore, despite other algorithms, there is no additional phase to remove cycles from the constructed BN; it is simply done during the network construction by using variable-action set learning automata. Moreover two improvements are proposed, which can increase the quality of the constructed network and decrease the execution time. The time complexity of the proposed algorithm is O(n 2). The convergence of BNC-VLA is theoretically proved. Convergence results confirm that by a proper choice of the learning rate, the probability of choosing a BN with the maximum score converges to one. Reported experimental results show that BNC-VLA is superior to other related algorithms based on the quality and performance measures. BNC-VLA has two main features: (1) it builds acceptable networks, and (2) it has lesser computational cost for network construction; these may be due to the learning capability and a negligible computational cost of learning automata. In comparison with two other learning automata-based methods, BNC-VLA can locate better structure, and also it exhibits superior speed of convergence. Moreover, since classification is one of the important applications of BNs which is used in varied fields, we examine the classification accuracy of the proposed algorithm. To do this, classification error is measured for different BNs on 25 different datasets. Again experimental results show that BNC-VLA classifier performs better than other classifiers.