Keywords

1 Introduction

Bayesian network is a compact representation of joined probability distribution and referred as directed acyclic graph (DAG) that consists of vertices and edges. Each vertex denotes a random variable, and edge from (A → B) represents the conditional dependency between random variables A and B. BN is a compact representation of joint probability distribution of all random variables [3].

1.1 Structure Learning

Bayesian network can be constructed manually by domain experts using prior structure knowledge especially in medical domain. The structure can then be refined manually or by automatically using techniques such as refinement algorithm and ExpertBayes [4]. However, manual construction may not be feasible for all domains. Another approach is to use available information. Structure of BN can also be estimated directly from the data. There are two approaches for the structure learning: score-based approach and constraint-based approach [5].

Constraint-Based Approach

The constraint-based case employs the independence test [6] to identify a set of edge constraints [7] for the graph and then finds the best DAG that satisfies the constraints [8].

Score-Based Approach

The score-based approach first defines a criterion to evaluate how well the Bayesian network fits the data and then searches over the space of DAGs [9] for a structure with maximal score [10].

Most of the existing techniques [1] follow the below steps to find the best structure. First, generate all possible graphs (which consumes large time complexity [2]) and then apply any scoring function for each candidate structure; the structure with the highest Bayesian score is considered as the best Bayesian network structure. However, the scoring-based approach generates exponent number of candidate graphs.

2 Proposed Algorithm

In this work, a new heuristic search technique is proposed that generates optimal number of graphs, and then, scoring function is applied on each of the graph to find the best structure.

  • Step 1: Generate optimal number of candidate structure.

  • Step 2: Calculate Bayesian score for each candidate structure using sufficient statistics of the input data.

  • Step 3: The structure with the highest Bayesian score is the best Bayesian network structure.

Algorithm 1: Proposed algorithm

Data:     Input data for covid 19 Result:     Bayesian Network structure         1. begin         2. Read Input data for covid 19;         3. Best_Score = 0;         4. Best_structure = 0;         5. Calculate Sufficient statistics (S) for the input data;         6. Call G = GenerateOptimalGraphs(n); // n is number of random variables in the Input data         7. for each graph gi in G begin         8. Calculate Bayesian Score(BS(gi))for each graph gi using Sufficient statistics (S);         9. if BS(gi) > Best_Score then     Best_Score = BS(gi);     Best_structure = gi end         10. end

3 Calculate Bayesian Score for Each Structure

Bayesian score [3] can be used as scoring function for all possible graphs [11, 12]. It is the function which takes one graph as an input and computes the Bayesian score for that structure [13, 14].

$$ P\left( \frac{G}{D} \right) = \frac{{P\left( \frac{D}{G} \right)P\left( G \right)}}{P\left( D \right)} $$
(1)

P(D | G) is the marginal likelihood of the graph structure G. P(G) is the prior probability of the graph structure and can be considered as uniform distribution G in the candidate structure space. P(D) is the probability of the data which acts as a normalizing constants the score function can be written as below; it primarily depends on the marginal likelihood P(D | G)

$$ {\text{ScoreB}}\left( {G:D} \right) = \log P\left( \frac{D}{G} \right) $$
(2)

3.1 Data Preprocessing

The data was collected from source kaggle [15] for the pandemic disease Covid-19 that consists of 35 million records; the dataset contains 27 features with missing values. These values are handled using Python package. Test data contains missing values for mixed data types (numeric or strings), and SimpleImputer from the Python package sklearn.impute can be used.

  • Ex: imp_mean = SimpleImputer (missing_values=np.nan, strategy=“most_frequent”)

When strategy is set to “most_frequent”, it replaces missing values using the most frequent value along each column. It can be used with strings or numeric data (mixed data types) (Fig. 1).

Fig. 1
An illustration of the Bayesian network by the proposed algorithm of the covid dataset. The K leads to D then further to R, Z, and w. The B and S lead to Y, C leads to T and M, X leads to F and E, J leads to U, and O lead to V. The N leads to A, Q, and Z. The G, H, I, L, and P are single datasets.

Best Bayesian network structure for covid_19 dataset created by proposed algorithm

4 Goodness Fit of the Model

Proposed heuristic algorithm can be compared with the existing structure learning algorithms such as hill-climbing learnt by bnlearn package. Bayesian network models can be compared with the following metrics.

  1. 1.

    score comparison

  2. 2.

    hold-out cross-validation

  3. 3.

    precision.

  1. 1.

    Score Comparison

The Bayesian scoring function is used to compute the score of the Bayesian network with 27 nodes learnt using bnlearn package (hill-climbing approach)

$$ {\text{score}}\left( {{\text{bnlearn}}\_{\text{BN}},26\;{\text{nodes}}} \right) = 3240.513 $$

Score of the Bayesian network with 20 nodes learnt using heuristic algorithm

$$ {\text{score}}\left( {{\text{heuristic}}\_{\text{BN}},26\;{\text{nodes}}} \right) = 3576.921 $$
  1. 2.

    Hold-Out Cross-Validation

Cross-validation is a standard method used to estimate a model’s goodness of fit. In k-fold cross-validation, data is randomly divided into k subsets. For each subset k, a model is evaluated on k having been trained on (k − 1) subsets (Fig. 2; Tables 1 and 2).

Fig. 2
A graph plots log-likelihood loss versus cluster B N and bnlearn B N. The expected loss for bnlearn is 14.80372 and the expected loss for cluster B N is 14.26. All values are approximated.

Expected loss comparison graph for bnlearn_BN and heuristic_BN

Table 1 Expected loss for the Bayesian network created by bnlearn package
Table 2 Expected loss for the Bayesian network created by heuristic algorithm
  1. 3.

    Precision Analysis

Precision: It computes the proportion of positive identifications was actually correct out of total predicted positives. 3 × 3 confusion matrix created for the node sore throat which has three levels, namely low, moderate and high, diagonal elements represent the true positives along that column, and the remaining values in the same column represent false positives. Precision values have been computed with various threshold values ranging from 0.2 to 1.0 for the below confusion matrices from precision comparison graph shown in Fig. 3 (Table 3).

Fig. 3
A line graph plots precision versus threshold levels. The line starts at 0.94, drops to 0.87, rises to 0.95, again drops to 0.74, then rises to 0.93, and again drops. All values are estimated.

Precision values computed for different threshold values (0.2–1.0)

Table 3 3 × 3 confusion matrix created for the node sore throat

5 Conclusion

The proposed algorithm generates less number of candidate structures, thereby consumes comparatively less time complexity and finds the best structure in optimal time. Bayesian score is used as a scoring function to compute the score of the each candidate structure. The goodness fit of the best Bayesian network structure is measured in terms of Bayesian score, expected loss and precision values. It is observed that score of the best Bayesian network structure has improved by 10%. At the same time, expected loss is reduced by 1.0098 percentages.

However, present work can be extended for large number of nodes in the graph, and Bayesian score can be calculated for Bayesian networks having both discrete and continuous random variables.