1 Introduction

Score-based optimization or structural learning of statistical models is typically performed over finite classes of models, where the topology of the search space poses a challenge for building an algorithm that can efficiently traverse across hills and valleys shaping a multimodal target function to be optimized. In particular in Bayesian model learning it is frequently possible to score individual models based on data using analytical formulas for the log unnormalized posterior, either exactly or approximately. Algorithms for stochastic simulation, such as Markov chain Monte Carlo (MCMC) and simulated annealing, represent popular ways to identify posterior optimal models. Learning of undirected graphical models, also known as Markov networks, is a widely considered application of such tools (Corander et al. 2008; Dellaportas and Forster 1999; Giudici and Castello 2003; Giudici and Green 1999; Madigan and Raftery 1994). One attractive property of these algorithms is their consistency at the limit, i.e., they are known to identify an optimal model with certainty as the number of iterations goes towards infinity. On the other hand, this does not guarantee anything concerning their performance for a finite number of iterations.

Statistical model learning has traditionally not been considered from the perspective of Boolean propositional logic, given the apparent separation of the two research traditions. However, learning of Bayesian networks and Markov networks has recently attained considerable interest in the computational logic and machine learning communities, and subsequently the use of various constraint-based methods, such as integer programming, maximum satisfiability, and answer set programming, has been proposed (Bartlett and Cussens 2013; Berg et al. 2014; Corander et al. 2013; Cussens 2008; Parviainen et al. 2014).

Graphical models are an essential tool when considering modular representations of multivariate systems. Despite of the versatility of Markov networks to encode the dependence structure over a set of discrete variables, certain more subtle types of independencies cannot be expressed in such models. The dependence structure induced by Markov networks may be unnecessarily stringent in that it can only convey conditional independencies that hold in the entire outcome space of the variables involved. This has motivated the development of new classes of models. Using the theory of log-linear models for contingency tables, Markov networks have been generalized in a number of ways (Corander 2003; Eriksen 1999, 2005; Højsgaard 2003, 2004). The common basis of these models is that conditional independence is augmented by an independence that holds only in a subset of the joint state space of the variables included in a particular condition. A similar construction has been proposed for Bayesian networks in context-dependent Bayesian networks (Boutilier et al. 1996; Friedman and Goldszmidt 1996; Koller and Friedman 2009; Pensar et al. 2015).

Recently, a class of Markov networks that belong to the general family of context-specific graphical models was introduced (Nyman et al. 2014). Termed as stratified graphical models (alternatively called as labeled Markov networks), this is the first class of undirected graphical models that allows for the graphical representation of both conditional and context-specific independencies, while simultaneously enabling fully Bayesian learning under an explicit factorization of the joint probability distribution. A non-reversible Markov chain Monte Carlo (MCMC) algorithm has been utilized for learning the maximum a posteriori (MAP) model from data by Nyman et al. (2014).

In this paper, we develop further a constraint-based learning method introduced for ordinary Markov networks (Corander et al. 2013) by significantly extending applicability with respect to network size and by also allowing context-specific independence restrictions. The earlier method is not directly applicable to larger Markov networks or labeled networks due to the impractically large number of possible model elements that need to be explicitly considered. This prompted us to develop several formal pruning criteria. They allow for reducing the number of relevant elements dramatically, so that constraint-based methods become applicable. Our pruning criteria are based on statistical arguments, such that they act consistently in model rejection as the number of samples available for model learning tends towards infinity. In addition we replace the balancing condition and maximum spanning tree considerations in (Corander et al. 2013) by a perfect elimination ordering, which is much better suited for obtaining a compact representation in terms of model constraints. We demonstrate the feasibility of this approach with examples that have model cardinalities ranging from very large to an astronomic size.

The structure of the paper is as follows. We start by introducing Markov networks and labeled Markov networks in Sect. 2. Section 3 presents the principles we use for pruning the set of candidate cliques to improve efficiency. A characterization of chordal graphs in terms of perfect elimination orderings is given in Sect. 4. Section 5 provides a constraint-based representation of ordinary Markov networks and a respective extension to labeled networks. An experimental evaluation of the proposed method is given in Sect. 6, and Sect. 7 concludes the paper.

2 Markov networks and their generalization

In the following we provide a brief introduction to graphical models and summarize results derived for context-dependent (labeled) Markov networks (Nyman et al. 2014). For a more comprehensive presentation of the theory of probabilistic graphical models see the standard references (Koller and Friedman 2009; Lauritzen 1996; Whittaker 1990). An undirected graph \(G=\langle N, E\rangle \) consists of a set N of nodes standing for random variables and a set \(E\subseteq N \times N\) of undirected edges. Two nodes \(\delta \) and \(\gamma \) are adjacent in G if \(\{\delta , \gamma \} \in E\). A path is a sequence of nodes such that every two consecutive nodes in the sequence are adjacent. Two sets A and B of nodes are separated by a third set D of nodes if every path between a node in A and a node in B contains at least one node in D. An undirected graph is chordal if, for every path \(X_1, \ldots , X_n,X_1\) with \(n \ge 4\), there are two non-consecutive nodes on the path connected by an edge. A clique c is a set of nodes such that every pair of nodes in c is adjacent. Given the set C of maximal cliques in a chordal graph, the multiset of separators can be obtained as intersections of the cliques ordered by a spanning tree of the respective clique graph (Golumbic 1980). Some details of this construction are given in Sect. 5.

A Markov network consists of a graph \(G = \langle N,E\rangle \) and a joint distribution \(P_{N}\) over the variables \(X_N\). The graph specifies the dependence structure of the variables and \(P_{N}\) factorizes according to G (see below for chordal graphs). Given G it is possible to ascertain whether two sets \(X_A\) and \(X_B\) of variables are conditionally independent given another set \(X_D\) of variables, due to the global Markov property

$$\begin{aligned} X_A \perp X_B \mid X_D, \text { if } D \text { separates } A \text { from }B \text { in } G. \end{aligned}$$

For a chordal graph G the probability of a joint outcome \(x_N \in \mathscr {X}_N\), where \(\mathscr {X}_N\) denotes the state space of the variables \(X_N\), factorizes as

$$\begin{aligned} P_N(x_N)=\frac{\prod _{c \in C } P_{c}(x_{c})}{\prod _{s \in S} P_{s} (x_{s}) }. \end{aligned}$$

We assume that all outcomes have strictly positive probabilities and that the prior distribution for the model parameters factorizes with respect to the graph. Then, the marginal likelihood of a dataset \(\mathbf X \) given a chordal graph (Dawid and Lauritzen 1993) can be written as

$$\begin{aligned} P(\mathbf X \mid G)= \frac{\prod _{c \in C } P_{c}(\mathbf X _{c})}{\prod _{s \in S} P_{s} (\mathbf X _{s}) }. \end{aligned}$$
(1)

By employing a Dirichlet distribution to assign prior probabilities to outcomes \(x_c \in \mathscr {X}_c\), the terms \(P_{c}(\mathbf X _{c})\) and \(P_{s} (\mathbf X _{s})\) can be calculated analytically. These properties are inherited by the class of decomposable labeled Markov networks, which allow a full factorization of the probability for all joint outcomes including context-specific independencies within maximal cliques. In addition context-specific independencies can be expressed in such Markov networks through labels. The assumption of decomposability of labeled networks permits an analytical scoring function, which has been customarily used also for ordinary Markov networks in the Bayesian approaches to model learning.

Definition 1

(Label) Let \((G, P_N)\) be a Markov network. For all \(\{\delta ,\gamma \} \in E\), let \(L_{\{\delta ,\gamma \}}\) denote the set of nodes adjacent to both \(\delta \) and \(\gamma \). For a non-empty \(L_{\{\delta ,\gamma \}}\), define the label of the edge \(\{\delta ,\gamma \}\) as the subset \(\mathscr {L}_{\{\delta ,\gamma \}}\) of outcomes \(x_{L_{\{\delta ,\gamma \}}} \in \mathscr {X}_{L_{\{\delta ,\gamma \}}}\) for which \(X_{\delta }\) and \(X_{\gamma }\) are independent given \(X_{L_{\{\delta ,\gamma \}}} = x_{L_{\{\delta ,\gamma \}}}\), i.e.,

$$\begin{aligned} \mathscr {L}_{\{\delta ,\gamma \}} = \left\{ x_{L_{\{\delta ,\gamma \}}} \in \mathscr {X}_{L_{\{\delta ,\gamma \}}} : X_{\delta } \perp X_{\gamma } \mid X_{L_{\{\delta ,\gamma \}}} = x_{L_{\{\delta ,\gamma \}}} \right\} . \end{aligned}$$

A label establishes a context in which a specific conditional independence holds. An edge \(\{\delta , \gamma \}\) is said to be labeled if \(\mathscr {L}_{\{\delta , \gamma \}}\) is well-defined and non-empty. Restricting a Markov network with the collection L of all labels induces a labeled Markov network.

Definition 2

(Labeled Markov network) A labeled Markov network is defined by the triple \((G, L, P_N)\), where G is the underlying graph, L equals the joint collection of all labels \(\mathscr {L}_{\{\delta ,\gamma \}}\) for the edges of G, and \(P_N\) is a joint distribution over \(X_N\) that factorizes according to the restrictions imposed by G and L.

The pair (GL) constitutes a labeled graph \(G_L\). Figure 1 shows the graphical representation of a labeled Markov network containing three conditionally dependent variables with the context-specific independence \(X_2 \perp X_3 | X_1= 1\).

Fig. 1
figure 1

Labeled graph encompassing the context-specific independence \(X_2 \perp X_3 | X_1 = 1\)

Imposing certain restrictions to the labeled graph will enable the factorization of \(P(\mathbf X \mid G_L)\) according to (1) as well as an analytic expression of \(P_{c}(\mathbf X _{c})\) and \(P_{s}(\mathbf X _{s})\).

Definition 3

(Decomposable labeled graph) Let (GL) be a labeled graph such that G is chordal. Further, let \(E_{L}\) denote the set of labeled edges, \(E_{c}\) the set of all edges in a maximal clique c, and \(E_S\) the set of all edges in the separators of G. The labeled graph is decomposable if

$$\begin{aligned} E_{L}\cap E_{S}=\emptyset \end{aligned}$$

and, for all \(c\in C\),

$$\begin{aligned} E_{L} \cap E_{c} = \emptyset \quad \text { or } \bigcap _{\{\delta ,\gamma \}\in E_{L}\cap E_{c}} \{\delta ,\gamma \} \ne \emptyset . \end{aligned}$$

A labeled graph is decomposable if G is chordal, there are no labels for edges in separators, and in each maximal clique all labeled edges have at least one node in common. A labeled Markov network with a decomposable labeled graph is a decomposable labeled Markov network. The first two restrictions will allow for \(P_N\) to be factorized according to (1). The third restriction will enable an ordering \((X_1, \ldots , X_{|c|})\) of the variables in \(X_c\), such that the variables in \(\{X_1, \ldots , X_{|c|-1}\}\) can be considered dependent on each other regardless of the context, and the variable \(X_{|c|}\) may or may not be independent of a subset of \(\{X_1, \ldots , X_{|c|-1}\}\) depending on the context.

If \(X_{|c|}\) is the variable corresponding to the node in common to all labeled edges in a maximal clique, the variables \(\{X_1, \ldots , X_{|c|-1}\}\) can be considered parents of \(X_{|c|}\), denoted by \(\Pi _{|c|}\). In a Markov network each outcome of the variables in \(\Pi _{|c|}\) would induce a specific conditional distribution for \(X_{|c|}\). However, for labeled Markov networks some outcomes of \(\Pi _{|c|}\) may be grouped together, with all outcomes in a group inducing the same conditional distribution for \(X_{|c|}\) (Boutilier et al. 1996). This creates a partition of the outcome space of \(\Pi _{|c|}\), where each cell in the partition forms a distinguishable parent combination for \(X_{|c|}\).

For decomposable labeled graphs, \(P_{c}(\mathbf X _{c})\) can be calculated (Nyman et al. 2014) using the formula

$$\begin{aligned} \prod _{j=1}^{|c|}\prod _{l=1}^{q_{j}}\frac{\Gamma \left( \sum _{i=1}^{k_{j}}\alpha _{jil}\right) }{\Gamma \left( n(\pi _{j}^{l})+ \sum _{i=1}^{k_{j}}\alpha _{jil}\right) }\prod _{i=1}^{k_{j}}\frac{\Gamma \left( n(x_{j}^{i} \mid \pi _{j}^{l})+\alpha _{jil}\right) }{\Gamma (\alpha _{jil})}, \end{aligned}$$
(2)

where \(q_{j}\) is the number of distinguishable parent combinations for variable \(X_j\) (i.e., there are \(q_j\) distinct conditional distributions for variable \(X_j\)), \(k_{j}\) is the number of possible outcomes for variable \(X_j\), \(\alpha _{jil}\) is the hyperparameter in the Dirichlet distribution corresponding to the outcome i of variable \(X_j\) given that the parental combination of \(X_j\) equals l, \(n(\pi _{j}^{l})\) is the number of observations of the combination l for the parents of variable \(X_j\), and finally, \(n(x_{j}^{i} \mid \pi _{j}^{l})\) is the number of observations, where the outcome of variable \(X_j\) is i given that the observed outcome of the parents of \(X_j\) equals l. Note that for \(X_{|c|}\) a parent configuration l is not necessarily comprised of a single outcome of \(\Pi _{|c|}\), but rather the set of outcomes inducing the same conditional distribution for \(X_{|c|}\).

The following choice of hyperparameters for the Dirichlet distribution is motivated by the desideratum that the ordering of the variables in \(\Pi _{|c|}\) should be irrelevant for statistical learning of the model structure:

$$\begin{aligned} \alpha _{ jil} = \alpha _{ jl} = \frac{|\mathscr {X}_{c}| \cdot \lambda _{ jl}}{\pi _{j} \cdot k_{ j}}, \end{aligned}$$

where \(\pi _{j}\) is the total number of possible outcomes for the parents of variable \(X_j\) and \(k_{j}\) is the number of possible outcomes for variable \(X_j\). Further, \(\lambda _{jl}\) equals the number of outcomes for the parents of variable \(X_j\) in group l with an equivalent effect on \(X_j\), if \(X_j\) is the last variable in the ordering. Otherwise, \(\lambda _{jl}\) equals one. The distribution \(P_{s}(\mathbf X _{s})\) can also be calculated using (2). Note that our choice of hyperparameters need not lead to a hyper-consistent prior distribution, as defined in Dawid and Lauritzen (1993). However, research in the machine learning field has clearly demonstrated that Dirichlet priors which are not necessarily hyper-consistent do in fact allow for the correct underlying dependence structure to be better learned than priors which ensure consistency. Marginal likelihood under a Dirichlet prior with constant hyperparameters has been shown to be very robust against scoring models with consistent priors, such as the BDe score (Silander et al. 2007, 2008, 2010).

Given a labeled Markov network, the marginal likelihood of a dataset can be calculated by combining (1) and (2), and for practical purposes we consider only the logarithmic value \(\log P(\mathbf X |G_L)\). Introducing the notation \(v(c, L) = \log P_{c}(\mathbf X _{c})\), which for non-empty label sets is dependent on the value of L, the log marginal likelihood can be written as

$$\begin{aligned} \log P(\mathbf X |G_L) = \sum _{c \in C } v(c, L) - \sum _{s \in S} v(s, L). \end{aligned}$$
(3)

The learning problem we consider consists of finding the labeled graph \(G_L\) that maximizes the posterior distribution

$$\begin{aligned} P(G_L|\mathbf X ) = \frac{P(\mathbf X |G_L) P(G_L)}{\sum _{G_L \in \mathscr {G}} P(\mathbf X |G_L) P(G_L)}, \end{aligned}$$

which can be reduced to identifying the model that optimizes \(P(\mathbf X |G_L) P(G_L)\). Here \(\mathscr {G}\) denotes the set of all graphs under consideration and \(P(G_L)\) is the prior probability assigned to \(G_L\). Here we use a prior penalizing dense graphs

$$\begin{aligned} P(G_{L}) \propto 2^{|N| - f}, \end{aligned}$$

where |N| is the number of nodes in G and f is the number of free parameters in a distribution \(P_{N}\) obeying G. This choice of prior is motivated by the fact that adding a label to a sparse graph often induces a context-specific independence in a larger context than adding a label to a dense graph. The value \(2^{f - |N|}\) is a numerically convenient approximation of the number of unique dependence structures that can be derived by adding labels to G.

Different types of MCMC methods are often applied to model optimization problems. While offering statistical consistency for the resulting estimates, a drawback of such an approach is that when the model space grows the number of iterations required to sample an optimal model even once may become prohibitively high. In addition, there are no general guarantees that a finite sample estimate corresponds to a true posterior optimal model. These aspects are particularly relevant when considering labeled Markov networks as the size of the model space is astronomical even for a moderate number of nodes. For Markov networks the model space grows according to the formula \(2^{d(d-1)/2}\), where d is the number of nodes in the graph. For decomposable labeled Markov networks over a set of binary nodes a lower bound for the cardinality of the model space is given by

$$\begin{aligned} d \cdot 2^{2^{ d-2 } \cdot (d-1)} - d +1 -\frac{d (d-1)}{2} \cdot (2^{2^{d-2}} - 1), \end{aligned}$$

which is the number of different label combinations for a maximal clique with d nodes while satisfying the restrictions of a decomposable labeled graph. Using this result, Table 1 shows the size of the model space relative to different numbers of nodes in Markov networks and labeled Markov networks.

Table 1 Size of model space given the number of nodes in the system

3 Filtering clique candidates

To apply a constraint-based learning method of the type introduced in Corander et al. (2013) it is necessary to construct a database containing the scores (log unnormalized posteriors) v(cL) for all possible clique candidates c (for Markov networks) and the respective labelings L of edges (for labeled Markov networks). Due to the rapidly increasing number of alternatives, it would be infeasible to obtain a reasonably sized input for constraint solvers without pruning clique candidates.

Therefore, we present a number of principles that can be used in practice to cut down the size of the clique database. A labeling L of a chordal graph is proper iff L satisfies the conditions of Definition 3. Thus, given a decomposable labeled graph \(G_L\), the labeling L is proper by definition.

Lemma 1

Consider a chordal graph G and two different sets L and \(L^*\) of proper labelings. Let \(E_{L}^c\) and \(E_{L^*}^c\) denote the set of edges of a maximal clique c labeled by L and \(L^*\), respectively. If \(E^c_{L} \subseteq E^c_{L^*}\) and \(v(c,L) > v(c,L^*)\), then \(v(c,L^*)\) is irrelevant for model optimality.

Proof

Since L and \(L^*\) are both proper and \(E^c_{L} \subseteq E^c_{L^*}\), any restrictions to the set of labeled edges in c satisfied by \(L^*\) will automatically be satisfied by L. Further, as \(v(c,L) > v(c,L^*)\), c cannot be labeled by \(L^*\) in an optimal labeled graph and thus the labeling \(L^*\) and its score \(v(c,L^*)\) can be omitted. \(\square \)

A consequence of Lemma 1 is that for any maximal clique \(c\in C\) and any proper labeling \(L^*\) such that \(v(c,\emptyset ) > v(c,L^*)\), the clique c cannot be labeled by \(L^*\) in an optimal labeled graph and \(L^*\) need not be considered. Thus the main purpose of Lemma 1 is to prune the collection of cliques and the related information, i.e., scores and labels, before applying constraint-based search methods. In the following, we sometimes use the abbreviation \(v(c)=v(c,\emptyset )\).

There are, however, further criteria that can be used to filter clique candidates more aggressively. The downside of applying such filtering methods is that too aggressive pruning may sacrifice optimal models. It is worth noting that Bayesian score-based pruning of models as part of the search was proposed in Madigan and Raftery (1994) already. Our first pruning principle is based on Bayes factors (Kass and Raftery 1995) and can be used to identify edges that are weakly supported (or unsupported) by the data. Applying the scoring functions defined in the context of the log marginal likelihood (3), we provide the following definition. For later reference, we introduce the abbreviation bforig standing for the original Bayes factor.

Definition 4

(bforig) Let \(\{X,Y\}\) be a clique of two random variables. The mutual dependence of X and Y is weakly supported by the data \(\mathbf X \) if the log Bayes factor

$$\begin{aligned} \mathrm {bf}(X,Y)=v(\{X,Y\})-v(\{X\})-v(\{Y\})<\log (10^{-k}), \end{aligned}$$
(4)

where \(k=0,1,2,\ldots \) is a parameter.

It is worth pointing out that (4) does not mention the labeling L since two-element cliques cannot have labels, making the scores independent of L. The condition can be used to remove any clique candidate c such that \(\{X,Y\}\subseteq c\). Moreover, the parameter \(k=0,1,2,\ldots \) controls the depth of filtering: the higher the value k takes, the fewer cliques will be pruned. However, as far as any larger clique candidates c are considered, condition (4) does not properly take the context created by c into account. This suggests a generalization of the log Bayes factor \(\mathrm {bf}(X,Y)\) in the context of c.

Definition 5

(Bayes factor) Given a clique c and variables \(X,Y\in c\), define

$$\begin{aligned}&\mathrm {bf}(X,Y\mid c)= \nonumber \\&\quad v(c)-v(c\setminus \{X\})-v (c\setminus \{Y\})+v(c\setminus \{X,Y\}). \end{aligned}$$
(5)

Definition 5 allows us to generalize condition (4) for larger cliques while omitting labelings, which effectively amounts to assuming an empty labeling \(\emptyset \). Depending on the point of interest, we introduce the abbreviations node, part, and edge for the following three generalizations.

Definition 6

(node, part, edge) A clique c is weakly supported by the data \(\mathbf X \) if one of the following conditions holds:

  1. 1.

    node: There is a variable \(X\in c\) such that

    $$\begin{aligned} \sum _{Y\in c\setminus \{X\}}\frac{\mathrm {bf}(X,Y\mid c)}{|c|-1}<\log (10^{-k}). \end{aligned}$$
    (6)
  2. 2.

    part: There is a partitioning of c into \(c_1\sqcup c_2\) such that

    $$\begin{aligned} \sum _{X\in c_1,Y\in c_2}\frac{\mathrm {bf}(X,Y\mid c)}{|c_1||c_2|}<\log (10^{-k}). \end{aligned}$$
    (7)
  3. 3.

    edge: There are two variables \(X,Y\in c\) such that

    $$\begin{aligned} \mathrm {bf}(X,Y\mid c)<\log (10^{-k}). \end{aligned}$$
    (8)

By setting \(c=\{X,Y\}\), \(c_1=\{X\}\), and \(c_2=\{Y\}\), the conditions (6)–(8) coincide with (4) but their intended use is different. The principle of Definition 4 applies to any clique containing a suspicious edge, whereas the ones of Definition 6 are clearly clique-specific criteria. Since these criteria aim at checking the consistency of scores at edge level, factors (6) and (7) are averaged by the respective numbers of edges \(|c|-1\) and \(|c_1||c_2|\) involved. These nominators have no effect when \(k=0\) and \(\log (10^{-k})=0\), but they regulate the effect of pruning when \(k>0\) grows: the higher the value of k, the fewer cliques will be pruned.

Proposition 1

Given a clique candidate c, condition (6) implies condition (7) that implies condition (8).

Proof

The first implication is clear by setting \(c_1=\{X\}\) and \(c_2=c\setminus \{X\}\). For the second implication, suppose that (7) holds for \(c=c_1\sqcup c_2\) but for each \(X,Y\in c\), \(\mathrm {bf}(X,Y\mid c)\ge \log (10^{-k})\). This contradicts (7), since the sum of log Bayes factors \(\sum _{X\in c_1,Y\in c_2}\mathrm {bf}(X,Y\mid c)\) \(\ge \) \(|c_1||c_2|\log (10^{-k})\). \(\square \)

Under the assumption that the data generating distribution is faithful (Koller and Friedman 2009) to a chordal graph, the conditions defined in (4) and (6)–(8) will not prune essential cliques when the size of the data goes to infinity. Here, a distribution is said to be faithful to a graph if any marginal or conditional independence present in the distribution can be determined from the graph. In such a scenario, as the size of the data goes to infinity, we can assume that an optimal graph, which maximizes the marginal likelihood, will correctly convey all conditional and marginal independencies in the generating distribution. We can further assume that implementations (4) and (5) of the Bayes factor will perform consistently. Consider a clique c in an optimal graph and any pair \(\{X, Y\} \subseteq c\). As there is an edge between X and Y, we know that they are conditionally dependent in the generating distribution given any other set of variables, including the empty set and \(c \setminus \{X, Y\}\). This means that the functions \(\mathrm {bf}(\ldots )\), defined in (4) and (5), will tend to infinity as the size of data tends to infinity. Subsequently, the essential clique c will not be pruned.

Table 2 The effects of different filtering schemes based on the number of remaining clique candidates for given sets of conditions

In this paper, the experiments are performed on three datasets concerning heart disease (6 binary variables), economical behavior (8 binary variables), or election behavior (25 binary variables), respectively. See, e.g., (Nyman et al. 2014; Pensar et al. 2015) for further details on these datasets abbreviated by heart, econ, and hs in the sequel. The sets heart and econ are based on complete clique databases, whereas that of hs contains all candidates up to size 8. To illustrate the effect of filtering, the column all of Table 2 gives the numbers of cliques involved in the datasets. In the labeled case, potential labelings have already been pruned using Lemma 1. The original numbers of candidates for heart-lab and econ-lab were 506 and 2, 038, respectively. The mnemonics of the last four columns refer back to Definitions 4 and 6. The filtering scheme bforig is based on the deletion of cliques c containing an edge \(\{X,Y\}\) such that condition (4) holds. The filtering schemes node, part, and edge amount to the three conditions (6)–(8) for clique removal, respectively. Each scheme is implemented so that the resulting clique database remains downward closed, i.e., if a particular clique c is present in the database so are its all proper subcliques. This is essential for scoring based on perfect elimination orderings to be introduced in the next section. Summarizing Table 2, the edge scheme provides the greatest pruning effect, having the negative effect of an increased risk of sacrificing optimal models (cf. Table 7). In the labeled case, the relationships between the pruning criteria are somewhat blurred by the nonuniform distribution of different labelings over clique candidates, which are nevertheless filtered in the same way as without labels.

We consider filtering as off-line computation that is performed only once. In fact, the implementation was not designed to be particularly efficient. For econ and heart, filtering takes only fractions of a second and is thus negligible. In case of hs, the number of cliques subject to filtering is essentially higher. When \(k=0\), the filtering times are 211, 501, 8833, and 299 seconds for the schemes bforig, node, part, and edge, respectively. The times do not vary substantially when \(k=1\) and \(k=2\). The considerably longer filtering time for the part scheme is due to the high number of cliques as well as the variety in which each clique can be partitioned.

Fig. 2
figure 2

Applying a perfect elimination ordering to a chordal graph

4 Perfect elimination orderings

Undirected trees have the following recursive property: if a leaf node and the respective edge are eliminated, the result is still a tree and the entire tree can be eliminated in this way. This gives rise to an elimination ordering for the nodes of the tree. Since chordal graphs generalize undirected trees, the same idea can be applied to chordal graphs (Chandran et al. 2003; Galinier et al. 1995).

Definition 7

(Perfect elimination ordering) Let \(G = \langle N,E\rangle \) be an undirected graph. Further, let \(X_1 <\cdots < X_n\) be a strict total ordering of N. Define \(N_{>}(X)\) as the set of neighbors of X that follow X in the ordering. Then, \(<\) is a perfect elimination ordering (PEO) for G if \(N_{>}(X)\) is a clique of G for every \(X\in N\).

Perfect elimination orderings enable a chordality test.

Theorem 1

(Dirac 1961; Rose 1970) A graph is chordal iff it has a perfect elimination ordering.

This connection of chordality and PEOs leads to a procedure in which nodes of a graph are eliminated one by one. Any node in the remaining graph that is part of a clique (consisting of one or more nodes) and is not adjacent to nodes outside the clique is eligible for elimination. If all nodes can be eliminated, then the original graph is chordal. Note that the above definition of PEO is equivalent with the standard definition of perfect ordering (Lauritzen 1996), however, below we utilize PEOs also for labeled graphs which are not considered in Lauritzen (1996).

Example 1

Figure 2 illustrates the application of the above procedure to an 8-node Markov network (a chordal graph having random variables as its nodes) according to the PEO \(A<C<H<F<B<E<G<D\). Since all nodes are eliminated in the end, the graph is chordal by Theorem 1.

Many other orderings are applicable as well. It is also possible to eliminate nodes in parallel. For example A, C, H, and F could all be eliminated first, and then the clique \(\{B,D,E,G\}\) can be eliminated node by node in four steps.   \(\square \)

To enable the scoring of Markov networks, the expression (3) of log marginal likelihood can be reformulated to better fit the purposes of constraint-based optimization.

Proposition 2

Let \(X_1<\ldots <X_n\) be a PEO for a labeled decomposable graph \(G_L\).

If \(s_i=N_{>}(X_i)\) and \(c_i=s_i\cup \{X_i\}\) are the cliques of \(G_L\) induced by \(X_1<\ldots <X_n\) for \(1\le i\le n\), and \(v(c_i,L)\) is defined as \(v(c_i,\emptyset )\) for non-maximal cliques \(c_i\), then the log marginal likelihood of a dataset \(\mathbf X \) given \(G_L\) is

$$\begin{aligned} \log P(\mathbf X \mid G_L)=\sum _{i=1}^{n} v(c_i,L)-\sum _{i=1}^{n} v(s_i,L). \end{aligned}$$

The score differences \(d(c_i,X_i,L)=v(c_i,L)-v(s_i,L)\) for \(1\le i\le n\) thus enable a differential calculation

$$\begin{aligned} \log P(\mathbf X \mid G_L)=\sum _{i=1}^{n}d(c_i,X_i,L). \end{aligned}$$
(9)

The preceding scheme can be further adjusted by isolating the effects of labelings L on scores. In the following, we let \( d(c,X_i)=v(c)-v(c\setminus \{X_i\}) \), where c is a clique and \(X_i\in c\) a variable. In the special case that \(c=\{X_i\}\), we have \(d(\{X_i\},X_i)=v(\{X_i\})-v(\emptyset )=v(\{X_i\})\). The effect of a labeling L on the score of c can be formalized as a reward

$$\begin{aligned} r(c,L)=v(c,L)-v(c,\emptyset ) \end{aligned}$$
(10)

that is independent on the nodes of c and guaranteed to be positive by Lemma 1. Thus (9) can be rewritten as

$$\begin{aligned} \log P(\mathbf X \mid G_L)= \sum _{i=1}^{n}d(c_i,X_i)+\sum _{i=1}^{n}r(c_i,L). \end{aligned}$$
(11)

5 Learning Markov networks by constraint optimization

Representing the structure learning problem of decomposable Markov networks is feasible in constraint satisfaction frameworks such as Boolean satisfiability (SAT) (Cook 1971) and its extension by optimization capabilities, i.e., maximum satisfiability (MAXSAT) (Johnson 1974). Also other extensions such as SAT modulo theories (SMT) (Sebastiani and Tomasi 2012) and integer programming (IP) (Gomory 1958) could be used. Answer set programming (ASP) (Brewka et al. 2011; Lifschitz 2002; Marek and Truszczyński 1999; Niemelä 1999) offers similar primitives in the form of rules rather than propositional formulas or linear equations. We begin by reviewing the proof-of-concept representation of the learning problem for Markov networks (Corander et al. 2013). These ideas form the starting point for developing an improved encoding for Markov networks and generalizing the encoding to the labeled case.

Definition 8

(Solutions (Corander et al. 2013)) Given a set N of nodes representing random variables and a scoring function \(v:{\mathbf {2}}^N\rightarrow \mathbb {R}\) based on log marginal likelihoods (3), a set \(C=\{c_1,\ldots ,c_n\}\) of cliques is a solution to the network learning problem iff

  1. 1.

    every node is covered by a clique, i.e., \(\bigcup _{i=1}^n c_i = N\),

  2. 2.

    cliques in C are set-inclusion maximal,

  3. 3.

    the graph \(\langle N,E\rangle \) with \(E=\bigcup _{c\in C} E_c\) is chordal,

  4. 4.

    the set C has a maximum weight spanning tree labeled by a set \(S=\{s_1,\ldots ,s_m\}\) of separators, and

  5. 5.

    C and S maximize \(v(C,S)=\sum _{c \in C } v(c) - \sum _{s \in S} v(s)\).

The constraints on solutions given in Definition 8 can be encoded in any of the mentioned formalisms for constraint optimization. For instance, if a Boolean (0-1) variable \(\mathrm {in}_c\) represents that a clique c is a part of a candidate solution, the first constraint can be formalized by a disjunction \(\mathrm {in}_{c_1}\vee \ldots \vee \mathrm {in}_{c_k}\), where \(c_1,\ldots ,c_k\) are all cliques that contain a particular node. Using dedicated variables \(\mathrm {e}_{i,j}\) to represent the edges of the resulting graph, expressing the maximality of cliques is straightforward. In contrast, obtaining a compact representation for chordality and the existence of a maximum weight spanning tree gets far more involved. For small graphs, however, chordality could be enforced by explicitly denying chordless cycles of length four or more. For the spanning tree condition, a key idea of Corander et al. (2013) is that the maximum weight of separators can be replaced by a balancing condition: every node occurs in the chosen cliques exactly once more than in the respective separators. This enables the recognition of separators in terms of standard cardinality constraints (Sinz 2005) and hence allows for determining the overall score (3) of the resulting network.

5.1 Encoding based on PEOs

In what follows, we present an encoding of the Markov network learning problem that improves the one summarized above in a number of ways. In particular, the idea is to exploit PEOs in the encoding of the chordality check and the results of Sect. 4 for scoring Markov networks. The goal is to avoid the determination of separators as far as possible. This is because the connection of a separator to the cliques it separates from each other gives rise to a cubic relation in the number of candidate cliques, i.e., substantial space complexity. We thus restate the requirements from Definition 8.

Definition 9

(Solutions reformulated) Given a function \(d:{\mathbf {2}}^N\times N\rightarrow \mathbb {R}\), a graph \(\langle N,E\rangle \) based on \(N=\{X_1,\ldots ,X_n\}\) is a solution to the Markov network learning problem iff

  1. 1.

    every node \(X_i\) has a clique \(c_i\) as its context of elimination, where \(X_i\in c_i\) and \(E=\bigcup _{i=1}^{n}c_i\),

  2. 2.

    there is a perfect elimination ordering for \(\langle N,E\rangle \), and

  3. 3.

    \(\sum _{i=1}^{n}d(c_i,X_i)\) corresponding to the log marginal likelihood (9) is maximized.

The elimination of \(X_i\) in the context of \(c_i\) means that \(X_i\) and all edges of \(c_i\) incident with \(X_i\) are removed. For instance, the node A is removed in the context of \(\{A,B,D,G\}\) in Fig. 2. To select a candidate graph, we introduce a Boolean variable \(\mathrm {in}_i^c\) for each node \(X_i\) and clique candidate c with \(X_i\in c\). If \(c_1,\ldots ,c_m\) are the clique candidates containing \(X_i\), the choice of the elimination context for \(X_i\) can be expressed by \(\mathrm {in}_i^{c_1}\vee \ldots \vee \mathrm {in}_i^{c_m}\). This disjunction should be made exclusive by adding \(\lnot \mathrm {in}_i^{c_j}\vee \lnot \mathrm {in}_i^{c_k}\) for all \(1\le j<k\le m\), or by adding an equivalent cardinality constraint allowing only one of the variables \(\mathrm {in}_i^{c_1},\ldots ,\mathrm {in}_i^{c_m}\) to be true for \(X_i\). The Boolean variables utilized by the encoding are collected in Table 3 for the reader’s convenience.

Our next objective is to enforce the chordality of the selected graph. This will be achieved by checking the existence of a PEO for the graph as formalized by Theorem 1. However, rather than insisting on a total ordering of the nodes, we allow for parallel eliminations to limit the number of steps and to achieve a more compact encoding. If parallel eliminations are considered, the worst case can be illustrated by an n-element clique \(\{X_1,X_2,\ldots ,X_n\}\) whose elimination requires n elimination steps using some arbitrary ordering as PEO (cf. the clique \(\{B,D,E,G\}\) in Fig. 2). The reason is that the potential removals of B, D, E, and G compete over the same resources (edges) and cannot be done at once in view of the differential score calculation (9).

Table 3 Boolean variables used in the PEO-based encoding

To formalize the chordality test for a variable \(X_i\), we introduce Boolean variables \(\mathrm {el}_{i,j}\) and \(\mathrm {el}_{i,j}^c\) for each elimination step \(1\le j\le n\) and clique candidate c such that \(X_i\in c\). The role of \(\mathrm {el}_{i,j}^c\) is to signal the elimination of \(X_i\) in context c at step \(j-1\) to the elements of \(c\setminus \{X_i\}\), which may then be eliminated from step j on. In turn, let \(c_1,\ldots ,c_m\) be all possible elimination contexts for variables \(X_{i_1}\in c_1,\ldots ,X_{i_m}\in c_m\) such that \(X_i\in c_1\setminus \{X_{i_1}\},\dots ,X_i\in c_m\setminus \{X_{i_m}\}\). Then, \(X_i\) can be eliminated once all \(X_{i_k}\) for \(1\le k\le m\) whose elimination context is \(c_k\), as indicated by \(\mathrm {in}_{i_k}^{c_k}\), are. This condition is captured by

$$\begin{aligned} \mathrm {el}_{i,j}\leftrightarrow \mathrm {el}_{{i_1},j}^{c_1}\wedge \ldots \wedge \mathrm {el}_{{i_m},j}^{c_m}, \end{aligned}$$
(12)

where the idea that \(\mathrm {el}_{i_k,j}^{c_k}\) is true if \(X_{i_k}\) is eliminated at some earlier step than j or in another context than \(c_k\) is recursively formalized by

$$\begin{aligned} \mathrm {el}_{i_k,j}^{c_k}\leftrightarrow \mathrm {el}_{i_k,j-1} \vee \lnot \mathrm {in}_{i_k}^{c_k} \end{aligned}$$
(13)

for \(j>1\) as well as \(\mathrm {el}_{i_k,1}^{c_k}\leftrightarrow \lnot \mathrm {in}_{i_k}^{c_k}\) for the base case \(j=1\). The definitions above aim at expressing chordality as

$$\begin{aligned} \mathrm {el}_{1,n}\wedge \ldots \wedge \mathrm {el}_{n,n}, \end{aligned}$$

requiring the elimination of all nodes in the end. Due to recursion over elimination steps, the overall space complexity is quadratic in n. However, depending on the target formalism, a linear representation can be feasible. This holds, e.g., for encodings in ASP that natively support recursion.

Fig. 3
figure 3

Illustration of the encoding of the chordality test

Example 2

To illustrate the encoding of the chordality condition, let us consider the Markov network in Fig. 3 and the following candidates:

Since \(X_1\) is mentioned by cliques \(c_1\), \(c_5\), \(c_6\), and \(c_{10}\), the choice of an elimination context is expressed by

Analogous formulas are needed for the nodes \(X_2\), \(X_3\), and \(X_4\). Let us then consider a concrete elimination scenario, where \(X_1\) and \(X_4\) are first removed in parallel, then \(X_2\) is removed, and finally \(X_3\). This scenario can be realized by setting the variables \(\mathrm {in}_1^{c_{10}}\), \(\mathrm {in}_2^{c_7}\), \(\mathrm {in}_3^{c_3}\), and \(\mathrm {in}_4^{c_{11}}\) true. The resulting order of elimination is depicted in Fig. 4a. Since \(\mathrm {in}_1^{c_{10}}\) is true, the other variables \(\mathrm {in}_1^{c_1}\), \(\mathrm {in}_1^{c_5}\), and \(\mathrm {in}_1^{c_6}\) concerning \(X_1\) are falsified by mutual exclusion. Respective \(\mathrm {in}_i^{c_j}\)-variables related with \(X_2\), \(X_3\), and \(X_4\) get similarly falsified. Let us then consider the satisfaction of formulas of forms (12) and (13). For \(X_1\) and the elimination step \(j=1\), the relevant instances are

Due to mutual exclusions explained above, the variables \(\mathrm {el}_{2,1}^{c_5}\), \(\mathrm {el}_{3,1}^{c_6}\), \(\mathrm {el}_{2,1}^{c_{10}}\), and \(\mathrm {el}_{3,1}^{c_{10}}\) must be set to true. This, in turn, makes \(\mathrm {el}_{1,1}\) true, indicating that \(X_1\) is removed at step \(j=1\). Since \(X_4\) is symmetric to \(X_1\), we may conclude that \(\mathrm {el}_{4,1}\) is also made true by analogous equivalences introduced for \(X_4\). On the other hand, the respective formulas for \(X_2\) and \(X_3\) falsify both \(\mathrm {el}_{2,1}\) and \(\mathrm {el}_{3,1}\) since both \(\mathrm {in}_1^{c_{10}}\) and \(\mathrm {in}_4^{c_{11}}\) are true. Thus \(X_2\) and \(X_3\) are not eliminated at step \(j=1\). As regards the subsequent elimination of \(X_2\) at step \(j=2\), the following instances of formulas (12) and (13) become relevant:

Fig. 4
figure 4

Illustration of elimination contexts and the resulting orderings of nodes

To check the satisfaction of the conjunction in the equivalence for \(\mathrm {el}_{2,2}\), it is sufficient to note that \(\mathrm {el}_{1,1}\) and \(\mathrm {el}_{4,1}\) have been set to true before, and the truth of \(\mathrm {in}_3^{c_7}\), \(\mathrm {in}_3^{c_{10}}\), and \(\mathrm {in}_3^{c_{11}}\) is excluded by the truth of \(\mathrm {in}_3^{c_{3}}\). Thus \(\mathrm {el}_{2,2}\) must be true. The analogous formulas for \(X_1\) and \(X_4\) force \(\mathrm {el}_{1,2}\) and \(\mathrm {el}_{4,2}\) to be true. To the contrary, \(\mathrm {el}_{3,2}\) is falsified since \(\mathrm {in}_{2}^{c_7}\) is true and \(\mathrm {el}_{2,1}\) is false. Our last observations concern the elimination step \(j=3\), where \(X_3\) becomes eligible for elimination in the context of \(c_3\). Since the equivalence for \(\mathrm {el}_{3,3}\) depends positively on \(\mathrm {el}_{1,2}\), \(\mathrm {el}_{2,2}\), and \(\mathrm {el}_{4,2}\) only, it is clear that \(\mathrm {el}_{3,3}\) is set to true. The same can be stated about the other nodes, i.e., \(\mathrm {el}_{1,3}\), \(\mathrm {el}_{2,3}\), and \(\mathrm {el}_{4,3}\) are made true by the respective equivalences. This illustrates the persistence of elimination: if \(X_i\) is eliminated at step j, it will remain eliminated at step \(j+1\). In particular, the formula \(\mathrm {el}_{1,3}\wedge \mathrm {el}_{2,3}\wedge \mathrm {el}_{3,3}\wedge \mathrm {el}_{4,3}\) is true as an indication of chordality in the scenario discussed so far.

In order to demonstrate further aspects of the chordality encoding, two other scenarios deserve attention. First, let us assume that \(\mathrm {in}_{1}^{c_5}\), \(\mathrm {in}_{2}^{c_8}\), \(\mathrm {in}_{3}^{c_6}\), and \(\mathrm {in}_{4}^{c_9}\) have been chosen to be true. This creates an interdependency for the selected elimination contexts, so that no node can be eliminated as illustrated in Fig. 4b. On the logical side this implies that \(\mathrm {el}_{i,j}\) will be falsified for each \(X_i\) and \(j=1,\ldots ,4\). Indeed, a graph in which the diagonal edge from Fig. 3 has been removed is not chordal, also witnessed by the falsity of \(\mathrm {el}_{1,4}\wedge \mathrm {el}_{2,4}\wedge \mathrm {el}_{3,4} \wedge \mathrm {el}_{4,4}\).

Second, let us pick \(\mathrm {in}_{1}^{c_{10}}\), \(\mathrm {in}_{2}^{c_8}\), \(\mathrm {in}_{3}^{c_9}\), and \(\mathrm {in}_{4}^{c_4}\) to be true. This represents a scenario in which \(X_1\) is removed at step \(j=1\) with \(c_{10}=\{X_1,X_2,X_3\}\) as its elimination context. This suggests that the edge \(\{X_2,X_3\}\) is present in the graph but, unfortunately, the elimination contexts of \(X_2\) and \(X_3\) do not include this edge, colored gray in Fig. 4c. The formulas of the chordality encoding will still set the variables \(\mathrm {el}_{1,1}\), \(\mathrm {el}_{2,2}\), \(\mathrm {el}_{3,2}\), and \(\mathrm {el}_{4,3}\) to true. Therefore, we observe that further constraints are necessary to ensure that the structure claimed to exist by the chosen elimination contexts of the nodes is indeed present in the graph. \(\square \)

A new Boolean variable \(\mathrm {cr}_{i}^{c}\) captures the idea that node \(X_i\) is responsible for the creation of the structure present in the clique c when elimination contexts are initially chosen. The constraint we want to impose on the elimination context c of a node \(X_i\) is essentially

$$\begin{aligned} \mathrm {in}_{i}^{c}\rightarrow \mathrm {cr}_{i_1}^{c'}\vee \ldots \vee \mathrm {cr}_{i_k}^{c'}, \end{aligned}$$
(14)

where \(c'=c\setminus \{X_i\}=\{X_{i_1},\ldots ,X_{i_k}\}\) gives the remainder of c if \(X_i\) is eliminated in this context. It remains to provide a formula defining the truth value of \(\mathrm {cr}_{i}^c\):

$$\begin{aligned} \mathrm {cr}_i^c\leftrightarrow \mathrm {in}_i^c\vee \mathrm {cr}_i^{c_1}\vee \ldots \vee \mathrm {cr}_i^{c_k}, \end{aligned}$$
(15)

where \(c_1=c\cup \{X_1\},\ldots ,c_k=c\cup \{X_k\}\) are all clique candidates extending c by one node. The effect of these formulas is illustrated next.

Example 3

Let us continue from the last scenario in Example 2, i.e., Fig. 4c. The relevant instance of (14) is \(\mathrm {in}_{1}^{c_{10}}\rightarrow \mathrm {cr}_{2}^{c_{7}}\vee \mathrm {cr}_{3}^{c_7}\), implying the consequent \(\mathrm {cr}_{2}^{c_{7}}\vee \mathrm {cr}_{3}^{c_7}\). Thus either \(X_2\) or \(X_3\) is responsible for creating the structure present in \(c_7\). To check this, we need to evaluate the respective instances of (15):

But since \(\mathrm {in}_2^{c_7}\), \(\mathrm {in}_3^{c_7}\), \(\mathrm {in}_2^{c_{10}}\), \(\mathrm {in}_3^{c_{10}}\), \(\mathrm {in}_2^{c_{11}}\), and \(\mathrm {in}_3^{c_{11}}\) have been set to false, the variables \(\mathrm {cr}_2^{c_7}\), \(\mathrm {cr}_3^{c_7}\), \(\mathrm {cr}_2^{c_{10}}\), \(\mathrm {cr}_3^{c_{10}}\), \(\mathrm {cr}_2^{c_{11}}\), and \(\mathrm {cr}_3^{c_{11}}\) must be falsified in turn. This rules out the possibility of satisfying \(\mathrm {cr}_{2}^{c_{7}}\vee \mathrm {cr}_{3}^{c_7}\) inferred above, i.e., the scenario under consideration is no longer possible. \(\square \)

Our last requirement for solutions to the Markov network learning problem concerns the score of a chordal graph for which a PEO can be identified. The idea is to use a differential score \(d(c,X_i)\) based on (9) that is compatible with \(X_i\) being eliminated in the context of c. If \(c_1,\ldots ,c_m\) is the collection of candidate cliques, the resulting objective function can be written as \(\sum _{j=1}^m\sum _{X_i\in c_j}d(c_j,X_i)\mathrm {in}_i^{c_j}\).

The representation of the objective function can be modularized by taking the variable view. So, let \(X_i\) be one of the variables with \(1\le i\le n\) and \(c_1,\ldots ,c_k\) the cliques containing \(X_i\). Since clique candidates are known in advance, we may assume that \(d(c_1,X_i)\ge \ldots \ge d(c_k,X_i)\) without loss of generality. Because \(X_i\) must have an elimination context, it is clear that \(X_i\) will be assigned at least the first score \(d(c_1,X_i)\) and potentially some smaller score fractions expressible as \(d(c_j,X_i)-d(c_{j-1},X_i)\) for \(1<j\le k\). To control which fractions are needed, we introduce Boolean variables \(\mathrm {sf}_{i,j}\) for each clique \(c_j\) with \(1< j<k\). The activation of score fractions is determined by the formula

$$\begin{aligned} \mathrm {sf}_{i,j}\leftrightarrow \mathrm {in}_i^{c_j}\vee \mathrm {sf_{i,j+1}}, \end{aligned}$$
(16)

where \(1<j<k\) refers to an elimination context \(c_j\) of \(X_i\). The base case \(j=k\), is covered by the formula

$$\begin{aligned} \mathrm {sf}_{i,k}\leftrightarrow \mathrm {in}_i^{c_k}. \end{aligned}$$
(17)

The idea is that, if \(\mathrm {sf}_{i,j}\) is set to true, then all other variables \(\mathrm {sf}_{i,j'}\) with \(j'<j\) are set to true as well. In this way, the contribution of \(X_i\) to the objective function is

$$\begin{aligned} o(i)=d(c_1,X_i)+\sum _{j=2}^k(d(c_j,X_i)-d(c_{j-1},X_i))\mathrm {sf}_{i,j}. \end{aligned}$$
(18)

The sum o(i) equals to \(d(c_j,X_i)\) for the chosen elimination context \(c_j\) of \(X_i\). The goal of this encoding is to allow smooth changes of scores when the elimination contexts of \(X_i\) are switched in accordance with the order \(c_1,\ldots ,c_k\).

Theorem 2

Let \(N=\{X_1,\ldots ,X_n\}\) be a set of random variables and \(d:{\mathbf {2}}^N\times N\rightarrow \mathbb {R}\) a scoring function. An undirected graph \(\langle N,E\rangle \) is a solution to the Markov network learning problem iff

  1. 1.

    the elimination formulas (12) are satisfied for each \(X_i\) and step \(1\le j\le n\) together with all instances of (13),

  2. 2.

    the formula \(\mathrm {el}_{1,n}\wedge \ldots \wedge \mathrm {el}_{n,n}\) is satisfied,

  3. 3.

    the context creation formulas (14) and (15) are satisfied for each \(X_i\) and each clique c such that \(X_i\in c\), and

  4. 4.

    the graph \(\langle N,E\rangle \) maximizes the sum \(\sum _{i=1}^n o(i)\) of objectives (18), while satisfying all formulas (16) and (17) defining score fractions.

Proof sketch

We argue that a graph \(\langle N,E\rangle \) based on N is chordal iff the requirements 1–3 are met.

Table 4 Boolean variables used in the encoding of labels

(\( \implies \)) Let \(\langle N,E\rangle \) be a chordal graph. By Theorem 1, the underlying graph \(\langle N,E\rangle \) has a PEO based on a strict total ordering of N. This ordering can be relaxed by recursively taking at step j as many as possible variables \(X_i\) from the ordering not competing for edges. In this way, the context elimination formulas get satisfied. The context creation formulas are satisfied because every variable \(X_i\) subject to elimination in its context \(c_i\) is connected to a clique \(c_i\setminus \{X_i\}\).

(\(\Longleftarrow \)) Suppose that requirements 1–3 are met by \(\langle N,E\rangle \) and let I be the satisfying assignment. A variable \(X_i\) is eliminated at step \(j\ge 1\) if \(\mathrm {el}_{i,j}\) is true in I and each \(\mathrm {el}_{i,j'}\) with \(1\le j'<j\) is false in I. A PEO can be constructed by taking variables eliminated at the same step j in any order. Since I satisfies the context creation formulas, it is guaranteed that, when \(X_i\) is eliminated, it is connected to a clique c such that a context \(c_i=c\cup \{X_i\}\) of elimination is determined for \(X_i\). Thus \(\langle N,E\rangle \) is chordal by Theorem 1.

Finally, we note that objective functions used for scoring, i.e., \(\sum _{i=1}^nd(c_i,X_i)\) and \(\sum _{i=1}^no(i)\), coincide.

Theorem 2 is applicable to any downward closed collection of cliques, where d may be a partial scoring function.

5.2 Encoding labels for Markov networks

The labels representing context-dependent conditional probabilities can be incorporated by an orthogonal extension to the basic encoding. It is essential to ensure that edges subject to a labeling L are not contained in the separators of the underlying chordal graph. Because the encoding developed in Sect. 5.1 avoids the formalization of separators altogether, we have to detect them somehow. For the sake of space efficiency, we concentrate on identifying edges involved in separators rather than identifying separators themselves (Table 4). Thus a Boolean variable \(\mathrm {e}_{i,j}\) is introduced for each pair \(X_i,X_j\) of nodes with \(1\le i<j\le n\), to be set to true if and only if the edge between \(X_i\) and \(X_j\) is present. If \(c_1,\ldots ,c_k\) are the cliques containing both \(X_i\) and \(X_j\), we obtain the formula

$$\begin{aligned} \mathrm {e}_{i,j}\leftrightarrow \mathrm {in}_i^{c_1}\vee \mathrm {in}_j^{c_1}\vee \ldots \vee \mathrm {in}_i^{c_k}\vee \mathrm {in}_j^{c_k}. \end{aligned}$$
(19)

When the edge between \(X_i\) and \(X_j\) is present in a graph, we are interested in other nodes \(X_k\) that are connected by an edge to both \(X_i\) and \(X_j\). To detect such nodes, we use a Boolean variable \(\mathrm {m}_{i,j,k}\) and a formula

$$\begin{aligned} \mathrm {m}_{i,j,k}\leftrightarrow \mathrm {e}'_{i,k}\wedge \mathrm {e}'_{j,k}, \end{aligned}$$
(20)

where \(\mathrm {e}'_{i,k}=\mathrm {e}_{i,k}\) (or \(\mathrm {e}'_{j,k}=\mathrm {e}_{j,k}\)) if \(i<k\) (or \(j<k\)), while \(\mathrm {e}'_{i,k}=\mathrm {e}_{k,i}\) (or \(\mathrm {e}'_{j,k}=\mathrm {e}_{k,j}\)) otherwise. The edge between \(X_i\) and \(X_j\) belongs to a separator if and only if it is connected—in the sense specified above—to two distinct nodes \(X_k\) and \(X_l\) that are not connected by an edge. A Boolean variable \(\mathrm {s}_{i,j}\) is introduced to capture this condition and its truth value is set in terms of

$$\begin{aligned} \mathrm {s}_{i,j}\leftrightarrow \bigvee _{1\le k<l\le n}\mathrm {m}_{i,j,k}\wedge \mathrm {m}_{i,j,l}\wedge \lnot \mathrm {e}_{k,l}. \end{aligned}$$
(21)

Example 4

Recall the first scenario of Example 2 in which \(\mathrm {in}_1^{c_{10}}\), \(\mathrm {in}_2^{c_7}\), \(\mathrm {in}_3^{c_3}\), and \(\mathrm {in}_4^{c_{11}}\) were set to true. The formulas of the form (19) force the variables \(\mathrm {e}_{1,2}\), \(\mathrm {e}_{1,3}\), \(\mathrm {e}_{2,3}\), \(\mathrm {e}_{2,4}\), and \(\mathrm {e}_{3,4}\) to be true, whereas other edge variables remain false.

Thus formulas (20) make variables \(\mathrm {m}_{2,3,1}\) and \(\mathrm {m}_{2,3,4}\) true. Other such variables are not relevant for our analysis.

The truth assignments reported above ensure that \(\mathrm {m}_{2,3,1}\wedge \mathrm {m}_{2,3,4}\wedge \lnot \mathrm {e}_{1,4}\) evaluates to true. This sets \(\mathrm {s}_{2,3}\) to true by the respective instance of formula (21), indicating that the edge between \(X_2\) and \(X_3\) is involved in a separator, i.e., \(s=c_{10}\cap c_{11}\). \(\square \)

It remains to decide the labeling for cliques. As worked out in the end of Section 4, labelings \(L_1,\ldots ,L_k\) modify the score of a clique c by rewards \(r(c,L_1),\ldots ,r(c,L_k)\) according to (10), where \(r(c,L_1)\ge \ldots \ge r(c,L_k)\) can be assumed without loss of generality and Lemma 1 implies that \(L_k\) is the empty labeling \(\emptyset \). Since we aim at maximizing the score for c, we should pick the first labeling from the sequence \(L_1,\ldots ,L_k\) that satisfies the labeling condition. It is worth noting that \(L_k=\emptyset \) satisfies the condition trivially, so that eventually some labeling can be found.

To formalize this idea, we introduce a Boolean variable \(\mathrm {xl}^c\) denoting that c is labeled by some of its supercliques, and Boolean variables \(\mathrm {xl}_i^c\) denoting that a particular labeling \(L_i\) is excluded for c. The truth value of the variable \(\mathrm {xl}^{c}\) can be determined by the formula

$$\begin{aligned} \mathrm {xl}^{c}\leftrightarrow \bigvee _{X_k\not \in c,X_j\in c\cup \{X_k\}}\mathrm {in}_j^{c\cup \{X_k\}}. \end{aligned}$$
(22)

As regards variables \(\mathrm {xl}_i^c\), the following formula is used to set its truth value in the base case \(i=1\), but only if \(k>1\), i.e., \(L_1\) is different from the empty labeling \(\emptyset \):

$$\begin{aligned} \mathrm {xl}_1^{c}\leftrightarrow \left( \bigvee _{X_j\in c}\mathrm {in}_j^{c}\right) \wedge (\mathrm {xl}^c\vee \bigvee \{\mathrm {s}_{j,k}\mid L_1\text { labels } \{X_j,X_k\}\}).\nonumber \\ \end{aligned}$$
(23)

When \(1<i<k\), a recursive formula is used:

$$\begin{aligned} \mathrm {xl}_i^{c}\leftrightarrow \mathrm {xl}_{i-1}^{c}\wedge \left( \mathrm {xl}^c\vee \bigvee \{\mathrm {s}_{j,k}\mid L_i\text { labels } \{X_j,X_k\}\}\right) . \end{aligned}$$
(24)

To implement the objective function (11) in the labeled case, we observe that rewarding an admissible label \(L_i\) for \(1\le i<k\) is equivalent to penalizing the exclusion of \(L_i\) (and all \(L_j\) such that \(1\le j < i\)) by a negative reward fraction \(r(c,L_{i+1})-r(c,L_i)\). Thus the required extension to the objective function is

$$\begin{aligned} r(c,L_1)+\sum _{i=1}^{k-1}(r(c,L_{i+1})-r(c,L_i))\mathrm {xl}_i^c. \end{aligned}$$
(25)

In the special case \(k=1\), the sum above reduces to \(r(c,\emptyset )=v(c,\emptyset )-v(c,\emptyset )=0\). This means that the given extension is not needed for cliques that only have one possible labeling, i.e., the empty labeling \(\emptyset \) acting as the default labeling.

Theorem 2 can be generalized for labeled Markov networks by extending the set of requirements for a graph candidate \(\langle N,E\rangle \). First, the edges involved in separators are detected by satisfying all instances of formulas (19)–(21). Once these have been identified, the best possible labelings for the cliques of \(\langle N,E\rangle \) can be orthogonally determined by satisfying the formulas (22)–(24). Finally, the resulting rewards (25) are taken into account in the objective function.

Table 5 Solver runtimes in seconds for learning context-dependent Markov networks from heart and econ datasets with different filtering schemes

6 Experimental evaluation

We have evaluated our constraint-based approach to Markov network structure learning using a number of fully automated off-the-shelf constraint solvers. To this end, we encodedFootnote 1 the conditions and optimization measures described in Section 5 in the input language of ASP and used the tool LP2ACYC (Gebser et al. 2014) for automatic translation to MAXSAT as well as IP format. Our comparison includes the solvers

  • CLASP (version 3.1.1) (Gebser et al. 2012),

  • PWBO (version 2.2) (Martins et al. 2012),

  • SAT4J (version 2.3.5) (Berre and Parrain 2010), and

  • CPLEX (version 12.6.0) [126].

As CLASP has been originally devised for ASP solving, we also compare it on input in ASP format, without translation to MAXSAT, and below denote this system variant by CLASP. Note that all solvers apply branch-and-bound techniques that successively refine an upper bound on solution quality, while unsatisfiability-based optimization approaches [cf. Manquinho et al. (2009)] turned out ineffective for the Markov network learning problem and are omitted here. The experiments were run sequentially on a Linux machine having 2.70 GHz Intel Xeon E5-4650 CPUs and 256 GB RAM, by imposing a time limit of 86,400 seconds (one day) per run. Timeouts are indicated by entries “TO” in tables that follow.

Table 5 provides runtimes in seconds on heart and econ datasets, relative to the filtering schemes b(forig), n(ode), p(art), and e(dge) from Sect. 3, parametrized by \(k=0,1,2\). For both benchmarks, the upper five rows refer to the unlabeled problem variants, and the lower five rows to the case of labeled Markov networks. Despite of performance gaps between solvers, the runtimes tightly correlate with the number of clique candidates in the input. As shown in Table 2, this number is regulated by the parameter k as well as the filtering scheme, where bforig and edge tend to prune cliques least or most aggressively, respectively. Notably, the bforig scheme along with \(k=2\) yields no pruning at all and thus reflects the performance on unfiltered datasets. Recalling the account of filtering times from Sect. 3, the filtering of clique candidates leads to effective savings in runtime.

Moreover, we note that the performance differences between unlabeled and labeled inputs obtained with the same filtering scheme are rather moderate, given that best labelings can be read off PEOs (cf. Sect. 5.2) to avoid an (additional) model space explosion as reported in Table 1. Comparing the solvers to each other exhibits that CL ASP, run as an ASP solver, performs best on all inputs, leading to the shortest runtimes highlighted in boldface in each column. This advantage is due to the native support of recursion in ASP, which permits a more compact internal problem representation and resulting improvements in search efficiency (in terms of conflicts) by one to two orders of magnitude in comparison to CLASP applied to instances in MAXSAT format. However, the performance of the MAXSAT solvers CLASP, PWBO, and SAT4J as well as the IP solver CPLEX follows a similar pattern, witnessing a consistent impact of the number of clique candidates. As a reference, note that the MCMC algorithm in Nyman et al. (2014) has been reported to take tens of minutes for converging to an optimum on heart-lab and econ-lab datasets, still without proving optimality in view of incompleteness of the method.

Table 6 File sizes of solver inputs in different formats

Regarding the compactness of encodings, the file sizes in Table 6 give an account of the progress relative to Corander et al. (2013), where the unlabeled heart and econ datasets have also been investigated. Hence, comparing the rows for either benchmark shows a considerable size reduction due to exploiting PEOs rather than maximal cliques and the balancing condition for scoring. In particular, the space requirements are decreased by one order of magnitude on the econ dataset, for both ASP and MAXSAT format, and the IP column further quantifies space savings in comparison to the translation from ASP to MAXSAT. Moreover, Corander et al. (2013) reported a shortest solver runtime of three days for the econ dataset, while CL ASP is now able to find and verify an optimal model within 15 minutes (cf. upper rows for bforig with \(k=2\) in Table 5). Finally, we note that the size of inputs is primarily governed by conditions on PEOs and not significantly increased by adding labels.

Table 7 Quality of optimal Markov networks relative to different filtering schemes
Table 8 Solver runtimes in seconds on samples of the hs dataset with varying numbers of nodes and parameter values for the edge filtering scheme

To estimate the trade-off between filtering schemes and solution quality, Table 7 provides a summary of optimal network scores for the unlabeled and labeled heart and econ datasets, where global optima (Nyman et al. 2014) are highlighted in boldface. For both heart variants, it turns out that optimal models are preserved regardless which scheme and parameter value are used for pruning clique candidates. This does not apply to the more complex econ dataset. In the unlabeled case, too small k-values decrease the solution quality with the bforig scheme, which does not take the contexts given by cliques into account, as well as the aggressive edge scheme. Given that the investigated filtering schemes assume empty labelings and do not consider potential alternatives, the usage of small k-values is particularly risky for the labeled econ variant. In fact, although the schemes node and part with \(k\le 1\) as well as the edge scheme with \(k=2\) come close to the global optimum, they still deteriorate the log marginal likelihood by 0.015 %. This suggests that labels should be taken into account in order to perform more aggressive yet informed pruning in the case of labeled Markov networks.

We further assessed the scalability of our approach on samples of the hs dataset of increasing size, using the edge scheme along with \(k=0,1,2\) for filtering clique candidates. The samples in the upper five rows of Table 8, with numbers of nodes given at the top, have been obtained by maximizing the density of the subgraph induced by a selection of the 25 nodes available in total, while the density is minimized in the lower five rows, with numbers of nodes given at the bottom. For both kinds of samples, the reported runtimes include the greatest numbers of nodes for which CL ASP, again performing best when run as an ASP solver, was able to find and verify an optimal model within the time of one day. Similar to the econ dataset, the depth of filtering regulated by the parameter k has a significant impact on the resulting difficulty of Markov network structure learning. Moreover, samples of size 8 turn out to be much harder to solve than inputs obtained by applying the edge scheme with same k-value to the econ dataset, where the number of nodes is also 8. As the difficulty rapidly increases with sample size, optimization succeeds on dense subgraphs with up the 13, 10, or 9 nodes, respectively, depending on the k-value used for pruning. Although this value remains relevant, samples such that the density is minimized exhibit a considerably smoother scaling behavior and can be successfully handled up to 20, 18, or 15 nodes, respectively. This observation emphasizes the impact of problem structure as well as the importance of filtering in view of an input size (number of clique candidates) that, in the worst case, grows exponentially with the number of nodes.

7 Conclusion

Our experiments and the recent interest in graphical model learning using integer programming, maximum satisfiability, and answer set programming demonstrate that computational logic holds a largely untapped valuable resource for statistical inference. The main challenge lies in finding effective translations from the statistical learning problem into logical constraints, to make optimization scalable to large problem instances. As the diversity of approaches adopted here and in Bartlett and Cussens (2013), Berg et al. (2014), Corander et al. (2013), Cussens (2008), and Parviainen et al. (2014) shows, there is a lot of room for creativity in this translation task, and the choices made can strongly impact the performance of solvers. In addition, adopting strong pruning methods can be critical for reducing the space of candidate solutions. For instance, the approach based on dynamic programming (Kangas et al. 2014) suffers from exponential growth of memory consumption when the number of variables is increased.

We have first theoretically studied which models can be automatically recognized as inferior to others, such that the model space can be reduced without eliminating globally optimal models. Then, we introduced statistical pruning criteria for more extensive filtering of candidates, guaranteeing that best solutions are preserved asymptotically when the number of input data vectors increases. In future research it will be useful to study the effect of pruning strategies further and to develop translations of learning problems beyond graphical models.