Keywords

1 Introduction

Statistical relational learning [8] and Probabilistic Logic Programming [3, 4] have contributed to various representations and learning schemes that reason about objects and uncertain relational structures among them. Popular approaches include PRISM [10], Independent Choice Logic [13], Bayesian Logic Programs [11], Markov Logic Networks [14], Logic Programs with Annotated Disjunctions [17], CP-Logic [16] and ProbLog [7]. Many of these languages are based on variants of the distribution semantics [15]. They vary in the way they define the distribution over logic programs but are equally expressive [2]. In this paper, we use ProbLog ’s representation. ProbLog has probabilistic facts such as \(\mathtt {0.01::earthquake}\), stating that the probability of having an earthquake is 0.01, and has clauses such as \(\mathtt {alarm:\!\!-}\) \(\mathtt {earthquake}\), stating that the alarm goes off if there is an earthquake. In addition, ProbLog supports annotated disjunctions (ADs) such as \(\mathtt {0.01::alarm(long)}\); \(\mathtt {0.19::alarm(short)}\); \(\mathtt {0.8::alarm(none)}\), stating that an alarm has exactly one type of the three types.

ProbLog ’s parameter learning approach, LFI-ProbLog, is designed only for probabilistic facts, and not for ADs. Hence, LFI-ProbLog cannot learn multi-head AD variables. Even though LFI-ProbLog can learn single-head AD parameters, we will show that it is inefficient and in extreme cases, results in incorrect values. Faria et al. tackled a special case of this efficiency issue for single-head ADs [6]. In contrast, we provide a more general solution that, in addition, also covers multi-head ADs. Although our approach is implemented in ProbLog, it can be applied to other EM-based parameter learning algorithms as what we exploit is the rule-based structure that is shared by all probabilistic logic programs.

The contribution is twofold. First, we introduce EMPLiFI, a new parameter learning approach in ProbLog. EMPLiFI correctly learns multi-head ADs and speeds up learning by exploiting the rule-based structure of logic programs. Second, we prove that EMPLiFI is correct and illustrate how it reduces EM iterations. We compare EMPLiFI with two other EM-based learners, LFI-ProbLog and EMBLEM, and show that EMPLiFI is the most accurate in learning multi-head ADs and takes the fewest EM iterations to converge.

2 Preliminaries

Probabilistic Logic Programming. A ProbLog theory (or program) \(\mathcal {T}\) consists of a finite set of probabilistic facts \(\mathcal {F}\), a finite set of clauses \(\mathcal {BK}\) and a finite set of annotated disjunctions \(\mathcal {AD}\). A probabilistic fact is an expression \(\mathtt {p::f}\) that states the ground fact \(\mathtt {f}\) is true with probability \(\mathtt {p}\). A clause is an expression \(\mathtt {h:\!\!- b_1{,}\cdots {,}b_n}\) where \(\mathtt {h}\) is a literal and \(\mathtt {b_1{,}\cdots {,}b_n}\) is a conjunction of literals, stating \(\mathtt {h}\) is true if \(\mathtt {b_1{,}\cdots {,}b_n}\) is true. ProbLog defines probability distributions over ground facts in a Herbrand base \(L_\mathcal {T}\). The probabilistic facts define a probability distribution over possible worlds. All ground facts in a possible world W are true and all that are not in W are false. The probability of a possible world W is defined as \(P(W|\mathcal {T}){=} \prod _{\mathtt {f_i}\in W} \mathtt {p_i} \prod _{\mathtt {f_i}\in L_\mathcal {T}\setminus W} \mathtt {(1{-}p_i)}\). The success probability of a query q is the sum of the probabilities of the possible worlds that entail q, formally, \(P_s(q|\mathcal {T}){=}\sum _{I\subseteq L_\mathcal {T}, I \models q} P(I|\mathcal {T})\). A partial interpretation I is an incomplete possible world that contains truth values of some (but not all) atoms. If an atom \(\mathtt {a}\) (resp. \(\mathtt {\lnot a}\)) is in I, then \(\mathtt {a}\) is true (resp. false). Otherwise, the truth value of \(\mathtt {a}\) is unknown. Hence, a partial interpretation I represents a number of possible worlds, and the probability of I is the success probability of the conjunction of the literals in I, i.e. \(P(I){=}P_s(\bigwedge _{l\in I} l)\).

Annotated Disjunctions in ProbLog. An annotated disjunctions (ADs) is a clause with one or more mutually exclusive heads of the form \(\mathtt {p_1::h_1; \cdots ;p_k::h_k}\) where \(\sum ^k_{i=1} \mathtt {p_i}\,{=}\,1\), stating that if the body is true, exactly one head is made true, where the choice of \(\mathtt {h_i}\) is governed by \(\mathtt {p_i}\). Although the ProbLog language and semantics allow for ADs, only probabilistic facts (and a transformation) are used for inference. Hence, most transformations encode ADs as probabilistic facts as the first step [5, 7]. For example, a three-head AD \(\mathtt {0.2::a1; 0.2::a2; 0.6::a3:\!\!- b}\) is encoded as

figure a

where \(\mathtt {h1}\), \(\mathtt {h2}\) and \(\mathtt {h3}\) are hidden facts. The last fact \(\mathtt {h3}\) can be dropped for inference but we keep it in because we will later need it for learning. This encoding is designed to compute the probabilities correctly for the inference task but is insufficient for learning and results in incorrect values (see Sect. 3).

3 Learning from Interpretations in ProbLog

In this section, we review LFI-ProbLog and illustrate its two issues. Later, Sect. 4 will introduce a new learning approach that resolves these issues. The parameter learning task in ProbLog is as below.

Given

  • A ProbLog program \(\mathcal {T}\mathbf {(p)}{\,=\,}\mathcal {F}\,{\cup }\,\mathcal {BK}\,{\cup }\,\mathcal {AD}\) where \(\mathcal {F}\) is a set of probabilistic facts, \(\mathcal {BK}\) is a set of background knowledge and \(\mathcal {AD}\) is a set of ADs. \(\mathbf {p}=\langle p_1,\cdots , p_N\rangle \) is a set of unknown parameters where each parameter is attached to a probabilistic fact or an AD head.

  • A set \(\mathcal {I}\) of partial interpretations \(\{I_1, ..., I_M\}\).

Find maximum likelihood probabilities \(\widehat{\mathbf {p}}\) for all interpretations in \(\mathcal {I}\). Formally,

$$\begin{aligned} \widehat{\mathbf {p}} = \arg \max _\mathbf {p} P(\mathcal {I|T}(\mathbf {p})) = \arg \max _\mathbf {p} \prod ^M_{m=1} P_s(I_m|\mathcal {T}(\mathbf {p})) \end{aligned}$$

Given initial parameters \(\mathbf {p}^0 =\langle p_{1}^{0},\cdots ,p_{N}^{0} \rangle \), an Expectation Maximization (EM) algorithm computes \(\mathbf {p}^1\), and in this fashion, enumerates a series of estimations \(\mathbf {p}^2, ..., \mathbf {p}^T\). The process terminates after T iterations when the log likelihood does not improve more than an arbitrary small value \(\epsilon \). LFI-ProbLog, is summarized by Eq. 1 [7, 9], which takes \(\mathbf {p}^t\) to compute \(\mathbf {p}^{t+1}\). Intuitively, based on \(\mathbf {p}^t\), a new estimate \(p_n^{t+1}\) is the expected count of \(\mathtt {f_n}\) being true divided by the total count of \(\mathtt {f_n}\), formally,

$$\begin{aligned} p_n^{t+1} = \frac{ \sum ^M_{m=1} \sum ^{K^m_n}_{k=1} P(\texttt {f}_{\texttt {n},\texttt {k}}|I_m, \mathcal {T}(\mathbf {p}^t))}{\sum ^M_{m=1} K^m_n } \end{aligned}$$
(1)

where \(K^m_n\) is the number of ground instances represented by \(\mathtt {p_n::f_n}\) in \(I_m\). We will use the following running examples throughout this paper to illustrate two issues of LFI-ProbLog and our approach to tackle them.

Running example 1. Consider the following Smokers program with three parameters \(\mathbf {p} = \langle \texttt {p}_{\texttt {1}}, \texttt {p}_{\texttt {2}}, \texttt {p}_{\texttt {3}}\rangle \), stating that a person is a smoker with probability \(\mathtt {p_1}\), and any smoker (resp. non-smoker) has cancer with probability \(\mathtt {p_2}\) (resp. \(\mathtt {p_3}\)).

figure b

Consider interpretations \(I_1\,{=}\,\{\mathtt {smokes(a),cancer(a)}\}\), \(I_2\,{=}\,\{\mathtt {smokes(b),\lnot }\mathtt {cancer(b)}\}\), \(I_3{\,=\,}\{\mathtt {\lnot smokes(c){,}cancer(c)}\}\), \(I_4{\,=\,}{\cdots }{\,=\,}I_{102}{\,=\,}\{\mathtt {\lnot smokes(\cdot ){,}}\mathtt {\lnot cancer(\cdot )}\}\). As all interpretations are fully observable, we obtain \(\langle \mathtt {p_1, p_2, p_3}\rangle \) \(=\langle 2/102, 1/2, 1/100\rangle \) by simply counting.

Running example 2. Consider the following Colors program with three parameters \(\mathbf{p}=\langle \texttt {p}_{\texttt {1}}, \texttt {p}_{\texttt {2}}, \texttt {p}_{\texttt {3}}\rangle \) that jointly denote a probability distribution of the color of a ball.

figure c

Given interpretations \(I_1{\,=\,}\{\mathtt {green}\}\), \(I_2{\,=\,}\{\mathtt {red}\}\) and \(I_3{\,=\,}\{\mathtt {blue}\}\), we obtain \(\mathbf {p}=\langle 1/3, 1/3, 1/3\rangle \) by counting.

The first issue of LFI-ProbLog is efficiency-related. When learning single-head ADs, LFI-ProbLog takes into account all interpretations, including the irrelevant ones that do not contain information about the parameter to be learned. These irrelevant interpretations introduce an undesired inertia in EM learning, as illustrated in Example 1.

Example 1

For LFI-ProbLog, the Smokers program must be transformed into the following program with hidden facts \(\mathtt {h_1, h_2}\) and \(\mathtt {h_3}\).

figure d

Given initial parameters \(\mathbf {p}^0 = \langle 0.1, 0.1, 0.1\rangle \), by applying Eq. 1, we obtain \(\mathbf {p}^1=\langle p_1^1,p_2^1,p_3^1\rangle \) as follows.

$$\begin{aligned} p_1^1 {\,=\,} \frac{1{+}1{+}0{+}0{\times }99}{102}{\,=\,}\frac{2}{102}, p_2^1{\,=\,} \frac{1{+}0{+}.1{+}.1{\times }99}{102}{\,=\,}0.108, p_3^1 {\,=\,} \frac{.1{+}.1{+}1{+}0{\times }99}{102}{\,=\,}0.012 \end{aligned}$$

By repeatedly applying Eq. 1, we further obtain a series of estimates \(\mathbf {p}^2{\,=\,}\langle 2/102{,}0.116{,}0.01\rangle \), \(\cdots \), \(\mathbf {p}^{100}{\,=\,}\langle 2/102{,}0.445{,}0.01\rangle \). Notice that \(\mathtt {p_2}\) does not converge to 0.5 after 100 iterations even though all interpretations are fully observable. This is resulted from the irrelevant interpretations \(I_3,\cdots I_{102}\).

The second issue is that LFI-ProbLog does not correctly learn all possible multi-head ADs. This is because how their probabilities are transformed from \(\mathbf {p}\) to \(\mathbf {\acute{p}}\) [5] (cf. Sect. 2). This transformation is incorrect in learning as the parameters are not known and must be learned, as illustrated in Example 2.

Example 2

For LFI-ProbLog, the Colors program must be transformed into the following program with hidden facts \(\mathtt {gh}\), \(\mathtt {rh}\) and \(\mathtt {bh}\), and the given initial parameters must be transformed. Say, given \(\mathbf {p}^0=\langle 0.2, 0.2, 0.6\rangle \), the transformed probabilities are \(\mathbf {\acute{p}}^0=\langle 0.2, 0.25, 1.0\rangle \).

figure e

By applying Eq. 1, we obtain \(\mathbf {\acute{p}}^{1}\) as follows.

$$\begin{aligned} \acute{p}_1^1 = \frac{1+0+0}{3}=0.333\qquad \acute{p}_2^1= \frac{0.25+1+1}{3}=0.750\qquad \acute{p}_3^1= \frac{1+1+1}{3}=1 \end{aligned}$$

We future obtain \(\mathbf {\acute{p}}^2=\langle 0.333,0.917,1\rangle , \cdots , \mathbf {\acute{p}}^{10}=\langle 0.333,1,1\rangle \), which corresponds to the incorrect AD parameters \(\mathbf {p}^{10}=\langle 1/3, 2/3, 0\rangle \).

4 Learning with Annotated Disjunctions

We propose EMPLiFI, a parameter learning approach in Eq. 2, as a solution to the issues discussed in Sect. 3. This section illustrates EMPLiFI, and Sect. 5 will prove EMPLiFI ’s correctness.

$$\begin{aligned} p_n^{t+1} = \frac{ \sum _{I_m\in I_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^t))}{\sum _{I_m\in I_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^t)) } \end{aligned}$$
(2)

where

  • \(\mathtt {h_{n,j}}\) and \(\mathtt {b_{n,j}}\) are the j-th possible ground instance represented by \(\mathtt {p_n::h_{n}}\) and the corresponding body in \(I_m\)

  • \(J^m_n\) is the total number of ground instances represented by \(\mathtt {p_n::h_{n}}\) in \(I_m\)

  • \(I_{\mathtt {p_n}}\) is the set of all relevant interpretations to \(\mathtt {p_n}\)

If the denominator is zero, then \(p_n^{t+1}\) will not be updated. Intuitively, based on \(\mathbf {p}^t\), a new estimate \(p_n^{t+1}\) is the expected count of the head divided by the expected count of the body. Unlike LFI-ProbLog that assumes all parameters are attached to a fact, EMPLiFI recognizes and exploits AD rule structures, which enables efficient EM and multi-head AD learning.

4.1 Relevant Interpretations

At this point, it is important to stress that some interpretations do not contain information about a rule \(\mathtt {p::h:\!\!-b}\). An interpretation I is called irrelevant to \(\mathtt {p}\) if the conditional probability components \(P({\texttt {h}_\texttt {n}}, {\texttt {b}_\texttt {n}}\mathtt {|\cdot )}\) and \(P(\mathtt {b_n|\cdot )}\) solely depend on the old probability estimate \(\mathbf {p}^t\), formally,

$$\begin{aligned} P(\texttt {h}, \texttt {b}|I, \mathcal {T}(\mathbf {p}^t)) = P(\texttt {h}, \texttt {b}|\mathcal {T}(\mathbf {p}^t)) \hbox { and } P(\texttt {b}|I, \mathcal {T}(\mathbf {p}^t)) = P(\texttt {b}|\mathcal {T}(\mathbf {p}^t)) \end{aligned}$$
(3)

As irrelevant interpretations slow down learning (see Example 1), it is our aim to identify them for each parameter.

The dependency set of a ground atom \(\mathtt {a}\) in a ProbLog theory \(\mathcal {T}\), denoted by \(dep_\mathcal {T}(\mathtt {a})\), is the set of all atoms that occur in some SLD-proof of \(\mathtt {a}\) [7]. The dependency set of multiple atoms is the union of their dependency sets. A ground fact \(\mathtt {f}\in \mathcal {T}\) is relevant to an interpretation I if it is in the dependency set of I, namely \(\mathtt {f}\in dep_\mathcal {T}(I)\). Similarly, a ground clause is relevant to I if it is used in the SLD proof of I. Then, the interpretation-restricted theory, denoted by \(\mathcal {T}_r(I)\), is the union of all relevant facts and clauses [9]. A restricted theory is a subset of the original ground program, in fact, it is usually much smaller than the original program. Using the restricted theory, we can define relevant interpretations for learning a parameter.

Definition 1

(Relevant Interpretation) For a ProbLog theory \(\mathcal {T}\), an interpretation I is relevant to an atom \(\mathtt {a_n}\in \mathcal {T}\) if and only if \(\mathtt {a_n}\) is in the interpretation-restricted theory \(\mathcal {T}_r(I)\), namely \(\mathtt {a_n}\in \mathcal {T}_r(I)\).

Since a parameter \(\mathtt {p_n}\) always corresponds a unique atom \(\mathtt {a_n}\) in \(\mathcal {T}\), we define \(\mathtt {p_n}\)-relevant interpretations using \(\mathtt {a_n}\), formally, \(\mathcal {I}_{\mathtt {p_n}}=\{I\in \mathcal {I} | \mathtt {a_n}\in \mathcal {T}_r(I)\}\). We have defined relevant interpretations for single-head ADs and probabilistic facts.

Example 3

Consider the Smokers program and \(I_2=\{\mathtt {smokers(b)}\), \(\mathtt {\lnot cancer(b)}\}\), the dependency set of \(I_2\) is \(dep_{\mathcal {T}}(I_2)\) = \(\{\mathtt {h1}\), \(\mathtt {h2}\), \(\mathtt {smokes(b)}\), \(\mathtt {cancer(b)}\}\) and the corresponding restricted theory \(\mathcal {T}_r(I_2)\) is

figure f

Therefore, \(I_2\) is relevant to \(\mathtt {p_1}\) and \(\mathtt {p_2}\) according to Definition 1. We obtain the relevant interpretation sets for all three parameters as \(\mathcal {I}_{\mathtt {p_1}}\, =\, \{I_1{,}\cdots {,} \,I_{102}\}\), \(\mathcal {I}_{\mathtt {p_2}} = \{I_1{,}\,I_2\}\), and \(\mathcal {I}_{\mathtt {p_3}} = \{I_3{,}\cdots {,}\,I_{102}\}\). Given initial parameters \(\mathbf {p}^0{\,=\,}\langle 0.1, 0.1, 0.1\rangle \), we obtain \(\mathbf {p}^1{\,=\,}\langle 2/102, 1/2, 1/100\rangle \) after one EM iteration by applying Eq. 2, as opposed to Example 1.

4.2 Directly Learning Multi-head ADs

Recall that ProbLog ’s transformations result in incorrect learning of multi-head ADs (see Sect. 3). To gain correctness, it is required to maintain mutual exclusivity in the interpretation-restricted theory. To do so, we define the AD dependency set, \(dep_{\mathcal {T}}^{AD}(I)\supseteq dep_{\mathcal {T}}(I)\) to include also mutually exclusive atoms. Intuitively, if \(dep_{\mathcal {T}}^{AD}(I)\) contains an AD head, it must also contain all mutually exclusive heads and their dependency sets. Then, an AD dependency set defines an AD interpretation-restricted theory as in Sect. 4.1.

Definition 2

(Relevant Interpretation with AD). For a ProbLog theory \(\mathcal {T}\), an interpretation I is relevant to an atom \(\mathtt {a_n}\in \mathcal {T}\) if and only if \(\mathtt {a_n}\) is in the AD interpretation-restricted theory \(\mathcal {T}^{AD}_r(I)\), namely \(\mathtt {a_n}\in \mathcal {T}^{AD}_r(I)\).

After defining relevant interpretations for ADs, we can now learn multi-head ADs using Eq. 2.

Example 4

Consider the Colors program and \(I_1\) = \(\{\mathtt {green}\}\), the AD dependency set \(dep_{\mathcal {T}}^{AD}(I_1)\) is \(\{\mathtt {ball, green, red, blue, gh, rh, bh}\}\) and the AD restricted theory \(\mathcal {T}^{AD}_r(I_1)\) is

figure g

\(I_1\) is relevant to \(\mathtt {p_1}\), \(\mathtt {p_2}\) and \(\mathtt {p_3}\) according to Definition 2. Similarly, \(I_2\) and \(I_3\) are also relevant to all three parameters. Given initial parameters \(\mathbf {p}^0{\,=\,}\langle 0.2, 0.2, 0.6\rangle \), we obtain \(\mathbf {p}^1=\langle 1/3, 1/3, 1/3\rangle \) after one EM iteration by applying Eq. 2.

5 Proofs

Section 5.1 will prove EMPLiFI ’s correctness and Sect. 5.2 will provide insight into how EMPLiFI improves efficiency of EM parameter learning.

5.1 Correctness

We start from the EM algorithm for Bayesian Networks [12], i.e. Eq. 4, that differs from EMPLiFI by learning from all interpretations. Since Eq. 4 is correct [12], we can prove the correctness of EMPLiFI, i.e. Eq. 2, by showing they converge to the same values, i.e. Proposition 1.

$$\begin{aligned} p_n^{t+1} = \frac{ \sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|\texttt {I}_\texttt {m}, \mathcal {T}(\mathbf {p}^{\texttt {t}}))}{\sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|\texttt {I}_\texttt {m}, \mathcal {T}(\mathbf {p}^{\texttt {t}})) } \end{aligned}$$
(4)

Proposition 1

Given a program \(\mathcal {T}\), a set of partial interpretations \(\mathcal {I}\), and initial parameters \(\mathbf {p}^0\). Let \(\mathbf {p}^{t,1}\) and \(\mathbf {p}^{t,2}\) be the parameter estimates generated by Eqs. 2 and 4, respectively. It is true then \(\lim _{t\rightarrow \infty } p_n^{t,1} = p_n^{t,2}\)

Proof

We prove by induction. When \(t{\,=\,}0\), \(\mathbf {p}^{0,1}{\,=\,}\mathbf {p}^{0,2}{\,=\,}\mathbf {p}^0\) holds. Assume that when \(t{\,=\,}k\), \(\mathbf {p}^{k,1}{\,=\,}\mathbf {p}^{k,2}\) holds. Then, for \(t{\,=\,}k{+}1\), by applying Eq. 2, we obtain \(p_n^{k+1,1}\), which we rewrite using A and B to save space.

$$\begin{aligned} p_n^{k+1,1}&=\frac{ \sum _{I\in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P({\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}}|I, \mathcal {T}(\mathbf {p}^{k,1})) }{ \sum _{I\in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|I, \mathcal {T}(\mathbf {p}^{k,1})) } = \frac{A}{B} \end{aligned}$$
(5)

We finish the proof by showing that \(p_n^{k+1,2}\) also converges to \(\frac{A}{B}\).

$$\begin{aligned} p_n^{k+1,2} =&\frac{ \sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^{k,2})) }{ \sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^{k,2})) }\hbox { (}\because ~\hbox {Equation}~4\hbox {) }\nonumber \\ =&\frac{ \sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^{k,1})) }{ \sum _{m=1}^{M} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|I_m, \mathcal {T}(\mathbf {p}^{k,1})) }\hbox { (}\because \mathbf {p}^{k,1}{\,=\,}\mathbf {p}^{k,2}\hbox {) }\nonumber \\ =&\frac{ A + \sum _{I\not \in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|I, \mathcal {T}(\mathbf {p}^{k,1})) }{ B + \sum _{I\not \in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|I, \mathcal {T}(\mathbf {p}^{k,1})) } \hbox {(} \because \hbox {Equation\,5)}\nonumber \\ =&\frac{ A + \sum _{I\not \in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {h}_{\texttt {n},\texttt {j}}, \texttt {b}_{\texttt {n},\texttt {j}}|\mathcal {T}(\mathbf {p}^{k,1})) }{ B + \sum _{I\not \in \mathcal {I}_{\texttt {p}_\texttt {n}}} \sum ^{J^m_n}_{j=1} P(\texttt {b}_{\texttt {n},\texttt {j}}|\mathcal {T}(\mathbf {p}^{k,1})) }\hbox {(} \because \hbox {Equation\,3)} \nonumber \\ =&\frac{ A + M_2 \times P(\texttt {h}_\texttt {n}, \texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k,2})) }{ B + M_2 \times P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k,2})) } \hbox { (Let} M_2{\,=\,}\sum \limits _{I\not \in \mathcal {I}_{\texttt {p}_\texttt {n}}}J^m_n \,\hbox {and}\, \because \mathbf {p}^{k,1}{\,=\,}\mathbf {p}^{k,2}\hbox {)}\nonumber \\ =&\frac{ A {+} M_2 {\times } P(\texttt {h}_\texttt {n}, \texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k+1,2})) }{ B {+} M_2 {\times } P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k+1,2})) }\hbox { (}\because \mathbf {p}^{k,2}{\,=\,}\mathbf {p}^{k+1,2} \hbox {)} \end{aligned}$$
(6)

By definition,

$$\begin{aligned} p_n^{k+1,2}{\,=\,} \frac{ P(\texttt {h}_\texttt {n}, \texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k+1,2})) }{ P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{k+1,2})) } \end{aligned}$$
(7)

By combining Eqs. 6 and 7, we obtain \(p_n^{k+1,2} =\frac{A}{B}\).

5.2 Convergence Rate

We will prove that EMPLiFI always updates the parameters by a larger margin by considering only relevant interpretations, namely Proposition 2.

Proposition 2

Given a program \(\mathcal {T}\), a set of interpretations \(\mathcal {I}\), and parameter estimates \(\mathbf {p}^t\). Let \(\mathbf {p}^{t+1,1}\) and \(\mathbf {p}^{t+1,2}\) be the next parameter estimates generated by Eqs. 2 and 4, respectively. It is true that either \(p_n^{t+1,1} \le p_n^{t+1,2} \le p_n^t\) or \(p_n^{t+1,1} \ge p_n^{t+1,2} \ge p_n^t\) holds.

Proof

Following the same reasoning as in Proposition 1, we have

$$\begin{aligned} p_n^{t+1,1}&= \frac{A}{B} \hbox { and } p_n^{t+1,2}=\frac{ A + M_2 \times P(\texttt {h}_\texttt {n}, \texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t})) }{ B + M_2 \times P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t})) } \end{aligned}$$

We also have \(P(\texttt {h}_\texttt {n}, \texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t})){\,=\,}p^t_n{\times } P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t}))\) by definition. Hence,

$$\begin{aligned} p_n^{t+1,2} =\frac{ p_n^{t+1,1}{\times } B {+} M_2 {\times } p_n^t{\times } P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t})) }{ B {+} M_2 {\times } P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t})) } =\frac{ p_n^{t+1,1}{\times } B {+} p_n^t{\times } C }{ B {+} C } \end{aligned}$$
(8)

where \(C {\,=\,} M_2 {\times } P(\texttt {b}_\texttt {n}|\mathcal {T}(\mathbf {p}^{t}))\). As \(B, C\ge 0\), we have proven Proposition 2.

Equation 8 illustrates that irrelevant interpretations create an inertia, namely \(p_n^t{\times }C\), where \(p_n^t\) is the old estimate and C is proportional to the number of irrelevant instances. This inertia not only slows down learning, but also causes numerical instability and results in incorrect values when \(C\,{>>}\,B\) as standard EM terminates before reaching the true probabilities (cf. Example 1).

6 Experiments

There are two well-known parameter learning algorithms and implementations for probabilistic logic programming: EMBLEM [1] and LFI-ProbLog [7, 9]. We compare EMPLiFI to these two learners to answer the following questions.

Q1 How much does EMPLiFI speed up EM learning?

Q2 How well does EMPLiFI handle multi-head ADs?

Q3 How well does EMPLiFI handle missing data?

Q4 Does EMPLiFI require more computational resources?

Programs

Emergency Power Supply (EPS)  [18] is propositional, acyclic and contains 24 probabilitiesFootnote 1. It can be handled by all learner as it has no multi-head ADs.

figure h

Smokers  [7] contains 4 probabilities. It has no multi-head ADs but is relational and cyclic. We omit ground facts \(\mathtt {person/1}\) and \(\mathtt {friend/2}\) to save space.

figure i

Dice is an AD with 6 heads. The dice has a higher change of throwing a six.

figure j

Colors consists of a single-head AD and two multi-head ADs. To learn this program, one must perform EM, even in the fully observable case because the AD bodies are not mutually exclusive.

figure k

Experimental Setup and Results

Experiments were run on a 2.4 GHz Intel i5 processor. Learning terminates with \(\epsilon {\,=\,}1e{-}6\). All interpretations are sampled from the above programs. Partial interpretations are generated by randomly discarding literals in the interpretation, given a missing rate \(m\in [0,1]\). When \(m=0\), interpretations are fully observable. We obtain average measurements by executing all tasks using 5 random seeds. EMPLiFI and LFI-ProbLog programs are compiled as SDDs. Tables 12 and 3 list parameter errors and EM iteration counts. Table 4 lists compilation, evaluation, total times and circuit sizes, which refer to node counts.

Q1 How Much Does EMPLiFI Speed up EM Learning? We run all three learners on Smokers and 100 fully observable interpretations. Table 1a shows that EMPLiFI and LFI-ProbLog converge to the same values, but EMPLiFI takes fewer EM iterations. This is consistent with Propositions 1 and 2. EMBLEM is not accurate in Table 1a. Since EMBLEM is not designed to learn all parameters at the same time [1], we split this task into four sub-tasks where each task learns one parameter and all other parameters are set to ground truth values. Results are in Table 1b, where EMBLEM is still the least accurate.

Table 1. Smokers. EMPLiFI is the most accurate and takes the fewest EM cycles.

Q2 How Well Does EMPLiFI Handle Multi-head ADs? This experiment shows that EMPLiFI is more accurate than EMBLEM and LFI-ProbLog in learning multi-head ADs. We run all three learners on Dice (Table 2) and Colors (Table 3) with 1k sampled interpretations under two settings. The first setting learns from full interpretations (e.g. \(\{\mathtt {one}\),\(\lnot \mathtt {two}\), \(\cdots \), \(\lnot \mathtt {six}\}\)), and the second setting learns from only positive literals (e.g. \(\{\mathtt {one}\}\)). The second setting is fully observable as the truth values of all missing literals can be derived. Table 2 shows that when given all literals, all learners can learn multi-head ADs. However, when given only positive literals, LFI-ProbLog cannot learn. Table 3 shows an example that EMBLEM also fails at learning multi-head ADs.

Table 2. Dice. LFI-ProbLog fails when only positive literals are given.
Table 3. Colors. EMPLiFI is the most accurate.

Q3 How Well Does EMPLiFI Handle Missing Data? This experiment shows that EMPLiFI reduces EM iterations in the partially observable setting. We run EMPLiFI and LFI-ProbLog on EPS and 10k interpretations for a series of missingness values in [0, 1]. EMBLEM cannot handle this task with the limit of 3.5 GB memory. Figure 1 shows that EMPLiFI converges much sooner than LFI-ProbLog given either full or partial interpretations.

Fig. 1.
figure 1

The average error of EPS parameters decreases over EM iterations under all settings. EMPLiFI generally takes fewer iterations to converge compared to LFI-ProbLog.

Q4 Does EMPLiFI Require More Computational Resources? We run EMPLiFI and LFI-ProbLog on EPS with 10k interpretations for a series of missingness values. Table 4 shows that EMPLiFI has larger circuits thus longer compilation and evaluation time. This is because EMPLiFI includes information of mutually exclusive AD heads. However, EMPLiFI achieves shorter total execution time compared to LFI-ProbLog as it reduces EM iterations.

Table 4. EPS. The total runtime is the sum of compilation and evaluation time.

7 Related Work

We review EM-based parameter learners. PRISM [10] is one of the first EM learning algorithms, however, it imposes strong restrictions on the allowed programs. LFI-ProbLog  [7, 9] performs parameter learning of probabilistic logic programs. Before learning AD parameters, it must transform the program, which introduces latent variables that slow down learning. In extreme cases, it can converge to incorrect values. Asteroidea  [6, 18] tackles this issue by avoiding EM iterations for probabilistic rules, which is a specialization of EMPLiFI that supports single-head ADs, but not multi-head ADs. EMBLEM  [1] is another EM-based parameter learner. It can naturally express and learn AD parameters as it is based on the language of Logic Programs with Annotated Disjunctions [17]. Similar to the aforementioned work, EMBLEM uses knowledge compilation techniques. However, it differs in the construction of BDDs as it focuses on learning a single target predicate. When multiple target predicates are present, EMBLEM can converge to incorrect values.

8 Conclusion

We have introduced EMPLiFI, an EM-based algorithm for probabilistic logic programs. EMPLiFI supports multi-head ADs and improves efficiency by learning from only relevant interpretations. Theoretically, we have proven that EMPLiFI is correct. Empirically, we have shown that EMPLiFI, compared to LFI-ProbLog and EMBLEM, is more accurate in learning multi-head ADs, and takes fewer iterations to converge. EMPLiFI is available in the ProbLogFootnote 2 repository.