Learning relational dependency networks in hybrid domains

Ravkic, Irma; Ramon, Jan; Davis, Jesse

doi:10.1007/s10994-015-5483-2

Learning relational dependency networks in hybrid domains

Published: 05 May 2015

Volume 100, pages 217–254, (2015)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Learning relational dependency networks in hybrid domains

Download PDF

Irma Ravkic¹,
Jan Ramon¹ &
Jesse Davis¹

2345 Accesses
12 Citations
Explore all metrics

Abstract

Statistical relational learning (SRL) is concerned with developing formalisms for representing and learning from data that exhibit both uncertainty and complex, relational structure. Most of the work in SRL has focused on modeling and learning from data that only contain discrete variables. As many important problems are characterized by the presence of both continuous and discrete variables, there has been a growing interest in developing hybrid SRL formalisms. Most of these formalisms focus on reasoning and representational issues and, in some cases, parameter learning. What has received little attention is learning the structure of a hybrid SRL model from data. In this paper, we fill that gap and make the following contributions. First, we propose hybrid relational dependency networks (HRDNs), an extension to relational dependency networks that are able to model continuous variables. Second, we propose an algorithm for learning both the structure and parameters of an HRDN from data. Third, we provide an empirical evaluation that demonstrates that explicitly modeling continuous variables results in more accurate learned models than discretizing them prior to learning.

Transfer learning by mapping and revising boosted relational dependency networks

Article 11 May 2020

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Statistical relational learning (SRL) (Getoor and Taskar 2007) studies formalisms that combine relational representations such as first-order logic with models for capturing uncertainty. The motivation underlying SRL is that real-world domains such as patient clinical histories, molecular structures, and social networks are characterized by the presence of data that are complex, highly structured and uncertain. Many real-world problems are also hybrid in that they contain both discrete and continuous variables. Examples of such domains include robotics, where a robot’s location is described by continuous variables and properties of encountered objects can be described by discrete variables; clinical histories, where a patient’s temperature and blood pressure represent continuous variables while their gender and diagnoses are discrete variables; and biology, where spatial relationships between molecules are modelled as continuous variables and atom types and amino acid types are discrete properties. Unfortunately, few formalisms can cope with structured and uncertain data that contain both continuous and discrete variables. On the one hand, hybrid Bayesian networks (Murphy 1998) model uncertainty for both continuous and discrete variables, but not relations. On the other hand, SRL approaches such as logical Bayesian networks (LBNs) (Fierens et al. 2005), probabilistic relational models (Getoor et al. 2001), and relational dependency networks (Neville and Jensen 2007) capture both structure and uncertainty in problems but are generally restricted to discrete data.

To address this shortcoming, there has recently been increased interest in designing hybrid SRL formalisms such as Hybrid Markov Logic Networks (HMLNs) (Wang and Domingos 2008), Hybrid ProbLog (HProbLog) (Gutmann et al. 2011), Continuous Bayesian Logic Programs (CBLPs) (Kersting and De Raedt 2001), Learning Modulo Theories (LMT) (Teso et al. 2013) and Hybrid Probabilistic Relational Models (HPRMs) (Narman et al. 2010). The vast majority of the work on hybrid SRL has focused on two issues. The first is building up the machinery needed to represent continuous variables within the various SRL formalisms. The second is adapting inference procedures such that they work for hybrid domains. Some formalisms provide support for learning the parameters of a handcrafted structure from data. What has received little attention to date is designing algorithms that are able to learn the structure of a hybrid SRL model (i.e., the dependencies among the variables and relations in a domain) from data.

In this paper we fill that gap by exploring structure learning within a hybrid SRL context. First, we describe hybrid relational dependency networks (HRDNs) a novel formalism which extends RDNs to handle continuous variables. HRDNs approximate a joint probability distribution with a set of conditional probability distributions (CPDs). We discuss several local conditional probability distributions that are adept at modeling continuous variables. Second, we present an algorithm that is able to learn the structure of an HRDN from data.^{Footnote 1} To the best of our knowledge, this is the first attempt to perform structure learning in the hybrid SRL setting. Third, we empirically evaluate our proposed algorithm on one synthetic and one real-world data set. We find that applying our approach to the original hybrid data results in more accurate learned models than discretizing the data prior to learning.

2 Background

We will review both propositional and relational dependency networks. First, we will introduce some general definitions and notational conventions used throughout the paper.

We consider two types of variables. First, a random variable, called a randvar, is a variable that has an associated range of values it can take based on a probability distribution. Second, a logical variable, called a logvar, is a variable that has a finite domain of possible values it can take. A logvar is a placeholder for objects such as students or courses. We denote variables with uppercase letters and specific values with lowercase letters. Given a set of variables $\mathcal {X}$, a boldface lowercase letter, such as $\mathbf {x}$, represents an assignment of a value to each variable in the set.

2.1 Propositional dependency networks

A dependency network (DN) (Heckerman et al. 2001) is a (cyclic) directed probabilistic model that approximates a joint probability distribution over a set of random variables with a set of conditional probability distributions (CPDs). A DN is a tuple $(\mathcal {X},dep)$ where $\mathcal {X}$ is a set of randvars and $dep$ is a function that maps each randvar $X\in \mathcal {X}$ to a conditional probability distribution $p(X~|~Parents(X))$, where $Parents(X) \subseteq \mathcal {X} \setminus \{X\}$. The CPD quantifies how $X$ depends on the variables in $Parents(X)$. A DN can be represented visually as a directed graph $G = (V, E)$, containing one vertex $V_X$ for each randvar $X\in \mathcal {X}$ and a directed arc from vertex $V_X$ to vertex $V_Y$ iff $X \in Parents(Y)$.

Learning the structure of a DN from data requires determining $Parents(X)$ for each $X \in \mathcal {X}$ (i.e., the dependency structure) and the parameters of the CPD for $X$. Even though the parameters of CPDs can be estimated by using a variety of regression or classification techniques, the standard method is to use probabilistic decision trees. One scoring function that is often used when learning DNs (and other probabilistic graphical models) is pseudo-loglikelihood (PLL) (Besag 1974). Optimizing the PLL has the advantages that it can be decomposed into maximizing the loglikelihood for each variable independently and calculating it does not require computing the partition function (that is, summing over all possible configurations of the randvars). The PLL of an assignment $\mathbf {x}$ to randvars $\mathcal {X}$ of a DN is calculated as:

$$\begin{aligned} PLL(\mathbf x ) = \sum _{i=1}^nlog~[p( X_i=x_{i} | Parents (X_{i}))]. \end{aligned}$$

(1)

Learning each CPD independently could result in an inconsistent model. That is, there may be no joint probability distribution such that it is possible to apply the rules of probability to the joint distribution in order to derive each learned CPD.

Regardless of whether a DN is consistent, applying an ordered Gibbs sampler to the DN’s CPDs results in a unique distribution, given that each variable in the DN is discrete and each CPD in the DN is positive (Heckerman et al. 2001). Ordered Gibbs sampling randomly selects the initial value for each random variable, and then in each Gibbs sweep iterates over the variables in a fixed order and resamples the value of each $X_{i}$ from its local distribution $p( X_{i} | Parents (X_{i}))$. If the DN is consistent, it generates the joint probability distribution. If the DN is inconsistent, this procedure is called an ordered pseudo-Gibbs sampler (Heckerman et al. 2001).

2.2 Relational dependency networks

Next, we review relational dependency networks (RDNs) (Neville and Jensen 2007). There are several ways to define RDNs, but we use a definition that uses first-order logic as a template language for constructing propositional dependency networks. We first briefly review the relevant concepts from first-order logic, then we define RDNs. Throughout the discussion, we will use a slightly modified version of the popular university model (Getoor et al. 2001) as a running example.

We use the datalog subset of first-order logic. The alphabet consists of three types of symbols: constants, logical variables, and predicates. A constant represents a specific object and is denoted with a lower-case letter (e.g., $\mathtt {pete} $). A logical variable (logvar) $\mathtt {X} $ is a variable ranging over the objects in the domain. Logical variables may be typed in which case they represent placeholders for a specific subset of objects in the domain. Predicate symbols $\mathtt{P}/n$, where $n\ge 0$ is the arity of the predicate, represent properties of objects or relations among objects. We use a typed language, that is, every argument position of a predicate has a type. Each predicate $\mathtt {P} $ has a finite range, denoted $range(\mathtt {P} )$. In contrast to traditional logic, we do not restrict the range of a predicate to $\{false, true\}$. For example, the range of a student’s intelligence could be $\{\mathtt{{low} },\mathtt{{med} },\mathtt{{high} }\}$. An atom is of the form $\mathtt {P} (\mathtt {{t_1}},\ldots ,\mathtt {{t_n}})$ where $\mathtt{{P/n} }$ is a predicate and each $\mathtt {{t_i}}$ is an object or a logvar. The range of an atom is the range of its predicate. A literal is an atom or its negation. An atom is ground if all its arguments are constants. A substitution, denoted $\{\mathtt{{X} }_1/\mathtt{{t} }_1,\dots ,\mathtt{{X} }_n/\mathtt{{t} }_n\}$, maps each logvar $\mathtt {X} _i$ to $\mathtt {t} _i$, where $\mathtt {t} _i$ is a logvar or a constant. A grounding substitution $\theta $ for an expression (e.g., an atom or a set of logvars) maps each logvar occurring in that expression to a constant. The set of all grounding substitutions for an expression $\mathtt {E} $ is denoted $grsub (\mathtt {E} )$. The result of applying a substitution to an atom $\mathtt {a} $ is denoted $\mathtt {a} \theta $.

Similar to LBNs (Fierens et al. 2005), we use a set of statements to define the random variables in a domain:

$$\begin{aligned} random(\mathtt {H} ) \leftarrow {l}_1,\ldots ,{l}_n \end{aligned}$$

where $\mathtt {H} $ is an atom, and $\mathtt {l_1} ,\ldots ,\mathtt {l_n} $ is a conjunction of literals. Given a set of random variable declarations $RVD$, the set of random variables $\varPhi $ is the set of all ground atoms $\mathtt {A} \theta $ for which there is a random variable declaration $random(\mathtt {A} )\leftarrow l_1, \ldots , l_n$ in $RVD$ and a substitution $\theta $ such that $l_1\theta $, ..., $l_n\theta $ is true given the background knowledge (amongst others specifying which ground atoms of the predicates in the body of the random variable declaration rules are true). For example, the random variable declaration for the atom $\mathtt {takes(S,C)} $

$$\begin{aligned} random(\mathtt {takes(S,C)} ) \leftarrow \mathtt {student(S)} ,~\mathtt {course(C)} \end{aligned}$$

(2)

creates one randvar for each student S and course C in the domain.

It must always be possible to evaluate the conjunction in the right-hand side of a random variable declaration, and we will use a closed-world assumption to guarantee this. As is common practice in many other probabilistic logical model frameworks (Fierens et al. 2005; Richardson and Domingos 2006; Getoor et al. 2001), our random variable declarations specify all random variables that are potentially of interest. For example, the random variable declaration

$$\begin{aligned} random(\mathtt {grade(S,C)} ) \leftarrow \mathtt {student(S)} , \mathtt {course(C)} \end{aligned}$$

(3)

specifies that every student gets a grade for every course, even though a precondition for obtaining a grade is that student S must take course C. In this case, grade(S,C) would have a special value $not\_relevant$ in its domain, and we would have the background knowledge

$$\begin{aligned} \mathtt {grade(S,C)} = not\_relevant \Leftrightarrow \mathtt {takes(S,C)} = false \end{aligned}$$

(4)

We refer to these statements as relevancy conditions. Later, when learning the conditional dependency for grade(S,C) on takes(S,C) and other random variables, we can easily use such hard background knowledge and reduce the learning problem to the subspace of the values of the parent random variables for which the dependent random variable is relevant.

Let h $\theta $ be a random variable. Given background knowledge, an interpretation $I$ assigns a value to h $\theta $ from its range or it assigns the special value $not\_relevant$ iff there exists a relevancy condition h $\theta \Leftrightarrow \varphi $ in the background knowledge and $\varphi \theta $ is true in $I$. The set of all groundings of a predicate $\mathtt {P} $ that have an assigned value $v\ne not\_relevant$ in interpretation $I$ is denoted as $gr(\mathtt {P} )^{I}$. We refer to the randvars in $gr(\mathtt {P} )^{I}$ as P’s relevant randvars.

Now, we will introduce relational features. For this, we first need to define aggregation functions.

Definition 1

(Aggregation function) An aggregation function for a domain $D$ is a function that maps every finite multiset of elements from $D$ to a single value from a range $R$.

For example, $mode $ is an aggregation function that maps a multiset of values from $D$ to the most frequently occurring value in the multiset.

Definition 2

(Discrete relational feature) Let $L$ be a set of logvars, $C$ be a conjunction of randvar-value tests of the form $G=v$ where $G$ is an atom and $v\in range(G)$, $A$ be an atom, and $\alpha $ be an aggregation function taking as input multisets of elements of $range(A)$. Assume the ranges of $A$, all atoms in $C$ and $\alpha $ are discrete. Then, a discrete relational feature ${\mathcal {F}}_{{L} : {C},{A},{\alpha }}$ is a function that maps any $\theta \in grsub (L)$ and interpretation $I$ to

$$\begin{aligned} {\mathcal {F}}_{{L} : {C},{A},{\alpha }}(\theta ,I)=\alpha \left( \{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}\right) \end{aligned}$$

where we say $C\theta \theta '$ holds in $I$ iff $\forall (G=v) \in C, I(G\theta \theta ') = v$.

A feature’s range is the range of its aggregation function $\alpha $. The length of a feature is equal to the number of randvar-value tests in $C$ plus one (for $A$).

There are two cases for grounding a relational feature that warrant mention:

(a)
$|\{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}|=1$, for all $\theta \in grsub (L)$
(b)
$|\{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}|=0$, for all $\theta \in grsub (L)$

The first case uses $value$ to denote the identity function which returns $I(A\theta \theta ')$. For example, if each student $\mathtt {S} $ has exactly one value for intelligence, then the relational feature ${\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\emptyset },{\mathtt{{intelligence} }(\mathtt{{S} })},{value}}$ simply returns the value taken by the randvar $\mathtt {intelligence} (\mathtt {S} )$, which represents the $\mathtt {intelligence} $ of a student $\mathtt {S} $, in interpretation $I$. The second case requires applying an aggregation function to the empty set. Some aggregation functions (e.g., mode) are not defined on the empty set, and in this case ${\mathcal {F}}_{{L} : {C},{A},{\alpha }}(\theta ,I)$ returns the value $undefined$.

Example 1

Consider the following relational feature:

$$\begin{aligned} {\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{{grade(S,C)=low} }},{\mathtt{{difficulty(C)} }},{{mode}}} \end{aligned}$$

where $\mathtt {C} $ is a logvar denoting courses and $\mathtt {S} $ is a logvar denoting students. This feature calculates the mode of the difficulties for the courses where a student received a low grade. If a student has taken no courses or received no low grades, then, as discussed above, mode would return the value undefined.

Definition 3

(Discrete dependency statement) A discrete dependency statement is of the form $\mathtt {G} ~|~$Parents($\mathtt {G} $). $\mathtt {G} $ is the target atom that has a discrete range and whose arguments are all logvars. Parents($\mathtt {G} $) is a set of discrete relational features, where for each ${\mathcal {F}}_{{L} : {C},{A},{\alpha }} \in $ Parents($\mathtt {G} $), $L$ is a subset of the logvars in $\mathtt {G} $. Each dependency statement has an associated conditional probability distribution (CPD) which quantifies how the target atom depends on its parent set.

Example 2

An example of a discrete dependency statement is:

$$\begin{aligned} {\mathtt{intelligence(S)}}~|~{\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{{takes(S,C)=true} }},{\mathtt{{grade(S,C)} }},{{mode}}} \end{aligned}$$

which states that a student’s intelligence depends on the mode of grades received across all courses the student has taken. As each student can take a varying number of courses, an aggregation function, such as mode in this example, is needed to combine the values from the varying number of parents into a single value.

We are now ready to formally define an RDN:

Definition 4

(RDN) An RDN is a tuple $(\mathcal {{P}},RVD,dep)$, where $\mathcal {{P}}$ is a set of predicates, each with a discrete range, $RVD$ is a set of randvar declarations, and $dep$ is a function that maps each $\mathtt {P} \in \mathcal {{P}}$ to a discrete dependency statement.

An RDN $(\mathcal {{P}},RVD,dep)$ is a template for constructing propositional DNs. Given the background knowledge and a set of randvar declarations $RVD$, an induced DN has a node for each randvar $\mathtt {G} \theta \in \varPhi $.

The parent set of a ground atom $\mathtt {G} \theta $ in a dependency network is defined as

$$\begin{aligned} Parents(\mathtt {G} \theta ) = Parents_{A}(\mathtt {G} \theta ) \cup Parents_{C}(\mathtt {G} \theta ) \end{aligned}$$

where

$$\begin{aligned} Parents_{A}(\mathtt {G} \theta )= & {} \left\{ A\theta \theta ' \mid \exists {\mathcal {F}}_{{L} : {C},{A},{\alpha }}\in Parents(\mathtt {G} ) : \theta '\in grsub ((C\theta ,A\theta )) \right\} \nonumber \\ Parents_{C}(\mathtt {G} \theta )= & {} \cup \left\{ C\theta \theta ' \mid \exists {\mathcal {F}}_{{L} : {C},{A},{\alpha }}\in Parents(\mathtt {G} ) : \theta '\in grsub ((C\theta ,A\theta )) \right\} \end{aligned}$$

(5)

There is an arc between two ground atoms $\mathtt {G} \theta $ and $\mathtt {G} '\theta $, if $\mathtt {G} '\theta \in Parents(\mathtt {G} \theta )$. The CPDs are shared across all randvars that originate from the same predicate.

The pseudo-loglikelihood of an RDN $M$ for an interpretation $I$ involves only the relevant randvars and it is calculated as:

$$\begin{aligned} PLL(M;I) = \sum _\mathtt{{P} \in \mathcal {P}}\sum _{g\in gr(\mathtt {P} )^{I}}\log ~[p(I(\mathtt {g} )~|~I(Parents(\mathtt {g} ))]. \end{aligned}$$

(6)

Example 3

Consider the following simple RDN for a domain with the following randvar declarations:

$$\begin{aligned}&{\mathtt{random(intelligence(S)) \leftarrow student(S)}}\\&{\mathtt{random(takes(S,C)) \leftarrow student(S),course(C)}}\\&{\mathtt{random(grade(S,C)) \leftarrow student(S),course(C)}}\\&{\mathtt{random(difficulty(C)) \leftarrow course(C)}} \end{aligned}$$

where each predicate has a discrete range and the following dependency statement:

The dependency states that a student’s grade in a course depends on the student’s intelligence and the difficulty of the course. Note that this statement says that all ways of instantiating the logvars $\mathtt{S}$ and $\mathtt{C}$ have an identical probabilistic relationship with $\mathtt{S}$’s intelligence and $\mathtt{C}$’s difficulty. Figure 1 shows an induced propositional DN for this RDN given the relevancy condition on grade/2 specified in (4), and a domain with two students $\mathtt {bob} $ and $\mathtt {ann} $, and two courses $\mathtt {math} $ and $\mathtt {bio} $ (short for biology). The dashed arrows denote the relevancy conditions for the grade/2 randvars.

Given that RDNs are templates for constructing DNs, they inherent the semantics of DNs (Neville and Jensen 2007). Namely, a consistent RDN specifies a joint probability distribution over the randvars of a relational data set. Similarly, a unique joint probability distribution for an RDN can be obtained by grounding out the model to obtain a DN and then running an ordered pseudo-Gibbs sampler on the DN. Again, this can be done regardless of whether the model is consistent. The distribution of an inconsistent RDN is the stationary distribution of an ordered pseudo-Gibbs sampler (if it exists) applied to the model.

Learning the structure of an RDN follows the same paradigm as in the propositional case: the CPD for each predicate is learned in turn. Normally, this is done by learning a relational probability tree for each predicate (Neville and Jensen 2007; Natarajan et al. 2012). Section 6 provides a more in-depth discussion of existing RDN structure learning algorithms.

3 Hybrid relational dependency networks

We now describe HRDNs, our proposed extension to RDNs for hybrid domains. First, we describe how to incorporate continuous variables. Second, we describe how to represent the CPDs. Third, we briefly describe how to perform inference in HRDNs.

3.1 Representation

It is relatively natural to extend RDNs to incorporate continuous random variables. It requires modifying the definitions presented in Sect. 2.2.

First, to introduce continuous variables, it suffices to declare the range of a predicate to be an interval of the real numbers. Each continuous randvar associated with such a predicate can then take on any value from this interval. For example, we could define a predicate $\mathtt {numHours/1} $ with the following random variable declaration:

$$\begin{aligned} \mathtt {random(numHours(C)}) \leftarrow \mathtt {course(C)} \end{aligned}$$

that represents the number of hours needed to study for a course $\mathtt {C} $. The range of this predicate can be the following interval:

$$\begin{aligned} \mathtt {range(numHours(C)}) =[20.0,180.0] \end{aligned}$$

Second, we need to modify the definition of a relational feature to account for the fact that both atoms and aggregation functions can have continuous ranges.

Definition 5

(Numeric relational feature) A numeric relational feature has the same form, ${\mathcal {F}}_{{L} : {C},{A},{\alpha }}$, as a discrete relational feature. In contrast to a discrete relational feature, one or both of $A$ and $\alpha $ in a numeric relational feature must have a continuous range.

Example 4

Consider the following numeric relational feature:

$$\begin{aligned} {\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{takes(S,C)=true}},{\mathtt{numHours(C)}},{\mathtt{average}}} \end{aligned}$$

This feature computes the average number of hours a student spends studying for all taken classes.

Third, we need to extend the definition of a dependency statement to incorporate numeric relational features.

Definition 6

(Hybrid dependency statement) A hybrid dependency statement is of the form $\mathtt {G} ~|~Parents(\mathtt {G} )$ where $\mathtt {G} $’s range may be discrete or continuous and Parents($\mathtt {G} $) is a set of discrete and/or numeric relational features. Each hybrid dependency statement has an associated CPD.

Note that the type of a CPD for each hybrid dependency is determined according to $\mathtt {G} $’s range: for a discrete range it is a probability mass function, and for a continuous range it is a density function.

Now we are ready to formally define an HRDN:

Definition 7

(HRDN) An HRDN is a tuple $(\mathcal {{P}}, RVD, dep)$, where $\mathcal {{P}}$ is a set of predicates, whose ranges may be discrete or continuous, $RVD$ is a set of randvar declarations and $dep$ is a function mapping each $\mathtt {P} \in \mathcal {{P}}$ to a hybrid dependency statement.

Analogous to an RDN, an HRDN can be viewed as a template for constructing a hybrid dependency network in the following way. The set of predicates $\mathcal {P}$ in an HRDN is split into the set of predicates with discrete range $\mathcal {P}_{D}$ and the set of predicates with continuous range $\mathcal {P}_{C}$. Given a set of random variable declarations $RVD$ for all predicates in $\mathcal {P}$ and a set of constants, the set of randvars is $\varPhi =\varPhi _{D} \bigcup \varPhi _{C}$ where $\varPhi _{D}$ denotes all randvars with discrete ranges and $\varPhi _{C}$ denotes all randvars with continuous ranges. The induced hybrid DN will have a node for each randvar in $\varPhi $ and the parent set of a node is determined in the same manner as described previously for discrete DNs. Each discrete randvar of a predicate $\mathtt {P} _{d}\in \mathcal {{P_{D}}}$ will obtain its own copy of the discrete CPD associated with $\mathtt {P} _d$ and each continuous randvar of a predicate $\mathtt {P} _c\in \mathcal {{P_{C}}}$ will obtain its own copy of the continuous CPD associated with $\mathtt {P} _c$.

A consistent HRDN specifies the joint distribution over the randvars in its corresponding hybrid dependency network. In parallel with the claims of Neville and Jensen (2007), there is a direct correspondence between consistent HRDNs and hybrid Markov logic networks (HMLN) in that the set of distributions that can be encoded by a consistent HRDN is equal to the set of positive distributions that can be encoded with an HMLN with the same adjacencies provided they use the same aggregate functions. If an HRDN induces a hybrid DN that does not contain cycles, then its semantics corresponds to those of a hybrid Bayesian network. Our work primarily considers inconsistent HRDNs. In this case, if there is a stationary distribution of an ordered pseudo-Gibbs sampler applied to an HRDN model, we refer to this distribution as the one represented by the model.

The pseudo-loglikelihood of an HRDN is computed as follows:

(7)

where the first summation goes over the predicates with a discrete range, and the second goes over the predicates with a continuous range.

Example 5

To illustrate an HRDN, we could extend Example 3 with the $\mathtt {numHours} /1$ predicate and add the following hybrid dependency statement:

$$\begin{aligned} {\mathtt{numHours(C)}}~|~{\mathcal {F}}_{{\{\mathtt{{C} }\}} : {\emptyset },{ \mathtt{{difficulty(C)} }},{\mathtt{value}}} \end{aligned}$$

which states that the number of hours spent studying for a class depends on its difficulty. Figure 2 shows the ground hybrid DN for Example 5. Squares denote randvars with a discrete range and ovals denote randvars with a continuous range.

3.2 Local distributions

Each dependency statement $\mathtt {G} ~|~$Parents($\mathtt {G} $) has an associated CPD. The type of model used for a CPD depends on both the range of the target atom $\mathtt {G} $ and whether Parents($\mathtt {G} $) contains discrete or numeric features.

In this work, we use a parametric approach to density estimation and focus only on variants of Gaussian distributions to model continuous variables. Specifically, we use the following models:

Multinomial If $\mathtt {G} $ has a discrete range and its parent set is empty, the CPD is modeled by a multinomial distribution.
Gaussian If $\mathtt {G} $ has a continuous range and its parent set is empty, the CPD is modeled by a Gaussian distribution.
Logistic Regression (LR) This CPD is used when the target atom has a discrete range as it facilitates incorporating both discrete and continuous parents (Bishop 1995). Given $range(\mathtt {G} )=\{y_{1},y_{2},\ldots ,y_{m}\}$, the conditional distribution for the first $(m-1)$ values for a specific grounding $\mathtt {G} \theta $ is:
$$\begin{aligned} p(\mathtt {G} \theta =y_{k}~|~Parents(\mathtt {G} \theta ))=\frac{exp\left( w_{k,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )}w_{k,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) }{1+\sum _{j=1}^{m-1} exp\left( w_{j,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )} w_{j,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) } \end{aligned}$$
The distribution for the $mth$ value is:
$$\begin{aligned} p(\mathtt {G} \theta =y_{m}~|~Parents(\mathtt {G} \theta ))=\frac{1}{1+\sum _{j=1}^{m-1} exp\left( w_{j,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )} w_{j,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) } \end{aligned}$$
(8)
In both equations, $\mathcal {F}$ is a relational feature, $w_{j,\mathcal {F}}$ are the weights associated with $\mathcal {F}$ for value $y_j$, and $w_{j,0}$ is $y_j$’s bias term.
Linear Gaussian (LG) A linear Gaussian CPD is used when $\mathtt {G} $’s range is continuous and all the features in the parent set are numeric (Lauritzen 1992; Koller et al. 1999). An LG is a Gaussian distribution that models $\mu $ as a linear combination of the values of the features in the parent set, but assumes a fixed variance $\sigma ^2$. The distribution is given as:
$$\begin{aligned} p(\mathtt {G} \theta ~|~Parents(\mathtt {G} \theta ))=N\left( w_{0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )}w_{\mathcal {F}}\cdot \mathcal {F}(\theta ),\sigma _\mathtt {G} ^2\right) \end{aligned}$$
(9)
where $\mathcal {F}$ is a numeric feature and $w_{\mathcal {F}}$ is the weight associated with $\mathcal {F}$.
Conditional Linear Gaussian (CLG) A conditional linear Gaussian (CLG) is used if $\mathtt {G} $’s range is continuous and its parents set contains a mix of discrete and numeric features. There is a separate linear Gaussian model for every instantiation of the discrete parents. More formally, consider partitioning the parent set of a predicate into the discrete features, $\mathcal {F}_{discrete}$, and the numeric features, $\mathcal {F}_{continuous}$ and let $\mathcal {D}$ be the Cartesian product of ranges of all features in $\mathcal {F}_{discrete}$. Then, the CPD consists of one LG model for each $d\in \mathcal {D}$:
$$\begin{aligned} p(\mathtt {G} \theta ~|~\mathcal {F}_{continuous},d)= N\left( w_{0_d}+\sum _{F \in \mathcal {F}_{continuous}}{w_{\mathcal {F}_d}\cdot \mathcal {F}(\theta )},\sigma _d^2\right) \end{aligned}$$
(10)
Note that because there is a separate LG for each $d$, each one has an associated variance $\sigma _d^2$. A conditional Gaussian is a special case of a CLG where the parent set only contains discrete features. Here, a separate Gaussian (mean and variance) is learned for each possible configuration of the parents.

As in the discrete case, it is possible that a feature does not have any groundings. If this occurs and the aggregation function of the feature is not defined on the empty set, then we again return the value undefined.

3.3 Inference

Similar to RDNs, inference in HRDNs can be performed by using an ordered pseudo-Gibbs sampler. The difference lies in the fact that HRDNs contain both conditional density functions and probability distributions. Given an HRDN, a set of constants for each type, and possibly a set of relevance conditions, inference is performed as follows.

First, the model is grounded to create the corresponding propositional hybrid dependency network. Second, each randvar gets its own copy of a CPD associated to its predicate. Third, an ordering over the atoms is determined based on the relevance conditions, if specified. This ordering has to ensure that when performing sampling for an atom $\mathtt {A} $ we first sample the values of the atoms in $l$ of the relevance condition $\mathtt {A} \Leftrightarrow l$. For example, consider the relevance condition (4). In each Gibbs sweep, before we sample values for grade/2 we make sure that the values for takes/2 are sampled.

Finally, in each Gibbs sweep we visit each ground atom in order and resample its value according to its probability distribution or density function. A randvar is assigned a value from its range or obtains the value $not\_relevant$ if there exists a relevance condition that is satisfied in the sweep. Each sweep results in an interpretation $I$ and a sample corresponds to only the relevant randvars in $I$.

4 Structure learning

In this section we present our algorithm for learning the structure of an HRDN. This requires learning a dependency statement and CPD for each predicate in the domain. It is possible to use a decomposable score function to evaluate candidate structures. Thus the problem can be tackled by independently learning a locally optimal CPD for each predicate. Therefore, we refer to our approach as the Learner of Local Models (LLM). When learning the CPD for each predicate, we define a space of candidate features and then greedily select those that improve the score.

Next, we will describe in more detail the key elements of our algorithm, which are (1) its high-level control structure, (2) how to learn a CPD for a single predicate, and (3) how to score the candidate CPDs.

4.1 High-level control structure

Algorithm 1 outlines LLM and it receives as input a set of predicates $\mathcal {P}$, a set of training interpretations $D$, and a set of validation interpretations $V$. LLM assumes fully-observed data. At a high level, the algorithm is quite simple. For each predicate $\mathtt {P} \in \mathcal {P}$, it invokes the LearnOneModel function to learn a local distribution that models $\mathtt {P} $ using $\mathcal {P}$. By using a decomposable score function, such as pseudo-loglikelihood, the global score can be optimized by independently finding the best local distribution for each predicate.^{Footnote 2} The final model $M$ is obtained by conjoining all learned local distributions.

Note that this algorithm has the same high-level control structure as existing approaches for learning RDNs. There are two important differences with existing approaches. The first is that the data may contain continuous variables. The second is that, in order to accommodate dependencies on continuous variables, the local distributions are represented via a logistic regression or a (conditional) linear Gaussian as opposed to a relational probability tree.

Next, we describe in detail how to learn and evaluate local distributions.

4.2 Learning local distributions

Each learned CPD, regardless of its form, in an HRDN is parameterized by a set of features. Learning the structure of the CPD requires determining which features should appear in the parent set. This can be posed as the problem of searching through the space of candidate features. We adopt a greedy approach that selects one feature at a time to add to the parent set until no inclusion improves the score. Thus, in each iteration, the central procedure is finding the single best feature and adding it to the parent set.

We construct candidate features in the following way. First, let $H=\mathtt {P} (\mathtt {V} _1,\ldots ,\mathtt {V} _n)$, where each $\mathtt {V} _i$ is a unique logvar, and let $L= \{\mathtt{{V} }_1,\ldots ,\mathtt{{V} }_n\}$. Next, we construct all $A$ such that $A$ is different from H. Then, given a user-defined parameter $N$, for each $A$ all conjunctions of $k\le N$ randvar-value tests $C=\{(G_1=v_1),\ldots ,(G_k=v_k)\}$ are exhaustively enumerated such that (1) all atoms $G_i$ have a discrete range, (2) no atom $G_i$ is identical to $H$ or $A$, (3) the set $Q=\{H, G_1,\ldots , G_k, A\}$ is connected.^{Footnote 3} These restrictions ensure that the set of candidate features is finite. For each constructed $C$ and $A$ one candidate feature ${\mathcal {F}}_{{L} : {C},{A},{\alpha }}$ for each aggregation function $\alpha $ applicable to $range(A)$ is generated. We consider the following aggregation functions:

If no aggregation is needed, we use $value$,
If $range(A)$ is discrete and not $\{true, false\}$, we use $mode$,
If $range(A)$ is discrete and $\{true, false\}$, we use $proportion$ and $exist$,
If $range(A)$ is continuous, we use $average$, $maximum$, and $minimum$.

The aggregation function $proportion$ computes the proportion of a feature’s possible groundings that are true. The other functions take on their traditional meanings.

Algorithm 2 outlines our procedure for learning the dependency for a predicate $\mathtt {P} $. As input, it receives the target predicate $\mathtt {P} $, the full set of predicates $\mathcal {P}$ for the domain, a training set $D$, and a validation set $V$. First, the algorithm starts by constructing the set of candidate features for $\mathtt {P} $. Second, it repeatedly iterates through the set of candidate features and evaluates the utility of adding each feature to the parent set. Each feature addition is followed by learning the CPD on the training data $D$ and then scoring it on the validation data $V$. In each iteration, the single best feature is added to the parent set. If no feature improves the score, the procedure terminates. Note that the form of the CPD depends on both $\mathtt {P} $ and the features in the parent set. If $\mathtt {P} $’s range is discrete, then the CPD is represented via logistic regression. If $\mathtt {P} $’s range is continuous, we use linear Gaussians if the parents only contain numeric features and conditional linear Gaussians when the parent set contains both numeric and discrete features.

The two following subsections explain how we estimate the parameters of the CPDs using the training data and how we evaluate the local models.

4.3 Estimating the parameters for candidate CPDs

Next, we briefly describe how to estimate the parameters for the CPDs for the different types of dependency statements that may appear in a learned HRDN.

Multinomial The maximum likelihood parameters of the multinomial are learned from the data.
Gaussian The maximum likelihood estimates of the Gaussian’s mean and the variance are learned from the data.
Logistic regression Parameter estimation requires learning the weight vectors for the logistic regression model. We follow the standard approach and take the (partial) derivative of the conditional loglikelihood of the data and perform gradient ascent to estimate the weights (Mitchell 1997).
Linear Gaussian Parameter learning requires estimating the weight vector for the linear regression model. This can be done via standard techniques for training a linear regressor and we use ridge regression (Bishop 1995). We estimate the variance by computing the expected value of the squared difference between the actual value and the model’s predicted value.
Conditional linear Gaussian In CLGs, each configuration of the discrete parents has an associated LG model. The parameters for each LG model are learned as described above.

4.4 Evaluating candidate models

Traditionally, a candidate model is evaluated using a score function that trades off the model’s fit to the data versus some penalty term based on the model’s complexity to avoid overfitting. For a candidate model $M$, we use the following score function, which is based on the Minimum Description Length (MDL) (Schwarz 1978):

$$\begin{aligned} MDL(M,D)= PLL(M,D) - Penalty(M,D) \end{aligned}$$

(11)

where $PLL(M,D)$ is computed using Eq. (6) and $Penalty(M, D)$ is the following penalty term:

$$\begin{aligned} Penalty(M, D)=\frac{1}{2}\sum _{I \in D} \sum _\mathtt{{P} \in \mathcal {P}} log_{2} (|gr(\mathtt {P} )|^{I}) \cdot B_\mathtt {P} \cdot K \end{aligned}$$

where $|gr(\mathtt {P} )|^{I}$ is the number of relevant randvars of predicate $\mathtt {P} $ in interpretation $I$, $B_\mathtt {P} $ is the number of free parameters in $\mathtt {P} $’s CPD and $K$ is the size of $\mathtt {P} $’s CPD.^{Footnote 4} Next, we will explain in more detail how $B_\mathtt {P} $ and $K$ are calculated.

When the CPD for $\mathtt {P} $ is represented by a logistic regression model (see Eq. 8), the number of free parameters is:

$$\begin{aligned} B_\mathtt{{P} } = (|range(\mathtt {P} )|-1)\cdot (1+|Parents(\mathtt {P} )|) \end{aligned}$$

where $(1+|Parents(\mathtt {P} )|)$ is the number of weights that must be learned to parameterize the model (i.e., one for each feature plus the intercept). For continuous CPDs, this is slightly more involved to compute. For an LG, the number of free parameters is:

$$\begin{aligned} B_\mathtt{{P} } = 1 + (1 + |Parents(\mathtt {P} )|) \end{aligned}$$

where the first $1$ is for the variance $\sigma ^2$ and $(1+|Parents(\mathtt {P} )|)$ is the number of weights that must be learned to parameterize the model (i.e., one for each feature in the parent set plus the intercept). Recall that in a CLG, one LG model is learned for each possible instantiation of the discrete parents. Thus the number of free parameters for a CLG is:

$$\begin{aligned} B_\mathtt{{P} }=d \cdot (1 + (1 + |Parents_{C}(\mathtt {P} )|)) \end{aligned}$$

where $d$ is the number of elements in the Cartesian product of the ranges of the discrete parents, $Parents_{C}(\mathtt {P} )$ denotes only numeric features in the parent set of $\mathtt {P} $ and $(1 + (1 + |Parents_{C}(\mathtt {P} )|))$ is the number of parameters needed to model each LG.

The size $K$ of $\mathtt {P} $’s CPD is the sum of the feature lengths in the parent set:

$$\begin{aligned} K= \sum _{\mathcal {F} \in Parents(\mathtt {P} )} |{\mathcal {F}}_{{L} : {C},{A},{\alpha }}| \end{aligned}$$

(12)

where $|{\mathcal {F}}_{{L} : {C},{A},{\alpha }}|=|C| +1$ is the length of a feature.

5 Experiments

This section empirically evaluates our HRDN structure learning algorithm LLM. Specifically, we want to answer the following questions:

1.
How does varying the amount of training data affect the quality of the learned model and the run time of the learning algorithm?
2.
Do we learn more accurate models by learning a hybrid model (i.e., explicitly modeling continuous variables) or by discretizing all continuous variables prior to learning?
3.
How does our approach compare to MLN (Richardson and Domingos 2006) structure learning?

All our code, data and models are publicly available.^{Footnote 5} We first describe the data sets we will use and then explain the experimental setup. Finally, we present and discuss the results.

5.1 Data sets

We use one synthetic and one real-world data set to answer these questions.

Synthetic university data We used a modified version of the well-known university model (Getoor et al. 2001) to generate synthetic data. We made the following alterations. First, we switched the range of intelligence/1 from discrete to continuous. Second, we added two predicates with continuous ranges: numHours/1, which is the estimated number of hours a student needs to study for a course, and ability/1, which is the ability of a professor. Finally, we added a Boolean predicate friend/2, which denotes whether two students are friends. Appendix 1 contains a complete description of the model.

We generate synthetic data in two ways. First, we fix the domain size of each type within an interpretation and vary the number of training interpretations. We learn models by using one, two, four, eight and $16$ interpretations. We use one validation and one test interpretation. Second, we fix the number of training and validation interpretations to one and vary the domain size of each object. The learned models in this setup are evaluated on a test interpretation consisting of 800 students, 125 courses and 125 professors. Tables 1 and 2 show the characteristics of the domains for the first and second synthetic setup, respectively.

Table 1 Data set characteristics for the synthetic data when varying the number of interpretations used for learning

Learning relational dependency networks in hybrid domains

Abstract

Similar content being viewed by others

Transfer learning by mapping and revising boosted relational dependency networks

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Explore related subjects

1 Introduction

2 Background

2.1 Propositional dependency networks

2.2 Relational dependency networks

Definition 1

Definition 2

Example 1

Definition 3

Example 2

Definition 4

Example 3

3 Hybrid relational dependency networks

3.1 Representation

Definition 5

Example 4

Definition 6

Definition 7

Example 5

3.2 Local distributions

3.3 Inference

4 Structure learning

4.1 High-level control structure

4.2 Learning local distributions

4.3 Estimating the parameters for candidate CPDs

4.4 Evaluating candidate models

5 Experiments

5.1 Data sets

5.2 Methodology

5.3 Results and discussion

6 Related work

7 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Handcrafted model and learned hybrid models for the synthetic data

Appendix 2: PKDD’99 real-world financial data set

Appendix 3: Detailed results for all domains

1.1 Results on synthetic data

1.2 Results on the PKDD’99 financial data set

Appendix 4: Features used for propositional learners

1.1 Predicates with discrete range

1.2 Predicates with continuous range

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation