1 Introduction

Statistical relational learning (SRL) (Getoor and Taskar 2007) studies formalisms that combine relational representations such as first-order logic with models for capturing uncertainty. The motivation underlying SRL is that real-world domains such as patient clinical histories, molecular structures, and social networks are characterized by the presence of data that are complex, highly structured and uncertain. Many real-world problems are also hybrid in that they contain both discrete and continuous variables. Examples of such domains include robotics, where a robot’s location is described by continuous variables and properties of encountered objects can be described by discrete variables; clinical histories, where a patient’s temperature and blood pressure represent continuous variables while their gender and diagnoses are discrete variables; and biology, where spatial relationships between molecules are modelled as continuous variables and atom types and amino acid types are discrete properties. Unfortunately, few formalisms can cope with structured and uncertain data that contain both continuous and discrete variables. On the one hand, hybrid Bayesian networks (Murphy 1998) model uncertainty for both continuous and discrete variables, but not relations. On the other hand, SRL approaches such as logical Bayesian networks (LBNs) (Fierens et al. 2005), probabilistic relational models (Getoor et al. 2001), and relational dependency networks (Neville and Jensen 2007) capture both structure and uncertainty in problems but are generally restricted to discrete data.

To address this shortcoming, there has recently been increased interest in designing hybrid SRL formalisms such as Hybrid Markov Logic Networks (HMLNs) (Wang and Domingos 2008), Hybrid ProbLog (HProbLog) (Gutmann et al. 2011), Continuous Bayesian Logic Programs (CBLPs) (Kersting and De Raedt 2001), Learning Modulo Theories (LMT) (Teso et al. 2013) and Hybrid Probabilistic Relational Models (HPRMs) (Narman et al. 2010). The vast majority of the work on hybrid SRL has focused on two issues. The first is building up the machinery needed to represent continuous variables within the various SRL formalisms. The second is adapting inference procedures such that they work for hybrid domains. Some formalisms provide support for learning the parameters of a handcrafted structure from data. What has received little attention to date is designing algorithms that are able to learn the structure of a hybrid SRL model (i.e., the dependencies among the variables and relations in a domain) from data.

In this paper we fill that gap by exploring structure learning within a hybrid SRL context. First, we describe hybrid relational dependency networks (HRDNs) a novel formalism which extends RDNs to handle continuous variables. HRDNs approximate a joint probability distribution with a set of conditional probability distributions (CPDs). We discuss several local conditional probability distributions that are adept at modeling continuous variables. Second, we present an algorithm that is able to learn the structure of an HRDN from data.Footnote 1 To the best of our knowledge, this is the first attempt to perform structure learning in the hybrid SRL setting. Third, we empirically evaluate our proposed algorithm on one synthetic and one real-world data set. We find that applying our approach to the original hybrid data results in more accurate learned models than discretizing the data prior to learning.

2 Background

We will review both propositional and relational dependency networks. First, we will introduce some general definitions and notational conventions used throughout the paper.

We consider two types of variables. First, a random variable, called a randvar, is a variable that has an associated range of values it can take based on a probability distribution. Second, a logical variable, called a logvar, is a variable that has a finite domain of possible values it can take. A logvar is a placeholder for objects such as students or courses. We denote variables with uppercase letters and specific values with lowercase letters. Given a set of variables \(\mathcal {X}\), a boldface lowercase letter, such as \(\mathbf {x}\), represents an assignment of a value to each variable in the set.

2.1 Propositional dependency networks

A dependency network (DN) (Heckerman et al. 2001) is a (cyclic) directed probabilistic model that approximates a joint probability distribution over a set of random variables with a set of conditional probability distributions (CPDs). A DN is a tuple \((\mathcal {X},dep)\) where \(\mathcal {X}\) is a set of randvars and \(dep\) is a function that maps each randvar \(X\in \mathcal {X}\) to a conditional probability distribution \(p(X~|~Parents(X))\), where \(Parents(X) \subseteq \mathcal {X} \setminus \{X\}\). The CPD quantifies how \(X\) depends on the variables in \(Parents(X)\). A DN can be represented visually as a directed graph \(G = (V, E)\), containing one vertex \(V_X\) for each randvar \(X\in \mathcal {X}\) and a directed arc from vertex \(V_X\) to vertex \(V_Y\) iff \(X \in Parents(Y)\).

Learning the structure of a DN from data requires determining \(Parents(X)\) for each \(X \in \mathcal {X}\) (i.e., the dependency structure) and the parameters of the CPD for \(X\). Even though the parameters of CPDs can be estimated by using a variety of regression or classification techniques, the standard method is to use probabilistic decision trees. One scoring function that is often used when learning DNs (and other probabilistic graphical models) is pseudo-loglikelihood (PLL) (Besag 1974). Optimizing the PLL has the advantages that it can be decomposed into maximizing the loglikelihood for each variable independently and calculating it does not require computing the partition function (that is, summing over all possible configurations of the randvars). The PLL of an assignment \(\mathbf {x}\) to randvars \(\mathcal {X}\) of a DN is calculated as:

$$\begin{aligned} PLL(\mathbf x ) = \sum _{i=1}^nlog~[p( X_i=x_{i} | Parents (X_{i}))]. \end{aligned}$$
(1)

Learning each CPD independently could result in an inconsistent model. That is, there may be no joint probability distribution such that it is possible to apply the rules of probability to the joint distribution in order to derive each learned CPD.

Regardless of whether a DN is consistent, applying an ordered Gibbs sampler to the DN’s CPDs results in a unique distribution, given that each variable in the DN is discrete and each CPD in the DN is positive (Heckerman et al. 2001). Ordered Gibbs sampling randomly selects the initial value for each random variable, and then in each Gibbs sweep iterates over the variables in a fixed order and resamples the value of each \(X_{i}\) from its local distribution \(p( X_{i} | Parents (X_{i}))\). If the DN is consistent, it generates the joint probability distribution. If the DN is inconsistent, this procedure is called an ordered pseudo-Gibbs sampler (Heckerman et al. 2001).

2.2 Relational dependency networks

Next, we review relational dependency networks (RDNs) (Neville and Jensen 2007). There are several ways to define RDNs, but we use a definition that uses first-order logic as a template language for constructing propositional dependency networks. We first briefly review the relevant concepts from first-order logic, then we define RDNs. Throughout the discussion, we will use a slightly modified version of the popular university model (Getoor et al. 2001) as a running example.

We use the datalog subset of first-order logic. The alphabet consists of three types of symbols: constants, logical variables, and predicates. A constant represents a specific object and is denoted with a lower-case letter (e.g., \(\mathtt {pete} \)). A logical variable (logvar) \(\mathtt {X} \) is a variable ranging over the objects in the domain. Logical variables may be typed in which case they represent placeholders for a specific subset of objects in the domain. Predicate symbols \(\mathtt{P}/n\), where \(n\ge 0\) is the arity of the predicate, represent properties of objects or relations among objects. We use a typed language, that is, every argument position of a predicate has a type. Each predicate \(\mathtt {P} \) has a finite range, denoted \(range(\mathtt {P} )\). In contrast to traditional logic, we do not restrict the range of a predicate to \(\{false, true\}\). For example, the range of a student’s intelligence could be \(\{\mathtt{{low} },\mathtt{{med} },\mathtt{{high} }\}\). An atom is of the form \(\mathtt {P} (\mathtt {{t_1}},\ldots ,\mathtt {{t_n}})\) where \(\mathtt{{P/n} }\) is a predicate and each \(\mathtt {{t_i}}\) is an object or a logvar. The range of an atom is the range of its predicate. A literal is an atom or its negation. An atom is ground if all its arguments are constants. A substitution, denoted \(\{\mathtt{{X} }_1/\mathtt{{t} }_1,\dots ,\mathtt{{X} }_n/\mathtt{{t} }_n\}\), maps each logvar \(\mathtt {X} _i\) to \(\mathtt {t} _i\), where \(\mathtt {t} _i\) is a logvar or a constant. A grounding substitution \(\theta \) for an expression (e.g., an atom or a set of logvars) maps each logvar occurring in that expression to a constant. The set of all grounding substitutions for an expression \(\mathtt {E} \) is denoted \(grsub (\mathtt {E} )\). The result of applying a substitution to an atom \(\mathtt {a} \) is denoted \(\mathtt {a} \theta \).

Similar to LBNs (Fierens et al. 2005), we use a set of statements to define the random variables in a domain:

$$\begin{aligned} random(\mathtt {H} ) \leftarrow {l}_1,\ldots ,{l}_n \end{aligned}$$

where \(\mathtt {H} \) is an atom, and \(\mathtt {l_1} ,\ldots ,\mathtt {l_n} \) is a conjunction of literals. Given a set of random variable declarations \(RVD\), the set of random variables \(\varPhi \) is the set of all ground atoms \(\mathtt {A} \theta \) for which there is a random variable declaration \(random(\mathtt {A} )\leftarrow l_1, \ldots , l_n\) in \(RVD\) and a substitution \(\theta \) such that \(l_1\theta \), ..., \(l_n\theta \) is true given the background knowledge (amongst others specifying which ground atoms of the predicates in the body of the random variable declaration rules are true). For example, the random variable declaration for the atom \(\mathtt {takes(S,C)} \)

$$\begin{aligned} random(\mathtt {takes(S,C)} ) \leftarrow \mathtt {student(S)} ,~\mathtt {course(C)} \end{aligned}$$
(2)

creates one randvar for each student S and course C in the domain.

It must always be possible to evaluate the conjunction in the right-hand side of a random variable declaration, and we will use a closed-world assumption to guarantee this. As is common practice in many other probabilistic logical model frameworks (Fierens et al. 2005; Richardson and Domingos 2006; Getoor et al. 2001), our random variable declarations specify all random variables that are potentially of interest. For example, the random variable declaration

$$\begin{aligned} random(\mathtt {grade(S,C)} ) \leftarrow \mathtt {student(S)} , \mathtt {course(C)} \end{aligned}$$
(3)

specifies that every student gets a grade for every course, even though a precondition for obtaining a grade is that student S must take course C. In this case, grade(S,C) would have a special value \(not\_relevant\) in its domain, and we would have the background knowledge

$$\begin{aligned} \mathtt {grade(S,C)} = not\_relevant \Leftrightarrow \mathtt {takes(S,C)} = false \end{aligned}$$
(4)

We refer to these statements as relevancy conditions. Later, when learning the conditional dependency for grade(S,C) on takes(S,C) and other random variables, we can easily use such hard background knowledge and reduce the learning problem to the subspace of the values of the parent random variables for which the dependent random variable is relevant.

Let h \(\theta \) be a random variable. Given background knowledge, an interpretation \(I\) assigns a value to h \(\theta \) from its range or it assigns the special value \(not\_relevant\) iff there exists a relevancy condition h \(\theta \Leftrightarrow \varphi \) in the background knowledge and \(\varphi \theta \) is true in \(I\). The set of all groundings of a predicate \(\mathtt {P} \) that have an assigned value \(v\ne not\_relevant\) in interpretation \(I\) is denoted as \(gr(\mathtt {P} )^{I}\). We refer to the randvars in \(gr(\mathtt {P} )^{I}\) as P’s relevant randvars.

Now, we will introduce relational features. For this, we first need to define aggregation functions.

Definition 1

(Aggregation function) An aggregation function for a domain \(D\) is a function that maps every finite multiset of elements from \(D\) to a single value from a range \(R\).

For example, \(mode \) is an aggregation function that maps a multiset of values from \(D\) to the most frequently occurring value in the multiset.

Definition 2

(Discrete relational feature) Let \(L\) be a set of logvars, \(C\) be a conjunction of randvar-value tests of the form \(G=v\) where \(G\) is an atom and \(v\in range(G)\), \(A\) be an atom, and \(\alpha \) be an aggregation function taking as input multisets of elements of \(range(A)\). Assume the ranges of \(A\), all atoms in \(C\) and \(\alpha \) are discrete. Then, a discrete relational feature \({\mathcal {F}}_{{L} : {C},{A},{\alpha }}\) is a function that maps any \(\theta \in grsub (L)\) and interpretation \(I\) to

$$\begin{aligned} {\mathcal {F}}_{{L} : {C},{A},{\alpha }}(\theta ,I)=\alpha \left( \{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}\right) \end{aligned}$$

where we say \(C\theta \theta '\) holds in \(I\) iff \(\forall (G=v) \in C, I(G\theta \theta ') = v\).

A feature’s range is the range of its aggregation function \(\alpha \). The length of a feature is equal to the number of randvar-value tests in \(C\) plus one (for \(A\)).

There are two cases for grounding a relational feature that warrant mention:

  1. (a)

    \(|\{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}|=1\), for all \(\theta \in grsub (L)\)

  2. (b)

    \(|\{I(A\theta \theta ') \mid \theta '\in grsub (A\theta ,C\theta )~and~C\theta \theta '~holds~in~I~\}|=0\), for all \(\theta \in grsub (L)\)

The first case uses \(value\) to denote the identity function which returns \(I(A\theta \theta ')\). For example, if each student \(\mathtt {S} \) has exactly one value for intelligence, then the relational feature \({\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\emptyset },{\mathtt{{intelligence} }(\mathtt{{S} })},{value}}\) simply returns the value taken by the randvar \(\mathtt {intelligence} (\mathtt {S} )\), which represents the \(\mathtt {intelligence} \) of a student \(\mathtt {S} \), in interpretation \(I\). The second case requires applying an aggregation function to the empty set. Some aggregation functions (e.g., mode) are not defined on the empty set, and in this case \({\mathcal {F}}_{{L} : {C},{A},{\alpha }}(\theta ,I)\) returns the value \(undefined\).

Example 1

Consider the following relational feature:

$$\begin{aligned} {\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{{grade(S,C)=low} }},{\mathtt{{difficulty(C)} }},{{mode}}} \end{aligned}$$

where \(\mathtt {C} \) is a logvar denoting courses and \(\mathtt {S} \) is a logvar denoting students. This feature calculates the mode of the difficulties for the courses where a student received a low grade. If a student has taken no courses or received no low grades, then, as discussed above, mode would return the value undefined.

Definition 3

(Discrete dependency statement) A discrete dependency statement is of the form \(\mathtt {G} ~|~\)Parents(\(\mathtt {G} \)). \(\mathtt {G} \) is the target atom that has a discrete range and whose arguments are all logvars. Parents(\(\mathtt {G} \)) is a set of discrete relational features, where for each \({\mathcal {F}}_{{L} : {C},{A},{\alpha }} \in \) Parents(\(\mathtt {G} \)), \(L\) is a subset of the logvars in \(\mathtt {G} \). Each dependency statement has an associated conditional probability distribution (CPD) which quantifies how the target atom depends on its parent set.

Example 2

An example of a discrete dependency statement is:

$$\begin{aligned} {\mathtt{intelligence(S)}}~|~{\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{{takes(S,C)=true} }},{\mathtt{{grade(S,C)} }},{{mode}}} \end{aligned}$$

which states that a student’s intelligence depends on the mode of grades received across all courses the student has taken. As each student can take a varying number of courses, an aggregation function, such as mode in this example, is needed to combine the values from the varying number of parents into a single value.

We are now ready to formally define an RDN:

Definition 4

(RDN) An RDN is a tuple \((\mathcal {{P}},RVD,dep)\), where \(\mathcal {{P}}\) is a set of predicates, each with a discrete range, \(RVD\) is a set of randvar declarations, and \(dep\) is a function that maps each \(\mathtt {P} \in \mathcal {{P}}\) to a discrete dependency statement.

An RDN \((\mathcal {{P}},RVD,dep)\) is a template for constructing propositional DNs. Given the background knowledge and a set of randvar declarations \(RVD\), an induced DN has a node for each randvar \(\mathtt {G} \theta \in \varPhi \).

The parent set of a ground atom \(\mathtt {G} \theta \) in a dependency network is defined as

$$\begin{aligned} Parents(\mathtt {G} \theta ) = Parents_{A}(\mathtt {G} \theta ) \cup Parents_{C}(\mathtt {G} \theta ) \end{aligned}$$

where

$$\begin{aligned} Parents_{A}(\mathtt {G} \theta )= & {} \left\{ A\theta \theta ' \mid \exists {\mathcal {F}}_{{L} : {C},{A},{\alpha }}\in Parents(\mathtt {G} ) : \theta '\in grsub ((C\theta ,A\theta )) \right\} \nonumber \\ Parents_{C}(\mathtt {G} \theta )= & {} \cup \left\{ C\theta \theta ' \mid \exists {\mathcal {F}}_{{L} : {C},{A},{\alpha }}\in Parents(\mathtt {G} ) : \theta '\in grsub ((C\theta ,A\theta )) \right\} \end{aligned}$$
(5)

There is an arc between two ground atoms \(\mathtt {G} \theta \) and \(\mathtt {G} '\theta \), if \(\mathtt {G} '\theta \in Parents(\mathtt {G} \theta )\). The CPDs are shared across all randvars that originate from the same predicate.

The pseudo-loglikelihood of an RDN \(M\) for an interpretation \(I\) involves only the relevant randvars and it is calculated as:

$$\begin{aligned} PLL(M;I) = \sum _\mathtt{{P} \in \mathcal {P}}\sum _{g\in gr(\mathtt {P} )^{I}}\log ~[p(I(\mathtt {g} )~|~I(Parents(\mathtt {g} ))]. \end{aligned}$$
(6)

Example 3

Consider the following simple RDN for a domain with the following randvar declarations:

$$\begin{aligned}&{\mathtt{random(intelligence(S)) \leftarrow student(S)}}\\&{\mathtt{random(takes(S,C)) \leftarrow student(S),course(C)}}\\&{\mathtt{random(grade(S,C)) \leftarrow student(S),course(C)}}\\&{\mathtt{random(difficulty(C)) \leftarrow course(C)}} \end{aligned}$$

where each predicate has a discrete range and the following dependency statement:

The dependency states that a student’s grade in a course depends on the student’s intelligence and the difficulty of the course. Note that this statement says that all ways of instantiating the logvars \(\mathtt{S}\) and \(\mathtt{C}\) have an identical probabilistic relationship with \(\mathtt{S}\)’s intelligence and \(\mathtt{C}\)’s difficulty. Figure 1 shows an induced propositional DN for this RDN given the relevancy condition on grade/2 specified in (4), and a domain with two students \(\mathtt {bob} \) and \(\mathtt {ann} \), and two courses \(\mathtt {math} \) and \(\mathtt {bio} \) (short for biology). The dashed arrows denote the relevancy conditions for the grade/2 randvars.

Fig. 1
figure 1

The DN induced by grounding the RDN specified in Example 3. The dashed arrows specify the relevancy condition on grade/2

Given that RDNs are templates for constructing DNs, they inherent the semantics of DNs (Neville and Jensen 2007). Namely, a consistent RDN specifies a joint probability distribution over the randvars of a relational data set. Similarly, a unique joint probability distribution for an RDN can be obtained by grounding out the model to obtain a DN and then running an ordered pseudo-Gibbs sampler on the DN. Again, this can be done regardless of whether the model is consistent. The distribution of an inconsistent RDN is the stationary distribution of an ordered pseudo-Gibbs sampler (if it exists) applied to the model.

Learning the structure of an RDN follows the same paradigm as in the propositional case: the CPD for each predicate is learned in turn. Normally, this is done by learning a relational probability tree for each predicate (Neville and Jensen 2007; Natarajan et al. 2012). Section 6 provides a more in-depth discussion of existing RDN structure learning algorithms.

3 Hybrid relational dependency networks

We now describe HRDNs, our proposed extension to RDNs for hybrid domains. First, we describe how to incorporate continuous variables. Second, we describe how to represent the CPDs. Third, we briefly describe how to perform inference in HRDNs.

3.1 Representation

It is relatively natural to extend RDNs to incorporate continuous random variables. It requires modifying the definitions presented in Sect. 2.2.

First, to introduce continuous variables, it suffices to declare the range of a predicate to be an interval of the real numbers. Each continuous randvar associated with such a predicate can then take on any value from this interval. For example, we could define a predicate \(\mathtt {numHours/1} \) with the following random variable declaration:

$$\begin{aligned} \mathtt {random(numHours(C)}) \leftarrow \mathtt {course(C)} \end{aligned}$$

that represents the number of hours needed to study for a course \(\mathtt {C} \). The range of this predicate can be the following interval:

$$\begin{aligned} \mathtt {range(numHours(C)}) =[20.0,180.0] \end{aligned}$$

Second, we need to modify the definition of a relational feature to account for the fact that both atoms and aggregation functions can have continuous ranges.

Definition 5

(Numeric relational feature) A numeric relational feature has the same form, \({\mathcal {F}}_{{L} : {C},{A},{\alpha }}\), as a discrete relational feature. In contrast to a discrete relational feature, one or both of \(A\) and \(\alpha \) in a numeric relational feature must have a continuous range.

Example 4

Consider the following numeric relational feature:

$$\begin{aligned} {\mathcal {F}}_{{\{\mathtt{{S} }\}} : {\mathtt{takes(S,C)=true}},{\mathtt{numHours(C)}},{\mathtt{average}}} \end{aligned}$$

This feature computes the average number of hours a student spends studying for all taken classes.

Third, we need to extend the definition of a dependency statement to incorporate numeric relational features.

Definition 6

(Hybrid dependency statement) A hybrid dependency statement is of the form \(\mathtt {G} ~|~Parents(\mathtt {G} )\) where \(\mathtt {G} \)’s range may be discrete or continuous and Parents(\(\mathtt {G} \)) is a set of discrete and/or numeric relational features. Each hybrid dependency statement has an associated CPD.

Note that the type of a CPD for each hybrid dependency is determined according to \(\mathtt {G} \)’s range: for a discrete range it is a probability mass function, and for a continuous range it is a density function.

Now we are ready to formally define an HRDN:

Definition 7

(HRDN) An HRDN is a tuple \((\mathcal {{P}}, RVD, dep)\), where \(\mathcal {{P}}\) is a set of predicates, whose ranges may be discrete or continuous, \(RVD\) is a set of randvar declarations and \(dep\) is a function mapping each \(\mathtt {P} \in \mathcal {{P}}\) to a hybrid dependency statement.

Analogous to an RDN, an HRDN can be viewed as a template for constructing a hybrid dependency network in the following way. The set of predicates \(\mathcal {P}\) in an HRDN is split into the set of predicates with discrete range \(\mathcal {P}_{D}\) and the set of predicates with continuous range \(\mathcal {P}_{C}\). Given a set of random variable declarations \(RVD\) for all predicates in \(\mathcal {P}\) and a set of constants, the set of randvars is \(\varPhi =\varPhi _{D} \bigcup \varPhi _{C}\) where \(\varPhi _{D}\) denotes all randvars with discrete ranges and \(\varPhi _{C}\) denotes all randvars with continuous ranges. The induced hybrid DN will have a node for each randvar in \(\varPhi \) and the parent set of a node is determined in the same manner as described previously for discrete DNs. Each discrete randvar of a predicate \(\mathtt {P} _{d}\in \mathcal {{P_{D}}}\) will obtain its own copy of the discrete CPD associated with \(\mathtt {P} _d\) and each continuous randvar of a predicate \(\mathtt {P} _c\in \mathcal {{P_{C}}}\) will obtain its own copy of the continuous CPD associated with \(\mathtt {P} _c\).

A consistent HRDN specifies the joint distribution over the randvars in its corresponding hybrid dependency network. In parallel with the claims of Neville and Jensen (2007), there is a direct correspondence between consistent HRDNs and hybrid Markov logic networks (HMLN) in that the set of distributions that can be encoded by a consistent HRDN is equal to the set of positive distributions that can be encoded with an HMLN with the same adjacencies provided they use the same aggregate functions. If an HRDN induces a hybrid DN that does not contain cycles, then its semantics corresponds to those of a hybrid Bayesian network. Our work primarily considers inconsistent HRDNs. In this case, if there is a stationary distribution of an ordered pseudo-Gibbs sampler applied to an HRDN model, we refer to this distribution as the one represented by the model.

The pseudo-loglikelihood of an HRDN is computed as follows:

(7)

where the first summation goes over the predicates with a discrete range, and the second goes over the predicates with a continuous range.

Example 5

To illustrate an HRDN, we could extend Example 3 with the \(\mathtt {numHours} /1\) predicate and add the following hybrid dependency statement:

$$\begin{aligned} {\mathtt{numHours(C)}}~|~{\mathcal {F}}_{{\{\mathtt{{C} }\}} : {\emptyset },{ \mathtt{{difficulty(C)} }},{\mathtt{value}}} \end{aligned}$$

which states that the number of hours spent studying for a class depends on its difficulty. Figure 2 shows the ground hybrid DN for Example 5. Squares denote randvars with a discrete range and ovals denote randvars with a continuous range.

Fig. 2
figure 2

The ground HRDN specified in Example 5. Squares represent randvars with a discrete range, and ovals represent randvars with a continuous range. The dashed arrows specify the relevancy condition on grade/2

3.2 Local distributions

Each dependency statement \(\mathtt {G} ~|~\)Parents(\(\mathtt {G} \)) has an associated CPD. The type of model used for a CPD depends on both the range of the target atom \(\mathtt {G} \) and whether Parents(\(\mathtt {G} \)) contains discrete or numeric features.

In this work, we use a parametric approach to density estimation and focus only on variants of Gaussian distributions to model continuous variables. Specifically, we use the following models:

  • Multinomial If \(\mathtt {G} \) has a discrete range and its parent set is empty, the CPD is modeled by a multinomial distribution.

  • Gaussian If \(\mathtt {G} \) has a continuous range and its parent set is empty, the CPD is modeled by a Gaussian distribution.

  • Logistic Regression (LR) This CPD is used when the target atom has a discrete range as it facilitates incorporating both discrete and continuous parents (Bishop 1995). Given \(range(\mathtt {G} )=\{y_{1},y_{2},\ldots ,y_{m}\}\), the conditional distribution for the first \((m-1)\) values for a specific grounding \(\mathtt {G} \theta \) is:

    $$\begin{aligned} p(\mathtt {G} \theta =y_{k}~|~Parents(\mathtt {G} \theta ))=\frac{exp\left( w_{k,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )}w_{k,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) }{1+\sum _{j=1}^{m-1} exp\left( w_{j,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )} w_{j,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) } \end{aligned}$$

    The distribution for the \(mth\) value is:

    $$\begin{aligned} p(\mathtt {G} \theta =y_{m}~|~Parents(\mathtt {G} \theta ))=\frac{1}{1+\sum _{j=1}^{m-1} exp\left( w_{j,0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )} w_{j,\mathcal {F}} \cdot \mathcal {F}(\theta )\right) } \end{aligned}$$
    (8)

    In both equations, \(\mathcal {F}\) is a relational feature, \(w_{j,\mathcal {F}}\) are the weights associated with \(\mathcal {F}\) for value \(y_j\), and \(w_{j,0}\) is \(y_j\)’s bias term.

  • Linear Gaussian (LG) A linear Gaussian CPD is used when \(\mathtt {G} \)’s range is continuous and all the features in the parent set are numeric (Lauritzen 1992; Koller et al. 1999). An LG is a Gaussian distribution that models \(\mu \) as a linear combination of the values of the features in the parent set, but assumes a fixed variance \(\sigma ^2\). The distribution is given as:

    $$\begin{aligned} p(\mathtt {G} \theta ~|~Parents(\mathtt {G} \theta ))=N\left( w_{0}+\sum _{\mathcal {F} \in Parents(\mathtt {G} )}w_{\mathcal {F}}\cdot \mathcal {F}(\theta ),\sigma _\mathtt {G} ^2\right) \end{aligned}$$
    (9)

    where \(\mathcal {F}\) is a numeric feature and \(w_{\mathcal {F}}\) is the weight associated with \(\mathcal {F}\).

  • Conditional Linear Gaussian (CLG) A conditional linear Gaussian (CLG) is used if \(\mathtt {G} \)’s range is continuous and its parents set contains a mix of discrete and numeric features. There is a separate linear Gaussian model for every instantiation of the discrete parents. More formally, consider partitioning the parent set of a predicate into the discrete features, \(\mathcal {F}_{discrete}\), and the numeric features, \(\mathcal {F}_{continuous}\) and let \(\mathcal {D}\) be the Cartesian product of ranges of all features in \(\mathcal {F}_{discrete}\). Then, the CPD consists of one LG model for each \(d\in \mathcal {D}\):

    $$\begin{aligned} p(\mathtt {G} \theta ~|~\mathcal {F}_{continuous},d)= N\left( w_{0_d}+\sum _{F \in \mathcal {F}_{continuous}}{w_{\mathcal {F}_d}\cdot \mathcal {F}(\theta )},\sigma _d^2\right) \end{aligned}$$
    (10)

    Note that because there is a separate LG for each \(d\), each one has an associated variance \(\sigma _d^2\). A conditional Gaussian is a special case of a CLG where the parent set only contains discrete features. Here, a separate Gaussian (mean and variance) is learned for each possible configuration of the parents.

As in the discrete case, it is possible that a feature does not have any groundings. If this occurs and the aggregation function of the feature is not defined on the empty set, then we again return the value undefined.

3.3 Inference

Similar to RDNs, inference in HRDNs can be performed by using an ordered pseudo-Gibbs sampler. The difference lies in the fact that HRDNs contain both conditional density functions and probability distributions. Given an HRDN, a set of constants for each type, and possibly a set of relevance conditions, inference is performed as follows.

First, the model is grounded to create the corresponding propositional hybrid dependency network. Second, each randvar gets its own copy of a CPD associated to its predicate. Third, an ordering over the atoms is determined based on the relevance conditions, if specified. This ordering has to ensure that when performing sampling for an atom \(\mathtt {A} \) we first sample the values of the atoms in \(l\) of the relevance condition \(\mathtt {A} \Leftrightarrow l\). For example, consider the relevance condition (4). In each Gibbs sweep, before we sample values for grade/2 we make sure that the values for takes/2 are sampled.

Finally, in each Gibbs sweep we visit each ground atom in order and resample its value according to its probability distribution or density function. A randvar is assigned a value from its range or obtains the value \(not\_relevant\) if there exists a relevance condition that is satisfied in the sweep. Each sweep results in an interpretation \(I\) and a sample corresponds to only the relevant randvars in \(I\).

4 Structure learning

In this section we present our algorithm for learning the structure of an HRDN. This requires learning a dependency statement and CPD for each predicate in the domain. It is possible to use a decomposable score function to evaluate candidate structures. Thus the problem can be tackled by independently learning a locally optimal CPD for each predicate. Therefore, we refer to our approach as the Learner of Local Models (LLM). When learning the CPD for each predicate, we define a space of candidate features and then greedily select those that improve the score.

Next, we will describe in more detail the key elements of our algorithm, which are (1) its high-level control structure, (2) how to learn a CPD for a single predicate, and (3) how to score the candidate CPDs.

4.1 High-level control structure

Algorithm 1 outlines LLM and it receives as input a set of predicates \(\mathcal {P}\), a set of training interpretations \(D\), and a set of validation interpretations \(V\). LLM assumes fully-observed data. At a high level, the algorithm is quite simple. For each predicate \(\mathtt {P} \in \mathcal {P}\), it invokes the LearnOneModel function to learn a local distribution that models \(\mathtt {P} \) using \(\mathcal {P}\). By using a decomposable score function, such as pseudo-loglikelihood, the global score can be optimized by independently finding the best local distribution for each predicate.Footnote 2 The final model \(M\) is obtained by conjoining all learned local distributions.

Note that this algorithm has the same high-level control structure as existing approaches for learning RDNs. There are two important differences with existing approaches. The first is that the data may contain continuous variables. The second is that, in order to accommodate dependencies on continuous variables, the local distributions are represented via a logistic regression or a (conditional) linear Gaussian as opposed to a relational probability tree.

Next, we describe in detail how to learn and evaluate local distributions.

figure a

4.2 Learning local distributions

Each learned CPD, regardless of its form, in an HRDN is parameterized by a set of features. Learning the structure of the CPD requires determining which features should appear in the parent set. This can be posed as the problem of searching through the space of candidate features. We adopt a greedy approach that selects one feature at a time to add to the parent set until no inclusion improves the score. Thus, in each iteration, the central procedure is finding the single best feature and adding it to the parent set.

We construct candidate features in the following way. First, let \(H=\mathtt {P} (\mathtt {V} _1,\ldots ,\mathtt {V} _n)\), where each \(\mathtt {V} _i\) is a unique logvar, and let \(L= \{\mathtt{{V} }_1,\ldots ,\mathtt{{V} }_n\}\). Next, we construct all \(A\) such that \(A\) is different from H. Then, given a user-defined parameter \(N\), for each \(A\) all conjunctions of \(k\le N\) randvar-value tests \(C=\{(G_1=v_1),\ldots ,(G_k=v_k)\}\) are exhaustively enumerated such that (1) all atoms \(G_i\) have a discrete range, (2) no atom \(G_i\) is identical to \(H\) or \(A\), (3) the set \(Q=\{H, G_1,\ldots , G_k, A\}\) is connected.Footnote 3 These restrictions ensure that the set of candidate features is finite. For each constructed \(C\) and \(A\) one candidate feature \({\mathcal {F}}_{{L} : {C},{A},{\alpha }}\) for each aggregation function \(\alpha \) applicable to \(range(A)\) is generated. We consider the following aggregation functions:

  • If no aggregation is needed, we use \(value\),

  • If \(range(A)\) is discrete and not \(\{true, false\}\), we use \(mode\),

  • If \(range(A)\) is discrete and \(\{true, false\}\), we use \(proportion\) and \(exist\),

  • If \(range(A)\) is continuous, we use \(average\), \(maximum\), and \(minimum\).

The aggregation function \(proportion\) computes the proportion of a feature’s possible groundings that are true. The other functions take on their traditional meanings.

Algorithm 2 outlines our procedure for learning the dependency for a predicate \(\mathtt {P} \). As input, it receives the target predicate \(\mathtt {P} \), the full set of predicates \(\mathcal {P}\) for the domain, a training set \(D\), and a validation set \(V\). First, the algorithm starts by constructing the set of candidate features for \(\mathtt {P} \). Second, it repeatedly iterates through the set of candidate features and evaluates the utility of adding each feature to the parent set. Each feature addition is followed by learning the CPD on the training data \(D\) and then scoring it on the validation data \(V\). In each iteration, the single best feature is added to the parent set. If no feature improves the score, the procedure terminates. Note that the form of the CPD depends on both \(\mathtt {P} \) and the features in the parent set. If \(\mathtt {P} \)’s range is discrete, then the CPD is represented via logistic regression. If \(\mathtt {P} \)’s range is continuous, we use linear Gaussians if the parents only contain numeric features and conditional linear Gaussians when the parent set contains both numeric and discrete features.

The two following subsections explain how we estimate the parameters of the CPDs using the training data and how we evaluate the local models.

figure b

4.3 Estimating the parameters for candidate CPDs

Next, we briefly describe how to estimate the parameters for the CPDs for the different types of dependency statements that may appear in a learned HRDN.

  • Multinomial The maximum likelihood parameters of the multinomial are learned from the data.

  • Gaussian The maximum likelihood estimates of the Gaussian’s mean and the variance are learned from the data.

  • Logistic regression Parameter estimation requires learning the weight vectors for the logistic regression model. We follow the standard approach and take the (partial) derivative of the conditional loglikelihood of the data and perform gradient ascent to estimate the weights (Mitchell 1997).

  • Linear Gaussian Parameter learning requires estimating the weight vector for the linear regression model. This can be done via standard techniques for training a linear regressor and we use ridge regression (Bishop 1995). We estimate the variance by computing the expected value of the squared difference between the actual value and the model’s predicted value.

  • Conditional linear Gaussian In CLGs, each configuration of the discrete parents has an associated LG model. The parameters for each LG model are learned as described above.

4.4 Evaluating candidate models

Traditionally, a candidate model is evaluated using a score function that trades off the model’s fit to the data versus some penalty term based on the model’s complexity to avoid overfitting. For a candidate model \(M\), we use the following score function, which is based on the Minimum Description Length (MDL) (Schwarz 1978):

$$\begin{aligned} MDL(M,D)= PLL(M,D) - Penalty(M,D) \end{aligned}$$
(11)

where \(PLL(M,D)\) is computed using Eq. (6) and \(Penalty(M, D)\) is the following penalty term:

$$\begin{aligned} Penalty(M, D)=\frac{1}{2}\sum _{I \in D} \sum _\mathtt{{P} \in \mathcal {P}} log_{2} (|gr(\mathtt {P} )|^{I}) \cdot B_\mathtt {P} \cdot K \end{aligned}$$

where \(|gr(\mathtt {P} )|^{I}\) is the number of relevant randvars of predicate \(\mathtt {P} \) in interpretation \(I\), \(B_\mathtt {P} \) is the number of free parameters in \(\mathtt {P} \)’s CPD and \(K\) is the size of \(\mathtt {P} \)’s CPD.Footnote 4 Next, we will explain in more detail how \(B_\mathtt {P} \) and \(K\) are calculated.

When the CPD for \(\mathtt {P} \) is represented by a logistic regression model (see Eq. 8), the number of free parameters is:

$$\begin{aligned} B_\mathtt{{P} } = (|range(\mathtt {P} )|-1)\cdot (1+|Parents(\mathtt {P} )|) \end{aligned}$$

where \((1+|Parents(\mathtt {P} )|)\) is the number of weights that must be learned to parameterize the model (i.e., one for each feature plus the intercept). For continuous CPDs, this is slightly more involved to compute. For an LG, the number of free parameters is:

$$\begin{aligned} B_\mathtt{{P} } = 1 + (1 + |Parents(\mathtt {P} )|) \end{aligned}$$

where the first \(1\) is for the variance \(\sigma ^2\) and \((1+|Parents(\mathtt {P} )|)\) is the number of weights that must be learned to parameterize the model (i.e., one for each feature in the parent set plus the intercept). Recall that in a CLG, one LG model is learned for each possible instantiation of the discrete parents. Thus the number of free parameters for a CLG is:

$$\begin{aligned} B_\mathtt{{P} }=d \cdot (1 + (1 + |Parents_{C}(\mathtt {P} )|)) \end{aligned}$$

where \(d\) is the number of elements in the Cartesian product of the ranges of the discrete parents, \(Parents_{C}(\mathtt {P} )\) denotes only numeric features in the parent set of \(\mathtt {P} \) and \((1 + (1 + |Parents_{C}(\mathtt {P} )|))\) is the number of parameters needed to model each LG.

The size \(K\) of \(\mathtt {P} \)’s CPD is the sum of the feature lengths in the parent set:

$$\begin{aligned} K= \sum _{\mathcal {F} \in Parents(\mathtt {P} )} |{\mathcal {F}}_{{L} : {C},{A},{\alpha }}| \end{aligned}$$
(12)

where \(|{\mathcal {F}}_{{L} : {C},{A},{\alpha }}|=|C| +1\) is the length of a feature.

5 Experiments

This section empirically evaluates our HRDN structure learning algorithm LLM. Specifically, we want to answer the following questions:

  1. 1.

    How does varying the amount of training data affect the quality of the learned model and the run time of the learning algorithm?

  2. 2.

    Do we learn more accurate models by learning a hybrid model (i.e., explicitly modeling continuous variables) or by discretizing all continuous variables prior to learning?

  3. 3.

    How does our approach compare to MLN (Richardson and Domingos 2006) structure learning?

All our code, data and models are publicly available.Footnote 5 We first describe the data sets we will use and then explain the experimental setup. Finally, we present and discuss the results.

5.1 Data sets

We use one synthetic and one real-world data set to answer these questions.

Synthetic university data We used a modified version of the well-known university model (Getoor et al. 2001) to generate synthetic data. We made the following alterations. First, we switched the range of intelligence/1 from discrete to continuous. Second, we added two predicates with continuous ranges: numHours/1, which is the estimated number of hours a student needs to study for a course, and ability/1, which is the ability of a professor. Finally, we added a Boolean predicate friend/2, which denotes whether two students are friends. Appendix 1 contains a complete description of the model.

We generate synthetic data in two ways. First, we fix the domain size of each type within an interpretation and vary the number of training interpretations. We learn models by using one, two, four, eight and \(16\) interpretations. We use one validation and one test interpretation. Second, we fix the number of training and validation interpretations to one and vary the domain size of each object. The learned models in this setup are evaluated on a test interpretation consisting of 800 students, 125 courses and 125 professors. Tables 1 and 2 show the characteristics of the domains for the first and second synthetic setup, respectively.

Table 1 Data set characteristics for the synthetic data when varying the number of interpretations used for learning
Table 2 Data set characteristics for the synthetic data when varying the domain size of each object type in the training interpretation

For each experimental condition, we repeat the following process ten times. We generate the appropriate number of interpretations, where each interpretation is constructed by performing \(2000\) iterations of the ordered pseudo-Gibbs sampling (see Sect. 3.3) using the handcrafted model and the specified number of constants.

For each generated data set, we also create a corresponding discretized version by binning each continuous randvar into a number of equal-size intervals. We used \(2\), \(4\), \(6\) and \(8\) bins.

Real-world PKDD’99 financial data set Our real-world domain is the financial data set from the PKDD’99 Discovery Challenge (Berka 1999). It consists of services one bank offers its clients such as loans, accounts, and credit cards among others. In the original data, the \(transaction\) table contains more than one million transactions. Therefore, we introduced several predicates (e.g., average of monthly withdrawals for an account) to summarize the information contained in this table. This results in \(16\) predicatesFootnote 6 about four types of objects: \(clients\), \(accounts\), \(loans\) and \(districts\). Ten predicates have a continuous range and six have a discrete range.

We consider \(account\) to be the central object type in the PKDD’99 financial data set. The original data set consists of \(4500\) accounts, but we omit ten accounts that have missing data. We then split the data associated with these accounts into tenfolds. To avoid leakage of information, all information about clients, loans and districts related to one account appear in the same fold. We used sixfolds for training, threefolds for validation and one for testing. Table 3 reports the characteristics of this data set.

Table 3 Characteristics of the PKDD’99 financial data set

Again, we create a discretized version of the data by binning each continuous randvar into a number of equal-size intervals and used \(2\), \(4\), \(6\) and \(8\) bins.

5.2 Methodology

We compare the following four learners on all experiments:

  • LLM-H This corresponds to learning a model using our LLM algorithm on the data containing both continuous and discrete variables.

  • LLM-D This corresponds to learning a model using our LLM algorithm on the discretized data. Thus each learned local distribution is modeled using a logistic regression CPD.

  • LSM This corresponds to learning a model using the publicly available implementation of LSM (Kok and Domingos 2010) on the discretized data. LSM is the state-of-the-art Markov logic network structure learning algorithm.

  • Independent This learner constructs a model on the hybrid data such that all randvars are independent. That is, it models the joint distribution as a product of marginal distributions.

On the experiments involving the PKDD’99 financial data set, we include an additional baseline: a handcrafted model. We built a local model to predict each predicate by a set of handcrafted non-relational features. These features are used to predict a property of an object by means of some other properties of that object. The features can be found in Appendix 4. For predicates with a discrete range, we used logistic regression. For predicates with a continuous range, we used both linear regression and MP5 (a regression tree) as implemented in Weka (Hall et al. 2009).

Experimental details LLM is implemented as a combination of Java and Prolog. Java is used for performing the learning and Prolog is used to compute the value of a feature. When generating features, we set the length of the features to be at most \(N=3\). Usually, in relational domains, only a small fraction of the Boolean atoms is true (e.g., the number of people who are friends is quite sparse compared to the number of possible friendships). Therefore, for efficiency reasons, we subsample the false Boolean atoms during learning (Natarajan et al. 2012) to achieve a 1:1 ratio of true to false groundings in all experiments.

For LSM, we contacted the authors in order to know what the most important parameters were to tune. Then, we tried several parameter combinations, and used the validation data to select appropriate ones for each data set.

Evaluation metrics We evaluate the quality of the learned models using several metrics. First, to measure the quality of the probability estimates, we report the weighted pseudo-loglikelihood (WPLL) (Kok and Domingos 2005). This corresponds to calculating the PLL of an interpretation as the sum of PLLs for each predicate divided by the number of groundings of that predicate in the interpretation.

Second, to measure the predictive performance, we report the area under the ROC curve (AUC-ROC) for discrete predicates and the normalized root-mean-square error (NRMSE) for continuous predicates. Because we have multi-class categorical variables in our domains, we calculate the multi-class AUC-ROC (Domingos and Provost 2000), which we denote as \(AUC_{total}\). The NRMSE for a predicate ranges from zero to one and is calculated by dividing RMSE by the predicate’s range.

Additionally, since we know the model structure for the synthetic data, we compare how closely the learned model reflects the handcrafted structure using the following edit distance. For each predicate, we compare the true parent set to the learned parent set. For each feature in the true parent set, we find its closest feature in the learned parent set according to the following distance metric. The distance \(\varDelta \) between two features, \({\mathcal {F}}_{{L_1} : {C_1},{A_1},{\alpha _1}}\) and \({\mathcal {F}}_{{L_2} : {C_2},{A_2},{\alpha _2}}\), is calculated as:

$$\begin{aligned} \varDelta (\mathcal {F}_1,\mathcal {F}_2)=|C_1\backslash C_2|+|C_2\backslash C_1|+ \delta _{A_1, A_2} + \delta _{\alpha _1, \alpha _2} \end{aligned}$$

where \(\delta _{A_1,A_2}\) equals zero if the two atoms \(A_1\) and \(A_2\) originate from the same predicate and their logvars are equivalent, otherwise it equals one. Similarly, \(\delta _{\alpha _1,\alpha _2}\) equals zero if \(\alpha _1\) and \(\alpha _2\) represent the same aggregation function, otherwise it equals one. When the best match is found, both the true and the learned feature are excluded from further comparisons, and the edit distance is incremented by the distance between them. Furthermore, the final distance is incremented by the length of each feature that must be added or removed from the learned dependency parent set.

We use a one-tailed paired t test to assess the significance of the results obtained through ten independent runs for the synthetic experimental setup and tenfolds for the real-world data set. The null hypothesis states that there is no difference between two approaches and we reject it when p \(<\)0.01. For all metrics, we report the metric itself along with its standard deviation.

5.3 Results and discussion

We now present experimental results for the synthetic and real-world data sets.

Results on synthetic data Table 4 shows how the WPLL of each approach varies as a function of the number of training interpretations. Learning from the hybrid data results in a significantly more accurate learned model than learning from the discretized data in all cases except for one in which we have one training interpretation and six discretizing bins. When using the same number of bins for discretization, LLM-D learns more accurate models than LSM on all settings. Note that LSM ran out of memory on all runs when training on eight and 16 interpretations. Finally, all learning approaches always outperform the no-learning baseline.

Table 4 The WPLL on the synthetic data as a function of the number of training interpretations

Table 5 presents the run times for all algorithms as a function of increasing the number of training interpretations. LSM is the fastest learner, but it produces lower-quality models. For all approaches, the run time scales linearly with the number of interpretations. Learning an HRDN is always faster than learning an RDN. When discretizing the data, the run time is influenced by the number of bins used: the more bins there are, the slower the discrete learner is. This occurs because adding more bins increases the size of the search space.

Table 5 The run times in minutes on the synthetic data as a function of the number of training interpretations

Finally, Fig. 3 shows how the edit distance varies as a function of the number of training interpretations. As expected, the edit distance decreases as more training data are used.

Fig. 3
figure 3

The effect of the number of training interpretations on the average edit distance between the handcrafted HRDN model and the hybrid model learned with LLM-H

Table 6 shows the WPLLs of all learners as a function of increasing the domain size for each object. To encapsulate the effect of domain size changes in a single number, we use the number of randvars in an interpretation. Again, we see that all the learners outperform the independent model. LLM-H always learns significantly more accurate models than LSM. LLM-H learns a significantly more accurate model than LLM-D except when discretizing the data into 6 or 8 bins on the data sets with 200, 400 and 800 students.

Table 6 The WPLL on the synthetic data as a function of the domain size

Table 7 shows the run time of all approaches as a function of increasing domain size. Similar to the previous setup, LSM exhibits better run times than either LLM-H or LLM-D, but it produces lower-quality models. As expected, both LLM-H and LLM-D run time varies quadratically with the increase in domain size. LSM’s run time seems to vary linearly, which probably occurs due to its random-walk style search for patterns, which does not necessarily examine all the variables in the training database. When learning (H)RDNs, LLM-H is faster than LLM-D. Again, in general, increasing the number of bins increases the training time.

Table 7 The run times in minutes on the synthetic data as a function of the domain size for all the learners

Figure 4 shows that the edit distance between LLM-H’s learned model and the handcrafted model decreases as the number of randvars in the training interpretation increases. More (observed) random variables equates to more training data, and, as expected, more data allows us to learn more accurate models.

Fig. 4
figure 4

The effect of increasing the domain size of each object type on the average edit distance between the handcrafted HRDN model and the hybrid model learned with LLM-H. We summarize the effect of changing the domain sizes by showing the number of randvars in the training interpretation

In both synthetic setups, we noticed that in the learned model difficulty(C) depends on nrhours(C). This dependency is not encoded explicitly in the handcrafted model. However, nrhours(C) does depend on difficulty(C) in the original model. In both cases, this contributes to the edit distance.

More detailed results for both synthetic setups can be found in Appendix 3.

Results on the PKDD’99 financial data set Figure 5 shows the WPLL for all approaches on the PKDD’99 financial data set as a function of the number of bins used for discretization. For the handcrafted models, we denote the combination of logistic regression and linear regression as LR+LinR, and the combination of logistic regression and MP5 regression trees with LR+MP5. In the figure, the lines for LLM-H, LR+LinR, LR+MP5 and the independent model are straight because these approaches operate directly on the hybrid data and hence do not perform discretization. We see a clear ranking between the approaches: LLM-H \(>\) LR+LinR \(>\) LR+MP5 \(>\) LLM-D \(>\) LSM \(>\) independent.

Fig. 5
figure 5

The WPLL for each approach on the PKDD’99 financial data set as a function of the number of bins used for discretization. Note that the results for LLM-H, LR+LinR, LR+MP5 and the independent model do not depend on the number of bins used for discretization

Table 8 shows the (multi-class) AUCs and NRMSE for LLM-H and the handcrafted models. All three approaches tend to have similar results on most predicates. Note that the handcrafted features used to propositionalize the data are all features that LLM-H is able to learn automatically.

Table 8 The performance of the two variants of the handcrafted models, LR+LinR and LR+MP5, compared to LLM-H on the hybrid data for the PKDD’99 financial data set

Table 9 reports the \(AUC_{total}\) for LLM-H, LLM-D and LSM. Out of the six discrete predicates, LLM-H has a higher \(AUC_{total}\) on one predicate, the same on two and worse on three compared to LLM-D. Compared to LSM, it wins on three predicates, loses on two and draws on one.

Table 9 \(AUC_{total}\) results for LLM-H, LLM-D and LSM on the six discrete predicates in the PKDD’99 financial data set

Figure 6 shows the run times for this data set as a function of the number of bins used for discretization. LLM-H exhibits better run times than both LLM-D and LSM. LSM is faster than LLM-D except when discretizing the data into two bins.

Fig. 6
figure 6

The run time of each approach on the PKDD’99 financial data set as a function of the number of bins used for discretization. The y-axis (run time) is on a log scale. Note that LLM-H’s results do not depend on the number of bins used for discretization

When we inspected the models learned on the PKDD’99 financial data set, we found a considerable number of bi-directional dependencies. This means that our algorithm succeeded in learning a model that is mostly structurally consistent. For example, it learned that the monthly payment amount for a loan depends on the loan amount, and vice versa. The same holds for the average salary and the ratio of urban inhabitants in a district, the average amount withdrawn from an account and the average amount credited to an account, the average amount withdrawn from an account and the average number of withdrawals for an account, among others.

More detailed results for the PKDD’99 financial data set can be found in Appendix 3.

Discussion Now we can revisit and answer the three experimental questions posed at the beginning of this section. To address the first question, we used the synthetic data to explore the scaling behavior of our algorithm. We found that as the amount of training data increases both the accuracy of the learned models and their faithfulness to the ground truth model slightly improve.

The second question revolves around whether it is better to learn from hybrid data or discretized data. On all experiments, we have seen that learning from the hybrid data directly consistently results in significantly more accurate learned models (according to WPLL) than discretizing the data prior to learning. Finally, we wanted to compare our proposed learning algorithm to the state-of-the-art MLN learner. The results show that on both hybrid and discrete data LLM learns more accurate models than LSM.

6 Related work

On the propositional level, researchers have considered extending formalisms such as Bayesian networks and dependency networks to model both discrete and continuous distributions. In terms of hybrid Bayesian networks, most of the work has focused on inference (Koller et al. 1999; Yuan and Druzdzel 2007; Murphy 1998; Moral et al. 2001; Lauritzen and Jensen 2001). There have also been some initial attempts for parameter learning (Murphy 1998) and structure learning (Romero et al. 2006). Cobb et al. (2007) provides a more detailed overview of work on hybrid Bayesian networks.

There has been some work on structure learning for hybrid dependency networks. Dobra (2009) has proposed bounded stohastic search for variable selection (structure learning) for sparse genetic dependency networks that contain both discrete and continuous variables.  Meinshausen and Bühlmann (2006) use neighbourhood selection with the Lasso for structure learning as a computationally attractive alternative to standard covariance selection methods for multivariate normal distributions. Guo and Gu (2011) use dependency networks for multi-label classification where each CPD represents a probabilistic or non-probabilistic binary classifier that can have both discrete and continuous predictors.

Our work represents a relational approach and builds off of two lines of research: structure learning for RDNs and hybrid relational probabilistic models. There are two existing structure learning approaches for RDNs (Neville and Jensen 2007; Natarajan et al. 2012). Both approaches perform structure learning by finding the best conditional distribution independently for each predicate. They slightly differ in how they represent the CPDs. Neville and Jensen (2007) learn a single relational probability tree (Neville et al. 2003) for each predicate. Natarajan et al. (2012) represent individual conditional distributions as a weighted sum of relational regression trees (Blockeel and De Raedt 1998), which are learned by a stage-wise optimization procedure. However, these approaches do not explicitly model continuous distributions and instead require them to be discretized. In contrast, our approach is able to directly encode dependencies between discrete and continuous random variables without discretization. Doing so necessitates representing the CPDs with logistic regression or conditional (linear) Gaussian model as opposed to a relational probability tree.

There are several formalisms that can represent hybrid relational domains including Hybrid Markov Logic Networks (HMLNs) (Wang and Domingos 2008), Hybrid Problog (HProblog) (Gutmann et al. 2011), Continuous Bayesian Logic Programs (CBLPs) (Kersting and De Raedt 2001), Learning Modulo Theories (LMT) (Teso et al. 2013) and Hybrid Probabilistic Relational Models (HPRMs) (Narman et al. 2010). Additionally, formalisms such as Relational Continuous Models (RCMs) (Choi et al. 2010) and Gaussian Logic (Kuželka et al. 2011) can model domains that exclusively contain continuous variables. The latter formalism also provides support for structure learning. Most of these formalisms focus on representation and reasoning issues in hybrid relational domains. HMLNs, CBLPs and LMTs also provide support for learning the parameters of a given model from data. Next, we provide a more detailed comparison between our approach and HMLNs, HProblog and CBLPs.

Representationally, HMLNs, CBLPs and HRDNs all serve as template languages for constructing a different type of propositional graphical model. Hence, each formalism inherits the strengths and weaknesses of the underlying formalism. In contrast, HProblog is a probabilistic extension of Prolog. There are differences in how each formalism models continuous variables. HRDNs, HProblog and CBLPs explicitly state the form of the distribution (e.g., a Gaussian) and its parameters (e.g., the mean and variance). In contrast, HMLNs express numeric variables through a set of soft constraints with a Gaussian penalty for diverging values. One notable difference between HRDNs and CBLPs is that CBLPs do not permit a discrete variable to have a continuous parent, whereas this is possible in HRDNs.

In terms of reasoning, HMLNs and HRDNs use approximate inference. Currently, HProblog only supports an exact inference procedure which involves partitioning the continuous probabilistic facts into admissible intervals. Scaling HProblog to large domains would require the development of a suitable approximate inference algorithm. Inference in CBLPs can be split in two parts: logical inference and probabilistic inference. The former computes the support network for a query (i.e., a Bayesian network containing all relevant variables for the query). The latter applies off-the-shelf Bayesian network inference methods to the resulting support network.

There are significant differences in the level of support for learning in each formalism. Out of the four formalisms, HRDNs are the only one that support structure learning in hybrid domains. Like HRDNs, HMLNs and CBLPs have algorithms for parameter learning. Currently, HProblog does not support parameter learning.

7 Conclusions and future work

This paper addressed the problem of learning models from structured, relational data that contain both discrete and continuous variables. To the best of our knowledge, this is the first attempt to perform structure learning in a hybrid SRL setting. We introduced Hybrid Relational Dependency Networks (HRDNs), a novel extension of relational dependency networks that accommodate continuous variables and proposed an algorithm that automatically learns the structure of an HRDN from data. Empirically, we evaluated the benefit of incorporating continuous variables in a learned model on one synthetic and one real-world data set by considering two versions of each data set: one that contains both continuous and discrete variables, and one where each continuous variable is discretized prior to learning. We compared our proposed algorithm to two learners that work only on discrete data: a variant of our algorithm and LSM, the state-of-the-art MLN structure learner. We found that learning directly from the hybrid data resulted in more accurate learned models than learning from the discretized data.

One interesting direction for future work is to explore the suitability of modeling other continuous conditional distributions, next to the Gaussians considered in this paper. In principle, other density functions can be used given that we can calculate the value of the function at a point and that we can sample a value for a variable given the assignment to its parents. However, it is unclear how easy this is in practice for complex distributions, and whether issues could arise with sampling inconsistent HRDNs containing relational conditional dependencies. We would also like to extend our learning algorithm such that it could cope with missing data and model latent variables. Additionally, we would like to explore other penalty terms in the objective function such as a L1 penalty that has been used for learning propositional DNs (Dobra 2009; Meinshausen and Bühlmann 2006). Finally, we would like to evaluate our approach on more real-world domains.