Learning graphical models for relational data via lattice search

Schulte, Oliver; Khosravi, Hassan

doi:10.1007/s10994-012-5289-4

Learning graphical models for relational data via lattice search

Published: 30 May 2012

Volume 88, pages 331–368, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Learning graphical models for relational data via lattice search

Download PDF

Oliver Schulte¹ &
Hassan Khosravi¹

2333 Accesses
16 Citations
Explore all metrics

Abstract

Many machine learning applications that involve relational databases incorporate first-order logic and probability. Relational extensions of graphical models include Parametrized Bayes Net (Poole in IJCAI, pp. 985–991, 2003), Probabilistic Relational Models (Getoor et al. in Introduction to statistical relational learning, pp. 129–173, 2007), and Markov Logic Networks (MLNs) (Domingos and Richardson in Introduction to statistical relational learning, 2007). Many of the current state-of-the-art algorithms for learning MLNs have focused on relatively small datasets with few descriptive attributes, where predicates are mostly binary and the main task is usually prediction of links between entities. This paper addresses what is in a sense a complementary problem: learning the structure of a graphical model that models the distribution of discrete descriptive attributes given the links between entities in a relational database. Descriptive attributes are usually nonbinary and can be very informative, but they increase the search space of possible candidate clauses. We present an efficient new algorithm for learning a Parametrized Bayes Net that performs a level-wise search through the table join lattice for relational dependencies. From the Bayes net we obtain an MLN structure via a standard moralization procedure for converting directed models to undirected models. Learning MLN structure by moralization is 200–1000 times faster and scores substantially higher in predictive accuracy than benchmark MLN algorithms on five relational databases.

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Bayesian Markov Logic Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many databases store data in relational format, with different types of entities and information about links between the entities. The field of statistical-relational learning (SRL) has developed a number of new statistical models for relational databases (Getoor and Tasker 2007). Markov Logic Networks (MLNs) form one of the most prominent SRL model classes; they generalize both first-order logic and Markov network models (Domingos and Richardson 2007). MLNs have achieved impressive performance on a variety of SRL tasks. Because they are based on undirected graphical models, they avoid the difficulties with cycles that arise in directed SRL models (Neville and Jensen 2007; Domingos and Richardson 2007; Taskar et al. 2002). An open-source benchmark system for MLNs is the Alchemy package (Kok et al. 2009). Essentially, an MLN is a set of weighted first-order formulas that compactly defines a Markov network comprising ground instances of logical predicates. The formulas are the structure or qualitative component of the Markov network; they represent associations between ground facts. The weights are the parameters or quantitative component; they assign a likelihood to a given relational database by using the log-linear formalism of Markov networks. This paper addresses structure learning for MLNs in relational schemas that feature a significant number of descriptive attributes, compared to the number of relationships. Previous MLN learning algorithms do not scale well with such datasets. We introduce a new moralization approach to learning MLNs: first we learn a directed Bayes net graphical model for relational data, then we convert the directed model to an undirected MLN model using the standard moralization procedure (marry spouses, omit edge directions). The main motivation for performing inference with undirected models is that they do not suffer from the problem of cyclic dependencies in relational data (Domingos and Richardson 2007; Taskar et al. 2002; Khosravi et al. 2010). Thus our approach combines the scalability and efficiency of directed model search, with the inference power and theoretical foundations of undirected relational models.

Approach

We present a new algorithm for learning a Bayes net from relational data, the learn-and-join algorithm. While our algorithm is applicable to learning directed relational models in general, we base it on the Parametrized Bayes Net formalism of Poole (2003). The learn-and-join algorithm performs a level-wise model search through the table join lattice associated with a relational database, where the results of learning on subjoins constrain learning on larger joins. The join tables in the lattice are (i) the original tables in the database, and (ii) joins of relationship tables, with information about descriptive entity attributes added by joining entity tables. A single-table Bayes net learner, which can be chosen by the user, is applied to each join table to learn a Bayes net for the dependencies represented in the table. For joins of relationship tables, this Bayes net represents dependencies among attributes conditional on the existence of the relationships represented by the relationship tables.

Single-table Bayes net learning is a well-studied problem with fast search algorithms. Moreover, the speed of single-table Bayes net learning is significantly increased by providing the Bayes net learner with constraints regarding which edges are required and which are prohibited. Key among these constraints are the join constraints: the Bayes net model for a larger table join inherits the presence or absence of edges, and their orientation, from the Bayes nets for subjoins. We present a theoretical analysis that shows that, even though the number of potential edges increases with the number of join tables, the join constraint reduces the Bayes net model search space to keep it roughly constant size throughout the join lattice. In addition to the join constraints, the relational structure leads to several others that reduce search complexity; we discuss the motivation for these constraints in detail. One of the constraints addresses recursive dependencies (relational autocorrelations of an attribute on itself) by restricting the set of nodes that can have parents (i.e., indegree greater than 0) in a Parametrized Bayes Net. A normal form theorem shows that under mild conditions, this restriction involves no loss of expressive power.

Evaluation

We evaluated the structures obtained by moralization using two synthetic datasets and five public domain datasets. In our experiments on small datasets, the run-time of the learn-and-join algorithm is 200–1000 times faster than benchmark programs in the Alchemy framework (Kok and Domingos 2009) for learning MLN structure. On medium-size datasets, such as the MovieLens database, almost none of the Alchemy systems returns a result given our system resources, whereas the learn-and-join algorithm produces an MLN with parameters within 2 hours; most of this time (98 %) is spent optimizing the parameters for the learned structure. To evaluate the predictive performance of the learned MLN structures, we used the parameter estimation routines in the Alchemy package. Using standard prediction metrics for MLNs, we found in empirical tests that the predictive accuracy of the moralized BN structures was substantially greater than that of the MLNs found by Alchemy. Our code and datasets are available for ftp download (Learn and join algorithm code).

Limitations

The main limitation of our current algorithm is that it does not find associations between links, for instance that if a professor advises a student, then they are likely to be coauthors. In the terminology of Probabilistic Relational Models (Getoor et al. 2007), our algorithm addresses attribute uncertainty, but not existence uncertainty (concerning the existence of links). The main ideas of this paper can also be applied to link prediction.

Another limitation is that we do not propose a new weight learning method, so we use standard Markov Logic Network methods for parameter learning after the structure has been learned. While these methods find good parameter settings, they are slow and constitute the main computational bottleneck for our approach.

Paper organization

We review related work, then statistical-relational models, especially Parametrized Bayes nets and Markov Logic Networks. We define the table join lattice, and present the learn-and-join algorithm for Parametrized Bayes nets. We provide detailed discussion of the relational constraints used in the learn-and-join algorithm. For evaluation, we compare the moralization approach to standard MLN structure learning methods implemented in the Alchemy system, both in terms of processing speed and in terms of model predictive accuracy.

Contributions

The main contributions may be summarized as follows.

1.
A new structure learning algorithm for Bayes nets that model the distribution of descriptive attributes given the link structure in a relational database. The algorithm is a level-wise lattice search through the space of join tables.
2.
Discussion and justification for relational constraints that speed up structure learning.
3.
A Markov logic network structure can be obtained by moralizing the Bayes net. We provide a comparison of the moralization approach with other MLN methods.

2 Additional related work

A preliminary version of the Learn-and-Join Algorithm was presented by Khosravi et al. (2010). The previous version did not use the lattice search framework. Our new version adds constraints on the model search, makes all constraints explicit, and provides rationales and discussion of each. We have also added more comparison with other Markov Logic Network learning methods (e.g., BUSL, LSM) and a lesion study that assesses the effects of using only part of the components of our main algorithm. Our approach to autocorrelations (recursive dependencies) was presented by Schulte et al. (2011). The main idea is to use a restricted form of Bayes net that we call the main functor node format. This paper examines how the main functor node format can be used in the context of the overall relational structure learning algorithm.

The syntax of other directed SRL models, such as Probabilistic Relational Models (PRMs) (Getoor et al. 2007), Bayes Logic Programs (BLPs) (Kersting and de Raedt 2007) and Logical Bayesian Networks (Fierens et al. 2005), is similar to that of Parametrized Bayes Nets (Poole 2003). Our approach applies to directed SRL models generally.

Nonrelational structure learning methods

Schmidt et al. (2008) compare and contrast structure learning algorithms in directed and undirected graphical methods for nonrelational data, and evaluate them for learning classifiers. Domke et al. provide a comparison of the two model classes in computer vision (Domke et al. 2008). Tillman et al. (2008) provide the ION algorithm for merging graph structures learned on different datasets with overlapping variables into a single partially oriented graph. It is similar to the learn-and-join algorithm in that it extends a generic single-table BN learner to produce a BN model for a set of data tables. One difference is that the ION algorithm is not tailored towards relational structures. Another is that the learn-and-join algorithm does not analyze different data tables completely independently and merge the result afterwards. Rather, it recursively constrains the BN search applied to join tables with the adjacencies found in BN search applied to the respective joined tables.

Lattice search methods

The idea of organizing model/pattern search through a partial order is widely used in data mining, for instance in the well-known Apriori algorithm (Agrawal and Srikant 1994), in statistical-relational learning (Popescul and Ungar 2007) and in Inductive Logic Programming (ILP) (Van Laer and de Raedt 2001). Search in ILP is based on the θ-subsumption or specialization lattice over clauses. Basically, a clause c specializes another clause c′ if c adds a condition or if c replaces a 1st-order variable by a constant. The main similarity to the lattice of relationship joins is that extending a chain of relationships by another relationship is a special case of clause specialization. The main differences are as follows. (1) Our algorithm uses only a lattice over chains of relationships, not over conditions that combine relationships with attributes. Statistical patterns that involve attributes are learned using Bayes net techniques, not by a lattice search. (2) ILP methods typically stop specializing a clause when local extensions do not improve classification/prediction accuracy. Our algorithm considers all points in the relationship lattice. This is feasible because there are usually only a small number of different relationship chains, due to foreign key constraints.

Since larger table joins correspond to larger relational neighborhoods, lattice search is related to iterative deepening methods for statistical-relational learning (Neville and Jensen 2007, Sect. 8.3.1; Chen et al. 2009). The main differences are as follows. (1) Current statistical-relational learning methods do not treat dependencies learned on smaller relational neighborhoods as constraining those learned on larger ones. Thus dependencies learned for smaller neighborhoods are revisited when considering larger neighborhoods. In principle, it appears that other statistical-relational learning methods could be adapted to use the relationship join lattice with inheritance constraints as in our approach. (2) To assess the relevance of information from linked entities, statistical-relational learning methods use aggregate functions (e.g., the average grade of a student in the courses they have taken), or combining rules (e.g., noisy-or) (Kersting and de Raedt 2007; Natarajan et al. 2008). In Probabilistic Relational Models, Bayes Logic Programs, and related models, the aggregate functions/combining rules add complexity to structure learning. In contrast, our statistical analysis is based on table joins rather than aggregation. Like Markov Logic Networks, our algorithm does not require aggregate functions or combining rules, although it can incorporate them if required.

MLN structure learning methods

Current methods (Kok and Domingos 2009; Mihalkova and Mooney 2007; Huynh and Mooney 2008; Biba et al. 2008) successfully learn MLN models for binary predicates (e.g., link prediction), but do not scale well to larger datasets with descriptive attributes that are numerous and/or have a large domain of values. Mihalkova and Mooney (2007) distinguish between top-down approaches, that follow a generate-and-test strategy of evaluating clauses against the data, and bottom-up approaches that use the training data and relational pathfinding to construct candidate conjunctions for clauses. In principle, the BN learning module may follow a top-down or a bottom-up approach; in practice, most BN learners use a top-down approach. The BUSL algorithm (Mihalkova and Mooney 2007) employs a single-table Markov network learner as a subroutine. The Markov network learner is applied once after a candidate set of conjunctions and a data matrix has been constructed. In contrast, we apply the single-table BN learner repeatedly, where results from earlier applications constrain results of later applications.

We briefly describe the key high-level differences between our algorithm and previous MLN structure learning methods, focusing on those that lead to highly efficient relational learning.

Search space

Previous Markov Logic Network approaches have so far followed Inductive Logic Programming techniques that search the space of clauses. Clauses define connections between atoms (e.g. intelligence=hi, gpa=low). Descriptive attributes introduce a large number of atoms, one for each combination of attribute and value, and therefore define a large search space of clauses. We utilize Bayes net learning methods that search the space of links between predicates/functions (e.g., intelligence, gpa), rather than atoms. Associations between predicates constitute a smaller model space than clauses that can be searched more efficiently. The efficiency advantages of searching the predicate space rather than the clause space are discussed by Kersting and de Raedt (2007, 10.7).

Constraints and the lattice structure

We employ a number of constraints that are motivated by the relational semantics of the data. These further reduce the search space, mainly by requiring or forbidding edges in the Bayes net model. A key type of constraint are based on the lattice of relationship chains: These state that edges learned when analyzing shorter chains are inherited by longer chains. This allows the learn-and-join algorithm to perform a local statistical analysis for a single point in the relationship chain lattice, while connecting the results of the local analyses with each other.

Data representation and lifted learning

The data format used by Markov Logic Networks is a list of ground atoms, whereas the learn-and-join algorithm analyzes data tables. This allows us to apply directly propositional Bayes net learners which take single tables as input. From a statistical point of view, the learn-and-join algorithm requires only the specification of the frequency of events in the database (the sufficient statistics in the database) (Schulte 2011). The data tables provide these statistics. In the case of join tables the statistics are frequencies conditional on the existence of a relationship (e.g., the percentage of pairs of friends who both smoke). The learn-and-join algorithm can be seen as performing lifted learning, in analogy to lifted probabilistic inference (Poole 2003). Lifted inference uses as much as possible frequency information defined at the class level in terms of 1st-order variables, rather than facts about specific individuals. Likewise, the learn-and-join algorithm uses frequency information defined in terms of 1st-order variables (namely the number of satisfying groundings of a 1st-order formula).

3 Background and notation

Our work combines concepts from relational databases, graphical models, and Markov Logic networks. As much as possible, we use standard notation in these different areas. Section 3.5 provides a set of examples illustrating the concepts.

3.1 Logic and functors

Parametrized Bayes nets are a basic statistical-relational learning model; we follow the original presentation of Poole (2003). A functor is a function symbol or a predicate symbol. Each functor has a set of values (constants) called the range of the functor. A functor whose range is {T,F} is a predicate, usually written with uppercase letters like P,R. A functor random variable is of the form f(τ ₁,…,τ _k) where f is a functor and each term τ _i is a first-order variable or a constant. We also refer to functor random variables as functor nodes, or for short fnodes.^{Footnote 1} Unless the functor structure matters, we refer to a functor node simply as a node. If functor node f(τ) contains no variable, it is ground, or a gnode. An assignment of the form f(τ)=a, where a is a constant in the range of f, is an atom; if f(τ) is ground, the assignment is a ground atom. A population is a set of individuals, corresponding to a domain or type in logic. Each first-order variable X is associated with a population $\mathcal {P}_{X}$ of size $|\mathcal {P}_{X}|$; in the context of functor nodes, we refer to population variables (Poole 2003). An instantiation or grounding γ for a set of variables X ₁,…,X _k assigns a constant γ(X _i) from the population of X _i to each variable X _i.

Getoor and Grant discuss the applications of function concepts for statistical-relational modelling in detail (Getoor and Grant 2006). The functor formalism is rich enough to represent the constraints of an entity-relationship (ER) schema (Ullman 1982) via the following translation: Entity sets correspond to populations, descriptive attributes to functions, relationship tables to predicates, and foreign key constraints to type constraints on the arguments of relationship predicates. A table join of two or more tables contains the rows in the Cartesian products of the tables whose values match on common fields.

We assume that a database instance (interpretation) assigns a unique constant value to each gnode f(a). The value of descriptive relationship attributes is well defined only for tuples that are linked by the relationship. For example, the value of grade(jack,101) is not well defined in a university database if Registered(jack,101) is false. In this case, we follow the approach of Schulte et al. (2009) and assign the descriptive attribute the special value ⊥ for “undefined”. Thus the atom grade(jack,101)=⊥ is equivalent to the atom Registered(jack,101)=F. Fierens et al. (2005) discuss other approaches to this issue. The results in this paper extend to functors built with nested functors, aggregate functions (Klug 1982), and quantifiers; for the sake of notational simplicity we do not consider more complex functors explicitly.

3.2 Bayes nets and Markov nets for relational data and Markov logic networks

We employ notation and terminology from Pearl (1988) for graphical models. Russell and Norvig provide a textbook introduction to many of the topics we review (Russell and Norvig 2010). A Bayes net structure is a directed acyclic graph (DAG) G, whose nodes comprise a set of random variables denoted by V. In this paper we consider only discrete finite random variables. When discussing a Bayes net structure, we refer interchangeably to its nodes or its variables. A family in a Bayes net graph comprises a child node and the set of its parents. A Bayes net (BN) is a pair 〈G,θ _G〉 where θ _G is a set of parameter values that specify the probability distributions of children conditional on assignments of values to their parents. The conditional probabilities are specified in a conditional probability table. For an assignment of values to all nodes in the Bayes net, the joint probability of the values is given by the product of the associated conditional probabilities. A Parametrized Bayes Net is a Bayes net whose nodes are functor nodes. In the remainder of this paper we follow (Schulte 2011) and use the term Functor Bayes Net or FBN instead of Parametrized Bayes Net, for the following reasons. (1) To emphasize the use of functor symbols. (2) To avoid confusion with the statistical meaning of “parametrized”, namely that values have been assigned to the model parameters. We usually refer to FBNs simply as Bayes nets.

A Markov net structure is an undirected graph whose nodes comprise a set of random variables. For each clique C in the graph, a clique potential function Ψ _C specifies a nonnegative real number for each possible assignment of values to the clique. For an assignment of values to all nodes in the Markov net, the joint probability of the values is given by the product of the associated clique potentials, divided by a normalization constant. A Functor Markov Net is a Markov net whose nodes are functor nodes.

Bayes nets can be converted into Markov nets through the standard moralization method: connect all spouses that share a common child, and make all edges in the resulting graph undirected. Thus each family in the Bayes net becomes a clique in the moralized structure. For each state of each family clique, we define the clique potential in the Markov net to be the conditional probability of the child given its parents. The resulting Markov net defines the same joint probability over assignments of values to the nodes as the original Bayes net.

3.3 Inference and ground models

In statistical-relational learning, the usual approach to inference for relational data is to use a ground graphical model for defining a joint distribution over the attributes and links of entities. This approach is known as knowledge-based model construction (Ngo and Haddawy 1997; Koller and Pfeffer 1997; Wellman et al. 1992). For a Functor Markov Net M, this leads to the notion of a ground Functor Markov net that is derived from M by instantiating the functor nodes in M in every possible way. Formally, there is an edge f ₁(a ₁)−f ₂(a ₂) between two gnodes if and only if there is an edge f ₁(τ ₁)−f ₂(τ ₂) in M and there is a grounding γ of τ ₁,τ ₂ such that γ(τ _i)=a _i, for i=1,2.

Each clique among gnodes inherits the clique potential from the clique potential of the 1st-order model. A given database instance specifies a value for each node in the ground graph. Thus the likelihood of the Functor Markov net for the database can be defined as the likelihood assigned by the ground Markov net to the facts in the database following the usual product of all clique potentials involving ground nodes. Viewed on a log-scale, this is the sum of the log-potentials.

In the case where the Functor Markov net is obtained by moralizing a Functor Bayes net, the resulting log-likelihood is as follows: For each possible child-parent state in the Bayes net, multiply the logarithm of the corresponding conditional probability by the number of instantiations of the child-parent states in the database. This is similar to the standard single-table Bayes net log-likelihood, where for each possible child-parent state in the Bayes net, we multiply the logarithm of the corresponding conditional probability by the number of table rows that satisfy the given child-parent state.

The fact that the grounding semantics provides a conceptually straightforward way to define probabilistic inferences for relational data has been a major competitive advantage of undirected relational models (Domingos and Richardson 2007; Taskar et al. 2002). Below, we discuss the difficulties that can arise in applying the grounding semantics with directed models, making reference to some examples.

3.4 Markov logic networks

A Markov Logic Network (MLN) is a finite set of 1st-order formulas or clauses {ϕ _i}, where each formula ϕ _i is assigned a weight. A Markov Logic Network can be viewed as a specification of a Markov network using logical syntax (Domingos and Richardson 2007). Given an MLN and a database $\mathcal {D}$, let $n_{i}(\mathcal {D})$ be the number of groundings that satisfy ϕ _i in $\mathcal {D}$. An MLN assigns a log-likelihood to a database according to the equation

$$ \ln\bigl(P(\mathcal {D})\bigr) = \sum_{i} w_{i} n_{i}(\mathcal {D}) - \ln(Z) $$

(1)

where Z is a normalization constant.

Thus the log-likelihood is a weighted sum of the number of groundings for each clause. Functor Markov Nets have a simple representation as Markov Logic Networks as follows. For each assignment of values to a clique of functor nodes, add a conjunctive formula to the MLN that specifies that assignment. The weight of this formula is the logarithm of the clique potential. For any Functor Markov net, the MLN likelihood function defined by Eq. (1) for the corresponding MLN is exactly the Markov net likelihood defined by grounding the Functor Markov net. Therefore we can use MLN inference to carry out inference for Functor Markov Nets.

3.5 Examples

We illustrate Functor Bayes Nets and Markov Logic Networks with two relational schemas.

Friendship database

Figure 1 shows a simple database instance in the ER format, following (Domingos and Richardson 2007). Figure 2 illustrates Functor Bayes net concepts. An example of a family formula with child node Smokes(Y) is

$$\mathit{Smokes}(Y) = \mathit {T},\qquad \mathit{Smokes}(X)= \mathit {T}, \qquad\mathit{Friend}(X,Y) = \mathit {T}. $$

Figure 3 shows the MLN structure obtained by moralization and the corresponding ground Markov net for the database of Fig. 1. For converting the Bayes net conditional probabilities to MLN clause weights, Domingos and Richardson suggest using the log of the conditional probabilities as the clause weight (Domingos and Richardson 2007, 12.5.3), which is the standard conversion for propositional Bayes nets. Figure 3 illustrates moralization using log-probabilities as weights. In this paper we apply moralization only to the model structure, not to the model parameters. Table 1 shows how the unnormalized log-likelihood of the sample database is computed for the ground model.

Table 1 The computation of the Markov Net log-likelihood for the database of Fig. 1. For simplicity, we used uniform probabilities as probability parameters for the nodes Friend(X,Y) and Smokes(X)

Learning graphical models for relational data via lattice search

Abstract

Similar content being viewed by others

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Bayesian Markov Logic Networks

Explore related subjects

1 Introduction

Approach

Evaluation

Limitations

Paper organization

Contributions

2 Additional related work

Nonrelational structure learning methods

Lattice search methods

MLN structure learning methods

Search space

Constraints and the lattice structure

Data representation and lifted learning

3 Background and notation

3.1 Logic and functors

3.2 Bayes nets and Markov nets for relational data and Markov logic networks

3.3 Inference and ground models

3.4 Markov logic networks

3.5 Examples

Friendship database

University database

3.6 Directed models and the cyclicity problem

4 Lattice search for attribute dependencies

4.1 Overview

4.2 The multinet lattice

4.2.1 Functor nodes

Examples

4.2.2 Relationship chains

4.2.3 The join data table

Examples

4.3 Model conversion

5 The learn-and-join algorithm

5.1 Constraints used in the learn-and-join algorithm

5.1.1 Edge inheritance in the relationship lattice

Constraint 1

Example

Constraint 2

Examples

5.1.2 The main functor node format

Constraint 3

Example

5.1.3 Population variable bound

Examples

5.1.4 Link attributes

Constraint 4

Examples

5.1.5 Relationship parents

Constraint 5

Example

5.2 Examples

5.3 Pseudocode

6 Discussion: lattice constraints

Computational efficiency

Statistical motivation

7 Discussion: the main functor node format

Example

Proposition 1

8 Discussion: population variable bound

9 Evaluation: experimental design

9.1 Datasets

University database

University+ database

MovieLens database

Mutagenesis database

Hepatitis database

Mondial database

UW-CSE database

9.2 Graph structures learned by the learn-and-join algorithm

10 Moralization vs. other structure learning methods: basic comparisons

10.1 Comparison systems and performance metrics

10.2 Runtime comparison

10.3 Predictive accuracy and data fit

10.4 UW-CSE dataset