Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Probabilistic and Information-Theoretic Methods

Information theory is closely connected to probability theory and statistics. In particular, the standard definition of information contained in a random variable X with a probability density function P ( X ) is well known to be I ( X ) = - log⁡ ( P ( X ) ) , with the corresponding Shannon entropy, in differential form, given by the average information

H ( P ) = - P ( X ) log⁡ ( P ( X ) ) d x .
(31.1)

One of the fundamental theorems of information theory, the second Gibbs theorem, states that the normal distribution achieves maximum entropy, hence maximal average information from all distributions with known variance. To show this in the univariate case, consider the normal distribution in the standard form

P ( X ) = 1 2 π σ 2 exp⁡ ( - ( X - μ ) 2 2 σ 2 ) .

It is straightforward to show that for the natural logarithm

- P ( X ) log⁡ ( P ( X ) ) d x = 1 2 + log⁡ ( 2 π σ 2 ) = - G ( X ) log⁡ ( P ( X ) ) d x ,

where G ( X ) is any arbitrary density function with variance G ( X ) ( X - μ ) 2 d x = σ 2 . Therefore, the difference in average information between the two density functions necessarily observes the following

- P ( X ) log⁡ ( P ( X ) ) d x + G ( X ) log⁡ ( G ( X ) ) d x = - G ( X ) log⁡ ( P ( X ) ) d x + G ( X ) log⁡ ( G ( X ) ) d x = - G ( X ) log⁡ ( P ( X ) G ( X ) ) d x G ( X ) ( 1 - P ( X ) G ( X ) ) d x = 0

using Jensen’s inequality log⁡ ( x ) x - 1 and the normalization property P ( X ) = Q ( X ) = 1 . This is a particular instance of Gibbs inequality and proves that the asymptotic distribution of the central limit theorem also maximizes entropy.

This led, in probability theory, to the definition of natural measures of dissimilarity closely related to the expectation of information difference, e. g., the Kullback–Leibler (GlossaryTerm

KL

) divergence [1]

D KL ( P Q ) = P ( X ) log⁡ ( P ( X ) Q ( X ) ) d x ,
(31.2)

as generalized distances between probability distributions P and Q.

The GlossaryTerm

KL

divergence occurs frequently in machine learning, where the development of learning strategies links information theory with statistical and biologically motivated concepts. For instance, the perceptron model was established as a simple but mathematically tractable model of a biological neuron as the smallest information processing unit in brains [2]. Recognition that gradient descent provided a pragmatic but effective solution to the credit assignment problem, namely which values the hidden nodes should have, led to the multilayer perceptron as powerful computational tools for classification and regression. Initially maximum likelihood optimization was used for parameter estimation, following the tried and tested statistical concepts of normally distributed errors leading to a sum-of-squares loss function in regression and, for classification, the Bernoulli distribution for binary data and the so-called cross-entropy (31.2) for multinomial class assignments, the latter two likelihood functions measuring information divergence averaged over the true distribution given by the empirical class labels.

Information theoretic aspects (e. g., mutual information) were also considered in neural models in order to avoid overtraining [3], for instance in Boltzmann networks which directly mirror information principles in statistical mechanics [4]. Related approaches are used currently for deep learning models, where information principles drive the feature representations [5].

The correspondence between maximum entropy and maximum likelihood outlined above is just one aspect of the application of information-theoretic concepts in machine learning. The next section outlines further developments linked first to source identification through blind signal separation and matrix factorization methods. These concepts from signal processing identify important degrees of freedom that may be used as hidden variables in probabilistic models, discussed later in the chapter. Furthermore, the application of information-theoretic methods extends also to the automatic identification of prototypes for use in compact data representations that include dictionaries defined by methods such as vector quantization, typically with unsupervised approaches.

Supervised methods are introduced as probabilistic models, focusing first on discriminative methods. This indicates that the maximum likelihood approach is limited in its predictive power in generalization to out-of-sample data, because it allows models to be generated with very little bias but with considerable variance – for a more detailed discussion of this point refer [6]. What this means in practice is that flexible models such as neural networks are prone to overfitting unless the complexity of the model is controlled along with the extent to which the model fits the data. The latter is described by the likelihood, but the model complexity can be controlled in a number of different ways. In probabilistic models an efficient framework to maximize the generality of probabilistic inference models is to apply the maximum a posteriori (GlossaryTerm

MAP

) framework which optimizes the posterior probability of the model parameters given the data but also given prior distributions for the parameters, typically limiting their size by assuming a zero-centred normal distribution as the prior. This is the basis of the method of automatic relevance determination, explained in Sect. 31.2.

While discriminative models are efficient approximators for nonlinear response functions, both in regression and in the estimation of class conditional density functions, they are difficult to interpret and can generally be considered as black boxes, meaning that they are not readily interpreted to give insights about the data. A topical and widely used alternative approach is to model the joint distribution of the data directly. This is ideally done by factorization into subgraphs into which the multivariate structure of the data is broken-up using strict conditional independence requirements, as discussed in Sect. 31.2. Inference can then proceed using Bayes theorem introduced in (31.6).

An alternative approach to modeling the joint distribution of the covariates is to use the mutual correlation in the data to identify important degrees of freedom that may be hidden in the sense that they are not directly observed. This generates latent variable representations that naturally fit into the framework of probabilistic modeling. However, the introduction of additional variables also introduces complexity into the optimization process for estimating their values. This leads naturally to the introduction of expectation maximization (GlossaryTerm

EM

), a general approach of particular value for estimating mixture models, discussed in Sect. 31.3.

So far the modeling methodologies focus on snapshots of the data, without taking into consideration the time evolution of the covariates. To do this requires explicit parametrization, for which arguably the most widely used probabilistic approach is hidden Markov models (GlossaryTerm

HMM

). These models are build on the concepts of conditional independence, latent variables, and expectation maximization to model the time evolution of sequences of covariate measurements, in the last substantive Sect. 31.4

1.1 Information-Theoretic Methods

While the statistical properties of perceptrons are widely investigated [6], the more difficult problem of establishing statistical independence is becoming increasingly important and novel algorithms have been presented during the last decade [7]. Their applicability is enormous, ranging from variable selection, to blind source separation (GlossaryTerm

BSS

) and statistical causality. Frequently, the difficult question of statistical dependence in data is replaced by the easier consideration of estimation and application of data correlations for learning strategies. A recent approach tries to determine independence by generalized correlation functions [8]. In this context of decorrelation and independence, GlossaryTerm

BSS

and nonnegative matrix factorizations [9] of data channels are based on statistical deconvolution. A comprehensive overview for GlossaryTerm

BSS

, independent component analysis (GlossaryTerm

ICA

) and nonnegative matrix and tensor factorization (GlossaryTerm

NMF

) can be found in [10, 11, 12], respectively. Different aspects can be investigated, like GlossaryTerm

ICA

and GlossaryTerm

BSS

maximizing conditional probabilities [11]. A relevant connection exists between GlossaryTerm

NMF

and probabilistic graphical models comprising hidden variables [13], which is briefly discussed in Sect. 31.3.4.1.

Other recent approaches in this field incorporate information theoretic principles directly: Pham [14] investigated GlossaryTerm

BSS

based on mutual information, whereas [15] applied β-divergences. The infomax principle for GlossaryTerm

ICA

was considered in depth [16], as was the problem of learning overcomplete data representations and performing overcomplete noisy blind source separation, e. g., the sparse coding neural gas (GlossaryTerm

SCNG

) [17]. Recent results including modern divergences (generalized α-β-divergences) were recently published [18]. Obviously, information theoretic divergence measures like Rényi-divergences (belonging to the family of α-divergences) capture directly the statistical information contained in the data, as expressed by the probability density function [19, 20]. This property can be used for unsupervised model estimation for instance in vector quantization, when divergences are used as dissimilarity measure [21].

Information optimum vector quantization by prototypes is a widely investigated topic in clustering and data compression, based on the optimization of the γ-reconstruction error

E V Q ( γ ) = v - w ( v ) E γ P ( V = v ) d v ,

where P ( V = v ) is the data density of the vector data v and v - w ( v ) E is the Euclidean distance of the data vector and the prototype w ( v ) representing it. One of the key results concerning information theoretic principles for vector quantization is Zador’s magnification law [22]: if the data vectors v are given in q-dimensional Euclidean space, then the magnification law ρ P α holds. Here, ρ ( w ) is the prototype density with the magnification factor

α = q q + γ .

This is the basic principle of vector quantization based on Euclidean distances. For different schemes like self-organizing maps, Neural Gas variants with slightly different magnification factors are obtained depending on the choice of neighborhood cooperation scheme applied during prototype adaptation [23, 24, 25]. Information optimum magnification for α = 1 is equivalent to maximum mutual information [22]. Yet, it is possible to control the magnification for most of these algorithms by different strategies like localized or frequency sensitive competitive learning. For an overview, we refer to [23]. If the Euclidean distance is replaced by divergence measures, optimum magnification α = 1 can also be achieved by maximum entropy learning [26], or by the utilization of correntropy [27]. Vector quantization algorithms directly derived from information theoretic principles based on Rényi entropies are intensively studied in [28], also highlighting its connection to graph clustering and Mercer kernel-based learning [29].

Other information theoretic vector quantizers optimize the mutual information between data and prototypes, or the respective GlossaryTerm

KL

divergence, instead of minimizing a reconstruction error [30]. Based on this principle, several data embedding, or dimensionality reduction techniques, have been developed as alternatives to multidimensional scaling. These approaches are frequently used to visualize data. Prominent examples are stochastic neighborhood embedding (GlossaryTerm

SNE

) [31] or variants thereof: for instance, t-GlossaryTerm

SNE

uses outlier-robust Student-t-distributions for data characterization instead of Gaussians [32]. The generalization to other divergences than GlossaryTerm

KL

can be found in [33].

Another role for information theory in machine learning is in feature selection. Removing irrelevant or redundant features not only leads to a simplification of the model and a reduced requirement for data acquisition, but it is also central for maximizing the generality of the model when it is applied to future data. Most feature selection approaches are supervised schemes, hence using class information or expected regression values. Strategies to achieve this goal can be classical Bayesian inference schemes of which automatic relevance determination (GlossaryTerm

ARD

) is a good example (described further in Sect. 31.2), or statistical approaches based on mutual correlation or covariances [34, 35]. An alternative approach to feature selection is to use mutual information

I ( X , Y ) = D KL ( J ( X , Y ) P ( X ) Q ( Y ) )

between random variables X and Y with probability densities P and Q, respectively, and joint density J [36]. Here, the features are treated as random variables to be compared and mutual information measures the information loss resulting from removal of variables from the model. Learning classification together with feature weighting in vector quantization is known as relevance learning [37]. Recent developments to introduce sparseness according to information theoretic constraints are discussed in [38, 39].

Information-theoretic measures such as mutual information, can be explicitly estimated from data [40]. This is used in the context of vectorial data analysis to obtain consistent and reliable estimators with topographic maps or kernels [41]. Further applications of information theoretic learning also use Rényi entropy

H a ( P ) = 1 1 - α log⁡ ( ( P ( X ) ) α d x )

as a cost function instead of the mean squared error, resorting, for computational efficiency, to Parzen estimators [42] or nearest neighbor entropy estimation models. For effective computation of an approximate of the mutual information I ( X , Y ) , the quadratic Rényi entropy H 2 ( p ) or the closely related information energy are common choices [43]. Parzen window-based estimators for some information theoretic cost functions have also been shown to be cost functions in a corresponding Mercer kernel space [44]. In particular, a classification rule based on an information theoretic criterion has been shown to correspond to a linear classifier in the kernel space. This leads to the formulation of the support vector machine (GlossaryTerm

SVM

) from information theory principles.

1.2 Probabilistic Models

Kernel models are known for having excellent discrimination performance, but they are typically not well calibrated. This is because they are designed to be efficient binary class allocation models rather than estimators of the posterior probability for membership of each class C. As an example, GlossaryTerm

SVM

s allocate inputs to classes on the basis on a binary-valued indicator variable that generally does not have a link function to a probability density estimate. This type of models is known as discriminative models, a well-known variant being Fisher’s linear discriminant. As the name implies, the central model is linear in the covariates,

y = w T x

optimizing, for binary classification, a discriminant function derived from the mean m i and variance s i of each class (i. e., i = 1,2), namely

J ( w ) = ( m 1 - m 2 ) 2 s 1 2 + s 2 2 .

In general, given the two data cohorts, the covariance matrix of the data S has a strict decomposition into within- and between-class covariance matrices as S = S w + S b . For an overall data mean vector m and a total of N j data points in each class, these matrices are given by

S = i = 1 N ( ( x i - m ) T ( x i - m ) ) , S w = j = 1 2 i = 1 N ( ( x i - m j ) T ( x i - m j ) ) , S b = j = 1 2 ( N j ( m j - m ) T ( m j - m ) ) .

The solution to the optimization of J ( w ) is

w S w - 1 ( m 2 - m 1 ) ,

where the inverse of the within-class covariance matrix S w positions the discriminant hyperplane so as to minimize the overlap between the projections of the data points in each class onto the direction of the weight w. This illustrates the observation that, in general, this projection will not be calibrated with a probabilistic estimate such as the logit

logit ( P ( C | X ) ) = log⁡ ( P ( C | X ) 1 - P ( C | X ) ) .

The correct calibration is found in a class of generalized linear models of the form

y ( x ) = f ( w T x + w 0 ) ,

where f ( ) is known as the activation function in machine learning and its inverse is called a link function by statisticians [6]. Perhaps the best-known choice of activation is the sigmoid function, where the probabilistic model becomes logistic regression and the linear index w T x represents exactly the logit ( P ( C | X ) ) . This is very widely used and a generally well-calibrated model, even when severe class imbalance is present.

It is often quoted that generalized linear models are limited by the discriminant forms determined by the linear scores, which must therefore be hyper-planes. However, this ignores the observation that, in most practical applications, suitable attribute representations are defined using domain knowledge, typically by binning variables into discrete states. This turns the probabilistic estimators into linear-in-the-parameter models with significant discrimination potential for nonlinearly separable data. In effect, if the link function is properly tuned to the noise structure of the data and in particular when there are larger numbers of independent covariates, well-designed generalized linear models are competitive with flexible machine learning models, the more so as the limitation of using a linear-in-the-parameters scoring index now works as a form of regularization limiting the complexity of the model. Moreover, the linear index provides a strong element of interpretability whose importance to application domain experts cannot be overestimated. Notwithstanding the power of machine learning, generalized linear models should always be used as benchmarks to set against nonlinear models.

An alternative to probabilistic linear models is the wide range of flexible direct estimators of P ( C | X ) among which arguably the most widely used model remains the multilayer perceptron (GlossaryTerm

MLP

). Similarly to linear statistical models, however, it is important to note that the estimation of class conditional probabilities with an GlossaryTerm

MLP

is contingent on using a correct activation function at the output node together with a suitable choice of loss function, which must be one of the entropy functions outlined in the previous section. So, in binary classification, the log-likelihood function with a Bernoulli distribution should be used in conjunction with a sigmoid activation function. In the multinomial case, we would need an extension of the sigmoid function, the softmax activation, together with the cross-entropy as the loss function, since this is the correct measure of the divergence between the estimated and observed probability density functions. Similarly, for nonlinear regression, the activation function should be linear with the usual sum-of-squares error function, provided the inherent noise in the data can be assumed to be normally distributed with zero mean, since this is where the loss function is derived from. In the event where the noise variance, for instance, is dependent on the covariates, heteroscedastic noise models must be used to derive appropriate loss functions [6].

While the strength of neural networks is their universal approximation capability, in the sense of fitting any multivariate surface to an arbitrarily small error, this flexibility also makes them prone to overfitting, potentially resulting in data models with little bias but large variance, in direct contrast to generalized linear models. In both cases, it is necessary to control the complexity of the model and this is best done by adding a penalty term to enforce the principle of parsimony, colloquially known as Occam’s razor (lex parsimoniae). Arguably, the most commonly used and effective scheme is to apply Bayes’ theory at the level of fitting the model parameters, then to the regularization hyperparameters, and finally to model selection itself.

As we saw previously, the output of the GlossaryTerm

MLP

represents a direct estimate of the posterior probability of class membership P ( C | X ) . This approach can be generalized for the analysis of longitudinal data where each individual subject is follow up over a period of time starting with a defined recruitment point and ending either at the end of a defined observation period or when an event of interest is observed, whichever occurs first. This is often called survival modeling and is typically used to estimate event rates in the presence of censorship, e. g., where the outcome of interest, for instance recovery from an illness, is observed in some subjects for only part of the allowed period of follow-up due to other events taking over, such as another condition setting-in, which prevent the observer from ever knowing whether or not the subject would have recovered from the original illness, which is the event of interest. For discrete time, these models can be estimated using the standard GlossaryTerm

MLP

with an additional input node coding the time intervals. The output of the GlossaryTerm

MLP

again represents a conditional probability, but now the probability of the subject surviving each time interval given that the subject survived until the start of the time interval. This defines the hazard function h l ( x i ) , for subject with covariate vector x i and predictions over the lth discrete time interval, which is given by

h l ( x i ) = P ( T t l | T > t l - 1 , x i ) .

For a single event of interest, i. e., a single risk factor, the log-likelihood function exactly mirrors that used in binary classification, treating as independent the probability estimates for each of the N subjects and over the discrete time intervals where the subject was observed, i. e., up to the end of the follow-up period or until censorship. This leads to the following loss function

L B = i = 1 N l = 1 l i { h l ( x i ) d i l [ 1 - h l ( x i ) ] ( 1 - d i l ) } ,
(31.3)

where the binary indicator variable d i l = 0 if the event of interest was not observed for the subject during the specific time interval, and is 1 otherwise. This loss function is known as a partial likelihood, since it is measured only over time periods where the outcomes for each subject are observed, an approach that has been extended to the multinomial case to provide a rigorous treatment of censorship with flexible models in the context known as competing risks [45].

Application of the Bayesian regularization framework consists in maximizing the posterior probability for the model parameters w, given the data set D, the regularization hyperparameters α and the choice of the model structure, e. g., selected covariates H, namely

P ( w | D , α , H ) = P ( D | w , α , H ) P ( w | α , H ) P ( D | α , H ) .
(31.4)

The first term on the right-hand side of Eq. (31.4) denotes the probability of the model fitting the data, represented by the exponential of the entropy term discussed in the introduction and defined for longitudinal data by (31.3), hence

P ( D | w , α , H ) = e - L B .

The second term in (31.4) represents a prior distribution of the model parameters typically with a quadratic loss term corresponding to independent zero-mean univariate Gaussian distributions, sometimes called weight decay terms. A particularly efficient implementation of Bayesian regularization is to assign a separate weight decay term to each covariate, indexing the covariates by m of which there are N α , with the N m hidden nodes indexed by n. This allows each covariate to be separately turned on or off depending on how informative it is for fitting the observations about the outcome variable, a process known as automatic relevance determination (GlossaryTerm

ARD

) [4]. Expressed in full, this gives

P ( w | α , H ) = e - G ( w , α ) Z w ( α ) , where G ( w , α ) = 1 2 m = 1 N α α m n = 1 N m w m n 2 and Z w = m = 1 N α ( 2 π α m ) N m 2 .

In principle, the best values for the regularization hyper-parameters, i. e., the weight decay parameters α, are those which minimize their posterior probability

P ( α | D , H ) = P ( D | α , H ) P ( α | H ) P ( D | H ) .

However, the denominator of (31.4) cannot be obtained in closed form, so a Laplace approximation is typically around a stationary point in the loss function as a function of the weights. This amounts to a local Taylor expansion of

P ( D | α , H ) = P ( D | w , α , H ) P ( w | α , H ) d w = e - S ( w , α ) Z w ( α ) d w ,

where the linear term in the weights vanishes because of stationarity leading to

S * ( w , α ) S ( w MP , α ) + 1 2 ( w - w MP ) T A ( w - w MP ) ,

from which the posterior probability for the hyperparameter results

P ( α | D , H ) e - S ( w MP , α ) Z W ( α ) ( 2 π ) N w 2 det⁡ ( A ) - 1 / 2 .

In practice, what this means is that the log-odds ratio, given by the activation of the output node of the GlossaryTerm

MLP

can be assumed to have a univariate normal distribution whose variance is given by the Hessian of the matrix S with respect to the weights; g is the gradient of the activation a with respect to the weights, namely

P ( a | X , D ) = 1 1 ( 2 π s 2 ) 1 / 2 e - ( ( a - a MP ) 2 2 s 2 )

with a MP denoting the most probable value of the activation function, i. e., the direct output of the GlossaryTerm

MLP

without marginalization, and

s 2 ( x ) = g T A - 1 g .

The so-called marginalized estimate of the GlossaryTerm

MLP

output is now the posterior distribution integrated over the activation a. In the above expression, g is the gradient of the activation with respect to the network weights and A is the corresponding Hessian; hence the matrix of second partial derivatives. For binary classification and single-risks modeling, this is given by a neat analytical expression

h ( x i , l ) = g ( a ) P ( a | X i = x i , l , D ) d a = g ( a MP ( x i , l ) 1 + ( π / 8 ) g T A - 1 g )
(31.5)

with g ( ) denoting the sigmoid function. This adjustment to the original GlossaryTerm

MLP

output, i. e., a MP , shows the regularization process in operation: stationary points, where the weights are well defined, have small variance s 2 and therefore their value remains almost unchanged. Conversely, flat valleys in the loss function, where stationary points for the weights have broad Gaussian distributions, are penalized by reducing the value of the argument of the sigmoid function in (31.5) toward nil, reflecting an increase in uncertainty by shifting the GlossaryTerm

MLP

output toward the don’t know threshold.

A probabilistic alternative to discriminative approaches consists of generative models, where Bayes’ theorem is once again put into practice to estimate the posterior probability of class membership P ( C k | X ) , from the class conditional density functions P ( X | C k ) and prior probabilities for the classes P ( C k ) , that is

P ( C k | X ) = P ( X | C k ) P ( C k ) k P ( X | C k P ( C k ) ) ,
(31.6)

where classes are indexed by k and the sum-rule has been used to expand the denominator. Suitable models for the probability density functions (GlossaryTerm

pdf

) of the data given each class will depend on the nature of the data. However, it is straightforward to show for two classes that if the GlossaryTerm

pdf

s are normal distributions with equal variance, then the posterior probability will have exactly the functional form of the logistic regression model. This can be taken as an explanation in probabilistic terms of the potential limitations of this linear model, since different classes in practice tend to have distinct variances, even when that data sets for each class are approximately normally distributed. A natural extension of this approach is to use a mixture of Gaussian distributions. This is a very flexible model that can parameterize also multimodal density functions. In the interest of space, we refer the interested reader to a standard textbook [6].

The two approaches of discriminative and generative models may be combined by using generative models to build kernels. These kernels define similarity between two covariate vectors x and x by correlation between the respective GlossaryTerm

pdf

s, with the values of the kernel function given by k ( x , x ) = P ( X = x ) P ( X = x ) for suitable choices of the probability functions. A kernel so designed will naturally form a Gram matrix. Such kernels lead naturally to the use of latent variables

k ( x , x ) = i P ( X = x | Z = i ) P ( X = x | Z = i ) × P ( Z = i ) ,

with weighting coefficients  P ( Z ) reflecting the strength of the latent variable Z indexed by i. An example of this approach in practice will be seen in the GlossaryTerm

HMM

s later in this chapter (see Sect. 31.4.2).

2 Graphical Models

In this section, we give a basic introduction to graphical models, a general framework for dealing with uncertainty in a computationally efficient way. Probabilistic models that we treat in the next sections belong to this framework. Here, we introduce the two main classes of graphical models, Bayesian and Markov networks, discussing different methods for performing probabilistic inference. Specific instances of learning within this framework, are given in the next two sections. For the sake of presentation, here we limit our presentation to discrete random variables; however, graphical models can be defined on continuous variables or mixed variables. The material covered in this section is based on [46, 47, 6].

A graphical model allows us to represent a family of joint probability distributions in terms of a directed or undirected graph, where nodes are associated with random variables, and edges represent some form of direct probabilistic interaction between variables. Being able to compactly represent the joint probability distribution of a set of random variables X = { X 1 , , X n } is very important: any probabilistic query involving the variables X 1 , , X n can be answered by knowing their joint probability distribution P ( X 1 , , X n ) . For example, assume variables to be discrete, and suppose we want to know the posterior probability of X 1 and X 2 given all the other variables, i. e., P ( X 1 , X 2 | X 3 , , X n ) . We can easily answer this query by computing

P ( X 1 , X 2 | X 3 , , X n ) = P ( X 1 , , X n ) X 1 dom ( X 1 ) X 2 dom ( X 2 ) P ( X 1 = x 1 , X 2 = x 2 , X 3 , , X n ) .

Unfortunately, storing the joint probability values associated with all the different assignments x 1 , , x n is not feasible: if d i is the size of dom ( X 1 ) , all the different assignments are i = 1 n d i , i. e., an exponential number of entries. This situation, however, constitutes the worst case. In fact, in many application domains, independence properties allow us to factorize the joint distribution into compact parts which can be stored efficiently. Graphical models provide the language to compactly represent these factors, enabling in many cases inference and learning over a compact parameterization of the joint distribution as graphical manipulations.

Graphical models can be characterized according to the type of probabilistic interaction between variables they model. Directed graphs (Bayesian networks) are used to express causal relationships between random variables (i. e., cause effect relationships), while undirected graphs (Markov networks) are better suited to express probabilistic constraints among subset of variables to which it is difficult to ascribe a directionality (graphical models containing both directed and undirected edges are possible; however, they will not be covered here). In both cases, the joint distribution is factorized according to the notion of conditional independence .

Definition 31.1 Conditional Independence

Let X , Y , Z be sets of random variables with X i X , Y i Y , Z i Z . X is conditionally independent of Y given Z (denoted as X Y | Z ) in a distribution P if, for all values x i dom ( X i ) , y i dom ( Y i ) , z i dom ( Z i )

P ( X = x , Y = y | Z = z ) = P ( X = x | Z = z ) × P ( Y = y | Z = z ) ,

where X = x denotes X 1 = x 1 , , X n X = x n X , Y = y denotes Y 1 = y 1 , , Y n Y = y n Y , Z = z denotes Z 1 = z 1 , , Z n Z = z n Z , and n X = | X | , n Y = | Y | , n Z = | Z | .

It is not difficult to see that if X Y | Z , then it is also true that P ( X | Y , Z ) = P ( X | Z ) . In fact, using the product rule for probabilities, we have P ( X , Y | Z ) = P ( X | Y , Z ) P ( Y | Z ) .

In the following, we will discuss how conditional independence is used within Bayesian and Markov networks to factorize the joint distribution. Inference and learning will be discussed as well.

2.1 Bayesian Networks

Bayesian networks are directed acyclic graphs used to model causal relationships between random variables: an edge X 1 X 2 is used to express the fact that variable X 1 (cause) influences variable X 2 (effect). The combination of this interpretation in conjunction with the exploitation of conditional independence, where applicable, allows the efficient probabilistic modeling of many relevant application domains. In general, the product rule can be used to factorize the joint distribution of variables X 1 , X 2 , X 3 , , X n as

P ( X 1 , X 2 , X 3 , , X n ) = i = 1 n P ( X i | X 1 , X 2 , , X i - 1 ) .
(31.7)

The conditional independence relationships can be used to simplify the form of each factor in (31.7), i. e., by eliminating variables from the conditioning part, thus drastically reducing the number of probability values that need to be specified to define the factor. For example, if we assume that all the variables are Boolean, then the number of entries needed to define P ( X n | X 1 , X 2 , , X n - 1 , ) would be 2 n - 1 . If we consider a simple scenario in which the variable X n is dependent only on X n - 1 , the corresponding simplified factor becomes P ( X n | X 1 , X 2 , , X n - 1 ) = P ( X n | X n - 1 ) , which only requires two entries.

The naïve Bayes model used in classification tasks can be understood as a Bayesian network, where the variable associated with the class label C is the cause and the variables X 1 , , X n used to describe the attributes of the current input are the effects. The underlying conditional independence assumption is fairly simplistic, but allows a very parsimonious factorization of the joint distribution. By assuming that the class label does not depend on the attributes, and that the attributes are conditionally independent with respect to each other given the class label, i. e., i , j P ( X i , X j | C ) = P ( X i | C ) P ( X j | C ) , naïve Bayes factorizes the joint distribution as

P ( C , X 1 , X 2 , X 3 , , X n ) = P ( C ) i = 1 n P ( X i | C ) .

The details of this model are not discussed in this chapter, but a good didactic reference is [6].

In general, after simplification via conditional independence, factors are in the form P ( X i | X j 1 , , X j k ) , where X j 1 , , X j k are denoted as parents of X i , and the notation pa ( X i ) is used with the following meaning pa ( X i ) = { X j 1 , , X j k } . The factor associated with variable X i can thus be rewritten as P ( X i | pa ( X i ) ) and the joint distribution as

P ( X 1 , X 2 , X 3 , , X n ) = i = 1 n P ( X i | pa ( X i ) ) .
(31.8)

The graphical representation of a Bayesian network is shown in Fig. 31.1. The graphical model includes one node for each involved variable. Moreover, a variable that is conditioned (effect) with respect to a parent one (cause) receives a directed edge from that variable. For example, in the Bayesian network represented in Fig. 31.1, we have pa ( X 7 ) = { X 2 , X 3 } , i. e., the set constituted by the two nodes from which X 7 receives an edge. This means that the factor associated with X 7 is P ( X 7 | X 2 , X 3 ) . In Fig. 31.1 , we have reported one popular way to specify the parameters of P ( X 7 | X 2 , X 3 ) when the involved variables are discrete, i. e., the conditional probability distribution table (CPD table). The CPD of X 7 in Fig. 31.1, for instance, reports the probability of X 7 = t , given each possible assignment of values to its parents. The CPD table associated with X 5 is reported as well. By using the CPD tables associated to all nodes, the joint distribution can be rewritten as

P ( X 1 , , X 7 ) = P ( X 1 ) P ( X 3 ) P ( X 2 | X 1 ) P ( X 7 | X 2 , X 3 ) × P ( X 5 | X 7 ) P ( X 6 | X 7 ) P ( X 4 | X 6 ) .
Fig. 31.1
figure 1figure 1

An example of Bayesian network. Conditional probability tables are shown only for variables X5 and X7. Different types of probabilistic influence among variables are highlighted

Note that different distributions can be obtained by using different values for the entries of the CPD tables. Thus, a Bayesian network is actually representing a family of distributions: all the distributions that are consistent with the conditional independence assumptions used to simplify the factors. In fact, up to now, we have discussed how starting from a universal decomposition of the joint distribution via the product rule (note that such decomposition is not unique as it depends on the presentation order assigned to the variables), a set of conditional independence assumptions can be used to simplify the factors, leading to the corresponding graphical representation given by the Bayesian network. An important question, however, is whether the topological structure of a Bayesian network allows for the direct identification of other (conditional) independence relationships, i. e., whether there exist other (conditional) independence relationships that must hold for any joint distribution P that is compatible with the structure of a specific Bayesian network (note that additional relationships may hold only for some specific distributions, i. e., some specific assignment of values to the entries of the CDP tables). As we will see later, the answer to this question is important to devise general-purpose inference algorithms on Bayesian networks. A general procedure, called d-separation (directed separation), can answer the question. It is based on the observation that two variables are not independent if one can influence the other via one or more paths in the graph. Let us exemplify this concept on the Bayesian network reported in Fig. 31.1, where we have highlighted four different basic cases:

  1. 1.

    Indirect causal effect: X1 can influence X7 via X2 if and only if X2 is not observed (a variable is said to be observed if the value assigned to that variable is known).

  2. 2.

    Indirect evidential effect: X4 can influence X7 via X6 if and only if X6 is not observed.

  3. 3.

    Common cause: X5 can influence X6 (and viceversa) via X7 if and only if X7 is not observed.

  4. 4.

    Common effect: X2 can influence X3 (and viceversa) if and only if either X7 or one of X7’s descendants (in this case, X5, X6, X4) is observed.

The topological structure encountered in the common effect is called v -structure and it plays a relevant role in the d-separation procedure. In general, it is clear from above that probabilistic influence does not follow edge direction. Thus, when considering a longer trail, e. g., the path from X 1 to X 4, we have to consider whether each part of the trail allows probabilistic influence to flow or not (according to the four basic cases described above).

Definition 31.2 Active Trail

Let X 1 , , X k be a trail in a Bayesian network G, and E be a subset of observed variables in G. The trail X 1 , , X k is active given E if:

  • Whenever a v-structure X i - 1 X i X i + 1 does occur, X i or one of its descendants belong to E;

  • No other node along the trail belongs to E.

Of course, by definition, if X 1 E or X n E the trail is not active. Examples of active/not active trails from the Bayesian network represented in Fig. 31.1 are: the trail X 1 , X 2 , X 7 , X 6 , X 4 is active given the set E = { X 3 , X 5 } , while it is not active whenever either X 2 or X 7 or X 6 belongs to E; on the other hand, the trail X 1 , X 2 , X 7 , X 3 is active if X 2 E and either X 7 or X 5 or X 6 or X 4 belongs to E.

The Bayesian network represented in Fig. 31.1 does not allow more than one trail between any couple of nodes. In general, however, two nodes may have several trails connecting them and one node can influence the other one as long as there exist at least one active trail among them. This intuition is captured by the definition of the concept of d-separation.

Definition 31.3 d-Separation

Let X , Y , Z be nonintersecting sets of nodes of a Bayesian network. X and Y are d-separated given Z if there is no active trail between any node X X and Y Y given Z.

The d-separation test can be used to precisely characterize the independence relationships which hold for probabilistic distributions that factorize according to the given Bayesian network.

In the following, we introduce another class of graphical models, i. e., Markov networks, which are described by undirected graphs.

2.2 Markov Networks

Directed edges in Bayesian networks are suited to describe causal relationships between random variables. In many cases, however, the probabilistic interaction between two variables is not directional. In these cases, it is natural to consider undirected graphs, i. e., Markov networks. An undirected edge between variables X and Y represents a probabilistic constraint between the two variables. On the other hand, if X and Y are not connected, then we can state a conditional independence assertion involving them if and only if there are no active trails connecting them in the graph. Note that, since edges are now undirected, a trail is not active if and only if any of the variables in the trail is observed. This leads us to discuss which kind of joint distribution factorization a Markov network does represent.

If we go back to the concept of active trail, it is clear that if we consider a subset S of fully connected nodes in the undirected graph, i. e., nodes in S are connected to each other, then any X , Y S will be connected by so many trails involving nodes in S { X , Y } that it is wise to consider a single factor ϕ S involving all nodes in S. Technically, S is called a clique, and we are actually interested in maximal cliques, i. e., cliques which cannot be extended in size by considering another node of the graph. For example, the maximal cliques of the Markov network given in Fig. 31.2 are

c 1 = { X 1 , X 3 , X 5 } , c 2 = { X 1 , X 2 } , c 3 = { X 2 , X 4 } , c 4 = { X 3 , X 4 } .

Note that, while { X 1 , X 5 } is a clique, it is not maximal since we can add X 3 obtaining a larger clique.

Fig. 31.2
figure 2figure 2

An example of Markov network involving five variables. Maximal cliques and corresponding potential functions are highlighted. An example of potential function is given for clique { X 2 , X 4 } , where we have assumed that X 2 and X 4 are Boolean variables

A different factor can be associated with each maximal clique c i . By using a global normalization constant for the joint distribution factorization, a factor associated with a clique c i can be modeled by a potential function ϕ c i ( ) , i. e., any nonnegative function (see Fig. 31.2 for involving Boolean variables). Thus, the factorization of the joint distribution for the example in Fig. 31.2 is

P ( X 1 , X 2 , X 3 , X 4 , X 5 ) = 1 Z ϕ c 1 ( X 1 , X 3 , X 5 ) ϕ c 2 ( X 1 , X 2 ) × ϕ c 3 ( X 2 , X 4 ) ϕ c 4 ( X 3 , X 4 ) ,

where the normalization constant

Z = i , x i X i ϕ c 1 ( X 1 , X 3 , X 5 ) ϕ c 2 ( X 1 , X 2 ) × ϕ c 3 ( X 2 , X 4 ) ϕ c 4 ( X 3 , X 4 )

is called the partition function. If with x we denote an assignment of values to the variables X 1 , , X n and with x c i the corresponding assignments associated with variables in the clique c i , the general formulas for a Markov network are

P ( X 1 , , X n ) = 1 Z i , c i ϕ c i ( x c i ) ,

where

Z = x i , c i ϕ c i ( x c i ) .

If the potential functions are restricted to be strictly positive, then it is possible to find a precise correspondence between factorization and conditional independence. In fact, if we consider the set of all possible distributions defined over variables of a given Markov network, then the set of such distributions that are consistent with the conditional independence statements that can be derived by using the adapted concept of active trails and d-separation coincides with the set of distributions that can be expressed as a factorization of the form given above with respect to maximal cliques of the network (Hammersley–Clifford theorem).

For practical reasons, it is convenient to express a strictly positive potential function as a Boltzmann distribution, i. e.,

ϕ c i ( x c i ) = e - E ( x c i ) ,

where E ( x c i ) is called an energy function. Since the joint distribution is the product of potentials, the total energy is obtained by adding the energy functions of each of the maximal cliques. Energy functions are very useful since, in the absence of a specific probabilistic interpretation for the potential functions, assignments of values that have high probability can be given low energies, while less probable assignments will correspond to high energies.

Let us give an example of application of Markov networks: image de-noising. The task is to remove noise from a binary image Y where the pixels Y i are −1 or +1. Each observed pixel Y i is obtained by a noise-free image X with pixels X i where, with some small probability, the sign of the pixel is flipped. Since neighboring pixels in the noise-free image are strongly correlated, as well as the two variables Y i and X i , due to the small flipping probability, we can use a Markov network like the one depicted in Fig. 31.3 to capture this knowledge. The total energy function encoding such prior knowledge would be

E ( X , Y ) = - β X i , X j X X i X j - η X i X Y i Y X i Y i ,

where all the maximal cliques are considered and couples of pixels with the same sign get lower energy values. Since we are interested in removing noise from the observed pixels Y i , we add a bias toward pixel values that have one particular sign, by summing a term hX i to the energy function for each pixel in the noise-free image

E ( X , Y ) = h X i X X i - β X i , X j X X i X j - η X i X Y i Y X i Y i .

Note that his operation is legal since it corresponds to multiplying the potential function, which are arbitrary nonnegative functions, by a nonnegative function.

Fig. 31.3
figure 3figure 3

A Markov network for image de-noising. Y i is the binary variable representing the state of pixel i in the noisy observed image, while X i refers to the noise-free image

The factorized joint distribution over Y and X is then defined as

P ( X , Y ) = 1 Z e - E ( X , Y ) .

Probabilistic inference can now be performed by clamping the value of Y to the observed image, which implicitly corresponds to a conditional distribution P ( X | Y ) over free images, and by computing the assignments to X that minimizes the total energy of the Markov model, i. e., the assignment of values to pixels of X with highest probability given the observed image Y. The resulting assignment of values to X will return the (presumed) noise-free version of Y.

In the following, we briefly present different approaches to perform probabilistic inference in Bayesian and Markov networks.

2.3 Inference

Performing probabilistic inference in a graphical model over a set of random variables X means being able to answer any probabilistic query involving X. Since a graphical model, either a Bayesian or a Markov network, describes a factorization of the joint distribution, any probabilistic query can be answered, so the problem reduces to find efficient procedures to perform inference. In the following, we report some of the most typical form of queries:

  • Conditional: In this case, we are interested in computing P ( Y | E = e ) , where Y , E X , with Y E = , where Y are the query variables and E = { E 1 , , E k } are the evidence variables for which specific values e = { e 1 , , e k } have been observed.

  • Most probable assignment: Given evidence E = e , we are interested in computing the most likely assignment y * to Y X E . There are two main variants for this kind of query: most probable explanation (GlossaryTerm

    MPE

    ) and maximum a posteriori (GlossaryTerm

    MAP

    ). A GlossaryTerm

    MPE

    query must solve the problem

    y * = arg⁡ max⁡ y P ( Y = y , E = e ) ,

    where Y = X E , while a GlossaryTerm

    MAP

    query must solve the problem

    y * = arg⁡ max⁡ y z P ( Y = y , Z = z | E = e ) ,

    where Z = X E Y .

From the point of view of inference, both directed and undirected networks can be treated in the same way. In fact, directed networks can be converted to undirected networks. This is done by observing that factors in directed networks can be understood as factors corresponding to cliques in an undirected graph obtained by mutually connecting all the parents of each node by new undirected edges and by dropping direction from the original directed edges. This procedure is known as moralization and the resulting undirected graph is the moral graph. By this means, all the variables involved in factors of the directed graph (e. g., CPTs) will be contained in corresponding cliques of the moral graph. Thus, we can focus on undirected graphs.

From a computational point of view, in the worst case, probabilistic inference is difficult: every type of probabilistic inference in graphical models is N P -hard or harder. Specifically, the complexity of inference is related to a topological property of the graphical network called treewidth. Approximate inference methods have been devised to deal with such computational complexity. Unfortunately, approximate inference turns out to be hard, in the worst case. Nevertheless, if the treewidth of the graphical network is not too large (e. g., in polytrees), exact inference can be performed in a reasonable amount of time. Moreover, in many practical cases, approximate inference is efficient and adequate.

There are three major approaches to perform inference: exact algorithms, sampling algorithms, and variational algorithms. The former tries to compute the exact probabilities while avoiding repeated computations. The second approach aims to efficiently approximate probabilities by sampling, in a smart way, the universe of events. Finally, the third approach allows us to treat both exact and approximate inference within the same conceptual framework. In the following, we briefly sketch the main ideas underpinning these approaches.

2.3.1 Exact Algorithms

Let us illustrate one of the basic ideas of exact algorithms, i. e., variable elimination, by using the Markov network shown in Fig. 31.2 , where we assume all variables to be Boolean. Suppose we are interested in computing the marginal probability  P ( X 2 ) . We can get it by summing the factorized joint distribution over the remaining variables

P ( X 2 ) = x 1 x 3 x 4 x 5 1 Z ϕ ( X 1 , X 3 , X 5 ) × ϕ ( X 1 , X 2 ) ϕ ( X 2 , X 4 ) ϕ ( X 3 , X 4 ) .

Naïve computation of the above equation would require O ( 2 5 ) operations, since each summand involves five Boolean variables. However, we can rearrange the summands in a smarter way

P ( X 2 ) = 1 Z x 1 ϕ ( X 1 , X 2 ) x 4 ϕ ( X 2 , X 4 ) x 3 ϕ ( X 3 , X 4 ) × x 5 ϕ ( X 1 , X 3 , X 5 ) = 1 Z x 1 ϕ ( X 1 , X 2 ) x 4 ϕ ( X 2 , X 4 ) × x 3 ϕ ( X 3 , X 4 ) m 5 ( X 1 , X 3 ) = 1 Z x 1 ϕ ( X 1 , X 2 ) x 4 ϕ ( X 2 , X 4 ) m 3 ( X 1 , X 4 ) = 1 Z x 1 ϕ ( X 1 , X 2 ) m 4 ( X 1 , X 2 ) = 1 Z m 1 ( X 2 ) ,

where the m i terms are the intermediate factors obtained by summation on variable X i . Note that Z can be computed by summing on variable X 2. Moreover, the total computational complexity reduces to O ( 2 3 ) since no more than three variables occur together in any summand. In general, the maximal number of variables that occur in any summand is determined by the elimination order. Since many different elimination orders may be used, the lowest complexity is obtained by the order that minimizes this maximal number, which is related to the treewidth of the graph. Unfortunately, finding the optimal elimination order is N P -hard.

One positive aspect of the elimination approach is that it also works for continuous variables since it is only based on the topology of the graph. However, the elimination procedure returns only a single marginal probability, while it is often of interest to compute more than one marginal probability. Luckily, we can generalize the idea to efficiently compute all the single marginals. Here we give some hints on how to do it. Consider the sequence of intermediate factors generated in the example above. They can be indexed by the variables in their scope, i. e., ψ 1 , 3 , 5 = ϕ ( X 1 , X 3 , X 5 ) , ψ 1 , 3 , 4 = ϕ ( X 3 , X 4 ) m 5 ( X 1 , X 3 ) , ψ 1 , 2 , 4 = ϕ ( X 2 , X 4 ) m 3 ( X 1 , X 4 ) , ψ 1 , 2 = ϕ ( X 1 , X 2 ) m 4 ( X 1 , X 2 ) . Graphically, we can represent them via a cluster graph, where each node is associated with a subset of variables (i. e., the scope of intermediate factors) and the undirected edges support the flow of computation of the elimination process. In our example, the cluster graph is shown in Fig. 31.4 , where we have shown the direction of the flow of computation under each edge, and the scope of the computed factor transmitted to the other node after variable elimination over each edge. The variable X 2 in the rightmost node is underlined to remark that it is the target of the flow of computation. In general, since each edge is associated with a variable elimination, it is not difficult to realize that the cluster graph is in fact a tree (called clique tree or junction tree). This structure can also be used for computing other marginals. In order to see that, we have to observe that the scope of the rightmost node is a subset of the scope of the node at its left, so it can be merged with this last node; moreover, each initial potential must be associated with a node with consistent scope, e. g.,

Now, suppose we want to compute P ( X 3 ) by eliminating all the other variables. We have to select a node which contains X 3 in its scope, e. g., the middle node. The flow of computation should now converge toward that node, as shown in

Any elimination order consistent with the above flow will do the work, e. g., we first consider the leftmost node and eliminate X 5 by transmitting the message

m 5 ( X 1 , X 3 ) = x 5 ϕ ( X 1 , X 3 , X 5 )

to the middle node. Then, we do the same for the rightmost node, by eliminating X 2 and transmitting the message

m 2 ( X 1 , X 4 ) = x 2 ϕ ( X 1 , X 2 ) ϕ ( X 2 , X 4 ) .

Finally, the middle node can merge the two received messages with the local potential obtaining

ϕ ( X 3 , X 4 ) m 5 ( X 1 , X 3 ) m 2 ( X 1 , X 4 ) ,

which is an unnormalized version of the joint distribution P ( X 1 , X 3 , X 4 ) . Marginal P ( X 3 ) can then be computed by summing out X 1 and X 4 and normalizing the result. Note that the same flow can be used to compute P ( X 1 ) and P ( X 4 ) : in the first case, the final stage will sum out X 3 and X 4, while in the second case it will sum out X 1 and X 3.

Fig. 31.4
figure 4figure 4

Example of cluster graph, where the direction of the flow of computation is shown under each edge, while the scope of the computed factor transmitted to the other node after variable elimination is shown over each edge

In general, all the factors needed by all the nodes to compute the marginals of the variables in their scope, can be computed by a sum-product message passing scheme where, having selected an arbitrary node as root, messages are transmitted from the leaves up to the root and then back from the root to the leaves. If evidence is present, restricted potentials (i. e., potentials where evidence variables are bound to the observed values) are used. MEP and GlossaryTerm

MAP

queries can be answered by using a max-sum algorithm, which is a variation of the sum-product algorithm exploiting a trellis over all the values the variables can take. The message passing scheme sketched above can also be implemented using division, giving raise to the Belief Update algorithm.

2.3.2 Sampling Algorithms

The strategy adopted by sampling algorithms to perform (approximate) inference is to approximate the joint distribution via estimates computed on a set of representative instantiations of all, or some of, the variables of the graphical model. Unlike exact inference, some techniques are specialized for directed networks. For example, a simple approach to estimate the joint probability in a Bayesian network is Forward Sampling. It starts by considering any topological ordering of the variables, e. g., for the network in Fig. 31.1 the order X 1 , X 3 , X 2 , X 7 , X 5 , X 6 , X 4 will do the job. Then random samples are generated by following the order and by picking a value for each variable according to its distribution. Note that variables with conditional distributions will be considered only when specific values for their parents have already been generated, so that the conditional probability for those variables is fully specified. Once M full samples are generated in this way, the probability of a specific event P ( E = e ) is estimated as the fraction of samples where variables in E take values e. If the query is of the form P ( Y | E = e ) , samples which are not consistent with the evidence are rejected (rejection sampling) and the remaining samples used to estimate the conditional distribution on variables Y. With this approach, however, a large amount of generated samples are discarded.

An improvement on this aspect is given by the likelihood weighting algorithm, which is based on the observation that evidence variables can be forced to assume only the observed values in a sample as long as the sample is weighted by the likelihood of the evidence. This means that a weight is associated with each sample and the weight is given by the product of all the posterior probabilities corresponding to the observed values for the evidence variables, i. e.,

w sample = E i E P ( E i = e i | pa ( E i ) ) .

Estimates are then computed considering weighted samples. Likelihood weighting turns out to be a special case of a more general approach called importance sampling which aims at estimating the expectation of a function relative to some distribution.

Improved sampling methods, which can also be applied to Markov networks, are given by Markov chain Monte Carlo methods. Unlike the methods described so far, these methods generate a sequence of samples, in such a way that later samples are generated by distributions that provably approximate with increasing precision the target posterior probability (i. e., the query P ( Y | E = e ) ).

The simpler method uses Gibbs sampling: an initial assignment of values for the unobserved variables is generated from an initial distribution; subsequently, in turn, each unobserved variable is sampled using the posterior probability given the current sample for all other variables. This distribution can be computed efficiently by using only factors associated with the Markov blanket, i. e., the neighbors of the variable to be resampled in the Markov network (in Bayesian networks, the Markov blanket of a node is given by the set of its parents, its children and the parents of its children). Using the theory of Markov chains (discussed in Sect. 31.4.1), it is possible to show that, under some assumptions, the sequence of generated distributions converges to a stationary distribution, where the fraction of time in which a specific assignment of values to variables (sample) does occur in the sequence is exactly proportional to the posterior probability of that assignment.

A drawback of Gibbs sampling is that it uses only local moves (i. e., resampling of a single variable), leading to very slow convergence for assignments with low probability. More effective methods, based on the Metropolis–Hastings approach, enable for a broader range of moves. Further, more advanced approaches allow us to consider partial assignments in conjunction with a closed-form distribution for unassigned variables. Others use deterministic methods to explicitly search for high-probability assignments to approximate the joint distribution.

2.3.3 Variational Algorithms

Probabilistic inference can be formulated as a constrained optimization problem. This allows both to rediscover exact inference algorithms, such as the ones we have briefly discussed above, and to design approximated inference algorithms, by simplifying either the objective function to optimize and/or the admissible region for optimization. The possibility to devise theoretically founded approximation algorithms is particularly appealing in cases where the joint distribution is characterized by a factorization with associated large treewidth. Research in this area has been recently very active, yielding to several interesting results. Here we do not have the space for a proper technical treatment, so we try to give only a brief introduction to the main ideas.

Variational approaches are based on the idea of approximating an intractable probabilistic distribution with a simpler one, which allows for inference. This simpler distribution is selected from a family of tractable distributions, as the distribution that is the best approximation to the desired one. Can we define a measure of the quality of the approximation that can be used for the minimization process? A good measure is the GlossaryTerm

KL

-divergence introduced in (31.2). Let us denote a distribution that factorizes according to the graphical model G as

P G ( X ) = 1 Z i , c i ϕ c i ( x c i )
(31.9)

and let Q ( X ) be a member of the tractable distributions we use to approximate P G ( X ) . Then, a nice feature of GlossaryTerm

KL

-divergence is that it allows us to efficiently solve the optimization problem

arg⁡ min⁡ Q ( X ) D KL ( Q ( X ) P G ( X ) )

without requiring to perform inference in P G ( X ) . In fact, using the factorization of P G ( X ) in (31.9), it is not difficult to show that

D KL ( Q ( X ) P G ( X ) ) = log⁡ Z - i , c i E Q [ log⁡ ϕ c i ] + E Q [ log⁡ Q ( X ) ] ,
(31.10)

and, since log⁡ Z does not depend on Q ( X ) , minimizing D KL ( Q ( X ) P G ( X ) ) is equivalent to maximizing the energy functional term

i , c i E Q [ log⁡ ϕ c i ] - E Q [ log⁡ Q ( X ) ] .

Following from the definition in (31.1), H Q ( X ) = - E Q [ log⁡ Q ( X ) ] is the entropy of Q, while the first term in (31.10) is referred to as energy term.

Different variational methods correspond to different strategies for optimizing the energy functional. The name variational is used since all of them adopt the general strategy of reformulating the optimization problem by introducing new variational parameters to be used for optimization. In particular, each specific choice of values for the variational parameters expresses one member, i. e., Q ( X ) , of the family of tractable distributions we want to use. The optimization procedure searches the space of variational parameters to find the Q * ( X ) that best approximates P G ( X ) . It is important to understand that the family of tractable distributions will actually corresponds to a set of constraints, involving the variational parameters that must be satisfied while maximizing the energy functional. By using Lagrange multipliers these constraints can be merged together with the energy functional, giving rise to a Lagrangian function that must be maximized. By taking the partial derivatives with respect to the variational parameters and the Lagrange multipliers, the solution to the optimization problem can be characterized by a set of fixed-point equations. These equations can then be used to straightforwardly devise an iterative solution.

Different variational methods work with different types of approximations. There are two main sources of approximation, which can be used singularly or in conjunction. One source is the energy functional, which can be substituted by a functional easy to manipulate while preserving a good degree of approximation. Another source of approximation are the constraints, i. e., the definition of the family of tractable distributions, which may not be fully consistent with the factorization represented by the graphical model (in this case, denoted as pseudo-distributions).

We do not have space here to give more details; however, it is worth to mention that while convergence proofs of several variational methods are available, it is not so common to find theoretical guarantees on the approximation error made by the specific method.

3 Latent Variable Models

Knowledge hindered in the complex relation between a large number of observable variables can be surfaced under the assumption that a simpler and unobservable process exists, which is responsible for generating the complex behavior of manifest data. Such an unobservable generative process can be modeled through the use of latent variables, as opposed to observable variables, that are not directly measurable, but can be inferred from observations and can explain the relation between manifest data. Intuitively, latent variables can be understood as an attempt to model the unknown physical process generating the observations or as an abstraction providing a simplified representation of the manifest data, e. g., clusters.

Probabilistic models that attempt to explain observations in terms of latent variables are called latent variable models. In probabilistic terms, the simplification introduced by latent variables results in conditional independence assumptions, such that (subsets of) observable variables can be considered conditionally independent when their hidden explanation, i. e., the latent variable assignment, is given. Similarly to observed variables, latent variables can be discrete or continuous: their nature, together with that of the observations, determines different types of probabilistic models. Nevertheless, parameter estimation in the different latent variable models can be achieved through a general iterative principle, known as expectation–maximization.

3.1 Latent Space Representation

To understand the intuition at the basis of latent space representation, consider a joint distribution P ( X ) = P ( X 1 , , X N ) defined over N joint observed random variables X i . As discussed in Sect. 31.2.1, without any simplifying assumption, the number of free parameters of this simple model grows as O ( 2 N - 1 ) for Boolean variables, which quickly becomes unmanageable for large N. One way to control the number of free parameters of a model, without taking too simplistic assumptions (e. g., X i being GlossaryTerm

i.i.d.

), is to introduce a collection of latent, or hidden, variables Z = { Z 1 , , Z K } . The latent variables are unobserved but can be used to factorize the joint distribution P ( X ) while allowing to capture (some of) the correlations between the X = { X 1 , , X N } observed variables. More formally, latent variables are such that

P ( X ) = z P ( X | Z = z ) P ( Z = z ) d z ,
(31.11)

that is the general formulation for the likelihood of a latent variable model. The details of the latent variable model, and the tractability of the integral in (31.11), are determined by the form of the conditional distribution P ( X | Z ) and by the marginal probability P ( Z ) . A common approach in latent variable models is to assume that observed variables become conditionally independent given the latent variables, that is

P ( X ) = z i = 1 N P ( X i | Z = z ) P ( Z = z ) d z .
(31.12)

A basic assumption for this latent model to be effective, is that the conditional and marginal distributions should be more tractable than the joint distribution P ( X ) . For instance, in a simple scenario with discrete observations and latent variables, this entails that K N . Not surprisingly, the same intuition is applied, in a deterministic context, for dimensionality reduction (cf. the number of projection directions in GlossaryTerm

PCA

) and clustering.

Different types of latent variable models are defined based on the nature of the latent and observed variables, as well as depending on the form of the conditional and marginal probabilities. In the following, we discuss two general classes of latent variable models with continuous and discrete hidden variables, which are factor analysis and mixture models, respectively.

3.2 Learning with Latent Variables: The Expectation–Maximization Algorithm

Learning, in a probabilistic setting, entails working with the model likelihood. In latent variable models, the likelihood in (31.11) might be difficult to treat due to the marginalization inside the logarithm, which can potentially couple all the model parameters. Despite the diversity of the models that can be designed, based on the general expression in (31.11), there exist a general principle to estimate their parameters.

The expectation–maximization (GlossaryTerm

EM

) algorithm [48] is a general iterative method for the maximization of the likelihood under latent variables. The key intuition of the GlossaryTerm

EM

algorithm is to define an alternative objective function where the parameter coupling introduced by the marginalization of the hidden variables is removed. The GlossaryTerm

EM

algorithm maximizes the marginal data likelihood P ( X | θ ) , where θ are the model parameters, through a tractable lower bound defined by introducing a function of the latent variables, i. e., Q ( Z ) , into the data likelihood through marginalization. For notational simplicity, consider the case of discrete latent variables. For any nonzero distribution Q ( Z ) , it holds

L ( θ ) = log⁡ P ( X | θ ) = log⁡ z P ( X , Z = z | θ ) = log⁡ z Q ( z ) P ( X , Z = z | θ ) Q ( z ) z Q ( z ) log⁡ P ( X , Z | θ ) - z Q ( z ) log⁡ Q ( z ) = L ̃ ( Q , θ ) ,
(31.13)

where the lower bound L ̃ ( Q , θ ) L ( θ ) is obtained by the application of the Jensen inequality to the concave log⁡ function. The joint distribution P ( X , Z | θ ) is known as the complete data likelihood, where the term complete refers to the fact that the marginal data likelihood  P ( X | θ ) is completed with the observations z for the latent variables.

The Expectation–maximization algorithm defines an alternate optimization process where the bound L ̃ ( Q , θ ) is maximized with respect to Q ( ) and θ. In general, this is performed by two independent maximization steps that are repeated until convergence:

  • Expectation (E) Step: For θ fixed, find the distribution Q ( t + 1 ) ( z ) that maximizes the bound L ̃ ( Q , θ ( t ) ) ;

  • Maximization (M) Step: Given the distribution Q ( z ) ( t + 1 ) , estimate the model parameters θ ( t + 1 ) that maximize the bound L ̃ ( Q ( t + 1 ) , θ ) ;

where the superscript denotes the estimate at time t. Clearly, the optimal solution for the E-step is attained when

Q ( t + 1 ) ( z ) = P ( Z = z | X , θ ( t ) ) ,
(31.14)

that is when the lower bound in (31.13) becomes an equality. In practice, to explicitly evaluate the complete likelihood in L ̃ ( Q , θ ( t ) ) , we would need to observe the z assignments. These are unknown, since latent variables are unobservable. However, given the marginalization of z in (31.13), we can substitute the unavailable z observations with their expected values, by considering them as another random variable. To this end, it suffices that the E-step computes the expected value of the complete log-likelihood log⁡ P ( X , Z | θ ) with respect to Z. These observations provide the final form of the classical GlossaryTerm

EM

algorithm:

  • E-step: Given the current estimate of the model parameters θ ( t ) , compute

    Q ( t + 1 ) ( θ | θ ( t ) ) = E Z | X , θ ( t ) [ log⁡ P ( X , Z | θ ) ] ;
    (31.15)
  • M-step: Find the new estimate of the model parameters

    θ ( t + 1 ) = arg⁡ max⁡ θ Q ( t + 1 ) ( θ | θ ( t ) ) .
    (31.16)

In other words, the E-step estimates the value of the otherwise unobserved latent variables, while the M-step finds the parameters that maximize the current estimate of the log-likelihood. In practice, the E-step often reduces to estimating the expectation of Z as its posterior P ( Z | X , θ ( t ) ) , while the M-step uses these values as sufficient statistics to update the model parameters θ ( t + 1 ) . This alternate optimization is typically iterated until the log-likelihood does not change much between consecutive estimates, or when a number of maximum iterations is reached. Note that the two-step GlossaryTerm

EM

optimization process is prone to local optima. Hence, its convergence can be slow and, often, its solutions tend to be dependent on the initialization.

The GlossaryTerm

EM

algorithm assumes that we can calculate the expected value of the complete log-likelihood. However, there are cases in which the required summation is not computationally feasible (e. g., with infinite summations where the integral has no close-form solution): in this cases, the approximated inference methods described in Sect. 31.2.3 can be used to define nonexact GlossaryTerm

EM

algorithms. For instance, stochastic versions of the GlossaryTerm

EM

algorithm are obtained by approximating the infeasible summation using (e. g., Gibbs) sampling from the posterior distribution P ( Z | X , θ ) . The classical GlossaryTerm

EM

algorithm is a GlossaryTerm

ML

method providing point estimates of the model parameters θ. The variational Bayes (GlossaryTerm

VB

) [6] method has been introduced to obtain a fully Bayesian solution that returns a posterior distribution of the parameters P ( θ ) , instead of their point estimate. GlossaryTerm

VB

is based on an analytical approximation of the joint posterior of the latent variables and model parameters that yields to a generalization of the GlossaryTerm

EM

alternate optimization, where the maximization at the M-step is taken over possible distributions Q ( θ ) , instead of on θ itself.

3.3 Linear Factor Analysis

Factor analysis (GlossaryTerm

FA

) is an example of a latent variable model for continuous hidden and manifest variables. In its simplest linear form, it is a classical statistical model widely used for generative dimensionality reduction. Similarly to its deterministic counterparts, e. g., GlossaryTerm

PCA

, it forms a low-dimensional embedding of a set of observations D = ( x 1 , , x n ) , where each observation x is a D-dimensional vector of reals. GlossaryTerm

FA

finds a lower dimensional probabilistic representation of D, by assuming that the features of each x are independently generated by K real-valued latent variables Z = { Z 1 , , Z K } , with K D (see the associated graphical model in Fig. 31.5).

Fig. 31.5
figure 5figure 5

Linear factor analysis: the observed D-dimensional variable X is related to the K latent variables Z = { Z 1 , , Z K } through a linear mapping

The GlossaryTerm

FA

model, assumes that observations are linked to the latent vectors through a linear model

x = F z + b + ϵ ,
(31.17)

where ϵ N ( ϵ | 0 , Ψ ) is the Gaussian distributed noise with zero mean and covariance Ψ, b is a bias vector and F is the factor loading matrix. The latent variables are the factors and are generally assumed to be distributed as Z N ( z | 0 , I K ) = P ( Z ) , where I K is the K-dimensional identity matrix. Under such Gaussian assumptions, and given the linear model in (31.17), the conditional distribution of the observations is

P ( X = x | Z = z ) = N ( x | F z + b , Ψ ) ,
(31.18)

which, inserted in (31.11), provides the distribution for the GlossaryTerm

FA

complete likelihood

P ( X ) = z P ( X | Z ) P ( Z ) d z = N ( x | b , F F T + Ψ ) .
(31.19)

The form of the noise covariance Ψ determines the type of GlossaryTerm

FA

model: in general, this is chosen as a diagonal matrix with a vector of ( ψ 1 , , ψ D ) values on the main diagonal. When the diagonal elements are all equal to a single value σ 2 R , the GlossaryTerm

FA

reduces to the special case of the Probabilistic GlossaryTerm

PCA

 [49].

Learning of the GlossaryTerm

FA

parameters θ = ( Ψ , F ) (b is usually set a priori to the mean of the data) is obtained by maximum likelihood estimation. The most popular approach to obtain such estimates is based on solving an eigen-decomposition problem. Given the nature of GlossaryTerm

FA

as a latent variable model, its θ parameters can also be estimated by applying GlossaryTerm

EM

to the logarithm of the complete likelihood in (31.19). The latter approach is, however, less used in general, given its slower convergence.

3.4 Mixture Models

The term mixture models identifies a large family of latent variable models comprising discrete hidden variables and generic manifest variables. A mixture model assumes that each observation is generated by a weighted contribution of a number of simple distributions, selected by the hidden variables. The simplest form of mixture model assumes that an observation is independently generated by a single mixture component. Widely popular elements of this family are the Gaussian mixture model for continuous observations and the mixture of unigrams for multinomial data. In the following, we discuss an example of more articulated generative processes comprising observations with mixed component memberships.

3.4.1 Probabilistic Latent Semantic Analysis

Probabilistic latent semantic analysis (GlossaryTerm

pLSA

) [50] has been introduced to model mixed membership observations, where a manifest sample is allowed to be generated by multiple latent variables. Its primary application is on documental analysis, where latent variables are interpreted as topics to be identified in a collection of documents. Intuitively, in the mixture of unigrams, each document is assigned to a unique topic and, as a consequence, all the words in a document are constrained to belong to a single topic. The GlossaryTerm

pLSA

model relaxes this assumption by allowing words in a document to belong to different topics, obtaining a multitopic representation for the documents in the collection.

The typical GlossaryTerm

pLSA

setting includes a dataset of multinomial samples, which are the documents D = { d 1 , , d N } . Each document is an L-dimensional vector of word counts of length equal to the size of the reference dictionary. In other words, the ith observed sample is a vector d i = ( w 1 i , , w L i ) , where w j i is the number of occurrences of the jth word of the vocabulary in the ith document. This data is typically summarized in a rectangular L × N integer matrix n, such that each row n ( , d i ) contains the word counts for document d i . The variables identifying words and documents, i. e., W j and D i , are observed, in contrast with the set of topics Z = { Z 1 , , Z K } , which are the latent variables. In GlossaryTerm

pLSA

, every observation n ( w j , d i ) is associated with a latent topic z k by means of the hidden variable Z k .

The fundamental probabilities associated with this model are P ( D = d i ) , that is, the document probability, P ( W = w j | Z = z k ) , that is, the probability of word w j conditioned on topic z k , and P ( Z = z k | D = d i ) , that is the conditional probability of topic z k given document d i . Given the nature of the manifest and hidden variables, all probabilities involved in GlossaryTerm

pLSA

are multinomials. The GlossaryTerm

pLSA

defines a (quasi) generative model for the word/document co-occurrences whose generative process is described by Fig. 31.6, using plate notation. This is a concise representation for graphical models involving replications: rectangular plates denote replication of their content for a number of times given by term on the bottom right (e. g., N and L d for the outer and inner plates in Fig. 31.6, respectively); each shaded circular item denotes an observed variable, while empty circles identify latent variables.

The conditional independence relationships in Fig. 31.6 allow us to factorize the joint word-topic distribution: by using the parent decomposition rule introduced in (31.8 ), it yields

P ( W j , D i ) = P ( D i ) P ( W j | D i ) = P ( D i ) k = 1 K P ( Z k | D i ) P ( W j | Z k ) ,
(31.20)

that is the specific GlossaryTerm

pLSA

form of the general latent topic factorization in (31.12). The second equality in (31.20) is given by the marginalization of the latent topics Z k and by the conditional independence assumption of the GlossaryTerm

pLSA

model, stating that word w j and document d i can be considered independent given the state of the latent variable Z k . In other words, the word distribution of a document is modeled as a convex combination of K topic-specific distributions P ( W j | Z k ) . Such decomposition has a well-known characterization in terms of Nonnegative matrix factorization [13].

Fig. 31.6
figure 6figure 6

Graphical model for the probabilistic latent semantic analysis: indices for the random variables D , Z , and W are omitted in the plate notation. The term L d denotes replication for the L d words present in the dth document

Estimation of the GlossaryTerm

pLSA

parameters θ = { P ( W j | Z k ) , P ( Z k | D i ) } is obtained by maximization of the log-likelihood

L ( θ ) = log⁡ i = 1 D j = 1 W P ( W j , D i ) n ( w j , d i ) = i = 1 D j = 1 W n ( w j , d i ) × log⁡ { P ( D i ) k = 1 K P ( Z k | D i ) P ( W j | Z k ) } ,
(31.21)

where P ( W j , D i ) has been expanded using the formulation in (31.20). As with other latent topic models, this maximization problem can be solved through the iterative GlossaryTerm

EM

-algorithm discussed in Sect. 31.3.2. Following (31.15), the E-step computes the expectation of the complete likelihood P ( Z , W , D ) with respect to the GlossaryTerm

pLSA

latent topics, assuming observed documents and words. It easily shows that the resulting E-step computes

P ( Z k | W j , D i ) = P ( Z k | D i ) ( t ) P ( W j | Z k ) ( t ) k = 1 K P ( Z k | D i ) ( t ) P ( W j | Z k ) ( t ) ,
(31.22)

that is the probability of the topic Z k given word W j in document D i , estimated using the current values (at time t) of the model parameters θ ( t ) = { P ( W j | Z k ) ( t ) , P ( Z k | D i ) ( t ) } . Note that the decomposition on the right-hand side of Eq. (31.22) has been obtained by factorization of the posterior P ( Z k | W j , D i ) using the Bayes theorem.

The M-step equations (31.16) are obtained by differentiating the GlossaryTerm

pLSA

log-likelihood, extended with appropriate Lagrange multipliers for normalization, with respect to the P ( Z k | D i ) and P ( W j | Z k ) parameters. The resulting update equations are

P ( Z k | D i ) ( t + 1 ) = j = 1 W n ( w j , d i ) P ( Z k | W j , D i ) j = 1 W n ( w j , d i ) ,
(31.23)
P ( W j | Z k ) ( t + 1 ) = i = 1 D n ( w j , d i ) P ( Z k | W j , D i ) j = 1 W i = 1 D n ( w j , d i ) P ( Z k | W j , D i ) .
(31.24)

The two-step optimization is iterated until a likelihood convergence criterion is met: often a validation set, or a tempered version of the GlossaryTerm

EM

are used in order to avoid model overfitting [50].

3.4.2 Advanced Topic Models

The GlossaryTerm

pLSA

was the first mixed membership model allowing a single observed sample to be generated by multiple latent topics at the same time. However, GlossaryTerm

pLSA

cannot be considered a fully generative model. In fact, the document-specific mixing weights for the topics are not sampled from a distribution, rather they are selected from P ( z k | d i ) based on the index of document d i . Hence, GlossaryTerm

pLSA

indexes only those documents that are in the training set D and cannot directly model the generative process of unseen test documents. In other words, the GlossaryTerm

pLSA

is basically assigning null probabilities to all inputs that are not in the training set. The folding-in heuristic has been proposed to opportunistically solve this limitation, by assigning latent variables in the test-data to their GlossaryTerm

MAP

values before computing the test-set perplexity. However, the folding-in approach has been shown to lead to overly optimistic estimates of the test-set log-likelihood [51].

The latent Dirichlet allocation (GlossaryTerm

LDA

) [52] has been proposed as a Bayesian approach to address such modeling limitation of GlossaryTerm

pLSA

. It extends GlossaryTerm

pLSA

by treating the multinomial weights P ( Z | D ) as additional latent random variables, sampling them from a Dirichlet distribution, that is the conjugate prior of a multinomial distribution. Using conjugate distribution eases inference as it ensures that the posterior distribution has the same form of the prior. The latent variable decomposition of the GlossaryTerm

LDA

log-likelihood is

P ( W = w | ϕ , α , β ) = z P ( W = w | Z = z , ϕ ) [ 1 m m ] × P ( Z = z | θ ) P ( θ | α ) P ( ϕ | β ) d θ ,
(31.25)

where P ( W | Z , ϕ ) is the multinomial word-topic distribution with parameters ϕ sampled from the Dirichlet distribution P ( ϕ | β ) . The term P ( Z | θ ) is the topic distribution having θ as document-specific multinomial parameter being sampled from the Dirichlet  P ( θ | α ) .

The terms α and β are the hyperparameters of the Dirichlet distribution, see Fig. 31.7 for the model plate notation. Direct GlossaryTerm

EM

inference is impossible for GlossaryTerm

LDA

, since the integral in (31.25) is intractable due to the couplings between the parameters within the topic marginalization. Again, approximate and stochastic Bayesian inference methods, such as those in Sect. 31.2.3, are used to fit the GlossaryTerm

LDA

parameters, including GlossaryTerm

VB

 [52], expectation propagation [53], and Gibbs sampling [54].

Fig. 31.7
figure 7figure 7

Graphical model for the latent Dirichlet allocation

The principles underlying GlossaryTerm

pLSA

and GlossaryTerm

LDA

have inspired the development of latent topic models that account for more articulated assumptions on the form of the hidden generative process. For instance, hierarchical GlossaryTerm

LDA

 [55] proposes a generative process where observations are generated by a topic tree instead of being drawn from a flat topic collection. Further, specialized latent variable models have been developed for specific applications, such as author-topic analysis in scientific literature [56] and image understanding [57].

4 Markov Models

Time series and, more generally, sequences are a form of structured data that represents a list of observations for which a complete order can be defined, e. g., time in a temporal sequence. Let a sequence of length T be y n = y 1 , , y T , where the bold notation is used to denote the fact that y is a compound object (in practice, however, this is can be treated as a set of random variables). The term y t is used to denote the tth observation with respect to the total order. Position t is often referred to as time when dealing with time-series data.

Two sequences are generally the results of independent trials, hence they can be considered GlossaryTerm

i.i.d.

samples. However, the elements composing a sequence fail to meet such GlossaryTerm

i.i.d.

property. Therefore, in principle, a probabilistic model for y would be required to specify the joint distribution P ( Y 1 , , Y T ) . For discrete valued observations y t , the joint distribution grows exponentially with the size of the observation domain. Clearly, this would make the use of the probabilistic model fairly impractical due to the exponential size of the parameter space. To reduce such parameterization, Markov chains make the simplifying assumption that an observation occurring at some position t of the sequence, only depends on a limited number of its predecessors with respect to the complete order. In a time series, this entails that an observation at the present time, only depends on the history of a limited number of past observations. Markov chains allow us to model such history dependence and are the heart of the hidden Markov model (GlossaryTerm

HMM

), which is the most popular approach to model the generative process of sequential data.

The GlossaryTerm

HMM

is a notable example of latent variable model: in the following, we provide an overview of the associated learning and inference problems. For simplicity, presentation focuses on sequences of finite length T and discrete time t. Sequence elements y t can be either discrete valued or defined over reals, without major impact on the model. The section also discusses how the GlossaryTerm

HMM

causation assumption can be modified to give rise to alternative approaches, with interesting applications that overshoot simple sequence modeling.

4.1 Markov Chains

A Markov chain is a simple stochastic process for sequences. It assumes that an observation y t at time (position) t only depends on a finite set of L 1 predecessors in the sequence. The number of predecessor L influencing the new observation is the order of the Markov chain.

Definition 31.4 Markov Chain

An L-order Markov chain is a sequence of random variables Y = Y 1 , , Y T such that for every t { 1 , , T } , it holds

P ( Y t = y t | Y 1 , , Y t - 1 , Y t + 1 , , Y T ) = P ( Y t = y t | Y t - L , , Y t - 1 ) .
(31.26)

Following from the discussions in Sect. 31.2.1, (31.26) states that the L predecessors of Y t define the set of its Bayesian parents pa ( Y t ) = { Y t - L , , Y t - 1 } . For a first-order Markov chain, i. e., L = 1, (31.26) reduces to P ( Y t = y t | Y t - 1 = y t - 1 ) . Such conditional independence assumption formally encodes the intuition that the current observation can be predicted from the sole knowledge of the preceding sample. The graphical model of a first-order Markov chain is shown in Fig. 31.8, whose joint distribution decomposes as

P ( Y 1 , , Y T ) = P ( Y 1 ) P ( Y 2 | Y 1 ) , P ( Y 3 | Y 2 ) × P ( Y T | Y T - 1 ) = P ( Y 1 ) t = 2 T P ( Y t | Y t - 1 ) .
(31.27)

The first element Y 1 has an empty conditioning part given that is has no predecessor. Its probability P ( Y 1 ) is referred to as marginal or prior probability, while the term P ( Y t | Y t - 1 ) is the transition probability.

Fig. 31.8
figure 8figure 8

Graphical model for a first-order Markov chain of length T, where pa ( Y t ) = { Y t - 1 }

A Markov chain is stationary or homogeneous, if the transition probability does not depend on the time (position) t. In other words, the parameterization of the Markov chain is such that

P ( Y t = y | Y t - 1 = y ) = f ( y , y ) ,

where the transition distribution is a function f ( y , y ) of the sole observations y , y . An interesting stationary first-order Markov chain is that whose random variables take values from a finite alphabet of discrete symbols i , j { 1 , , M } . In these chains, the transition probability

A i j = P ( Y t = i | Y t - 1 = j )
(31.28)

denotes the probability of occurrence of the ith symbol preceded by symbol j. For convenience, such probability is represented by the element A ij of the M × M transition matrix A = [ A i j ] i , j = 1 M . Similarly, the marginal distribution defines the elements

π i = P ( Y 1 = i )
(31.29)

of the M × 1 initial state vector π = [ π i ] i = 1 M . These Markov chains can be straightforwardly interpreted as state-transition systems, where each symbol i of the alphabet is a state and a state-transition arrow exists between states i and j having a nonzero A ij entry in the transition matrix.

The Markov chains described by (31.28) and (31.29), despite their simplicity, have found wide application, e. g., in modeling of physical phenomena, economic time series, and information retrieval. Learning Markov chains requires fitting the M2 parameters of the transition matrix plus an M-dimensional prior, where M is the size of the observation alphabet. Efficient methods exists to fit stationary first-order Markov chains by maximum likelihood (GlossaryTerm

ML

). By using the decomposition in (31.26), substituting the definitions in (31.28) and (31.29), the Markov chain log-likelihood for a generic sequence y writes

L ( θ ) = log⁡ P ( Y = y | θ ) = log⁡ i = 1 M π i δ ( y 1 = i ) × t = 2 T i , j = 1 M A i j δ ( y t = i , y t - 1 = j ) ,
(31.30)

where θ = ( A , π ) are the model parameters and δ ( y t = i , y t - 1 = j ) is the indicator function. For instance, it equals 1 if a transition from y t - 1 = j to y t = i can be observed in the sequence and it is 0 otherwise. Similarly, δ ( y 1 = i ) = 1 if and only if the first symbol of the sequence is i . The final expression of the log-likelihood is obtained by taking the log into the products and adding appropriate Lagrange multipliers for normalization. The GlossaryTerm

ML

estimate is obtained by differentiating this final expression with respect to parameters A ij and π i , yielding

A i j = t = 2 T δ ( y t = i , y t - 1 = j ) t = 2 T i = 1 M δ ( y t = i , y t - 1 = j ) ,
(31.31)
π i = δ ( y 1 = i ) i = 1 M δ ( y 1 = i ) .
(31.32)

Intuitively, the GlossaryTerm

ML

estimate corresponds to counting the number of transitions from symbol j to i across time (similarly for the initial state). Generalization to a set of N samples sequences y n is straightforward: it suffices to count transitions both in time and across samples, and similarly for the initial symbols y 1 n .

4.2 Hidden Markov Models

Markov chains model sequential data assuming that sequence elements are generated by a fully observable stochastic process. In the discrete-state Markov chain, this requires each state of the process to correspond to an observable element of the sequence, i. e., en event. On the other hand, most real-world systems generate observable events that are correlated, but not coincident, with the state of the generating process. More importantly, the only available information can be the outcome of the stochastic process at each time, i. e., event y t , while the state of the system remains unobservable, i. e., hidden. The GlossaryTerm

HMM

allows modeling more general stochastic processes where the state transition dynamics is disentangled from the observable information generated by the process. The state-transition dynamics is assumed to be nonobservable and is modeled by a Markov chain of discrete and finite latent variables, i. e., the hidden states. The observable information is then generated by such hidden states similarly to how latent variables generate observations in mixture models (see Sect. 31.3.4).

The graphical model of an GlossaryTerm

HMM

is exemplified in Fig. 31.9: the hidden states are latent variables S t , while the sequence elements Y t are observed.

Fig. 31.9
figure 9figure 9

A first-order HMM with hidden states S t chosen on the discrete domain { 1 , , C } , for t = 1 T

The conditional dependence expressed by the arrow S t Y t indicates that the observed element of the sequence at time t is generated by the corresponding hidden state S t through the emission distribution b s t ( y t ) = P ( Y t = y t | S t = s t ) . The unknown state-transition dynamics is modeled by the first-order Markov chain of discrete and finite hidden states S t . By applying the Markovian decomposition in (31.27) to the hidden states chain, the joint distribution of the observed sequence y = y 1 , , y T and associated hidden states s = s 1 , , s T writes as

P ( Y = y , S = s ) = P ( S 1 ) t = 2 T P ( S t | S t - 1 ) P ( Y t | S t ) .
(31.33)

The actual parameterization of the probabilities in (31.33) depends on the form of the observation and hidden states variables. From Sect. 31.8, a stationary hidden states chain is known to be regulated by the C × C matrix of state transitions A i j = P ( S t = i | S t - 1 = j ) and by the C-dimensional vector of initial state probabilities π i = P ( S t = i ) , where i , j are drawn from { 1 , , C } . For discrete sequence observations y t { 1 , M } , the emission distribution is an M × C emission matrix B such that its elements are

b i ( k ) = B k i = P ( Y t = k | S t = i ) .
(31.34)

For continuous observations y t , the state assignment S t = i selects the ith emission distributions b i ( y t ) = P ( Y t | S t = i ) from a mixture of C candidates.

An GlossaryTerm

HMM

is a latent variable model defined by the θ = ( π , A , B ) parameters and, implicitly, by the (unkown) number of hidden states C. In [58], three notable inference problems are identified for an GlossaryTerm

HMM

.

Definition 31.5 Evaluation Problem

Given a model θ and an observed sequence y, determine the likelihood P ( Y = y | θ ) of the sequence being generated by the model.

Definition 31.6 Learning Problem

Given a dataset of N observed sequences D = { y 1 , , y N } and the number of hidden states C, find the parameters π, A and B that maximize the probability of model θ = { π , A , B } having generated the sequences in D.

Definition 31.7 Optimal States Problem

Given a model θ and an observed sequence y, find an optimal state assignment s = s 1 * , , s T * for the underlying hidden Markov chain.

These classical inference problems are addressed using efficient and numerically stable recursive algorithms that exploit message passing on the GlossaryTerm

HMM

junction tree (Sect. 31.2.3) to factorize the, otherwise hardly tractable, joint maximization problems. The underlying intuition is a recursive computation of intermediate probabilities (messages) that are passed forward and backward along the sequence (the junction tree, in practice) to accumulate evidence for solving the joint problem. A discussion of the key aspects of these solutions is provided in the following.

4.2.1 Evaluation

The evaluation problem refers to measuring how well a given GlossaryTerm

HMM

matches an observed sequences. Let the model be θ = ( π , A , B ) and the observed sequence y = y 1 , , y T , the objective is to find P ( Y = y | θ ) . To effectively compute this probability in the GlossaryTerm

HMM

assumption, it is needed to introduce the hidden states assignment corresponding to the observed sequence y. Following the general approach for latent variable models in Eq. (31.11), these are introduced through marginalization on the joint assignment  s = s 1 , , s T

P ( Y | θ ) = s P ( Y , S = s | θ ) = s 1 , , s T P ( S 1 ) t = 2 T P ( S t | S t - 1 ) P ( Y t | S t ) ,
(31.35)

where the joint probability P ( Y , S | θ ) has been factorized according to the GlossaryTerm

HMM

assumption in (31.33).

Direct computation of (31.35 ) is generally infeasible, as it would require O ( T C T ) operations. This probability can be efficiently computed, with O ( T C 2 ) operations, through accumulation of a recursive term that is computed by scanning the sequence from left to right. The procedure is known as forward algorithm: let y 1 : t be the observed subsequence from position 1 to t, define the forward probability as

α t ( i ) = P ( Y 1 : t = y 1 : t , S t = i | θ )
(31.36)

that is the probability of observing a partial sequence up to position t and the underlying hidden process being in state i at time t. A recursive formulation of the α t ( i ) term is obtained by introducing the hidden state S t - 1 by marginalization, yielding

α t ( i ) = j = 1 C P ( Y 1 : t = y 1 : t , S t = i , S t - 1 = j | θ ) = j = 1 C P ( Y t = y t | S t = i , θ ) × P ( S t = i | S t - 1 = j , θ ) × P ( Y 1 : t - 1 = y 1 : t - 1 , S t - 1 = j | θ ) = b i ( y t ) j = 1 C A i j α t - 1 ( j ) ,
(31.37)

where the second equality follows from the conditional independence assumptions of the model. Since, pa ( S t ) = { S t - 1 } , the chain element S t is completely determined by the hidden state at previous time S t - 1 ; similarly, emission Y t is conditional independent from the rest, given the hidden state S t .

The forward recursion scans the observed sequence from left to right and recursively computes the α t ( i ) values in each position t = 1 , , T using (31.37). At each observed position t, the α t ( i ) values are computed for each i { 1 , , C } , since the hidden states are not observed. The basis of the recursion is at t = 1, where the (31.37) reduces to α 1 ( i ) = b i ( y 1 ) π i , such that y 1 is the first element of the observed sequence. The likelihood of the full sequence y = y 1 : T is computed at the end of the forward recursion as

P ( Y | θ ) = i = 1 C P ( Y 1 : T , S T = i | θ ) = i = 1 C α T ( i ) .
(31.38)

4.2.2 Learning

Learning of an GlossaryTerm

HMM

θ = ( π , A , B ) amounts to finding the values of the parameters π, A and B that are most likely to have generated a dataset of observed GlossaryTerm

i.i.d.

sequences D = { y 1 , , y N } . From the evaluation problem, we know how to measure the quality of the matching between a sequence y and a model θ using the likelihood P ( Y | θ ) . The GlossaryTerm

HMM

learning problem can be solved through GlossaryTerm

ML

estimation of θ parameters considering the hidden states as latent variables. As discussed in Sect. 31.3.2, this problem can be solved through application of the GlossaryTerm

EM

algorithm, whose GlossaryTerm

HMM

version is referred to as Baum–Welch algorithm [59], which is a form of sum-product inference algorithm introduced in Sect. 31.2.3. Marginalization of the hidden states as in (31.35), yields to the GlossaryTerm

HMM

log-likelihood on the dataset D

L ( θ ) = log⁡ n = 1 N P ( Y n | θ ) = log⁡ n = 1 N { s 1 n , , s T n n P ( S 1 n ) . log⁡ n = 1 N . s 1 n , , s T n n × t = 2 T n P ( S t n | S t - 1 n ) P ( Y t n | S t n ) } ,
(31.39)

where overscript n refers to the nth sequence y n and T n is the corresponding length. The likelihood in (31.39) is intractable due to the nonobservable state assignment that introduces the marginalization term. Following the principles of the GlossaryTerm

EM

algorithm, we assume to know the unobserved state assignment, as in (31.30). This can be achieved by introducing indicator variables z t i n for the unknown assignment, such that z t i n = 1 if the chain is in state i at position t of the nth sequences, and it is 0 otherwise. Given this (assumed) knowledge about the hidden state assignments, if is possible to write the corresponding completed likelihood

L c ( θ ) = log⁡ n = 1 N { j = 2 T n i = 1 C P ( S 1 n = i ) z 1 i n . . × t = 2 T n i , j = 1 C P ( S t n = i | S t - 1 n = j ) z t i n z ( t - 1 ) i n P ( Y t n | S t n = i ) z t i n } = n = 1 N { i = 1 C z 1 i n log⁡ π i + t = 2 T n i , j = 1 C z t i n z ( t - 1 ) i n . . j = 2 T n × log⁡ A i j + z t i n log⁡ b i ( y t n ) } ,
(31.40)

where the latter equality introduces the parameters θ in place of the corresponding probabilities and brings the logarithms into the products.

The GlossaryTerm

EM

procedure is applied to the complete log-likelihood in (31.40). Following (31.15), the E-step computes the expected value of L c ( θ ) with respect to the distribution of the indicator variables Z = { z t i n } , conditional on the observed sequences D and the current estimate of the parameters θ ( k ) . Given L c ( θ ) as in (31.40), taking its conditional expectation with respect to the hidden variables Z, it yields to the following posterior probability:

E Z | Y , θ ( k ) [ z t i ] = P ( S t = i | y ) ,
(31.41)

where superscript n is omitted for notational simplicity. The estimation of this posterior is known as the smoothing problem. In the Baum–Welch algorithm, this is efficiently solved by a double recursion that exploits the following decomposition of the joint probability

P ( S t = i , y ) = P ( S t = i , Y 1 : t , Y t + 1 : T ) = P ( S t = i , Y 1 : t ) × P ( Y t + 1 : T | S t = i ) = α t ( i ) β t ( i ) ,
(31.42)

where the observed contribution from the predecessors of t (i. e., Y 1 : t ) is separated from that of its successors (i. e., Y t + 1 : T ). The cancelations in (31.42) follow from the fact that S t d-separates (see definition in Sect. 31.2.1) the elements of the two subsequences, i. e., Y 1 : t and Y t + 1 : T .

The first term in (31.42) is the α t ( i ) probability defined in (31.36), which can be computed through the forward algorithm. The β t ( i ) term can also be computed through a recursive procedure known as backward algorithm, due to the inverted direction with respect to the forward recursion. Consider the following recursive decomposition

β t - 1 ( j ) = P ( Y t : T | S t - 1 = j ) = i = 1 C P ( Y t : T , S t = i | S t - 1 = j ) = i = 1 C P ( Y t | S t = i ) P ( Y ( t + 1 ) : T | S t = i ) × P ( S t = i | S t - 1 = j ) = i = 1 C b i ( y t ) β t ( i ) A i j ,
(31.43)

it can be computed for 2 t T by scanning the sequence backward, assuming β T ( j ) = 1 for each j { 1 , , C } .

The final expression of the smoothed posterior in (31.41) is given by the joint α - β recursions, known as the forward–backward algorithm, that is

γ t ( i ) = P ( S t = i | Y ) = P ( S t = i , Y ) P ( Y ) = α t ( i ) β t ( i ) j = 1 C α t ( j ) β t ( j ) .
(31.44)

Note that the forward and backward recursions can be ran in parallel, since the values of α and β do not depend on each other. To complete the derivations of the sufficient statistics for the M-step, it is also necessary to estimate the joint posterior

E Z | Y , θ ( k ) [ z t i z ( t - 1 ) j ] = P ( S t = i , S t = j | Y ) ,
(31.45)

which can be straightforwardly factorized into known probabilities along the lines of (31.42). It turns out that such joint posterior can be estimated using the α - β probabilities computed by the forward–backward algorithm, that is

γ t , t - 1 ( i , j ) = P ( S t = i , S t = j | Y ) = α t - 1 ( j ) A i j b i ( y t ) β t ( i ) m , l = 1 C α t - 1 ( m ) A l m b 3 ( y t ) β t ( l ) .
(31.46)

Parameters θ = ( π , A , B ) are re-estimated at the M-step, with update equations that follow straightforwardly from the maximization problem in (31.16). It suffices to differentiate (31.40), extended with appropriate Lagrange multipliers to account for the sum-to-one constraints. Intuitively, the update equations can be straightforwardly written from the GlossaryTerm

ML

estimates for observable Markov chains in (31.31) and (31.32). It suffices to substitute the observed state counts, obtained through the indicator function δ ( ) , with the virtual counts γ ( ) estimated by (31.44) and (31.46) at the E-step. For the hidden state transition and initial state distributions this yields to

A i j = n = 1 N t = 2 T n γ t , t - 1 n ( i , j ) n = 1 N t = 2 T n γ t - 1 n ( j ) and π i = n = 1 N γ 1 n ( i ) .
(31.47)

The estimate of the parameters B depends on the form of the emission distribution: if the observed sequences take values k from a finite alphabet { 1 , , M } , the corresponding multinomial emission in (31.34) is updated by

B k i = n = 1 N t = 1 T n γ t n ( i ) δ ( y t = k ) ,
(31.48)

where δ ( ) is the indicator function counting the occurrences of the symbols k in the observed sequences. Real-valued sequences are modeled usually through Gaussian emissions, whose parameters are fit as usual through maximization of the complete log-likelihood.

Particular care must be taken to avoid numerical problems when implementing the forward–backward algorithm. Both recursions work with multiplications of small numbers: hence, the values of α and β can underflow for long sequences. To this end, it is advisable to perform them in log-space or to work with scaled versions of the α and β probabilities [60]. A sequential version of the smoothing algorithm exists [61] that directly computes the smoothed posterior γ t ( i ) = P ( S t = i | Y ) through a γ-recursion that uses the α values generated by the forward algorithm.

4.2.3 Optimal State

Once a model θ has been trained, it can be interesting to determine the most likely hidden state assignment s * that has generated an observed sequence y. This inference problem, known also as decoding, has different solutions, since several optimal assignment exists, depending on the interpretation of what an optimal assignment is. For instance, the optimal hidden sequence can be the one maximizing the expected count of correct states. On the other hand, an optimal assignment might be the sequence of hidden states s * with the maximum joint probability P ( Y = y , S = s * ) .

The former optimality condition is solved by selecting, at each position t, the most likely state given by the sequence, i. e.,

s t * = arg⁡ max⁡ i = 1 , , C P ( S t = i | Y ) .
(31.49)

Clearly, this amounts to select the most likely state for each position independently, using the posterior computed by the Baum–Welch algorithm. Conversely, the latter optimality condition estimates the joint hidden state assignment

s * = arg⁡ max⁡ s P ( Y , S = s ) .
(31.50)

This is a complex inference problem that can be efficiently solved though a dynamic programming approach, known as the Viterbi algorithm . Note that the two optimality definitions generally lead to different solutions. For instance, the Viterbi solution is constrained to provide only state transitions allowed by the generating distribution, while this is not the case for the Baum, Welch solution, given that hidden states are selected independently.

The Viterbi algorithm is based on a backward recursion that exploits a factorization of the maximization problem in (31.50). Consider the restricted problem of determining the hidden state of the tail element T

max⁡ s T P ( Y , S T = s T ) = max⁡ s T t = 1 T P ( Y t | S t ) P ( S t | S t - 1 ) = t = 1 T - 1 P ( Y t | S t ) P ( S t | S t - 1 ) max⁡ s T P ( Y t | S T ) P ( S T | S T - 1 ) ,
(31.51)

where the joint probability factorizes according to the Markov chain assumption. We can isolate the maximization problem in the rightmost term

ϵ T - 1 ( s T - 1 ) = max⁡ s T P ( Y t | S T = s T ) × P ( S T = s T | S T - 1 = s T - 1 ) ,
(31.52)

that is a message conveying information on the maximization of the tail element to the penultimate position. Substituting the definition of ϵ T - 1 ( s T - 1 ) back in (31.51) and adding the maximization with respect to s T - 1 , suggests the recursive formulation of ϵ ( ) for a generic position t - 1 , i. e.,

ϵ t - 1 ( s t - 1 ) = max⁡ s t P ( Y t | S t = s t ) × P ( S t = s t | S t - 1 = s t - 1 ) ϵ t ( s t ) ,
(31.53)

for 2 t T , where ϵ T ( s T ) = 1 is the basis of the recursion. At each step t of the backward recursion, the Viterbi algorithm computes the ϵ-message for each possible assignment of the hidden state of t and propagates it to the predecessor t - 1 . The recursion ends at the initial element of the sequence, where the initial optimal state is obtained as

s 1 * = arg⁡ max⁡ s P ( Y t | S 1 = s ) P ( S 1 = s 1 ) ϵ 1 ( s ) .
(31.54)

The assignment of the remaining hidden states is obtained by backtracking through the forward recursion

s t * = arg⁡ max⁡ s P ( Y t | S t = s ) × P ( S t = s | S t - 1 = s t - 1 * ) ϵ t ( s ) .
(31.55)

Note that the Viterbi algorithm is a special case of a max-sum inference algorithm introduced in Sect. 31.2.3.

4.3 Related Models

4.3.1 Higher Order Markov Models

Hidden Markov models serve as a starting point for the design on more complex Markov generative processes, besides the obvious extension to higher order hidden chains [62]. Factorial GlossaryTerm

HMM

s [63] generalize the original model by defining super states that are collections of K discrete hidden states, each being part of an independent Markov chain (see Fig. 31.10). This factorial model results in K hidden Markov chains running in parallel: at each time step, the emission depends on the K-dimensional super state, but each state variable is decoupled from those of the other chains and evolves according to its own dynamics. By this means, it is possible to efficiently encode the state dynamics of K objects evolving independently that interact to jointly determine the observation (e. g., K cars moving in the traffic and jointly determining traffic jams).

Fig. 31.10
figure 10figure 10

Factorial HMM with K = 3 independent hidden Markov chains

4.3.2 Nonhomogenous HMMs

Relaxation of the homogeneity assumption led to the input/output hidden Markov model (GlossaryTerm

IO-HMM

) [64] that allow modeling the causal dependence of the hidden generative process from an additional input sequence x. Basically, the GlossaryTerm

IO-HMM

enables nonhomogeneous transition and emission distributions that are explicitly dependent (i. e., parameterized) on the currently observed label of the input sequence. An GlossaryTerm

IO-HMM

implements a mapping, referred to as transduction, from an observed input sequence x into an output (target) sequence y, realized by the input-conditional hidden process P ( Y | X ) . Interesting applications of GlossaryTerm

IO-HMM

are in learning transformations between modalities in multimedia data [65], exploratory analysis of financial time series [66] and gene data analysis [67].

4.3.3 HMMs for Structured Data

Hidden tree Markov models represent the generative process of more complex, tree-structured information (see Fig. 31.11). Differently from the sequential domain, the direction of the generative process leads to different representational capabilities when dealing with trees. Top-down approaches [68] model all possible paths from the root to the leaves of the tree. Bottom-up models [69] propose a generative process from the leaves to the root, where complex structures are generated by composition of simpler substructures. Recently, an extension of the GlossaryTerm

IO-HMM

has been proposed to learn transductions between trees [70].

Fig. 31.11
figure 11figure 11

A bottom-up hidden tree Markov model for a simple structure with five nodes: the generative process follows the direction of the arrows, i. e., from the leaves to the root (t = 1)

4.3.4 Bayesian and Nonparametric Extensions

GlossaryTerm

HMM

s have been extended to allow a countably infinite number of hidden states through a Bayesian approach where state distributions are modeled by Dirichlet processes [71]. Abstracting from the direction of the arrows in Fig. 31.9 leads to a discriminative probabilistic model known as liner-chain conditional random fields [72], whose capability to model long term dependences is widely used in natural language parsing and computer vision.

5 Conclusion and Further Reading

Graphical models have been discussed as an excellent framework for probabilistic modeling of articulated processes that can be described by a static set of random variables tied up by probabilistic relationships. Such relationships need not to be necessarily known, a-priori. Several approaches exists to infer them from data, i. e., to determine the presence of a corresponding edge in the graphical model. However, the same approaches tend to fix the structure of the graphical model, once this is determined from the data. In other words, these graphical models represent a static picture of the process, where the set of random variables and associated relationships is held fixed from a point onward. The nature of sequence data calls for the ability to model more dynamic phenomena. Processing of video information requires Markov networks that can unfold their structure across the video sequence. Even classic text analysis needs to account for novel generative dynamics, where texts are produced as dynamic streams instead of being static collections of words, e. g., consider blog posts and associated comments, or the stream of social networks status updates. Therefore, the horizon of current research is pushing graphical models to more dynamic formulations where, on the one hand, the structure is allowed to change over time and, on the other hand, the model is allowed to dynamically self-tune the number of parameters that is most adequate to represent the process at each time. Following the intuitions underlying the GlossaryTerm

HMM

approach, dynamical graphical models are being proposed that are capable of unfolding their structure across time, to better model the dynamics of complex time-varying processes. At the same time, concepts from nonparametric Bayesian statics are being used to develop models where latent variables can be dynamically adjusted to sample from a virtually infinite set of events and where the very same structure of the latent space is adapted across time, i. e., through variable addition and pruning. Such a new class of dynamic graphical models introduces novel computational challenges associated with inference and representation of dynamic knowledge. The answers to this challenges can be partly found in the chapter, in the approximated inference methods discussed for static models and in the principles underlying the unfolding of Markov chains. Finally, it is worth to note that deep learning, described in Chap. 2, is an instance of graphical model where both nonlinearity and dynamic representations play an important role.