Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In this chapter we consider the application of modern methods from machine learning to the analysis of biomedical datasets. There are a substantial number of machine learning methods which could be used in the context of bioinformatics, and so we are necessarily selective. We focus on the two commonest themes within machine learning, namely supervised and unsupervised learning. Many methods have been proposed for supervised learning, and so, in Sect. 12.1, we choose to concentrate on kernel-based methods, specifically support vector machines (SVMs). SVMs are a popular approach to classification and have several advantages in handling datasets from bioinformatics. In particular, biomedical data can appear in many forms from continuous-valued and discrete data to network structures and sequence data. These different types of data can be encoded into kernels which quantify the similarity of data objects. We start by introducing classification and the support vector machine for binary classification. In Sect. 12.1.1 we then extend this approach to multiclass classification, learning in the presence of noise, the association of confidence measures to class labels, and regression (i.e., using continuously valued labels). In Sect. 12.1.3 we consider simple kernels, complex kernels for graphs, strings, and sequences, and multiple kernel learning, where we build a decision function for prediction using multiple types of input data.

The second area of machine learning we consider is unsupervised learning, where we are interested in the discovery of structure in data. In Sect. 12.2.1 we start with hierarchical cluster analysis (HCA), given the common usage of this unsupervised learning approach in the biomedical community. We then point out that Bayesian approaches can have certain advantages over HCA. We consider variational approaches to Bayesian unsupervised learning and illustrate the use of this approach in finding novel disease subtypes in cancer research. We then consider Markov chain Monte Carlo (MCMC) approaches, which can be more accurate than variational methods, and illustrate their use in cancer research for finding the most probable pathway structure from a set of candidate pathway topologies.

Supervised Learning

Many bioinformatics problems involve prediction over two classes; For example, we may want to predict whether a tumor is benign or malignant, based on genetic data. An abstract learning machine will learn from training data and attempt to generalize and thus make predictions on novel input data. For the training data we have a set of input vectors, denoted x i , with each input vector having a number of component features. These input vectors are paired with corresponding labels, which we denote y i , and there are m such pairs (i = 1,… , m). Thus, for our cancer example, y i  =+ 1 may denote malignant and y i  =− 1 benign. The matching x i are input vectors encoding the genetic data derived from each patient i. Typically, we would be interested in quantifying the prediction performance before any practical usage, and so we would evaluate a test error based on a test set of data.

The training data can be viewed as labeled datapoints in an input space, which we depict in Fig. 12.1. For two classes of well-separated data, the learning task amounts to finding a directed hyperplane, i.e., an oriented hyperplane such that datapoints on one side will be labeled y i  =+ 1 and those on the other side as y i  =− 1. The directed hyperplane found by a support vector machine is intuitive: it is that hyperplane which is maximally distant from the two classes of labeled datapoints. The closest such points on both sides have most influence on the position of this separating hyperplane and are therefore called support vectors. The separating hyperplane is given as w ⋅ x + b = 0 (where “⋅” denotes the inner or scalar product). b is the bias or offset of the hyperplane from the origin in input space, and x are points located within the hyperplane. The normal to the hyperplane, the weight vector w, determines its orientation.

Fig. 12.1
figure 1

(a) The argument inside the decision function of our SVM classifier is w ⋅ x + b. The separating hyperplane corresponding to w ⋅ x + b = 0 is shown as a line on this plot. This hyperplane separates the two classes of data, with points on one side labeled y i  =+ 1 (w ⋅ x + b ≥ 0) and points on the other side labeled y i  =− 1 (w ⋅ x + b < 0). (b) The perpendicular distance between the separating hyperplane and a hyperplane through the closest points (the support vectors) is called the margin, γ. x 1 and x 2 are examples of support vectors of opposite sign. The hyperplanes passing through the support vectors are canonical hyperplanes, and the region between the canonical hyperplanes is the margin band. The projection of the vector (x 1 − x 2) onto the normal to the separating hyperplane ( w / | | w | | 2 ) is 2γ

Of course this picture is too simple for many applications. The two clusters could be highly intermeshed with many overlapping datapoints: the dataset is then not linearly separable. This situation is one motivation for introducing the concept of kernels later in this chapter. We can also see that stray datapoints could have a significant impact on the orientation of the hyperplane, and so we need a mechanism for handling anomalous datapoints and noise.

Statistical learning theory is the theoretical study of learning and generalization. From the perspective of statistical learning theory, the motivation for considering binary classifier SVMs comes from a theoretical upper bound on the generalization error, that is, the theoretical prediction error when applying the classifier to novel, unseen instances. This generalization error bound has two important features.

  1. 1.

    The bound is minimized by maximizing the margin, γ, i.e., the minimal distance between the hyperplane separating the two classes and the closest datapoints to the hyperplane, and

  2. 2.

    The bound does not depend on the dimensionality of the space.

Suppose we consider a binary classification task with datapoints x i (i = 1,… , m) having corresponding labels y i  =± 1 and a decision function

f ( x ) = sign ( w · x + b ) ,
(12.1)

where ⋅ is the inner product. From the decision function we see that the data is correctly learnt if y i (w ⋅ x i  + b) > 0∀i, since (w ⋅ x i  + b) should be positive when y i  =+ 1 and it should be negative when y i  =− 1. The decision function is invariant under a positive rescaling of the argument inside the sign-function, leading to an ambiguity in defining a distance measure and therefore the margin. Thus, we implicitly define a scale for the (w, b) by setting w ⋅ x + b = 1 for the closest points on one side and w ⋅ x + b =− 1 for the closest on the other side. The hyperplanes passing through w ⋅ x + b = 1 and w ⋅ x + b =− 1 are called canonical hyperplanes, and the region between these canonical hyperplanes is called the margin band. Let x 1 and x 2 be two points inside the canonical hyperplanes on both sides of the separating hyperplane (Fig. 12.1b). If w ⋅ x 1 + b = 1 and w ⋅ x 2 + b =− 1, we deduce that w ⋅ (x 1 − x 2) = 2. For the separating hyperplane w ⋅ x + b = 0, the normal vector is w / | | w | | 2 (where | | w | | 2 is the square root of w T w). Thus, the distance between the two canonical hyperplanes is equal to the projection of x 1 − x 2 onto the normal vector w / | | w | | 2 , which gives ( x 1 - x 2 ) · w / | | w | | 2 = 2 / | | w | | 2 . As half the distance between the two canonical hyperplanes, the margin is therefore γ = 1 / | | w | | 2 . Maximizing the margin is therefore equivalent to minimizing

1 2 | | w | | 2 2 ,
(12.2)

subject to the constraints

y i ( w · x i + b ) 1 i .
(12.3)

This is a constrained optimization problem in which we minimize an objective function (12.2) subject to the constraints (12.3).

As a constrained optimization problem, the above formulation can be reduced to minimization of a Lagrange function, consisting of the sum of the objective function and the m constraints multiplied by their respective Lagrange multipliers, denoted α i . We will call this the primal formulation

L ( w , b ) = 1 2 ( w · w ) - i = 1 m α i [ y i ( w · x i + b ) - 1 ] ,
(12.4)

where α i are Lagrange multipliers, and thus α i  ≥ 0 (a necessary condition for the Lagrange multiplier). At the optimum, we can take the derivatives with respect to b and w and set these to zero

L b = - i = 1 m α i y i = 0 ,
(12.5)
L w = w - i = 1 m α i y i x i = 0 .
(12.6)

Substituting w from (12.6) back into L(w, b) we then get a dual formulation (also known as a Wolfe dual)

W ( α ) = i = 1 m α i - 1 2 i , j = 1 m α i α j y i y j ( x i · x j ) ,
(12.7)

which must be maximized with respect to the α i subject to the constraints

α i 0 , i = 1 m α i y i = 0 .
(12.8)

The objective function in (12.7) is quadratic in the parameters α i , and thus it is a constrained quadratic programming (QP) problem. QP is a standard problem in optimization theory, so there are a number of resources available, such as QUADPROG in MATLAB, MINOS, and LOQO. In addition there are a number of packages specifically written for SVMs such as SVMlight, LIBSVM, and SimpleSVM [12.1].

So far we have not considered point (2), i.e., that the generalization bound does not depend on the dimensionality of the space. For the objective (12.7) we notice that the data x i only appear inside an inner product. To get an alternative representation of the data, we could therefore map datapoints into a space with different dimensionality, called feature space, through

x i · x j Φ ( x i ) · Φ ( x j ) ,
(12.9)

where Φ(⋅) is the mapping function. Data which are not separable in input space can always be separated in a space of high enough dimensionality. A consequence of the generalization bound, given in (2), is that there is no loss of generalization performance if we map to a feature space where the data are separable and a margin can be defined.

Surprisingly, the functional form of the mapping Φ(x i ) does not need to be known in general, since it is implicitly defined by the choice of the kernel: K(x i , x j ) = Φ(x i ) ⋅ Φ(x j ) or inner product in feature space. For nonseparable continuous-valued data a common choice is the Gaussian kernel (Sect. 12.1.3)

K ( x i , x j ) = e - ( x i - x j ) 2 / 2 σ 2 .
(12.10)

The introduction of a kernel with its implied mapping to feature space is known as kernel substitution. The class of mathematical functions which can be used as kernels is very general. Apart from continuous-valued data we can also consider many data objects which appear in bioinformatics such as graphs (representing networks and pathways), strings, and sequences (such as genetic or protein sequence data).

For binary classification with a given choice of kernel the learning task therefore involves maximization of

W ( α ) = i = 1 m α i - 1 2 i , j = 1 m α i α j y i y j K ( x i , x j ) ,
(12.11)

subject to the constraints (12.8). The bias, b, is found separately. For a datapoint with y i  =+ 1,

min { i | y i = + 1 } [ w · Φ ( x i ) + b ] = min { i | y i = + 1 } [ j = 1 m α j y j K ( x i , x j ) ] + b = 1 ,
(12.12)

using (12.6), with a similar expression for datapoints labeled y i  =− 1. We thus deduce that

b = - 1 2 { max { i | y i = - 1 } [ j = 1 m α j y j K ( x i , x j ) ] + min { i | y i = + 1 } [ j = 1 m α j y j K ( x i , x j ) ] } .
(12.13)

Thus, to construct an SVM binary classifier, we place the data (x i , y i ) into (12.11) and maximize W(α) subject to the constraints (12.8). From the optimal values of α i , which we denote α i , we calculate the bias b from (12.13). For a novel input vector z, the predicted class is then based on the sign of

ϕ ( z ) = i = 1 m α i y i K ( x i , z ) + b ,
(12.14)

where b denotes the value of the bias at optimality.

Multiclass Classification and Other Extensions

Multiclass Classification

Many bioinformatics problems involve multiclass classification, and a number of schemes have been outlined. If the number of classes is small then we can use a directed acyclic graph (DAG) [12.2] with the learning task reduced to binary classification at each node. The idea is illustrated in Fig. 12.2. Suppose we consider a three-class classification problem. The first node is a classifier making the binary decision label 1 versus label 3, say. Depending on the outcome of this decision, the next steps are the decisions 1 versus 2, or 2 versus 3. We could also use a series of one-against-all classifiers [12.3]. Thus, we construct C separate SVMs, with the c-th SVM trained using data from class c as the positively labeled samples and the remaining classes as the negatively labeled samples. Associated with the c-th SVM we have f c (z) =∑ i y c i α c i K(z, x i ) + b c, and the novel input z is assigned to class c such that f c (z) is largest. Other schemes have been suggested [12.4,5,6,7].

Fig. 12.2
figure 2

A multiclass classification problem reduced to a series of binary classification tasks

Learning with Noise: Soft Margins

Many biomedical datasets are intrinsically noisy, and a learning machine could fit to this noise, leading to poor generalization. As remarked earlier, outliers can have an undue influence on the position of the separating hyperplane used by an SVM (Fig. 12.1). Potential noise in a dataset can be handled by the introduction of a soft margin [12.8]. Two schemes are commonly used. With an L 1 error norm, the learning task is the same as in (12.11, 12.8) except for the introduction of the box constraint

0 α i C .
(12.15)

On the other hand, for an L 2 error norm, the learning task is (12.11, 12.8) except for the addition of a small positive constant to the leading diagonal of the kernel matrix

K ( x i , x i ) K ( x i , x i ) + λ .
(12.16)

The appropriate values of these parameters can be found by means of a validation study. With sufficient data we would split the dataset into a training set, a validation set, and a test set. With regularly spaced values of C or λ, we train the SVM on the training data and find the best choice for this parameter based on the validation error. With more limited data, we may use cross-validation, or rotation estimation, in which the data are randomly partitioned into subsets and rotated successively as training and validation data.

With many biomedical datasets there is an imbalance between the amount of data in different classes, or the significance of the data in the two classes can be quite different; For example, for the detection of tumors on magnetic resonance imaging (MRI) scans, it may be best to allow a higher number of false positives if this improved the true-positive detection rate. The balance between the detection rate for different classes can be easily shifted by introducing asymmetric soft margin parameters [12.9]. Thus, for binary classification with an L 1 error norm we use 0 ≤ α i  ≤ C + (y i  =+ 1) and 0 ≤ α i  ≤ C (y i  =− 1), but K(x i , x i ) ← K(x i , x i ) + λ + (if y i  =+ 1) and K(x i , x i ) ← K(x i , x i ) + λ (if y i  =− 1) for the L 2 error norm.

Introducing a Confidence Measure

Suppose we are using a support vector machine for diagnostic categorization or prediction of disease progression; then, it would be plainly useful to have a confidence measure associated with the class assignment. A clinician could be expected to plan differently with a high confidence prediction over a low confidence one. A SVM has an in-built quantity that could provide a confidence measure for the class assignment, i.e., the distance of a new point from the separating hyperplane (Fig. 12.1). A new datapoint with a large distance from the separating hyperplane should be assigned a higher degree of confidence than a point which lies close to the hyperplane. Before thresholding, the output of a SVM is given by

ϕ ( z ) = i y i α i K ( x i , z ) + b .
(12.17)

One approach is to fit a probability measure p(y|ϕ) directly [12.10]. A good choice for the mapping function is the sigmoid

p ( y = + 1 | ϕ ) = 1 1 + exp ( A ϕ + B ) ,
(12.18)

with the parameters A and B found from the training set (y i , ϕ(x i )). Let us define t i as the target probabilities

t i = y i + 1 2 ,
(12.19)

so for y i  ∈{− 1, 1} we have t i  ∈{ 0, 1}. We find A and B by performing the following minimization over the entire training set:

min A , B [ - i t i log ( p i ) + ( 1 - t i ) log ( 1 - p i ) ] ,
(12.20)

where p i is simply (12.18) evaluated at ϕ(x i ). This is a straightforward two-dimensional minimization problem which can be solved using a variety of optimization methods. Once the sigmoid has been found using the training set, we can use (12.18) to calculate the probability that a new test point belongs to either class. Figure 12.3 shows the margins of training data points (x-axis) and fitted sigmoid for a microarray dataset for ovarian cancer. The distinction is ovarian cancer versus normal. There are no datapoints present in a band between + 1 and − 1 since this corresponds to the margin band of the SVM.

Fig. 12.3
figure 3

Probability of membership of one class (y-axis) versus margin (x-axis). The plot shows the training points and fitted sigmoid for an ovarian cancer dataset

Regression

Some bioinformatics applications involve regression and the construction of models with real-valued labels y i . To model the dependency between the input vectors x i and the y i we could use a linear function of the form

g ( x i ) = w T x i
(12.21)

to approximate y i . Thus, we could minimize the following function in w:

L ( w ) = 1 2 i = 1 m [ y i - g ( x i ) ] 2
(12.22)

to get a solution y i  ≈ g(x i ). If we make a mapping to feature space x i  → Φ(x i ) and introduce a variable ξ i  = y i  − w T Φ(x i ), then (12.22) could be reformulated as

min w , ξ { L = i = 1 m ξ i 2 } ,
(12.23)

subject to the constraints

y i - w T Φ ( x i ) = ξ i i
(12.24)
w T w B 2 .
(12.25)

The latter constraint (12.25) is a regularization constraint on the w; that is, it is used to avoid an overcomplex solution which fits to noise in the data, leading to poor generalization. As a constrained optimization problem with objective function (12.23) and two constraint conditions (12.24) and (12.25), we derive a Lagrange function

L = i = 1 m ξ i 2 + i = 1 m β i [ y i - w T Φ ( x i ) - ξ i ] + λ ( w T w - B 2 )
(12.26)

with Lagrange multipliers β i and λ for these two constraint conditions. If we take derivatives of L with respect to ξ i and w, we get

ξ i = 1 2 β i ,
(12.27)
w = 1 2 λ i = 1 m β i Φ ( x i ) .
(12.28)

Substituting these back into L gives the dual formulation

W = i = 1 m ( - 1 4 β i 2 + β i y i ) - 1 4 λ i , j = 1 m ( β i β j K ( x i , x j ) ) - λ B 2 ,
(12.29)

which with a redefined variable α i  = β i /2λ (a positive rescaling, since λ ≥ 0) gives the following restatement:

max α i , λ { W = - λ 2 i = 1 m α i 2 + 2 λ i = 1 m α i y i - λ i , j = 1 m α i α j K ( x i , x j ) - λ B 2 } .
(12.30)

In contrast to a SVM for binary classification, where we must solve a constrained QP problem, (12.30) gives a direct solution

α = ( K + λ I ) - 1 y .
(12.31)

This gives a means for finding w in (12.21) via (12.28). The above is one of the simplest kernel-based approaches to regression [12.11,12]. However, for prediction on a novel datapoint we implicitly use all the datapoints via the kernel matrix K. Sample sparsity is desirable since it reduces the complexity of the model. For this reason, it not the best approach to finding a regression function, and thus a number of approaches which involve constrained quadratic programming have been developed [12.13,14,15,16,17]. These minimize the number of support vectors favoring sparse hypotheses and giving smooth functional approximations to the data.

Case Study 1: Predicting Disease Progression

As an example of the use of SVMs within the context of bioinformatics, we consider an application to predicting disease progression. In this example, the objective is to predict relapse versus nonrelapse for Wilmʼs tumor, a cancer which affects children and young adults [12.18]. This tumor originates in the kidney, but it is a curable disease in the large majority of affected children. However, there is a recognized aggressive subtype with a high probability of relapse within a few years. It is therefore clinically important to predict risk of relapse when the disease is first diagnosed, with an alternative treatment regime if risk of relapse is high. In this study we used microarray data as input to the support vector machine. The microarray had 30720 probes, each measuring gene expression, roughly quantifying the amount of protein produced per gene. A number of these readings were poor quality, so quality filtering was used, reducing the number of features to 17790. The dataset consisted of 27 samples, of which 13 samples were from patients who relapsed with the disease within 2 years and the remainder labeled as nonrelapse due to long-term survival without disease recurrence.

Fig. 12.4
figure 4

The number of LOO test errors (y-axis) versus number of top-ranked features (x-axis) remaining (with features ranked by a t-test statistic) for predicting relapse or nonrelapse for Wilmʼs tumor

The task is binary classification, and the large number of features means the relatively small number of datapoints are embedded in a high-dimensional space. As a consequence, the dataset is separable and we use a linear kernel K(x i , x j ) = x i  ⋅ x j . Since we do not have large training and test sets we use leave-one-out (LOO) testing; i.e., we train on 26 samples and evaluate classifier performance on the single left-out datapoint, successively rotating the test datapoint through the data. The vast majority of features are likely to be irrelevant, so we use feature selection to remove uninformative features and thus improve performance.

Using a t-test statistic to rank features [12.19] we can obtain a minimal LOO test error of 1 from 27 (the t-test must be run separately per LOO partitioning to avoid corrupting the test statistic). However, this tentative prediction performance is only achieved with a small set of the most informative features, and, if all features are included, the vast number of uninformative and noisy features is sufficient to overwhelm this predictive genetic signature.

Different Types of Kernels

Biomedical data can appear in many different formats, and in this section we consider how to construct kernels representing these different types of data.

Permissable Kernels

If a proposed kernel matrix is positive semidefinite (PSD) then it is an allowable kernel. For any arbitrary set of real-valued variables a 1,… , a m , a PSD kernel satisfies

i = 1 m j = 1 m a i a j K ( x i , x j ) 0 .
(12.32)

This type of kernel is symmetric, K(x i , x j ) = K(x j , x i ), with positive components on the diagonal, K(x, x) ≥ 0. An example is the linear kernel introduced earlier

K ( x i , x j ) = x i · x j .
(12.33)

It is symmetric, K(x, x) ≥ 0, and it satisfies (12.32) since

i = 1 m j = 1 m a i a j ( x i · x j ) = | | i = 1 m a i x i | | 2 0 .
(12.34)

We can determine if a proposed kernel matrix is PSD by determining its spectrum of eigenvalues: if the matrix has at least one negative eigenvalue λ with corresponding eigenvector v, say, then v T Kv = λ v T v < 0, so it is not PSD. There are, indeed, strategies for handling non-PSD kernel matrices [12.20,21,22,23,24].

From permissable kernels we can construct other permissable kernels; For example,

  1. 1.

    If K 1(x i , x j ) is a kernel then so is

    K ( x i , x j ) = cK 1 ( x i , x j ) ,
    (12.35)

    where c is a positive constant.

  2. 2.

    If K 1(x i , x j ) and K 2(x i , x j ) are two kernels then the sum K(x i , x j ) = K 1(x i , x j ) + K 2(x i , x j ) and the product K(x i , x j ) = K 1(x i , x j )K 2(x i , x j ) are both permissable kernels.

  3. 3.

    If K 1(x i , x j ) is a kernel and f(x) is any function of x, then the following is a permissable kernel:

    K ( x i , x j ) = f ( x i ) K 1 ( x i , x j ) f ( x j ) .
    (12.36)
  4. 4.

    If K 1(x i , x j ) is a kernel then so is

    K ( x i , x j ) = p [ K 1 ( x i , x j ) ] ,
    (12.37)

    where p(⋅) is a polynomial with nonnegative coefficients. As an example, the following is a kernel:

    K ( x i , x j ) = exp [ K 1 ( x i , x j ) ] ,
    (12.38)

    since exp (⋅) can be expanded in a Taylor series with positive coefficients.

A further manipulation we can apply to a kernel matrix is normalization. This is achieved using a modified mapping function to feature space: x Φ ( x ) / | | Φ ( x ) | | 2 . The normalized kernel is then

K ^ ( x i , x j ) = Φ ( x i ) · Φ ( x j ) | | Φ ( x i ) | | 2 | | Φ ( x j ) | | 2 = Φ ( x i ) · Φ ( x j ) Φ ( x i ) · Φ ( x i ) Φ ( x j ) · Φ ( x j ) = K ( x i , x j ) K ( x i , x i ) K ( x j , x j ) .
(12.39)

Thus, consider the Gaussian kernel introduced in (12.10). Since

K ( x i , x j ) = exp ( - ( x i - x j ) 2 2 σ 2 ) = exp ( x i · x j σ 2 - x i · x i 2 σ 2 - x j · x j 2 σ 2 ) = exp ( x i · x j σ 2 ) exp ( x i · x i σ 2 ) exp ( x j · x j σ 2 ) ,
(12.40)

it is a normalized kernel. Since we know that the linear kernel K(x i , x j ) = x i  ⋅ x j is a permissable kernel, validity of a Gaussian kernel follows from properties (1), (2), and (4) above.

For the Gaussian kernel, and various other kernels, there is a kernel parameter (the σ for the Gaussian). The value for this parameter needs to be found, and there are several ways to do this. If we have enough data we can split it into a training set, a validation set, and a test set. We then pursue a validation study in which the learning machine is trained at regularly spaced choices of the kernel parameter, and we use that value which minimizes the validation error (the error on the validation data). If insufficient data are available and we are considering classification, then we can estimate the kernel parameter using generalization bounds with no recourse to using validation data [12.25,26,27,28,29].

Kernels for Strings and Sequences

Strings appear in many bioinformatics contexts; For example, we could be considering DNA genetic sequences composed of the four DNA bases A, C, G, and T, or we could be considering proteinogenic amino acid sequences composed of the 21 amino acids found in eukaryotes. Strings can be defined as ordered sets of symbols drawn from an alphabet. We can evidently see a degree of similarity between strings. Thus, suppose we consider the genetic sequences ACTGA, CCACTG, and CTGACT. They have the string CTG in common, and a matching algorithm should pick up this similarity irrespective of the differing prefixes and suffixes. Strings could differ by deletions or insertions, thus ACGA differs from ACTGA by a gap consisting of a single deletion.

We can consider two distinct categories when matching ordered sets of symbols [12.30]. The first we will call string matching: in this case contiguity of the symbols is important. For the second category, sequence matching, only the order is important. Thus, for our example with ACGA and ACTGA, there are only two short contiguous strings in common: AC and GA. On the other hand, A, C, G, and A are ordered the same way in both words. When matching the same genes between two individuals, there will be extensive commonality, interrupted by occasional mutations and rare deletions and insertions. In this section we consider the p-spectrum kernel (contiguity is necessary), the subsequence kernel (contiguity is not necessary and the order of symbols is important), and a gap-weighted kernel which is influenced by both contiguity and order.

The p-Spectrum Kernel

Two strings can be compared by counting the number of contiguous substrings of length p which are in common. A p-spectrum kernel [12.31,32,33] is based on the set of frequencies of all contiguous substrings of length p; For example, suppose we wish to compute the 2-spectrum of the string s = CTG. There are two contiguous substrings of length p = 2, namely u 1 = CT and u 2 = TG, both with frequency of 1. As a further example, let us consider the set of strings s 1 = CTG, s 2 = ACT, s 3 = CTA, and s 4 = ATA. The 2-spectrum mapping function Φ is given in Table 12.1, where each entry is the number of occurrences of the substring u (say u = CT) in the given string (say s = CTG). The corresponding kernel is shown in Table 12.2.

Table 12.1 A mapping function Φ for the p-spectrum kernel
Table 12.2 The p-spectrum kernel matrix from the mapping function in Table 12.1

Thus, to compute the (CTG, ATG) entry in the kernel matrix, we sum the products of the corresponding row entries under each column in the mapping function (Table 12.1). Only the pair of entries in the TG substring column both have nonzero entries of 1, giving K(CTG, ATG) = 1. For an entry on the diagonal of the kernel matrix, we take the sum of the squares of the entries in the corresponding row in Table 12.1. Thus, for K(CTG, CTG), there are nonzero entries of 1 under CT and TG, and so K(CTG, CTG) = 2.

The All-Subsequence Kernel

With this kernel the implicit mapping function is taken over all contiguous and noncontiguous ordered subsequences of a string, which includes the empty set. As an example let us consider two sequences s 1 = CTG and s 2 = ATG. Let Ω represent the empty set, then the mapping function is given in Table 12.3.

Table 12.3 A mapping function Φ for the all-subsequences kernel

The off-diagonal terms in the kernel matrix are then evaluated as the sum across all columns of the products of the two entries in each column. The diagonal terms are the sum of squares of all entries in a row (Table 12.4).

Table 12.4 The all-subsequences kernel matrix for the mapping function in Table 12.3

Finally, we can consider a gap-weighted subsequence kernel. With this kernel a penalization is used so that the length of the intervening gap or insertion decreases the score for the match. As an example, CTG is a subsequence of CATG and CAAAAATG. However, CTG differs from CATG by one deletion, but CTG differs from CAAAAATG by a gap of five symbol deletions. By appropriately weighting the penalty associated with the gap or insertion, we can interpolate between a p-spectrum kernel and the all-subsequence kernel.

Kernels for Graphs

Graphs appear in many settings in bioinformatics; For example, we can use a graph to represent a transcriptional regulatory network with the genes as the nodes and the edges representing functional connections between genes. We can consider two types of similarity. For a given graph we may be interested in the similarity of two nodes within the same graph. On the other hand we may be interested in constructing a measure of similarity between two different graphs. We can construct kernels for both cases, and we refer to kernels on graphs [12.34,35] when constructing a within-graph kernel between nodes and kernels between graphs [12.36,37] for the latter comparison.

Multiple Kernel Learning

Many bioinformatics prediction tasks involve different types of data. In the preceding sections we saw that kernels are available for encoding a variety of different data types. Thus, we now consider prediction based on multiple types of input data. Plainly, if we build a predictor which can use all available relevant information, then it is likely to be more accurate than a predictor based on one type of data only.

As an example, for our case study 2, considered shortly, we predict protein fold class based on the use of 12 different types of data. This potentially includes sequence data derived from RNA sequences but also continuous-valued data from various physical measurements. Later, in case study 5, we consider network completion. Based on a training set of known links and nonlinks we attempt to predict possible links to new nodes in an incomplete network. For the case of network completion with protein–protein interaction data there are a variety of informative data types such as gene expression correlation, protein cellular localization, and phylogenetic profile data. Individually, these types of data may be weakly informative for the purposes of prediction. However, taken together, they may yield a stronger predictive signal.

This type of problem is called multiple kernel learning (MKL). The most common approach to MKL is to use a linear combination of candidate kernels, with these kernels representing different types of input data. Let 풦 be such a set of candidate kernels, then the objective of MKL is to simultaneously find the optimal weighted combination of these kernels and the best classification function. With such a linear combination of p prescribed kernels {K : = 1,… , p}, we have a composite kernel

K = = 1 p λ K ,
(12.41)

where ∑ =1 p λ  = 1, λ  ≥ 0, and the λ are called the kernel coefficients.

There are a number of criteria for learning the kernel. For SVM binary classification an appropriate criterion would be to maximize the margin with respect to all the kernel spaces capable of discriminating different classes of labeled data. In particular, for a given kernel K ∈풦 with K(x i , x j ) = Φ(x i ) ⋅ Φ(x j ), we have seen that maximizing the margin involves minimizing | | w | | 2 in feature space subject to

y i ( w · Φ ( x i ) + b ) 1 , i = 1 , , m .
(12.42)

As we have seen, maximizing the margin subject to these constraints gives the following dual problem:

ω ( K ) = max α { i = 1 m α i - 1 2 i , j = 1 m α i α j y i y j K ( x i , x j ) : i = 1 m α i y i = 0 , α i 0 } .
(12.43)

If 풦 is now a linear combination of kernel matrices, this maximum margin approach to kernel combination learning reduces to

min λ max α { ( α , λ ) = i = 1 m α i - 1 2 i , j = 1 m α i α j y i y j [ = 1 p λ K ( x i , x j ) ] }
(12.44)

subject to

i = 1 m α i y i = 0 0 α i C ,
(12.45)
= 1 p λ = 1 λ 0 .
(12.46)

ℒ(α, λ) is concave with respect to α and convex with respect to λ. A number of approaches have been proposed for dealing with this type of optimization problem. One of the earliest approaches [12.38] proposed a semidefinite programming (SDP) approach. SDP is computationally intensive, so more efficient methods were developed subsequently [12.38,39,40,41,42,43].

Case Study 2: Protein Fold Prediction Using Multiple Kernel Learning

During folding a protein forms its final three-dimensional structure. Understanding a proteinʼs structure gives important insights into function. Its structure can lead to an understanding of protein–protein interaction or likely biological function. A knowledge of protein structure is important in the design of small molecular inhibitors for disabling the function of target proteins. We can use machine learning methods to predict protein structure based on sequence and other types of data: indeed this is an obvious application domain for MKL techniques. In this case study we only consider a subproblem of structure prediction in which the predicted label is over a set of fold classes. The fold classes are a set of structural components, common across proteins, which give rise to the overall three-dimensional structure. In this study we will use 27 fold classes with 313 proteins used for training and 385 for testing.

There are a number of relevant types of data which can be used to predict fold class, and in this study we use 12 different types of data, each encoded into a kernel. These data types included sequence and physical measurements such as hydrophobicity, polarity, and van der Waals volume. In Fig. 12.5 we illustrate the performance of a MKL algorithm on this dataset. In Fig. 12.5a the vertical bars indicate the test set accuracy based on using one type of data only: for example, H is hydrophobicity, P is polarity, and V is van der Waals volume. The horizontal line indicates the performance of the MKL algorithm with all data types included. Thus, we get an improvement in performance if we use all available relevant sources of data over just using the single most informative data source.

Fig. 12.5
figure 5

Performance of a MKL method [12.43] on a protein fold prediction dataset. There are 27 classes and 12 types of data. (a) Test set accuracy (TSA, in %) based on individual data types (vertical bars) and using MKL (horizontal line). (b) Kernel coefficients λ , which indicate the relative significance of individual types of data

Figure 12.5b gives the values of the kernel coefficients λ based on using a linear combination (12.41). The relative height of the peaks indicates the relative significance of different types of input data. This algorithm indicates that all 12 types of data are relevant, though some types of data are more informative than others (the most informative, SW1 and SW2, are based on sequence alignments). MKL methods have been successfully demonstrated on other bioinformatics problems requiring integration of heterogeneous datasets [12.38,43,44,45,46].

Unsupervised Learning

Having considered supervised learning, we now discuss unsupervised learning and the discovery of structure in biomedical datasets. Hierarchical cluster analysis is a commonly used approach to unsupervised learning in the biomedical literature, and so we briefly review this topic in Sect. 12.2.1. One of our main objectives in this section will be to show that there are some advantages to using more contemporary methods, for example, from Bayesian statistics. In Sects. 12.2.312.2.5 we introduce variational methods. These are not as accurate as the Markov chain Monte Carlo (MCMC) methods discussed in Sect. 12.2.7. However, they are fast in practice and thus suited to some of the large datasets which appear in many bioinformatics applications. After introducing variational methods we consider an application to the interpretation of gene expression array datasets in cancer research. After introducing MCMC, we illustrate its use with network inference in Sect. 12.2.8 with an application to finding the most probable network topology given a set of candidate network topologies.

Hierarchical Cluster Analysis

In the biomedical literature, hierarchical cluster analysis (HCA) is a commonly used technique for unsupervised learning. This approach can be divided into agglomerative methods, which proceed through a series of successive fusions of m samples into larger and larger clusters, and divisive methods, which systematically separate clusters into smaller and smaller groupings. HCA methods are usually represented by a dendrogram which illustrates the successive fusions or divisions produced by this approach. In this section we only consider agglomerative methods. In this case we start with m single sample clusters, representing each data point. At each stage in the clustering procedure we fuse those samples or groups of samples which are currently closest to each other. There are a number of criteria for evaluating the similarity or closeness of data points. Thus, if x id corresponds to the i-th sample (i = 1,… , m) with d the corresponding feature index (d = 1,… , p), then a commonly used similarity measure is the squared distance

D ( x i , x j ) = d = 1 p ( x id - x jd ) 2 .

Other criteria are used such as the correlation coefficient,

C ( x i , x j ) = d ( x id - x ¯ i ) ( x jd - x ¯ j ) d ( x id - x ¯ i ) 2 d ( x jd - x ¯ j ) 2 ,

where x ¯ i = ( d x id ) / p . If each sample vector is standardized to zero mean and unit standard deviation, then clustering based on the correlation coefficient becomes equivalent to use of a squared distance. Using the chosen distance measure, we then derive a distance matrix encoding the distances between all data points.

Just as there are a number of ways of quantifying similarity, there are a number of criteria for deciding which clusters to fuse at each stage. Six methods are commonly used in practice: single linkage, complete linkage, average linkage, the centroid and median methods, and Wardʼs method. We will illustrate the approach with single linkage clustering as one of the simplest of these methods. In this case the distance between groupings is defined as the shortest distance between any pair of samples. This method is best illustrated with an example. Thus, suppose we have a set of five samples with a corresponding initial distance matrix given by

D 1 = 1 2 3 4 5 1 2 3 4 5 ( 0 4 8 9 7 4 0 6 5 6 8 6 0 3 8 9 5 3 0 2 7 6 8 2 0 ) .

In this matrix it is evident that the smallest distance is that between samples 4 and 5. Thus, we place these two samples into a new cluster. The new distances between this cluster, which we label (45) and the other three samples is obtained from the above distance matrix as

d 1 ( 45 ) = min ( d 14 , d 15 ) = d 15 = 7 , d 2 ( 45 ) = min ( d 24 , d 25 ) = d 24 = 5 , d 3 ( 45 ) = min ( d 34 , d 35 ) = d 34 = 3 .

A new distance matrix can be derived based on these distances and the remaining set of pre-existing distances, thus

D 2 = 1 2 3 ( 45 ) 1 2 3 ( 45 ) ( 0 4 8 7 4 0 6 5 8 6 0 3 7 5 3 0 ) .

The smallest entry in this new distance matrix is then between sample 3 and the cluster (45), and so we form a new three-member cluster and a new set of distances

d 1 ( 345 ) = min ( d 13 , d 14 , d 15 ) = d 15 = 7 , d 2 ( 345 ) = min ( d 23 , d 24 , d 25 ) = d 24 = 5 ,

leading to a new 3 × 3 distance matrix. This process is iterated until all data points belong to one cluster. The sequence of clusterings is therefore

Step Clusters 1 ( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 5 ) 2 ( 1 ) , ( 2 ) , ( 3 ) , ( 45 ) 3 ( 1 ) , ( 2 ) , ( 345 ) 4 ( 12 ) , ( 345 ) 5 ( 12345 )

Of course, using the closest points within each cluster is not necessarily a good fusion criterion. With complete linkage clustering the distance between two groupings is defined as the most distant pair of data points, with each pair consisting of one sample from each of the two groups. Again, this type of method can be influenced by outliers within each cluster, and so a better method is to use the average of a cluster instead. With average clustering the distance between two groupings is defined as the average of the distances between all pairs of samples, with one sample in a pair from each group. Thus, with our example above, after merging data points 4 and 5 into the first cluster, we would compute the new set of distances from this cluster to the other three data points through

d 1 ( 45 ) = 1 2 ( d 14 + d 15 ) = 8 . 0 , d 2 ( 45 ) = 1 2 ( d 24 + d 25 ) = 5 . 5 , d 3 ( 45 ) = 1 2 ( d 34 + d 35 ) = 5 . 5 .

With centroid clustering each grouping is represented by a mean vector, that is, a vector composed of the mean values of each feature taken over all samples within the grouping. The distance between two clusters is then the distance between the corresponding mean vectors. However, centroid clustering has the following disadvantage: if the number of members in one cluster is very different from the other, then, when they are fused, the centroid of the new cluster will be most heavily influenced by the larger grouping and may remain within that cluster. Clustering can be made independent of group size by assuming that the two groups are of equal size. Median clustering is a variant on centroid clustering in which the cluster arising from the fusion lies at such a midpoint between the two clusters. Finally, with Wardʼs method, the two clusters chosen for fusion are the two which result in the least increase in the sum of the distances of each sample to the centroid of its originating cluster.

Of course, the above methods provide an approach to finding a cluster structure but they do not indicate the most likely number of clusters in the data. However, various criteria have been proposed to indicate the most probable number of clusters [12.48,49,50].

Case Study 3: An HCA Application to Colon Cancer

As an illustration, we now apply HCA to a gene expression microarray dataset from a cancer study. A DNA microarray has a series of microscopic probes, each constructed from strings of nucleotides. A microarray has tens of thousands of such probes, each of which can hybridize to a particular target strand of complementary DNA (cDNA). Probe–target hybridization can be measured by using fluorophore labels, for example, with altered levels of gene expression quantified by the level of fluorescence. In Fig. 12.6 we depict the corresponding dendrogram (using average linkage and a correlation coefficient distance measure) for a small microarray dataset consisting of 17 colon tumor and 18 normal samples. The dendrogram has been largely successful in separating normal from cancer samples.

Fig. 12.6
figure 6

HCA applied to a gene expression microarray dataset [12.47] consisting of 35 samples comprising 17 colon cancer and 18 normal colon samples

Bayesian Unsupervised Learning

HCA has been the method of choice for cluster analysis for many biomedical publications. However, HCA does have some drawbacks. Firstly, there is an implicit assumption that each sample is associated to a particular cluster. This may not be realistic in many situations where a sample should be better represented as overlapping several clusters. Thus, tumors can be genetically heterogeneous; i.e., tissue regions in different regions of the tumor may have different genetic signatures. The models we describe below are mixed membership models with each sample represented as a combinatorial mixture over clusters. With the clinical assignment of patient samples to clusters or disease subtypes it would also be useful to associate a confidence measure with the cluster assignment. This can be achieved using the Bayesian methods outlined here. Further advantages of a Bayesian approach are the use of sound methods to objectively assess the number of sample clusters and the means to penalize overfitting (at the top levels of the dendrogram in Fig. 12.6 we are finding structure, but, toward the leaves at the base, we are starting to fit to noise in the data).

Let us suppose that M is a model and D is the data, then p(M|D) represents the probability of a model given the data. Intuitively we should seek to maximize the probability of the model given the data. Given an assumed model structure, a fit to data is achieved through the adjustment of model parameters, which we denote Θ as a set and represent as the components of a vector θ. Bayesʼs theorem then implies that we should maximize

p ( Θ | D ) = p ( D | Θ ) p ( Θ ) p ( D ) ,
(12.47)

where p(Θ|D) is the posterior, p(D|Θ) is the likelihood, and p(Θ) is the prior. Thus, p(Θ|D) ∼ p(D|Θ)p(Θ) states that multiplying the likelihood by our prior beliefs about the parameters, p(Θ), will give us posterior beliefs about the parameters, having observed the data, p(Θ|D). The normalization term p(D) is called the evidence and can be found through an integration over the θ (we will refer to this as marginalizing out the θ)

p ( D ) = p ( D | θ ) p ( θ ) d θ .
(12.48)

Maximizing the probability of a model, given the data, would require finding the optimal set of values for the model parameters, which we denote Θ ^ . We will call this a set of point estimates for these parameters.

However, we cannot make inferences which are not justified by the data. As a consequence, given some set of data, the most we can really say is that there is a spectrum of models which fit the data and some of these models are more probable than others. We call this a posterior distribution over models. This posterior distribution carries information beyond a model based on point estimates since a relatively flat posterior distribution means that many models fit the data well and the most probable model is not particularly unique. On the other hand, a sharply peaked posterior distribution indicates that a point estimate solution might be a sound solution to use. In the discussion below we will start with maximum likelihood and maximum a posteriori approaches to unsupervised learning which use point estimates. Later we discuss approaches which give a full posterior distribution over models.

Maximum Likelihood and Maximum a Posteriori Solutions

With a maximum likelihood (ML) approach we derive a set of parameters that maximize the likelihood of the data given the model parameters, p(D|Θ). With the maximum a posteriori (MAP) approach we find the set of parameters that maximize the posterior, p(Θ|D), given the data. Thus, in terms of parameter-dependent probabilities, the MAP and ML solutions are related through Bayes rule p(Θ|D) ∼ p(D|Θ)p(Θ). The MAP solution therefore enables us to include any prior knowledge we may have about the distribution of the parameter values.

The ease with which we may calculate p(Θ|D) depends on the functional forms of the likelihood and the prior. Also, if Θ is high dimensional, the evidence p(D) may be difficult to evaluate. With a MAP solution only point estimates are used for the model parameters. These are denoted Θ MAP, and they are based on the mode of the posterior distribution. Since the mode is unaffected when the distribution is multiplied by a constant, we can ignore the evidence in the denominator and only use the unnormalized distribution, which we denote p ^ ( Θ | D ) = p ( D | Θ ) p ( Θ ) . In general, though, we may need to evaluate the evidence, and this motivates our discussion of Monte Carlo methods below.

Variational Bayes

With a variational Bayes approach we can determine a posterior distribution over models. In this approach we approximate a posterior distribution p(R|D) by a parameterized distribution q(R). In this case D is the data, as before, and R is a parameter vector consisting of the model parameters Θ and any hidden variables within the model (for example, hidden labels assigning data components to clusters). Thus, we wish to make q(R) as close as possible to the posterior p(R|D). Since p(R, D) = p(R|D)p(D) and ∫−∞ q(R)dR = 1, we can write

log [ p ( D ) ] = log ( p ( R , D ) p ( R | D ) )
(12.49)

so

log [ p ( D ) ] = - q ( R ) log ( q ( R ) p ( R , D ) q ( R ) p ( R | D ) ) d R = - q ( R ) log ( p ( R , D ) q ( R ) ) d R + - q ( R ) log ( q ( R ) p ( R | D ) ) d R = F [ R ] + KL [ q | | p ] .

The second term

KL [ q | | p ] = q ( R ) log ( q ( R ) p ( R | D ) ) d R
(12.50)

is a Kullback–Leibler (KL) divergence which quantifies the similarity of the two distributions q(R) and p(R|D). The key observation is that log [ p ( D ) ] does not depend on R. Since we want to minimize the KL divergence between q(R) and p(R|D), we can achieve this by optimizing F[R], which is called the variational free energy. After making distributional and other assumptions we can derive an expectation-maximization (EM) algorithm which optimizes F[R] and gives an effective approximation to the posterior distribution p(R|D).

Though we do not give more detail here, the ML, MAP, and variational Bayes methods outlined above can be extended in many ways; For example, we could consider a marginalized variational Bayes approach in which we marginalize, or integrate out, further model parameters [12.51,52]. With fewer parameters to estimate, this approach gives higher cluster accuracy but at the cost of model interpretability. We could consider semisupervised clustering [12.53] if side information, in the form of some sample labels, is present. Then again, whereas HCA may not be amenable to data integration, Bayesian methods could provide a route to unsupervised learning using multiple types of data; For example, with a correspondence model [12.54,55], we can consider two types of data with one type of data believed dependent on the other. In a bioinformatics context an example would be gene expression array data which we believe may be partially influenced by microRNA expression [12.55].

Case Study 4: An Application to the Interpretation of Expression Array Data in Cancer Research

To illustrate an application of these methods we consider the use of ML, MAP, and variational Bayes methods to interpret gene expression array data derived in cancer research. We will start with a maximum likelihood or MAP solution giving point estimates for the model parameters. We start by specifying a model which includes making certain distributional assumptions for the data and model parameters. This leads us through to a likelihood bound which is optimized using an algorithmic procedure (an EM or expectation-maximization method).

As an example we consider latent process decomposition (LPD) [12.56]. We first make assumptions about the type of data we are considering. For expression array data, a reasonable assumption is that the gene expression measurements follow an approximate Gaussian distribution which will have a mean μ and standard distribution σ. Each sample in the data has a set of features labeled by an index g = 1,… , G. Then, for feature g we draw a cluster index k (k = 1,… , K) with a probability denoted β k which selects a Gaussian with parameters μ gk and σ gk . Next we make assumptions about the probability distributions involved. For the β we assume a Dirichlet distribution, which is a standard assumption in this context. This Dirichlet distribution is parameterized by a K-dimensional vector α. The expression array data D consist of a set of samples, indexed a = 1,… , A, each with a set of features labeled by g, and we thus denote the experimental measurements by e ga . It is normal to work with the log of the likelihood so that we deal with summations rather than products. This is sound since the log-function is monotonic, so maximizing the log-likelihood is equivalent to maximizing the likelihood. The log-likelihood is then log p(D|μ, σ, α), which can be factorized over the individual samples as

log p ( D | μ , σ , α ) = a = 1 A log p ( a | μ , σ , α ) .
(12.51)

We can rewrite this as a marginalized integral over the β, that is,

log p ( D | μ , σ , α ) = a = 1 A log p ( a | μ , σ , β ) p ( β | α ) d β ,
(12.52)

with p(β|α) being the Dirichlet distribution. We introduce the Gaussian distributional assumption for the data

p ( a | μ , σ , β ) = Π g = 1 G k = 1 K N ( e ga | k , μ gk , σ gk ) β k .
(12.53)

Exact inference with this model is intractable, so to increase the likelihood and therefore, implicitly, the probability of the model given the data, we follow the indirect route of deriving a lower bound on this log-likelihood via Jensenʼs inequality. We then maximize this lower bound using an iterative procedure based on an expectation-maximization algorithm. Thus we get

a = 1 A log [ p ( a | μ , σ , α ) ] = a = 1 A log β [ Π g = 1 G k = 1 K N ( e ga | k , μ gk , σ gk ) β k ] p ( β | α ) d β .
(12.54)

If we define E p (z) =∫ zp(z)dz, then Jensenʼs inequality for a concave function f(x) states that

f ( E p ( z ) [ z ] ) E p ( z ) [ f ( z ) ] .
(12.55)

We have assumed a Dirichlet distribution for the β, so we could introduce a sample-specific (a-dependent) distribution for the p(β|γ a ) as

E p ( β | γ a ) [ f ( β ) ] = β f ( β ) p ( β | γ a ) d β ,
(12.56)

giving

f [ β p ( β | γ a ) β d β ] = f ( E p ( β | γ a ) [ β ] ) E p ( β | γ a ) [ f ( β ) ] = β f ( β ) p ( β | γ a ) d β
(12.57)

and so

a log [ p ( a | μ , σ , α ) ] = a log [ β p ( a | μ , σ , β ) p ( β | α ) d β ] = a log [ β { p ( a | μ , σ , β ) p ( β | α ) p ( β | γ a ) } p ( β | γ a ) d β ] = a log [ E p ( β | γ a ) { p ( a | μ , σ , β ) p ( β | α ) p ( β | γ a ) } ] a E p ( β | γ a ) [ log { p ( a | μ , σ , β ) p ( β | α ) p ( β | γ a ) } ] = a E p ( β | γ a ) [ log { p ( a | μ , σ , β ) } ] + a E p ( β | γ a ) [ log { p ( β | α ) } ] - a E p ( β | γ a ) [ log { p ( β | γ a ) } ] .

It is this lower bound which we optimize using a two-step expectation-maximization algorithm [12.56]. A variational Bayes approach leads to a similar iterative procedure to maximize the free energy term in (12.50).

We applied ML, MAP, and variational Bayes methods [12.55] to the interpretation of expression array data derived from 78 primary breast cancer samples [12.57]. Using a MAP approach we obtained a solution for the model parameters using an EM algorithm. If we split the data into training and validation data, parameters in the log-likelihood can be estimated from the training set. Using this estimated log-likelihood and the validation data, we can then estimate model complexity, i.e., how many clusters are apparently present in the data – in this case how many subtypes of breast cancer are indicated. In Fig. 12.7a we show the estimated log-likelihood on left-out validation data for this breast cancer dataset. The peak indicates a minimum of five subtypes.

Fig. 12.7
figure 7

(a) Estimated log-likelihood versus number of clusters using a MAP solution [12.55] for a breast cancer expression array dataset [12.57]. (b) Variational Bayes solution for the same dataset. For both methods the peak at five clusters indicates five principal subtypes of breast cancer. If more data were used, higher resolution may be achieved and we may observe more subtypes

We also used a variational Bayes method with the same dataset [12.55]. In this case the maximum of the free energy versus number of clusters indicates the appropriate model complexity. In Fig. 12.7b the peak is also at 5, indicating that this is the most appropriate number of subtypes to consider. Interestingly, with variational Bayes, we do not need to use validation data.

Having found the most appropriate number of subtypes, we can use these methods to find those genes which are abnormally functioning within subtypes. The parameters μ gk and σ gk can be used to model the data distribution for gene g in cluster k. This is illustrated in Fig. 12.8 for two genes, FOXA1 and FOXC1, which appear to be operating abnormally within one of these subtypes (cluster 5, denoted Cl5). The Gaussian distributions are determined by the (μ gk , σ gk ), and the actual distribution of data values is shown below these Gaussian distributions. Whereas FOXA1 and FOXC1 appear to function normally in the other subtypes, FOXA1 appears to be underexpressing and FOXC1 overexpressing within this subtype.

Fig. 12.8
figure 8

Expression profiles for two genes within a subtype of breast cancer: FOXA1 (a) and FOXC1 (b). This subtype can be identified with the basal-like or basaloid subtype of breast cancer. These two genes show a strong reciprocal anticorrelated expression profile within this subtype (after [12.58])

As pointed out at the beginning, one further aspect in which these Bayesian unsupervised learning methods differ from HCA is that they allow for mixed membership. As a second example (Fig. 12.9), we apply a variational Bayes method to a lung cancer gene expression array dataset derived from 73 patient samples [12.59]. The peaks indicate the confidence that a particular patient belongs to a particular cluster. Many peaks are 1, but a number overlap several clusters, possibly indicating an unclear assignment of patient to subtype. We have used lung cancer as an illustration because there are a number of clinically established subtypes for this disease, such as small cell lung cancer and adenocarcinoma of the lung. The clinical assignments are indicated by the boundary markers in this plot, and there is reasonable agreement between clinical assignment to subtype and those assignments made by the variational Bayes method.

Fig. 12.9
figure 9

The peaks indicate the confidence measure that a given sample is assigned to a particular cluster (see [12.58] for details about the derivation of this measure). The fourth band down has mainly adenocarcinoma of the lung (samples 1–20), and the plot indicates some confusion of assignment between this category and other subtypes of lung cancer

Monte Carlo Methods

Monte Carlo methods are potentially more exact than variational methods, and they are commonly used in probabilistic inference. If θ is high dimensional, evaluating the evidence ∫ p(D|θ)p(θ)dθ and finding the posterior p(θ|D)̇ is a difficult task. For the evidence, we could perform a Monte Carlo integration. Taken over an arbitrary distribution h(θ), we can perform Monte Carlo integration by writing h ( θ ) = f ( θ ) g ( θ ) as

h ( θ ) d θ = f ( θ ) g ( θ ) d θ = E g ( θ ) [ f ( θ ) ] 1 M m = 1 M f ( θ ( m ) ) ,
(12.58)

so that the integration becomes an expectation of f(θ) over g ( θ ) . By deriving a number of observations θ (m) (m = 1,… , M), sampled from the distribution g ( θ ) , and using these samples in f(θ), the original integral can be approximately evaluated. We could use this approach to evaluate the evidence and the posterior distribution by sampling from the unnormalized distribution p ^ ( θ | D ) = p ( D | θ ) p ( θ ) to find the normalization ∫ p(θ|D)dθ. To use this method of integration, we must be able to draw samples reliably from a distribution of choice, such as p ^ ( θ | D ) · . This is not easy, and here we briefly describe Markov chain Monte Carlo (MCMC) methods [12.60,61,62], based on the Metropolis algorithm, for performing this task.

Markov chain is a sequence of samples {θ 1, θ 2,…} such that each sample is only dependent on the previous sample, i.e., p(θ t |θ t−1, θ t−2,… , θ 1) = p(θ t |θ t−1), where t is the iteration index. In the MCMC approach, a proposal or jump distribution Q(θ t |θ t−1) is used to generate θ t from θ t−1. With the Metropolis algorithm we assume that this distribution is symmetric, i.e., that Q(θ t+1|θ t ) = Q(θ t |θ t+1). Our objective is to reliably draw samples from a distribution g(θ), such as p(θ|D). Thus, we start the procedure with some θ 0 such that g ^ ( θ 0 ) > 0 . After a number of iterations this procedure will tend toward the stationary distribution g(θ) so that θ t represents a random draw from g(θ). In the standard approach to MCMC, to arrive at this stationary distribution, at each step a sample is selected from Q and either accepted or rejected based on a comparison with g(θ). Specifically, a candidate sample θ is sampled from Q(θ t |θ t−1) and accepted with probability α t given by

α t = min ( g ^ ( θ ) Q ( θ t - 1 | θ ) g ^ ( θ t - 1 ) Q ( θ | θ t - 1 ) , 1 )
(12.59)

or rejected. If θ is accepted, then θ is assigned to θ t . Otherwise, if θ is rejected, then θ t−1 is assigned to θ t instead. The Metropolis Hastings algorithm [12.63,64,65] extends the Metropolis algorithm to jump distributions which are not symmetric, and both of these methods are members of a broad class of approaches to sampling from high-dimensional distributions.

Case Study 5: Network Inference

The understanding of pathways and networks is crucial to our understanding of the functional organization of genes and proteins. At an abstract level a network can be viewed as a set of nodes, together with a set of directed or undirected edges between these nodes. Biological networks of interest include, for example, transcriptional regulatory networks. In this case a gene may express a protein that functions as a transcriptional inhibitor or, alternatively, as an activator of one or more target genes. In this case the genes can be viewed as the nodes of a network with the edges representing direct regulatory connections to other genes. With signal transduction networks the proteins are viewed as the nodes and the edges are corresponding protein–protein interactions. Further networks of interest are metabolic networks, where the metabolites are the nodes.

We could use unsupervised learning to determine the network structure, a task we could call global inference of network topology. However, suppose we consider a fully connected network with n nodes, then we are attempting to find n(n − 1)/2 possible connections given limited amounts of data. A cell may have tens of thousands of genes, whereas, in most experiments, we are only considering a few hundred samples with measurements corrupted by noise. Thus, the inference problem is typically highly underdetermined.

To make network inference more amenable as a machine learning task, a more tractable approach is network completion [12.66]. In this case we have an established pathway of interest and consequently we know certain links and nonlinks between pairs of nodes. The problem is therefore to determine whether a given node, perhaps representing a gene, has a link to this pathway or not. Since we have a training set of known links and nonlinks we can train a classifier via supervised learning and then predict a link to the pathway or otherwise, for the node of interest (Fig. 12.10). Various types of data are informative as to whether a functional link exists, and hence we can cast this supervised learning task as an application of the multiple kernel learning techniques of Sect. 12.1.4. As a supervised learning problem and with a smaller set of linkages to infer, this problem can give more reliable results than global unsupervised inference of network topology.

Fig. 12.10
figure 10

Network completion. Using a training set of known links (bold lines) and nonlinks for nodes A–E we use supervised learning to predict links or nonlinks to a new node X (dashed lines)

We can improve accuracy by reducing the search space further. Thus, as a third approach, we can consider the use of Bayesian methods to decide the most probable network structure given a small set of candidate network topologies. In this case it is assumed that a biologist has partially determined a network but remains undecided over a set of candidate topologies. Thus, the task is to determine which of these proposed network topologies is most probable, given the data. As an example, Calderhead et al. [12.67,68] used Bayesian unsupervised learning to decide over alternative topologies proposed for the extracellular signal-regulated kinase (ERK) pathway. This signaling pathway regulates growth and is defective in many cancers. Binding of epidermal growth factor (EGF) to the epidermal growth factor receptor (EGFR) at the cell surface membrane activates ERK through a chain of proteins. Four candidate topologies were proposed. To compare individual pairs, representing different topologies, we use the Bayes factor, that is, the likelihood ratio

p ( D | M i ) p ( D | M j )
(12.60)

stated in terms of the marginal likelihood for model M to generate data D

p ( D | M i ) = p ( D | M i , θ i ) p ( θ i ) d θ i ,
(12.61)

where the θ i are the set of model parameters marginalized or integrated out. Each protein in the chain is modeled using an ordinary differential equation (ODE), and we use optimization methods with a least-squares parameter fitting approach to find the optimal parameter values in these ODEs. We do not describe the ODEs here but refer to the original paper [12.67,68]. Thus, a model can be written as M ={ S, θ}, where S is the system of differential equations and θ is the set of parameters. The marginal likelihood is therefore a nonlinear function based on the solution of these ODEs. It cannot be computed analytically, and so we must use MCMC-based methods. Because the posterior distribution is generated by complex dynamical systems models, it is necessary to use more sophisticated sampling methods than those mentioned in Sect. 12.2.7. Indeed, rather than static sampling distributions we use populations of annealed (temperature-dependent) distributions in an approach commonly referred to as population MCMC. Applied to ERK pathway models it was possible to use population MCMC to compute marginal likelihoods and thus Bayes factors, lending support to one of the proposed topology models for the ERK pathway [12.68].

Conclusions

Progress in the biomedical sciences can be furthered by improved data interpretation, in addition to the acquisition of more data. In this chapter we have seen that contemporary methods from machine learning can offer many advantages over more long-established data analysis methods. There are an increasing number of studies where multiple types of data are acquired from the same sample. An example is the Cancer Genome Atlas [12.69] project, where multiple types of data are derived from the same tumor sample. In Sect. 12.1.3 we saw that many different types of data can be encoded into corresponding kernels and prediction can be achieved using multiple kernel learning. In Sect. 12.2 we saw that contemporary Bayesian unsupervised learning methods compare favorably with hierarchical cluster analysis, providing a confidence measure for assigning datapoints to clusters (Fig. 12.9) and an effective approach to model selection (Fig. 12.7). In the last section we saw that Bayesian methodology is the most effective approach to other tasks such as determining the most probable network topology given a set of candidate network topologies. Further innovation in machine learning will improve and expand the interpretability of many datasets from the biomedical sciences.