Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Linguistic Metaphor

New sequencing technologies are giving access to an ever increasing amount of DNA, RNA or protein sequences for more and more species. One major challenge in the post-genomic era is now to decipher this set of genetic sequences composing what has been popularly named “the language of life” [1].

As witnessed by this expression, the linguistic metaphor has been used for a long time in genetics. Indeed, the discovery of the double helix structure of DNA in 1953 showed that the genetic information contained in this biological macromolecule can be represented by two (long) complementary sequences over a four-letter alphabet \(\{A,C,G,T\}\) symbolizing the nucleotides, the complementary letters (called Watson–Crick base pairs) being AT and CG. This genetic information is used to construct and operate a living organism by the transcription when needed of pieces of DNA sequences, named genes, into RNA single strand macromolecules which can also be represented by a sequence on almost the same four-letter alphabet \(\{A,C,G,U\}\), where T has been replaced by its unmethylated form U. Sequences of RNAs coding for proteins are in turn translated into protein sequences of amino acid residues, over the 20 amino acid’s alphabet \(\{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y\}\), that determine their three-dimensional conformations and functions in the cells (see for instance [2] for a more detailed introduction to the production of RNAs and proteins encoded in DNA). Sequences are thus at the core of storage of heredity information and its expression into the functional units of the cells: the natural language metaphor arises then quickly. This metaphor may be convenient for vulgarization but can also be a source of inspiration for scientists trying to discover the functional units of the genome and how this “text” is structured.

Applying computational linguistics tools to represent, understand and handle biological sequences is a natural continuation of the linguistic metaphor. Using formal grammars, such as the ones introduced in 1957 by Noam Chomsky [3] to describe natural languages and study syntax acquisition by children, has been advocated in particular by Searls: his articles provide a good introduction to the different levels of expressiveness required to model biological macromolecules by grammatical formalisms [48]. Basically, copies and long-distance correlations are common in genomic sequences, calling rapidly for context-sensitive grammars in Chomsky’s hierarchy to model them, which makes parsing unworkable. As in linguistics, a solution to getting polynomial-time parsing and still representing many of the non-local constraints from genomic sequences is to use mildly context-sensitive languages [9]. Along this way, Searls introduced String Variable Grammars as an expressive formalism for describing the language of DNA that has led to several generic practical parsers: the precursor Genlang [10] and its successors Stan [11], Patscan [12], Patsearch [13] and Logol [14]. Many specialized parsers have also been devised, as for instance RNAMotif [15], RNAbob [16], Hypasearch [17, 18], Palingol [19] and Structator [20], tailored to handle efficiently RNA stem-loop secondary structures.

But one has still to design the grammar. In contrast with all the expertise available on natural languages, little is known about the syntax of DNA and the functional/semantic role of its parts. For instance, how are the equivalents of “words”, “sentences” and even “punctuation marks” defined? In some specific cases, expert knowledge can be used to build a grammar, eventually by successive trial-and-error refinements with respect to the sequences retrieved by the model. In the other cases, expert knowledge is missing or is insufficient.

On the other hand, a huge number of genomic sequences are available, opening the door to grammar inference from these sequences. In this chapter, we will present advances made towards the big challenge of learning automatically the language of genomic sequences. The first step we consider is to discover what the genomic “words” are: this is mainly the domain of Motif Discovery, and related work is presented in Sect. 8.2. The second step is then to learn the “syntax” governing the admissible chaining of “words” in macromolecules: this is the classical goal of Grammatical Inference and we present the first successes obtained at the intersection of this field and Bioinformatics in Sect. 8.3.

8.2 Discovering and Modeling Biological Words

“Words” can be looked at different levels in DNA, requiring different levels of modelization. We investigate in this section how this has been classically handled in Bioinformatics from the simplest historical first steps, introducing and illustrating some specificities of biological sequences, to the more elaborate techniques from today’s state of the art.

8.2.1 Short DNA Words

Simple Words A classical example of an identified DNA substring is \(\text {AAGCTT}\) (on the upper strand and its complement \(\text {TTCGAA}\) on the lower strand), that is specifically recognized in H. influenzae bacteria by one of its enzymatic proteins named HindIII that cleaves the double strand DNA of invading viruses at the sites where this substring occurs, while the bacteria’s occurrence sites of the substring in its DNA are protected from cleavage by a prior methylation. The HindIII protein is said to be a restriction enzyme. More than 800 different restriction enzymes and more than 100 corresponding recognition sequences have been identified in bacterial species, with important applications in genetic engineering. These recognition sequences show a great variability among species, many of them being palindromic on complementary strands (meaning the sequence reads the same backwards and forwards in complementary DNA strands, like in \(\text {AAGCTT}\) and \(\text {TTCGAA}\)), reflecting that both strands of DNA have to be cut, often by a complex of two identical proteins operating on each strand. The main characteristic of these substrings is their short length (about four to eight base pairs), that makes them likely to appear frequently in any genome, providing them an efficient defense against unknown invading viruses. These sequences are thus rather ubiquitous and do not support information by themselves (they are only substrings recognized by the restriction enzymes), and it could be discussed whether they are “words” in a linguistic sense.

Conserved Words Another example of a well-known short sequence is the Pribnow box, early identified in the DNA of E. coli bacteria. It was discovered by Pribnow [21] by looking at the DNA sequences around six, experimentally determined, starting points of the transcription of genes into RNAs by a molecule named RNA-polymerase. Would you find in these sequences, shown hereafter and aligned on the known transcription start site formatted in bold, the protein binding site initiating the transcription by the RNA-polymerase?

figure a

By looking carefully at the sequences, one can find a conserved region (underlined below), located about 10 positions before the transcription start site, that may have been conserved despite mutations for its function through natural selection:

figure b

Consensus Sequences and Motifs Looking at the underlined alignment of this conserved region, only two positions (the second and the sixth) are strictly conserved out of seven and the farthest sequences share only three identical positions for four mismatches. But the consensus sequence \(\text {TATAATG}\) of the alignment, built by keeping only the most abundant letter at each position, appears with no more than two mismatches, and one may consider it as an archetypal (eventually ancestral) sequence for the region and the other sequences as its variants by meaningless mutations. Searching for this consensus sequence \(\text {TATAATG}\) without mismatch, we would retrieve only one of the six conserved sites and we would expect one match per \(4^7\simeq 16\text {,}000\) bp in whole DNA. Allowing one mismatch, we would retrieve three of the six sites and we would expect one match per 700 bp. Allowing two mismatches, we would retrieve all the sites but we would expect one match per 70 bp, which is likely to be too much.

We can remark that the nucleotides \(\text {A}\) and \(\text {G}\) are evenly distributed at the fourth underlined position and \(\text {TATGATG}\) would also have been a good candidate consensus sequence. Actually, it would be more informative to know that the fourth position has to be a purine (A or G bases) and, as done by Pribnow, we can use the consensus motif \(\text {TAT[AG]ATG}\) (where brackets specify a set of alternative bases at the position) to designate the sequences probably engaged by RNA polymerase. This motif retrieves two sites for one expected match per \(8\text {,}000\) bp, and five sites if one error is allowed for one expected match per 400 bp. If we assume that the base at the fifth position is not important, we can also relax the consensus to \(\text {TAT[AG]xTG}\) where \(\text {x}\) is a wildcard for any base. This consensus retrieves four of the sites with about one match per \(2\text {,}000\) bp and all the sites if one error is allowed with a match per 100 bp. And choosing the full consensus \(\text {[GT]A[CT][AG][ACT]T[AG]}\) would recognize all the sites and would expect a match per 350 bp. As shown in this example, consensus offers many ways to model a word and its possible variants in DNA, ranging from consensus sequences allowing a limited number of errors to full consensus motifs, with all the intermediate ambiguity/sensitivity trade-offs.

“De novo” discovery of such words can be done by enumerating them and returning those over-represented in a collection of genome sequences, i.e. occurring more frequently than expected by chance. This approach has been successful in Motif Discovery, particularly for the discovery of short words and rather simple motifs (to enable a practical enumeration, even if efficient datastructures can be used and enumerating only the motifs that have sufficient support in the sequences can help); see [22, 23] for details.

Position Specific Matrices Yet, consensus sequences or motifs are not completely satisfactory for representing and discovering biological words. Taking again the example of the full consensus motif \(\text {[TG]A[TC][GA][ACT]T[GA]}\), do we really want \(\text {GACACTA}\) to be recognized like \(\text {TATAATG}\)? Or if a more specific consensus, such as the consensus sequence \(\text {TATAATG}\), is chosen, allowing a limited number of errors, how is it possible to express that some positions can mutate more easily than others and that some base mutations occur more likely at some positions? Moreover, while conserved on average, the binding sites involved in the initiation of transcription occur rarely as exact matches of their specific consensus sequence. On average in bacteria, only half of the positions in each site match with the consensus sequence. A first explanation is that bindings have to be reversible. Different affinities with binding proteins enable as also to tune at a fine level the concentration of the RNA (and eventually protein) genes expressed in the cell, which would be interesting to estimate from the motif.

Weighting the consensus motifs addresses these issues. This is usually done on the basis of a summary of the sites by their base count at each position in a position-specific count matrix (PSCM). For the Pribnow sites example, the PSCM for the aligned conserved region would be:

 

1

2

3

4

5

6

7

A

0

6

0

3

4

0

1

C

0

0

1

0

1

0

0

G

1

0

0

3

0

0

5

T

5

0

5

0

1

6

0

If we denote by \(o_i(a)\) the observed count of base a at position i of the sites, estimation of probability of a at i in the site is given by:

$$\hat{p_i}(a)=\frac{o_i(a)}{\sum _{a'\in \{A,C,G,T\} }o_i(a')}.$$

Under the strong assumption that the probability of a base at a position depends only on the position, the probability of a sequence on \(a_1a_2 \ldots a_k\) given a position-specific probability matrix (PSPM) \(\mathbf {P}=[\mathbf {p_1},\mathbf {p_2}\ldots ,\mathbf {p_k}]\) is \(\varPi _{i=1}^{k}p_i(a_i)\). For instance, for the example above, the probability of \(\text {TATAATG}\) would be \(\frac{5}{6}\times \frac{6}{6}\times \frac{5}{6}\times \frac{3}{6}\times \frac{4}{6}\times \frac{6}{6}\times \frac{5}{6}\simeq \) \(1.2\times 10^{-4}\) while for \(\text {GACACTA}\) it would only be \(\frac{1}{6}\times \frac{6}{6}\times \frac{1}{6}\times \frac{3}{6}\times \frac{1}{6}\times \frac{6}{6}\times \frac{1}{6}\simeq \) \(6.8\times 10^{-5}\). By way of comparison, both sequences would have a probability of \((\frac{1}{4})^{7}\simeq \) \(6.1\times 10^{-5}\) of being generated randomly by an equiprobable choice of the bases.

In the genome of S. cerevisiae which contains \(64\,\%\) of \(\text {A}\) and \(\text {T}\), the probability of \(\text {TATAATG}\) and \(\text {GACACTA}\) would respectively be \(2\times 10^{-4}\) and \(6\times 10^{-6}\), making the second word more exceptional and thus more interesting than the first word with respect to this background model. When positions are assumed to be independent, the odd-score of the probability of a sequence \(a_1a_2 \ldots a_k\) by \([\mathbf {p_1},\mathbf {p_2}\ldots ,\mathbf {p_k}]\) with respect to its probability in a background model where each base a has a probability p(a) can directly be computed by \(\varPi _{i=1}^k\frac{p_i(a_i)}{p(a_i)}\), and the comparison with respect to expected background probability can directly be embedded in a Position Weight Matrix (PWM) [24], also called Position-Specific Weight Matrix (PSWM) or Position-Specific Scoring Matrix (PSSM), in logarithm form to facilitate computation (sum instead of product and better precision for rounded computation). In a PWM, the score of base a at position i is usually defined by

$$s_i(a)= \log _2 \frac{p_i(a)}{p(a)}$$

and the score of a sequence \(a_1a_2 \ldots a_k\) is given by

$$S(a_1a_2 \ldots a_k)=\sum _{i=1}^{k}s_i(a_i).$$

The PWM computed from the PSCM above, assuming that the bases are equiprobable in the background model (\(p(A)=p(C)=p(G)=p(T)\)), would be:

 

1

2

3

4

5

6

7

A

\(-\infty \)

2

\(-\infty \)

1

1.42

\(-\infty \)

\(-0.58\)

C

\(-\infty \)

\(-\infty \)

\(-0.58\)

\(-\infty \)

\(-0.58\)

\(-\infty \)

\(-\infty \)

G

\(-0.58\)

\(-\infty \)

\(-\infty \)

1

\(-\infty \)

\(-\infty \)

1.74

T

1.74

\(-\infty \)

1.74

\(-\infty \)

\(-0.58\)

2

\(-\infty \)

Bases over-represented with respect to background probability have positive scores, while under-represented bases have negative scores. Using a sliding window of width k, PWM can assign a score at each site of a genome reflecting its likelihood of being part of the motifs. The highest score for a sequence with the PWM above is 11.64, obtained for \(\text {TATAATG}\), while the lowest score (except \(-\infty \)) is 2.68, obtained for \(\text {GACACTA}\).

First Pseudocounts Let us remark that a mutation from \(\text {A}\) to \(\text {G}\) at the fifth position of the best sequence \(\text {TATAATG}\) will directly result in \(-\infty \) score. Nucleotides that occur rarely in the motif at a specific position may not be seen in a small sample by chance but will force the probability of any sequence containing one of these missing nucleotides to 0. Pseudocounts are thus usually added to compensate for small samples counts. This can be done by adding systematically 1 to the observed counts, and the estimate of probability of a at i will then be:

$$\hat{p_i}(a)=\frac{o_i(a)+1}{\sum _{a'}\left( o_i(a')+1\right) }.$$

More elaborate pseudocounts can be used; for instance, in

$$\hat{p_i}(a)=\frac{o_i(a)+A\,p(a)}{\sum _{a'}\left( o_i(a')+A\,p(a')\right) }$$

the pseudocount added is proportional to background probability p(a) and the weight A given to the prior. Choosing \(A=2\) and keeping the equiprobability of the bases, the PWM on the Pribnow example would be

 

1

2

3

4

5

6

7

A

\(-2.00\)

1.70

\(-2.00\)

0.81

1.17

\(-2.00\)

\(-0.42\)

C

\(-2.00\)

\(-2.00\)

\(-0.42\)

\(-2.00\)

\(-0.42\)

\(-2.00\)

\(-2.00\)

G

\(-0.42\)

\(-2.00\)

\(-2.00\)

0.81

\(-2.00\)

\(-2.00\)

1.46

T

1.46

\(-2.00\)

1.46

\(-2.00\)

\(-0.42\)

1.70

\(-2.00\)

and we would have \(S(\text {TATAATG})=9.76\), \(S(\text {GACACTA})=2.53\) and \(S(\text {TATAGTG})=6.59\), the minimal score being \(-14\). By adding pseudocounts, all the sequences have a score strictly greater than \(-\infty \). One can still discriminate a set of sequences by choosing a cut-off value, chosen as a compromise between desired recall and precision, with the advantage over sequence consensus or motifs of being better suited for the representation of similar sequences without a strict conservation per position.

Measuring Conservation The conservation of a site can be evaluated according to a measure named information content [25] that measures the information gain on the site provided by the PSPM with respect to a uniform random choice of the bases. The information content \(IC_i\) at position i is given by the formula:

$$IC_i= 2 + \sum _{a}p_i(a)\log _2p_i(a).$$

Assuming positional independence, information content of the complete site is simply the sum of the information contents:

$$IC=\sum _{i=1}^k IC_i.$$

Information content is the basis of a convenient visualization of PSPM named sequence logos [26] that displays simultaneously conservation of each position, and their proportional base composition (see Fig. 8.1).

Fig. 8.1
figure 1

Sequence logos for the Pribnow example (left without pseudocounts, right with pseudocounts). Height of stacks of symbols shows the information content of the position and the relative heights of the bases indicates their probability at the position (logo generated with WebLogo 3.3 [27])

Information content can be generalized to account for the background model with biased base probability distribution \(\mathbf {p}\):

$$IC_{i||\mathbf {p}}= \sum _{a}p_i(a)\log _2\frac{p_i(a)}{p(a)}$$

\(IC_{i||\mathbf {p}}\) is known as the relative entropy (a.k.a. Kullback-Liebler divergence [28]) and measures how much the \(p_i(x)\) diverge from the background distribution \(\mathbf {p}\) at the position. Let us note that when \(\forall a\in \{A,C,G,T\}, p(a)=\frac{1}{4}\), the formula can be rewritten into \(IC_i\). The generalized information content of the site is once again the sum over the positions \(IC_{||\mathbf {p}}=\sum _{i=1}^k IC_{i||\mathbf {p}}(i)\): it measures how much the distribution defined by the PSPM contrasts with the distribution obtained by a Bernoulli-like process.

Information content is thus related to how exceptionally conserved is the set of underlying words with respect to such background models. It is thus a good objective function for PWM motif discovery programs that aim at identifying such sets of words in a set of sequences (for instance, to find binding sites near the transcription sites as in the Pribnow example). In its simplest setting, the problem can be stated as looking for a word of length k per sequence such that the corresponding information content, or a related score, is maximized. Many strategies for the exploration of the search space have been proposed. This includes the greedy algorithm consensus [2931], expectation maximization algorithms like MEME [32] and several algorithms based on a Gibbs sampling strategy: Gibbs [3335], AlignACE [36], MotifSampler [37] or BioProspector [38]. The scores used are information content (IC) (consensus, MotifSampler), log-likelihood ratio (LLR) (MotifSampler, Gibbs), E-value of the log-likelihood (MEME) or E-value of the IC (consensus).

Usage PWM/PSSM are widely used in popular databases such as TRANSFAC [39] and JASPAR [40] to model binding sites, identified experimentally by techniques such as SELEX or now ChIP-Seq, with the help of motif discovery programs to refine the site localization, and are then available to scan new genomes for the prediction of putative binding sites. There is still a large number of false positives, and regulation in more complex organisms than bacteria is still incompletely understood. Whether those sites are actually bound by a protein and play a functional role in transcription, and under what conditions, must still be determined experimentally by traditional molecular techniques like promoter bashing, reporter gene assays, ChIP experiments, etc.

8.2.2 Longer Words

Binding sites, involved in the regulation of the transcription of DNA genes into RNA and the production of proteins, are examples of short words in DNA. Gene coding for RNA or proteins, that are the functional products of DNA in the cell, can also easily be considered as (longer) DNA words.

In the context of natural evolution, genes as well as other DNA sequences, are subject to genomic mutations (substitutions, insertions, deletions or recombinations) under natural selection pressure. Most of these mutations are lethal or harmful, but about a third of them are either neutral or weakly beneficial. There is thus a sequence conservation of the genes transmitted among the individuals or the species, but with substitutions of bases and insertions or deletions of (eventually stretches of) bases. Biologists use the term of “homologs” to designate sequences inherited in two species by a common ancestor. Homology is the base of comparative genomics to annotate the sequences that can be considered as variants of the same word. But homology does not imply necessarily that function is preserved. The TIGRFAM protein database introduced the term “equivalogs” to designate homologs that are conserved with respect to function since their last common ancestor. This later concept matches more closely the linguistic closely of a “word” (with literal or practical meaning) but is more difficult to establish, especially in silico.

Similarity of Two Proteins Homology of two proteins can be estimated by aligning their sequences so as to optimize the number of exact matches between aligned amino acids and by reporting the percentage of identity between the two aligned sequences. To better evaluate their functional kinship, it is better to take into account the different physico-chemical properties of the amino acids (see Fig. 8.2). For instance, if the electric charge of an amino acid is important for the function of the protein, the function is more likely to be conserved by mutations preserving this charge. In some other cases, the hydrophoby of the amino acid will be its important feature.

Fig. 8.2
figure 2

Venn diagram of amino acid properties (adapted from: [41])

Substitution matrices such as Blosum62 [42] score the similarity of amino acids according to their propensity to be exchanged with each other in blocks of conserved regions (Table 8.1). Such matrices reflects the mean physico-chemical similarity between amino acids under natural selection pressure, as well as some similarity or redundancy of the genetic code.

Table 8.1 BLOSUM62 substitution matrix

Substitution matrices provide a way to score the similarity (instead of their percentage of identity) of two proteins by aligning their sequence of amino acids so as to maximize the sum of the amino acid substitution scores. This can be computed in quadratic time by a dynamic programming global alignment algorithm known as the Needleman–Wunsch algorithm [43], that copes also with insertions and deletions of subsequences that are common in DNA sequences by the addition of affine penalty scores for ‘gaps’.

Global alignment enables one to compare two protein sequences over their whole length, but many proteins are composed of several domains that are stable units of protein spatial structures able to fold autonomously. Domains may have existed, or may still exist, as independent proteins: they constitute the protein building blocks selected by evolution and recombined in different arrangements to create proteins with different functions. Comparing proteins at this level requires local rather than global alignments. The best local alignment of two sequences can be computed by the Smith–Waterman algorithm [44], a variation of the global alignment dynamic programming algorithm not penalizing gaps at both ends of the sequences. To search an entire database for homologous (sub-)sequences of a given protein sequence in reasonable time, heuristic and approximate local alignment algorithms have been developed, such as FASTA [45] or BLAST [46], one of the most widely used bioinformatics programs.

Modeling Conserved Protein Sequences When getting more than two related protein sequences, switching from pairwise sequence alignment to multiple sequence alignment enables one to identify evolutionarily or structurally conserved regions and key positions in all the sequences. Most formulations of multiple sequence alignment lead to NP-complete problems; therefore, classical multiple sequence alignment programs rely on heuristics. Most of them perform global multiple sequence alignment such as ClustalW [47], T-Coffee [48], Probcons [49], MUSCLE [50] or MAFFT [51]. Local multiple sequence alignment can be found by the methods cited above to build PWM, the set of conserved k-words being a specific case of local alignment without gaps. In between global and local alignment, DIALIGN [52, 53] proposes an original approach based on significant local pairwise alignment of segments that enables it to identify a set of multiple sequence local alignments shared by all the sequences without any gap penalty.

Fig. 8.3
figure 3

PWM, pHMM, Meta-MEME and Protomata types of architecture (inspired from [54])

Profile HMM Modeling locally conserved regions identified by multiple sequence alignment can be done once again with PWM. To handle larger regions with insertions and deletions, PWMs have been generalized to so-called profile models by the addition of insertion/deletion penalties at each position [55] and furthermore to profile Hidden Markov Models (pHMM) by adding also probabilities for entering into insertion, deletion or matching mode at next position given the current position and mode [56, 57]. Namely, pHMMs are hidden Markov models with a predefined specific k-position left-to-right architecture, with three (hidden) states per position (see Fig. 8.3): a match state generating amino acids according to the conserved position distribution (the equivalent of a PWM column), an insert state generating amino acids with respect to their distribution in gaps (by default, their background probability) and a delete silent state enabling passing a match state without emitting any amino acid.

Transitions are only allowed between states from one position to the next one and are probabilized, enabling one to tune the likelihood of inserting or deleting amino acids at each position and the likelihood of continuing insertions or deletions after entering one of these modes, as seen in protein sequence families.

If the topology of a pHMM is set, its probabilistic parameters can be estimated from available sequences of the family by a classical Expectation-Maximization scheme such as the Baum–Welch algorithm [58]. Nevertheless, the classical workflow in Bioinformatics is rather to start from a multiple sequence alignment of the sequences, assign for each column of the alignment involving enough sequences (say more than half of the family) a match state (and its insertion and deletion companion states) and convert observed counts of symbol emissions and state transitions into probabilities from the alignment.

Elaborate Pseudocounts Even if the topology of pHMM is simple, the number of free parameters to estimate is still big compared to the number of sequences usually available. Much work has thus been done to avoid overspecialization and compensate for the lack of data or its biases by the development of transition regularizers, sequence weighting schemes and, especially, sophisticated pseudocount schemes based on the usage and elaboration of a priori knowledge on amino acid substitutability. As a matter of fact, the alphabet size of amino acids is greater than that of nucleotides, and targeted characterizations with pHMMs tend to be longer than with PWMs: pseudocounts are thus even more important here.

Classical pseudocounts presented above for nucleotides can be used, but taking into account the substitutability preferences of amino acids arising from their shared physico-chemical properties leads to better performances. A first way is to use available substitution matrices such as BLOSUM62: if we denote by m(a|b) the probability of having a mutation to a from b, derived from the corresponding score in the chosen substitution matrix (see [42]), an intuitive scheme introduced for PWM with many variants [59] is to make each amino acid b contribute to pseudocounts of amino acid a in proportion to its abundance at the position and its probability m(a|b) of mutating into a. If we denote by \(m_i(a)=\sum _{b} \frac{o_i(b)}{\sum _{b'}o_i(b')}m(a|b)\) the probability of getting a by mutation of residues at position i, an estimate for the probability of a in i can be:

$$\hat{p_i}(a)=\frac{o_i(a)+A\,m_i(a)}{\sum _{a'}\left( o_i(a')+A\,m_i(a')\right) }.$$

This pseudocount scheme has the advantage of interpolating between the score of pairwise alignment, such as with BLAST, when a small number of sequences is available (consider for instance the case of only one sequence and \(A \gg 1\)) and the maximum likelihood approach when more sequences are available (when \(\sum _a o_i(a) \gg A\)). In practice, A has to be chosen to tune the importance of pseudocounts with respect to observed counts, classical proposed policies being to choose \(\min (20,\sum _ao_i(a))\) [60] or 5R [59] where R is the number of different amino acids observed in the column, a simple measure of its diversity.

This pseudocount scheme performs well but does not take full advantage of the column composition knowledge. Instead of distributing pseudocounts from each amino acid count independently, one may wish to distribute them according to the whole column distribution. For instance, if the column is biased towards small hydrophobic amino acids, one would like to bias the pseudocounts towards this combination of physico-chemical properties. To this end, [62] proposed using Dirichlet mixture densities as a means of representing prior information about typical amino acid column distributions in multiple sequence alignments and derived the formulas to compute the corresponding posterior distributions given observed counts in the Bayesian framework.

Dirichlet mixtures can be thought of as mixtures of M pseudocount vectors \(\mathbf {\alpha _{1}},\ldots ,\mathbf {\alpha _{M}}\) corresponding to M different typical distributions of amino acids having each a prior probability of \(q_{j}, 1 \le j \le M\), where each Dirichlet density \(\mathbf {\alpha _{j}}=(\alpha _j(A),\alpha _j(C), \ldots , \alpha _j(W))\) contains the appropriate amino acid pseudocounts (the equivalent of \(A\,p(a)\) or \(A\,m_i(a)\) in the pseudocount formulas above) for the typical distribution j.

An example of a Dirichlet mixture from [61] is given in Table 8.2. This Dirichlet mixture and more recent ones can be found on the site of the Bioinformatics and Computational Biology group at UCSC at http://compbio.soe.ucsc.edu/dirichlets/. These mixtures were estimated by maximum likelihood inference from the columns of available large “gold standard” datasets of protein multiple alignments that are assumed to be accurate and representative.

Table 8.2 Parameters of Blocks9, a nine components Dirichlet mixture prior [61]

According to the authors, the mixture shown here was one of their first really good Dirichlet mixtures. It is composed of nine components that favor each a different distribution of amino acids biased towards one or several physico-chemical properties from Fig. 8.2: for instance, Dirichlet density \(\alpha _2\) favors aromatic amino acids (YFWH) by assigning them higher pseudocounts (relatively to what would be expected from their background frequency; see [61] for details) while \(\alpha _5\) favors aliphatic or large and non-polar amino acids. The last component is specific: it favors columns with few different amino acids, with a preference for PGW or C, by assigning tiny pseudocounts to all amino acids so that the observed count will dominate. This component has the highest prior probability (\(q_9=0.234\)) since many positions in alignments exhibit a unique conserved amino acid, followed by the first component (\(q_1=0.183\)) that favors small neutral amino acids that appear to be often mixed together in alignment columns, while the more specific density of the second component has the lowest prior probability of the mixture (\(q_2=0.058\)).

Basically, the Dirichlet density \(\mathbf {\alpha _j}\) of a Dirichlet mixture component embeds a prior in the form of a pseudocount that enables one to compute the posterior probability \(\hat{p_i}(a|\mathbf {\alpha _j})\) of each amino acid a from observed counts at position i with respect to this prior by:

$$\hat{p_i}(a|\alpha _j)=\frac{o_{i}(a)+\alpha _{j}(a)}{\sum _{a'}\left( o_i(a')+\alpha _{j}(a')\right) }.$$

This formula can be extended to a mixture of M Dirichlet densities \(\varTheta =(\mathbf {\alpha _{1}},\ldots ,\mathbf {\alpha _{M}},q_{1},\ldots ,q_{M})\) by distributing these probabilities proportionally to the likelihood \(p_i(j)\) of each component for the observed count distribution:

$$\hat{p_i}(a|\varTheta )=\sum _{j=1}^{M}p_i(j)\frac{o_{i}(a)+\alpha _{j}(a)}{\sum _{a'}\left( o_i(a')+\alpha _{j}(a')\right) }.$$

\(p_i(j)\) is named the posterior mixture coefficient of component j and can be estimated by application of Bayes rule from the prior Dirichlet mixture coefficient \(q_j\) and the likelihood of the observed counts for component j determined by density \(\mathbf {\alpha _{j}}\):

$$\hat{p_i}(j)=\frac{q_{j}\,p(\mathbf {o_{i}}|\mathbf {\alpha _{j}})}{\sum _{j'=1}^{M}q_{j'}\,p(\mathbf {o_{i}}|\mathbf {\alpha _{j'}})}$$

where \(p(\mathbf {o_{i}}|\mathbf {\alpha _{j}})\), the likelihood of the observed counts according to Dirichlet density \(\mathbf {\alpha _{j}}\), is given by the complicated but simple to calculate formula

$$p(\mathbf {o_i}|\mathbf {\alpha _{j}})=\frac{\left( \sum _ao_i(a)\right) !}{\prod _{a}o_i(a)!}.\frac{\prod _{a}\varGamma (o_i(a)+\alpha _{j}(a))}{\varGamma (\sum _{a}o_i(a)+\alpha _{j}(a))}.\frac{\varGamma (\sum _{a}\alpha _{j}(a))}{\prod _{a}\varGamma (\alpha _{j}(a))} $$

where \(\varGamma (x)\), the gamma function, is the standard continuous generalization of the integer factorial function.

These formulas obtained by Bayesian inference provide a powerful pseudocount scheme to estimate the distribution at a position from a small number of observation counts and priors on different typical column amino acid distributions. From more than hundred sequences required to build a good characterization of a family of homologous sequences, one comes down to fifty sequences, or even as few as ten or twenty examples with the latest pseudocount schemes.

Usage Profile HMMs have thus become a method of choice for the classification and the annotations of homologous protein sequences. Instead of using BLAST to search in a database of annotated sequences for one homolog to the sequence to annotate, the idea is to build first a pHMM for each family of homologous sequences and then to predict to which family the sequence belongs by testing which pHMM recognize it. This way, information from the whole family, rather than from only one sequence, can be used for more sensitive annotation. The most popular pHMM packages are HMMER (pronounced hammer) [54] and SAM [63]. The HMMER package is used in particular in the PFAM [64, 65] and TIGRFAM [66] databases gathering alignments and pHMM signatures for domains and proteins that are widely used by biologists for the annotation of new sequenced genomes. The SAM package is more directed towards the recognition of a remote homolog sharing a common structural fold: it was applied to search for protein structure templates in several structure prediction competitions CASP [67] and it is used by the SUPERFAMILY [68] library of profile hidden Markov models that represent all proteins of known 3D structure.

Thanks to the work done to require fewer and fewer examples by the incorporation of a priori knowledge on the similarity of homologous sequences, the recent trend has been to build a pHMM starting from only one proteic sequence as initiated by PSI-BLAST with PWM [69] to provide a more sensitive alternative to BLAST. Starting from a unique query sequence, the strategy is to bootstrap the search with close homologs: a pHMM is built from the query sequence and then progressively refined by searching and including iteratively the most significant sequence matches in comprehensive sequence databases such as UniProt [70] or the non-redundant (nr) database from NCBI [71]. The result of this procedure is a sensitive pHMM and the retrieved homologous sequences to the query. This strategy was used by SAM-T98 and its successor SAM-T2K for the CASP competitions [7274]. pHMM packages implementing this strategy with fast heuristic prefilters, such as in the new HMMER3 [75], are now as fast as BLAST. The idea has been pushed one step further by HHSearch [76] and its filtered speeded-up version HHBlits [77] that preprocess the sequences from the databases to group them in sets of close homologs represented by a pHMM and perform then iterative pHMM-pHMM alignments to obtain more sensitive results for the search of remote homologs sharing the same structural fold, helped by sequence context-specific pseudocounts.

Modeling Conserved RNA Sequences Profile HMMs have been especially successful for modeling protein homologs and they are also starting to be used for modeling DNA homologs [78, 79]. However, they are not adapted so well for modeling RNA not translated into proteins. These so-called non-coding RNA (ncRNA) molecules play vital roles in many cellular processes. One of the best known examples of functional ncRNA is the family of transfer RNAs (tRNA) that is central for the synthesis of proteins. A tRNA molecule is shown in Fig. 8.4: one can see from this example that, like proteins, RNAs are single-strand molecules that fold into a three-dimensional structure (“tertiary structure”) that determines the function, and, as in DNA, the complementarity between the bases (AU and CG) is a key determinant of RNA structure that is typically composed of short helices packed together and is often simply represented by the base pairing on the sequence (“secondary structure”).

The contiguous paired bases that form the helices, named stems, predominantly occur in a nested fashion in the RNA sequences as complementary palindromic subsequences. These kinds of long-distance correlations in the sequence that are crucial for RNA structure are typically context-free and lie beyond the expressiveness of pHMMs that are restricted to position-based characterizations.

RNA and Context-Free Grammars In Fig. 8.5, an example is given of how a context-free Grammar can be designed in a straightforward way to capture the non crossing base pairing. The idea is to have a pair matching rule \(S_i\rightarrow a S_{i+1} b\) for each paired base (ab) and a base matching rule of the form \(S_i\rightarrow a S_{i+1}\) or \(S_i\rightarrow S_{i+1} a\) for each unpaired base a. By ordering this rule with respect to sequence order and introducing a branching rule \(B_i\rightarrow S_iS_j\) to chain successive nested structures, one gets a grammar recognizing the RNA sequence with a derivation tree mirroring its secondary structure.

Fig. 8.4
figure 4

Tertiary (left) and secondary (right) structure of yeast tRNA-Phe

Fig. 8.5
figure 5

Example of context-free grammar and derivation tree mirroring the secondary structure of an RNA sequence

The secondary structure is often more conserved than the sequence of non-coding RNAs: mutations in one strand of a stem are often compensated for by a mutation in the complementary strand. These compensatory mutations restore base pairing at a position and contribute to the conservation of the RNA secondary structure and therefore its function. Let us remark here that single mutations can also occur on non-paired bases without changing the secondary structures. The grammar above can easily be generalized to cope with these kinds of mutations, preserving the structure that the sequence can undergo. To do so, each pair matching rule can be complemented to match the other complementary pairs of bases and each single base matching rule can also be complemented to match the other bases.

For instance, in the example of Fig. 8.5, \(S_4\rightarrow \mathrm {G}S_5\mathrm {C}\) would be complemented to get a rule \(S_4\rightarrow aS_5\tilde{a}\) for each pair of bases \((a,\tilde{a})\), where \(\tilde{a}\) denotes the complementary base to a, and \(S_1\rightarrow \mathrm {A}S_2\) would be complemented to get a rule \(S_1\rightarrow aS_2\) for each base a; and so on for the other matching rules.

Profile SCFG By doing so, the resulting grammar would model the secondary structure and would lose the information of the initial RNA sequence even if this can be important for homology search or functional characterization. A trade-off between sequence and secondary structure conservation can be achieved by weighting differently each base or pair of bases matched by each rule according to its probability of occurring at the position. At this point, the obtained grammar could be seen as a stochastic context-free counterpart of the (regular) PWMs seen above, allowing us to match a base a at one position i with weight \(w_{i,a}\) as with a PWM by a base matching rule \(S_i\rightarrow aS_{i+1}/w_{i,a} \), but allowing us also to match paired bases \((a,\tilde{a})\) at paired positions (ij) with a weight \(w_{i,(a,\tilde{a})}\) by a pair matching rule \(S_i\rightarrow aS_{i+1}\tilde{a}/w_{i,(a,\tilde{a})}\). To obtain the context-free counterpart of pHMM, named profile stochastic context-free grammars (pSCFG) [81] or covariance models (CM) [82], each matching rule \(S_i\) is completed with position-based deletion rules (of the form \(S_i \rightarrow S_{i+1}/w^{del}_{i}\)) and insertion rules (of the form \(I_i\rightarrow a I_i/w^{ins}_{i,a}\) or \(I_i\rightarrow I_ia/w^{ins}_{i,a}\)). For positions matching one base, this is done as for pHMM. For positions matching paired bases, deletion and insertion rules are added in a similar way but taking care to enable insertion or deletion on each side (left or right) of the nested sequence, which requires the equivalent of six states instead of three by position.

Fig. 8.6
figure 6

Setting pSCFG’s topology from multiple sequence alignment annotated by a secondary structure (example adapted from [80])

As with pHMMs, pSCFG’s parameters can be trained by likelihood maximization approaches from a set of aligned sequences, but this requires additionally an RNA consensus (nested) secondary structure indicating the paired bases and the unpaired bases to set up the topology. This secondary structure can be known for one of the aligned sequences, be predicted by free energy minimization on a sequence or be the inferred common secondary structure from a set of multiple, homologous sequences. In Fig. 8.6, an example of three aligned RNA sequences with such a secondary structure is given with nested ‘\({\texttt {>}}\)’ and ‘\({\texttt {<}}\)’ indicating the paired positions, ‘x’ the unpaired positions and ‘.’ the insertions with respect to the structure. From this information, one can automatically only keep the matching positions sufficiently shared among the sequence to get the paired (‘\({\texttt {>}}\)’, ‘\({\texttt {<}}\)’) and unpaired (‘x’) matching positions of the pSCFG corresponding to the template secondary structure displayed on the left of Fig. 8.6. Each matching position is systematically completed with its companion insertion/deletion rules to get the complete pSCFG topology and parameters can then be trained to maximize the likelihood of the alignment, eventually completed by pseudocounts.

Usage By using a context-free representation, PCFGs and CM extend pHMMs nicely to handle not only the base distribution at each position but also the pairs of base distribution at each (nested) paired position, capturing this way an important structural feature of ncRNA sequences that make it suitable to retrieve successfully RNA homologs. The Rfam database [83] that is an authoritative collection of non-coding RNA families represents each family by a multiple sequence alignment, predicted secondary structure and CM, and is powered by Infernal [84], the kinship software package to HMMER dedicated to modeling RNA with CM.

To get finer results on the characterization of ncRNA, one would need to be able to represent also cross-correlations such as pseudoknots (typical RNA structures with two stems in which half of one stem is intercalated between the two halves of another stem), which with all the computational hardness that they involve, is beyond the generative power of context-free grammars. Even if some proposals have been made to represent this kind of struture by grammatical models [8588], learning such models will be extremely difficult. Finding good representations with practical computational time for learning that kind of correlation on genomic sequences is still an open and challenging research area.

Towards Sentences So far, we have seen approaches modeling homologous proteins or RNA genes in their maximal alignable length. To find more distant homologs or to focus on functionally important parts of the sequences, other approaches prefer to target the identification and the characterization of the most conserved parts shared by a set of sequences.

For instance, Meta-MEME [89] is based on an iterative search by MEME of a set of significant local alignments on a set of DNA or proteic sequences [32] that are used to build a simplified profile HMM where all the delete states are removed and only the insertion states between each block modeling a local alignment found are kept (see Fig. 8.3).

Pratt [90] searches for even more strict conservation: instead of local alignments on all the sequences, it searches by enumeration for interspaced strictly conserved amino acid or nucleotide symbols occurring in a sufficiently large subset of the sequences and then refines heuristically these patterns with new matching components offering a choice between sets of symbols. The patterns potentially returned by Pratt are composed of a suite of symbols or choice of symbols separated by wildcards indicating an insertion of a stretch of symbols bounded by a minimal and a maximal length. To remain feasible, the search has to be constrained by many user-defined parameters limiting the size of the pattern and the number of insertions, the program returning then the best patterns in this search space with respect to an information base or a minimum description length score.

An example of a well-known pattern is the C2H2 signature of ‘zinc finger’ in proteins: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, read as a C followed by two to four amino acids, then a C followed by three amino acids and then one of the amino acid chosen in [LIVMFYWC] followed by eight amino acids, an H, three to five amino acids and finally an H. These patterns are among the most expressive patterns used in Bioinformatics and can be seen as the deterministic counterpart of the Meta-MEME models, with blocks arising from exact conservation rather than from local similarity. They are known as Prosite’s patterns from the name of the database [91] that popularized them as exact signatures of many domains, families and functional sites on proteins. While the patterns in Prosite were initially mostly built semi-automatically from multiple sequence alignments, Pratt is now the default pattern discovery software proposed to users on Prosite’s website to find patterns without the need for a sequence alignment.

These later methods enable one to discover shorter functional or structural conserved units than genes or domains—the highly conserved blocks of Meta-MEME in all sequences or the adjacent groups of conserved positions identified by Pratt in a sequence subset—introducing each unit as a new potential genomic word or the succession of these units as a more complex, interspaced in the sequence (but eventually close in space), word.

8.3 Learning Syntax

So far, we have seen the state-of-the-art methods actually used in practice by biologists to discover and model (conserved) words in genomic sequences. The achievements in Bioinformatics for expressive characterizations are strongly linked with multiple sequence alignments, resulting in position-specific signatures that represent a suite of independent, uncorrelated conserved positions (or pairs of positions for RNA), eventually augmented with the ability to insert symbols between these positions or to skip some of them. Learning is then based on (1) the choice by the expert of the most adequate simple topology, (2) the identification and alignment of the conserved positions among the sequences and, for stochastic models, (3) training the parameters to maximize the likelihood of the sample with respect to priors.

In this section, we are interested in overtaking the position-specific characterization of (conserved) words. In particular, we would like to learn models with dependencies between the symbols of the sequences. In other words, this would allow us to make progress towards the goal of learning not only the words but also the syntax (the grammar) of genomic sequences. The difficulty is that, with dependencies being unknown, one cannot then cannot anymore rely on predefined topologies such as the pHMMs or pSCFGs: the structure of the grammar has to be learnt from the sample, which constitutes a complete Grammatical Inference task and a challenging application for that field.

Learning k -Testable Languages A first step towards learning grammars on genomic sequences is the early work of Yokomori et al. [92, 93] on learning automata representing locally k-testable languages applied to the identification of hemoglobin \(\alpha \)-chains. The class of locally k-testable languages, very similar to the class of k-testable languages in the strict sense [94, 95], is linked to n-grams and, more biologically, to (persistent) splicing systems. Languages of this class have the property that it is sufficient to parse the substrings of length k to decide whether a sequence is accepted or not; dependencies are therefore limited to the length k but cover all the length of the sequences in contrast to motifs. Given k, learning such a language can be done by a simple efficient algorithm building an automaton memorizing the subwords of length k appearing in the positive sample and the corresponding one-letter admissible transitions between them. This algorithm ensures identification in the limit of k-testable languages when k is known. In practice, however, the value of k is estimated by cross-validation and is usually small, the inference being then less subject to over-specialization. To apply this simple inference algorithm to proteins, Yokomori et al. reduce the 20 letter alphabet to a six letter alphabet, clustering amino acids according to main substitutability classes following Dayhoff’s coding method, or drastically to a binary alphabet according to hydropathy (see Fig. 8.7). Recoding the sequences with these reduced alphabets help greatly the generalization and enables us to bootstrap the inference by some biological knowledge on amino acids similarities.

Fig. 8.7
figure 7

Dayhoff’s and binary amino acid encodings used in [92, 93, 96, 97]

This first work is the root of recent studies applying similar approaches to learn grammatical models for the prediction of coiled-coil proteins [96] and transmembrane regions in proteins [97], whose performances are close to those of dedicated tools built with human expertise. In these works, the application scope of learning a k-testable language is extended from a sequence classification to a sequence labeling task through preliminary sequence recoding and automata to transducer post transformation. Sequence recoding is done first by reducing the alphabet according to Dayhoff’s code as in [92, 93] but the alphabet is hereafter augmented by combining letters of the reduced alphabet with their label in the labeled sequences forming the training sample for the task. For instance, using an example from [97], one protein sequence of the training set,

M R V T A P R T L L L L L W G A V A L T E T W A G S H S M R,

would be encoded first following Dayhoff’s coding into

e d e b b b d b e e e e e f b b e b e b c b f b b b d b e d

and, from its known transmembrane topology, this sequence could be labeled as follows (see [97] for alternative labeling):

e d e b b b d b e e e e e f b b e b e b c b f b b b d b e d

O O O C M M M M M M D I I I I I I I I A M M M M M B O O O O

where M labels residues in transmembrane regions, I labels residues in the cell while O labels residues out of the cell and A, B, C, D label the shift from outer/inner regions to/from transmembrane regions. Then, in the augmented alphabet, composed of a symbol xL for each letter x labeled by L, the sequence encoding the labeled example would begin by the following symbols (separated by white spaces):

eO dO eO bC bM bM dM bM eM eM eD eI eI fI bI bI eI bI eI bA cM...

By encoding the sequences from the positive sample this way, one can learn a k-testable language by a classical algorithm, such as k-TSSI, designed to learn k-testable languages in the strict sense [94, 95], with the advantage that, as in the morphic generator methodology [98], identical letters can be distinguished by their label during the inference. By transforming each transition labeled by a symbol xL from the learned automaton into a transition by letter x and output label L, one gets back a labeling transducer that can then be weighted and used for the task, eventually with the help of error correcting parsing techniques to compensate for the lack of data. These studies show that grammatical inference techniques can be applied with encouraging results to genomic sequences, even with such a limited class of languages when helped by pertinent pre- and post-processing techniques. We will now focus on learning more expressive grammatical representations of languages, and thus more complex dependencies, on these kinds of sequences.

Learning Automata At the first level of Chomsky’s hierarchy (regular languages), we have investigated in our team the inference of full automata to model functional or structural families of protein directly from their complete sequence. RPNI [99, 100], EDSM and Blue-Fringe [101] (see Chap. 4, On the Inference of Finite State Automata from Positive and Negative Data, López and García) having been shown to be successful in practice on artificial data, testing these methods on this task was appealing. Our preliminary attempts showed also that these methods, even improved by taking into account similarities of amino acids, were performing very badly on leave-one-out experiments. Our analysis of these results yields that protein sequences whose length is about 300 symbols on average on a 20-letter alphabet, and whose functional parts are not necessarily at the beginning or the end of the sequences, are not well suited for these algorithms relying mainly on common sequence heads and tails for the inference. To avoid these pitfalls, we have proposed shifting from a deterministic to non-deterministic automata and adapt consequently the idea of evidence introduced by EDSM to merge common (similar) substrings rather than common tails, obtaining a first successful application of the classical state merging grammatical inference framework to learn automata on protein sequences: Protomata-Learner [102107].

Shifting from a deterministic to a non-deterministic setting in the state merging approach requires simply starting from the Minimal Canonical Automaton (MCA) of the sample set (the non-deterministic automaton that is the union of the canonical automata built on each sample) rather than the Prefix Tree Acceptor (PTA) and proceeding by merging some of its states (without merging for determinisation) [108] or, inspired by EDSM, by merging successively the states on paths labeled by common substrings.

Similar Substring Merging Approach To deal with amino acid similarities, the heuristic has been generalized to look at common similar substrings, on the basis of the significantly similar pairs of substrings (named diagonals) precomputed by Dialign to serve as multiple sequence alignment building blocks [53]. A Dialign’s diagonal d is a pair of equal-length substrings \((d_1,d_2)\), implicitly aligned from left-to-right and whose similarity s(d) is computed by summing the substitution score (given by a substitution matrix) of the aligned amino acids. Dialign computes also for each diagonal the weight of its similarity w(d) as the negative logarithm of the similarity’s p-value, namely of the probability of finding a diagonal of the same length with a greater or equal similarity in random amino acid sequences. The weight measuring how exceptional the similarity of the diagonal is relative to its length enables us to compare diagonal of different lengths and to define similar diagonals: random diagonals ought to have a weight of 0; similar diagonals are thus those whose weight is greater than 0, or greater than a positive weight threshold parameter t if one wants more significant similarity before considering the substrings in the diagonal.

Fig. 8.8
figure 8

Merging similar substrings

The task is then to distinguish the similar diagonals that are characteristic of the family from those that are similar by chance or for another unrelated reason. This is done in Protomata-Learner by a best-first greedy approach: at each iteration, the best similar substrings are selected by one heuristic (maximizing their support in the training set and also their similarity) and states aligned by these substrings are merged (see Fig. 8.8), discarding from the future choices the remaining similar substrings that are incompatible with the selected ones.

Fig. 8.9
figure 9

Incompatible and compatible diagonals

Incompatible daigonals are those with an overlap presenting conflicting alignments inside the diagonals and forcing us to choose at most one of themFootnote 1 (see Fig. 8.9 top). Another kind of incompatible diagonal can be introduced to help the inference when it is assumed that the protein sequence family does not undergo shuffling mutations (that are unlikely to occur without structure and function change): in that case, the order of the similar substrings in the sequences is preserved and crossing diagonals are incompatible (see Fig. 8.9 bottom). This greedy similar substring merging algorithm halts when no more compatible similar substrings are available for merging, relying so on incompatibilities and on the chosen threshold t to stop the inference. No negative sample is required, the characterization being directed towards maximizing the global unexpected similarity of substrings with respect to random sequences and adopting in this way a Minimum Description Length perspective rather than the discriminative Occam’s razor inspiration of RPNI or EDSM.

A New Kind of Alignment The similar substring merging approach of Protomata-Learner under such incompatibility constraints can be linked to the classical Bioinformatics field by considering the sets of similar substrings merged as a new kind of multiple sequence alignment, named partial local multiple alignment (PLMA), exhibiting conserved regions that can be local, involving only a contiguous subset of the amino acids in the sequences as defined for classical local alignments, but also partial, involving contiguous amino acids from only a subset of the sequences instead of all the sequences. This later property enables us to represent unrelated conserved regions among subsets of the sequences: instead of being limited to the identification of conserved positions in all the sequences, one can identify alternative conserved words in some sequences, not necessarily aligned, and their chaining, paving thus the way to modeling syntax in addition to conserved words. For the inference of automata, the aligned substrings from conserved regions of the PLMA are merged, weighting eventually amino acid transitions thanks to efficient PWM or pHMM weighting schemes, and insertion states are added to link consecutive conservation regions (see Fig. 8.10), enabling learning topologies that can be seen as a generalization of pHMM or Meta-MEME architectures overtaking these position-specific characterizations by enabling us to model alternative paths (see Fig. 8.3).

Fig. 8.10
figure 10

Learning automata by partial local alignment from set of protein sequences

Learning Context-Free Grammars Even if automata enable us to take a important step toward more expressive models, they are limited to successive short-term dependencies while it is well known that, from protein folding, residues that are far in the sequence may be close in space and interact together or are simply correlated. To represent this kind of long-distance interaction, one needs to learn more expressive grammatical representations.

From a General Template Grammar A first attempt towards this goal is the framework introduced in [109] based on a genetic algorithm training the weights of a complete stochastic context-free grammars in Chomsky’s normal form to maximize the likelihood of the training sample. A complete grammar is such that the rule \(A\rightarrow BC\) exists for each non-terminal ABC: the number of rules grows thus extremely fast with respect to the chosen number of non-terminals. The framework aims at limiting the number of non-terminals by proposing biasing the topology of the grammar towards nested dependencies and more drastically by an original way of coping with the size of the amino acid alphabet and introducing knowledge on their physico-chemical properties: all the amino acids are generated from only three non-terminals, corresponding to three discretized levels (low, medium or high level) of a chosen property of interest (for instance the van der Waals volume), the probability of generating the amino acid being fixed with respect to these levels (and thus not subject to training). Then a grammar considers amino acids only with respect to one property and if more than one property is of interest, one needs to train several grammars and to combine parsing scores for membership predictions. Experiments restricted to binding site regions of protein sequences and nine non-terminals show a good recognition accuracy on this task and pertinent parse trees illustrating the interest of this kind of context-free model.

By Local Substitutability We have recently proposed a different approach [110] showing the versatility and the efficiency of distributional learning of context-free languages (see Chap. 6, Distributional Learning of Context-Free and Multiple Context-Free Grammars, Clark and Yoshinaka) by applying it to protein sequences. PLMAs are used once again, but here as a pre-processing step to deal with amino acid similarity: by using parameters that allows to identify all short highly conserved regions under overlapping and crossing incompatibilities, the sequences are recoded according to these conservation blocks and provided as input for the actual generalization step performed by a grammatical inference algorithm. To be able to parse non-encoded protein sequences, a post-processing of the inferred grammar is performed to replace each terminal corresponding to a conserved region by a new non-terminal generating amino acid from the region (by introducing a succession of new non-terminals for each set of aligned amino acid from the region, in charge of generating indifferently any amino acid from the set) and introduce new non-terminals in charge of generating any amino acid for non-conserved regions. Used this way, PLMAs detect and align similar amino acids but entails almost no generalization when no grammatical inference algorithm is used, as testified by leave-one-out experiments. More surprisingly, when we tried state-of-the-art grammatical inference algorithms learning substitutable [111, 112] or kl-substitutable [113] context-free languages, based on a formalization of substitutability idea introduced in linguistics by Zellig Harris in the 1950s [114], no additional generalization was performed.

Learning such languages is based on the identification of substrings appearing in a common context, to generalize the language by allowing these substrings to be substituted for each other (a contextual constraint for substitutability being added for kl-substitutable language): i.e. if xyz and \(xy'z\) are both in the training set, then any occurrence of y (or a subset of them for kl-substitutable language) can be substituted by \(y'\), and vice versa, in the language. The problem in the preliminary experiments on protein sequences is that this criterion was never met in the training samples. As a matter of fact, if the sequences are long, observing a double occurrence of the common context (xz) and a double occurrence of y, given that at least one of these substrings has to be long, has low likelihood in practice. Moreover, these characterizations rely on conserved heads and tails that, as already stated for the inference of automata, are not necessarily informative and conserved in protein sequences.

In [110], we proposed thus a variant of the substitutability generalization criterion that considers local rather than global context to define the substitutable substrings: local substitutability criterion states that it is sufficient to have both xuyvz and \(x'uy'vz'\) for a common local context (uv) of sufficient length in the training set to allow us to substitute any occurrence of y (or a subset of them for kl-substitutable languages) by \(y'\). At the price of adding two additional parameters on the required left and right lengths of the common local context enabling us to define substitutability of the substrings (or only one parameter when right and left contexts are considered symmetrically), one has been able to get a real and pertinent generalization. Thanks to the development and the implementation of a faster algorithm for learning local substitutable context-free grammars, named ReGLiS, combined with the encoded pre- and post-processing scheme, these results have been confirmed on the complete set of protein families used for the testing in [109]: using the entire protein sequences rather than only the short binding site substrings, our leave-one-out experiments show a good recall and a perfect precision [115]. These preliminary results, obtained without any weights on the rules, are really encouraging and should be easily improved. They already show, with other works presented in this section, that the application of grammatical inference can be successful for non-trivial syntactic characterizations of protein families. More generally, learning syntax on genomic sequences is a very nice open playground for grammatical inference, enabling us to apply ideas or techniques from the field but being also a source of inspiration for novel practical and theoretical challenging developments.

8.4 Conclusion

We have presented here the first successful steps towards learning the language of biological sequences. So far, the state of the art is mainly at the word level: the discovery of exceptional words, the alignment of conserved words and their modeling by the parametrization of simple adequate topologies based on biological priors. Some recent advances have also been made on learning non-trivial grammar topologies for proteins but we are only at the beginning of this exciting challenge addressed by Grammatical Inference.

To draw the lines of future research in that field, one can guess that the focus on learning topologies with (long-distance) correlations will continue. In proteins and RNA, it would allow us to capture correlations between positions that are far in the sequences but close in the 3D space. In DNA, the problem seems more complicated since the challenge is then to deal with palindromes and copies, requiring us to use and learn more expressive grammars. For DNA, recent advances have thus rather been on a simpler task: discovering the hierarchical structure of DNA as an instance of the smallest grammar problem, along the lines initiated by Sequitur [116] and its successors [117123]. These studies have not been presented in this chapter since it is still difficult to assert and compare their biological pertinence, but these approaches based on repeats may help us to better understand what are the important words and where are their occurrences in DNA and to decipher its word structure as a preliminary step to learning grammars. Moreover, the repeats used in these approaches are not that far from the variables used in the current state-of-the-art DNA parsers of the first section. This is an interesting convergence when the goal is to design automatically, or help the expert to design, the grammars for these parsers.

We have proposed in this chapter an overview from a Grammatical Inference point of view of the achievements and open challenges in this research field as well as some keys to enter it. To further investigate this area, we propose a short list of additional reading recommendations.

Further Readings First, Wikipedia (http://www.wikipedia.org/) covers fairly well the related concepts in biology or bioinformatics and these pages are usually well written. Good entry points to Pattern Discovery are [22, 124] while [23] offers a comprehensive algorithmic and theoretical treatment of the subject. For probabilistic models on sequences, an excellent review with a grammatical inference point of view is [125], while the reference books [126, 127] contain non-grammatical machine learning techniques. Finally, on Grammatical Inference, the other chapters of this book should be helpful, as well as the reference book [128].