Background

Since the first descriptions of protein structures by Pauling and Corey [1, 2], their repetitive secondary structures have been widely analyzed. They have been studied from two principal points of view – assignment and prediction.

Different approaches can be used for assigning secondary structures to a 3D protein structure. The most common is DSSP [3], which is based on hydrogen bonding patterns. STRIDE [4] relies on the same criteria with slightly different parameters and computes backbone dihedral angles. DEFINE [5] uses an inter-Cα distance matrix that corresponds to ideal repetitive secondary structures. PCURVE [6] is based on the helicoidal parameters of each peptide unit and generates a global peptide axis. Finally, PSEA [7] bases its assignments only on the Cα position, using distance and angle criteria. Not surprisingly, these methods do not assign the same state to all residues, especially those located at the beginning and end of repetitive structures. For instance, DSSP, DEFINE and PCURVE only assign 65% of residues to the same state [8].

Several prediction methods have been developed [9], and accuracy rates climb to 80% with neural networks and sequence homology [10]. Secondary structures do not, however, entirely describe the 3D protein structure. Coils account for more than 40% of residues. In the conventional 3-state description, they are associated with only one state, defined as non-helicoidal and non-extended. The coil state is in fact composed of really distinct local folds, such as turns [11]. Several studies have attempted to analyze loops [12, 13] and predict their conformations [14], but they still fail to take a significant portion of residues into account.

Protein structure descriptions that use a library or set of small prototypes, i.e., N states rather than the conventional three, can help improve definitions of these regions and may also improve prediction. Such a library constitutes a structural alphabet [15, 16] and is composed of structural prototypes. Because these describe all the local folds, repetitive structures as well as coils, they allow a better approximation of the entire protein structure. Thus, they can be used to reconstruct protein structures [17] or to predict the local structure [18]. In a previous study, we defined a structural alphabet composed of 16 protein fragments, each 5 residues in length, called Protein Blocks (PBs, cf. Figure 1) [19]. They have been used to describe 3D protein backbones [2022] and to predict local structures [19, 23]. Our structural alphabet is particularly informative [24] and is thus useful for pre-processing before ab initio and new fold prediction.

Figure 1
figure 1

Protein Blocks. From left to right and top to bottom, RASTER 3D [42, 43] images of the 16 Protein Blocks of the structural alphabet. Each prototype is five residues in length and corresponds to eight dihedral angles (φ,ψ). The PBs m and d can be roughly described as prototypes for the central α-helix and the central β-strand, respectively. For each PB, the N cap extremity is on the left and the C-cap on the right.

We focus here on the study of small loops that connect two repetitive structures. We first analyze the classic secondary structure assignments with the five above-mentioned methods. Secondly, we describe the short loops with our structural alphabet and analyze the sequence-structure relationship in these local structures. Finally, we make local predictions based on the amino acid sequences.

Results

Secondary structure assignments

As noted by Woodcock et al. [25], a serious problem raised by the variety of methods for secondary structure assignment is that they often yield differing results. A consensus method has been proposed to lessen this effect [8]. Here we used an agreement rate, denoted as C 3, which is the proportion of residues associated with the same state. Table 1 summarizes the correspondence between the secondary structure assignments from the five methods. It clearly highlights three points: (i) with its default parameters, DEFINE yielded results very different from the other methods, as shown by its C 3 values, close to 62%; (ii) DSSP and STRIDE produced nearly identical assignments, with C 3 equal to 95%. Of the remaining assignments, 4% corresponded to confusion between α-helices and coils, and the remaining 1% to confusion between β-strands and coils; (iii) all the other comparisons gave a mean C 3 of 80%, with 6–7% confusion between α-helices and coils and 12–13% between β-strands and coils.

Table 1 Agreement index C 3 between secondary structure assignment methods. The index values measure the proportion of residues assigned to the same state by two methods.

In addition, DEFINE was the only method to confuse α-helices and β-strands. This confusion ranged from 2% to 5% between DEFINE and the other methods, while for all other comparisons, it was less than 0.05%. These results did not change when β-strands were described by 'E' (extended-strand participating in a β-ladder) and 'B' (residue in isolated β-bridge) [9] labels for DSSP and STRIDE rather than only 'E'.

These results show the difficulties related to defining an appropriate length for α-helices, β-strands and coils and locating their ends [26]. These inaccuracies in defining the repetitive structures have direct repercussions on the definition of loops. Figures 2 and 3 use the example of the ribosomal protein S15 from Bacillus Stearothermophilus (PDB code 1A32; another example, proto-oncogene Mtcp-1, PDB code 1A1X, is given [see Additional file 1 and Additional file 2]) to show the multiple secondary structure assignments that can ensue. 79% of the residues are assigned to the same state, rather more than for many other proteins. The repetitive structure caps remain quite confusing (cf. Figure 3), however, despite good agreement. For instance, the C-cap of the first helix is defined over three residues, depending on the assignment method (positions 13 to 15). The connecting zone between helices 2 and 3 is fuzzy. DSSP and STRIDE assign positions 4448, PSEA 4550 and PCURVE 4547 as coils whereas DEFINE assigns positions 4750 as a small β-strand. In this example, we see that the 16 Protein Blocks (PBs), labeled a-p, describe every part of the protein structures specifically. This description includes the repetitive structures, their edges, and the coils that the secondary structures define only as non-helicoidal and non-extended.

Figure 2
figure 2

Comparison of methods for secondary structure assignment. Example of 5 assignments for the ribosomal protein S15 from Bacillus Stearothermophilus (PDB code 1A32). The figure shows the amino acid sequence (AA), the secondary structure assignments by (DSSP), (STRIDE), (PSEA), (DEFINE), (PCURVE) with 'H' for the α-helix, 'E' for the β-strand and 'C' for the coil, (cons.) is a simple consensus with a star (*) if the five methods agree or a dot (.) if they do not, (PB) is the Protein Block assignment with ZZ for the extremities (not assigned, i.e., the PB is centered on the central residue).

Figure 3
figure 3

Representation of the secondary structure assignments. Example of the ribosomal protein S15 from Bacillus Stearothermophilus (PDB code 1A32) with (a) DSSP, (b) STRIDE, (c) PSEA, (d) DEFINE, (e) PCURVE and (f) Protein Blocks. In the last case to simplify the representation, helices are associated to PB m and strands to PB d. The visualization is done with RASTER 3D [42, 43] and MOLSCRIPT [44]. The α-helices are in red, the β-strands in green and the coils in blue.

Each PB is a fragment five residues long that corresponds to a local fold and is defined by eight dihedral angles. PBs m and d correspond roughly to the core of α-helices and β-strands, respectively. In the example in Figure 2, several series of PB m accurately describe the helix cores. Where a secondary structure assignment method assigned a β-strand (PSEA, positions 14-17, DEFINE positions 1618 and 4750), the PB assignment gave PB b, or PB c and e, all close to β-strand geometry. Thus, PBs may explain the ambiguity of the assignments. In this case, PBs b, c and e can take the variability of the β-strand into account. The structural alphabet was more structurally informative (16 states instead of 3 states) and better approximated the protein backbone. It is thus a relevant alternative for describing loops.

Describing short loops in terms of Protein Blocks

Loops are defined as protein fragments that connect two series of PBs m and/or d and contain no repetitive PBs. Short loops have a length of 2 to 6 PBs. The short loop databank contains 3,319 fragments: 644 for mm/mm, 801 for dd/dd, 989 for dd/mm and 886 for mm/dd. Table 2 summarizes the properties of the PBs in the overall databank as well as in the loop and short loop databanks. We focused on the frequency of occurrence of PBs in these regions and on the main transitions between successive PBs, since previous studies observed only a limited number of transitions [19, 23]. Table 2 points out the specificities of the transitions of some PBs in the short loops (for comparison, this information on the PBs in the complete databank is given [see Additional file 3]).

Table 2 Description of Protein Blocks in short loops. The analysis is carried out in the short loops regions, i.e. 2 to 6 residues between two successive mm and/or dd. Listed for each protein block (PB; labeled from a to p), are: the frequency of occurrence (frq) in the complete databank (DB), the loops databank (Loops), and the short loops databank (Short loops), the four main PB transitions and the distribution in the secondary structures (α-helix, coil and β-strand) of the central residue, as assigned by PSEA.

We observed that PBs k, l, n, o and p were relatively more specific to short loops. Their frequencies were 1% higher in short loops than in all loops. Inversely, the frequency of PB b dropped from 9.0% in all loops to only 3.7% in short loops. Moreover, it was slightly less frequent in the short loops than in the overall databank (4.4%). The frequencies of the other PBs were the same in loops and short loops.

The transition frequencies between successive PBs varied substantially between the complete databank and the short loops. We noted three main categories. (i) The principal transitions became more pronounced for most PBs (i.e., 11). For example, the transition from PB a to PB c increased by more than 20% (50.9% versus 71.8%), c to d more than 20%, e to h more than 10%, f to k (24%) and l to m (15%). For PBs h, i, k, n, o, and p, the increase was smaller, ranging from +2 to +10%. (ii) For two PBs, the first preference transitions were inverted. The second most common transition of PB g (PB c) in the databank took over first place for short loops, and its frequency climbed from 28.0% to 39.7%. PB j was the fuzziest PB (rmsd = 0.74 Å) and had a high number of "main" transitions (6 with a transition rate greater than 10%). In the short loops, its third most common transition, PB l, becomes first (and its rate went from 16.1% to 25.0%). (iii) PB b changed completely. Its main transition in the databank (PB d) dropped into fourth place. In fact, the transition from PB b to PB d was found mainly at the end of long loops going to β-strands. In short loops, PBs c, f and l were preferred. Hence, the rate of the second leading transition (PB c) increased from 17.9% to 40.7% and the third (PB l) from 15.7% to 25.7%.

We analyzed the distribution of the classic secondary structures in our short loop definitions. The secondary structure assignments (with PSEA [7]) changed substantially from their distribution in the entire databank. The frequency of PBs a, c and e in β-strands increased by 5%, 2% and 9%, respectively, and the frequency of PBs k, l, n and o in α-helices by more than 12%. The frequency of PB b in coils increased from 85.4% to 95.8%. Other methods of secondary structure assignment yield similar results.

Analysis of the sequence-structure relationship

Figure 4 reports the amino acid occurrence matrices, normalized into Z-scores, and their asymmetric Kullback-Leibler index (KLd [27]) for two PBs, c and l, calculated from the complete databank and from the short loop set. The PBs are five residues in length (noted from -2 to +2 and centered in 0). We showed in a previous study [19] that prediction can be improved only by enlarging the sequence window to 15 residues (noted from -7 to +7 and still centered in 0). We therefore computed the occurrence matrices for fragments of 15 residues. Positive Z-scores (respectively negative) correspond to overrepresented (respectively underrepresented) amino acids and provide information for each amino acid at each position. The KLd analyzes the contrast between the amino acid distribution observed in a given position of the occurrence matrix and the reference amino acid distribution in the protein set. Hence, it measures the sequence information content and highlights the most informative positions.

Figure 4
figure 4

Analysis of PBs c and l in short loops. The left part corresponds to PB c and the right to PB l. (a), (b), (e) and (f) are the amino acid Z-scores, with (blue): Z-score < (-4.4), (green): (-4.4) < Z-score < (-1.96), (yellow): (-1.96) < Z-score < 1.96, (orange): 1.96 < Z-score < 4.4 and (red): Z-score > 4.4. For prediction purpose, a five-residue PB (numbered from -2 to +2) is encompassed in a longer fragment of 15 amino acids in length (numbered from -7 to +7). (c), (d), (g) and (h) are the asymmetric Kullback-Leibler distributions. (a) and (c) correspond to PB c, (b) and (d) to PB l in the complete databank. (e) and (g) correspond to PB c and (f) and (h) to PB l in the short loops.

We observed two types of behaviors for the 14 PBs: for most (i.e., 11) the KLd values increased at every position while for the other three, decreased values compared with the entire databank were seen at some positions. PBs c and l are representative examples of these two cases. Globally, we observed some significant contrasts in the Z-score matrices, quantified by the higher KLd measure in some positions of the sequence window. PB c showed clear specificity: proline was overrepresented and glycine underrepresented in positions 1 and 2, both in the complete databank and in the short loop set (cf. Figures 4a and 4e). For PB c in the complete databank, the maximal KLd was 0.15 in position (-2), but in the other central positions (-1 to +2), corresponding to the informative sequence zone, KLds ranged from 0.04 to 0.05 (cf. Figure 4c). KLd levels were lower in the flanking regions. All PB c positions in the short loops had markedly increased specificities (cf. Figure 4g). The value of the maximal KLd increased from 0.15 to 0.23, and doubled for the other central positions, for a KLd range of 0.09 – 0.11.

PB l behaved distinctively. Its amino acid distributions in the short loop set differed from those in the entire databank (cf. Figures 4b and 4f). The informative region was restricted to only three positions (-2, -1, 0) with KLd values of 0.23, 0.13 and 0.13 respectively (cf. Figure 4d). In the short loops, position (+2) increased significantly, to 0.11 and became equivalent to position (-1). Position (0) lost specificity (-0.01), but position (-2) remained most specific, increasing to 0.03 (cf. Figure 4h).

Table 3 summarized the 149 amino acid over- and underrepresentations observed in the short loop set, fewer than in the overall databank. This was due mainly to the number of occurrences, by definition lower in the short loops. Nevertheless, 20% of the significant amino acids had not previously been found. Nearly all PBs had at least one amino acid over- or under-represented. As expected, in most cases, it was glycine (9 times), although 8 other types of amino acids were involved. We note two specific examples: (i) the overrepresentation of methionine in position (+1) of PB p (the only methionine overrepresented in all the short loops), and (ii) the underrepresentation of glycine in position (-2) of PB f, although it was overrepresented in the global distribution.

Table 3 Description of Protein Blocks in short loops. For each position (indexed from -4 to + 4) of the 16 protein blocks (PBs), the highest amino acid over-representations (Z-score > 4) and under-representations (Z-score < -4) are labeled by the symbols (+) and (-), respectively. The new over- and under-representations specific to the short loops are displayed in bold and italics respectively. For analysis purpose, a five-residue PB (numbered from -2 to +2) is encompassed in a longer fragment of length 9 (numbered from -4 to +4).

Predicting with PBs in the short loops

Table 4 summarizes the predictions. A training set corresponding to 2/3 of the dataset was used to learn the sequence-structure relationship for all predictions, and a test set corresponding to the remaining 1/3 to evaluate the results. We ran three different sets of predictions: the first two used occurrence matrices computed from the complete databank, and the third, matrices computed only from the short loop regions. We computed Q 16 and Q 14 ratios to analyze the quality of the predictions. Q 16 corresponds to the total number of true predicted PBs over the total number of predicted PBs. The Q 14 value is specific for loops, i.e., PB m and d are not taken into account.

Table 4 Prediction results. Predictions are given for each PB (noted from a to p), together with the prediction rate for the 16 PBs (Q 16) and without the repetitive PBs (Q 14) in loops (i.e., PBs m and d). The first prediction (init) considered all the sequence positions. The second (short loops 1) was the same but only for the short loop regions, i.e., 2 to 6 residues between two mm and/or dd series. The third (short loops 2) included specific learning for the short loops.

The first prediction (init) is the conventional Bayesian prediction, run with all 16 PBs. It yielded a global prediction rate Q 16 equal to 35.2%. This value is close to that in our previous study (Q 16 = 34.4% [19]) and far superior to the value of 7.5% obtained with random assignment. The Q 14 value equals 36.0% for both the short and long loops. This computation shows that the non-repetitive PBs were predicted as accurately as the PBs m (39.3%) and d (27.7%). Prediction was thus not biased in favor of the most populated blocks.

The second prediction (short loops 1) used, as previously, the occurrence matrices computed from the complete databank, but focused only on predicting the short loop regions (cf. Short loop description). Hence, only 14 PBs were considered. The prediction rate Q 14 reached 41.2%, significantly greater than the random rate (8.0%). Prediction rates increased for most PBs, especially those associated with the α-helix ends: PB n (+10.8%), PB o (+8.7%), PB p (+9.2%). The increase in the prediction rates for the PBs associated with β-strand edges was slightly lower. Prediction rates fell for five PBs – approximately 1% for four (PBs e, g, h and i) and 9.6% for one, PB j.

The last prediction (referred to as short loops 2) used specific learning with the sequence-structure relationship in the short loops to define the occurrence matrices of the 14 PBs involved. The Q 14 value increased by 1.3% and yielded better distribution between the PBs. Hence, only four PBs had poorer prediction rate than with the initial prediction. They were all associated with coil-assigned structures. PBs g, h and i lost 1.9%, 1.7% and 1.4%, respectively. The prediction rate for PB j decreased dramatically, from 47.3% to 25.0%. Rates for the PBs associated with repetitive PB m, i.e., PBs n and p, returned to values slightly closer to those for the complete databank, with accuracy increasing by 4.4% for PB n and 7.5% for PB p. This prediction approach also favored the protein blocks associated with PB d: the prediction rate increased by 5.8% for PB b, 16.0% for PB c and 7.0% for PB e. Moreover, the prediction rate for PB f increased from 37.1% to 44.1%. Thus, the short loop 2 method improved the prediction of most PBs, but was limited by the bad performance of PB j.

Discussion

We have observed that the secondary structure assignment methods can produce highly discordant results. In most cases, only 80% of the residues are assigned to the same state. The capping regions of repetitive secondary structures are particularly mismatched. The difficulties of describing clearly repetitive regions have often been pointed out [2830].

PBs allow more precise description than do the secondary structures. In addition, they overlap. Accordingly, a small modification of PB assignment has fewer consequences than changing a secondary structure assignment; for example, a PB m is relatively similar to a PB n whereas an α-helix should be highly distinct from a coil. Analysis of series of PBs prove their structural relevance [23]. All these points justify the use of our structural alphabet to describe and analyze short loops. A recent approach has shown that most short loop fragments can be approximated correctly in the Protein DataBank [31].

The behavior of PB b in short loops differs from that in all loops: it appears to be a β-strand N-cap mainly involved in long loops. This point may partly explain its poor prediction rate in the short loops. Similarly, we observe that most of the rates of leading transitions are lower in the complete databank than in the short loops. This indicates that the less frequent transitions are associated with longer loops, i.e., fragments of more than 6 PBs.

Analysis of the sequence-structure relationship shows that most of the PBs in short loops have specific amino acid distributions that differ in many cases from the reference PB distribution. Nonetheless, as noted with PB l (see Figure 4), some positions lose amino acid specificity.

Because of the limited number of short loops in our non-redundant databank, we ran three different sets of predictions so that we could carefully observe the behavior of the PBs. (i) The global prediction shows that the loops were predicted as accurately as the repetitive structures (Q 16 = 35.2% and Q 14 = 36.0%), i.e., this method did not introduce artificial bias resulting in preferential prediction of repetitive regions. (ii) The sequence-structure relationship in the short loops was strongly determinist and thus significantly improved the prediction (Q 14 = 41.2%). The use of the global occurrence matrices, however, induced an imbalance in the prediction of certain PBs: PBs associated with the repetitive PB m enjoy many advantages over other PBs mainly associated with the coil-state. (iii) Accordingly, a specific approach dedicated to the short loops yielded better, more accurate predictions, better balanced between the different PBs (Q14 = 42.3%), with no particular bias.

PB j is the only PB for which results really suffer with this approach. It is the least frequent PB and the most variable. Consequently, the poor prediction rate for it may be explained by the lack of information in the databank for it. We also have noted important over-fitting (more than 20% between the learning set and the validation set) for this PB, substantially higher than for the other blocks.

One advantage of such an approach is that it enables us to compute the most significant series of PBs and from this information propose alternative 3D candidate structures. Figure 5 shows an example of short loop prediction with the PB probabilities associated with a given sequence window and the corresponding possible 3D structures.

Figure 5
figure 5

Example of prediction: scoring PB combinations. This figure presents the prediction of positions 34 to 43 of cupredoxin amicyanin (Protein DataBank code [41]: 1BXA) with short loops prediction. The true 3D representation is given in (a) with the corresponding amino acid and PB sequences. The prediction gives a probability for every PB at each position. The score (and associated probability) at each position are reported in (b). Only scores more than 1 (superior to random) are indicated. The most probable series of PBs is therefore eojac. The comparison between ehiac from cupredoxin amicyanin and eojac taken from positions 53 to 62 of cyclophilin A (PDB code: 1AWU) gives a root mean square deviation (rmsd) equal to 2.2 Å. From (b) we can compute other high scoring PB series. Two of them are given: ehiac from monoclonal 2E8 FAB antibody (PDB code: 12E8, positions 10–19) and ehjac from Apo intestinal fatty acid-binding protein (PDB code: 1A57, positions 65–74) with associated rmsd of 0.3 Å and 2.5 Å respectively. We used MOLMOL [45] for the image visualizations.

Conclusions

Loop prediction, despite the considerable work devoted to it and the numerous methods developed, remains a difficult research topic [14, 32, 33]. Prediction methods are often used in comparative modeling and propose one "complete" loop [14, 33]. Here, instead of describing entire loops, we predict locally each position of the loops. This Bayesian approach can be used to propose not just one, but many different loops. Because each PB at each position is associated with a corresponding probability score, correlated in turn with the prediction accuracy [19, 23], a loop prediction approach could be extremely useful. It can help to probe and sample the flexibility of loops. It can be useful too in ab initio loop prediction [34, 35], recently shown to be important for some docking methods for protein-protein [36] and protein-DNA interaction [37].

Methods

Data sets

The main set of proteins (PAPIA), based on the PAPIA/PDB-REPRDB database [40], comprises 717 protein chains and 180,854 residues [41]. It has been used in previous work [23] and is available at http://www.ebgm.jussieu.fr/~debrevern. The set contains no more than 30% pairwise sequence identity. The selected chains have X-ray crystallographic resolutions less than 2.0 Å and an R-factor less than 0.2. Each structure selected has a rmsd value larger than 10 Å from all representative chains. Each chain was carefully examined with geometric criteria to avoid bias from zones with missing density. An updated databank has been built with the same criteria; it is composed of 1,403 proteins and 320,005 residues.

Protein Blocks

They correspond to a set of 16 local prototypes, labeled from a to p (cf. Figure 1), 5 residues in length and based on Φ, Ψ dihedral angle description [19]. They were obtained by an unsupervised classifier similar to Kohonen Maps [38] and Hidden Markov Models [39]. The PBs m and d can be roughly described as prototypes for central α-helices and central β-strands, respectively. PBs a through c primarily represent β-strand N-caps and PBs e and f, C-caps; PBs g through j are specific to coils, PBs k and l to α-helix N-caps, and PBs n through p to C-caps. This structural alphabet allows a reasonable approximation of local protein 3D structures [19, 23] with a root mean square deviation (rmsd) now evaluated at 0.42 Å.

Short loop description

We defined the short loops as PB series 2 to 6 PBs long. These series must be composed of non-repetitive PBs, i.e., all PBs except d and m. They must have flanking regions composed of series of PBs mm and/or dd.

Secondary structure assignments

Secondary structures were assigned with five distinct programs: DSSP [3] (CMBI version 2000), DEFINE [5] (version 2.0), PCURVE [6] (version 3.1), STRIDE [4] and PSEA [7] (version 2.0). DSSP and STRIDE use more than three states, so we reduced them: the α-helix contains 'H', 'G' and 'I', the β-strand contains 'E' and the coil everything else. Default parameters were used for each.

Agreement rate

To compare two distinct secondary structure assignment methods, we used an agreement rate denoted C 3 and defined as the proportion of residues associated with the same state (α-helix, β-strand and coil).

Z-score

The amino acid occurrences for each PB were normalized into a Z-score:

with

the number of times amino acid i was observed in position j for a given PB and the number expected. The product of observations in position j and its frequency in the entire databank equals . Positive Z-scores (respectively negative) correspond to amino acids that are overrepresented (respectively underrepresented); threshold values of 4.42 and 1.96 were chosen (probability less than 10-5 and 5.10-2 respectively).

Asymmetric Kullback-Leibler measure

The Kullback-Leibler measure or relative entropy [27], denoted by KLd, makes it possible to compute the contrast between two amino acid distributions, i.e., that observed in a given position j and the reference distribution in the protein set (DB). The relative entropy KLd(j|PB x ) in the site j for the block PBx is expressed as:

where P(aa j = i|PB x ) is the probability of observing the amino acid i in position j (j = -w, ...,0, ..., +w) of the sequence window, given protein block PBx, and, P(aa j = i|DB) the probability of observing the same amino acid in the databank (named DB).

Thus, it enables us to detect the "informative" positions in terms of amino acids for a given protein block [19].

Prediction

In a strategy of structure prediction from sequence [19, 23], we must compute for a given sequence window S aa = {aa -w , ..., aa 0, ..., aa +w }, the probability of observing a given protein block PBx, i.e., P(PBx | S aa ). For this purpose, each PB is associated with an occurrence matrix of dimension l × 20 centered upon the PB, with l = 2 w +1 (in the study, w = 7). Using the Bayes theorem to compute this a posteriori probability P(PBx | S aa ) from the a priori probability P(S aa | PBx) deduced from the occurrence matrix allows us to define the odds score R x :

The highest score R x corresponds to the most probable PB [19, 23]. The Q 16 value computed is the total number of true predicted PBs over the total number of predicted PBs. We also computed a Q 14 value, specific for loops, i.e., the PB m and d are not taken into account in the accuracy rate computation.