A Preliminary Comparison of P-Tool Consistency

Murillo, Javier; Spetale, Flavio; Tapia, Elizabeth; Krsticevic, Flavia; Cailloux, Olivier; Guillaume, Serge; Vazquez, Gustavo; Fernandez, Tamara; Destercke, Sebastien; Ponce, Sergio; Bulacio, Pilar

doi:10.1007/978-3-030-30648-9_97

Part of the book series: IFMBE Proceedings ((IFMBE,volume 75))

Included in the following conference series:

Latin American Conference on Biomedical Engineering

2063 Accesses

Abstract

Many Bioinformatics tools, known as p-tools, have been developed to predict the effect of single nucleotide polymorphisms (SNPs) on gene functionality, in an effort to reduce the need for in-vivo assays. However, the large number of p-tools available and the heterogeneity of their output make their selection and comparison difficult. To study the consistency of predictions across p-tools, here we present two indices and test them on five p-tools whose predictions are based on different types of background information. For this test, SNPs from well-known organism Drosophila melanogaster are considered.

Access provided by Autonomous University of Puebla. Download conference paper PDF

SuRFing the genomics wave: an R package for prioritising SNPs by functionality

Article Open access 14 October 2014

EAGLE: Explicit Alternative Genome Likelihood Evaluator

Article Open access 20 April 2018

In Silico Prediction of Deleteriousness for Nonsynonymous and Splice-Altering Single Nucleotide Variants in the Human Genome

Keywords

1 Introduction

A main factor underlying the conformation of proteins is their amino acid sequence. An individual nucleotide change, also called a Single Nucleotide Polymorphism (SNP), is a missense mutation when it causes a different protein, or a nonsense mutation when it causes a short and non-functional protein. The degree to which a SNP affects protein function is a key point, but its prediction remains an open problem.

Next-Generation Sequencing (NGS) technologies have made it possible to detect thousands of SNPs [1], but wet-lab studies needed to associate these SNPs with phenotypic traits are costly. To narrow down the list of candidate SNPs, several Bioinformatics tools, hereafter referred to as p-tools, have been developed to predict the impact of SNPs in-silico. P-tools can be based on information from amino acid sequences, protein structure, context, functional parameters and evolutionary information [2]. For instance, for sequence conservation analysis, conserved amino acids—known to be relevant for protein function—are identified by alignment, and SNPs on these positions are identified as likely deleterious. Structure information is also used to infer sites with likely impact on protein function: SNPs in ligand-binding domains or active sites typically modify protein function. Based on this information, p-tools can be designed using either expert knowledge or machine learning techniques. P-tools not only vary in nature, but their outputs also vary in syntax and semantics, which makes comparison between them tricky. To tackle this problem, most strategies normalize predictions, forcing them into two classes, to evaluate classical performance metrics like accuracy, sensibility, sensitivity and ROC curves [4].

In this work, consistency across p-tools is evaluated by means of two proposed indices. For two given p-tools, the indices quantify the systematic disagreement between each pair of SNPs, i.e., count pairs of predictions ordered differently in each p-tool scale, without performing any scale normalization. An experimental study was carried out using five widely-used p-tools [3]. These were selected based on the diversity of knowledge or learning method they are based on, as well as the possibility to be run online with standard parameters. The consistency across p-tools was tested with SNPs from model organism D. melanogaster, a common starting point for data analysis.

2 Materials and Methods

The method to evaluate consistency across p-tools has two stages. The first one ponders, for each p-tool i, the order of two SNP effect predictions, $(m_1,m_2)$: $m_1$ can be more damaging than $m_2$, the opposite can be true, or two mutations can be equally damaging. This preference relation is noted as follows:

$m_1 \prec _i m_2$ if p-tool i considers $m_1$ to be less damaging than $m_2$;
$m_1 \sim _i m_2 \iff \lnot (m_1 \prec _i m_2) \wedge \lnot (m_2 \prec _i m_1)$, if p-tool i cannot assert $m_1$ to be less or more damaging than $m_2$.

To value the three possible orders for a pair $m_1, m_2$, let $r_i(m_1,m_2)$ be defined as follows:

$$\begin{aligned} r_i(m_1,m_2) = \left\{ \begin{array}{rl} 1 &{} \text{ if } m_1 \prec _i m_2\\ -1 &{} \text{ if } m_2 \prec _i m_1\\ 0 &{} \text{ if } m_2 \sim _i m_1 \end{array} \right. \end{aligned}$$

(1)

The second stage values the degree of (dis)agreement of relative orders—given by all pairs of mutations—between two p-tools, i and j, through two indices (Eqs. 2 and 3): $K_{all}$ is the ratio of mutations pairs ordered differently by both p-tools to all mutation pairs, considering all disagreements; $K_{strong}$ is analogous but considers only opposite orderings.

$$\begin{aligned} K_\text {all}= \frac{|\{(m_1, m_2)\ | \ r_i(m_1, m_2) \ne r_j(m_1, m_2)\}|}{\left( {\begin{array}{c}|M|\\ 2\end{array}}\right) } \end{aligned}$$

(2)

$$\begin{aligned} K_\text {strong}= \frac{|\{(m_1, m_2)\ | \ r_i(m_1, m_2) \ne 0 \wedge \ r_i(m_1, m_2) = -r_j(m_1, m_2)\}|}{\left( {\begin{array}{c}|M|\\ 2\end{array}}\right) } \end{aligned}$$

(3)

For the case of discrete outputs, predictions are ordered according to labels, e.g. {benign, possibly deleterious, probably deleterious} and a preference relation: $benign \prec possibly~deleterious \prec probably~deleterious$. When dealing with numerical outputs $t_i$, an inequality threshold $\delta _i$ is introduced, such that the preference relation for p-tool i is defined as follows: $m_1 \prec _i m_2 \iff t_i(m_2) - t_i(m_1) > \delta _i, \ \delta _i \ge 0$. Hence, $m_1 \sim _i m_2 \iff |t_i(m_2) - t_i(m_1)| \le \delta _i$.

P-Tools: Selected p-tools have the following main features.

PolyPhen2^{Footnote 1} uses protein sequences on a trained Naïve Bayes model to predict SNP sites which code for a protein’s structure or function.
Provean^{Footnote 2} is based on a model that evaluates evolutionary information from protein sequence.
Align-GVGD^{Footnote 3} uses biophysical characteristics of amino acid and protein multiple sequence alignments on an evolutionary conservation model.
Strum^{Footnote 4} values changes in folding stability induced by SNPs based on a gradient boosting of Gibbs free-energy with different sequence and structure properties.
Cupsat^{Footnote 5} evaluates changes in protein stability induced by SNPs based on structure information of wild-type and mutant proteins.

Data: SNPs were analyzed on gene vermilion, locus Dmel_CG2155, on D.mel.

3 Results and Discussion

P-tools are compared pairwise with the $K_{all}$ and $K_{strong}$ indices and different equality thresholds. All possible SNPs in each sequence position of the vermilion gene are considered. See Fig. 1. Small index values represent similar pairwise outputs. For all pairwise comparisons, as the threshold increases, the $K_\text {all}$ value first increases and then decreases when outputs became similar ($r_i(m_1, m_2) = 0$). Polyphen2-Provean follows that behavior after a threshold of 40% (data not shown). On the other hand, $K_\text {strong}$ has a monotonically decreasing behavior. Two cases are possible when comparing pairwise p-tool outputs $t_i$, $t_j$ with pairwise SNPs $m_1$ and $m_2$: (1) $t_i(m_1) < t_i(m_2)$ and $t_j(m_1) < t_j(m_2)$ or (2) $t_i(m_2) < t_i(m_1)$ and $t_j(m_1) < t_j(m_2)$ ($t_i(m_1) \simeq t_i(m_2)$ or $t_j(m_1) \simeq t_j(m_2)$ are a middle step between these cases and are also analyzed). In case (1), there is an agreement according to $K_{all}$. When the threshold increases up to level $\delta _i=|t_i(m_1) - t_i(m_2)|$, $t_i(m_1) \simeq t_i(m_2)$ and $t_j(m_1) < t_j(m_2)$, making $K_{all}$ count this as a disagreement. When the threshold reaches $\delta _j=|t_j(m_1) - t_j(m_2)|$, $t_j(m_1) \simeq t_j(m_2)$ and both tools agree again. In this case, the error will increase after $\delta _i$ and decrease after $\delta _j$. Clearly, when $\delta $ is 100%, all pairs will be considered equal. In case (2), there is a disagreement according to $K_{all}$. After $\delta _i$ is reached, $t_i(m_1) \simeq t_i(m_2)$ and $t_j(m_1) < t_j(m_2)$, meaning they still disagree. Only after $\delta _j$ is reached, the two p-tools agree. In this case, the error decreases monotonically. An analogous analysis can be done with $K_{strong}$. In case (1), since both p-tools never give opposite results, $K_{strong}$ evaluates outputs as an agreement, regardless of the threshold. In case (2), after $\delta _i$ is reached, the outputs are no longer opposite and are therefore equal according to $K_{strong}$. In both cases, the error can only decrease for increasing thresholds. While $K_{all}$ has a monotonically decreasing behaviour with $\delta $ only in case (2), $K_{strong}$ has such a behaviour for both cases, making it less sensitive to the equality threshold.

Note that even for $\delta \sim 10\%$, pairwise p-tool comparison with $K_{strong}$ varies from 0.05% to 20% in the worst case. The two tools which agree the most across all $\delta $ are Strum and Cupsat, which makes sense since both work with similar knowledge, namely gene energy functions. On the other hand, the two tools which disagree the most are PolyPhen2 and Align-GVGD, which also makes sense since one is based on structure and the other on evolutionary information.

4 Conclusions

Two indices were proposed to compare p-tools considering their most informative output. The indices do not require any normalization process. The comparison on D.mel. gene vermilion shows that predictions vary widely depending on the p-tool. Still, different outputs are not necessarily a problem, since they enable outputs to be integrated to achieve a more accurate prediction of SNP effects.

Notes

References

Wadapurkar, R., Vyas, R.: Computational analysis of next generation sequencing data and its applications in clinical oncology. Inform. Med. Unlocked 11, 75–82 (2018)
Article Google Scholar
Choi, Y., Sims, G.E., Murphy, S., Miller, J.R., Chan, A.P.: Predicting the functional effect of amino acid substitutions and indels. PLOS ONE 7(10), e46688 (2012)
Article Google Scholar
Thusberg, J., Olatubosun, A., Vihinen, M.: Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat. 32(4), 358–368 (2011)
Article Google Scholar
Hicks, S., et al.: Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed. Hum. Mutat. 32(6), 661–668 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CIFASIS-CONICET, Univ. Nacional de Rosario, Rosario, Argentina
Javier Murillo, Flavio Spetale, Elizabeth Tapia & Pilar Bulacio
Université Paris-Dauphine, CNRS, Paris, France
Olivier Cailloux
ITAP, Irstea, Montpellier SupAgro, Univ. Montpellier, Montpellier, France
Serge Guillaume
Universidad Católica del Uruguay, Montevideo, Uruguay
Gustavo Vazquez & Tamara Fernandez
Université de Technologie de Compiègne, Compiegne, France
Sebastien Destercke
Universidad Tecnológica Nacional, Regional San Nicolás, Buenos Aires, Argentina
Flavia Krsticevic & Sergio Ponce

Authors

Javier Murillo
View author publications
You can also search for this author in PubMed Google Scholar
Flavio Spetale
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Tapia
View author publications
You can also search for this author in PubMed Google Scholar
Flavia Krsticevic
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Cailloux
View author publications
You can also search for this author in PubMed Google Scholar
Serge Guillaume
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Vazquez
View author publications
You can also search for this author in PubMed Google Scholar
Tamara Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Sebastien Destercke
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ponce
View author publications
You can also search for this author in PubMed Google Scholar
Pilar Bulacio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javier Murillo .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
César A. González Díaz
Departamento de Ingeniería Electrica y Computación, Universidad Autónoma de Ciudad Juárez, Chihuahua, Mexico
Christian Chapa González
Universidad Nacional de San Juan, San Juan, Argentina
Eric Laciar Leber
Universidad de Guadalajara, Guadalajara, Mexico
Hugo A. Vélez
Universidad Autónoma de Nuevo León, Nuevo León, Mexico
Norma P. Puente
Universidad Autónoma de Baja California, Ensenada, Baja California, Mexico
Dora-Luz Flores
Universidade Federal de Uberlândia, Uberlândia, Brazil
Adriano O. Andrade
Instituto Nacional de Cancerología, Mexico City, Mexico
Héctor A. Galván
Universidad Autónoma Metropolitana, Mexico City, Mexico
Fabiola Martínez
Universidad Federal de Santa Catarina, Florianópolis, Brazil
Renato García
Instituto Nacional de Rehabilitación, Mexico City, Mexico
Citlalli J. Trujillo
Universidad Autónoma de San Luis Potos, San Luis, Mexico
Aldo R. Mejía

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murillo, J. et al. (2020). A Preliminary Comparison of P-Tool Consistency. In: González Díaz, C., et al. VIII Latin American Conference on Biomedical Engineering and XLII National Conference on Biomedical Engineering. CLAIB 2019. IFMBE Proceedings, vol 75. Springer, Cham. https://doi.org/10.1007/978-3-030-30648-9_97

Download citation

DOI: https://doi.org/10.1007/978-3-030-30648-9_97
Published: 01 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30647-2
Online ISBN: 978-3-030-30648-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Preliminary Comparison of P-Tool Consistency

Abstract

Similar content being viewed by others

SuRFing the genomics wave: an R package for prioritising SNPs by functionality

EAGLE: Explicit Alternative Genome Likelihood Evaluator

In Silico Prediction of Deleteriousness for Nonsynonymous and Splice-Altering Single Nucleotide Variants in the Human Genome

Keywords

1 Introduction

2 Materials and Methods

3 Results and Discussion

4 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Preliminary Comparison of P-Tool Consistency

Abstract

Similar content being viewed by others

SuRFing the genomics wave: an R package for prioritising SNPs by functionality

EAGLE: Explicit Alternative Genome Likelihood Evaluator

In Silico Prediction of Deleteriousness for Nonsynonymous and Splice-Altering Single Nucleotide Variants in the Human Genome

Keywords

1 Introduction

2 Materials and Methods

3 Results and Discussion

4 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation