Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Lualdi, Marta; Fasano, Mauro

doi:10.1007/978-1-0716-1641-3_9

Marta Lualdi³ &
Mauro Fasano³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2361))

2685 Accesses
1 Citations

Abstract

“Omics” techniques (e.g., proteomics, genomics, metabolomics), from which huge datasets can nowadays be obtained, require a different way of thinking about data analysis that can be summarized with the idea that, when data are enough, they can speak for themselves. Indeed, managing huge amounts of data imposes the replacement of the classical deductive approach (hypothesis-driven) with a data-driven hypothesis-generating inductive approach, so to generate mechanistical hypotheses from data.

Data reduction is a crucial step in proteomics data analysis, because of the sparsity of significant features in big datasets. Thus, feature selection/extraction methods are applied to obtain a set of features based on which a proteomics signature can be drawn, with a functional significance (e.g., classification, diagnosis, prognosis). Despite big data generated almost daily by proteomics studies, a well-established statistical workflow for data analysis in proteomics is still lacking, opening up to misleading and incorrect data analysis and interpretation. This chapter will give an overview of the methods available for feature selection/extraction in proteomics datasets and how to choose the most appropriate one based on the type of dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A survey of computational tools for downstream analysis of proteomic and other omic datasets

Article Open access 28 October 2015

The Perseus computational platform for comprehensive analysis of (prote)omics data

Article 27 June 2016

Set-Based Test Procedures for the Functional Analysis of Protein Lists from Differential Analysis

References

Ma’ayan A (2017) Complex systems biology. J R Soc Interface 14(134):20170391
Article PubMed PubMed Central Google Scholar
Broad CD (1925) Mind and its place in nature. Harcourt, Brace & Company, Inc., New York
Google Scholar
Hayes BK, Heit E, Swendsen H (2010) Inductive reasoning. Wiley Interdiscip Rev Cogn Sci 1:278–292
Article PubMed Google Scholar
Hayes BK, Heit E (2018) Inductive reasoning 2.0. Wiley Interdiscip Rev Cogn Sci 9:e1459
Article PubMed Google Scholar
He Q-Y, Chiu J-F (2003) Proteomics in biomarker discovery and drug development. J Cell Biochem 89:868–886
Article CAS PubMed Google Scholar
Kohn EC, Azad N, Annunziata C et al (2007) Proteomics as a tool for biomarker discovery. Dis Markers 23:411–417
Article CAS PubMed PubMed Central Google Scholar
Suppers A, van Gool AJ, Wessels HJCT (2018) Integrated chemometrics and statistics to drive successful proteomics biomarker discovery. Proteomes 6(2):20
Article PubMed Central Google Scholar
Bittner L (1962) R. Bellman, adaptive control processes. A guided tour. XVI + 255 S. Princeton, N. J., 1961. Princeton University Press. Preis geb. $ 6.50. ZAMM J Appl Math Mech Z Für Angew Math Mech 42:364–365
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Google Scholar
Hira ZM, Gillies DF A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015:198363. https://www.hindawi.com/journals/abi/2015/198363/
Hoque N, Bhattacharyya DK, Kalita JK (2014) MIFS-ND: a mutual information-based feature selection method. Expert Syst Appl 41:6371–6385
Article Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings, twentieth international conference on machine learning, pp 856–863
Google Scholar
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69
Article Google Scholar
Radovic M, Ghalwash M, Filipovic N et al (2017) Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics 18:9
Article PubMed PubMed Central Google Scholar
Azuaje F (2006) Witten IH, Frank E: data mining: practical machine learning tools and techniques 2nd edition. Biomed Eng Online 5:51
Article PubMed Central Google Scholar
Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Data clustering: algorithms and applications
Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Article Google Scholar
Jović A, Brkić K, Bogunović N (2015) A review of feature selection methods with applications. In: 2015 38th International convention on information and communication technology, electronics and microelectronics (MIPRO), pp 1200–1205
Google Scholar
Bonferroni C (1936) Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni R Ist Super Sci Econ E Commericiali Firenze 8:3–62
Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70
Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57:289–300
Google Scholar
Aggarwal S, Yadav AK (2016) False discovery rate estimation in proteomics. Methods Mol Biol 1362:119–128
Article CAS PubMed Google Scholar
Diz AP, Carvajal-Rodríguez A, Skibinski DOF (2011) Multiple hypothesis testing in proteomics: a strategy for experimental work. Mol Cell Proteomics 10
Google Scholar
Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130
Article CAS Google Scholar
Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemom 28:213–225
Article CAS Google Scholar
Gromski PS, Muhamadali H, Ellis DI et al (2015) A tutorial review: metabolomics and partial least squares-discriminant analysis—a marriage of convenience or a shotgun wedding. Anal Chim Acta 879:10–23
Article CAS PubMed Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441
Article Google Scholar
Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Philos Transact A Math Phys Eng Sci 374
Google Scholar
Dubitzky W, Granzow M, Berrar DP (2007) Fundamentals of data mining in genomics and proteomics. Springer, Berlin
Book Google Scholar
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
CAS PubMed Google Scholar
Kuligowski J, Pérez-Guaita D, Quintás G (2016) Application of discriminant analysis and cross-validation on proteomics data. Methods Mol Biol 1362:175–184
Article CAS PubMed Google Scholar
Wang W, Sue AC-H, Goh WWB (2017) Feature selection in clinical proteomics: with great power comes great reproducibility. Drug Discov Today 22:912–918
Article CAS PubMed Google Scholar
Goh WWB, Wong L (2016) Evaluating feature-selection stability in next-generation proteomics. J Bioinforma Comput Biol 14:1650029
Article CAS Google Scholar
Lim K, Wong L (2014) Finding consistent disease subnetworks using PFSNet. Bioinformatics 30:189–196
Article CAS PubMed Google Scholar
Goh WWB, Wong L (2016) Advancing clinical proteomics via analysis based on biological complexes: a tale of five paradigms. J Proteome Res 15:3167–3179
Article CAS PubMed Google Scholar
Goh WWB (2016) Fuzzy-FishNET: a highly reproducible protein complex-based approach for feature selection in comparative proteomics. BMC Med Genet 9:67
Google Scholar
Christin C, Hoefsloot HCJ, Smilde AK et al (2013) A critical assessment of feature selection methods for biomarker discovery in clinical proteomics. Mol Cell Proteomics 12:263–276
Article PubMed Google Scholar
Alterovitz G, Liu J, Afkhami E et al (2007) Bayesian methods for proteomics. Proteomics 7:2843–2855
Article CAS PubMed Google Scholar
Hernández B, Pennington SR, Parnell AC (2015) Bayesian methods for proteomic biomarker development. EuPA Open Proteom 9:54–64
Article Google Scholar
Dridi N, Giremus A, Giovannelli J-F et al (2017) Bayesian inference for biomarker discovery in proteomics: an analytic solution. EURASIP J Bioinforma Syst Biol 2017:9
Article Google Scholar
Marchiori E, Heegaard NHH, West-Nielsen M et al (2005) Feature selection for classification with proteomic data of mixed quality. In: 2005 IEEE symposium on computational intelligence in bioinformatics and computational biology, pp 1–7
Google Scholar
Guyon I, Weston J, Barnhill S et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Article Google Scholar
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the tenth national conference on artificial intelligence. AAAI Press, San Jose, CA, pp 129–134
Google Scholar
Conrad TOF, Genzel M, Cvetkovic N et al (2017) Sparse proteomics analysis—a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. BMC Bioinformatics 18:160
Article PubMed PubMed Central Google Scholar
Lualdi M, Fasano M (2019) Statistical analysis of proteomics data: a review on feature selection. J Proteome 198:18–26
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Science and High Technology, Center of Bioinformatics, University of Insubria, Busto Arsizio, Italy
Marta Lualdi & Mauro Fasano

Authors

Marta Lualdi
View author publications
You can also search for this author in PubMed Google Scholar
Mauro Fasano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mauro Fasano .

Editor information

Editors and Affiliations

Department of Biotechnology, University of Verona, VERONA, Verona, Italy
Daniela Cecconi

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Lualdi, M., Fasano, M. (2021). Features Selection and Extraction in Statistical Analysis of Proteomics Datasets. In: Cecconi, D. (eds) Proteomics Data Analysis. Methods in Molecular Biology, vol 2361. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1641-3_9

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1641-3_9
Published: 09 July 2021
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1640-6
Online ISBN: 978-1-0716-1641-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey of computational tools for downstream analysis of proteomic and other omic datasets

The Perseus computational platform for comprehensive analysis of (prote)omics data

Set-Based Test Procedures for the Functional Analysis of Protein Lists from Differential Analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Features Selection and Extraction in Statistical Analysis of Proteomics Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey of computational tools for downstream analysis of proteomic and other omic datasets

The Perseus computational platform for comprehensive analysis of (prote)omics data

Set-Based Test Procedures for the Functional Analysis of Protein Lists from Differential Analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation