Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

Raghavan, Vijay V.; Benton, Ryan G.; Johnsten, Tom; Xie, Ying

doi:10.1007/978-3-642-41218-9_3

Vijay V. Raghavan²⁴,
Ryan G. Benton²⁴,
Tom Johnsten²⁵ &
…
Ying Xie²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8170))

Included in the following conference series:

International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing

1230 Accesses
1 Citations

Abstract

Analyzing and classifying sequence data based on structural similarities and differences is a mathematical problem of escalating relevance. Indeed, a primary challenge in designing machine learning algorithms to analyzing sequence data is the extraction and representation of significant features. This paper introduces a generalized sequence feature extraction model, referred to as the Generalized Multi-Layered Vector Spaces (GMLVS) model. Unlike most models that represent sequence data based on subsequences frequency, the GMLVS model represents a given sequence as a collection of features, where each individual feature captures the spatial relationships between two subsequences and can be mapped into a feature vector. The utility of this approach is demonstrated via two special cases of the GMLVS model, namely, Lossless Decomposition (LD) and the Multi-Layered Vector Spaces (MLVS). Experimental evaluation show the GMLVS inspired models generated feature vectors that, combined with basic machine learning techniques, are able to achieve high classification performance.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Article 04 January 2022

Sqn2Vec: Learning Sequence Representation via Sequential Patterns with a Gap Constraint

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform

Article Open access 02 May 2018

Keywords

References

Xie, Y., Fisher, J., Raghavan, V.V., Johnsten, T., Akkoc, C.: Granular approach for protein sequence analysis. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 414–421. Springer, Heidelberg (2012)
Chapter Google Scholar
Akkoç, C., Johnsten, T., Benton, R.: Multi-layered vector spaces for classifying and analyzing biological sequences. In: BICoB, pp. 160–166 (2011)
Google Scholar
Liao, L., Noble, S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 857–868 (2003)
Google Scholar
Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 443–453 (1970)
Google Scholar
Smith, F., Waterman, S.: Identification of common molecular subsequences. Journal of Molecular Biology, 195–197 (1981)
Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 564–575 (2002)
Google Scholar
Leslie, C., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Neural Information Processing Systems, pp. 1441–1448 (2003)
Google Scholar
Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J., Pongor, S.: A protein classification benchmark collection for machine learning, D232-D236 (2007)
Google Scholar
Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann (1993)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDDD Explorations, 10–18 (2009)
Google Scholar
Supplementary data (from paper [3]), http://noble.gs.washington.edu/proj/svm-pairwise/

Download references

Author information

Authors and Affiliations

Center for Advanced Computer Studies, University of Louisiana at Lafayette, Louisiana, USA
Vijay V. Raghavan & Ryan G. Benton
School of Computing, University of South Alabama, Alabama, USA
Tom Johnsten
Department of Computer Science, Kennesaw State University, Georgia, USA
Ying Xie

Authors

Vijay V. Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
Ryan G. Benton
View author publications
You can also search for this author in PubMed Google Scholar
Tom Johnsten
View author publications
You can also search for this author in PubMed Google Scholar
Ying Xie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Milano-Bicocca, viale Sarca 336/14, 20126, Milano, Italy
Davide Ciucci
Osaka University, 560-8531, Toyonaka, Osaka, Japan
Masahiro Inuiguchi
University of Regina, S4S 0A2, Regina, SK, Canada
Yiyu Yao
University of Warsaw, ul. Banacha, 2, 02-097, Warsaw, Poland
Dominik Ślęzak
Chongqing Institute of Green and Intelligent Technology, CAS, 401122, Chongqing, China
Guoyin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Raghavan, V.V., Benton, R.G., Johnsten, T., Xie, Y. (2013). Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models. In: Ciucci, D., Inuiguchi, M., Yao, Y., Ślęzak, D., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. RSFDGrC 2013. Lecture Notes in Computer Science(), vol 8170. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41218-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-41218-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41217-2
Online ISBN: 978-3-642-41218-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

Abstract

Chapter PDF

Similar content being viewed by others

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Sqn2Vec: Learning Sequence Representation via Sequential Patterns with a Gap Constraint

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

Abstract

Chapter PDF

Similar content being viewed by others

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Sqn2Vec: Learning Sequence Representation via Sequential Patterns with a Gap Constraint

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation