Accelerating sequence searching: dimensionality reduction method

Song, Guojie; Cui, Bin; Zheng, Baihua; Xie, Kunqing; Yang, Dongqing

doi:10.1007/s10115-008-0180-0

Accelerating sequence searching: dimensionality reduction method

Regular Paper
Published: 12 December 2008

Volume 20, pages 301–322, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Knowledge and Information Systems Aims and scope Submit manuscript

Accelerating sequence searching: dimensionality reduction method

Download PDF

Guojie Song¹,
Bin Cui²,
Baihua Zheng³,
Kunqing Xie¹ &
…
Dongqing Yang²

102 Accesses
9 Citations
Explore all metrics

Abstract

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL (2000) GenBank. Nucleic Acids Res 28(1): 15–18
Article Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proc. 24th VLDB Conference (VLDB’97), pp 194–205
Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. MIT Press, Cambridge
MATH Google Scholar
Cormode G, Muthukrishnan S (2002) The string edit distance matching problem with moves. In: Proc. of the 13th annual ACM-SIAM symposium on Discrete Algorithms, pp 667–676
DanTong Yu, Zhang A (2003) ClusterTree: integration of cluster representation and nearest-neighbor search for large data sets with high dimensions. IEEE Trans Knowl Data Eng 15(5): 1316–1337
Article Google Scholar
Faloutsos C, Lin KI (1995) Fast Map: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proc. of the International Conference on Management of Data (SIGMOD’95), pp 163–174
Filho RFS, Traina AJM, Traina C, Faloutsos C (2001) Similarity search without tears: the OMNI family of all-purpose access methods. In: Proc. of the 19th International Conference on Data Engineering (ICDE’01), pp 623–630
Golub GH, Van Loan CF (1989) Matrix computations. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Gravano L, Ipeirotis PG, Jagadish HV, Tolga Bozkaya N, Ozsoyoglu M (1997) Distance-based indexing for high-dimensional metric spaces. In: Proc. of the International Conference on Management of Data (SIGMOD’97), pp 357–368
Gusfield D (1997) Algorithms on sequences, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Google Scholar
Jolliffe IT (1986) Principal component analysis. Springer, Berlin
Google Scholar
Kadiyala S, Shiri N (2007) A compact multi-resolution index for variable length queries in time series databases. Knowl Info Syst 15: 131–147
Article Google Scholar
Karp RM, Miller RE, Rosenberg AL (1972) Rapid identification of repeated patterns in strings, trees and arrays. In: Proc. of the 4th Symposium on Theory of Computing, pp 125–136
Keogh E, Ann RC (2005) Exact indexing of dynamic time warping. Knowl Info Syst 7(3): 358–386
Article Google Scholar
Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Dimensionality reduction for fast similarity search in large databases. Knowl Info Syst 3(3): 263–286
Article MATH Google Scholar
Lee S, Chun S, Kim D (2000) Similarity search for multidimensional data sequences. In: Proc. of the 16th Int’l Conf. on Data Engineering. IEEE Computer Society, Washington, pp 599–608
Li M, Badger JH, Xin C, Kwong S, Kearney P, Ferragina HP, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its applications. JACM 46(2): 236–280
Article Google Scholar
McCreight EM (1976) A space-economical suffix tree construction algorithm. JACM 23(2): 262–272
Article MATH MathSciNet Google Scholar
Muthukrishnan S, Ahinalp SCS (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of the 32nd Symposium on Theory of Computing, pp 416–424
Navarro G (2000) A guided tour to approximate sequence matching. ACM Comput Survey 33(1): 31–88
Article Google Scholar
Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discret Algorithms 1(1): 205–239
MathSciNet Google Scholar
Pearson W, Lipman D (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85: 2444–2488
Article Google Scholar
Popivanov I, Miller RJ (2002) Similarity search over time series data using wavelets. In: Proc. of the 18th Int’l Conf. on Data Engineering, ICDE 2002. IEEE Computer Society, San Jose, pp 212–221
Sahinalp SC, Tasan M, Macker J, Ozsoyoglu ZM (2003) Distance-based indexing for string proximity search. In: Proc. of the 19th International Conference on Data Engineering (ICDE’03), pp 125–136
Song G, Cui B, Zheng B, Xie K, Yang D (2008) Squeezing long sequence data for efficient similarity search. In: Proc. Apweb, 2008, pp 438–449
Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: proc. of the 7th International Conference on Extended Database Technology (EDBT’00), pp 51–65
Venkateswaran J, Lachwani D, Kahveci T, Jermaine CM (2006) Reference-based indexing of sequence databases. In: Proc. 24th VLDB Conference (VLDB’06), pp 906–917
Vieira MR, Traina C, Chino FJT, Traina AJM (2004) DBM-Tree: a dynamic metric access method sensitive to local density data. Simposio Brasileiro de Bancos de Dados (SBBD’04), pp 163–177
Wang TL, Wang X, Lin KI, Shasha D, Shapiro B, Zhang K (1999) Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proc. of the 5th ACM International Conference of Knowledge Discovery and Data Mining (SIGKDD’99), pp 307–311
Weiner P (1973) Linear pattern matching algorithms. In: IEEE Symposium on Switching and Automata Theory, pp 1–11
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203–214
Article Google Scholar
http://www-db.stanford.edu/pleiades/SUMATRA.html

Download references

Author information

Authors and Affiliations

Key Laboratory of Machine Perception, Peking University, Ministry of Education, Beijing, China
Guojie Song & Kunqing Xie
Department of Computer Science, Peking University, Beijing, China
Bin Cui & Dongqing Yang
School of Information System, Singapore Management University, Singapore, Singapore
Baihua Zheng

Authors

Guojie Song
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Baihua Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Kunqing Xie
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guojie Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, G., Cui, B., Zheng, B. et al. Accelerating sequence searching: dimensionality reduction method. Knowl Inf Syst 20, 301–322 (2009). https://doi.org/10.1007/s10115-008-0180-0

Download citation

Received: 23 December 2007
Revised: 14 May 2008
Accepted: 27 September 2008
Published: 12 December 2008
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10115-008-0180-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Accelerating sequence searching: dimensionality reduction method

Abstract

Article PDF

Similar content being viewed by others

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

DRESS: dimensionality reduction for efficient sequence search

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating sequence searching: dimensionality reduction method

Abstract

Article PDF

Similar content being viewed by others

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

DRESS: dimensionality reduction for efficient sequence search

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation