Abstract
Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL (2000) GenBank. Nucleic Acids Res 28(1): 15–18
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proc. 24th VLDB Conference (VLDB’97), pp 194–205
Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. MIT Press, Cambridge
Cormode G, Muthukrishnan S (2002) The string edit distance matching problem with moves. In: Proc. of the 13th annual ACM-SIAM symposium on Discrete Algorithms, pp 667–676
DanTong Yu, Zhang A (2003) ClusterTree: integration of cluster representation and nearest-neighbor search for large data sets with high dimensions. IEEE Trans Knowl Data Eng 15(5): 1316–1337
Faloutsos C, Lin KI (1995) Fast Map: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proc. of the International Conference on Management of Data (SIGMOD’95), pp 163–174
Filho RFS, Traina AJM, Traina C, Faloutsos C (2001) Similarity search without tears: the OMNI family of all-purpose access methods. In: Proc. of the 19th International Conference on Data Engineering (ICDE’01), pp 623–630
Golub GH, Van Loan CF (1989) Matrix computations. The Johns Hopkins University Press, Baltimore
Gravano L, Ipeirotis PG, Jagadish HV, Tolga Bozkaya N, Ozsoyoglu M (1997) Distance-based indexing for high-dimensional metric spaces. In: Proc. of the International Conference on Management of Data (SIGMOD’97), pp 357–368
Gusfield D (1997) Algorithms on sequences, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Jolliffe IT (1986) Principal component analysis. Springer, Berlin
Kadiyala S, Shiri N (2007) A compact multi-resolution index for variable length queries in time series databases. Knowl Info Syst 15: 131–147
Karp RM, Miller RE, Rosenberg AL (1972) Rapid identification of repeated patterns in strings, trees and arrays. In: Proc. of the 4th Symposium on Theory of Computing, pp 125–136
Keogh E, Ann RC (2005) Exact indexing of dynamic time warping. Knowl Info Syst 7(3): 358–386
Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Dimensionality reduction for fast similarity search in large databases. Knowl Info Syst 3(3): 263–286
Lee S, Chun S, Kim D (2000) Similarity search for multidimensional data sequences. In: Proc. of the 16th Int’l Conf. on Data Engineering. IEEE Computer Society, Washington, pp 599–608
Li M, Badger JH, Xin C, Kwong S, Kearney P, Ferragina HP, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its applications. JACM 46(2): 236–280
McCreight EM (1976) A space-economical suffix tree construction algorithm. JACM 23(2): 262–272
Muthukrishnan S, Ahinalp SCS (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of the 32nd Symposium on Theory of Computing, pp 416–424
Navarro G (2000) A guided tour to approximate sequence matching. ACM Comput Survey 33(1): 31–88
Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discret Algorithms 1(1): 205–239
Pearson W, Lipman D (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85: 2444–2488
Popivanov I, Miller RJ (2002) Similarity search over time series data using wavelets. In: Proc. of the 18th Int’l Conf. on Data Engineering, ICDE 2002. IEEE Computer Society, San Jose, pp 212–221
Sahinalp SC, Tasan M, Macker J, Ozsoyoglu ZM (2003) Distance-based indexing for string proximity search. In: Proc. of the 19th International Conference on Data Engineering (ICDE’03), pp 125–136
Song G, Cui B, Zheng B, Xie K, Yang D (2008) Squeezing long sequence data for efficient similarity search. In: Proc. Apweb, 2008, pp 438–449
Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: proc. of the 7th International Conference on Extended Database Technology (EDBT’00), pp 51–65
Venkateswaran J, Lachwani D, Kahveci T, Jermaine CM (2006) Reference-based indexing of sequence databases. In: Proc. 24th VLDB Conference (VLDB’06), pp 906–917
Vieira MR, Traina C, Chino FJT, Traina AJM (2004) DBM-Tree: a dynamic metric access method sensitive to local density data. Simposio Brasileiro de Bancos de Dados (SBBD’04), pp 163–177
Wang TL, Wang X, Lin KI, Shasha D, Shapiro B, Zhang K (1999) Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proc. of the 5th ACM International Conference of Knowledge Discovery and Data Mining (SIGKDD’99), pp 307–311
Weiner P (1973) Linear pattern matching algorithms. In: IEEE Symposium on Switching and Automata Theory, pp 1–11
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203–214
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, G., Cui, B., Zheng, B. et al. Accelerating sequence searching: dimensionality reduction method. Knowl Inf Syst 20, 301–322 (2009). https://doi.org/10.1007/s10115-008-0180-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0180-0