Abstract
Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Abrahamson K (1987) Generalized string matching. SIAM J Comput 16: 1039–1051
Altschul FS, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402
Amir A, Lewenstein M, Porat E (2000) Faster algorithms for string matching with k mismatches. In: Symposium on discrete algorithms, pp 794–803
Assent I, Krieger R, Glavic B, Seidl T (2008) Clustering multidimensional sequences in spatial and temporal databases. Knowl Inf Syst 16: 29–51
Brown P, Botstein D (2000) Exploring the new world of the genome with DNA microarrays. Nat Genet 21: 33–37
Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. Intell Inf Syst 3: 231–262
Feigenbaum J, Kannan S, Strauss M, Viswanathan M (1999) An approximate L1-difference algorithm for massive data streams. In: Proceedings of FOCS99
Fleischmann RD, Adams MD, White O, Clayton RA (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 28: 496–512
Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 31: 1441–1442
Kawahara D, Kurohashi S (2006) Case frame compilation from the web using high-performance computing. In: Proceedings of the 5th international conference on language resources and evaluation (LREC2006), pp 1344–1347
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12: 25–53
Liu J, Goss S, Murray G (1994) Similarity comparison and analysis of sequential data. In: IEEE international conference on expert systems for development, pp 138–143
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22: 935–948
Muthukrishnan S, Sahinalp SC (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of 32nd annual ACM symposium on theory of computing, pp 416–424
Muthukrishnan S, Sahinalp SC (2002) Simple and practical sequence nearest neighbors under block edit operations. In: Proceedings of CPM2002
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132: 185–219
Popendorf K, Osana Y, Hachiya T, Sakakibara Y (2007) Murasaki-homology detection across multiple large-scale genomes. In: Fifth annual RECOMB satellite workshop on comparative genomics (San Diego, USA, 2007)
Shivakumar N, Garcia-Molina H (1996) Building a scalable and accurate copy detection mechanism. In: International conference on digital libraries, proceedings of the first ACM international conference on digital libraries, pp 160–168
Yamada S, Gotoh O, Yamana H (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinform 7: 524
Yamada T, Morishita S (2003) Computing highly specific and mismatch tolerant oligomers efficiently. Bioinformatics Conference 2003
Yamada T, Morishita S (2005) Accelerated off-target search algorithm for siRNA. Bioinformatics 21: 1316–1324
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Uno, T. Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst 25, 229–251 (2010). https://doi.org/10.1007/s10115-009-0271-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0271-6