Abstract
Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the “database”A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N c for somec<1. The sequenceA must be overa. finite alphabet σ. More precisely, our algorithm requires 0(DNpow(ɛ) logN) expected-time where ɛ=D/P is the maximum number of differences as a percentage of query length, and pow(ɛ) is an increasing and concave function that is 0 when ɛ=0. Thus the algorithm is superior to current O(DN) algorithms when ɛ is small enough to guarantee that pow(ɛ) < 1. As seen in the paper, this is true for a wide range of ɛ, e.g., ɛ. up to 33% for DNA sequences (¦⌆¦=4) and 56% for proteins sequences (¦⌆¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman, A basic local alignment search tool,J. Molecular Biol. 215 (1990), 403–410.
Boyer, R., and J. Moore, A fast string searching algorithm,Comm. ACM 20(10) (1977), 262–272.
Chang, W. I., and E. L. Lawler, Approximate matching in sublinear expected time,Proc. 31st IEEE Symp. on Foundation of Computer Science, 1990, pp. 116–124.
Galil, Z., and K. Park, An improved algorithm for approximate string matching,SIAM J. Comput. 19(6) (1990), 989–999.
Knuth, D. E., J. H. Morris, and V. R. Pratt, Fast pattern matching in strings,SIAM J. Comput. 6(2) (1977), 323–350.
Landau, G. M., and U. Vishkin, Introducing efficient parallelism into approximate string matching and a new serial algorithm,Proc. Symp. on Theory of Computing, 1986, pp. 220–230.
Myers, E. W., Incremental alignment algorithms and their applications, Technical Report 86-22, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1986.
Myers, E. W., AnO(ND) difference algorithm and its variants,Algorithmica 1 (1986), 251–266.
Myers, E. W., and D. Mount, Computer program for the IBM personal computer that searches for approximate matches to short oligonucleotide sequences in long target DNA sequences,Nucleic Acids Res. 14(1) (1986), 1025–1041.
Sellers, P. H., The theory and computation of evolutionary distances: Pattern recognition,J. Algorithms 1 (1980), 359–373.
Ukkonen, E., Finding approximate patterns in strings,J. Algorithms 6 (1985), 132–137.
Ukkonen, E., Algorithms for approximate string matching,Inform, and Control 64 (1985), 100–118.
Wu, S., U. Manber, and E. W. Myers, A Subquadratic Algorithm for Approximate Limited Expression Matching, Technical Report TR92-36, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1992 (submitted toAlgorithmica).
Author information
Authors and Affiliations
Additional information
Communicated by Alberto Apostolico
This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics.
Rights and permissions
About this article
Cite this article
Myers, E.W. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994). https://doi.org/10.1007/BF01185432
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF01185432