A sublinear algorithm for approximate keyword searching

Myers, E. W.

doi:10.1007/BF01185432

A sublinear algorithm for approximate keyword searching

Published: November 1994

Volume 12, pages 345–374, (1994)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Algorithmica Aims and scope Submit manuscript

A sublinear algorithm for approximate keyword searching

Download PDF

E. W. Myers¹

380 Accesses
102 Citations
Explore all metrics

Abstract

Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the “database”A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N ^c for somec<1. The sequenceA must be overa. finite alphabet σ. More precisely, our algorithm requires 0(DN^pow(ɛ) logN) expected-time where ɛ=D/P is the maximum number of differences as a percentage of query length, and pow(ɛ) is an increasing and concave function that is 0 when ɛ=0. Thus the algorithm is superior to current O(DN) algorithms when ɛ is small enough to guarantee that pow(ɛ) < 1. As seen in the paper, this is true for a wide range of ɛ, e.g., ɛ. up to 33% for DNA sequences (¦⌆¦=4) and 56% for proteins sequences (¦⌆¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.

Article PDF

libFLASM: a software library for fixed-length approximate string matching

Article Open access 10 November 2016

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman, A basic local alignment search tool,J. Molecular Biol. 215 (1990), 403–410.
Google Scholar
Boyer, R., and J. Moore, A fast string searching algorithm,Comm. ACM 20(10) (1977), 262–272.
Article Google Scholar
Chang, W. I., and E. L. Lawler, Approximate matching in sublinear expected time,Proc. 31st IEEE Symp. on Foundation of Computer Science, 1990, pp. 116–124.
Galil, Z., and K. Park, An improved algorithm for approximate string matching,SIAM J. Comput. 19(6) (1990), 989–999.
Article MATH MathSciNet Google Scholar
Knuth, D. E., J. H. Morris, and V. R. Pratt, Fast pattern matching in strings,SIAM J. Comput. 6(2) (1977), 323–350.
Article MATH MathSciNet Google Scholar
Landau, G. M., and U. Vishkin, Introducing efficient parallelism into approximate string matching and a new serial algorithm,Proc. Symp. on Theory of Computing, 1986, pp. 220–230.
Myers, E. W., Incremental alignment algorithms and their applications, Technical Report 86-22, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1986.
Google Scholar
Myers, E. W., AnO(ND) difference algorithm and its variants,Algorithmica 1 (1986), 251–266.
Article MATH MathSciNet Google Scholar
Myers, E. W., and D. Mount, Computer program for the IBM personal computer that searches for approximate matches to short oligonucleotide sequences in long target DNA sequences,Nucleic Acids Res. 14(1) (1986), 1025–1041.
Article Google Scholar
Sellers, P. H., The theory and computation of evolutionary distances: Pattern recognition,J. Algorithms 1 (1980), 359–373.
Article MATH MathSciNet Google Scholar
Ukkonen, E., Finding approximate patterns in strings,J. Algorithms 6 (1985), 132–137.
Article MATH MathSciNet Google Scholar
Ukkonen, E., Algorithms for approximate string matching,Inform, and Control 64 (1985), 100–118.
Article MATH MathSciNet Google Scholar
Wu, S., U. Manber, and E. W. Myers, A Subquadratic Algorithm for Approximate Limited Expression Matching, Technical Report TR92-36, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1992 (submitted toAlgorithmica).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Arizona, 85721, TucsonAZ, USA
E. W. Myers

Authors

E. W. Myers
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Communicated by Alberto Apostolico

This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Myers, E.W. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994). https://doi.org/10.1007/BF01185432

Download citation

Received: 25 September 1991
Revised: 16 August 1992
Issue Date: November 1994
DOI: https://doi.org/10.1007/BF01185432

Key words

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A sublinear algorithm for approximate keyword searching

Abstract

Article PDF

Similar content being viewed by others

libFLASM: a software library for fixed-length approximate string matching

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

A sublinear algorithm for approximate keyword searching

Abstract

Article PDF

Similar content being viewed by others

libFLASM: a software library for fixed-length approximate string matching

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation