Abstract
We revisit the problem of indexing a string S[1..n] to support finding all substrings in S that match a given pattern P[1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(m k) time for searching. Motivated by the indexing of DNA, we investigate space efficient indexes that occupy only O(n) space. For k=1, we give an index to support matching in O(m+occ+log nlog log n) time. The previously best solution achieving this time complexity requires an index of O(nlog n) space. This new index can also be used to improve existing indexes for k≥2 errors. Among others, it can support 2-error matching in O(mlog nlog log n+occ) time, and k-error matching, for any k>2, in O(m k−1log nlog log n+occ) time.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Indexing and dictionary matching with one error. In: Proceedings of Workshop on Algorithms and Data Structures, pp. 181–192 (1999)
Amir, A., Landau, G., Lewenstein, M., Sokol, D.: Dynamic pattern, static text matching. In: Proceedings of Workshop on Algorithms and Data Structures, pp. 340–352 (2003)
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Proceedings of Theoretical Informatics, 4th Latin American Symposium, pp. 88–94 (2000)
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.R.: Range searching over tree cross products. In: Proceedings of European Symposium on Algorithms, pp. 120–131 (2000)
Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear-size index for approximate pattern matching. In: Proceedings of Combinatorial Pattern Matching, pp. 49–59 (2006)
Chavez, E., Navarro, G.: A metric index for approximate string matching. In: Proceedings of Latin American Theoretical Informatics, pp. 181–195 (2002)
Cobbs, A.: Fast approximate matching using suffix trees. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 41–54 (1995)
Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of Symposium on Theory of Computing, pp. 91–100 (2004)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of Symposium on Foundations of Computer Science, pp. 390–398 (2000)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of Symposium on Theory of Computing, pp. 397–406 (2000)
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)
Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 434–444 (2004)
Lam, T.W., Sung, W.K., Wong, S.S.: Improved approximate string matching using compressed suffix data structures. Algorithmica 51, 298–314 (2008)
Maaß, M.G., Nowak, J.: Text indexing with errors. Technical Report TUM-10503, Fakultät für Informatik, TU München (Mar. 2005)
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Munro, J.I.: Tables. In: Proceedings of Conference on Foundations of Software Technology and Computer Science, pp. 37–42 (1996)
Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. J. Comput. 31(3), 762–776
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of 13th Symposium on Discrete Algorithms, pp. 657–666 (2002)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–209 (2000). Special issue on Matching Patterns
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242 (2002)
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 225–232 (2002)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Θ(n). Inf. Process. Lett. 17(2), 81–84 (1983)
Author information
Authors and Affiliations
Corresponding author
Additional information
Result in this paper has appeared in a preliminary form in the Proceedings of the 14th Annual European Symposium (ESA 2006).
Current address of H.-L. Chan: Department of Computer Science, University of Pittsburgh, USA.
The research T.-W. Lam was supported by the Hong Kong RGC Grant HKU 7140/06E.
Rights and permissions
About this article
Cite this article
Chan, HL., Lam, TW., Sung, WK. et al. Compressed Indexes for Approximate String Matching. Algorithmica 58, 263–281 (2010). https://doi.org/10.1007/s00453-008-9263-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-008-9263-2