Abstract
We address the problem of extracting pairs of subwords (m 1,m 2) from a text string s of length n, such that, given also an integer constant d in input, m 1 and m 2 occur in tandem within a maximum distance of d symbols in s.
The main effort of this work is to eliminate the possible redundancy from the candidate set of the so found tandem motifs. To this aim, we first introduce the concept of maximality, characterized by four specific conditions, that we show to be not deducible by the corresponding notion of maximality already defined for “simple” (i.e., non tandem) motifs. Then, we further eliminate the remaining redundancy by defining the concept of irredundancy for tandem motifs.
We prove that the number of non-overlapping irredundant tandems is O(d 2 n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. As a further contribution we show an algorithm to extract this compact irredundant index.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Apostolico, A., Comin, M., Parida, L.: VARUN: Discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comp. Biol. Bioinf. 7(4), 752–762 (2010)
Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. J. of Comp. Biol. 11(1), 15–25 (2004)
Apostolico, A., Pizzi, C., Satta, G.: Optimal Discovery of Subword Associations in Strings. In: Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 270–277. Springer, Heidelberg (2004)
Apostolico, A., Pizzi, C., Ukkonen, E.: Efficient algorithms for the discovery of gapped factors. Algorithms for Molecular Biology (6:5) (2011)
Apostolico, A., Satta, G.: Discovering subword associations in strings in time linear in the output size. J. Discrete Algorithms 7(2), 227–238 (2009)
Apostolico, A., Tagliacollo, C.: Incremental discovery of the irredundant motif bases for all suffixes of a string in o(n2logn) time. Theor. Comput. Sci. 408(2-3), 106–115 (2008)
Apostolico, A., Tagliacollo, C.: Optimal extraction of irredundant motif bases. Int. J. Found. Comput. Sci. 21(6), 1035–1047 (2010)
Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.-F.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biology Bioinform. 3(2), 126–140 (2006)
Fassetti, F., Greco, G., Terracina, G.: Mining loosely structured motifs from biological data. IEEE Trans. Knowl. Data Eng. 20(11), 1472–1489 (2008)
Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F.: MADMX: A strategy for maximal dense motif extraction. J. of Comp. Biol. 18(4), 535–545 (2011)
Grossi, R., Pisanti, N., Crochemore, M., Sagot, M.-F.: Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans. Comp. Biol. Bioinf. 2(3), 159–177 (2005)
Marsan, L., Sagot, M.-F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. of Comp. Biol. 7(3-4) (2000)
Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification. In: Proceedings of the fourth annual international conference on Computational Molecular Biology (RECOMB), pp. 210–219 (2000)
Monteiro, P.T., Mendes, N.D., Teixeira, M.C., others: Yeastract-discoverer: new tools to improve the analysis of transcriptional regulatory associations in saccharomyces cerevisiae. Nucleic Acids Research 36(Database-Issue), 132–136 (2008)
Mularoni, L., Guigó, R., Albà, M.M.: Mutation patterns of amino acid tandem repeats in the human proteome. Genome Biol. 7(4), R33 (2006)
Parida, L., Rigoutsos, I., Floratos, A., Platt, D.E., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 297–308 (2000)
Pelfrêne, J., Abdeddaïm, S., Alexandre, J.: Extracting approximate patterns. J. Discrete Algorithms 3(2-4), 293–320 (2005)
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum. In: Rovan, B., Vojtáš, P. (eds.) MFCS 2003. LNCS, vol. 2747, pp. 622–631. Springer, Heidelberg (2003)
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A comparative study of bases for motif inference. In: Iliopoulos, C., Lecroq, T. (eds.) String Algorithmics, pp. 195–226. KCL Publications (2004)
Rombo, S.E.: Extracting string motif bases for quorum higher than two. Theor. Comput. Sci (2012)
Ukkonen, E.: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theor. Comput. Sci. 410(43), 4341–4349 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Parida, L., Pizzi, C., Rombo, S.E. (2012). Characterization and Extraction of Irredundant Tandem Motifs. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-34109-0_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34108-3
Online ISBN: 978-3-642-34109-0
eBook Packages: Computer ScienceComputer Science (R0)