Abstract
In the motif finding problem one seeks a set of mutually similar subsequences within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find subsequences of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between subsequences come from a limited set of possibilities. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Transcription Factor Binding Site
- Integer Linear Programming
- Linear Programming Relaxation
- Separation Algorithm
- Integer Linear Programming Formulation
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: RECOMB, pp. 1–7 (2000)
Bafna, V., Lawler, E., Pevzner, P.A.: Approximation algorithms for multiple alignment. Theoretical Computer Science 182, 233–244 (1997)
Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)
Chazelle, B., Kingsford, C., Singh, M.: A semidefinite programming approach to side-chain positioning with new rounding strategies. INFORMS J. on Computing 16, 380–392 (2004)
Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. Wiley-Interscience, New York (1997)
Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization, 2nd edn. Springer, Berlin (1993)
Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinf. 15, 563–577 (1999)
Kellis, M., Patterson, N., Endrizzi, M., Birren, B., Lander, E.: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003)
Kingsford, C., Chazelle, B., Singh, M.: Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinf. 21, 1028–1039 (2005)
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G., et al.: Transcriptional regulatory networks in S. cerevisiae. Science 298, 799–804 (2002)
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. J. Computer and Systems Sciences 65(1), 73–96 (2002)
Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comp. Bio. 7, 345–362 (2000)
McGuire, A., Hughes, J., Church, G.: Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 10, 744–757 (2000)
Osada, R., Zaslavsky, E., Singh, M.: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinf. 20, 3516–3525 (2004)
Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: ISMB, pp. 269–278 (2000)
Robison, K., McGuire, A., Church, G.: A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 Genome. J. Mol. Biol. 284, 241–254 (1998)
Schuler, G., Altschul, S., Lipman, D.: A workbench for multiple alignment construction and analysis. Proteins 9(3), 180–190 (1991)
Tavazoie, S., Hughes, J., Campbell, M., Cho, R., Church, G.: Systematic determination of genetic network architecture. Nat. Genetics 22(3), 281–285 (1999)
Thompson, W., Rouchka, E., Lawrence, C.: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31, 3580–3585 (2003)
Tompa, M., Li, N., Bailey, T., Church, G., De Moor, B., Eskin, E., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotech. 23, 137–144 (2005)
Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comp. Bio. 1, 337–348 (1994)
Zaslavsky, E., Singh, M.: Combinatorial Optimization Approaches to Motif Finding (submitted), also available as Princeton University Computer Science Dept. Technical Report TR-728-05
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kingsford, C., Zaslavsky, E., Singh, M. (2006). A Compact Mathematical Programming Formulation for DNA Motif Finding. In: Lewenstein, M., Valiente, G. (eds) Combinatorial Pattern Matching. CPM 2006. Lecture Notes in Computer Science, vol 4009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780441_22
Download citation
DOI: https://doi.org/10.1007/11780441_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35455-0
Online ISBN: 978-3-540-35461-1
eBook Packages: Computer ScienceComputer Science (R0)