Abstract
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Stormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht, A. (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res., 10, 2997–3011.
Benos, P. V., Lapedes, A. S. and Stormo, G. D. (2002) Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol., 323, 701–727.
Kaplan, T., Friedman, N. and Margalit, H. (2005) Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput. Biol., 1, e1.
Wolfe, S. A., Nekludova, L. and Pabo, C. O. (2000) DNA recognition by Cys2His2 zinc finger proteins. Annu. Rev. Biophys. Biomol. Struct., 29, 183–212.
Klug, A. (2010) The discovery of zinc fingers and their development for practical applications in gene regulation and genome manipulation. Q. Rev. Biophys., 43, 1–21.
Foat, B. C. and Stormo, G. D. (2009) Discovering structural cis-regulatory elements by modeling the behaviors of mRNAs. Mol. Syst. Biol., 5, 268.
Gorodkin, J., Heyer, L. J. and Stormo, G. D. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724–3732.
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763.
Rosenblatt, F. (1962) Principles of Neurodynamics. New York: Spartan Books.
Stormo, G. D., Schneider, T. D. and Gold, L. M. (1982) Characterization of translational initiation sites in E. coli. Nucleic Acids Res., 10, 2971–2996.
Djordjevic, M., Sengupta, A. M. and Shraiman, B. I. (2003) A biophysical approach to transcription factor binding site discovery. Genome Res., 13, 2381–2390.
Maxam, A. M. and Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U S A, 74, 560–564.
Sanger, F., Nicklen, S. and Coulson, A. R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U S A, 74, 5463–5467.
Rosenberg, M. and Court, D. (1979) Regulatory sequences involved in the promotion and termination of RNA transcription. Annu. Rev. Genet. 13, 319–353.
Hawley, D. K. and McClure, W. R. (1983) Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res., 11, 2237–2255.
Siebenlist, U., Simpson, R. B. and Gilbert, W. (1980) E. coli RNA polymerase interacts homologously with two different promoters. Cell, 20, 269–281.
Gold, L., Pribnow, D., Schneider, T., Shinedling, S., Singer, B. S. and Stormo, G. (1981) Translational initiation in prokaryotes. Annu. Rev. Microbiol., 35, 365–403.
Scherer, G. F., Walkinshaw, M. D., Arnott, S. and Morré, D. J. (1980) The ribosome binding sites recognized by E. coli ribosomes have regions with signal character in both the leader and protein coding segments. Nucleic Acids Res., 8, 3895–3907.
Mount, S. M. (1982) A catalogue of splice junction sequences. Nucleic Acids Res., 10, 459–472.
Harr, R., Häggström, M. and Gustafsson, P. (1983) Search algorithm for pattern match analysis of nucleic acid sequences. Nucleic Acids Res., 11, 2943–2957.
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505–519.
Kel, A. E., Gössling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O. V. and Wingender, E. (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res., 31, 3576–3579.
Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res., 23, 4878–4884.
Mulligan, M. E., Hawley, D. K., Entriken, R. and McClure, W. R. (1984) Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity. Nucleic Acids Res., 12, 789–800.
Schneider, T. D., Stormo, G. D., Gold, L. and Ehrenfeucht, A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188, 415–431.
Schneider, T. D. and Stephens, R. M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 6097–6100.
von Hippel, P. H. (1979) On the Molecular Bases of the Specificity of Interaction of Transcriptional Proteins with Genome DNA. New York: Plenum Publishing Corp.
von Hippel, P. H. and Berg, O. G. (1986) On the specificity of DNA-protein interactions. Proc. Natl. Acad. Sci. U S A, 83, 1608–1612.
Berg, O. G. and von Hippel, P. H. (1987) Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol., 193, 723–750.
Heumann, J. M., Lapedes, A. S. and Stormo, G. D. (1994) Neural networks for determining protein specificity and multiple alignment of binding sites. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, 2, 188–194.
Stormo, G. D. and Fields, D. S. (1998) Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci., 23, 109–113.
Foat, B. C., Morozov, A. V. and Bussemaker, H. J. (2006) Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics, 22, e141–e149.
Stormo, G. D. (2011) Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics, 187, 1219–1224.
Galas, D. J., Eggert, M. and Waterman, M. S. (1985) Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186, 117–128.
Waterman, M. S., Arratia, R. and Galas, D. J. (1984) Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46, 515–527.
Stormo, G. D. and Hartzell, G. W. 3rd. (1989) Identifying proteinbinding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U S A, 86, 1183–1187.
Hertz, G. Z. and Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563–577.
Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, 2, 28–36.
Lawrence, C. E. and Reilly, A. A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41–51.
Cardon, L. R. and Stormo, G. D. (1992) Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol., 223, 159–170.
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214.
Bailey, T. L. and Machanick, P. (2012) Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res., 40, e128.
Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839.
Machanick, P. and Bailey, T. L. (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics, 27, 1696–1697.
Roth, F. P., Hughes, J. D., Estep, P. W. and Church, G. M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939–945.
Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M. and Wong, W. H. (2008) An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol., 26, 1293–1300.
Stormo, G. D., Schneider, T. D. and Gold, L. (1986) Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res., 14, 6661–6679.
Benos, P. V., Bulyk, M. L. and Stormo, G. D. (2002) Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res., 30, 4442–4451.
Bulyk, M. L., Johnson, P. L. and Church, G. M. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30, 1255–1261.
Lee, M. L., Bulyk, M. L., Whitmore, G. A. and Church, G. M. (2002) A statistical model for investigating binding probabilities of DNA nucleotide sequences using microarrays. Biometrics, 58, 981–988.
Man, T. K. and Stormo, G. D. (2001) Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res., 29, 2471–2478.
Zhao, Y., Ruan, S., Pandey, M. and Stormo, G. D. (2012). Improved models for transcription factor binding site identification using nonindependent interactions. Genetics, 191, 781–790.
Maerkl, S. J. and Quake, S. R. (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science, 315, 233–237.
Stormo, G. D. and Zhao, Y. (2007) Putting numbers on the network connections. BioEssays: news and reviews in molecular, cellular and developmental biology, 29, 717–721.
Zhao, Y., Granas, D. and Stormo, G. D. (2009) Inferring binding energies from selected binding sites. PLoS Comput. Biol., 5, e1000590.
Zhao, Y. and Stormo, G. D. (2011) Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol., 29, 480–483.
Sarai, A. and Takeda, Y. (1989) Lambda repressor recognizes the approximately 2-fold symmetric half-operator sequences asymmetrically. Proc. Natl. Acad. Sci. U S A, 86, 6513–6517.
Takeda, Y., Sarai, A. and Rivera, V. M. (1989) Analysis of the sequence-specific interactions between Cro repressor and operator DNA by systematic base substitution experiments. Proc. Natl. Acad. Sci. U S A, 86, 439–443.
Bussemaker, H. J., Li, H. and Siggia, E. D. (2001) Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171.
Bussemaker, H. J., Foat, B. C. and Ward, L. D. (2007) Predictive modeling of genome-wide mRNA expression: from modules to molecules. Annu. Rev. Biophys. Biomol. Struct., 36, 329–347.
Tanay, A. (2006) Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res., 16, 962–972.
Stormo, G. D. and Zhao, Y. (2010) Determining the specificity of protein-DNA interactions. Nature reviews. Genetics, 11, 751–760.
Fordyce, P. M., Gerber, D., Tran, D., Zheng, J., Li, H., DeRisi, J. L. and Quake, S. R. (2010) De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol., 28, 970–975.
Wu, R. Z., Chaivorapol, C., Zheng, J., Li, H. and Liang, S. (2007) fREDUCE: detection of degenerate regulatory elements using correlation with expression. BMC Bioinformatics, 8, 399.
Tuerk, C. and Gold, L. (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249, 505–510.
Fields, D. S., He, Y., Al-Uzri, A. Y. and Stormo, G. D. (1997) Quantitative specificity of the Mnt repressor. J. Mol. Biol., 271, 178–194.
Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J., et al. (2010) Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res., 20, 861–873.
Zykovich, A., Korf, I. and Segal, D. J. (2009) Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res., 37, e151.
Atherton, J., Boley, N., Brown, B., Ogawa, N., Davidson, S. M., Eisen, M. B., Biggin, M. D. and Bickel, P. (2012) A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann. Appl. Stat., 6, 928–949.
Slattery, M., Riley, T., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., Zhou, T., Rohs, R., Honig, B., Bussemaker, H.J., et al. (2011) Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell, 147, 1270–1282.
Philippakis, A. A., Qureshi, A. M., Berger, M. F. and Bulyk, M. L. (2008) Design of compact, universal DNA microarrays for protein binding microarray experiments. Journal of Computational Biology, 15, 655–665.
Berger, M. F., Philippakis, A. A., Qureshi, A. M., He, F. S., Estep, P.W. 3rd, and Bulyk, M. L. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435.
Robasky, K. and Bulyk, M. L. (2011) UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res., 39, D124–D128.
Badis, G., Berger, M. F., Philippakis, A. A., Talukder, S., Gehrke, A. R., Jaeger, S. A., Chan, E. T., Metzler, G., Vedenko, A., Chen, X., et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720–1723.
Weirauch, M. T., Cote, A., Norel, R., Annala, M., Zhao, Y., Riley, T. R., Saez-Rodriguez, J., Cokelaer, T., Vedenko, A., Talukder, S., et al. and the DREAM5 Consortium. (2013) Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol., (In press).
Meng, X., Brodsky, M. H. and Wolfe, S. A. (2005) A bacterial onehybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol., 23, 988–994.
Meng, X., Thibodeau-Beganny, S., Jiang, T., Joung, J. K. and Wolfe, S. A. (2007) Profiling the DNA-binding specificities of engineered Cys2His2 zinc finger domains using a rapid cell-based method. Nucleic Acids Res., 35, e81.
Noyes, M. B., Meng, X., Wakabayashi, A., Sinha, S., Brodsky, M. H. and Wolfe, S. A. (2008) A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res., 36, 2547–2560.
Christensen, R. G., Gupta, A., Zuo, Z., Schriefer, L. A., Wolfe, S. A. and Stormo, G. D. (2011) A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity. Nucleic Acids Res., 39, e83.
Chu, S. W., Noyes, M. B., Christensen, R. G., Pierce, B. G., Zhu, L. J., Weng, Z., Stormo, G. D. and Wolfe, S. A. (2012) Exploring the DNArecognition potential of homeodomains. Genome Res., 22, 1889–1898.
Gupta, A., Christensen, R. G., Rayla, A. L., Lakshmanan, A., Stormo, G. D. and Wolfe, S. A. (2012) An optimized two-finger archive for ZFN-mediated gene targeting. Nat. Methods, 9, 588–590.
Gupta, A., Meng, X., Zhu, L. J., Lawson, N. D. and Wolfe, S. A. (2011) Zinc finger protein-dependent and-independent contributions to the in vivo off-target activity of zinc finger nucleases. Nucleic Acids Res., 39, 381–392.
Zhu, C., Gupta, A., Hall, V. L., Rayla, A. L., Christensen, R. G., Dake, B., Lakshmanan, A., Kuperwasser, C., Stormo, G. D. and Wolfe, S. A. (2013) Using defined finger-finger interfaces as units of assembly for constructing zinc-finger nucleases. Nucleic Acids Res.
Siggers, T., Duyzend, M. H., Reddy, J., Khan, S. and Bulyk, M. L. (2011) Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex. Mol. Syst. Biol., 7, 555.
Nutiu, R., Friedman, R. C., Luo, S., Khrebtukova, I., Silva, D., Li, R., Zhang, L., Schroth, G. P. and Burge, C. B. (2011) Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol., 29, 659–664.
Agius, P., Arvey, A., Chang, W., Noble, W. S. and Leslie, C. (2010) High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput. Biol., 6, 6.
Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y. and Pritchard, J. K. (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res., 21, 447–455.
Narlikar, L., Gordân, R. and Hartemink, A.J. (2007) A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput. Biol., 3, e215.
Degner, J. F., Pai, A. A., Pique-Regi, R., Veyrieras, J. B., Gaffney, D. J., Pickrell, J. K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G. E., et al. (2012) DNase I sensitivity QTLs are a major determinant of human expression variation. Nature, 482, 390–394.
Gaffney, D. J., Veyrieras, J. B., Degner, J. F., Pique-Regi, R., Pai, A. A., Crawford, G. E., Stephens, M., Gilad, Y. and Pritchard, J. K. (2012) Dissecting the regulatory architecture of gene expression QTLs. Genome Biol., 13, R7.
Maurano, M. T., Humbert, R., Rynes, E., Thurman, R. E., Haugen, E., Wang, H., Reynolds, A. P., Sandstrom, R., Qu, H., Brody, J., et al. (2012) Systematic localization of common disease-associated variation in regulatory DNA. Science, 337, 1190–1195.
Neph, S., Vierstra, J., Stergachis, A. B., Reynolds, A. P., Haugen, E., Vernot, B., Thurman, R. E., John, S., Sandstrom, R., Johnson, A. K., et al. (2012) An expansive human regulatory lexicon encoded in transcription factor footprints. Nature, 489, 83–90.
Cooper, G. M. and Shendure, J. (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet., 12, 628–640.
Hesselberth, J. R., Chen, X., Zhang, Z., Sabo, P. J., Sandstrom, R., Reynolds, A. P., Thurman, R. E., Neph, S., Kuehn, M. S., Noble, W. S., et al. (2009) Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat. Methods, 6, 283–289.
Neph, S., Stergachis, A. B., Reynolds, A., Sandstrom, R., Borenstein, E. and Stamatoyannopoulos, J. A. (2012) Circuitry and dynamics of human transcription factor regulatory networks. Cell, 150, 1274–1286.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Stormo, G.D. Modeling the specificity of protein-DNA interactions. Quant Biol 1, 115–130 (2013). https://doi.org/10.1007/s40484-013-0012-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40484-013-0012-4