Keywords

1 Introduction to Transcription Factors

Plants are sessile but astonishing organisms. They are endowed by a resilient developmental program enabling them growth, reproductive success, and response to many environmental challengers (Udvardi et al. 2007). All these events are primarily modulated at gene transcriptional level via complex interactions of various transcription-associated proteins (TAPs ). Plant TAP proteins could be broadly categorized into three groups based on their roles in transcriptional regulation such as transcription factors (TFs), transcriptional regulators (TRs), and putative TAPs. TFs directly activate or suppress the transcription of target genes via binding their cis-regulatory elements in upstream or promoter regions (Franco-Zorrilla and Solano 2016). TRs include the general transcription initiation factors, coactivators or corepressors, and chromatin remodeling factors (Zheng et al. 2016). Putative TAPs contain the other unknown function proteins possibly related with transcriptional regulation (Richardt et al. 2007). Transcription factors modulate very diverse panels of biological processes in plants from embryogenesis, inflorescence development, maintenance of homeostasis, and plant morphology to response of internal or external stimuli (Yamasaki et al. 2013). Plant genomes are estimated to ascribe about 7% of their coding sequences to TFs, indicating the intricacy of transcriptional regulation (Udvardi et al. 2007). So, over the past decades, huge efforts have been realized to elucidate the roles of TFs in biological processes. TFs could possess one or more DNA-binding domains (DBDs ) which specifically bind to cis-regulatory elements in upstream regions/promoters of target genes to mediate the binding of RNA polymerase before transcription initiation (Lin et al. 2014). For instance, an ethylene response factor 1 (ERF1) from Arabidopsis possesses a single AP2 domain directing to bind to GCC box in promoter regions of ethylene-responsive genes, whereas AINTEGUMENTA (ANT) from Arabidopsis has two AP2 domains (Ohme-Takagi and Shinshi 1995; Nole-Wilson and Krizek 2000). Besides, TFs may also have an auxiliary domain to facilitate DNA binding, for example, auxin response factor (ARF) harbors a “B3” DNA-binding domain as well as contains an auxiliary domain “auxin/indole-3-acetic acid (Aux/IAA). ” ARF binds to TGTCTC-containing auxin responsive elements via interaction of auxiliary Aux/IAA domain (Ulmasov et al. 1997; Guilfoyle et al. 1998). TFs are mainly classified into TF families based on their protein domains. In Arabidopsis, 2,296 TFs were classified into 58 TF families (Jin et al. 2013). In rice about 2,478 putative TF genes were classified in 84 families covering nearly 6% of protein-coding genes (Priya and Jain 2013). In maize, 2,538 putative TF genes were distributed in 64 families representing the ~6% of protein-coding genes (Lin et al. 2014). However, interestingly soybean possessed 5,671 putative TF genes clustered in 63 families, including about 12 % of protein-coding genes (Schmutz et al. 2010). The high number of TFs in soybean was partly attributed to the whole-genome duplication (WGD ) events occurred about 13 million years ago (MYO ) (Schmutz et al. 2010). Based on estimations TFs comprise a notable portion (~7%) of protein-coding genes in a plant genome (Udvardi et al. 2007). Thus, their genome-wide identification is usually based on the availability of sequenced genomes. Herein, to have a comparative insights about TF families and their distribution, eight selected plant species with sequenced genomes from two dicots (A. thaliana and Glycine max), two monocots (Oryza sativa indica and Zea mays), two tree species (Populus trichocarpa and Eucalyptus grandis), and two lower plants (Chlamydomonas reinhardtii and Physcomitrella patens) have been investigated (Table 1). TF families have been searched in a comprehensive specialized plant TF database PlantTFDB which contains 320,370 TFs from 165 plant species classified into 58 families (Jin et al. 2013; also refer to Table 3). Database employs a family assignment rule according to which there are three types of domain a TF belongs to such as DNA-binding domain, auxiliary domain, and forbidden domain. Generally, a DNA-binding domain correctly assigns itself to a family. However, in some cases an auxiliary domain is needed to classify TF correspondingly. In addition, forbidden domains are employed to eliminate the sequences with DNA-binding domain but no transcription activity. Based on these rules, 2,296 identified TFs (1,717 loci) in Arabidopsis were classified into 58 families, 6,150 TFs (3,747 loci) in soybean classified into 57 families, 1,891 TFs (1891 loci) in rice classified into 56 families, 3308 TFs (2,289 loci) in maize classified into 56 families, 4,287 TFs (2,466 loci) in poplar classified into 58 families, 2,163 TFs (1,731 loci) in Eucalyptus classified into 56 families, 230 TFs (206 loci) in Chlamydomonas classified into 29 families, and 3,930 TFs (1,156 loci) in Physcomitrella classified into 53 families. In light of identified TF and loci in eight plants, it could be speculated that most TF genes could have been occurred through small and/or large-scale genomic duplications, particularly soybean. Besides, a lower plant green algae C. reinhardtii also demonstrated a clear divergence from other plants in terms of TF families. Among TFs, basic helix-loop-helix (bHLH) family was the most widely distributed TF, also implicating its functional diversity. All taken together, specialized TF databases appear to provide very useful information from functional and evolutionary aspects.

Table 1 Distribution of potential TF families in eight plant species such as A. thaliana, G. max, O. sativa, Z. mays, P. trichocarpa, E. grandis, C. reinhardtii, and P. patens

In earlier times of TF studies, identification of TFs and their potential binding elements has been mainly studied through the conventional laboratory experiments (Wingender 1988). However today, the availability of sequenced genomes of many plant species has allowed us to systematically identify and classify the TFs at genome-scale (Riechmann et al. 2000; Schmutz et al. 2010; Priya and Jain 2013; Lin et al. 2014). The integration of next-generation sequencing (NGS) technologies along with various bioinformatics resources has also revolutionized the field of TF research. In addition, the introduction of various bioinformatics portals also paved the way for storage, analysis, and distribution of TFs efficiently. A number of specialized TF databases have been developed for plants including AGRIS (Yilmaz et al. 2011), PlnTFDB (Pérez-Rodríguez et al. 2009), PlantTFDB (Jin et al. 2013), GRASSIUS (Yilmaz et al. 2009), PlantTFcat (Dai et al. 2013), TreeTFDB (Mochida et al. 2013), SoyTFKB (Yu et al. 2016), RIKEN (Iida et al. 2005), SoyDB (Wang et al. 2010), RiceSRTFDB (Priya and Jain 2013), and IT3F (Bailey et al. 2008). Besides, many other general databases such as Green PhylDB (Rouard et al. 2010), Phytozome (Goodstein et al. 2012), PLAZA (Proost et al. 2015), ProFITS (Ling et al. 2010), AthaMap (Bülow et al. 2009), PLEXdb (Dash et al. 2012), and more others also provided various internal utilities to investigate the TFs from structural and functional aspects. However, only few databases such as PlnTFDB (Pérez-Rodríguez et al. 2009), PlantTFcat (Dai et al. 2013), ProFITS (Ling et al. 2010), GrassCoregDB in Grassius (Yilmaz et al. 2009), PlanTAPDB (Richardt et al. 2007) give annotation of transcription coregulators (TCs; Mannervik et al. 1999) which are proteins without DNA-binding domains but could bind to TFs or RNA polymerase II to mediate the gene regulation in plants.

2 General Bioinformatics Database Resources for Plant TF

Various biological databases with general contents provide internal tools or utilities to search TFs from various aspects based on their data types and structures. For example, Phytozome is a publicly available plant-based comparative genomics portal (Table 2; Goodstein et al. 2012). As of release v1, it allows access to 65 sequenced and annotated plant genomes. Individual genes has been annotated with PANTHER, KOG, PFAM, KEGG, and GO assignments. Different search options are available for TF exploration. TFs could be either directly searched via “Keyword search” or they could be retrieved based on templates which are predefined queries using PhytoMine interface of Phytozome . PLAZA is another plant-based comparative genomics portal harboring genomic data from different genome sequencing initiatives (Table 2; Proost et al. 2015). As of release PLAZA 3. 0, database deposits the genome sequences from 38 species. TFs could be queried via search option by selecting “TF family” or “Gene” options from the menu. In addition, links to other specialized databases are useful option for cross-validation. GreenPhylDB is other web resource for comparative and functional genomics in 37 different plant species (Table 2; Rouard et al. 2010). The “Transcription factors” option under “Gene Family lists” menu is specifically designed to display the list of transcription factor families. The number of TFs could be graphically displayed, and TF information could be mined. ProFITS is a database aiming to facilitate the studies on signal transduction systems in maize (Table 2; Ling et al. 2010). It also categorizes TF families and other transcriptional regulators. TFs could be searched from “Transcription factor” tab by browsing. AthaMap is a species-specific web resource dedicated to Arabidopsis providing a genome-wide map of putative TFs and small RNA binding sites (Table 2; Bülow et al. 2009). It contains a complete list of 211 TFs derived from published TF binding specificities available as proven single binding sites or alignment matrices. TFs could be explored using different search functions under “Tools” menu. PLEXdb is a combined gene expression database for plants and plant pathogens (Table 2; Dash et al. 2012). It currently provides a genotype to phenotype information from 14 different species. The expression profiles of TFs and their target genes can be analyzed using array and/or RNA-seq data on relevant experiments. ATTED-II is a plant co-expression database integrating various co-expression data sets and network analysis tools (Table 2; Aoki et al. 2015). As of release v8. 0, it harbors eight microarray and six RNA sequencing-derived co-expression data from seven dicot species, such as Arabidopsis, soybean, tomato, field mustard, medick, grape, and poplar, and from two monocot species of maize and rice. The interacting gene sets are inclined to be co-expressed, thereby dissecting a co-expression network of TFs could provide very useful information about functional gene relationships (Aoki et al. 2015). CORNET is a system biology portal for Arabidopsis and maize integrating co-expressions, regulatory interactions (e. g. , TFs), gene associations, protein-protein interactions (PPIs), and functional annotations (Table 2; De Bodt et al. 2012). Interactions of TFs among themselves and between TFs and their targets demonstrate intricate regulatory cascades; thus, a holistic approach like system biology is an effective way in understanding the complex regulatory networks. Multiple options in CORNET are available to construct the networks centralized around input genes or proteins.

Table 2 List of general bioinformatics database resources for plant regulatory element exploration

Those abovementioned databases with general content but enabling regulatory element explorations with various internal utilities and functionalities are only some glimpse from flourishing a number of bioinformatics resources. Thus, development of new versatile resources is an emerging issue to further understand the transcription regulatory mechanisms from various aspects, particularly system levels.

3 Specialized Bioinformatics Database Resources for Plant TF and Regulatory Element Search

In addition to general databases, many specialized TF databases have been also developed for TF and regulatory element exploration in plants such as AGRIS (Arabidopsis Gene Regulatory Information Server), PlantTFDB (Plant Transcription Factor Database), PlnTFDB (Plant Transcription Factor Database), GRASSIUS (Grass Regulatory Information Server), PlantTFcat (Plant Transcription Factor Categorization and Analysis Tool), TreeTFDB (Tree Transcription Factor Database), PlanTAPDB (Plant Transcription Associated Protein Database), TOBFAC (Database of Tobacco Transcription Factors), ppdb (plant promoter database), and PlantCARE (Plant Cis-Acting Regulatory Element).

3.1 AGRIS (Arabidopsis Gene Regulatory Information Server)

AGRIS is an Arabidopsis-specific database resource providing information on promoter sequences, TFs, and their target genes (Table 3; Yilmaz et al. 2011). AGRIS contains three distinct databases such as AtcisDB, AtTFDB, and AtRegNet. AtcisDB (arabidopsis. med. ohio-state. edu/AtcisDB/) contains the information about 33,000 upstream regions of annotated Arabidopsis genes and describes the validation of cis-regulatory elements as experimentally or predicted. It is composed of different data types such as promoter sequence, TF binding site information, and associated annotations, and data can be searched by TAIR gene symbol or locus ID. AtTFDB (arabidopsis. med. ohio-state. edu/AtTFDB/) includes the information about 1,770 TFs grouped in 50 families based on domain conservancy. The users can search the database by using a specific locus ID or gene name or browsing the TF families. AtRegNet (arabidopsis. med. ohio-state. edu/grgx/) harbors 18,772 direct interactions between TFs and target genes. For example, employing AGRIS , interaction between BR-activated transcription factor (BZR1 ) and phytochrome-interacting factor4 (PIF4 ) was demonstrated to integrate the brassinosteroid and environmental responses (Oh et al. 2012).

Table 3 List of specialized bioinformatics database resources for plant transcription factor (TF) and cis-regulatory element exploration

3.2 PlantTFDB (Plant Transcription Factor Database)

PlantTFDB contains the 320,370 TFs from 165 plant species classified into 58 families (Table 3; Jin et al. 2013). Very extensive annotations have been provided for each identified TF such as functional domains, binding motifs, gene and plant ontologies, 3D structures, regulation information, curated functional description, interaction, expression information, references, and cross-link to various databases. The evolutionary relationships between TFs were provided by constructing the phylogenies and inferring the orthologous groups. Database could be searched using TF IDs and common names in “search” tab or providing sequences in BLAST. In addition, new portals are also internally available for regulation prediction and functional enrichment at PlantRegMap (Plant Transcriptional Regulatory Map) and for architecture and evolutionary features of transcriptional regulatory networks at ATRM (Arabidopsis Transcriptional Regulatory Map). For example, an Arabidopsis transcriptional regulatory map constructed with 388 TFs from 47 families showed the architectural heterogeneity in stress response and developmental subnetworks and demonstrated three types of new network motifs (Jin et al. 2015).

3.3 PlnTFDB (Plant Transcription Factor Database)

PlnTFDB covers the 28,193 protein models and 26,184 distinct protein sequences distributed in 84 gene families from 20 plant species (Table 3; Pérez-Rodríguez et al. 2009). It is an integrative database providing information on TFs and other TRs in completely sequenced and annotated plant species. For example, chickpea transcripts were queried against PlnTFDB TFs to identify all TF families in chickpea transcriptome using BLASTX (Garg et al. 2011). Each gene family is provided with a basic description complemented by literature reference as well as with domain alignment. TF/TR entries also cover information of expressed sequence tags (ESTs), domain architecture, 3D structures of homologue proteins, and other cross-links to various resources. The different species are also associated to each other with orthologous genes to facilitate the cross-species comparisons. Database search could be realized using sequence identifiers, blasting, or direct browsing.

3.4 GRASSIUS (Grass Regulatory Information Server)

GRASSIUS is a web portal including various resources related to the control of gene expression in grass species such as maize, rice, sorghum, sugarcane, and Brachypodium (Table 3; Yilmaz et al. 2009). Database currently contains 9,044 TFs, 579 coregulators, 149,075 promoter sequences, and 2,098 TF ORF clones. It harbors four different databases such as GrassTFDB , GrassCoRegDB , GrassPROMDB , and TFome Collection. GrassTFDB (grassius. org/grasstfdb. html) contains an extensive collection of TFs from maize, rice, sorghum, sugarcane, and Brachypodium. GrassCoRegDB (grassius. org/grasscoregdb. html) includes a collection of transcriptional regulator proteins. These proteins do not bind DNA at sequence-specific way. They act either through interacting with TFs or as chromatin modifiers releasing or restricting DNA accessibility. GrassPROMDB (grassius. org/grasspromdb. html) is a promoter sequence database for grass species covering the cis-regulatory elements . TFome Collection (grassius. org/tfomecollection. html) provides access to the grass TF ORFeome collection .

3.5 PlantTFcat (Plant Transcription Factor Categorization and Analysis Tool)

PlantTFcat is a web-based transcription factor and transcriptional regulator categorization and analysis tool (Table 3; Dai et al. 2013). It currently contains the information from a total of 108 published TF, TR, and chromatin regulator (CR) families. Database can be searched using protein or nucleic acid sequences in FASTA format or as pure sequence (without FASTA header). In many studies, PlantTFcat was employed in identification of TFs in various plant species including Vicia sativa (Panchal 2015), Phaseolus vulgaris (Patel et al. 2014a), Cicer arietinum (Patel et al. 2014b), Trigonellafoenum graecum (Patel et al. 2014c), Arachis hypogaea (Patel et al. 2015), Andrographis paniculata (Cherukupalli et al. 2016), Ananas comosus (Chen et al. 2016), Brassica napus (Shamloo-Dashtpagerdi et al. 2015), and some others.

3.6 TreeTFDB (Tree Transcription Factor Database)

TreeTFDB contains TFs from six economically valuable tree species such as papaya (Carica papaya), jatropha (Jatropha curcas), cassava (Manihot esculenta), poplar (Populus trichocarpa), castor bean (Ricinus communis), and grapevine (Vitis vinifera) to provide resource for comparative and functional genomics (Table 3; Mochida et al. 2013). The importance of specialized databases like TreeTFDB which is dedicated to tree species has also been emphasized in a mini review (Legué et al. 2014). As of ver. 1. 0, it includes 1,481 TF models of jatropha, 3,110 TF models of poplar, 2,638 TF models of cassava, 1,493 TF models of grapevine, 1,552 TF models of papaya, and 1,512 TF models of castor bean. A number of search options are also available for TF exploration such as using TF families, keyword, gene IDs, InterProScan result, GO terms, cis-motif (stress responsive), cis-motif (PLACE), cis-motif (hormone responsive), and employing blast options.

3.7 PlanTAPDB (Plant Transcription Associated Protein Database)

PlanTAPDB is a phylogeny-based web resource for transcription-associated proteins (TAPs) in a vast taxonomic range including algae and a moss (Table 3; Richardt et al. 2007). Database contains information on three categories of entries such as transcription factors (TFs), transcription regulators (TRs), and putative TAPs (PTs) that belong to one of 119 families (138 subfamilies). TAPs can be searched by family accession numbers and IDs, keyword search, and also being queried using BLAST.

3.8 TOBFAC (Database of Tobacco Transcription Factors)

TOBFAC is an integrative database dedicated to tobacco plant (Table 3; Rushton et al. 2008). It provides access to sequence, phylogeny, and various associated data for tobacco TFs. As of current, database includes 65 TF families each provided with literature information, domain architecture, list of genomic sequences, minimum number of genes, and other information. TOBFAC can be queried using various search and data retrieval options on main menu. Database provides very useful resources for tobacco studies. For example, in a custom oligo array design for transcriptome analyses in water-deficit tobacco, probe sequences were obtained from three different sources including TOBFAC TFs (Rabara et al. 2015). In many very recent studies, TOBFAC database has been also employed (Fu et al. 2013; Ogata et al. 2013; Xu et al. 2015).

3.9 ppdb (Plant Promoter Database)

ppdb is a plant promoter database providing information on core promoter structures such as TATA boxes, Y Patches, Initiators, CA and GA elements, transcription start sites (TSSs), and transcriptional regulatory elements from A. thaliana, Oryza sativa, Physcomitrella patens, and P. trichocarpa (Table 3; Hieno et al. 2013). For example, ABA-responsive promoter elements (ABREs) identified by ppdb were reported to investigate the relationship between stomatal closure and ABA signaling evolution (Lind et al. 2015). Database can be searched by using a gene name or a keyword. In addition, a “homologue gene search” option is available to compare the promoter structures of orthologous genes in specified plants.

3.10 PlantCARE (Plant Cis-Acting Regulatory Element)

PlantCARE is a web portal providing information on plant cis-regulatory elements , enhancers, and repressors and also a portal for tools in silico analysis of promoter sequences (Table 3; Lescot et al. 2002). Regulatory elements are demonstrated by consensus sequences, positional matrices, individual sites, and functional annotations for queried sequences. Database can be searched by submitting raw DNA sequence or FASTA file. In addition, other query options are also available for more guided searches using classifications, genes, name of factor, name of site, and referencia. PlantCARE is a very useful source for exploration of predicted or verified cis-regulatory elements in given sequences (Filiz et al. 2015; Tira-Umphon et al. 2015; Vatansever et al. 2016).

4 Conclusion and Perspectives

Biological processes are controlled at multiple levels; at transcriptional level it is regulated by TFs which modulate the expression of target genes via binding cis-regulatory elements in their promoter regions. Identification of TFs is a primary step in TF research which is usually achieved by a direct blasting against a particular plant genome or against a gene/protein database repository. Following TF identification , domain families of identified TFs are first needed to be confirmed using a domain search utility like Pfam (pfam. xfam. org/; Finn et al. 2016), InterProScan (ebi. ac. uk/interpro/search/sequence-search; Mitchell et al. 2014), PROSITE (prosite. expasy. org/; Sigrist et al. 2012), or NCBI Conserved Domain (ncbi. nlm. nih. gov/Structure/cdd/wrpsb. cgi; Marchler-Bauer et al. 2014). TFs with typical domain structures could be readily verified, but members with atypical domains could be difficult to identify due to the sequence divergence and reshuffling. In addition to DNA-binding domains, the presence of various other motif sequences in some TFs could also make more difficult for their identification. A statistical model known as hidden Markov model (HMM ) is commonly a used approach in search of typical and complex protein domain families (Finn et al. 2016). Besides, an iterative approach has been also reported to identify the complex domain motifs in TFs (Wang et al. 2012, 2016). However, it is an emerging issue to develop novel bioinformatics tools or approaches to efficiently identify the highly complex domain structures.

In addition, it is almost imperative to have insights about TF binding sites or cis-regulatory elements to dissect TF-based networks. For this, integration of various resources from expression, co-expression , protein-binding, and phylogenetic studies has become a powerful approach (Godoy et al. 2011; Franco-Zorrilla et al. 2014). To capture real interactions between TFs and their target sites, the chromatin immunoprecipitation (ChIP) , ChIP-microarray , ChIP-seq, and combined ChIP-seq-RNA-seq have been potential technologies so far reported (Buck & Lieb, 2004; Collas ; Kaufmann et al. 2010; Yang et al. 2013; Heyndrickx et al. 2014; Weirauch et al. 2014). Moreover, being parallel with ever-increasing biological data, the development of new databases enabling analysis and distribution of plant TFs has been fundamentally important in TF research. Some specialized and general content databases for TF research have been developed, but most of these databases are still far away from completeness due to such reasons: (1) verified and hypothetical data not clearly distinguished; (2) species-specific functional divergence ignored, particularly working with orthologues data; (3) data not sufficiently linked to relevant resources to cross-validate; (4) ambiguous entries present data not clearly annotated; (5) not regularly up-to-date; (6) less or lack of interoperability between TF resources; and (7)not included adequate analysis tools. Thus, it is a compelling demand to develop novel TF databases and enhance already ones to satisfy with abovementioned deficiencies and equip with more utilities and functionalities to improve TF research in a more promising way.