Introduction

Recent advances in genomic DNA sequencing, mainly driven by the next-generation sequencing (NGS) technique, have revolutionized ways to examine molecular events inside cells in several aspects. For example, chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) generates molecular maps that pinpoint genome-wide binding positions of various proteins in many different cell types (Mouse et al. 2012; Kang et al. 2013). With these maps, researchers can distinguish target and non-target genes of transcription factors. Whole transcriptome shotgun sequencing (RNA-seq) can estimate the abundance of whole transcripts including protein-coding genes, non-coding RNAs, and small RNAs (Feuermann et al. 2013; Yamaji et al. 2013). As the number of NGS-based datasets increases, many tools have been developed to help turn sequenced short DNA fragments into biologically meaningful information. According to statistics from the OMICtools website (http://omictools.com/) (Henry et al. 2014), more than 2,000 tools are currently available for the analysis of NGS-based data. In case of ChIP-seq, several thousand datasets have been deposited in NCBIs gene expression omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) (Barrett et al. 2013).

In the era of NGS, however, there is no gold standard for analyzing a given NGS data set, although most of programs provide some statistics-based output. For instance, several studies have pointed out that the number of binding sites of a given ChIP-seq data set can vary depending on the algorithms and the parameters used for the analysis (Kang et al. 2013; Adomas et al. 2014). In addition, setting up a computer for the analysis can be challenging for novice users. Therefore, it is very difficult for users to select appropriate applications and establish a computing environment for the analysis. To address this issue, the present authors have developed AutoChIP, an automated analysis pipeline that can analyze a large number of ChIP-seq datasets simultaneously. Based on a graphical user interface (GUI), it installs required programs from websites automatically, generates an appropriate genome index for alignment and then processes several types of input such as FASTQ (unmapped reads) and BAM (mapped reads) files sequentially. To produces a list of high-confidence binding sites, which are defined as the peaks detected by all the algorithms as a cocktail strategy, AutoChIP utilizes the following popular peak calling programs; model-based analysis of ChIP-seq (MACS), hypergeometric optimization of motif enrichment (HOMER), and PeakRanger (Zhang et al. 2008; Heinz et al. 2010; Feng et al. 2011). Our evaluation demonstrated that the cocktail approach implemented in AutoChIP improves overall performance of peak finding in terms of the ratio of motif occurrences and the average density of ChIPed reads.

Materials and Methods

Development environment

AutoChIP was developed in Java programming language (JDK1.7) using Eclipse (Kepler version Java EE IDE). The GUI was made by using Swing and Windowbuilder. Since most applications implemented in AutoChIP are solely based on the Linux operating system, it can only run on a Linux operating system such as Ubuntu and Fedora. AutoChIP will download and install the following tools automatically: Samtools, twoBitToFa, and BEDTools for the manipulation of files and output (Li et al. 2009; Quinlan and Hall 2010); Bowtie2 and Subread for alignment (Langmead and Salzberg 2012; Liao et al. 2013); and MACS (version 1.4), HOMER, and PeakRanger for peak calling (Zhang et al. 2008; Heinz et al. 2010; Feng et al. 2011). To identify genome-wide binding sites, it runs all three peak calling programs and intersects the outputs.

ChIP-seq data sets used in the study

To assess the performance of AutoChIP, the following mouse and human ChIP-seq data were downloaded and analyzed by MACS, HOMER, PeakRanger, and AutoChIP; mouse STAT5A (GSM1005189) and STAT5B (GSM1005190) and their corresponding input (GSM1005193) in mammary gland tissues; human GATA3 (GSM1241752) and its corresponding input (GSM1241753) in MCF7 cell line; and human GATA3 (GSM1241754) and its corresponding input (GSM1241755) in T47D cell line (Adomas et al. 2014; Kang et al. 2014).

Motif analysis and peak annotation

To predict over-represented motifs in peaks, MEME-ChIP (http://meme.nbcr.net/meme/cgi-bin/meme-chip.cgi) was used with the default setting (Machanick and Bailey 2011). The absence or presence of the predicted motifs within a 200 bp flanking sequence was assessed by the FIMO tool (http://meme.nbcr.net/meme/cgi-bin/fimo.cgi) (Grant et al. 2011). Distributions of peaks were estimated according to gene and repetitive element annotation (mm9 for mouse and hg19 for human) using HOMER.

Program availability

AutoChIP can be downloaded at https://sites.google.com/site/kangklab/.

Results and Discussion

AutoChIP workflow

AutoChIP automatically installs required programs upon its first run. For the analysis, it accepts FASTQ (unmapped reads) and BAM (mapped reads) files as inputs and performs one of the following analyses: Indexing, Alignment, or Align-ChIP (Fig. 1). In the indexing tab, users can generate a genome index for alignment using either Bowtie2 or Subread (Fig. 1a) (Langmead and Salzberg 2012; Liao et al. 2013). This is a necessary step when input files are provided in the FASTQ format and there is no existing genome index. Human, mouse, and drosophila genomes can be downloaded from the designated AutoChIP server. In the alignment tab, FASTQ files can be aligned to an indexed genome (Fig. 1b). Multiple files can be processed sequentially. The Align-ChIP tab provides a fully automated mode for the ChIP-seq analysis when FASTQ files and an indexed genome are provided (Fig. 1c). The main algorithm to detect binding sites of proteins of interest is a cocktail approach that generates a list of high-confidence peaks by intersecting the outputs from MACS, HOMER, and PeakRanger (Zhang et al. 2008; Heinz et al. 2010; Feng et al. 2011). We chose those peak calling tools due to the following reasons: (1) MACS is one of the most widely used peak calling programs for ChIP-seq analysis; (2) HOMER is a versatile tool that can analyze different types of NGS-based data including ChIP-seq, RNA-seq, and MNase-seq; and (3) PeakRanger was used to process a large set of ChIP-seq data produced by the modENCODE consortium. Additionally, AutoChIP can annotate the identified peaks by means of HOMER with appropriate gene and repetitive element information. All the functions are provided in GUI mode; therefore, there is no need for users to learn how to install and execute each program.

Fig. 1
figure 1

Three main functions of AutoChIP. a If mapped files (BAM) were not provided, input files (FASTQ) should be aligned to a reference genome before peak calling. The genome should be indexed prior to the alignment. To generate an index for the given genome, the Indexing tab provides a function that generates an index of the genome by using either Bowtie2 (default) or Subread. The following reference genomes can be automatically downloaded from https://sites.google.com/site/kangklab/: hg19 (Human), mm9 and mm10 (Mouse), and dm3 (Drosophila). b The alignment tab provides a function that can align unmapped reads (FASTQ) to the indexed genome. Multiple files can be processed sequentially. If an indexed genome and mapped files (BAM) were provided, the analysis of peak calling, and annotation of the identified peaks can be conducted. c If an indexed genome and unmapped files (FASTQ) were provided, the alignment, peak calling, and annotation steps can be conducted sequentially in the Align-ChIP tab

Inconsistency between peak calling applications

To identify genome-wide binding sites from given ChIP-seq datasets, available peak calling programs including MACS, HOMER, and PeakRanger use different strategies despite being based on statistical methods such as false discovery rate (FDR) (Zhang et al. 2008; Heinz et al. 2010; Feng et al. 2011). Due to the differences in strategy, several studies reported that the number of genome-wide binding sites identified can vary by up to several thousand (Malone et al. 2011; Kang et al. 2013). We confirmed the inconsistency by reanalyzing available mouse STAT5A (GSE40930, mammary gland tissues) (Kang et al. 2014) and human GATA3 (GSE51274, MCF7 and T47D breast cancer cell lines) (Adomas et al. 2014) ChIP-seq data with MACS, HOMER, and PeakRanger. Up to several thousand peaks were differentially identified between peak calling programs (Fig. 2). For example, 33495 STAT5A binding sites were identified by all three tools, while 8987, 2891, and 1703 STAT5A peaks were uniquely detected by MACS, HOMER, and PeakRanger, respectively. Similarly, 3751, 6031, and 429 GATA3 peaks were identified in MCF7 cells only by each respective program. Totals of 33495 STAT5A, 9486 STAT5B, 18296 GATA3 (MCF7), and 12513 GATA3 (T47D) peaks were identified by all three applications. The results demonstrated that current ChIP-seq analysis tools still have room for improvement.

Fig. 2
figure 2

Reanalysis of published STAT5 and GATA3 ChIP-seq with MACS, HOMER, and PeakRanger. Different numbers of STAT5A, STAT5B and GATA3 (MCF7 and T47D cell lines) binding sites were identified by MACS, HOMER, and PeakRanger. The published data were downloaded from the Gene Expression Omnibus (GEO accession number GSE40930 and GSE51274) (Adomas et al. 2014; Kang et al. 2014)

Performance evaluation of the cocktail approach implemented in AutoChIP

AutoChIP takes advantage of each algorithm by intersecting their outputs and provides a list of high-confidence peaks, which are defined as the peaks identified by all the algorithms, as a cocktail approach. To assess the cocktail approach, motif frequency (absence or presence of a given motif per peak) was calculated. Since DNA binding proteins recognize specific DNA sequences (motifs), STAT5 and GATA3 binding motifs were first predicted by using MEME-ChIP, with the peaks identified by AutoChIP (Machanick and Bailey 2011). The known STAT5 and GATA3 motifs were significantly identified (Fig. 3a). With the top motifs showing the lowest E-value, motif frequency (p value < 0.0001 for detecting motifs) in T47D cells was estimated using the FIMO tool with the identified peaks (Grant et al. 2011): 33495 (AutoChIP), 50266 (MACS), 40476 (HOMER), and 40696 (PeakRanger) STAT5A peaks; 9486 (AutoChIP), 15866 (MACS), 14825 (HOMER), and 12129 (PeakRanger) STAT5B peaks; 18296 (AutoChIP), 30222 (MACS), 32186 (HOMER), and 20151 (PeakRanger) GATA3 peaks in MCF7 cells; and 12513 (AutoChIP), 32279 (MACS), 29128 (HOMER), and 13678 (PeakRanger) GATA3 peaks. The result shows that the cocktail approach by AutoChIP outperformed the single peak calling tool in all cases according to the percentage of peaks containing at least one top motif (Fig. 3b). This tendency was maintained when applied to degenerated motifs defined by increasing the p value cutoff from 0.001 to 0.1 (Fig. 3c). In addition, the normalized read densities of the peaks identified by AutoChIP in STAT5A, STAT5B, and GATA3 (MCF7 and T47D) ChIP-seq are higher than those identified by the single method (Fig. 3d). The results demonstrated that the cocktail approach implemented in AutoChIP identified high-quality peaks in terms of the ratio of motif occurrence and the binding strength of the given proteins to the sites.

Fig. 3
figure 3

Comparison of AutoChIP to MACS, HOMER, and PeakRanger. a Top motifs (the lowest E-value) predicted in the common peaks of the given ChIP-seq data by MEME-ChIP are shown (Machanick and Bailey 2011). These were used for further analysis. b The percentage of the common peaks containing at least one motif was calculated with different p value thresholds for detecting motifs in the peaks. c The percentages of the motifs in the peaks identified by HOMER, MACS, PeakRanger, and AutoChIP were shown as bar graphs. d The average of normalized read density (reads per million per nucleotide, RPM) on the AutoChIP and marginal peaks is shown. The marginal peaks were defined as the peaks identified that were unique to a single program

Annotation of the identified peaks with the information of genes and repetitive elements

AutoChIP provides several advantages to users. First, all the required programs are automatically installed and necessary procedures for peak calling such as alignment are conducted in one step. Second, additional analyses can be performed after the peak calling. For example, annotation of the identified peaks can easily be executed with the Perl script (annotatePeaks.pl) provided by HOMER. Annotation analysis of the identified peaks by AutoChIP showed that the majority of STAT5A, STAT5B, and GATA3 peaks were located in intergenic and intron regions (Fig. 4a; Table S1). The result is consistent with the known feature of STAT5 and GATA3 as enhancer binding proteins (Ranganath et al. 1998; Gonsky et al. 2004). Additionally, the relationship between STAT5 (or GATA3) binding and repetitive elements was assessed by using the Perl script (analyzeRepeat.pl) from HOMER. Interestingly, 2.95 and 0.80 % of the known repetitive elements coincided with STAT5A and STAT5B in mouse mammary glands, respectively (Fig. 4b; Table S2). In addition, 1.44 % (MCF7) and 0.98 % (T47D) of promoter regions of the repetitive elements were bound by GATA3. However, the importance of these bindings related to the activity of repetitive elements needs to be validated in near future.

Fig. 4
figure 4

Annotation of the common peaks. The distribution of the identified peaks by AutoChIP relative to genes and repetitive elements was estimated

Conclusion

Owing to the NGS technique, various molecular events can be captured and visualized by means of bioinformatic approaches. Among them, ChIP-seq has been widely used to detect genome-wide binding sites of proteins. Currently, more than 30 peak calling programs and several thousand ChIP-seq datasets have been reported. Incorporating the available data into an ongoing study can give rise to new biological insights. However, it is a daunting task for novice users to install programs and use them to detect genome-wide binding sites of proteins of interest. In addition, false positive peaks might be identified along with true positive peaks, regardless of statistical methods, due to problematic genomic regions, sequencing bias, inadequate statistical power, and insufficient sequencing depth. Through a series of analyses, we showed that the cocktail approach implemented in AutoChIP outperformed a single peak calling method. Using AutoChIP, all necessary steps including the installation of required programs, genome indexing, alignment, peak calling, and annotation of identified peaks can be done in one step. The easy-to-use GUI will help novice users to analyze their own and available ChIP-seq datasets. Understanding of genome-wide protein binding networks could be facilitated by using AutoChIP along with various other NGS-based methods such as RNA-seq and MNase-seq.