Abstract
Background
Genome-wide protein-DNA binding is popularly assessed using specific antibody pulldown in Chromatin Immunoprecipitation Sequencing (ChIP-Seq) or Cleavage Under Targets and Release Using Nuclease (CUT&RUN) sequencing experiments. These technologies generate high-throughput sequencing data that necessitate the use of multiple sophisticated, computationally intensive genomic tools to make discoveries, but these genomic tools often have a high barrier to use because of computational resource constraints.
Results
We present a comprehensive, infrastructure-independent, computational pipeline called SEAseq, which leverages field-standard, open-source tools for processing and analyzing ChIP-Seq/CUT&RUN data. SEAseq performs extensive analyses from the raw output of the experiment, including alignment, peak calling, motif analysis, promoters and metagene coverage profiling, peak annotation distribution, clustered/stitched peaks (e.g. super-enhancer) identification, and multiple relevant quality assessment metrics, as well as automatic interfacing with data in GEO/SRA. SEAseq enables rapid and cost-effective resource for analysis of both new and publicly available datasets as demonstrated in our comparative case studies.
Conclusions
The easy-to-use and versatile design of SEAseq makes it a reliable and efficient resource for ensuring high quality analysis. Its cloud implementation enables a broad suite of analyses in environments with constrained computational resources. SEAseq is platform-independent and is aimed to be usable by everyone with or without programming skills. It is available on the cloud at https://platform.stjude.cloud/workflows/seaseq and can be locally installed from the repository at https://github.com/stjude/seaseq.
Similar content being viewed by others
Background
Understanding where proteins localize on chromatin is an important component in many study designs. Selectively purifying the chromatin fragments with antibodies and sequencing the resulting material affords detection of these protein-DNA interactions. Chromatin occupancy data from ChIP-Seq or CUT&RUN-Seq experiments can support a range of biological conclusions using different analytical approaches, but each analysis usually requires intense computation and bioinformatics skills to execute. Basic processing of a chromatin binding dataset consists of several steps, including sequence read alignment to the reference genome, peak calling, annotation and visualization of peaks, coverage profiling, motif analysis, and most importantly quality control and assessment at each step. In general, the sequencing reads from a successful experiment are mapped to a reference genome, where read coverage roughly correlates with occupancy, and read-enriched regions are statistically determined. These regions or “peaks” can then be further mined to understand their functions and characteristics [1,2,3].
Multiple pieces of software exist for each analysis step, and attempts have been made to link these tools together in cohesive pipelines [4,5,6,7,8,9,10], as has been done for other analysis types [11,12,13,14,15]. However, these pipelines; summarized in Fig. 1, are limited by the types of analyses they can perform, often requiring substantial in-house computational infrastructure or expertise, which typically restricts their use to select research settings. Stemming from the vast number of existing tools and workflows, it remains challenging for new or seasoned researchers to decipher their usefulness and confidently determine relevant analyses.
To address these problems, we introduce a comprehensive and easy-to-use computational pipeline, “Single-End Antibody sequencing” (SEAseq), which takes advantage of enterprise-level workflow management tools to facilitate synonymous deployment across different computing infrastructures. SEAseq is a fully automated open-source pipeline that performs all the major analysis needed to process chromatin binding datasets ensuring high quality and useful results. It can be applied to data from any organism, and can automatically access publicly available data from the NCBI Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) repositories at the user’s request [16, 17]. The pipeline is designed to be platform-independent and scalable: it can be executed on a personal computer or in high-performance computing (HPC) environments. Additionally, to enable broader utility in resource-poor environments, SEAseq can also be accessed in a reliable and reasonably priced parallel cloud computing environment, available at https://platform.stjude.cloud/workflows/seaseq.
Implementation
SEAseq architecture
We were motivated to broaden the user base capable of performing chromatin sequencing analysis independent of computing infrastructure and expertise. To maximize the flexibility and portability of our pipeline, we adopted a workflow management system, Workflow Description Language (WDL) [18], and containerized the requisite tools, programs and SEAseq custom scripts to using Docker [19] (see Additional file 1 for the list of Docker images built for SEAseq).
The pipeline itself is not in a container platform due to its complexity; rather each step was compartmentalized into individual WDL tasks and sub-workflows to allow for independent execution of each task in an isolated environment with the use of Docker containers. Using WDL enables reorganization of individual tasks without having to redesign the entire pipeline. WDL also ensures efficient utilization and scaling of computing resources in a replicable and repeatable manner. Using Docker containers maximizes control and management of version configurations, software requirements and dependencies for improved portability and reproducibility. In addition, the choice to use WDL and Docker facilitates a standardized but customizable deployment across computing platforms, independent of the hardware infrastructure used to run SEAseq. The modular design of the pipeline allows experienced users to modify tools or steps in the pipeline, providing these modifications match the input/output schema formats used, by modifying the WDL code available on GitHub (https://github.com/stjude/seaseq).
SEAseq can be executed on a single computing node, such as a personal computer, or a parallel computing infrastructure including field-standard high-performance computing (HPC) clusters using compatible workflow execution engines such as Cromwell [20]. Complete usage instructions are outlined in the SEAseq documentation (https://github.com/stjude/seaseq/#readme). Three dependencies are required to run SEAseq: Java, Cromwell and an engine able to run Docker containers (such as Docker, Singularity [21]). Each of these packages can be installed with minimal user expertise.
SEAseq functionality
SEAseq performs the several fundamental ChIP-Seq or CUT&RUN analyses in a single execution. Figure 2 shows a schematic overview of the analyses performed by SEAseq, which are briefly described in the following paragraphs (see Additional file 2 for the expanded list of SEAseq pipeline steps and parameters).
FASTQ sequencing reads are stringently aligned to the reference genome provided using Bowtie [22]. The mapped reads are then further processed by removal of redundant reads [23] using SAMtools [24], and removal of reads in problematic regions or regions with significant background noise or artificially high signal [25] using BEDTools [26]. In addition, SEAseq characterizes the global binding preferences of the antibody using read density profiling in relevant genomic regions such as promoters and gene bodies using our custom version of BAMToGFF (https://github.com/stjude/BAM2GFF).
To compensate for the background noise intrinsic to chromatin binding assays, many analyses rely on identification of highly covered regions or peaks [3]. It is important to choose the appropriate peak-calling algorithm based on the type of protein targeted [27]. After read alignment and filtering steps are completed, SEAseq identifies enriched regions for two binding profiles: MACS [28] for factors that bind shorter regions, e.g. many sequence-specific transcription factors, and SICER [29] for broad regions of enrichment, e.g. some histone modifications. The choice of these peak callers for the identification of narrow peaks and broad peaks was based on extensive review of published benchmarking evaluations and our own previous work [30,31,32,33,34]. Normalized and unnormalized coverage files are generated for visualization on multiple genome browsers such as the UCSC genome browser [35] and IGV [36]. The pipeline also identifies stitched clusters of enriched regions and separates exceptionally signal-rich regions, e.g. super-enhancers, from typical enhancers using ROSE [37, 38]. SEAseq also performs motif discovery and enrichment analysis to characterize overrepresented sequences using tools from the MEME Suite [39], and performs genic annotation of the various peaks and quantifies regions of abundance in promoters, gene bodies, gene-centric windows, and proximal genes using BEDTools and custom Python scripts.
SEAseq quality metrics and dashboard
Though other pipelines perform coverage-based analyses, most generally lack a comprehensive means of assessing the overall quality of the experiment. SEAseq uniquely calculates an extensive set of quality metrics for detecting experimental issues, including ChIP-Seq metrics recommended by the ENCODE consortium [40]. The SEAseq quality metrics include the percentage of reads mapped, nonredundant fraction (NRF), fraction of reads in peaks (FRiP), strand correlation scores (NSC, RSC), library complexity (PCR bottleneck) and many other important metrics used to infer quality as listed in Table 1. To facilitate integration and easy interpretation of these metrics, we devised a five-scale color-rank flag system to visually inspect the performance of each metric, and provide a cross-metric averaged rank score to easily intuit the performance of the overall analysis (Fig. 3). The rank flag system is estimated based on recommended thresholds from the ENCODE consortium, literature review and criteria provided in Table 2. These quality metrics can be broadly grouped into two subsets: quality assessment on the sequencing library and assessment on the enrichment profiles observed, providing users an easy way to intuitively explore the performance or quality of their data. The results are exported in a tab-delimited file and color flagged in html format. Additional file 3 shows a typical quality statistics report produced by SEAseq.
SEAseq on the cloud
To facilitate the broadest usage of our pipeline, and to empower researchers with little to no computational skills or resources, we offer a cloud-based version of SEAseq, called SEAseq Cloud, that is hosted on the St. Jude Cloud Genomics Platform [41]. The St. Jude Cloud Genomics Platform leverages Microsoft Azure and DNAnexus (https://www.dnanexus.com) to provide a secure and privacy compliant framework for analysis, storage and distribution of genomic data [41]. The DNAnexus platform provides an easy-to-navigate graphical user interface for exploration and analysis of user data, which can be quickly and securely uploaded, downloaded, and shared with collaborators at a reasonably low cost. For more advanced operations, DNAnexus also provides a command-line client. SEAseq cloud is available from https://platform.stjude.cloud/workflows/seaseq.
The user needs a DNAnexus account to use SEAseq Cloud. SEAseq requires that the user data, such as the FASTQs and genome files, be uploaded to a project folder. Navigating the SEAseq Cloud web interface after logging in, as shown in Fig. 4, involves selecting the “Start” button for first-time users, then the “Launch This Tool” button (Fig. 4a). The user will then be navigated to the SEAseq Cloud Run Analysis Page to start an analysis (Fig. 4b). DNAnexus requires the user input data files, such as the FASTQ files and required genome files (if applicable), be uploaded to a Project before proceeding. Uploading data files can be done on a Projects page by selecting the Projects tab then the All Projects option to be directed to the Projects/Home page (Fig. 4c). The DNAnexus Projects page (also accessible from https://platform.dnanexus.com) will contain the vended SEAseq workflow in its self-titled project and St. Jude Reference Project files (Fig. 4c). The St. Jude Reference Project files contains some genome files acceptable by SEAseq for the user’s convenience. To begin analysis, select on the SEAseq Project to access the SEAseq workflow (Fig. 4d) and/or to upload user data files by clicking the “Add” button. It is recommended to use the New (also called the “Pannexin”) User Interface (UI) version, instead of the Classic/Legacy (“Membrane”) version. Further descriptions will be based on the New UI as shown in Fig. 4b, d.
To start an analysis using SEAseq, select the “SEAseq (St. Jude)” workflow, and the user will be navigated to the Run Analysis page (Fig. 4b), where one can specify their preferred “Execution Name” and “Execution Output Folder” in the “Analysis Settings” input fields. The previously uploaded user data will then be available for selection in the requested “Analysis Inputs” input fields. The required fields are the sequencing FASTQs or SRA accession numbers (SRRs), “Genome Reference” and “Gene Annotation” files. SEAseq also provides a list of genome files, “Host Genomes”, to select from (Fig. 4b). Optionally, one or more SRRs can be inputted into the “Sample SRRs” field for sample SRRs and/or the “Control SRRs” field for Input/Control SRRs. After the analysis inputs and settings has been specified to the user’s preference, the SEAseq analysis can be started by selecting “Start Analysis”. Once started, the status of the job can be monitored from the project folder under the “Monitor” tab (Fig. 4d) or SEAseq landing page by selecting the “View Results” button (Fig. 4a). When the analysis is completed, a notification message will be sent to the user’s registered email and the analysis results files can be viewed and downloaded from the initially specified “Execution Output Folder”. More information on SEAseq Inputs and Output directories and files can be found in Additional file 4.
Results and discussion
Replication of two ChIP-Seq studies using SEAseq
We re-analyzed ChIP-Seq data from two previously published studies to showcase the utility of SEAseq in performing relevant analyses in a single execution.
The first study probed functions of LIN28B in neuroblastoma and discovered it binds chromatin via interaction with the transcription factor ZNF143 (GEO accession: GSE138742) [42]. The dataset included LIN28B, ZINF143, and doxycycline-inducible engineered LIN28B ChIP-Seq data of BE2C cells. The analysis was executed using the SEAseq Cloud workflow. All required hg19 genome files and position weight matrices were first uploaded into a DNAnexus project. Each ChIP antibody SRR along with the corresponding Input DNA SRR were used as input, and the relevant genome files were selected to their required fields in SEAseq (see Additional file 5 for the accession numbers and genome files used). Analysis using SEAseq demonstrated the expected satisfactory overall scores for these samples’ read quality and quality of peaks (> 1) (Fig. 5a). Motif analysis identified the major binding consensus motif in both LIN28B and ZNF143 as reported in the original publication (Fig. 5b). In addition, given the ability of SEAseq to perform a large array of advanced downstream analysis, we also observed the expected genome-wide occupancy results, demonstrating preferential binding at promoter regions by LIN28B, ZNF143, and significantly with the doxycycline-induced LIN28B (Fig. 5e). The entire analysis of all three datasets was computed using the SEAseq Cloud platform, which successfully completed in an average of 28.45 wall clock hours costing an average of $24.14. Table 3 summarizes the total time and cost of analysis performed using SEAseq Cloud.
The second study profiled binding of the tumor suppressor p53 in normal human cells (GEO accession: GSE31558) [43]. To maximize comparability with the published results, we uploaded available hg18 reference genome data files, and the deposited study files, which include ChIP-Seq for p53 in IMR90 fibroblasts (see Additional file 5). SEAseq analysis showed an “AVERAGE” overall quality score; indicative from the sequencing depth, FRiP and the corresponding poor number of peaks identified (Fig. 5a). From this, we show that the analysis may benefit from higher sequencing coverage and depth. Nevertheless, we were able to obtain comparable results to the original publication, including the significantly enriched p53 binding motif (Fig. 5b). This analysis was successfully completed in 11 h costing $8.33 (Table 3).
These recapitulated results demonstrate the utility of SEAseq in analysis of antibody purification data.
SEAseq performance evaluation
To assess SEAseq performance and resource consumption, the aforementioned datasets using SEAseq Cloud were also analyzed on our IBM Spectrum LSF (v10.1.0.9) HPC cluster with Cromwell (v52) and Singularity (v3.8.0). Per-task analysis revealed the most time-consuming component of the pipeline is the metagene analysis step (completed in under 16 h). Most other steps were completed in under 1 h, with the exception of the read alignment and optional genome index construction steps, which each completed in 4–6 h. Figure 6 shows the time and memory consumption of the different steps. In addition, steps displaying the highest memory consumption were the metagene analysis, broad peaks identification, alignment, and genome index construction. Given the genome index construction step is computationally intensive, we recommend that users avoid this optional step by providing the bowtie genome indexes when analyzing multiple datasets. Overall, the different steps show a maximum memory usage of under 7 GB. This performance showcases the efficiency of SEAseq in appropriating tasks in a timely and memory efficient manner, and thus makes the pipeline deployable on most computing environments. Furthermore, the analysis also reveals the size of the FASTQs has a proportional effect on time and memory consumption.
Conclusion
To our knowledge, SEAseq is the first portable, all-inclusive analysis pipeline for ChIP-Seq and CUT&RUN data. SEAseq is easy to use and provides a modular architecture for quick integration and customization for efficient data analysis on multiple computing infrastructures. Having proved highly comparable results to published datasets, we believe SEAseq fulfills a critical requirement as an efficient and reliable one-stop computational pipeline for high quality analysis results.
Availability and requirements
-
Project name: SEAseq
-
Project home page: https://github.com/stjude/seaseq
-
Project cloud page: https://platform.stjude.cloud/workflows/seaseq
-
Operating system(s): Platform independent
-
Programming language: WDL, Docker
-
Other requirements: Java, Cromwell (when using the GitHub version)
-
License: Apache License 2.0
-
Any restrictions to use by non-academics: None
Availability of data and materials
The developed software is freely available on GitHub at https://github.com/stjude/seaseq. All datasets and genome files analyzed during this study are publicly available and information on how they were obtained are included in this published article (Additional file 5).
Change history
14 March 2022
The initials were placed incorrectly in the citation. The article has been updated to rectify the error.
Abbreviations
- GEO:
-
Gene Expression Omnibus
- SRA:
-
Sequence Read Archive
- WDL:
-
Workflow Description Language
- UI:
-
User Interface
- HPC:
-
High-Performance Computing
References
Nakato R, Sakata T. Methods for ChIP-seq analysis: a practical workflow and advanced applications. Methods. 2021;187:44–53. https://doi.org/10.1016/j.ymeth.2020.03.005.
Nakato R, Shirahige K. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform. 2016;18:bbw023. https://doi.org/10.1093/bib/bbw023.
Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10:669–80. https://doi.org/10.1038/nrg2641.
Orlova NN, Bogatova OV, Orlov AV. High-performance method for identification of super enhancers from ChIP-Seq data with configurable cloud virtual machines. MethodsX. 2020. https://doi.org/10.1016/j.mex.2020.101165.
Zhu Q, Liu N, Orkin SH, Yuan G-C. CUT&RUNTools: a flexible pipeline for CUT&RUN processing and footprint analysis. Genome Biol. 2019;20:192. https://doi.org/10.1186/s13059-019-1802-4.
Han BW, Wang W, Zamore PD, Weng Z. piPipes: a set of pipelines for piRNA and transposon analysis via small RNA-seq, RNA-seq, degradome- and CAGE-seq, ChIP-seq and genomic DNA sequencing. Bioinformatics. 2015;31:593–5. https://doi.org/10.1093/BIOINFORMATICS/BTU647.
Yan H, Evans J, Kalmbach M, Moore R, Middha S, Luban S, et al. HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data. BMC Bioinform. 2014;15:280. https://doi.org/10.1186/1471-2105-15-280.
Liu T, Ortiz JA, Taing L, Meyer CA, Lee B, Zhang Y, et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 2011;12:R83. https://doi.org/10.1186/gb-2011-12-8-r83.
Qin Q, Mei S, Wu Q, Sun H, Li L, Taing L, et al. ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline. BMC Bioinform. 2016;17:404. https://doi.org/10.1186/s12859-016-1274-4.
Tang M. pyflow-ChIPseq: a snakemake based ChIP-seq pipeline. 2017. https://zenodo.org/record/819971.
Zhang X, Jonassen I. RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinform. 2020;21:1–9. https://doi.org/10.1186/S12859-020-3433-X/TABLES/2.
Garrido-Rodriguez M, Lopez-Lopez D, Ortuno FM, Peña-Chilet M, Muñoz E, Calzado MA, et al. A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways. PLoS Comput Biol. 2021;17: e1008748. https://doi.org/10.1371/JOURNAL.PCBI.1008748.
D’Antonio M, De Meo PDO, Pallocca M, Picardi E, D’Erchia AM, Calogero RA, et al. RAP: RNA-Seq analysis pipeline, a new cloud-based NGS web application. BMC Genom. 2015;16:1–11. https://doi.org/10.1186/1471-2164-16-S6-S3/FIGURES/2.
Cameron CJF, Cameron CJF, Wang XQD, Dostie J, Blanchette M. LAMPS: an analysis pipeline for sequence-specific ligation-mediated amplification reads. BMC Res Notes. 2020;13:1–4. https://doi.org/10.1186/S13104-020-05106-1/FIGURES/1.
Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences. BMC Bioinform. 2021;22:205. https://doi.org/10.1186/s12859-021-04120-9.
Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011;39(Database):D19-21. https://doi.org/10.1093/nar/gkq1019.
Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–10. https://doi.org/10.1093/NAR/30.1.207.
OpenWDL. https://openwdl.org/.
Docker. https://www.docker.com/.
Cromwell. https://cromwell.readthedocs.io/en/stable/.
Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for mobility of compute. PLoS ONE. 2017;12: e0177459. https://doi.org/10.1371/JOURNAL.PONE.0177459.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. https://doi.org/10.1186/gb-2009-10-3-r25.
Dozmorov MG, Adrianto I, Giles CB, Glass E, Glenn SB, Montgomery C, et al. Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data. BMC Bioinform. 2015;16:1–11. https://doi.org/10.1186/1471-2105-16-S13-S10.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Amemiya HM, Kundaje A, Boyle AP. The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep. 2019;9:9354. https://doi.org/10.1038/s41598-019-45839-z.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. https://doi.org/10.1093/bioinformatics/btq033.
Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, et al. Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol. 2013. https://doi.org/10.1371/journal.pcbi.1003326.
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008. https://doi.org/10.1186/gb-2008-9-9-r137.
Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009;25:1952–8. https://doi.org/10.1093/bioinformatics/btp340.
Steinhauser S, Kurzawa N, Eils R, Herrmann C. A comprehensive comparison of tools for differential ChIP-seq analysis. Brief Bioinform. 2016;17:953–66. https://doi.org/10.1093/BIB/BBV110.
Starmer J, Magnuson T. Detecting broad domains and narrow peaks in ChIP-seq data with hiddenDomains. BMC Bioinform. 2016;17:1–10. https://doi.org/10.1186/S12859-016-0991-Z/FIGURES/4.
Laczik M, Hendrickx J, Veillard AC, Tammoh M, Marzi S, Poncelet D. Iterative fragmentation improves the detection of ChIP-seq peaks for inactive histone marks. Bioinform Biol Insights. 2016;10:209. https://doi.org/10.4137/BBI.S40628.
Jeon H, Lee H, Kang B, Jang I, Roh TY. Comparative analysis of commonly used peak calling programs for ChIP-Seq analysis. Genom Inform. 2020;18:1–9. https://doi.org/10.5808/GI.2020.18.4.E42.
Wilbanks EG, Facciotti MT. Evaluation of algorithm performance in ChIP-Seq peak detection. PLoS ONE. 2010. https://doi.org/10.1371/JOURNAL.PONE.0011471.
Kuhn RM, Haussler D, James KW. The UCSC genome browser and associated tools. Brief Bioinform. 2013;14:144–61. https://doi.org/10.1093/bib/bbs038.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–92. https://doi.org/10.1093/bib/bbs017.
Lovén J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, et al. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell. 2013;153:320–34. https://doi.org/10.1016/j.cell.2013.03.036.
Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013;153:307–19. https://doi.org/10.1016/j.cell.2013.03.035.
Bailey TL, Johnson J, Grant CE, Noble WS. The MEME suite. Nucleic Acids Res. 2015;43:W39-49.
Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012;22:1813–31. https://doi.org/10.1101/gr.136184.111.
McLeod C, Gout AM, Zhou X, Thrasher A, Rahbarinia D, Brady SW, et al. St. Jude cloud: a pediatric cancer genomic data-sharing ecosystem. Cancer Discov. 2021;11:1082–99. https://doi.org/10.1158/2159-8290.cd-20-1230.
Tao T, Shi H, Mariani L, Abraham BJ, Durbin AD, Zimmerman MW, et al. LIN28B regulates transcription and potentiates MYCN-induced neuroblastoma through binding to ZNF143 at target gene promotors. Proc Natl Acad Sci U S A. 2020;117:16516–26. https://doi.org/10.1073/pnas.1922692117.
Botcheva K, McCorkle SR, McCombie WR, Dunn JJ, Anderson CW. Distinct p53 genomic binding patterns in normal and cancer-derived human cells. Cell Cycle. 2011;10:4237–49. https://doi.org/10.4161/cc.10.24.18383.
Acknowledgements
We would like to thank Brian Curran, Andrew Thrasher, Andrew Frantz, and the broader St. Jude Cloud team for their generous support in making the workflow accessible on the St. Jude cloud platform and DNAnexus. We wish to thank all the users who provided critical feedback.
Funding
This work has been supported by funding from the American Lebanese Syrian Associated Charities (ALSAC) and the St. Jude Children’s Research Hospital Collaborative Research Consortium on Chromatin Regulation in Pediatric Cancer. The funding body had no role in the design of the study and collection, analysis and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
M.O.A. designed and developed SEAseq and drafted the manuscript. B.J.A. supervised the work, gave feedback, and revised the manuscript. Both authors performed extensive testing of the pipeline and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
M.O.A. has no competing interests. B.J.A. is a shareholder in Syros Pharmaceuticals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. SEAseq Docker images. List of Docker images built for SEAseq.
Additional file 2
. SEAseq pipeline steps and parameters. Detailed description of SEAseq pipeline steps and parameters.
Additional file 3
. Case study LIN28B. The complete HTML quality statistics report for LIN28B ChIP-seq analysis.
Additional file 4
. SEAseq inputs and outputs. Input files used for SEAseq analysis and detailed descriptions of the Output directories and files generated using SEAseq.
Additional file 5
. Case study datasets. Case Study Datasets and Genome Files information.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Adetunji, M.O., Abraham, B.J. SEAseq: a portable and cloud-based chromatin occupancy analysis suite. BMC Bioinformatics 23, 77 (2022). https://doi.org/10.1186/s12859-022-04588-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-022-04588-z