The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Dickey, Aaron M.; Schmidt, John W.; Bono, James L.; Guragain, Manita

doi:10.1038/s41598-024-63832-z

The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Article
Open access
Published: 10 June 2024

Volume 14, article number 13257, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Download PDF

Aaron M. Dickey¹,
John W. Schmidt¹,
James L. Bono¹ &
…
Manita Guragain²

470 Accesses
Explore all metrics

Abstract

Salmonella enterica and Escherichia coli are major food-borne human pathogens, and their genomes are routinely sequenced for clinical surveillance. Computational pipelines designed for analyzing pathogen genomes should both utilize the most current information from annotation databases and increase the coverage of these databases over time. We report the development of the GEA pipeline to analyze large batches of E. coli and S. enterica genomes. The GEA pipeline takes as input paired Illumina raw reads files which are then assembled followed by annotation. Alternatively, assemblies can be provided as input and directly annotated. The pipeline provides predictive genome annotations for E. coli and S. enterica with a focus on the Center for Genomic Epidemiology tools. Annotation results are provided as a tab delimited text file. The GEA pipeline is designed for large-scale E. coli and S. enterica genome assembly and characterization using the Center for Genomic Epidemiology command-line tools and high-performance computing. Large scale annotation is demonstrated by an analysis of more than 14,000 Salmonella genome assemblies. Testing the GEA pipeline on E. coli raw reads demonstrates reproducibility across multiple compute environments and computational usage is optimized on high performance computers.

CamPype: an open-source workflow for automated bacterial whole-genome sequencing analysis focused on Campylobacter

Article Open access 20 July 2023

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Article Open access 20 November 2014

Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era

Introduction

Salmonella enterica (hereafter Salmonella) are estimated to cause at least 1 million illnesses in the United States each year¹. Escherichia coli are ubiquitous in a wide variety of environments relevant to food safety including food animal gastrointestinal systems, animal production sites, human gastrointestinal systems, meats, and manure impacted soils. A small but clinically important sub-set of E. coli are pathogenic. The ubiquitous nature of E. coli contributes to their relevance beyond food safety. Due to their prominence and small genome size, Salmonella and E. coli are also two of the top organisms with available whole genome sequencing read archives (https://www.ncbi.nlm.nih.gov/sra?term=(%22public%22%5BAccess%5D)%20AND%20%22genomic%22%5BSource%5D) and assemblies (https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/). Such large datasets often rely on high-performance computing to accelerate computational tasks via increased RAM, threads, and parallelization². Large datasets can benefit from data analysis pipelines, which process many input files with an initial set of user specifications and distill the results to a small number of organized outputs for interpretation³.

Useful pipelines for epidemiologic annotation should take advantage of the most up-to-date reference information available. Actively curated reference databases meet this need by rapidly incorporating newly released genomic data. The interplay between these two dependencies can be thought of as a positive feedback loop wherein 1. Running the pipeline on new strains improves the database coverage and quality by exposing knowledge gaps and 2. The database improvement leads to more accurate search hits when running the pipeline on new strains.

Here, we introduce the GEA pipeline. GEA stands for Gammaproteobacteria Epidemiologic Annotation. Analyses central to the GEA pipeline are those using Center for Genomic Epidemiology (CGE) developed tools⁴. The databases for these tools to search are updated frequently, facilitating the positive feedback loop between our pipeline and these databases. Several existing pipelines utilize CGE developed tools^5,6,7,8,9. But we are unaware of any other published pipelines, which use FimTyper¹⁰, MLST¹¹, PlasmidFinder¹², ResFinder¹³, SerotypeFinder¹⁴, and VirulenceFinder¹⁵ in tandem.

Another important feature of the pipeline is the use of a container. Containers allow for compute mobility¹⁶ and provide an increased level of reproducibility¹⁷. The container housing the software tools for running the GEA pipeline also takes advantage of the Scientific File System (SCIF)¹⁸, providing independent mount points to different apps in the container with incompatible environmental requirements.

Prior versions of the GEA pipeline have been used in published research^19,20,21 in food safety, risk assessment, antimicrobial resistance gene transfer, and virulence research at the US Meat Animal Research Center. This demonstrates the utility of the pipeline and the benefit of the distilled annotation summary across hundreds of genomes. This paper describes the methods used for creating the pipeline and provides a kilo-scale demonstration of the pipeline on S. enterica assemblies. The pipeline is available from github.com/Phylloxera/GEA-dev.

Results

Computer resource usage

Table 1 provides the usage summary for testing. The run time varied from 5.04 to 34.56 h (3.15 to 21.6 min-per-E.-coli-library). Assembly accounted for 89–95% of pipeline run time.

Table 1 Computational resources used for GEA Pipeline testing.

Full size table

Output

The principal output of the GEA pipeline is the tab-delimited file, metadata.txt. The metadata can be opened in Excel. The number of rows corresponds to the number of assemblies annotated and the number of columns is dataset dependent. The annotation quality is expected to improve as the CGE database coverage improves over time. The metadata from the test dataset of 96 E. coli raw read libraries contained 444 columns and the metadata from the demonstration dataset of 14,310 Salmonella assemblies had 597 columns. Table 2 summarizes the components of the two metadata files. Tables 3 and 4 provide single annotation snapshots from VirulenceFinder¹⁵ E. coli and ResFinder¹³ Salmonella respectively, while Table 5 provides a snapshot of the all tools summary from E. coli. Table 5 suggests a relationship between FIM type and Antimicrobial Resistance Gene (ARG) content for the 96 E. coli libraries test dataset. The complete summary output file (GEA_ecoli_test_Ceres_metadata.txt) from USDA Ceres²² is provided in Supplementary Data S1.

Table 2 Summary of the sections of the GEA Pipeline test data and demonstration data tabular output.

Full size table

Table 3 A 7-column VirulenceFinder annotation from a three assembly subset of the 96 E. coli libraries test data (Supplementary Data S1).

Full size table

Table 4 A 7-column ResFinder annotation from a 4 assembly subset of the 14,310 Salmonella assemblies demonstration data (Supplementary Data S1).

Full size table

Table 5 A 3-column portion of the GEA Pipeline All Tools Summary for the 96 E. coli libraries test data (Supplementary Data S1) showing acquired antimicrobial resistance gene content by fim type.

Full size table

Demonstration

The pipeline ran on the 14,310 S. enterica assemblies in ~ 72 h (~ 18 s-per-S.-enterica-assembly) on the USDA/Mississippi State University Atlas cluster²². The complete summary output file is in GEA_senterica_demo_Atlas_metadata.txt in Supplementary Data S1.

Discussion

A diverse set of bioinformatic tools has been developed for phenotypic prediction based on genomic data, especially for human pathogens. These tools often grow out of the requirements sought by a group of researchers. In the case of GEA, these included the desire to assemble and run CGE tools at the command-line on large numbers of strains with a single summary output and to have genome assemblies in a single directory ready for submission to NCBI.

Users have reported different results with the same input data, sometimes with analyses conducted many months apart. This is suspected to be caused by updates in the actively curated CGE databases. This has been confirmed in some instances. In the interest of reproducibility, the user has the option to update their local copy of the CGE databases and use the most up-to-date versions, or to leave the databases static for reproducibility across independent computational runs. Reproducibility has long been an aspiration of scientific analysis, however database dependent analyses may demonstrably benefit from non-reproducibility as database coverage increases with the passage of time.

In our testing phase, we sought to have reproducibility across multiple computational environments and this was largely achieved. The only difference in the outputs across high performance computers was due to the contig names assigned by shovill (https://github.com/tseemann/shovill). In all these cases, the length, coverage, and Pilon²³ name were identical whereas shovill assigned contig integers differed by 1. Furthermore, shovill contig names included the date assembled, an additional possible source of discrepancy for runs taking place on different days. E.g., contig00205 len = 509 cov = 31.1 corr = 0 origname = NODE_348_length_509_cov_31.119617_pilon sw = shovill-spades/1.1.0 date = 20231107 on Ceres vs contig00204 len = 509 cov = 31.1 corr = 0 origname = NODE_348_length_509_cov_31.119617_pilon sw = shovill-spades/1.1.0 date = 20231101 on Moose (discrepancies in bold; example from row 15, column 128 of GEA_ecoli_test_Ceres_metadata.txt in Supplementary Data S1). The identical coverage and Pilon designation indicates that, in all cases, these were identical contigs, but that the final integer contig name assignment in shovill may not be deterministic. Additionally, E. coli test data Assembly_bp and Ncontigs statistics were identical across computing environments. The other discrepancy was caused by insufficient memory being available to Skesa²⁴ on the Desktop computer causing two libraries to not assemble resulting in missing Skesa plasmid annotations (rows 33 and 35, columns 21 and 22 of GEA_ecoli_test_Ceres_metadata.txt in Supplementary Data S1). Importantly, these assembly failures were documented by the GEA pipeline log, which alerted the Desktop user. Apart from these two discrepancy sources (shovill final contig naming and Skesa memory requirements not being met), we expect that GEA will provide reproducible results for a given version of the pipeline and databases.

GEA has clear advantages and limitations relative to tools with similar goals. First, long available tools, such as nullarbor (https://github.com/tseemann/nullarbor) and TORMES⁷ have the distinction of an active user base, citations, and more time under development. Bacannot²⁵ is a newer tool, which is container based like GEA. Software containerization increases reproducibility over OS specific source-compile-install-run and cross-platform package manager methodologies¹⁷. RSYD-BASIC²⁶ is also a newer tool which produces a tabular output somewhat like that produced by GEA. The lack of a web server option is a limitation of GEA. However, having a batch command-line implementation of CGE tools was a central functionality driving the development of GEA. GEA also lacks phylogenetic methods, except to the extent that typing predictions are phylogenetically informative. Currently, GEA is only indicated for E. coli and S. enterica. The strongest advantages of GEA are a single dependency (Apptainer²⁷), batch processing in an HPC environment proven by hundreds of successful analyses of illumina raw read libraries²¹, and the successful demonstration at the kilo-scale of a processing rate of ~ 18-s-per-S.-enterica-assembly as demonstrated in this report.

Shovill and Skesa can both handle low levels of contamination. The pipeline has been tested with a diversity of libraries from three different Illumina sequencing platforms at the USMARC Core Lab, but exhaustive testing on the types and degrees of contamination has not been conducted. Regardless, preprocessing or quality control of libraries should be unnecessary.

The Gammaproteobacteria name derivation of the GEA pipeline implies a much broader set of pathogens than are currently included. One noticeable impact of incorporating new species is a substantially smaller set of annotations for the added species, at least initially. This is because all other species have fewer applicable CGE tools relative to E. coli. This is surmountable in instances where there is a non-CGE tool available for the task (e.g. GEA uses SeqSero2²⁸ for serotyping Salmonella since the CGE tool, SerotypeFinder¹⁴, does not serotype Salmonella). Other enhancements, which could be incorporated into GEA in the future include long-read sequence inputs, identification of additional genotypes/phenotypes as new CGE tools are released, and downsampling reads to accelerate Skesa assembly as is done with shovill. GEA is now available for other users with institutional HPCs for rapid characterization of large batches of E. coli and Salmonella genomes from diverse sample sources. GEA is available for download from github.com/Phylloxera/GEA-dev.

Methods

Pipeline

The GEA pipeline implements the following steps in sequential order (Fig. 1). GEA processes the user command-line options and inputs. If the inputs are raw reads, the workflow proceeds to Assembly.

1.
Assembly is first carried out by Shovill(https://github.com/tseemann/shovill) followed by SKESA²⁴. Skesa is used for identifying complete plasmids and can circularize some novel small plasmids not yet on the plasmidfinder database.
2.
Epidemiologic Prediction is conducted on shovill assemblies or user supplied fasta contigs. Local copies of the CGE tool and 5 loci databases (https://github.com/Phylloxera/5loci) are updated by the pipeline unless otherwise specified by the user. The databases are stored locally in the user’s home directory (e.g. $HOME/share/resfinder). BLAST²⁹ is used as the search method for the CGE tools and to query the custom 5 loci databases.
1. a.
  PlasmidFinder¹² (CGE)
2. b.
  MLST¹¹ (CGE) is run. If the user does not specify the taxon, mlst is run against both databases and the species is determined by GEA.
3. c.
  Serotyping is conducted using SerotypeFinder¹⁴ (CGE—E. coli) or SeqSero2²⁸ (Salmonella).
4. d.
  Resfinder4¹³ (CGE) is run against resfinder_db and pointfinder_db databases.
5. e.
  The 5 loci databases are queried with Blast.
6. f.
  VirulenceFinder¹⁵ (CGE—E. coli)
7. g.
  Ezclermont³⁰ (E. coli)
8. h.
  FimTyper¹⁰ (CGE—E. coli)
3.
The results are compiled and written to the tab delimited output file, metadata.txt. GC and N50 summary statistics are calculated by stats.sh(https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/statistics-guide/) during results compilation and additional summary statistics are extracted from the shovill logs.

GEA is written in Bash. The current version has several new features relative to prior development iterations used in previous work.

1.
The user can specify the taxon.
2.
The user can specify whether to update their local copy of the databases.
3.
The input data can be raw gzipped paired reads (fastq) or genome assemblies (fasta).
4.
The following new tools have been added: Ezclermont, FimTyper, VirulenceFinder, and the 5 loci databases.
5.
The container recipe utilizes SCIF¹⁸ for software environment modularity inside the software container.
6.
The container, and pipeline are made available via https://github.com/Phylloxera/GEA-dev.

The shovill pipeline utilizes SPAdes³¹, Velvet³², Lighter³³, FLASh³⁴, SAMtools³⁵, BWA-MEM³⁶, KMC³⁷, seqtk(https://github.com/lh3/seqtk), pigz(https://zlib.net/pigz/), Pilon²³, Trimmomatic³⁸ and samclip(https://github.com/tseemann/samclip). The container with the needed software was created using Apptainer²⁷.

Testing

GEA was tested on 3 linux high performance computers and a desktop computer with Hyper-V enabled on Windows 10 Professional (Table 1). The data used in testing were a single plate of 96 illumina E. coli raw read libraries from a long-term evolutionary study. Testing utilized the -u F option to query identical versions of the CGE databases and evaluate reproducibility across the compute environments.

Demonstration

To demonstrate the pipeline at kilo-scale, GEA was run on 14,310 Salmonella enterica genome assemblies released during October, 2023. The assemblies were downloaded on November 6, 2023 using datasets(https://www.ncbi.nlm.nih.gov/datasets) with download genome options: taxon 28901,–include genome,–exclude-atypical,–released-after 10/1/2023,–released-before 10/31/2023,–assembly-source GenBank, and–dehydrated. Fasta files were moved to a single folder to be used as input. GEA was run on the Atlas high performance computer system of Mississippi State University and the US Department of Agriculture²² on November 11, 2023, with options: -t senterica, -u F, -r 336:00:00, -m 360G, and -c 48. Initial tests predicted a run time of 2–5 days.

Data availability

Underlying data: The test dataset used during the current study can be made available from the corresponding author on reasonable request. The demonstration dataset used during the current study are publicly available from the National Center for Biotechnology Information. The assemblies can be downloaded using the publicly available software, datasets (https://www.ncbi.nlm.nih.gov/datasets), with download genome options: taxon 28901,–include genome,–exclude-atypical,–released-after 10/1/2023,–released-before 10/31/2023,–assembly-source GenBank, and–dehydrated. Extended data: USDA-NAL: Supplementary Data S1. This project contains the following extended data:–GEA_ecoli_test_Ceres_metadata.txt. GEA Pipeline summary output (metadata.txt) for 96 E. coli Illumina libraries used as test data.–GEA_senterica_demo_Atlas_metadata.txt GEA Pipeline summary output (metadata.txt) for 14,310 S. enterica assemblies used as demonstration data. Extended data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Code availability

The full pipeline (as an apptainer container) and files required for building the pipeline from scratch are available from: [https://github.com/Phylloxera/GEA-dev]. The version used in all analyses herein is 0.0.1-alpha.1. GEA License: [https://github.com/Phylloxera/GEA-dev/blob/main/LICENSE]. The 5 loci BLAST databases are available from: [https://github.com/Phylloxera/5loci]. 5 loci License: [https://github.com/Phylloxera/5loci/blob/main/LICENSE].

References

Scallan, E. et al. Foodborne illness acquired in the United States–major pathogens. Emerg. Infect. Dis. 17, 7–15. https://doi.org/10.3201/eid1701.P11101 (2011).
Article PubMed PubMed Central Google Scholar
Fjukstad, B. & Bongo, L. A. A review of scalable bioinformatics pipelines. Data Sci. Eng. 2, 245–251. https://doi.org/10.1007/s41019-017-0047-z (2017).
Article Google Scholar
Leipzig, J. A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18, 530–536. https://doi.org/10.1093/bib/bbw020 (2017).
Article PubMed Google Scholar
Center for Genomic Epidemiology Repositories [internet]. [cited 29 November 2023]. https://bitbucket.org/genomicepidemiology/workspace/repositories/
Chukamnerd, A. et al. BacSeq: A user-friendly automated pipeline for whole-genome sequence analysis of bacterial genomes. Microorganisms. 11, 1769. https://doi.org/10.3390/microorganisms11071769 (2023).
Article CAS PubMed PubMed Central Google Scholar
Couvin, D., Stattner, E., Segretier, W., Cazenave, D. & Rastogi, N. simpiTB–a pipeline designed to extract meaningful information from whole genome sequencing data of Mycobacterium tuberculosis complex, allows to combine genomic, phylogenetic and clustering analyses in existing SITVIT databases. Infect. Genet. Evol. 113, 105466. https://doi.org/10.1016/j.meegid.2023.105466 (2023).
Article CAS PubMed Google Scholar
Quijada, N. M., Rodríguez-Lázaro, D., Eiros, J. M. & Hernández, M. TORMES: An automated pipeline for whole bacterial genome analysis. Bioinformatics 35, 4207–4212. https://doi.org/10.1093/bioinformatics/btz220 (2019).
Article CAS PubMed Google Scholar
Thomsen, M. C. F. et al. A bacterial analysis platform: An integrated system for analysing bacterial whole genome sequencing data for clinical diagnostics and surveillance. PloS One 11, e0157718. https://doi.org/10.1371/journal.pone.0157718 (2016).
Article CAS PubMed PubMed Central Google Scholar
Xavier, B. B. et al. BacPipe: A rapid, user-friendly whole-genome sequencing pipeline for clinical diagnostic bacteriology. IScience. 23, 100769. https://doi.org/10.1016/j.isci.2019.100769 (2020).
Article ADS PubMed Google Scholar
Roer, L. et al. Development of a web tool for Escherichia coli subtyping based on fimH alleles. J. Clin. Microbiol. 55, 2538–2543. https://doi.org/10.1128/jcm.00737-17 (2017).
Article CAS PubMed PubMed Central Google Scholar
Larsen, M. V. et al. Multilocus sequence typing of total-genome-sequenced bacteria. J. Clin. Microbiol. 50, 1355–1361. https://doi.org/10.1128/jcm.06094-11 (2012).
Article CAS PubMed PubMed Central Google Scholar
Carattoli, A. & Hasman, H. PlasmidFinder and in silico pMLST: Identification and typing of plasmid replicons in whole-genome sequencing (WGS). Horiz. Gene Transf. 285–294. (2020). https://doi.org/10.1007/978-1-4939-9877-7_20
Bortolaia, V. et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J. Antimicrob. Chemoth. 75, 3491–3500. https://doi.org/10.1093/jac/dkaa345 (2020).
Article CAS Google Scholar
Joensen, K. G., Tetzschner, A. M. M., Iguchi, A., Aarestrup, F. M. & Scheutz, F. Rapid and easy in silico serotyping of Escherichia coli isolates by use of whole-genome sequencing data. J. Clin. Microbiol. 53, 2410–2426. https://doi.org/10.1128/jcm.00008-15 (2015).
Article CAS PubMed PubMed Central Google Scholar
Joensen, K. G. et al. Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J. Clin. Microbiol. 52, 1501–1510. https://doi.org/10.1128/jcm.03617-13 (2014).
Article PubMed PubMed Central Google Scholar
Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: Scientific containers for mobility of compute. PloS One 12, e0177459. https://doi.org/10.1371/journal.pone.0177459 (2017).
Article CAS PubMed PubMed Central Google Scholar
Grüning, B. et al. Practical computational reproducibility in the life sciences. Cell Syst. 6, 631–635. https://doi.org/10.1016/j.cels.2018.03.014 (2018).
Article CAS PubMed PubMed Central Google Scholar
Sochat, V. The scientific filesystem. GigaScience giy023, 023. https://doi.org/10.1093/gigascience/giy023 (2018).
Article Google Scholar
Guragain, M., Schmidt, J. W., Kalchayanand, N., Dickey, A. M. & Bosilevac, J. M. Characterization of Escherichia coli harboring colibactin genes (clb) isolated from beef production and processing systems. Sci. Rep. 12, 5305. https://doi.org/10.1038/s41598-022-09274-x (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Guragain, M., Schmidt, J. W., Dickey, A. M. & Bosilevac, J. M. Distribution of extremely heat-resistant Escherichia coli in the beef production and processing continuum. J. Food Protect. 86, 100031. https://doi.org/10.1016/j.jfp.2022.100031 (2023).
Article Google Scholar
Schmidt, J. W. et al. Twenty-four-month longitudinal study suggests little to no horizontal gene transfer in situ between third-generation cephalosporin-resistant Salmonella and third-generation cephalosporin-resistant Escherichia coli in a beef cattle feedyard. J. Food Protect. 85, 323–335. https://doi.org/10.4315/JFP-21-371 (2022).
Article CAS Google Scholar
SCINet Scientific Computing U.S. DEPARTMENT OF AGRICULTURE. [internet]. [cited 29 November 2023]. scinet.usda.gov
Walker, B. J. et al. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One 9, e112963. https://doi.org/10.1371/journal.pone.0112963 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Souvorov, A., Agarwala, R. & Lipman, D. J. SKESA: Strategic k-mer extension for scrupulous assemblies. Genome Biol. 19, 153. https://doi.org/10.1186/s13059-018-1540-z (2018).
Article CAS PubMed PubMed Central Google Scholar
de Almeida, F. M., de Campos, T. A. & Pappas, G. J. Jr. Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation. F1000Research 12, 1205. https://doi.org/10.12688/f1000research.139488.1 (2023).
Article CAS PubMed PubMed Central Google Scholar
Steinke, K. et al. RSYD-BASIC: A bioinformatics pipeline for routine sequence analysis and data processing of bacterial isolates for clinical microbiology. Access Microbiol. https://doi.org/10.1099/acmi.0.000646.v2 (2023).
Article Google Scholar
Dykstra, D. Apptainer without setuid. arXiv 2208, 12106. [preprint] (2022). https://doi.org/10.48550/arXiv.2208.12106
Zhang, S. et al. SeqSero2: Rapid and improved Salmonella serotype determination using whole-genome sequencing data. Appl. Environ. Microbiol. 85, e01746-e1819. https://doi.org/10.1128/AEM.01746-19 (2019).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinform. 10, 44935. https://doi.org/10.1186/1471-2105-10-421 (2009).
Article CAS Google Scholar
Waters, N. R., Abram, F., Brennan, F., Holmes, A. & Pritchard, L. Easy phylotyping of Escherichia coli via the EzClermont web app and command-line tool. Access Microbiol. 2, acmi000143. https://doi.org/10.1099/acmi.0.000143 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bankevich, A. et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477. https://doi.org/10.1089/cmb.2012.0021 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Zerbino, D. R. & Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829. https://doi.org/10.1101/gr.074492.107 (2008).
Article CAS PubMed PubMed Central Google Scholar
Song, L., Florea, L. & Langmead, B. Lighter: Fast and memory-efficient sequencing error correction without counting. Genome Biol. 15, 44939. https://doi.org/10.1186/s13059-014-0509-9 (2014).
Article CAS Google Scholar
Magoč, T. & Salzberg, S. FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963. https://doi.org/10.1093/bioinformatics/btr507 (2011).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008. https://doi.org/10.1093/gigascience/giab008 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 1303, 3997 [preprint] (2013). https://doi.org/10.48550/arXiv.1303.3997
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: Counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761. https://doi.org/10.1093/bioinformatics/btx304 (2017).
Article CAS PubMed Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120. https://doi.org/10.1093/bioinformatics/btu170 (2014).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank two anonymous reviewers who identified gaps in the manuscript, a bug in the pipeline script, and suggested a beneficial pipeline enhancement that improved the paper. Stephanie Schmidt provided secretarial support. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. The USDA is an equal opportunity provider and employer.

Funding

Funding for this research was provided by the USDA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

US Department of Agriculture, Agricultural Research Service, US Meat Animal Research Center, 844 Rd 313, PO Box 165, Clay Center, NE, 68933, USA
Aaron M. Dickey, John W. Schmidt & James L. Bono
US Department of Agriculture, Agricultural Research Service, Eastern Regional Research Center, 600 East Mermaid Lane, Wyndmoor, PA, 19038, USA
Manita Guragain

Authors

Aaron M. Dickey
View author publications
You can also search for this author in PubMed Google Scholar
John W. Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
James L. Bono
View author publications
You can also search for this author in PubMed Google Scholar
Manita Guragain
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization (J.W.S., M.G.); Data Curation (A.M.D.); Formal Analysis (A.M.D.); Investigation (A.M.D.); Methodology (A.M.D., J.W.S., J.L.B., M.G.); Resources (A.M.D., J.W.S., J.L.B., M.G.); Software (A.M.D.); Visualization (A.M.D.); Writing (A.M.D., J.W.S., M.G.); Editing (J.W.S., J.L.B., M.G.).

Corresponding authors

Correspondence to Aaron M. Dickey or Manita Guragain.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dickey, A.M., Schmidt, J.W., Bono, J.L. et al. The GEA pipeline for characterizing Escherichia coli and Salmonella genomes. Sci Rep 14, 13257 (2024). https://doi.org/10.1038/s41598-024-63832-z

Download citation

Received: 24 January 2024
Accepted: 03 June 2024
Published: 10 June 2024
DOI: https://doi.org/10.1038/s41598-024-63832-z
Springer Nature Limited

The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Abstract

Similar content being viewed by others

CamPype: an open-source workflow for automated bacterial whole-genome sequencing analysis focused on Campylobacter

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era

Introduction

Results

Computer resource usage

Output

Demonstration

Discussion

Methods

Pipeline

Testing

Demonstration

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Navigation

The GEA pipeline for characterizing Escherichia coli and Salmonella genomes

Abstract

Similar content being viewed by others

CamPype: an open-source workflow for automated bacterial whole-genome sequencing analysis focused on Campylobacter

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Bacterial Genomic Data Analysis in the Next-Generation Sequencing Era

Introduction

Results

Computer resource usage

Output

Demonstration

Discussion

Methods

Pipeline

Testing

Demonstration

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation