ContFree-NGS: Removing Reads from Contaminating Organisms in Next Generation Sequencing Data

Peres, Felipe Vaz; Riaño-Pachón, Diego Mauricio

doi:10.1007/978-3-030-91814-9_6

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13063))

Included in the following conference series:

Brazilian Symposium on Bioinformatics

474 Accesses
1 Altmetric

Abstract

We present ContFree-NGS, an open source software that removes reads originating from contaminant organisms in your sequencing dataset. The user has to provide a target taxon, and anything that does not belong to this taxon or its descendants will be labelled as contaminant. In order to achieve this, ContFree-NGS exploits results from a taxonomy assignment engine, like Kraken2 or Kaiju.

Access provided by Autonomous University of Puebla. Download conference paper PDF

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets

Article Open access 03 March 2015

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

Article Open access 16 November 2020

Keywords

1 Introduction

Second and third generation DNA sequencing technologies are powerful tools that are revolutionizing biology. However, results from these technologies often present contamination, which could impact their interpretation [1, 2]. A contaminating sequence is one that does not faithfully represent the genetic information from the biological source organism because it contains one or more sequence segments of foreign origin, and they could cause several problems in downstream analyses. The primary consequences of contamination are time and effort wasted on meaningless analyses, erroneous conclusions drawn about the biological significance of the sequence, misassembly of sequence contigs and false sequence clustering, delay in the release of the sequence in public databases and pollution of public databases [3].

Recently, some tools have been made available that aim to remove sequences from contaminating organisms in next generation sequencing (NGS) datasets. DecontaMiner is a tool to unravel the presence of contaminating sequences in the set of reads that do not map to a reference genome [4]. Conterminator removes contaminating sequences from contigs exploiting a taxonomic assignment file [5]. QC-Blind is an automatic tool to do unsupervised assembly and contig binning to identify and remove putative contaminants [6]. These tools have in common that they either require a reference genome of the source organism, or need to perform assembly prior contaminant detection. Our goal was to develop a simpler tool to remove contaminated sequences directly from unassembled reads, without mapping, exploiting fast k-mer analysis implemented in taxonomic assignment engines commonly used in metagenomics. Thus, we developed ContFree-NGS, an open source and very simple filter that removes sequences from contaminating organisms in NGS datasets based on a taxonomic classification file.

2 Implementation

ContFree-NGS was implemented as a single Python v3 (>3.6) script, using the biopython module and the Python Environment for Tree Exploration (ETE). In order to assess contamination, ContFree-NGS exploits a taxonomic assignment file containing the read identifier and a NCBI taxonomic identifier for every sequence in the dataset. This taxonomic classification file can be generated with a taxonomic assignment engine, such as Kraken2 [7] or Kaiju [8]. ContFree-NGS requires that the user provide a target taxon and only reads assigned to this taxon or to its descendants will be regarded as target sequences and further maintained. Sequences not assigned to the target taxon or its descendants will be discarded and sequences that could not be assigned to any taxa will be kept in a separated unclassified file. ContFree-NGS will process the NGS dataset and the taxonomic assignment file in the following way:

1.
It creates an indexed database for the sequencing dataset (FastQ format) using the Bio.SeqIO.index_db function with the index stored in a SQLite database;
2.
It creates a list with the NCBI identifier for the target taxon and all the identifiers of its descendants according to the NCBI taxonomy database;
3.
It iterates over the taxonomy assignment file. If the read was not assigned to any taxa it is saved in a fastq file for unclassified reads. If the read was assigned, it will check if its taxon is found in the list of the target taxon descendants, created in step (ii), if so, will save the read to a fastq file for filtered reads, otherwise the read will be discarded.

As ContFree-NGS exploits the results from a taxonomic assignment engine, users must use the proper switches to achieve an accurate classification, for instance a proper value of the --confidence switch in Kraken2.

3 Evaluation

We evaluated ContFree-NGS on three sugarcane artificially contaminated datasets, A (50.000 paired end reads), B (250.000 paired end reads) and C (1.250.000 paired end reads). In all datasets 80% of the reads came from sugarcane (SRR1774134), 15% came from Acinetobacter baumanii (SRR12763742) and 5% came from Aspergillus fumigatus (DRR289670). We used Kraken2 for taxonomic assignment. To perform that, we built a Kraken2 custom database containing the following reference libraries: archaea, bacteria, viral, human, fungi, plant, protozoa and the NCBI non-redundant nucleotide database. Then, Kraken2 was run with the confidence set to 0.05, resulting in the following number of classified sequences: dataset A: 25.547, dataset B: 128.396, dataset C: 664.270.

At the confidence level of 0.05 set in Kraken2, ContFree-NGS was able to remove over 99% of the known contaminants in the set of classified reads. We run ContFree-NGS on a high performance computing (HPC) cluster and recorded RAM usage and processing wall time for the three datasets (see Fig. 1).

4 Conclusion

ContFree-NGS is a very simple filter and useful tool that removes sequences from contaminating organisms in a NGS dataset.

References

Park, S.J., Onizuka, S., Seki, M., et al.: A systematic sequencing-based approach for microbial contaminant detection and functional inference. BMC Biol. 17, 72 (2019). https://doi.org/10.1186/s12915-019-0690-0
Article CAS PubMed PubMed Central Google Scholar
Goig, G.A., Blanco, S., Garcia-Basteiro, A.L., et al.: Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol. 18, 24 (2020). https://doi.org/10.1186/s12915-020-0748-z
Article CAS PubMed PubMed Central Google Scholar
National Center for Biotechnology Information 2016: Contamination in Sequence Databases. https://www.ncbi.nlm.nih.gov/tools/vecscreen/contam/. Accessed 6 Oct 2021
Sangiovanni, M., Granata, I., Thind, A., et al.: From trash to treasure: detecting unexpected contamination in unmapped NGS data. BMC Bioinform. 20, 168 (2019). https://doi.org/10.1186/s12859-019-2684-x
Article Google Scholar
Steinegger, M., Salzberg, S.L.: Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol. 21, 115 (2020). https://doi.org/10.1186/s13059-020-02023-1
Article CAS PubMed PubMed Central Google Scholar
Xi, W., Gao, Y., Cheng, Z., et al.: Using QC-blind for quality control and contamination screening of bacteria DNA sequencing data without reference genome. Front. Microbiol. 10, 1560 (2019). https://doi.org/10.3389/fmicb.2019.01560
Article PubMed PubMed Central Google Scholar
Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
Article CAS PubMed PubMed Central Google Scholar
Menzel, P., Ng, K., Krogh, A.: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016). https://doi.org/10.1038/ncomms11257
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by “Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)” [grant number 2019/24796-5 to F.V.P] and by “Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)” [grant number 310080/2018‐5 to D.M.R-P].

Author information

Authors and Affiliations

Federal University of São Carlos, Araras, , São Paulo, Brazil
Felipe Vaz Peres
Computational, Evolutionary and Systems Biology Laboratory, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, , São Paulo, Brazil
Felipe Vaz Peres & Diego Mauricio Riaño-Pachón

Authors

Felipe Vaz Peres
View author publications
You can also search for this author in PubMed Google Scholar
Diego Mauricio Riaño-Pachón
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Mauricio Riaño-Pachón .

Editor information

Editors and Affiliations

Leipzig University, Leipzig, Sachsen, Germany
Peter F. Stadler
University of Brasília, Brasilia, Brazil
Maria Emilia M. T. Walter
CINVESTAV, Irapuato, Mexico
Maribel Hernandez-Rosales
Universidade de Brasilia, Brasilia, Brazil
Marcelo M. Brigido

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peres, F.V., Riaño-Pachón, D.M. (2021). ContFree-NGS: Removing Reads from Contaminating Organisms in Next Generation Sequencing Data. In: Stadler, P.F., Walter, M.E.M.T., Hernandez-Rosales, M., Brigido, M.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2021. Lecture Notes in Computer Science(), vol 13063. Springer, Cham. https://doi.org/10.1007/978-3-030-91814-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-91814-9_6
Published: 23 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91813-2
Online ISBN: 978-3-030-91814-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ContFree-NGS: Removing Reads from Contaminating Organisms in Next Generation Sequencing Data

Abstract

Similar content being viewed by others

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

Keywords

1 Introduction

2 Implementation

3 Evaluation

4 Conclusion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

ContFree-NGS: Removing Reads from Contaminating Organisms in Next Generation Sequencing Data

Abstract

Similar content being viewed by others

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

Keywords

1 Introduction

2 Implementation

3 Evaluation

4 Conclusion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation