Abstract
The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.
Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15%). Thus it will be fundamental to exploit quality value information within the alignment-free framework.
In this paper we present a family of alignment-free measures, called D q-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. These measures are implemented in a software called QCluster ( http://www.dei.unipd.it/~ciompin/main/qcluster.html ).
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Medini, D., Serruto, D., Parkhill, J., Relman, D., Donati, C., Moxon, R., Falkow, S., Rappuoli, R.: Microbiology in the post-genomic era. Nature Reviews Microbiology 6, 419–430 (2008)
Jothi, R., et al.: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221–5231 (2008)
Altschul, S., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS 106(8), 2677–2682 (2009)
Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In: Proc. 23rd Int. Workshop on Database and Expert Systems Applications (DEXA-BIOKDD 2012), pp. 190–194 (2012)
Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. BMC Algorithms for Molecular Biology 7(34) (2012)
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads. Journal of Computational Biology 20(2), 64–79 (2013)
Comin, M., Schimd, M.: Assembly-free Genome Comparison based on Next-Generation Sequencing Reads and Variable Length Patterns. Accepted at RECOMB-SEQ 2014: 4th Annual RECOMB Satellite Workshop at Massively Parallel Sequencing. Proceedings to appear in BMC Bioinformatics (2014)
Vinga, S., Almeida, J.: Alignment-free sequence comparison – a review. Bioinformatics 19(4), 513–523 (2003)
Gao, L., Qi, J.: Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evolutionary Biology 7(1), 41 (2007)
Qi, J., Luo, H., Hao, B.: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research 32 (Web Server Issue), 45–47 (2004)
Goke, J., Schulz, M.H., Lasserre, J., Vingron, M.: Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28(5), 656–663 (2012)
Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23(13), 249–255 (2007)
Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. Accepted for presentation at The Twelfth Asia Pacific Bioinformatics Conference. Proceedings to appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics (2014)
Comin, M., Antonello, M.: Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes. In: Ngom, A., Formenti, E., Hao, J.-K., Zhao, X.-M., van Laarhoven, T. (eds.) PRIB 2013. LNCS, vol. 7986, pp. 277–288. Springer, Heidelberg (2013)
Comin, M., Antonello, M.: Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(3), 500–509 (2014)
Comin, M., Verzotto, D.: Classification of protein sequences by means of irredundant patterns. Proceedings of the 8th Asia-Pacific Bioinformatics Conference (APBC), BMC Bioinformatics 11(Suppl.1), S16 (2010)
Comin, M., Verzotto, D.: The Irredundant Class method for remote homology detection of protein sequences. Journal of Computational Biology 18(12), 1819–1829 (2011)
Hashimoto, W.S., Morishita, S.: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Research 19(7), 1309–1315 (2009)
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics 14, 268 (2013)
Heng, L., Jue, R., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851–1858 (2008)
Albers, C., Lunter, G., MacArthur, D.G., McVean, G., Ouwehand, W.H., Durbin, R.: Dindel: accurate indel calls from short-read data. Genome Research 21(6), 961–973 (2011)
Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012)
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS USA 83(14), 5155–5159 (1986)
Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)
Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)
Wan, L., Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)
Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8(3), 186–194 (1998)
Holtgrewe, M.: Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin (2010)
Birney, E.: Assemblies: the good, the bad, the ugly. Nature Methods 8, 59–60 (2011)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18, 821–829 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Comin, M., Leoni, A., Schimd, M. (2014). QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering. In: Brown, D., Morgenstern, B. (eds) Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science(), vol 8701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44753-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-44753-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44752-9
Online ISBN: 978-3-662-44753-6
eBook Packages: Computer ScienceComputer Science (R0)