Abstract
The development of efficient DNA data compression tools is fundamental for reducing the storage, given the increasing availability of DNA sequences. The importance is also reflected for analysis purposes, given the search for optimized and new tools for anthropological and biomedical applications. In this paper, we describe the characteristics and impact of the GeCo2 tool, an improved version of the GeCo tool. In the proposed tool, we enhanced the mixture of models, where each context model or tolerant context model has now a specific decay factor. Additionally, specific cache-hash sizes and the ability to run only a context model with inverted repeats was developed. A new command line interface, twelve new pre-computed levels, and several optimizations in the code were also included. The results show a compression improvement using less computational resources (RAM and processing time). This new version permits more flexibility for compression and analysis purposes, namely a higher ability of addressing different characteristics of the DNA sequences. The decompression is performed using symmetric computational resources (RAM and time). The GeCo2 is freely available, under GPLv3 license, at https://github.com/pratas/geco2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mardis, E.R.: DNA sequencing technologies: 2006–2016. Nat. Protoc. 12(2), 213 (2017)
Marco, D.: Metagenomics: Theory, Methods and Applications. Horizon Scientific Press, Poole (2010)
Marciniak, S., et al.: Harnessing ancient genomes to study the history of human adaptation. Nat. Rev. Genet. 18(11), 659 (2017)
Weber, W., et al.: Emerging biomedical applications of synthetic biology. Nat. Rev. Genet. 13(1), 21 (2012)
Schatz, M.C., et al.: The DNA data deluge. IEEE Spectr. 50(7), 28–33 (2013)
Goyal, M., et al.: DeepZip: lossless data compression using recurrent neural networks. arXiv:1811.08162 (2018)
Sayood, K.: Introduction to Data Compression. Morgan Kaufmann, Burlington (2017)
Dougherty, E.R., et al. (eds.): Genomic Signal Processing and Statistics. Hindawi Publishing Corporation, London (2005)
Grumbach, S., et al.: Compression of DNA sequences. In: DCC-1993, Utah, pp. 340–350 (1993)
Ziv, J., et al.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Grumbach, S., et al.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6), 875–886 (1994)
Rivals, E., et al.: A guaranteed compression scheme for repetitive DNA sequences. In: DCC-1996, Utah, p. 453 (1996)
Loewenstern, D., et al.: Significantly lower entropy estimates for natural DNA sequences. In: DCC-1997, Utah (1997)
Allison, L., et al.: Compression of strings with approximate repeats. In: Proceedings of Intelligent Systems in Molecular Biology, ISMB 1998, Montreal, Canada, pp. 8–16 (1998)
Apostolico, A., et al.: Compression of biological sequences by greedy off-line textual substitution. In: DCC-2000, Utah (2000)
Chen, X., et al.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)
Matsumoto, T., et al.: Biological sequence compression algorithms. In: Proceedings of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)
Tabus, I., et al.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: DCC-2003, Utah, pp. 253–262 (2003)
Korodi, G., et al.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1), 3–34 (2005)
Manzini, G., et al.: A simple and fast DNA compressor. Softw.—Pract. Exper. 34, 1397–1411 (2004)
Lee, A.J.T., et al.: DNAC: an efficient compression algorithm for DNA sequences. National Taiwan University, Taipei 10617, R.O.C. 1(1) (2004)
Cao, M.D., et al.: A simple statistical algorithm for biological sequence compression. In: DCC-2007, Utah (2007)
Vey, G.: Differential direct coding: a compression algorithm for nucleotide sequence data. Database (2009)
Mishra, K.N., et al.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)
Rajeswari, P.R., et al.: GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences. Int. J. Comput. Sci. Inf. Technol. 2, 25–29 (2010)
Gupta, A., et al.: A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl. 33, 245–251 (2011)
Zhu, Z., et al.: DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput. 15(5), 643–658 (2011)
Pinho, A.J., et al.: Bacteria DNA sequence compression using a mixture of finite-context models. In: IEEE Workshop on Statistical Signal Processing, Nice (2011)
Pinho, A.J., et al.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)
Roy, S., et al.: An efficient biological sequence compression technique using LUT and repeat in the sequence. arXiv:1209.5905 (2012)
Satyanvesh, D., et al.: GenCodex - a novel algorithm for compressing DNA sequences on multi-cores and GPUs. In: Proceedings of IEEE 19th International Conference on High Performance Computing (HiPC), Pune (2012)
Bose, T., et al.: BIND-an algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37(4), 785–789 (2012)
Li, P., et al.: DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE 8(11), e80377 (2013)
Pratas, D., et al.: Exploring deep Markov models in genomic data compression using sequence pre-analysis. In: EUSIPCO-2014, Lisbon, pp. 2395–2399 (2014)
Sardaraz, M., et al.: SeqCompress: an algorithm for biological sequence compression. Genomics 104(4), 225–228 (2014)
Guo, H., et al.: Genome compression based on Hilbert space filling curve. In: International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, pp. 29–31 (2015)
Xie, X., et al.: CoGI: towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(6), 1275–1285 (2015)
Pratas, D., et al.: Efficient compression of genomic sequences. In: DCC-2016, Utah, pp. 231–240 (2016)
Chen, M., et al.: Genome sequence compression based on optimized context weighting. Genet. Mol. Res.: GMR 16(2) (2017)
Pratas, D., et al.: Cryfa: a tool to compact and encrypt FASTA files. In: PACBB-2017, pp. 305–312 (2017)
Hosseini, M., et al.: Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1), 146–148 (2018)
Hosseini, M., et al.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)
Pratas, D., et al.: A DNA sequence corpus for compression benchmark. In: PACBB-2018, pp. 208–215 (2018)
Bell, T.C., et al.: Text Compression. Prentice Hall, Upper Saddle River (1990)
Pratas, D., et al.: Substitutional tolerant Markov models for relative compression of DNA sequences. In: PACBB-2017, pp. 265–272 (2017)
Ferreira, P.J.S.G., et al.: Compression-based normal similarity measures for DNA sequences. In: ICASSP-2014, Florence, pp. 419–423 (2014)
Acknowledgments
This work was partially funded by FEDER (Programa Operacional Factores de Competitividade - COMPETE) and by National Funds through the FCT, in the context of the projects UID/CEC/00127/2019 & PTCD/EEI-SII/6608/2014 and the grant PD/BD/113969/2015 to MH.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Pratas, D., Hosseini, M., Pinho, A.J. (2020). GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences. In: Fdez-Riverola, F., Rocha, M., Mohamad, M., Zaki, N., Castellanos-Garzón, J. (eds) Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019. Advances in Intelligent Systems and Computing, vol 1005 . Springer, Cham. https://doi.org/10.1007/978-3-030-23873-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-23873-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23872-8
Online ISBN: 978-3-030-23873-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)