Skip to main content

A NoSQL Solution for Bioinformatics Data Provenance Storage

  • Conference paper
  • First Online:
New Knowledge in Information Systems and Technologies (WorldCIST'19 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 930))

Included in the following conference series:

  • 3175 Accesses

Abstract

Provenance data can support the reproducibility of experiments providing the history of the data in a scientific workflow. Bioinformatics generates an increasing amount of data, which are often analyzed employing workflows. This paper proposes a way to manage automatic executions of Bioinformatics workflows, storing their provenance and raw data in the MongoDB NoSQL database system. It uses a program that manages three different data models, a referenced, an embedded, and a hybrid data model for purposes of comparison. The results showed general advantages and disadvantages for each data model and some particularities of Bioinformatics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 6th International Conference on Pervasive Computing and Applications (ICPCA), pp. 363–366. IEEE (2011)

    Google Scholar 

  2. Erturk, E., Jyoti, K.: Perspectives on a big data application: What database engineers and it students need to know. Eng. Technol. Appl. Sci. Res. 5(5), 850–853 (2015)

    Google Scholar 

  3. Li, T., Liu, L., Zhang, X., Xu, K., Yang, C.: Provenancelens: service provenance management in cloud. In: 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (2014)

    Google Scholar 

  4. Moniruzzaman, A., Hossain, S.A.: NoSQL database: new era of databases for big data analytics-classification, characteristics and comparison, arXiv preprint arXiv:1307.0191 (2013)

  5. Reis, D.G., Gasparoni, F.S., Holanda, M., Victorino, M., Ladeira, M., Ribeiro, E.O.: An evaluation of data model for NoSQL document-based databases. In: World Conference on Information Systems and Technologies, pp. 616–625. Springer (2018)

    Google Scholar 

  6. Bellazzi, R.: Big data and biomedical informatics: a challenging opportunity. Yearb. Med. Inf. 9(1), 8 (2014)

    Google Scholar 

  7. Gessert, F., Ritter, N.: Scalable Data Management: NoSQL Datastores in Research and Practice (2016)

    Google Scholar 

  8. The MongoDB 4.0 Manual. https://docs.mongodb.com/manual. Accessed 23 June 2018

  9. Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: International Conference on Database Theory, pp. 316–330. Springer (2001)

    Google Scholar 

  10. Guimaraes, V., Hondo, F., Almeida, R., Vera, H., Holanda, M., Araujo, A., Walter, M.E., Lifschitz, S.: A study of genomic data provenance in NoSQL document-oriented database systems. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2015, pp. 1525–1531. IEEE (2015)

    Google Scholar 

  11. Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L.: Gerenciando experimentos científicos em larga escala, SEMISH – Seminário Integrado de Software e Hardware (2008)

    Google Scholar 

  12. De Paula, R., Holanda, M., Gomes, L.S., Lifschitz, S., Walter, M.E.M.: Provenance in bioinformatics workflows. BMC Bioinf. 14(11), S6 (2013)

    Article  Google Scholar 

  13. Abdrabo, M., Elmogy, M., Eltaweel, G., Barakat, S.: Enhancing big data value using knowledge discovery techniques. IJ Inf. Technol. Comput. Sci. 8, 1–12 (2016)

    Google Scholar 

  14. Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)

    Article  Google Scholar 

  15. Mattoso, M., Dias, J., Costa, F., de Oliveira, D., Ogasawara, E.: Experiences in using provenance to optimize the parallel execution of scientific workflows steered by users. In: Workshop of Provenance Analytics, vol. 1 (2014)

    Google Scholar 

  16. Kanwal, S., Khan, F.Z., Lonie, A., Sinnott, R.O.: Investigating reproducibility and tracking provenance-a genomic workflow case study. BMC Bioinf. 18(1), 337 (2017)

    Article  Google Scholar 

  17. Costa, F., Silva, V., De Oliveira, D., Ocaña, K., Ogasawara, E., Dias, J., Mattoso, M.: Capturing and querying workflow runtime provenance with PROV: a practical approach. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 282–289. ACM (2013)

    Google Scholar 

  18. Hondo, F., Wercelens, P., da Silva, W., Lima, I., Santana, I., de Araujo, G., Araujo, A., Walter, M.E., Holanda, M., Lifschitz, S.: Uso de bancos de dados nosql para gerenciamento de dados em workflow de bioinformática. In: Proceedings of 32nd Brazilian Symposium on Databases, pp. 310–317 (2017)

    Google Scholar 

  19. Hondo, F., Wercelens, P., da Silva, W., Castro, K., Santana, I., Walter, M.E., Araujo, A., Holanda, M., Lifschitz, S.: Data provenance management for bioinformatics workflows using NoSQL database systems in a cloud computing environment. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1929–1934. IEEE (2017)

    Google Scholar 

  20. Kim, D., Langmead, B., Salzberg, S.L.: HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12(4), 357 (2015)

    Article  Google Scholar 

  21. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)

    Article  Google Scholar 

  22. Anders, S., Pyl, P.T., Huber, W.: HTSeq-a python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ingrid Santana .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santana, I., da Silva, W.M.C., Holanda, M. (2019). A NoSQL Solution for Bioinformatics Data Provenance Storage. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) New Knowledge in Information Systems and Technologies. WorldCIST'19 2019. Advances in Intelligent Systems and Computing, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-16181-1_50

Download citation

Publish with us

Policies and ethics