1 Introduction

Analysis of molecular profiling data obtained from high-throughput “-omics” studies is essential for unravelling large-scale patterns in microbial community composition, function and interactions between microbial organisms. The development of bioinformatics tools has been pivotal for understanding the importance of microbiome in human health (Erickson et al. 2012; Heintz-Buschart et al. 2017; Schirmer et al. 2018). Numerous tools from command line interfaces such as Mothur (Schloss et al. 2009), DADA2 (Callahan et al. 2016b), Anvi’o (Eren et al. 2015) and the Python-based QIIME and QIIME2 (Caporaso et al. 2010; Bolyen et al. 2018) to web-based tools such as Calypso (Zakrzewski et al. 2016) and MicrobiomeAnalyst (Dhariwal et al. 2017) have been designed to serve microbial research community. The methods in this field are developing rapidly, however, and the quality of research software can vary widely (Mangul et al. 2018). Open source code does not as such guarantee quality or accuracy, and the research community can be slow to correct or improve implementations. Hence, suboptimal or even erroneous algorithms may be potentially used over long periods of time.

Open collaboration and joint development of data analytical methods can accelerate the dissemination and access to latest research algorithms. The emergence of open data science (McKiernan et al. 2016; Lahti 2018) has revolutionized such collaborative research and is greatly facilitating the development and adoption of open methods and practices in data-intensive research. The availability of distributed version control systems (Wilson et al. 2017) has created new opportunities to transparently benchmark and criticize alternative approaches. In microbiome bioinformatics, much of such development is currently focused on two computational programming environments, R and Python, where researchers are now routinely sharing research software and reproducible notebooks that summarize complete data analytical procedures from raw data to the final reporting. Graphical interfaces can further support researchers by providing interactive tools for data exploration and analysis (Venables and Smith 2006).

We provide a brief overview of the current status of microbiome data science in R from a community developer perspective. While the R ecosystem is one of the main platforms for current community-driven development efforts, the key concepts apply more widely to other data science environments.

2 Microbiome data science

The route from the processing of raw data to final analysis and reporting relies on a vast number of methods and concepts in microbial ecology (figure 1), and an individual researcher is seldom able to master all relevant research areas. Hence, multi-disciplinary research can be greatly supported by well-designed workflows that implement best practices in the field and provide examples and guidance for choosing the methods, while maintaining the flexibility and opportunities to customize any part of the analysis workflow (Eren et al. 2015; McKiernan et al. 2016; Knight et al. 2018; Pollock et al. 2018; Schloss 2018). Research software can be efficiently communicated in the context of experimental benchmarking data and reproducible online tutorials that can be interactively tested and further modified by users. These so-called electronic notebooks have emerged to provide new educational resources as well as open collaboration platforms to facilitate methods criticism and development (see, e.g., Ragan-Kelley et al. 2013). Hence, key elements enabling microbiome data science include open data, open methods, and open collaboration (Lahti 2018).

Figure 1
figure 1

Overview of the contemporary microbiome data science ecosystem in R. The shaded boxes highlights research areas where the demand for new analysis methods and tools is particularly topical. A microbiome data science ecosystem binds together research data and methods, and enables new forms of collaboration.

2.1 Data

Open availability of research data can improve the overall quality and trustworthiness of research. Moreover, convenient access to benchmarking data from published case studies can be valuable for verification, meta-analysis, and methods development. Importantly, the use of standard data formats such as phyloseq (McMurdie and Holmes 2013) can greatly facilitate the development and integration of new methods and the reproducibility of experiments, and lower the barrier for using analysis tools without expert knowledge on data processing and integration details. In microbiome research, typical data sets include counts of taxonomic units, genes, or metabolites, and complementary information on taxonomic classifications, phylogenies and nucleotide sequences. The algorithmic R packages can be complemented by so-called data packages, which can be distributed through Bioconductor, for instance. Data packages can have a larger size than the standard algorithm packages, and they provide well-documented example data sets that facilitate the development of methods, unit tests, and educational tutorials. Whereas R data packages have already a long history in bioinformatics, recently such data packages have started to emerge in the microbiome field also, providing data from recent microbiome studies at taxonomic and functional levels (see, e.g., Pasolli et al. 2017; Schiffer et al. 2019).

2.2 Analysis

R is well-suited for a variety of interactive analysis tasks and data handling operations and the contemporary R ecosystem covers dozens of packages for various aspects of microbiome data science (table 1). Most methods currently focus on 16S rRNA amplicon sequencing or assume that OTU tables are readily available from metagenomic sequencing studies. Data summarization is now facilitated by dedicated preprocessing algorithms such as DADA2 (Callahan et al. 2016b), and class structures such as phyloseq, which integrates OTU counts, taxonomic trees, and sample metadata into a single object that serves as a standardized starting point for downstream methods (McMurdie and Holmes 2013). Estimation of alpha diversity and related ecological indices including richness, evenness, dominance, and rarity, is provided by various packages (Oksanen et al. 2011; Lahti and Shetty 2017) and complemented by phylogenetic trees (Kembel et al. 2010) or co-occurrence networks (Willis and Martin 2018); the Shiny-phyloseq package provides further tools for interactive network analysis (McMurdie and Holmes 2015). Community dissimilarity, or beta diversities, can be analysed using both phylogenetic (Chen 2012) and non-phylogenetic metrics (Beals 1984). Many methods are available for differential abundance analysis (Robinson et al. 2010; Paulson et al. 2013; Fernandes et al. 2014; Love et al. 2014), and systematic benchmarking tests have indicated wide variation in the performance of alternative methods (Weiss et al. 2017). Advanced approaches consider nested hierarchies in multiple testing (Sankaran and Holmes 2014). Community-level differences between sample groups can be tested with PERMANOVA and other methods (Oksanen et al. 2011; Anderson and Walsh 2013) and further complemented by unsupervised analyses (Sankaran and Holmes 2018a; Singh et al. 2019) such as Dirichlet Multinomial Mixtures (DMMs) (Ding and Schloss 2014; Harris et al. 2014). Further tools are available for the analysis of phylogenetic trees (Paradis et al. 2004; Wright 2016; Stevens et al. 2017; Washburne et al. 2017), co-occurrence networks (Schwager et al. 2014; Kurtz et al. 2015), metabolic interactions (Cao et al. 2016), and microbiome function (Aßhauer et al. 2015). Visualization tools span from amplicon sequencing data (Andersen et al. 2018) to unsupervised ordination by incorporating phylogenetic structure (Fukuyama 2017) to network analysis (Csardi and Nepusz 2006), phylogenetic trees (Paradis et al. 2004), taxonomic diversity (Foster et al. 2017), and geospatial analysis (Charlop-Powers and Brady 2015). Many generic utilities for microbiome profiling data are also available (Chen et al. 2016; Korpela 2016; Lagkouvardos et al. 2017; Lahti and Shetty 2017). However, general-purpose package can be more challenging to maintain and develop in the long term compared to packages with a more specific scope. R packages have also been created to access taxonomic information (Chamberlain et al. 2014) and to support interoperability with other systems such as the Python-based QIIME and the Biological Observation Matrix (BIOM) format (Bittinger 2014; McMurdie and Paulson 2016). The MultiAssayExperiment provides utilities for parallel multi-omics profiling (Ramos et al. 2017), and further class structures are available for generic time series but these opportunities have not yet been fully exploited in the microbiome data science.

Table 1 Overview of contemporary online resources for microbiome data science in R. The indicated groupings are approximations as many packages span over multiple categories

R packages in microbiome data science are mainly distributed through four channels, which have varying levels of software review. Github is a generic open source development platform that does not pose any formal review requirements for new R packages; CRAN has strict technical checks for package consistency, and rOpenSci (Boettiger et al. 2015) and Bioconductor (Gentleman et al. 2004) have implemented comprehensive human-curated software review procedures that can improve the overall quality of the research software, including source code, compliance to standards, and documentation.

2.3 Workflows and tutorials

Sharing of technical knowledge and best practices can be greatly facilitated by open online resources (table 1) (Callahan et al. 2016a; Schloss 2018). Open practices facilitate community-driven development work on methods and algorithms, thus facilitating free and open knowledge sharing and helping to democratize microbiome data science by limiting monopolies of power. Some good practices in microbiome data science workflows include routine application of automated unit tests and crowd-sourced quality control in the form of issue reports and case studies on reproducible notebooks (Ram 2013; Wilson et al. 2017). Various examples of such case studies have been made openly available through common software repositories (see e.g. Baxter et al. 2016; Proctor et al. 2018). Availability of source code can greatly facilitate the reproducibility, verification and further use of the algorithms and has the potential to increase the overall efficiency and impact of research.

3 Discussion

Microbiome data science facilitates collaborative development of algorithms and methods. In the collaborative development model, independent research groups contribute to the same methods base for instance through shared version control systems. This has facilitated access to various algorithmic methods in microbial ecology. Whereas we have provided a brief overview of the current microbiome data science ecosystem in R, many complementary methods are available in Python and other environments. Recent developments towards integrating R and Python, two widely used data science programming environments have been remarkable and new packages such as reticulate have emerged to allow fluent exchange of information between Python from R (Allaire et al. 2018). The subsequent ability to perform open data analysis in a single environment can greatly support the trustworthiness, reproducibility and reusability of research outcomes. Most currently available R packages are heavily focused on 16S rRNA gene analysis, and often contain overlapping functionality whose performance has not yet been comprehensively benchmarked. At the same time, the demand is now specifically increasing for methods that could facilitate the analysis and integration of deep metagenomic and multi-omics profiling data and multivariate time series, both in the context of targeted case studies as well as large population cohorts.