Keywords

Microbiome samples are accumulating at a very fast speed, representing microbial communities from every niche (biome) of our body as well as the environment (Mitchell et al. 2020; Integrative HMP (iHMP) Research Network Consortium 2019; Thompson et al. 2017; Sunagawa et al. 2015). The fast-growing amount of microbiome samples, as well as the diversified sources from where the samples are collected, have provided us with an unprecedented scene from where we could obtain a better understanding of the microbial evolution and ecology (Mitchell et al. 2020; Segata et al. 2013; Integrative Human Microbiome Project 2019). While all of these represent profound biological patterns and regulation principles, the understanding of them is heavily dependent on data integration and big-data mining (Knight et al. 2018), including the data-driven microbiome marker identification (Segata et al. 2011), non-linear relationship mining (Surana and Kasper 2017), dynamic pattern discovery (Halfvarson et al. 2017; Ren et al. 2017; Bashan et al. 2016; Backhed et al. 2015; Liu et al. 2019), regulation principle discovery (Han et al. 2020), etc.

In this chapter, we will first introduce several terminologies in microbiome research, followed by the introduction of microbiome big-data. Then we will emphasize the microbiome databases, as well as mainstream microbiome data mining techniques. We will provide several microbiome applications to showcase the power of microbiome big-data integration and mining for knowledge and clinical applications. Finally, we will summarize the current status of microbiome big-data analysis, point out several bottlenecks, and illustrate prospects in this research area.

1 Microbial Communities, Metagenome, and Microbiome

As a ubiquitous and important organism in nature, microorganisms usually coexist in the form of a “microbial community” (Thompson et al. 2017; Sunagawa et al. 2015; Segata et al. 2013; Integrative HMP (iHMP) Research Network Consortium 2014). A microbial community usually contains dozens to thousands of different microorganisms, these species cooperate with each other to adapt to the changes in the environment, and their life activities also have a long-term and profound impact on the environment (Thompson et al. 2017; Integrative HMP (iHMP) Research Network Consortium 2014). With the deepening of human understanding of microorganisms, the basic research of microbial community and its application in the fields of health and environment have become increasingly important (Integrative Human Microbiome Project 2019; Biteen et al. 2016). The main research objects of microbiome include all the genetic materials of microbial communities, related environmental parameters and metabolites, as well as their complex relationships and dynamic changes.

In the microbiome research area, several terms need to be explained clearly, including microbiota, metagenome, and microbiome (Whiteside et al. 2015). A microbial community is a mixture of microbial species living, adapting, and evolving in a certain environment. Metagenome refers to the total genetic materials in the microbial community, while metagenome could be obtained by shotgun sequencing, many projects are still conducted by 16 s rRNA amplicon sequencing that could only quantitatively profile the species in the community. Microbiome refers to all genetic and non-genetic information contained in the microbial community, including metagenome, as well as all environmental factors and metabolites in the community. A brief illustration of the definitions and relationships of microbial communities, metagenome, and microbiome is provided in Fig. 10.1.

Fig. 10.1
figure 1

The definitions of microbiota, metagenome, and microbiome. The same shape and color represent the same species, while different symbols represent different entities. (a) Microbiota: identification of all species in the microbial community using 16S rRNA sequencing. (b) Metagenome: all genetic materials in the microbial community. (c) Microbiome: all genetic materials, environmental factors, and metabolites in the microbial community

The microbiome research is mostly conducted by the omics approach (Mitchell et al. 2020; Segata et al. 2013). Firstly, samples are collected from niches, stored in a −20°C tube, before DNA extraction and amplification and sequencing. Then high-throughput sequencing is conducted, by means of 16S rRNA sequencing or metagenomic sequencing, and sequencing data are transferred for analysis (Knight et al. 2018).

1.1 The Differences Between 16S and Metagenomes

The sequencing principles: 16S rDNA contains nine hypervariable regions and ten conserved regions. A segment of hypervariable region sequence was amplified by PCR and sequenced. Metagenomic sequencing is similar to conventional DNA library in that it randomly breaks microbial genomic DNA into small fragments and then inserts joints at both ends of the fragments for high-throughput sequencing.

Different fields of study: 16S rRNA sequencing mainly studies the species composition, the evolutionary relationship among species, and the diversity of communities. Besides, metagenomic sequencing can also be used for further research at the genetic and functional levels.

Different degree of species identification: Much of the 16S sequencing results are below species level, while metagenomic sequencing identifies microbes to species level and even to strain level.

The advantages and disadvantages of 16S rRNA and metagenomic sequencing methods for microbial community research have been summarized in (Knight et al. 2018), and we have provided key points in Table 10.1.

Table 10.1 Advantages and disadvantages of 16S rDNA and metagenomic sequencing methods for microbial community research

2 The Microbiome Research Is Heavily Dependent on Big-Data

As the number of microbiome samples easily exceeds tens of thousands in a medium-sized data collection (Mitchell et al. 2020), the efficiency and accuracy of sample comparison and search become a critical bottleneck (Knight et al. 2018), not to mention millions of samples from the rapidly diversified biomes from less than a hundred to more than three hundred in public databases (Fig. 10.2). The rapidly increasing number of samples from various niches on the planet has thus created a difficult huddle for knowledge discovery from these samples (Mitchell et al. 2020).

Fig. 10.2
figure 2

The fast increasing number of microbiome samples, and the rapidly diversified biomes from where they are collected. Results are based on assessment of EBI MGnify database from year 2011 to year 2020

Microbiome research is heavily dependent on big-data, largely due to three reasons: (1) As traditional microbial research strategies could not identify the species in the community, current species identification and quantification is mostly done by sequencing techniques plus data analysis techniques. (2) As heterogeneous microbial community samples are collected from hundreds of different niches around the world, the comparison of these communities could only be performed using big-data mining techniques. (3) The mining of millions to trillions of functional genes from microbial communities is also a data-driven task nowadays.

Big-data technology and machine learning technology are very suitable for the organization, integration, and in-depth analysis of microbiome data (Li et al. 2019; Cheng et al. 2019; Tang et al. 2019; Microbiota meet big data 2014). First of all, microbiome data has all the 4 V characteristics of big data (Volume, Velocity, Variety, Veracity): large Volume, a large amount of data, including the amount of collection, storage, and calculation. The starting measurement unit of big data is at least p (1000 t), e (one million T), or Z (1 billion T). There are various types and sources (Variety). Including structured, semi-structured, and unstructured data, multi-types of data put forward higher requirements for data-processing ability. The Value density is low, and the data value density is relatively low. In other words, it is valuable to wash sand in waves. Information is massive, but the value density is low. How to mine the value of data through powerful machine algorithms is the most important problem to be solved in the era of big data. Velocity: this is a significant feature that big data is different from traditional data mining. Secondly, microbiome big data needs to be deeply mined: Data mining (DM) is an emerging interdisciplinary subject that gathers multiple disciplines. It is an extraordinary process, that is, the process of extracting unknown, implied, and potentially valuable information from huge data (Fig. 10.3).

Fig. 10.3
figure 3

The characteristics and urgent needs in multi-omics researchers

3 Microbiome Data Integration and Databases

The development of microbiome research has profoundly boosted the data accumulation as well as the output of the researches. In the past 10 years, an exponential number of publications have been output (Fig. 10.4a), based on more than 100 TB per year of microbiome data accumulated (Fig. 10.4b).

Fig. 10.4
figure 4

In the past 10 or more years, (a) an exponential number of publications have been output, (b) based on more than 100 TB per year of microbiome data accumulated

Currently, there are already databases dedicated to microbiome researches (Table 10.2), including MG_RAST ((Meyer et al. 2008), http://metagenomics.anl.gov/), CAMERA ((Seshadri et al. 2007), http://camera.calit2.net/) as specialized databases, and NCBI SRA (http://www.ncbi.nlm.nih.gov/sra) as general databases. Among these databases, NCBI SRA (Kodama et al. 2012), MG-RAST (Meyer et al. 2008), and CAMERA2 (Seshadri et al. 2007) each has more than 10,000 microbiome projects, representing hundreds of thousands of samples and several TB of sequencing data.

Table 10.2 Commonly used microbiome databases (Zhang et al. 2017)

However, the microbiome data in several major databases have not been well sorted out, whether in terms of the unification and integration of microbiome data format, or the matching environmental parameters (metadata). One of the key points is that the microbiome data has not been effectively classified and organized, resulting in a bottleneck for sample classification and comparison. Microbial community samples and relevant sequencing data are organized according to the biome ontology organization structure by hierarchical structures. For example: at the end of 2019, EBI MGnify contains sub-millions samples from 491 biomes (https://www.ebi.ac.uk/metagenomics/biomes) (Mitchell et al. 2020), in which the samples from human fecal have the exact biome position at “root > Host-associated > Human > Digestive system > Large intestine > Fecal.” This ontology structure is very beneficial to the classification of samples. However, the hierarchical organization structure of the current ontology is not completely tree-like, but has the feature that an ontology belongs to the direct sub-ontology of multiple ontologies. For example, “Fecal” has more than five upper level ontology information. Therefore, the relevant living environment ontology of each microbiome data is likely to have multi-label. On the one hand, the multi-label nature of microbiome data is not conducive to the simple classification of samples, resulting in the bottleneck of sample classification and comparison. On the other hand, the multi-tag attribute of microbiome data conforms to the characteristics of big-data research, and better results are expected to be obtained when processed by machine learning or deep learning.

4 Mainstream Microbiome Data Mining Techniques

As regard to microbiome data mining tools, current methods could be categorized according to their purposes (Table 10.3):

  1. 1.

    Identification of microbial species based on microbiome: Based on the metagenome sequencing data, the species contained in the metagenome can be assigned to different taxonomic levels, such as phylum, class, order, family, genus, etc. At present, metagenome-based microbial species identification can be categorized into alignment-based and alignment-free sequence classification methods, both of which are based on the assumption that similar sequences originate from similar species. Sequence alignment identifies the species corresponding to the target genome sequence by comparing it with the existing database. Alignment-free sequence classification methods use the characteristics of the sequences themselves, such as GC content, codon usage frequency, etc., to classify them into the species corresponding to the most similar sequences. Typical examples of species identification methods include Megan (Huson et al. 2007), QIIME2 (Bolyen et al. 2019), etc. However, these methods are mostly limited to sequences of known classes and functions in databases (sequences in databases are mostly from model organisms or culturable microorganisms), so the exact species of the majority of microorganisms in the microbial community remain largely unclear.

    Table 10.3 Representative analytical platforms for microbiome researches
  2. 2.

    Tools for microbial community structure decoding: Tools for microbial community structure decoding and comparison include those for species composition analysis such as Phyloshop (Shah et al. 2011), Parallel-Meta (Su et al. 2012), MEGAN (Huson et al. 2007), etc., and those for microbial community comparison including UniFrac (Lozupone and Knight 2005) and Fast UniFrac (Hamady et al. 2010). However, these tools still have limitations: MEGAN (Huson et al. 2007) and STAMP (Parks and Beiko 2010) have provided an approach for microbial community sample comparison based on species composition, while such method is largely limited by the ignorance of evolutionary relationships among species (Hamady and Knight 2009). UniFrac (Lozupone and Knight 2005) and Fast UniFrac (Hamady et al. 2010) have taken phylogeny information into consideration, yet they could hardly handle thousands of samples due to large time cost. There is still a lack of efficient and accurate sample comparison and search methods, especially for model-based method.

  3. 3.

    Microbial-based functional profiling and regulation model generation: In terms of predicting the main functions of species, the current research is still in its infancy. Methods such as PICRUSt (Langille et al. 2013), based on 16S rRNA data, could analyze differences between samples by inferring the composition of functional genes in the samples. However, this prediction method cannot fully reflect the detailed functional composition and metabolic pathways of different species in a sample. Functional genes in microbial community analysis level, in view of the biosynthesis gene cluster (BGC) and antibiotic resistance gene cluster (ARG) gene functions such as group analysis, in addition to the typical antiSMASH (Medema et al. 2011) and NaPDoS (Ziemert et al. 2012) analysis platform and IMG-ABC (Hadjithomas et al. 2015), DoBISCUIT (Ichikawa et al. 2013), ClusterMine360 (Conway and Boddy 2013) database. Functional annotation and enrichment analysis of microbiome genes can deepen the understanding of microbial community functions and the analysis of key metabolic pathways and microbiome-host metabolic regulation mechanisms. However, the microbiome contains a large number of genes, and the functions of most genes are unknown.

  4. 4.

    Microbial gene mining from metagenomics data: At present the main database and the software including DoBISCUIT (Ichikawa et al. 2013) system (http://www.bio.nite.go.jp/pks/) based on manual selection of data, and the databases designed for specific types of metabolites, such as ClusterMine360 (Conway and Boddy 2013) database system, NaPDoS (Ziemert et al. 2012) analysis system (http://napdos.ucsd.edu/) for secondary metabolism genes, COBRA (Becker et al. 2007) for intestinal flora metabolism modeling analysis system, as well as antiSMASH (Medema et al. 2011) biosynthesis gene cluster (BGC) analysis system, etc. Relevant methods, however, largely depend on the reference sequence, known species in the microbial community species reference sequence under the condition of the lack of its completeness is not very ideal. The genes around the “environment–microbial community–metabolism” chain are largely unannotated, leaving large room for improvement.

    Fig. 10.5
    figure 5

    Scheme of integrative pan-genome with metagenome studies on microbial community. (a) Using pan-genome of a set of genomes from isolates as a reference to recruit reads from metagenomes to quantify relative frequency of each gene sequence in community. (b) Binning co-abundant genes obtained from de novo assembly across metagenomic samples to reconstitute metagenomic species pan-genomes. Co-abundant with core or accessory genes of microbial species co-occurrence in samples and yield co-abundance. This figure was adapted from a previous published work [Integrating pan-genome with metagenome for microbial community profiling. Computational and Structural Biotechnology Journal, 2021, 19:1458–1466] with permission of authors

  5. 5.

    Microbiome data analysis platform: There are currently several analytics platforms that cover the main steps of microbiome data analysis, such as QIIME (Caporaso et al. 2010), MG-RAST (Glass et al. 2010; Keegan et al. 2016), Camera (Seshadri et al. 2007), and EBI Metagenomics (now known as EBI Mgnify) (Mitchell et al. 2020). These sites often contain large datasets and data-processing platforms. At present, the biggest bottleneck in this regard is that the development of metagenomic data analysis platforms is far behind the rapid accumulation of metagenomic data. In particular, the integration analysis and deep mining of massive metagenomic data and other omics data are in urgent need.

5 Integration of Metagenome and Pan-Genome Towards Holistic Analysis of Microbial Communities

The microbiome data is mostly analyzed by the metagenome approach (Fig. 10.5). Metagenomics has been utilized for the studies of changes in community organization and microbial inhabitants, resulting in the discovery of a remarkable amount of genomic diversity and the characterization of new bacterial members (Integrative HMP (iHMP) Research Network Consortium 2014; Riesenfeld et al. 2004). A series of metagenome analysis tools, such as MEGAHIT (Li et al. 2015), MEGAN (Huson et al. 2007), and MetaPhlAn2 (Truong et al. 2015) have been proposed allowing for metagenomics assembly, taxonomy, and functional analysis. The analyses of microbiome composition and function in different sites of human body including skin, oral, and gut show great differences in the microbial structure (Koren et al. 2011; Costello et al. 2009). For example, the taxonomic representation of bacteria on the human skin includes Staphylococcus, Micrococcus, and Corynebacterium (Fredricks 2001; Grice et al. 2009), while the dominant microorganisms in oral are Streptococci, Lactobacillus, and Fusobacterium (Dewhirst et al. 2010; Teng et al. 2015). In addition, the main components of microorganisms in the human gut are Bacteroides and Prevotella (Costea et al. 2018; Wu et al. 2011). These microbes in human body have coevolved with their hosts, which is also related to human health and disease (Costello et al. 2009; Clemente et al. 2012). The composition of microbes in different hosts varies greatly, and there are dynamic changes under different environmental factors (Costello et al. 2009). For example, Sonnenburg et al. revealed a seasonal cycle of gut microbiota corresponding to the enrichment of functions of the Hadza hunter-gatherers, especially Bacteroides, varies with the season, especially between the dry season and the wet season (Smits et al. 2017). Such studies revealed the succession of microbial community that changes with season in human gut. In addition, studies of microbial communities in natural environments such as soil (Daniel 2004), deep-sea (Mason et al. 2014), and wastewater (Guo et al. 2017) have uncovered hundreds of microbes, new genes, and uncharacterized metabolism, revealing an incredible microbial diversity and complexity.

6 Deep Learning Techniques for Microbiome Research

In recent years, more and more deep learning techniques have been developed for mining microbiome big-data (Li et al. 2019; Tang et al. 2019; Lan et al. 2018; Min et al. 2017; Wang and Gao 2019). These techniques essentially solved the functional gene mining, dynamic pattern discovery, and phenotype prediction problems.

  1. 1.

    For sample comparison and search: In microbial community source tracking, the traditional unsupervised learning method SourceTracker (Knights et al. 2011) and FEAST (Shenhav et al. 2019) could achieve very high accuracy when there are hundreds of samples and handful of biomes, while when the number of samples and biomes increase, the running time would increase very rapidly, preventing them from large-scale source tracking. This dilemma could be solved by deep learning solutions: by utilizing model-based methods such as neural network, both speed and accuracy could be achieved for the source tracking problem.

  2. 2.

    For gene mining: An example is ARG gene mining, for which traditional BLAST method could find the candidate ARG genes when they could match to those in the database. However, such an approach is limited to known ARG genes, and the search time could be short when faced with millions of candidates to be screened. Again, the deep learning approach has led to the model-based method that could mine novel ARG genes out of millions of candidates in an efficient manner.

All of these limitations have been calling for AI techniques that could discover more knowledge from microbiome dark matters. AI techniques are advantageous in generation of the models from a massive amount of samples, which are representative of the global profile of the context-dependent subjects (Kodama et al. 2012). AI techniques are therefore suitable for accurate and fast search when new samples (either a community, a gene, or a pattern) are searched against the models (Paczian et al. 2019; Markowitz et al. 2008; Daniel 2004). Therefore, AI techniques are especially suitable for microbiome dark matter mining, especially when facing the tradeoff between accuracy and efficiency.

The solutions for eliminating current methods’ tradeoffs rely on deep learning approaches (Kodama et al. 2012; Paczian et al. 2019; Meyer et al. 2019; Markowitz et al. 2008; Gonzalez et al. 2018). First of all, model-based methods such as neural networks could be very fast for source tracking: once a rational model has been built, the source tracking could be very fast, and the source tracking accuracy could also be achieved, comparable with or even better than existing distance-based and unsupervised methods. The same approach is suitable for the gene mining problem. For the spatial-temporal dynamic pattern mining, the deep learning method could also discover the intrinsic patterns out of the cross-section or longitudinal cohorts.

7 Representative Microbiome Applications

7.1 Case Study 1: Enterotype Analysis (Costea et al. 2018)

In 2011, three sequencing technologies (Illumina, 454, and Sanger) were used to sequence 16S rRNA genes in human fecal samples from three countries (Denmark, Spain, and the USA), and the result was that there were three enterotypes (Costea et al. 2018). The enterotypes were described as “a dense cluster of samples in a multidimensional space composed of communities” and were not affected by age, sex, cultural background, or geographical location. For each enterotype, an indicator/driver group was found at the center of the co-existing microbial network that was most profoundly associated with the enterotype. For example, enterotypes 1 can also be expressed as ET B, and Bacteroides is the best indicator group. Enterotype 2, which can also be expressed as ET P, is driven by Prevotella and its abundance is usually inversely proportional to the abundance of Bacteroides. Enterotype 3, which can also be expressed as ET F, is distinguished by the proportion of Firmicutes, among which the main group is Ruminococcus. All of the above analyses are based on the classification at the genus level, because the genus level can better reflect the ecological niche changes (Costea et al. 2018). Although some genera show functional heterogeneity, such as Streptococci, which contains both common symbiotic and lethal pathogens and groups that can be used for food fermentation, genera level analysis is generally reliable.

7.2 Case Study 2: Gene Mining (Qin et al. 2010)

7.2.1 Human Intestinal Microbiome Reference Gene Set

The authors describe the assembly and characterization of 3.3 million non-redundant microbial genes from fecal samples of 124 European individuals by Illumina-based metagenomic sequencing. This gene set is 150 times larger than the human gene complement, contains the vast majority of the (more common) microbial genes in the cohort, and probably includes the majority of the human gut microbial genes. These genes are shared to a large extent between individuals in this cohort. More than 99% of the genes were bacterial, suggesting that the entire cohort contained between 1000 and 1150 endemic bacterial species, with each individual containing at least 160 such species, and that they were also largely shared. The authors define and describe the minimum intestinal metagenome and the minimum intestinal bacterial genome in terms of the functions of all individuals and most bacteria, respectively.

Most of the microbes that live in the gut have profound effects on human physiology and nutrition and are essential to human life. The content, diversity, and function of the gut microbiome are studied in order to understand and utilize the influence of gut microbiome on human health. Methods based on 16S ribosomal RNA gene (rRNA) sequences revealed that two families of bacteria, the Bacteroidaceae and the Antimicrobiaceae, make up more than 90% of the known phylogenetic categories and dominate the distal intestinal flora. Studies have also shown that there is great diversity in the gut microbiome between healthy individuals.

7.2.2 Metagenomic Sequencing of the Intestinal Microbiome

As part of the Metahit (Human Intestinal Genomics) project, the authors collected fecal samples from 124 healthy, overweight, and obese adult individuals and patients with inflammatory bowel disease (IBD) in Denmark and Spain. Total DNA was extracted from the fecal samples.

To generate an extensive catalogue of microbial genes from the human gut, the authors first assembled short Illumina readings into longer overlapping clusters, which could then be analyzed and annotated using standard methods. Using SoapDeNovo, the authors assembled all Illumina GA sequence data from scratch. Up to 42.7% of Illumina GA reads were assembled into a total of 6.58 million overlap groups, and nearly 35% of readings from any one sample could map to overlap groups from other samples, indicating the presence of a common sequence core.

To accomplish the overlapping group setup, the authors combined the unassembled reads from all 124 samples and repeated the de novo assembly process. Thus, about 400,000 overlapping groups with a length of 370 Mb and N50 939 bp are generated. Therefore, the total length of the author’s final overlap group is 10.7 GB. Approximately 80% of the 576.7 Gb sequences of Illumina GA sequences were able to be compared with the overlap group at a 90% identity threshold to adapt to sequencing errors and strain variability in the gut, almost double the 42.7% of sequences. Soap de novo assembles them into overlapping clusters because the assembly uses more stringent criteria. This indicates that the author’s overlap group represents the vast majority of Illumina sequences.

7.2.3 Genome Sets of the Human Intestinal Microbiome

To establish a non-redundant human gut microbiome genome, the authors first used the Metagene program to predict ORFs in overlapping populations and found 14,048,045 ORFs longer than 100 bp. They accounted for 86.7% of the overlap, comparable to the 86% found in fully sequenced genomes. Two-thirds of the ORFs appear to be incomplete, possibly due to the size of the author overlap group (N50 is 2.2 KB). Next, the authors remove the excess ORFs by pair comparison using very strict criteria that 95% conformance exceeds 90% of the shorter ORF length, which can be fused with direct homologues but can avoid dataset bloat due to possible sequencing errors.

The authors refer to the genes in the non-redundant set as “epidemic genes” because they are encoded on an overlapping group assembled from the richest read segments. The authors examined the number of prevalent genes found in all individuals, which is a function of the sequencing range and requires at least two gene calls to support reading. An estimate of coverage richness (ICE) based on incidence, determined by 100 people (the maximum number that can be accommodated by the Evaluations21 program), indicates that the authors’ catalog captured 85.3% of the prevalence genes. Although this may be an underestimate, it still suggests that the catalogue contains the vast majority of the prevalent genes in this cohort.

Each person carries 536,112 ± 12,167 of the prevalent genes, suggesting that most of the 3.3 million gene pools must be shared. But most of the prevalent genes were found in only a few individuals: 2,375,655 were found in less than 20% of individuals, and 294,110 were found in at least 50% of individuals (these “common” genes, as the authors call them). These values depend on the sampling depth. The sequencing of MH0006 and MH0012 revealed more catalogue genes, which were present in low abundance. Still, even at regular sampling depths, each person still has 204,0566 3603 common genes, suggesting that about 38% of an individual’s total gene pool is shared. Interestingly, patients with IBD carried, on average, 25% fewer genes than those without the disease, which is consistent with the observation that the former had less bacterial diversity than the latter.

7.3 Case Study 3: Plasticity of Intestinal Flora (Dynamic Pattern) (Liu et al. 2019)

First of all, at the macroscopic research level of the plasticity of the intestinal flora, the project team and the Capital Medical University have been monitoring the dynamics of the intestinal flora of the foreign aid medical team (volunteer team (VT)) for more than a year, aiming to study diet The influence of factor changes on the structure of human intestinal flora.

In this study, we recruited a team of 10 Chinese volunteers who set out from Beijing, stayed in Trinidad and Tobago (TAT) for 6 months and then returned to Beijing. A high-density longitudinal sampling strategy (average of 19 time points for VT members) was used to collect their stool samples (188 samples) and detailed dietary information. We divided the entire longitudinal study into six stages: when VT stays in TAT, T1 represents the pre-travel stage (20 samples), T2 (28 samples), T3 (60 samples), and T4 (21 samples) represents three time slots. After VT returned to Beijing, T5 (35 samples) and T6 (20 samples) sent two time slots, respectively. At the same time, we also collected samples of Beijing healthy people (BJN, 57 samples), TAT healthy people (TTN, 28 samples), TAT patients (TTP, six samples), and TAT Chinese (TTC, eight samples) as samples control data set. Finally, we sequenced the V4 hypervariable region of the microbial 16S rRNA gene on 287 stool samples from 41 individuals and analyzed high-quality readings using QIIME (Caporaso et al., 2010).

We found that the microbial community in the intestine has two-way plasticity and elasticity during long-term stay and has a variety of dietary changes. First, BJN and TTN show different microbial community patterns (Fig. 10.6a). However, the microbial community of VT members changed from a microbial community similar to BJN to the TTN mode that accompanied them in TAT and returned to the original mode within 1 month after VT returned to Beijing (Fig. 10.6b–f). In addition, although we found that location and population have a great influence on the differentiation of samples (Fig. 10.6g, h), the dynamic changes of each member of VT show a specific trend (Fig. 10.6i, j ), indicating that there may be the plasticity mode depending on the intestinal type among VT members. In addition, the relative abundance of Sclerotium and Bacteroides showed strong adaptability on the time axis and was negatively correlated on the time axis (Fig. 10.6k, l). Similarly, the relative abundance of Proteus and Actinomycetes also showed a plasticity pattern (Fig. 10.6m, n). By tracking and comparing at least 10% of the common operational taxonomic units (OTUs) shared by at least 10% of VT members, we found that Firmicutes, Bacteroidetes, Proteobacteria, and Actinobacteria have unique time dynamics during the long-term stay of VT (Fig. 10.6o–r).

Fig. 10.6
figure 6

Long-term human gut microbial community pattern and multiple dietary changes (Liu et al. 2019). (Reprinted with permission from authors of Liu et al. (2019))

7.4 Case Study 4: Athletes’ Gut Microbiota (Han et al. 2020)

The gut microbiome of athletes and sedentary individuals differs in diversity and in certain taxa; however, it is unclear to what extent the patterns of the gut microbiome differ between the two and whether athletes’ potential can be effectively monitored against the microbiome.

This study recruited a total of 306 fecal samples from 19 Chinese professional female rowers and divided them into three groups according to their daily performance: adult elite athletes (AE), young elite athletes (YE), and young non-elite athletes (YN). The differences of intestinal microbiome in different groups were compared to determine the correlation between intestinal microbiome and diet, physical characteristics and sports performance (Fig. 10.7).

Fig. 10.7
figure 7

Gut enterotypes in elite and youth non-elite athletes. A total of 306 samples are stratified into three enterotypes. The major contributor in the three enterotypes is Prevotella, Bacteroides, and Ruminococcaceae_unclassified, respectively. (a) Relative abundances of the top genera (Prevotella, Bacteroides, and Ruminococcaceae_unclassified) in each enterotype. (b) Three enterotypes were visualized by PCoA of Jensen-Shannon distance at the genus level. (c) The proportion of AE, YE, and YN samples distributed in three enterotypes. 72.3% AE, 61.9% YE, and 27.27% YN samples are found in enterotype 3. (d) Co-occurrence patterns among the dominant genera (average relative abundance >0.01%) across the samples from enterotype 3, as determined by the Spearman correlation analysis. (Reprinted with permission from authors of Han et al. (2020))

Firstly, the intestinal flora of elite athletes and young non-elite athletes were stratified to find that the intestinal flora of elite athletes and young non-elite athletes had different intestinal types. In terms of taxonomic structure and functional composition, it was found that SCFA-producing bacteria were dominant in the microbial community of elite athletes. Secondly, functional analysis showed that ATP metabolism, multiple sugar transport systems, and carbohydrate metabolism were enriched in the microbial community of elite athletes. Furthermore, the construction of accurate classifiers based on a combination of taxonomy and functional biomarkers highlights the great potential of monitoring candidate elite athletes from a group of athletes. Finally, it was shown that intestinal flora is closely related to physical characteristics, dietary factors, and exercise-related characteristics. Importantly, the versatility of the athletes’ microbiome, which may influence athlete performance by altering the gut microbiome, is associated with dietary factors (29%) and physical characteristics (21%). These findings highlight the complex interplay of gut flora, dietary factors, and athletes’ physical characteristics and performance, with gut flora as a key factor (Han et al. 2020).

8 Microbiome Research: Current Status, Bottlenecks, and Prospects

Today, microbiome research is, from many facets, a data-driven science. Firstly, the sequencing techniques have advanced quickly, thus enabling the fast and batch profiling of millions of microbial community samples. Secondly, data mining techniques have also advanced quickly, thus enabling the batch discovery of functional genes, dynamic patterns, as well as prediction of phenotype with high accuracy and fidelity. Thirdly, although data-driven, many discoveries are later verified by we-lab experiments, such as several probiotics (Whiteside et al. 2015; Routy et al. 2018), verified the power and validity of these data-driven approaches.

However, several bottlenecks remain for the microbiome big-data mining researches. One of the most critical bottlenecks is the big-data integration bottleneck (Integrative Human Microbiome Project 2019), and another is the lack of AI techniques for deep mining of important species, functional genes, and community dynamic patterns from a large amount of microbiome data (Microbiota meet big data 2014).

Despite these bottlenecks, microbiome researches are on the sharp rise, and many problems are on the edge of solution, while many more new frontiers are on the horizon. It is foreseeable that with several millions of samples from thousands of niches that have been collected, sequenced, and analyzed, a much better understanding of the microbial community ecology and evolution patterns would be discovered, together with hundreds of clinical or environmental applications made possible.

8.1 Microbiome Research as Part of a Multi-Omics Exploration

The multi-omics studies will continue to grow, in at least two directions: first, from multi-omics for single organisms or single species, to single-cell level omics studies, as well as to population and community level studies; second, the tight integration of multi-omics with data science as well as with clinical applications.

Fig. 10.8
figure 8

The simplified grad view of multi-omics in the near future

From the aspect of expanding the scope of multi-omics for single organisms or single species, single-cell level omics studies, as well as to population and community level studies, we have already seen rapid progress, largely due to the sequencing technical advances. From the aspect of integration of multi-omics with data science as well as with clinical applications, there are very hard challenges still lying ahead. For example, it remains to be an open problem to determine the concordance of multi-omics along the timeline.

9 Summary

Taken together, it has become clear that microbiome research, which represents a rapidly growing omics research area, has already ensured enough high-quality data, as well as enabled data mining techniques, for large-scale microbiome data mining towards an in-depth understanding of microbial communities. The microbial community niches, species, functional genes and their dynamics, have constituted the microbial dark matter, which has been emerged as a grand challenge for microbiome research. The fast development of microbiome data mining would certainly boost the discovery of much more resources and regulation patterns out of these dark matters. And the integration of microbiome and other omics data would lead to a more complete picture of the dynamic patterns as well as regulation principles in the microbiome world.