4.1 Introduction

During the evolution of life on earth, microbes have played an important role and have done much more for human beings for the sustenance and survival. As these microbes have adapted to the earth’s environment, they are found everywhere, viz., on earth, inside earth, in water, and in air. To understand their impact on global ecology, it is most important to understand their diversity and life. According to estimates, about 99% of the microbes are not culturable in pure culture. It acts as the major debacle in understanding the microbial genetics and community ecology. These microbial communities are responsible for biological activities carried out in all environments including the ocean (DeLong 2005), soils, and human-associated habitats (Ravel et al. 2011). Although metagenomics is quite a young and emerging field, it has helped in understanding the microbial diversity which was not possible by using traditional and classical methods of microbiology. Metagenomics has emerged as the most powerful and reliable technique for genome analysis of the entire community of microbes overruling the need to isolate and culture individual microbial species (Arrial et al. 2009). It has wide potential in discovering novel enzymes for industrial applications, antibiotics against many harmful microbes for curing diseases, and organisms for experimental purposes.

Major metagenomics themes are (a) marker metagenomics that surveys microbial community structure by targeting the highly conserved 16S rRNA gene; (b) functional metagenomics that takes the total environmental DNA, from which it infers the metabolic potential of the microbial community; and (c) identification of novel enzymes. Metagenomics uses two approaches: targeted metagenomics and shotgun sequencing. Targeted metagenomics is most commonly used to identify the phylogenetic diversity and the relative abundance in a given sample. This technique is mainly used to investigate the diversity of small subunit of rRNA (16S/18S rRNA) within a sample. It is often used to understand the impact of environmental contaminant that alters the microbial community structure. For conducting the study related to targeted metagenomics, the environmental DNA is extracted from the source, the particular gene of interest is amplified using PCR primers, and further these amplified results are sequenced using next-generation sequencing . Targeted metagenomics is useful in identifying the diversity of single gene of interest, but it is limited by the type of PCR primers used for the analysis (Shakya et al. 2013; Parada et al. 2016; Klindworth et al. 2013; Prosser 2015).

Similarly in shotgun metagenomics sequencing, the genomic complement of an environmental community is studied by using genome sequencing. Basically in this approach, the DNA is extracted from the environmental sample and fragmented to prepare sequencing libraries and further sequenced for the determination of total genomic content of that sample. Shotgun sequencing is often restricted by the depth of the sequencing.

Functional metagenomics has played a major role in understanding the role of microbial community in microbial ecology and global geochemical cycles. Furthermore it is a unique way to identify the novel enzymes from the environmental sample (Uchiyama and Miyazaki 2009). Therefore the functional metagenomics played major role in protein and nucleic acid database through addition of novel functional annotation. However major drawback of this technique includes a low hit rate of positive clones, low throughput, and time-consuming screening (Hosokawa et al. 2015).

Currently metagenomics is a powerful technique to have industrial applications in identification of novel biocatalysts , discovering novel antibiotics, and bioremediation . The application of metagenomics is increasing rapidly, and these are being listed below.

4.2 Application of Metagenomics and the Impact on Environmental Biotechnology

The new field of metagenomics is expected to bring fruitful result for the researchers working in the area of microbiology in mainly two ways: in first application it will provide knowledge about those bacteria which are still not cultivated so far (about 99% are uncultured in the pure culture). Secondly it will provide access to whole microbe community residing in variety of natural environment . Furthermore as we know that microbes are quite essential component of our life for the sustenance and these microbes play very crucial role in industries which are backbone of our present economy planet. Direct access to the genetic makeup of microbes of the whole ecosystem community will provide new basis for fundamental research and new tool for application in environment, agriculture , human health , bio-industry, etc. (Fig. 4.1).

Fig. 4.1
figure 1

Various aspects of applications of metagenomics (also known as environmental and community genomics) in different fields of biological science

4.2.1 Industrial Enzymes

There is an increasing demand of novel enzymes for industrial applications, and metagenomics is playing an important role in providing these biomolecules (Lorenz et al. 2002; Schloss and Handelsman 2003) specially enzymes that are used in wide range of applications (Kirk et al. 2002). These are required in minute amount to synthesize huge amount of key molecules that are used in producing active pharmaceuticals as these are the major building block of those products (Patel et al. 1994). There are many industrial enzymes which have a very wide application in industries and act as their backbone like cellulases, xylanases, lipases, amylases, etc.

Cellulases have attracted industrialists due to their wide application and crucial enzyme activities that are inherited in various forms within them such as endoglucanases (EC 3.2.1.4), exoglycosidase, and β-glucosidases (EC 3.2.1.21). Today cellulase is the third most widely used enzyme in industries (Wilson 2009). Cellulases are mainly used in animal feed and improving the digestibility. Furthermore de-inking of paper is another evolving application of this enzyme (Soni et al. 2008). Metagenomics has played a vital role in extracting cellulase from natural environments like compost soil, soil from cold region, rumen samples and many more. Even few workers have reported that cellulases are isolated from sugarcane soil and buffalo rumen (Alvarez et al. 2013; Duan et al. 2009).

Xylaneses are key enzymes that are widely used in degradation of xylan and are helpful in breaking of hemicellulose, regarded as essential component of cell wall. Xylaneses have wide spectrum of application in industries such as clarification of juices (Sharma 2012), detergents (Kumar et al. 2004), production of pharmacologically active polysaccharides for the antimicrobial agent use (Christakopoulos et al. 2001), antioxidants (Katapodis et al. 2003), and production of surfactants (Kashyap et al. 2014). Xylanases are produced by a wide range of microbes from different sources that have many application in industries. It is reported that xylaneses are present in insect gut that could be used for conversion of biomass into fermentable sugar which could be used for production of biofuels (Brennan et al. 2004; Lee et al. 2006; Jeong et al. 2012). This enzyme was reported in the saccharification of reed and could be used efficiently in the conversion of biomass to fermentable sugar for biofuel production (Wang et al. 2012).

Lipases are mainly triacylglycerol acylhydrolases that are actively involved in the conversion of triglycerides into diglycerides, monoglycerides, glycerols and fatty acids. Being resistant to varying environmental conditions like temperature, pH, organic solvent etc., they have great prospects in industries. It is widely found in many plant and animal sources and also reported in some microbes such as bacteria, fungus, and yeast, and these have varying application in oil industries, pharmaceutical industries, dairy industries etc. (Cardenas et al. 2001).

Amylases are mostly regarded as starch-degrading enzymes. They are quite abundant in plants, animals, and microbes. These have wide application in industries like food, fermentation, and pharma for hydrolysis of starch. AmyI3C6 commonly known as cold-adapted alpha amylase from the metagenomic libraries of cold and alkaline environment can be useful as it showed potent activity against two commercially known detergents. A novel amylase was isolated from a soil metagenome that showed 90% activity at low temperature which proved its potential for industrial exploitation (Sharma et al. 2010).

4.2.2 Bioactive Compounds and Antibiotics

Nowadays a major worldwide health-related problem involves treating infections which are resistant to antibiotics . These resistant microbes are able to cause severe mortality and impose a large budget on healthcare (Carlet et al. 2011). Earlier these antibiotics were used for treating human infection, but they became popular in agriculture and food industry as well as many other related sectors, thus finally imposing high impact on human health (Radhouani et al. 2014). In the current scenario, antibiotics are considered as the pillars of the modern medicine (Ball et al. 2013). This bacterial resistance against widely used common antibiotics has forced researchers to discover novel antibiotics against these microbial infectious diseases.

Today metagenomics is playing a very vital role in discovery of bioactive compounds and antibiotics . It is considered as an alternative way of isolating antibiotics from environmental samples as well as to trace the mechanism of bacterial gene resistance. The combined approach of metagenomics and next-generation sequencing has paved way for success in study of antimicrobial resistance and microbial genomes (Forsberg et al. 2012; McGarvey et al. 2012). Generally, the bacterial gene resistance is mainly developed due to the horizontal gene transfer or spontaneous mutation in target gene (Hassan et al. 2012). The transfer of antibiotic resistance gene involves the mobility of genetic material to other bacterial species or the same group (Thomas and Nielsen 2005).

Metagenomics is putting effort to sort out the drug resistance genes in microorganisms against various class of antibiotics. Its another application is identification of bioactive molecules having antimicrobial properties (MacNeil et al. 2001; Gillespie et al. 2002; Lim et al. 2005). Today, antibiotic resistance of microbes is an alarming worldwide problem and emerging as a major threat (Čivljak et al. 2014) as these microbes are developing resistance against many traditional antibiotics, and on the other hand, many researchers are discovering many novel antimicrobial compounds from different environmental sources including microorganisms, plants, and animals likewise (Roy et al. 2013; de Souza Candido et al. 2014). It is reported that uncultivated soil microbes have potential of novel biomolecules which could be very well exploited in any biotechnological application (Wilson and Piel 2013). In this way we can conclude that these soil microbes can be an alternative source of bioactive molecules. Various active biomolecules which are identified by metagenomic approach include teicoplanin, friulimicin, azinomycin, rapamycin, borregomycin, etc.

4.2.3 Bioremediation

The process to degrade and detoxify environmental contaminants through microbe-mediated process is known as bioremediation (Chakraborty et al. 2012). It involves removal of biological and anthropogenic contaminants through natural process, so it is considered as the most effective approach (Lovley 2003). Bioremediation approaches can be classified into three main classes, (a) natural attenuation, (b) biostimulation, and (c) bioaugmentation.

In natural attenuation native organisms are used for detoxifying contaminants through using natural process. This process is quite effective in terms of cost, and no need of altering additives is required for this. In biostimulation the rate of bioremediation is increased through using native organisms but needs to remove some environmental constraints. This approach required addition of some nutrients to achieve fast rate of bioremediation. Sometimes this approach failes to achieve their faster rate of bioremediation by using native organism due to their inability to degrade contaminant of concern. To overcome this problem, some nonnative organisms or enzymes are added to enhance the rate of bioremediation which is known as bioaugmentation. This approach is considered as most invasive as nonnative organism. In some cases bioaugmentation is considered as most convenient mean of remediation (Payne et al. 2011; Salanitro et al. 2000). The major drawback of bioaugmentation is that nonnative organism can’t survive under the condition found in the contaminated ecosystem.

In the present scenario, metagenomic approach is widely used for environmental monitoring and bioremediation. Metagenomics approaches that are often used for monitoring the environmental microbes are targeted metagenomics or shotgun metagenomics. Targeted metagenomics is widely exploited to study the phylogenetic diversity and relative abundance of a particular gene in the environment . This approach is used to study the diversity of the rRNA sequence in the sample (16S/18S rRNA). It is often used to study the impact of environmental contaminant in microbial community structure. The major advantage of targeted metagenomics is that it provides the information about microbial community present in the set of sample and change in microbial diversity before and after perturbation.

Likewise in the shotgun metagenomics, the total genomic complement of the environmental community is probed by using genome sequencing. In this approach, environmental DNA is extracted and fragmented to prepare genomic libraries and further sequenced to determine the total genomic content. Using this approach potential of a microbial community can be identified. Recently metatranscriptomics and metaproteomics are being widely used to apply over environmental system. In metatranscriptomics ribonucleic acid (RNA) is extracted from the sample and converted to complementary deoxyribonucleic acid (cDNA) in a similar function as in metagenomics. The metaproteomics approach does not involve the nucleic acid sequencing but high-resolution mass spectrometry combined with enzymatic digest of the proteins and liquid chromatography (Hettich et al. 2013). Metaproteomics provides an information about the kind of protein present inside the environmental sample including posttranslational modification in proteins that may impact their activity.

Many industries are responsible for increased level of hydrocarbons in the environment due to the incomplete combustion of fossil fuel. Generation of these anthropogenic compounds into the environment results into the accumulation of large amount of aromatic hydrocarbons which leads to contamination of ecosystem (Jacques et al. 2007). Microorganisms are involved in many biogeochemical cycles and have potential of degradation of hydrocarbons (Alexander 1994). Metagenomics can be helpful in degradation of aromatic compounds by screening and identifying suitable organisms in a metagenomic library obtained from oil source (Sierra-García et al. 2014). Many genes and their pathways were identified for the degradation of phenol and aromatic compound by using metagenomic approach (Silva et al. 2013). Some bacterial population having capacity for the degradation of polycyclic aromatic compound (PAH) were isolated from cold environment by identifying their functional target (Marcos et al. 2006).

As we know that oil spillage has badly affected many parts of the natural marine ecosystem (National Academy of Science 2005) due to increased anthropogenic activity (Hazen et al. 2016; Atlas and Hazen 2011). In this context Deepwater Horizon oil spill is considered as the worst marine oil spill in the USA and considered as major threat for marine ecosystem biology (King et al. 2015). The first application of metagenomics was to understand the mechanism behind the oil biodegradation in marine environment . The targeted metagenomics was applied to find out the microbial community in the surface water and reported as Cycloclasticus, Alteromonas, Halomonas, and Pseudoalteromonas (Redmond and Valentine 2012; Gutierrez et al. 2013). However they also reported that deep water is primarily composed of psychrophilic oil-degrading microbes related to Oceanospirillales, Colwellia, and Cycloclasticus (Hazen et al. 2010). Shotgun metagenomics approach was used for sample collected during Deepwater Horizon oil spill which revealed diverse group of genes responsible for chemotaxis and hydrocarbon degradation (Mason et al. 2012). The results of the single amplified genome showed genes involved in degradation of n-alkanes and cycloalkanes. Thus metagenomics sequencing approach helps in understanding the mechanism behind the oil degradation by microbial community in marine environment .

4.2.4 Applications in Agriculture

The productivity of agriculture is severely affected by presence of organic and inorganic anthropogenic pollutants that play a very significant role in abiotic stress. These kinds of abiotic stresses are responsible for reduction in crop yield. To improve the quality of such soil contaminated by anthropogenic pollutants, bioremediation is required. Microorganisms of soil metagenome are quite capable of producing biosurfactants which can remove many anthropogenic pollutants which may be either hydrocarbons or heavy metals (Sun et al. 2006). Biosurfactants are capable of removing hydrocarbons and heavy metals through the combination of soil washing and cleanup technology (Pacwa-Płociniczak et al. 2010; Liu et al. 2010a, b; Partovinia et al. 2010; Gottfried et al. 2010; Coppotelli et al. 2010; Kang et al. 2010). Some studies have revealed that biosurfactants isolated Lactobacillus pentosus had reduced the octane hydrocarbons from soil (Moldes et al. 2011). Some biosurfactant-producing species like Burkholderia isolated from oil-contaminated metagenome may act as a potential candidate for the reduction (bioremediation) of pesticides (Wattanaphon et al. 2008). Some studies have also revealed that biosurfactants are more efficient in removal of organic insoluble pollutant from soil than surfactants (Cameotra and Bollag 2003; Straube et al. 2003). The soil samples from such fields shall be subjected to metagenomics analysis, library preparation and subsequent analysis for identifying biosurfactant-producing microbes.

Besides application of biosurfactants for removal of many anthropogenic molecules which are either hydrocarbon or heavy metals, these may also be applicable in removal of plant pathogens due to their antimicrobial nature, thus promoting sustainable agriculture . Biosurfactants which are produced by rhizobacteria have antagonistic properties (Nihorimbere et al. 2011). For sustainable agriculture, biosurfactants and chemical surfactants are useful in controlling parasitism, antibiosis, competition, induced systemic resistance, and hypovirulence (Singh et al. 2007). In fact the application of surfactants in agriculture is mainly for enhancing the antagonistic activity of microbes and microbial products (Kim et al. 2004). Some studies have also revealed that these surfactants when applied in combination of certain fungus like Myrothecium verrucaria are found to be useful in the control of weed (Boyette et al. 2002).

Additionally, biosurfactants are also useful for inhibition of many phytopathogens. Biosurfactant isolated from Pseudomonas and Bacillus is reportedly used for the control of soft rot caused by Pectobacterium and Dickeya spp. and thus has been helpful in protection of economically valuable crops (Krzyzanowska et al. 2012). Many studies have reported that antipathogenic agents like rhamnolipids have the ability to kill zoospore of plant pathogens that are being resistant against many commercial pesticides (Sha et al. 2011, Kim et al. 2011). Some researchers have proposed that rhamnolipids also stimulate immunity in plants against various infectious agents (Vatsa et al. 2010). The lipopeptide biosurfactant of Bacillus origin was reported to inhibit growth of some phytopathogenic fungi like Fusarium spp., Aspergillus spp., and Bipolaris sorokiniana. Such biosurfactant of Bacillus origin can be very well exploited for their function as biocontrol agent (Velho et al. 2011). Surfactin isoform and this lipopeptide biosurfactant produced by Brevibacillus brevis strain HOB1 have reported potent antibacterial and antifungal properties which could be utilized for control of phytopathogens (Haddad 2008). Pseudomonas fluorescens biosurfactants are well reported for their antifungal property (Nielsen et al. 2002). Biosurfactants produced by the Pseudomonas fluorescens has potential in inhibition of certain fungal pathogens like Pythium ultimum (causes damping off and root rot of plants), Fusarium oxysporum (wilting in crop plants), and Phytophthora cryptogea (responsible for rotting of fruits and flowers) (Hultberg et al. 2008). Biosurfactants produced by Bacillus subtilis isolated from soil metagenome are found useful in the control of Colletotrichum gloeosporioides which is a causative agent of anthracnose on papaya leaves (Kim et al. 2010). A common plant pathogen Pseudomonas aeruginosa is found to be inhibited by the biosurfactants of staphylococcus of oil-contaminated soil metagenome (Eddouaouda et al. 2012). The abovementioned evidences support the claim that biosurfactants produced by many microbes could be very useful for control of various kinds of phytopathogens. Furthermore, these biosurfactants are emerging as an alternative source of commonly used pesticides and insecticides which are currently in agricultural practices. Metagenomics has great prospects in identifying many phytopathogens, plant growth-promoting microbes and biosurfactant-producing microbes as well.

4.2.5 Applications in Human Health

Human beings are always surrounded by microbes as they not only surface over them but also live within their body. The microbes which are residing inside the human flora are not fully characterized (less than 1%). Furthermore there are certain microbes in our environment which are causative agents of many infectious diseases. These infectious microbes are mainly characterized by laboratory-based surveillance and syndromic surveillance which are strictly relying on the non-laboratory data. Detecting these causative agents of infectious diseases is failed in approximately 40% gastroenteritis cases and 60% in encephalitis cases when conventional approach is used (Finkbeiner et al. 2008; Ambrose et al. 2011).

The Human Microbiome Project enabled the scientific community to know about the sophisticated sequencing technologies and association of microbiome toward human health and disease (Peterson et al. 2009). Metagenomics has the potential to detect both known and novel microorganisms using culture-independent sequencing and analysis of all nucleic acids taken from the sample. The whole genome sequences of the pathogens can be detected using the advance bioinformatics tools which further help in drawing inferences about antibiotic resistance, virulence and evolution.

In the present scenario, metagenomics is playing a very crucial role in investigating novel species and strains (Wan et al. 2013; Mokili et al. 2013; Xu et al. 2011), outbreaks (Loman et al. 2013; Greninger et al. 2010), and complex diseases (Wang et al. 2012; Cho and Blaser 2012). As with the advancement of the next-generation sequencing and its cost-effectiveness, it could become an essential approach in investigation of infectious diseases at very low abundance and can be performed from clinical samples (Seth-Smith et al. 2013) or from single cells (McLean et al. 2013). The metagenomics approaches which are used for the detection of these infectious or pathogenic agents include deep amplicon sequencing and shotgun sequencing.

In deep amplicon sequencing, certain gene families are reported in every known member species in a particular taxonomic group. It employs the amplification of certain taxonomic markers such as rRNA genes. By using next-generation sequencing, many different amplicons in a sample can be sequenced, and the resulting sequences are compared with the reference standard to identify the species/genus associated with each sequence. The deep amplicon sequencing is capable of identifying the novel microorganisms . In the case of bacterial deep amplicon sequencing, they use specific primers that are specific to the conserved genes such as 16S, rRNA, chaperonin-60 (Links et al. 2012), and RNA polymerase (rpoB) (Wu et al. 2011). Likewise in protozoan and fungal deep amplicon sequencing approach, they only target 18S rRNA gene regions (Leng et al. 2011; Sirohi et al. 2012; Iliev et al. 2012). Major advantage of the deep amplicon sequencing lies in an enhancement of the assay’s sensitivity for the microorganisms, with higher resolution. However the major drawback of this approach is the inaccurate estimation of the microbial community composition, which requires prior knowledge of pathogenic agent.

In shotgun metagenomics, all microbes are taken into account after sequencing all the nucleic acids extracted from a specimen. Extracted nucleic acids from the specimen are sequenced using next-generation approach, and their results are compared with their reference database. The database used in shotgun metagenomics are usually much larger than those used in deep amplicon sequencing and contain all the known sequences as compared to the set of sequence from a single gene family. The major advantage of shotgun metagenomics over deep amplicon sequencing is that it is less biased and generates data that better reflect the sample’s true population structure. Besides pathogen detection using shotgun metagenomics approach, it also has the potential to generate complete or nearly complete pathogen genome assemblies from the sample (Seth-Smith et al. 2013; McLean et al. 2013). These results provide an estimation of microbial phenotypes and microbial genotypes by determining the presence or absence of antimicrobial resistance and epidemic dynamics (Bertelli and Greub 2013).

Although metagenomics has immense potential to exploit genomics based information for identifying microbiomes that are relevant to the public health. Additionally it is of use in hospitals and healthcare facilities to identify unknown or novel pathogens as well as for characterization of normal and disease associated microbial communities. Through metagenomics approach, it became quite easier to identify the 78 species from the biofilm from the hospital sink with new bacterial phylum (McLean et al. 2013). Thus in the present scenario, metagenomics approach has proved itself as the most powerful tool for the detection of novel microorganisms .

4.2.6 Environmental Applications

Various kinds of microbes are living in our environments which are helpful in many ways. They play a very important role in decomposing dead material present in the environment and making it free from pollutants. There are certain microbes which are able to degrade oil whenever it spills over water surface. Many microbes also have the ability of cleaning the ground water. Here metagenomics may play very important role in identifying particular species which are concerned with water treatment purpose. Oil-consuming microbes that are present in sea are suitable examples of microbial bioremediation of water. Many other bacteria that are present in the soil have qualities of consuming heavy metals and may be helpful in reducing soil toxicity. Identification of these microbes is a major hurdle in further research and analysis in this regard. So this area is a hot cake for metagenomics and environmental scientists as well.

4.3 Methods and Tools

The steps involved in metagenomics analysis have been shown in the flowchart given in Fig. 4.2, and each part is explained in detail along with the tool used in particular methods. Figure 4.2 shows flowchart for experiment design, sampling, sample fractionation for obtaining DNA, that is further analyzed using different computational tools to find out solution to various research problems.

Fig. 4.2
figure 2

Flowchart of experimental and computational methods that are used to retrieve the genomics information which is further analyzed by different bioinformatics tools. This analysis helps in screening and identification of uncultured microbes that are directly taken from environmental samples

4.3.1 Experimental Design

Experimental design plays a major role in getting accurate, reliable, and high-quality data. Researchers working in the field of metagenomics need to focus on number of replication of data, cost-effectiveness for the sequencing, and accuracy of methods that are used to perform the metagenomics data analysis. In order to obtain accurate and qualitative results in the field of metagenomics, there should be minimum standards during experimental design. While designing the experiment, one must consider the biological and technical replicates, budget should be fixed for sequencing, best protocols should be searched for high yield and good quality of DNA, and sequencing platform should also be discussed. The place should be clearly defined in the terms of certain parameters, from where the sample has to be taken (Cooke et al. 2017).

4.3.2 Sampling

After the experimental design, sample is collected from different sources, i.e., soil, air, water, biopsy, plants, etc. which is known as sampling. The quality of data we obtained from metagenomics depends on sampling (Thomas et al. 2012). While describing the biodiversity, the sample should represent whole population (Wooley et al. 2010) and it should also represent habitat. While collecting the samples, one should know about the time (i.e., day, date, and year of sample collection), number of samples, and volume of samples needed to describe the environmental conditions. Strategy of sampling method and variability of experimental methods should be clear. For collection of representative sample, it is very important to know the amplitude of variation in habitat environment , for example, soil communities with varying soil types like clay, silt and sand particles, plant matter in various stages of decomposition, and variety of invertebrates. So, while sampling one must consider the scale i.e. size of habitat, biological variation, experimental variability, reproducibility, repository and singletons.(The New Science of Metagenomics).

4.3.3 Sample Fractionation

Sample fractionation is a process of lysing the cell to extract the genomic DNA. It is done for obtaining the genomic DNA from abundant as well as rare representative of each taxonomic groups possessing different thickness of cell wall and cell membrane. During sample fractionation or cell lysis, genomic DNA is also exposed to different types of nucleases. So, it’s very important to deactivate or inactivate the nucleases by adding strong denaturing agents to keep our genomic DNA safe (Virgin and Todd 2011; Claesson et al. 2012; Yatsunenko et al. 2012). Cell lysis can be performed by thermal, chemical, mechanical, and enzymatic methods (Felczykowska et al. 2015).

4.3.4 DNA Extraction

DNA extraction is a crucial step for analyzing the genome of unculturable microbe. So, it’s very important to select a qualitative and quantitative DNA extraction method for getting high yield and good quality of DNA (Felczykowska et al. 2015). The sample contains DNA in various packages like virus particles, eukaryotic DNA, and prokaryotic DNA including free DNA. This may be suspended in liquid, bound to solid, or trapped in the biofilm or tissue. So, extraction methods are selected on the basis of medium present and interest of population. Basically, there are two methods for extraction of DNA, i.e., direct method and indirect method. In the first method, cells are lysed within the sample, and then DNA is extracted, e.g., viruses, and later one includes separation of sample from noncellular material before lysis. The yield of DNA product is nearly 100 times lower in the indirect method of DNA extraction than direct, but the bacterial diversity of DNA recovered by indirect means was distinctly higher. (LaMontagne et al. 2002; Van et al. 1997; Ogram et al. 1987; Berry et al. 2003; Jacobsen and Rasmussen 1992).

4.3.5 DNA Sequencing

Generally, there are three types of sequencing methods, viz., amplicon sequencing, shotgun sequencing, and metagenomics sequencing. Amplicon sequencing is used for characterization of microbiota diversity and it is the most commonly used technique. It targets the small subunit of ribosomal RNA (16s) locus, which acts as marker which gives information about phylogeny and taxonomy (Pace et al. 1986; Hugenholtz and Pace 1996). This sequencing method is used to characterize a large range of microbial diversity in the human gut (Yatsunenko et al. 2012), Arabidopsis thaliana roots (Lundberg et al. 2012), ocean thermal vents (McCliment et al. 2006), hot springs (Bowen DeLeon et al. 2013), and Antarctic volcano mineral soils (Soo et al. 2009). Due to certain limitations of amplicon sequencing, shotgun sequencing came in the picture. Novel and highly diverged species were difficult to study using amplicon sequencing (Acinas et al. 2004).

Shotgun sequencing has capability to overcome the limitations of previous approach. This approach relies on extracting DNA from cells in community and fragmenting it into tiny parts (i.e., reads) that are used to align against the known genome and 16S rRNA. Hence, it provides opportunity to explore microbiota community with two aspects (Sharpton 2014). Shotgun sequencing has also limitation like large data handling, reads may not present in the whole genome, and sometimes two reads of the same gene don’t overlap (Schloss 2008; Sharpton et al. 2011). Advancement in shotgun sequencing enables it to answer the above-raised questions and has been used for identification of new viruses (Yozwiak et al. 2012) as well as characterization of uncultured bacteria (Wrighton et al. 2012). This advanced metagenomics sequencing has been used to characterize the microbes associated with roots (Bulgarelli et al. 2013; Vorholt 2012) and also used for identification of taxa that are associated with the human gut (Morgan et al. 2012).

4.3.6 Quality Control

The sequencing data obtained from NGS technology is first subjected to quality control studies. It is the process of sorting out and screening low-quality reads, which affect the downstream analysis (Zhou et al. 2014). The accuracy of microbial biodiversity can be improved by quality filtering (Handelsman 2004). There are several tools available for quality control as shown in Table 4.1.

Table 4.1 List of online tools that are useful for assessing the overall quality of a sequencing run and are widely used in next-generation sequencing (NGS) data production environments as an initial quality control (QC) checkpoint

4.3.7 Assembly

Assembly means reconstruction of genome from smaller fragment of DNA, i.e., reads obtained through sequencing (Reich et al. 1984). Basically, there are two types of assemblies, i.e., de novo assembly in which the genome is constructed from reads data and the second is comparative assembly which is used to reconstruct the genome using a closely related organism (Medvedev et al. 2007). For the de novo assembly, three algorithm-based strategies are used named as greedy (Pop and Salzberg 2008), overlap layout consensus (Myers 1995), and De Bruijn graph (Zerbino and Velvet 2008; Pevzner et al. 2004). Improved de novo assemblies have been generated with the help of a known reference genome to form a comparative assembly like OSLay (optimal syntenic layout of unfinished assemblies) (Richter et al. 2007), Projector 2 (Van et al. 2005) and ABACAS (algorithm-based automatic contiguation of assembled sequences) (Assefa et al. 2009).

4.3.8 Annotation

Functional annotation of metagenomics data obtained after the assembling of reads involves predicting the gene, biological function, gene pathway annotation, and metabolic pathway annotation. The tools used for different functional annotations are shown in Table 4.2.

Table 4.2 List of tools and servers useful in metagenomics analysis. Some of them are freely available and compatible with Windows/Linux for functional annotation of metagenomics data and few are paid. Function of tool is shown in the first column and corresponding name is shown in the second column.

4.4 Metagenomics Databases and Online Resources

There are many databases and online tools for analyzing and retrieving metagenomics data. Table 4.3 shows the name along with link of such databases/servers. The European Bioinformatics Institute (EBI) Metagenomics enables us to submit, analyze, visualize, and compare our data (Mitchell et al. 2015). MG-RAST is a metagenomics analysis server for annotation of sequence fragments, their phylogenetic classification, functional classification of samples, and comparison between multiple metagenomes. It also computes an initial metabolic reconstruction for the metagenome and allows comparison of metabolic reconstructions of metagenomes and genomes (Wilke et al. 2016). MEGAN (Huson et al. 2011) is a comprehensive toolbox for analyzing microbiome data. One can perform the different analytics using this tool like taxonomic analysis, functional analysis, etc. QIIME (Quantitative Insights Into Microbial Ecology) is a freely available bioinformatics tool for performing microbiome analysis from raw DNA sequencing data. One can perform demultiplexing and quality filtering, OTU (operational taxonomic unit) picking, taxonomic assignment, phylogenetic reconstruction, and diversity analyses and visualizations (Caporaso et al. 2010). Mothur is an open-source, expandable software to fill the bioinformatics needs of the microbial ecology community (Schloss et al. 2009). RDP (ribosomal database) provides quality-controlled, aligned, and annotated bacterial and archaeal 16S rRNA sequences, fungal 28S rRNA sequences, and a suite of analysis tools to the scientific community.

Table 4.3 List of different tools and servers that are used for metagenomics data analysis. Some tools are freely available that can be downloaded and compatible with Windows and Linux, while servers are freely available online

RDP is an online tool which is used to study the new fungal 28S rRNA sequence collection. RDP tools are now freely available in packages for users to incorporate in their local workflow (Cole et al. 2009). SILVA (from Latin silva) is an online freely accessible tool to check the quality of reads and aligned (16S/18S, small subunit ribosomal RNA) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequence data of bacteria, archaea, and eukarya (Quast et al. 2013). Real Time Metagenomics is an online freely available tool which performs annotation of metagenomes by relating the individual sequence reads with a database of known sequences and assigning a unique function to each read. They generated a novel approach to annotate metagenomes using unique k-mer oligopeptide sequences from 7 to 12 amino acids long (Edwards et al. 2012).

4.5 Bioinformatics-Based Data Analysis

Bioinformatics-based data analysis can be done using short reads and assembled contigs present in the short read archive (SRA) format (Fig. 4.3). The metagenomics SRA data is firstly treated to sort out high-quality reads or sequences. The pretreatment includes:

  1. (a)

    Removal of adapters and linkers

  2. (b)

    Removal of duplicate sequences (dereplication)

  3. (c)

    Quality assessment

Fig. 4.3
figure 3

Flowchart for analysis of data generated by different metagenomics experiments. The procedure involves use of several computational biology tools for retrieving functional information in terms of pathway, interaction network, and gene ontology hidden in the metagenomics data. GCNA (gene co-expression network analysis) and PPI (protein-protein interaction network) studies are useful for identification of interactors

Before pretreatment of data, quality of data is checked by checking base quality, GC content, sequence dereplication levels, and adapter content using FastQC. Quality control of metagenomics data is done by RSeQC (quality control of RNA-seq experiments) followed by RNA-SeQC (Wang et al. 2012; De Luca et al. 2012). Once data become clean, then it can be used for functional annotation. After pretreatment of data, assembling of reads is done for getting the functional contigs. Data size generated after sequencing can be reduced by metagenome assembly by using integrated computational approach (Howe et al. 2014).

Reference-based and de novo-based methods are used for assembling the reads. The previous one is used to align the short reads against the related genome, while the latter one is used to find out the novelty in genes against the similar reference genome. It requires a large memory and high computational methods. Once assembling is done, binning is performed. It is a computational process of clustering or assigning the contigs that may represent individual genome/taxon or closely related microbes. Homology-based tools are used to perform the binning, i.e., MetaPhlAn2, MetaPhyler, and CARMA (Segata et al. 2012; Liu et al. 2010a, b; Gerlach and Stoye 2011). Day by day, technology is improving which leads to reduction in sequencing cost; hence researchers can access the environmental metagenome, and bioinformatics tools can be integrated with metagenome data to produce useful results and findings (Albertsen et al. 2013).

Structural and functional annotation of microbial community can be done by using assembled reads and unassembled reads too. It is well proven that unassembled short reads contain original information that can explain about functional genes, metabolic profile, and quantitative composition of microbial taxa (Davit Bzhalava and Joakim Dillner 2013).

4.6 Conclusion

Metagenomics is a continuously increasing and developing field. Modern tools and techniques like bioinformatics, NGS technology, and data analysis methods are proving to be facilitators of the trending research field. Biological data is continuously increasing its size; hence researchers have golden opportunity to solve or retrieve the hidden information present in assembled or unassembled reads using modern analytical tools more efficiently. Direct DNA sequencing of environmental samples has given opportunity to gather information about the microorganisms that were unexplored so far. Screening of useful bacteria that survive in extreme environmental conditions, heavily polluted soil, disease-affected tissues or cells, oil-contaminated water bodies, heavy metal -contaminated fields, etc. can be done easily by combining environmental and metagenomics approaches. The data obtained from environmental sample sequencing may be of great use in discovery of new drugs and antibiotics , new bacterial species, plant growth promoters, bioremediation, as well as many other industrial applications. This article presents a detailed account of applications of metagenomics especially in the field of environmental biotechnology with special focus on methods and tools useful in sample collection, sequencing, and analyzing the metagenomics data.