Introduction

A complex ecosystem like soil bears a multitude of significant community-based operations and interactions occurring amongst microorganisms. There is thus, little or no probability of a microbe to live a lonely life in such a environment (Pace 1997). To define soil ecosystem, one should consider its biogeochemical makeup comprising physical (e.g., density, texture etc.), chemical (pH, available Phosphate, C:N ratio etc.), and biological (e.g., microbial biomass) properties (Dominati et al. 2010; Pointing and Belnap 2012). Maintenance and enhancement of soil properties are the primary roles of the inhabiting microbial population and involves interplay with other members of the same ecosystem (Doran and Zeiss 2000; Lavelle et al. 2006). Though physical properties of soil have been extensively studied, the changing dynamic nature, biodiversity, and functionality be studied by holistic approaches and not by isolating the members, which accounts for the low success under laboratory condition (Kellner et al. 2011; Torsvik et al. 1996). Bacterial species of soil have been extensively studied, but the huge diversity of eukaryotes in soil and essential roles they play, is not well established. Various cultivation-based surveys have biased outcomes as soil eukaryotes are difficult to isolate because of the profuse growth rate of bacteria (Bending et al. 2007; Jumpponen and Johnson 2005; O'Brien et al. 2005).

To appreciate eukaryotic functional diversity in soil and the activities they express in situ in response to different environmental constraints, it is necessary to develop new experimental approaches adapted for studying these microorganisms (Nannipieri et al. 2003). One such approach developed in recent years is metagenomics. Metagenomic approach is based on direct isolation of nucleic acids from environmental samples and has proven to be a powerful tool for comparing and exploring soil ecology (Biddle et al. 2008). Though metagenomic studies provide valuable information on the probable existence of microbial communities, it does not give a picture of the actual activities carried out by them at specific environment or changing environmental conditions. The metatranscriptomics approach connects community-based structures and function in single experiment, and reach beyond the community's genomic potential as assessed in DNA-based methods (Bailly et al. 2007). In this review, we summarise various methods used to study the microbial community structures through the metagenomic and metatranscptomic approach, role of metatranscriptomics in functional diversity of eukaryotes and identification of novel genes involved in various biological processes from the environmental samples. An attempt has also been made to study the problems, challenges and future prospective of the metatranscriptomics.

Microbial communities in soil

PCR based methods

Various molecular methods have been developed which involves extraction of nucleic acid, directly or indirectly from soil and were able to detect species, genera, families or even higher taxonomic groups (Handelsman et al. 1998; Hirsch et al. 2010). Using techniques such as DNA re-association also provides information of diversity specially the variations in GC content for detecting changes in microbial community. This technique requires large amounts of DNA and resolution suffers as several taxonomic groups may have same G + C content. Hence it does not provide enough information about the composition and the role of microorganisms.

The format for achieving diversity information changed significantly with the advent of more complex molecular tools like the use of denaturing gradient gel electrophoresis (DGGE) (Deng et al. 2008) and eventually PCR-derived clone libraries of genes encoding the small subunit ribosomal RNA (rRNA) (Lara et al. 2007). Nucleic acids obtained from soil were subjected to different approaches of fingerprinting to assess the change in community structure with respect to environment. Comparing profiles of low molecular weight RNAs (5S rRNA, tRNA) collected from environment to their pure cultures previously obtained, provided insights into the change in structure (Liu and Stahl 2007). Amplified Ribosomal DNA Restriction Analysis (ARDRA) DNA fingerprinting technique makes use of restriction enzyme digestion of amplified 16S rRNA genes followed by gel electrophoresis (Nüsslein and Tiedje 1998). Similar technique involving terminal restriction fragment length polymorphism (T-RFLP), measures only the terminal restriction fragment of every 16S rRNA gene thus reducing the complexity of ARDRA (Liu et al. 1997; Marsh 1999). PCR amplified intergenic spacer between 16S rRNA and 23S rRNA genes or ribosomal spacer analysis (RISA) was also used to assess the diversity of bacteria in certain soil type (Ranjard et al. 2000). Various PCR based finger printing techniques used to study the microbial community structures in different soil environments are depicted in Table 1.

Table 1 Selected nucleic acid fingerprinting techniques for microbial community DNA analysis

Though PCR based studies are robust techniques for analyzing the metagenome they have many drawbacks mainly due to the biasness introduced by the primers and interference of stray environmental DNA. It will eventually include genetic material not only from the live (active/inactive) sources but also from dead organisms or plant (Venter et al. 2004; Gill et al. 2006). Another major disadvantage for such techniques is the presence of PCR-inhibiting substances like humic acids, bacterial exopolysaccharides and proteins usually associated with particular soil types (Gelsomino et al. 1999; Miller et al. 1999; Opel et al. 2010).

Metagenomic approach

Metagenomic based methods rely on soil-extracted DNA to be directly inserted or cloned in vector systems like plasmids or bacterial artificial chromosomes (BACs) and then propagated in bacterial strains like E. coli thus creating a genomic library of all possible genetic content (Liles et al. 2003). The DNA library of the entire metagenome of soil has been estimated to be approximately 106 Bacterial Artificial Chromosome (BAC) clones with an insert size of 100 kb (Handelsman et al. 1998). The screening of metagenomic libraries usually depend on two broad techniques (Schloss and Handelsman 2003). Firstly, they depend on the specific enzymatic activity of the recombinant cells. This is known as activity-based screening as identifying new bioactivity does not require information about previous known sequences. Secondly, sequence driven (Daniel 2002; Rondon et al. 2000) which makes use of conserved DNA sequences to design hybridization probes or PCR primers, which are derived from conserved regions of already known genes or protein families to screen metagenomic libraries for clones that contain sequences of interest. In this way, only novel variants of known functional classes of proteins can be identified (Knietsch et al. 2003). A significant fraction of eukaryotes could not be encompassed by this method of creating gene library.

The metagenomic DNA library contains a very low proportion of clones of eukaryotic origin as prokaryotes outnumber eukaryotic organisms in soil (Lehembre et al. 2013). Also, owing to the huge size of the genome (e.g., 13.8 Mpb for the yeast Schizosaccharomyces pombe (Wood et al. 2002), eukaryotic genes could not be well represented within a functional genomic library, hence a significant number of their duties in soil still remain unconvincing (Yadav et al. 2014). Moreover, expression of eukaryotic protein-coding genes is prevented in most of the hosts (both prokaryotes and eukaryotes), because of the presence of introns and lack of conservation of transcription regulatory elements in promoter sequences (Bailly et al. 2007; Takasaki et al. 2013).

Function based study of environmental DNA holds a role of great importance as more and more data sets are generated by cloning methods creating knowledge of putative biocatalysts which were previously not achievable. Discovery of such novel biocatalysts require bioinformatic tools and available databases where sequence-based metagenomic study can be deciphered. A novel protein-coding gene can thus be categorized that often represents more than half of sequence similarity of the total identified ORF to a known functional protein. Warnecke et al. (2007) discovered more than 600 new putative glycosyl hydrolases from the termite hindgut. The major constraint for functional metagenomics is availability of suitable activity assays for high-throughput screens. Some enzymatic activities may require the collective action of several proteins that could happen to be dispersed elsewhere on the host’s genome. Thus, the respective activities might not be detected when screening small insert metagenomic clone libraries. Moreover, the absence of chaperons in expression hosts might fail to fold the protein and hence cannot function in the host (Warnecke and Hess 2009). A major limitation of metagenomics is that it cannot reveal dynamic properties, such as impact of the environment on the activities of the community life. The only information obtained is the potential of the microbiome to exhibit functional properties associated with the presence of certain genes but may overlook the expression pattern of the genes in a stressful environment. Hence the next subset of metagenomics which sheds light on the active functional profile of a microbial community is metatranscriptomics.

Metatranscriptomics approach

Metatranscriptomics deal with those sets of genes which respond to a given environment at a given time to transcribe and exhibit activity (Chao-Rong and Zhang 2011; Moran 2009). Functional metatranscriptomics hence is a powerful tool which allows characterization of genes expressed by different eukaryotic microorganisms (e.g., fungi, protists, etc.) directly in the environment. This approach has a strong potential in biotechnology to discover novel genes of interest for the bioindustry, in bioremediation and as biomarkers. This approach allows characterization of genes implicated in adaptation to stressful conditions or involved in organic matter degradation. Functional metatranscriptomics applied to soil environment involves the extraction and analysis of mRNA which provides information on the regulation and expression profiles of complex communities. It is based upon extraction of environmental RNA instead of DNA and on the purification by affinity chromatography of the eukaryote-specific polyadenylated mRNA from the total environmental RNA mixture. These poly-A mRNAs can be converted into cDNA, which can be cloned in appropriate expression vectors such as plasmids which allow expression of the cloned genes in the eukaryotic hosts (Yadav et al. 2014). Different methodologies to study the soil diversity and function of microorganisms which involves both culture dependent and culture-independent techniques are depicted in Fig. 1.

Fig. 1
figure 1

Overview of methodologies for studying soil diversity and function of microorganisms which involves both culture dependent and culture-independent techniques

Metatranscriptomics of polluted soil

Functional diversity of eukaryotes

Maintaining community structure and diversity is a challenge faced by the residing microorganisms in polluted soil (Giller et al. 1998). Though diversity of community is negatively correlated to increase in stress levels, the effect of ‘competitive exclusion’ would enhance the removal of dominant organisms, and promote an increase in diversity of usually less abundant organisms. This is considered resilience as the affected soil microorganisms try to regain their size, composition and function after an initial disturbance (Allison and Martiny 2008). Studies pertaining to heavy metal cadmium and copper have indicated the presence of ‘competitive exclusion’ (Degens et al. 2001; Zhang et al. 2009). Eukaryotic life forms have been previously studied in extreme environments along with prokaryotes as they represent a significant fraction of the microbial biomass. They hold a repository of genes responsible for various activities in soil, hence, it is difficult to isolate and culture them under laboratory condition. Holistic approaches like metagenomics and metatranscriptomics have developed to identify genes and archive them as soil gene library. Information thus stored is a treasure house for novel biocatalysts and bioactive compounds which can be coded by these genes when suitably expressed (Daniel 2005; Tringe and Rubin 2005).

Soil fungi have been under investigation for centuries (Rossman 1998), but DNA-based analysis have paved their way to prove that phylogenetically, eukaryotes are as diverse as bacteria. A study by Fierer et al. (2007) dealt with grouping of sequences by Operational Taxonomic Units (OTU’s) and reveled that archaeal or fungal OTUs appeared to be equal or exceed the number of unique bacterial OTUs in different soil types (Dessert, Grassland, Rain forest) with minimum taxonomic overlap between the soil types. In depth study of spruce and beech forest soil showed dominance of fungi, metazoans and protists linked with enzyme activities for litter degradation for maintenance of fertility of terrestrial ecosystems (Damon et al. 2012).

With the advent of high-throughput or next-generation sequencing (NGS), large volume of sequence data is generated, which enables efficient analysis of more complex expression profiles. The most common high-throughput sequencing platforms used in metatranscriptomic studies are the 454 Genome Sequencer FLX systems (Roche) and the HiSeq 2000 (Illumina, Inc.). Both the platforms facilitate large numbers of individual sequencing reactions to be performed in parallel. Direct sequencing of DNA or cDNA with/without any cloning step (Medini et al. 2008) increases the throughput in terms of numbers of base pairs sequenced per run and decreases cost per sequenced base, though the new technologies produced short reads (454 pyrosequencer produced ~ 400 bp). Developments in sequencing technologies have improved and read lengths are comparable to traditional dye-terminator sequencing technology.

Sequencing technologies played a major role in metagenomics and metatranscriptomics for studying the diversity as culture-based identification were slowly on the decline. This included species identification and functional annotation of genes. A clear majority in culture-independent based sequence identification and submission with the development of NGS can be seen over the years. We have analyzed the gene sequences derived from soil eukaryotes (Fungi and Protists) deposited in GenBank with different search strategies employed in Entrez. The results indicated that by the year 2011, almost 99.9% of sequence data submitted to GenBank were uncultured soil eukaryotes and the submission of uncultured sequence data did not fall below 89% of the total sequences submitted to GenBank even in the year 2019.

The first report of using 454 pyrosequencing for studying the metatranscriptome of a soil microbial community was published by Leininger et al. (2006), where they reported an abundance of archaeal ammonia oxidizers than their bacterial counterparts. A study prior to this on activated sludge and samples collected from hot springs reported cDNA library preparation from extracted RNA exhibiting significant similarity to eukaryote mRNA-encoded protein sequences (Grant et al. 2006). Kit based RNA extraction from aquatic samples was carried out by Poretsky et al. (2005) and approximately 400 clones were analyzed after cDNA was amplified with random primers. They identified many novel genes associated to environmentally important processes such as sulfur oxidation from diverse taxonomic groups.

Pioneering work on functional diversity of communities of soil eukaryotic microorganisms, which included construction and screening of a cDNA library using polyadenylated mRNA extracted from a forest soil was carried out by researchers from University of Lyon 1, France (Bailly et al 2007). It was observed that fungi and especially unicellular eukaryotes (protists) dominated the microbial population with genes encoding for essential proteins involved in different biochemical processes. This was assessed by complementation of auxotrophic mutant yeast with soil cDNA. Studies were extended to investigate litter degradation in forest ecosystem thus making nutrients available in soil organic matter degradation (Damon et al. 2012). Major nitrogen assimilating and cell wall degrading enzymes (celluloses, pectin, lignin etc.) were identified (Damon et al. 2011). Metatranscriptomics received great impetus for studying different soil type especially nutrient-poor soil mainly because it avoids biasness when gauging community reaction to environmental changes. Along with high-throughput sequencing technology, structure and functioning of sandy soil microflora were studied and less frequent taxa Crenarchaeota were well represented using rRNA-tags with the discovery of specific enzymes for ammonia oxidation and CO2 fixation (Urich et al. 2008).

Metatranscriptomic approach was therefore played an important role for studying polluted or stressful soil systems. First study on heavy metal contaminated soil using metatranscriptomics was reported by Lehembre et al. (2013), where a huge metatranscriptomic library was screened with varied diversity of sequences which were screened for heavy metal tolerance mostly from unknown taxonomic groups. The isolated heavy metal tolerant genes were able to withstand from the toxicity of a variety of metals even at high concentrations (Thakur et al. 2018, 2019). This opened up a huge platform for investigating the proteins and linking them with different pathways for coping up with heavy metal tolerance.

Discovery of novel eukaryotic genes

Mining of industrially important enzymes

Enzymes from higher eukaryotic microorganisms especially fungi have been widely used as industrial important biocatalysts ranging from applications in food industries, pharmaceutical, textile etc. (Gudynaite-Savitch and White 2016; Guerriero et al. 2016). Although thorough discussion regarding fungal biocatalysts is beyond the scope of this review, certain important contributions of eukaryotic genes from environment are emphasized here. Role of eukaryotes in improving soil fertility by litter degradation involves various enzymatic processes to degrade lignocelluloses, hemicelluloses, polyphenols etc. (Štursová et al. 2012). Litter degradation rates depend on their physicochemical properties, which in turn vary with climatic factors, plant species which form the litter and the association of saprophytic fungal microhabitats (Goldmann et al. 2015). Besides production of useful metabolites, eukaryotes are also involved in biomass treatment in biorefineries and biofuel production. Essential enzymes acting on complex organic matter like lytic polysaccharide monooxygenases cleaves complex polysaccharides like cellulose and hemicelluloses resulting in oxidative cleavage. These eukaryotic microbes are extremely diverse and hence their function cannot be localized and neither can be said that they are ubiquitous in every ecosystem (Marmeisse et al. 2017). To trap the extensive enzyme resources from forest soil, the most convenient way is to create a metatranscriptomic library and perform activity screening which does not incorporate biasness towards specific species. Industrially important enzymes like acid phosphatase and laccase were identified by activity screening and semi-quantitative PCR, respectively, following RNA extraction and cDNA preparation (Kellner et al. 2011; Luis 2005). The only disadvantage of this process is the massive number of transformed clones which have to be screened for activity. As many as 30,000 recombinant yeast clones were screened for potential phosphate solubilizing enzyme, acid phosphatase and also an imidazoleglycerol-phosphate dehydratase by utilizing the concept of functional complementation and activity screening (Kellner et al. 2011). The metatranscriptomic library was ligated to a yeast secretion vector pTEF-MF-SfiI A/B which was modified with additional SfiI restriction sites to accommodate the library and was used to transform Saccharomyces cerevisiae mutant ML20C having pho5– (mutation in the repressible acid phosphatase gene) and a his3− mutation. Positive clones complemented by plasmid borne his3− gene and colony staining assays were further sequenced. Strong similarity with basidiomycete acid phosphatase from the sequence data confirmed the important role played by fungi in nutrient mobilization in the organic rich layer of forest soil. The availability of a large number of yeast strains which are mutated for a variety of metabolic pathways provide an opportunity to perform complementation with gene resources to screen and find out the gene of interest (Scherens and Goffeau 2004). Yeast can be genetically manipulated by plasmids having constitutive or regulated promoters for gene expression and rescuing the auxotrophs by complementation.

Along with mutant yeast strains, plasmids have also been modified to successfully clone environmental cDNA and effectively express the gene in the host. The transcription of foreign genes in mutant yeast will depend upon carefully placed compatible promoter in the plasmid upstream of the cloning site. Expression of environmental eukaryotic cDNA is efficiently cloned in pFL61 which has a Multiple Cloning Sites between Phosphoglycerate kinase (PGK) promoter and its terminator and also bearing a 2 μm yeast origin of replication for yeast host and part of pUC19 E. Coli vector for replication in bacterial host (Minet et al. 1992).

An efficient technique to screen cDNA synthesized from soil-extracted polyadenylated mRNAs is using ‘Sequence capture by hybridization’ (SCH) (Bragalini et al. 2014). This method of pre-selection of target genes before activity screening increases the probability of exploring novel genes by reducing the number of random environmental cDNA. It was conducted using biotinylated degenerated RNA probes. The biotinylated probes hybridized to known and unknown sequences of members of a target gene family present in a metatranscriptomic library. They were further captured using streptavidin-coated paramagnetic beads. Tested on the glycoside hydrolase 11 gene family encoding endo-xylanases, the capture probes were able to relate > 90% of the cDNA to this gene family. Sequencing of cDNA revealed many phylogenetically diverse species, which were not yet included in public databases. The captured cDNA was cloned and expressed in Uracil-auxotrophic yeast strain DSY-5 also having non-functional sugar transporter protein. Yeast cells producing endo-xylanases developed a blue halo in the presence of xylan linked to a dye confirming hydrolysis of xylan. This method used the plasmid PDR 196-SfiI-Kan modified to contain two specific SfiIA and SfiIB sites for ligation of cDNA which also contain identical sites for directional cloning and placed under the control of a yeast PMA1 promoter to facilitate constitutive expression of cDNA. Figure 2 depicts the overview of the sequence capture screening activity by degenerate probes, transformation with cDNA clones on selective medium.

Fig. 2
figure 2

Modified from Bragalini et al. (2014)

a Overview of the Sequence capture screening activity by degenerate probes followed by (b) Transformation of yeast DSY-5 with cloned cDNA plated on a selective medium containing AZCL-xylan (Specific substrate to endoxylanase) and without Uracil. Positive clones degrade the substrate to release dark blue dye.

The enzymes retrieved from eukaryotic organisms have dominated the industrial processes especially where it has replaced traditional usage of chemical treatment of biomass, in textile as well as in food industries (Azizan et al. 2016). The recent advances to improve the commercial potential of industrially important enzymes have focused on the molecular techniques such as metagenomics and metatranscriptomics for accessing the environmental genes (Niehaus et al. 2011). The practice of collecting industrially important genes from the environmental metagenomic library has been discussed in detail by Ferrer et al. (2016). Sequence based as well as functional screening of environmental metagenome to achieve a stable, ‘industry-level’ enzyme is a time-consuming process. This requires efficient screening and expression systems. However, the approach of metatranscriptomics has been developed into a promising and practical technique for harvesting and screening protein-coding genes from an environmental library. Forest soil has been the primary source to harvest important and novel enzyme coding genes. Glycoside hydrolase that decompose cellulosic biomass was harvested from the fungal communities of forest soil (Takasaki et al. 2013). About 40% of genes involved in metabolism of carbohydrate and amino acid out of 9449 eukaryotic coding sequences were identified by metatranscriptomic approach. Functional metatranscriptomics was successfully applied to eukaryotes of forest soil where acid phosphatase and imidazoleglycerol-phosphate dehydratase of fungal origin was isolated and characterized using yeast mutants which could proficiently express the protein (Kellner et al. 2011). Many important eukaryotic oligopeptide transporters were discovered by functionally complementing yeast mutant defective in di/tripeptide uptake. These transporters were able to act upon a wide range of substrates and were further expressed in Xenopus oocytes (Damon et al. 2011). A significant number of non-fungal sources of industrially important enzymes in forest soils were explored by Damon et al. (2012), where metatranscriptomic library was probed for essential enzyme coding genes. Major plant cell wall and polymer (cellulose, hemicelluloses, pectin, lignin etc.) degrading enzymes were identified. Also, organic matter hydrolyzing enzymes like proteases, lipases, a phytase etc. were also identified. Forest soil is a host of carbohydrate active enzyme transcript sequences (CAZymes) and essentially about 74,000 active transcripts were identified in a maple forest where soil metatranscriptome was analyzed (Hesse et al. 2015). The use of metatranscriptomics in screening polyadenylated mRNAs from soil was implemented on endo-xylanases encoding gene glycoside hydrolase 11 gene family. An efficient functional screening mechanism of solution hybrid selection was used screening of enzyme encoding genes (Bragalini et al. 2014).

Metatranscriptomics has also been applied to polluted soil for discovering enzymes of industrial importance. Mukherjee et al. (2019a), aldehyde dehydrogenase gene belonging to the phylum Ciliophora was isolated from polluted soil, which when expressed in metal sensitive yeasts, conferred tolerance against various toxic metals. Aldehyde dehydrogenases are an important group of abiotic stress response enzymes which are expressed to eliminate toxic levels of aldehydes generated due to environmental stresses which includes metal stress. A novel serine protease inhibitor (serpin) belonging to the phylum Tardigrada was isolated, which showed increased expression levels in presence of metals and was able to confer metal tolerance. Similar to aldehyde dehydrogenase, serpins are also stress response proteins which act as defense system against biotic and abiotic stresses (Mukherjee et al. 2019b). Also, heterologous gene expression of characterized genes could be beneficial for improving stress tolerant capacity of crop plants as the genes are a part of important antioxidant systems for organisms.

The use of pesticides and chemical fertilizers in many agricultural lands has led to the accumulation of the persistent aromatic hydrocarbon pollutants in the soil. In a study by Sharma and Sharma (2018), pesticide and chemical fertilizer polluted soil was studied and the taxonomic and functional aspect of the microbial communities was investigated by metatranscriptomic approach. Transcripts related to degradation of cypermethrin and related aromatic compounds were identified. Hence, using a single technique, both prokaryotic and eukaryotic enzymes were studied for pesticide degradation. The compound cypermethrin is degraded in a step wise process forming by products like carboxylate, benzoic acid, derivatives of aldehydes etc. It can thus be inferred that the process of removal of toxic compounds is an orchestration of a chain of functional proteins which may originate from any organisms in the soil ecosystem. Metatranscriptomic study of the wheat rhizosphere revealed various bacteria that are involved in the degradation of xenobiotics like aromatic compounds, carbazoles, naphthalene (Singh et al. 2018).

Mining of genes involved in metal tolerance

Though there are a limited number of published research papers, a significant amount of molecular characterization was performed with metal resistance genes obtained from the environment proving metatranscriptomics approach, a lucrative study method for bioremediation of heavy metals. Metallothioneins (MT), a family of ubiquitous cysteine-rich proteins have long been studied for their ability to coordinate the binding of metal ions by forming metal-thiolate complexes (Blindauer 2013; Mehra and Winge 1991). Though MTs have a defined classification system based on sequence similarity analysis (Binz and Kägi 1999), they share less overall sequence conservation. This might be due to independent emergence of MT families throughout the course of evolution (Capdevila and Atrian 2011). Eukaryotic MTs, however, have not been covered fully as it has been studied only in specific species like vertebrates, plants, fungi etc. Thus, the search for new metal-binding proteins from environmental DNA or RNA will be a fruitful endeavor in terms of detection and bioremediation of heavy metals in soil. Soil-extracted mRNA converted to cDNA was screened on the basis of their functional role of resisting toxic amounts of heavy metals (Cervantes and Gutierrez-Corona 1994; Lehembre et al. 2013). This cDNA was cloned into suitable yeast expression vectors. Metal sensitive phenotypes of yeast were then transformed and the cDNAs were screened on their ability to rescue the yeast strain by functional complementation. Members of the cysteine-rich proteins (CRP) were identified and compared with the already existing fungal metallothioneins (Fig. 3). The potential of these environmental genes to support the growth of sensitive yeast is much higher than existing fungal metallothioneins in media with cadmium metal stress (Lehembre et al. 2013). Further characterization of the CRPs was carried out by Ziller et al. (2017) (Fig. 3). Amino acid features were analyzed for these CRPs and their metal-binding abilities towards Cd, Zn and Cu metal ions. Previously isolated soil metatranscriptomes were sub-cloned and expressed as a fusion protein. These novel proteins or ‘environmental metallothioneins’ had the ability to chelate Cd(II), Zn(II), and Cu(I). Various yeast mutants (ycf1Δ, zrc1Δ) derived from the wild type strain BY4741 (MATa his3Δ1 leu2Δ0met15Δ0 ura3Δ0) were used, thereby enabling efficient screening by functional complementation.

Fig. 3
figure 3

Adapted from Lehembre et al. (2013) and Ziller et al. (2017)

a Functional complementation of the CRPs in Cadmium sensitive yeast phenotypes ycf1Δ compared to the fungal metallothionein gene. b Inhibition of growth (%) in liquid broth spiked with cadmium. c CRP 1–5 expression in Zinc sensitive phenotypes Zrc1Δ in presence and absence of Zinc metal exhibited multi-metal resistance potential. d Nine cysteine-rich proteins (CRP2) were identified from soils of 12 different sites with their amino acids alignments showing conserved cysteine motifs.

The potential of genes identified by functional metatranscriptomics can be extended to be used as biomarkers whose level of expression can be used to access the extent of toxicity of heavy metals in crop fields. A recent study on yeast-based biosensors for detection of environmental pollutants provides an explanation as to why eukaryotic models are now being utilized for this purpose. Yeast-based systems have an edge over bacteria based and enzyme-based biosensor systems as bacteria has low relevance for representing eukaryotic community and enzyme-based biosensors have lower relevance for the cell or the whole organisms (Jarque et al. 2016). Already existing yeast-based heavy metal detection are depicted in the Table 2. Toxic heavy metals, one of the most hazardous groups of contaminants, can be detected by both non-specific and specific biosensors. The sensing element in non-specific biosensors is based on the cAMP/PKA (cAMP-dependent protein kinase) signaling pathway which mediates gene expression in response to generic toxic substances in yeast. These biosensors have been shown to recognize As3+, Fe2+, Pb2+ and Cd2+, but they seem to be nonselective to metals, responding to other generally toxic compounds (Radhika et al. 2005). With metatranscriptomics, a very specific metallothionein-based detection platform can be devised for biosensing heavy metal pollution in arable lands. A representation of a general biosensing detection system with yeasts is depicted in Fig. 4.

Table 2 Yeast bio-detection systems of heavy metals in environment (Table
Fig. 4
figure 4

Yeast Biosensing Detection Principle. a Preparation of environmental samples and its exposure to yeast-based sensing system. b Recombinant product- environmental metal resistant gene reacts with heavy metals inducing activity of c a reported gene or generation of electrical signal and d the signal associated with the response might be amplified and processed into valuable quantifiable data

Critically designed experiments which capture microbial response reveal newer, underrepresented species, performing environmental activities like nutrient mobilization and breakdown of organic matter/hydrocarbons. Studies by Lehembre et al. (2013) and Ziller et al. (2017) have captured different roles of eukaryotes involved in metal chelation and homeostasis. This will greatly accelerate the tapping of natural resources for industrially important biocatalysts and also help to solve problems of soil fertility and pollution. Ubiquitin domain containing zinc finger protein was isolated and characterized by Thakur et al. (2018) which was screened in S. cerevisiae mutants and was able to impart tolerance towards different toxic metals. Similar study revealed a novel von Willebrand factor type D of vitellogenin protein from polluted soil environment which depicted higher expression levels in presence of toxic metals and was able to confer tolerance towards toxic concentrations of metals (Thakur et al. 2019). Vitellogenin proteins are known for their antioxidant properties. From these studies it can be concluded that the technique to isolate enzyme coding genes from perturbed soil could serve as important biomarkers and genes suitable for bioremediation of environmental soil pollution. Table 3 summarizes all the published work on soil metatranscriptomics of various soil ecosystems.

Table 3 Published metatranscriptomics studies on different soil ecosystems

Metatranscriptomics in microbiome research

Metatranscriptomic approach has been employed in microbiome research majorly to characterize the human gut microflora and classifying the active population. It also helps to gain the insights into the host gene expression profile and also to study the gene expression of complex network of microbial communities at a given environment (gut, oral cavity, or respiratory tract), where culture-dependent techniques cannot assess their true functional roles. From such data, metabolic pathways in the microbial communities can thus be identified and be associated to particular environment. This can actively be associated with health, diseases and metabolism (Bashiardes et al. 2016).

Studies on the rumen microbiome of buffaloes, led to the understanding of the role of microbial carbohydrate assimilating enzymes present in the buffalo gut with the abundance of polysaccharide degrading bacteria (Patel et al. 2014). Moreover, the colonization of the bacteria in the rumen of cattle throughout its lifetime was studied using next-generation sequencing technology which has led to useful information of the digestive capabilities of the ruminant (Koringa et al. 2019). The rumen gut is particularly interesting from a study point as it holds a repertoire of carbohydrate assimilating active enzymes, which can degrade plant matter. It has been reported that breakdown of cellulose involves the action of three different types of cellulolytic enzymes endo-β-1, 4-glucanase, glycosyl hydrolase family, cellobiohydrolase and β-glucosidase (Terry et al. 2019). Several studies have employed the approach of metatranscriptomics in studying the eukaryotic population of rumen and their significant contribution towards plant fiber digestion. The presence of anaerobic fungi of phylum Neocallimastigomycota in the rumen has been associated with degradation of plant cell wall material where CAZyme families have been active in hemicelluloses digestion (Gruninger et al. 2018). High-throughput RNA sequencing technique for estimation of active population applied to rumen gut has found a total of 21.8% of the small subunit RNA belonging to the eukaryotes (Comtet-Marre et al. 2017). The protozoan population was also found to be an active member of the gut microflora and about 3–6% total non-rRNA reads were from protozoan species. However, Piromyces and Neocallimastix accounted for the largest fungal cellulases identified from the rumen gut (Dai et al. 2015). Söllinger et al. (2018) correlated the activity of the transcripts along with the metabolites of rumen and cellulase transcripts originated from two eukaryotic groups, Neocallimastigaceae and Ciliophora.

Metatranscriptomic approach to study the rumen gut suggested the predominance of exoglucanases, which degrade crystalline cellulose. Many glysoside hydrolases, which were previously not identified by metagenomic studies were confirmed to be active in the metatranscriptomic study (Dai et al. 2015). Major cellulose binding modules were also found in the eukaryotic transcriptomes along with polysaccharide utilization loci which aid in the binding, transport and degradation of polymeric glycan structures (Seshadri et al. 2018). A detailed review by Terry et al. (2019) on the metagenomic and metatranscriptomic studies further describes the rumen gut microflora.

Bioinformatics used in metatranscriptomics

The gene expression profiling of the microbial population with respect to the environment is a major field of study. Metagenomic analysis along with metatranscriptomics allows a holistic approach to study a microbiome and its associated functions (Bashiardes et al. 2016). Several different metatranscriptomic pipelines have been developed for the efficient analysis of bioinformatics data concerning the functionality of the microbial community. MetaTrans is such a pipeline which is designed to handle RNA-Seq analyses concerning both taxonomic data and gene expression involving quality-control assessment, sorting of RNA into mRNA and non-mRNA, mapping the reads against functional databases and also associating differential gene expression database (Martinez et al. 2016). The Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) pipeline has been developed in conjunction with the Metagenomic Rapid Annotations using Subsystems Technology (MG-RAST) annotation server. The pipeline is validated with gut microbiome data and utilizes raw RNA-Seq data to form meaningful comparisons between the different species and their function. Following annotations and aggregation, the pipeline is designed to produce a graphical representation with custom R scripts to generate visual plots of data (Westreich et al. 2016). A suitable bioinformatic analysis platform applied to the human genome project is the HUMAnN, which involves high-throughput metagenomic functional reconstruction validated with 649 microbial communities from 7 body habitats. Several gene families and their respective pathways can be analyzed within a community as well as their relative abundances (Abubucker et al. 2012). Another bioinformatic pipeline concerning human gut microflora was developed by Leimena et al. (2013), which enabled study of symbiotic interactions within prokaryotic ecosystems. MetaQUBIC is designed as an integrated biclustering-based computational pipeline that can integrate metagenomic and metatranscriptomic data to study the multiple levels of human microbiome (Ma et al. 2019).

The use of robust bioinformatic analysis pipelines have been used to understand extreme environment microbiota as they are resistant or tolerant towards harsh environmental conditions. Integration of metagenomic and metatranscriptomic data was carried out successfully by Hua et al. (2015), where a ‘divide and conquer’ technique was employed to breakdown huge metagenomic data into subsets for the discovery of uncultured and/or rare taxa microbes and their associated functions. Another multi-omic pipeline IMP, a reproducible and modular analysis tool, which can integrate metagenomic and metatranscriptomic data was also developed (Narayanasamy et al. 2016). Functional Mapping and Analysis Pipeline (FMAP) was developed that incorporates alignment of reads with reference databases, operon-level and pathways analysis and also the ability to process large data sets (Kim et al. 2016). SqueezeMeta developed by Tamames and Puente-Sanchez (2019) is an assembly-based bioinformatic pipeline dedicated to metagenomic/metatranscriptomic analysis and has a ‘fail-safe’ approach of assembly of metagenomes with internal checks and also has the feature for estimating the abundances of the genes in a metagenome. It also has the ability of retrieving the rRNA sequences and integrates a taxonomic assignment function. A similar work-flow pipeline which performs differential expression of genes and combines RNA-Seq data with Whole Genome sequencing for referencing, is the Meta-Omics Software for Community Analysis (MOSCA) (Sequeira et al. 2019). Both of the assembly-based workflows, filters out the rRNA data from the reads and then assemble the transcripts before assigning functions to them and have the provision of estimating relative abundance of genes (Sequeira et al. 2019). The development of a metatranscriptomic pipeline combining functional assignment and comprehensive statistics is the COMAN (Comprehensive Metatranscriptomics Analysis), which analyzes reads by first removing non-coding RNA. This uses a convenient web server, which has detailed instructions and is suitable for biologists with limited expertise in computational biology. For a more detailed analysis and comparison between different metatranscriptomic pipelines, the authors suggest referring to the article by Shakya et al. (2019).

Problems and challenges in metatranscriptomics

Every method for studying natural habitats seems to have their own benefits and short comings. The first limitation of metatranscriptomic study, which has already been dealt upon, is the instability of RNA molecules. Unlike DNA, RNA is prone to degradation by various sources once they are isolated, average half-lives being seconds to minutes (Iost and Dreyfus 1995). There can be variations in RNA stability depending on the type of soil, microbial species and nutrients present (Redon et al. 2005). To save the RNA in the collected sample, liquid nitrogen must be used to freeze the samples post-collection or to shift them to RNA protective solutions (e.g., RNAlater™ Invitrogen™, USA). Isolation of RNA from soil is incredibly challenging as it involves inefficient cell lysis due to the tough matrix soil provides. Downstream processing becomes difficult due to co-precipitation of complex molecules which inhibit PCR reactions. Nucleic acids are easily adsorbed on soil particles, which makes extraction very difficult, especially with low pH buffers, which specifically separate RNA from total nucleic acid. A detailed evaluation of challenges faced by scientists practicing metatranscriptomics as a tool to understand soil ecosystems has already been presented by Warnecke and Hess (2009) and Carvalhais et al. (2012).

A major step to bypass the mRNA enrichment stage and selectively amplify soil eukaryotic mRNA was first used by Bailly et al. (2007), where the technique of Switching Mechanism At the 5′ end of RNA Template (SMART ™) was used to synthesize cDNA. Traditional method of synthesizing cDNA has many shortcomings. They produce truncated cDNA due to premature termination of reverse transcription and as a result, most of the 5′ terminal sequence information is lost resulting in underrepresentation of 5′ end in cDNA library (Botero et al. 2005; Gubler and Hoffman 1983; D’Alessio and Gerard 1988). Moreover, use of ligation adapters for cloning cDNA generated undesirable byproducts (chimera).

SMART cDNA synthesis starts with as low as 2 nanogram of total RNA. A modified oligo(dT) primer containing specific restriction sites primes the first-strand synthesis reaction (Zhu et al. 2001). When the reverse transcriptase reaches the 5′ end of the mRNA, the enzyme’s terminal transferase activity adds a few additional nucleotides to the 3′ end of the cDNA. The SMARTer Oligonucleotide base pairs with the non-template nucleotide stretch, create an extended template. The reverse transcriptase then switches templates and continues replicating to the end of the oligonucleotide. The overview of the process of metatranscriptomics from metal contaminated site is depicted in Fig. 5.

Fig. 5
figure 5

Overview of metatranscriptomics for mining heavy metal tolerant genes from Soil. a, b Soil collection and total RNA isolation c cDNA synthesis by SMART ™, Reverse Transcriptase (RT) converts soil mRNA to cDNA. d Double stranded cDNA is cloned in plasmids having yeast promoter and suitable gene marker for selection in yeast and E. coli. e Mutant/Auxotrophic yeast transformed with recombinant plasmids are able to survive metal stress due to the expression of plasmid-bourne cDNA; control having empty plasmid cannot survive the stress. f Surviving yeast cells are selected on minimal media containing specific metal ions

The designed primer serves well as a priming site for the overall exponential amplification of the synthesized cDNA in contrast to incomplete cDNA or genomic DNA, which may otherwise be contaminating the cDNA library. However, the only downside of this method is the presence of truncated RNAs in poor quality RNA starting material which will eventually be amplified, and will contaminate the final cDNA library (Chenchik et al. 1998).

With the advancements in sequencing technologies, more and more data are being generated not only as metagenomes but also as gene expression data in the form of RNA-seq. However, the most important drawback of sequence analysis techniques is the deficiency of ample reference genomes, which results in inaccurate taxonomic as well as functional characterization. Also, the unavailability of experimental metadata to train the various pipelines is a major issue to understand the growing complexity of the deposited metatranscriptomic databases. The requirement of high-throughput in sequencing results in shorter reads, which are not deep enough to assign taxonomic and functional tags accurately. However, long-read sequencing (LRS) and de novo assembly of transcripts offer improvements in characterization of genes that are overlooked in existing NGS technologies (Clarke et al. 2009; Mantere et al. 2019).

The choice of bioinformatic pipelines is a major limitation as it is still unclear, as to which pipeline is the ideal for any kind of metatranscriptomic data. Most workflows merge various tools and techniques to achieve better analysis, but they lack a specific, benchmarked pipeline which can address all the problems together. Moreover, the bioinformatic pipelines are resource intensive and require a lot of computing as well as storage needs. Thus, development of workflows like the SqueezeMeta which is capable of running in small desktop/laptop computers and easy-to-use web-based pipelines like Galaxy, COMAN etc. are the need of the hour (Boekel et al. 2015; Ni et al. 2016).

Concluding remarks

As discussed in this review, metatranscriptomics is a dependable and robust tool to discover novel functions and possible roles of microorganisms concerning soil ecosystem with an emphasis on eukaryotes. The survival strategies in heavy metal polluted soil have encouraged scientists to develop clear-cut holistic approaches to identify and discover new and underrepresented yet important microbes and functions. To understand the complex and dynamic nature of soil microbial community function and environmental impact due to metal contamination (Wasserman et al. 2008), it is essential to focus on the expressed subset of total genes and not work with the entire metagenome. Here metatranscriptomics helps us to understand the responses of microbial population with environmental modification or perturbation, which will reveal the services performed pertaining to nutrient mobilization, breakdown of complex organic matter, imparting tolerance or resistance in response to heavy metals in environment etc. Sequence based discovery of participating organisms and their functional information from the RNA reveals a lot more than what isolation or cultivation-based techniques would have revealed. Heavy metal polluted soil can contain an enormous amount of information as stressed conditions trigger overexpression of a protein-coding genes (Mitchell et al. 2011) and increased activity of a community which would otherwise be dormant under ordinary conditions. This huge repository of information in the form of sequences is expanding due to high-throughput sequencing facilities that everybody has access to. The soil biota is well represented, more and more taxonomic annotations are assigned to previously unknown species especially eukaryotes. Protein-coding sequences of these environmental transcripts are validated through proper functional studies.

Mining and characterization of functional genes amongst millions of obtained sequences is a challenge faced by metatranscriptomics. Some of the problem is relieved by (i) Sizing the mRNA in suitable fractions (Yadav et al. 2014) (ii) Sequence capture by probe hybridization followed by activity screening for the desired function (Bragalini et al. 2014) and a more recent technique (iii) Ultra High-throughput screening by picodroplet microfluidics (Sjostrom et al. 2014; Marmeisse et al. 2017).

The eukaryotic genes recovered from polluted soil through metatranscriptomic approach thus can be used to study unknown taxonomic groups and their contribution towards metal tolerance. This will have great biotechnological importance such as biomarkers or organism based biosensors. It will also help us to manipulate the metal resistance level in plants which would be useful for remediation and revegetation of polluted soil (Khan 2005). Though these are lucrative options but they do not come hurdle-free. Finding suitable host for expression of environmental gene other than yeast can be a problem but there are ongoing studies carefully designed to express these genes in other living systems for example Damon et al. (2011) successfully introduced an environment oligopeptide transporter into Xenopus oocytes which could successfully transport dipeptides. Once they can be suitably expressed in different hosts, the novel genes and its products in the form of biocatalyst would thus open up numerous profitable paths in the area of biotechnology.