Introduction

Enzymes play an important role in our daily lives and are used in variety of industries and sectors like food, detergent and medicine [1]. The demand of certain enzymes has increased exponentially, like lipases, proteases, hydrolases and polymerases. Research laboratories and industries are extensively working to find newer and better candidates. Major enzyme industries are regularly introducing new enzymes in the market. In the past two decades, several patents on enzymes have been filed and issued. Apart from this, there are ongoing efforts to substitute chemical reaction processes in industries with enzymatic processes, as they are greener and environment friendly alternatives. It has been widely accepted that a cleaner chemical synthesis process should be practiced to prevent pollution and avoid generation of toxic wastes [2]. Enzymatic synthesis of chemical compounds has emerged as a simple, better and competitive route in comparison to chemical methods. Also, a high substrate specificity and better conversion rate with formation of low or no by-products makes enzyme a robust and efficient choice. Recently, Merck and Codexis developed a greener process for the synthesis of Sitagliptin, a drug used in diabetes treatment [3]. In the recent years, advancement in recombinant DNA technology has resulted in successful approaches to overexpress an enzyme in variety of host cells, which can help in producing the biocatalyst in high amount. To obtain an efficient enzyme candidate, stringent selection criteria are required to achieve high activity, specificity, and stability. In an industrial processes, the substrate, solvent, reaction conditions are important and an enzyme chosen should be able to withstand these components and conditions. It is actually difficult to find a natural enzyme with all the properties desired in an industrial process. To fulfil the massive enzyme demand, various approaches are practiced to constantly explore different resources to obtain new and better enzymes. Among these, in-silico bioprospecting has come up as an efficient, cost and time effective approach to discover new enzyme candidates. Although this approach has been practiced at various laboratories [4,5,6], it has not been reviewed or discussed.

In-Silico Bioprospecting

New enzyme discovery can be accomplished using various conventional and contemporary methods as mentioned in Fig. 1. Common methods of screening to identify novel enzymes are performed by exploring natural sources like industrial waste or soil, but they require an established protocol for screening assay or selection method based on the desired properties of the enzyme. This process involves biochemical screening and isolating the organism on selective media, which is usually time and resource consuming and may or may not result in a novel candidate. From these screening assays, the selected organism further needs to be identified, followed by the identification of gene sequence which is coding for the desired enzyme and function. One approach is to perform random mutagenesis to create enzyme mutant, and then sequence the DNA region. Another way is to perform targeted or whole genome sequencing to identify the desired enzyme gene sequence. As an alternative, amplification of target gene can be performed using degenerate primers [7]. There are challenges involved in primer designing, which affects the success rate. The process is followed by PCR library cloning and screening for prospective candidates with desired properties, which again demands a well-established protocol for screening positive candidates. After selecting the desired clone, the responsible gene can be sequenced, cloned and expressed.

Fig. 1
figure 1

Methods of enzyme bioprospecting

The direct screening and identification methods are preferred where molecular biology resources are inadequate. These experimental approaches are used commonly, but they are time and resource consuming, with low success rate. However, in-silico bioprospecting is a simple, straightforward and promising approach to identify novel enzyme candidates with better enzymatic properties. A compilation of recent reports, where in-silico bioprospecting approach has been used to find novel enzymes, is given in Table 1. The current fast paced, high-throughput whole genome/metagenome sequencing has tremendously increased the biological database and thus the enzyme diversity. This diversity in turn has increased the complexity and difficulty of finding a novel candidate. The in-silico bioprospecting process can be broadly divided into two steps: (i) Searching databases (ii) Using Bioinformatics tools to screen, analyse and shortlist prospective candidates.

Table 1 In-silico bioprospecting approach used to find novel enzymes

Step 1: Searching Databases

This can be performed by exploring databases using various search tools based on homology, conserved motif, consensus guided approach, or simply keyword search. The search result can be further screened using filters, such as percentage identity, query coverage, e-value. For example, a keyword search in NCBI protein database can be performed, followed by filtering the results to show candidates between 30 and 80% identity with query coverage > 95%. Gupta et al. [11] used keywords such as ‘Hypothetical Protein of T. aestivum’, ‘Hypothetical Proteins of wheat’ in NCBI database followed by manual screening to get unique protein candidates. After removing redundant entries, unique candidates were further subjected to physicochemical, localization, function and domain analysis. In another database search, keywords such hydroxybutyrate, hydroxyalkanoate, hydroxyalkanoic, PHA and PHB were used as input [15]. Another common approach practiced by researcher is to search biological databases using a known candidate enzyme sequence. While choosing a potential enzyme gene sequence, it is of utmost importance to select a full length protein sequence having conserved domains, as many incomplete sequences annotated in database do not code for a functional protein, when checked experimentally. Also, in the search result, the selected candidate’s sequence similarity should not be very high with known sequence. This is to ensure that a novel candidate is shortlisted and not a close homologue of a known sequence. In the similarity search result, the hits with > 90 identity are very closely related, sources like different species of same family, and it is more likely that they are very similar. But, the hits with ~ 80% identity or lower are those candidates who are different from the query candidates, not closely related, but do have conserved sequences similar to known candidates. This ensures that novel candidates are chosen, which is predicted to retain the enzyme activity but is different from the search query. There have been reports where researchers had selected candidates with sequence similarity as low as 40 percent. Sharma et al. [10] searched novel sources of nitrilases from microbial genomes by adopting homology-based approach and selected sequences which exhibited > 30% and < 80% identity. The shortlisted search results need to be confirmed for a complete coding sequence or sequences. For example, shortlisted candidates of nitrilase were checked by GenMark S tool to verify complete coding sequences or sequences [10]. Since the protein length information is available for the input sequence, the search results should be restricted to length closer to the input sequence length. In case of nitrilases, sequences with less than 100 amino acids were considered as false positive and were discarded [10]. In another instance, sequences less than 250 amino acids were excluded to find novel BVMO (Bayer-Villiger Monooxygenases) enzyme [5]. For PHA synthase, sequences with ~ 120 to 260 bp were considered as prospective candidate in a database search [15]. These search filters along with others like e-value, can aid in gathering positive sequences which could code for functional enzyme of appropriate length and reduces the chance of false discovery or random or irrelevant search result.

In certain cases, designing motif from selected protein sequences [e.g. by using MAST (Motif Alignment and Search Tool) at MEME suite] can be used to search bacterial genome. For example, Homology-based approach and motif search resulted in the identification of 138 putative/hypothetical protein sequences which had potential to code for nitrilase [10]. Vaquero et al. [16] also adopted homology-based strategy to screen for novel CalB-type lipase in fungal genomes using blastp algorithm, against JGI and NCBI databases, with e-value cut-off as 10−2. In the same study, conserved motif approach failed to identify putative lipase gene due to absence of conserved sequence motif generated by MEME software. Therefore, different individual strategies or combinations should be implemented in the process of finding novel putative enzymes. Consensus-guided approach, using Pfam domain, can also be used to search databases for the presence of particular enzyme family. Consensus-guided approach was adopted by Shakeel et al. [9] to obtain heat stable alkane-producing enzymes, using ado gene from Synechococcus elongatus PCC7942 as a query to search IMG/MER hot spring database. A consensus sequence was generated from the list of homologous sequences using Bioinformatics tools, which was further validated computationally and experimentally.

Specific datasets like metagenomes from various ecosystems can also be searched for obtaining novel enzymes. Around 264 putative monooxygenases were obtained when Pfam domain and blastp search were used to search BVMO [5] from ~ 14 million protein-coding sequences present in metagenomic dataset of cold marine sediments [5]. Metagenome data of mangrove soil were explored to find polyhydroxyalkanoate (PHA) synthase genes [15]. Adam et al. [17] reported a novel activity-based approach to screen H2-uptake enzyme from hydrothermal Metagenome. Toyama et al. [13] reported a novel β-glucosidase from microbial Metagenome of a lake in Amazon. Tan et al. [6] reported a novel thermostable phytase using bioinformatics approach which was screened from Metagenome database. Various steps and approaches used in gene mining from Metagenome data have been discussed and reviewed recently and reader is referred to these articles and reviews [18, 19] for details.

The steps of in-silico bioprospecting can be modified as per the desired property of enzyme. For example, if a thermostable enzyme is desirable, but the known enzyme reported is not thermostable, the similarity searches in thermophiles will be useful to find putative thermostable enzymes. It has been commonly observed that the thermostable enzyme sequences are different from their mesophilic counterpart. The putative thermophilic candidates searched this way should be further analysed (discussed in Step 2) to make sure that residues important for structure and functions are conserved.

Step 2: Using Bioinformatics Tools to Screen, Analyse and Shortlist Prospective Candidates

Once the primary list has been generated using various database search approaches, the next step will be to analyse their physiochemical, phylogenetic and functional properties using different bioinformatics tools. ProtParam software using ExPASy server is widely used to access physiochemical properties (such as the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index, and grand average of hydropathicity (GRAVY) of putative candidates [10, 11, 20]). Predicted values of all parameters of putative enzyme(s) are compared to the well characterized enzyme which affects the confidence level to study the putative enzyme(s) experimentally. For example, ProtParam predicted physiochemical properties of 138 putative nitrilases with in the range of well-characterized nitrilases [10]. All the parameters are based on protein sequence i.e. sequence-dependent analysis; therefore, it is necessary to get complete or nearly complete sequence for accurate analysis and prediction of various physiochemical properties.

Phylogenetic analysis can be performed using tools like Molecular Evolutionary Genetics Analysis (MEGA) [11, 15, 16, 21]. For example, phylogenetic analysis of selected putative candidates belonging to CalB-family grouped putative lipases in to different clusters of known lipases depending upon its evolutionary closeness [16], thus helping in deciding on novel and unique candidates. Structural modelling of putative candidates can be performed using SWISS-MODEL server or MODELLER v9.15 software [21]. Vaquoro et al. [16] used CalB as template to model PlicB, which exhibits 30% sequence identity and 44% similarity. The information about structure and residue conservation prediction is only possible if structural data of protein homologues are available through crystal structures. Hence, persistent exploration and enrichment of databases are necessary for in-silico bioprospecting of novel enzymes.

There are other tools which can predict structural information such as signal peptide (e.g. Signal P) or disulphide linkages (e.g. DiANNA). DiANNA 1.1 web server predicted two disulpfide bonds in PlicB whereas CalB and Uml2 lacks disulfide bonds [16]. Protein functional domains and families are studied by comparing list of putative enzyme(s) against databases like Pfam, CATH, SVM-Prot, CDART, SMART. In one study, hypothetical proteins (HPs) were explored using tools based on domain architecture and profiles [11]. Out of 124 HPs, 77 sequences were annotated with high confidence by using Pfam, CATH, SVM-Prot, CDART, SMART and ProtoNet, and among them, 16 were predicted as enzymes. Functional protein network provides information about the association of hypothetical/putative protein(s) with the known functional protein, which can be generated by STRING database. In the study conducted by Gupta et al. [11], it was found that the predicted HPs such as HAV22 (Q7XAP6) and F-box protein (D0QEJ9) were interacting with other proteins of the STRING database such as protein 4,345,793 of Oryza sativa subsp. Japonica.

Analysing the putative candidates using bioinformatic tools provides clarity and help in selecting those candidates which are structurally and functionally more suitable, novel and unique. Following the sequence selection, candidates are validated for desired properties by cloning and expressing them in artificial expression systems followed by physiochemical characterization of enzyme [6, 13]. Apart from in-silico bioprospecting, enzymes with desired properties such as high activity [22, 23], substrate specificity [24] and stability [25, 26] can also be obtained by modifying the existing enzyme using mutagenesis via directed evolution, rational or semi-rational approaches [27,28,29,30,31,32,33,34,35]. Random mutagenesis of a single gene can be done by chemical, error prone-PCR or saturation mutagenesis, or by using mutator strains. On the other hand, gene recombination approach can be applied with more than one related gene sequences, using tools like DNA shuffling, Random Chimeragenesis on Transient Templates (RACHITT), Exon shuffling, incremental truncation for the creation of hybrid enzymes (ITCHY), Sequence Homology-Independent Protein Recombination (SHIPREC). The reader is referred to review by Rubin-pital et al. [31] for details about these processes, their advantages and drawbacks. Recent developments along with additions of rational component have resulted in faster selection methods and maximized qualities of libraries with more relevant mutations [36]. Rational mutagenesis to improve enzyme property has been attempted in recent years to obtain the desired property; however, the phenotype of certain mutations is still beyond the current understanding of enzyme structure and function.

Conclusion

In the past few years, enzyme production and research have taken a major leap and a vast number of potential enzymes are available in market and are produced at industrial scale. Reports are being continuously published related to the screening and finding newer and better enzymes. However, it is generally observed that wild-type enzymes are not directly applicable for an industrial process. In the coming years, it is expected that more industrially important enzymes will be discovered or engineered that can satisfy the ever-growing demand of enzymes. The availability of various expression vectors, host and systems has increased the possibility of expressing a gene artificially in a host of our choice. However, protein expression even in bacterial host like E. coli can be challenging many times [37, 38]. The diversity of enzymes present in databases indicates that the present knowledge of structure and function is vast but far from complete. The last two decades has seen tremendous growth in protein structural information, and expression systems and tools have enriched in large, but we still require more information to understand and utilize it to its full potential. With the rise in molecular techniques, enzyme improvement by protein engineering has taken a big leap [35]. Drastic improvement in enzymatic properties like activity and stability has been witnessed by using methods of directed evolution or rational mutagenesis. With the current knowledge of enzyme structure and function, it is still a challenging task to pursue a rational approach of enzyme engineering in every case to improve their properties. Efforts should be more focussed towards solving enzyme crystal structures and expanding our knowledge and understanding of enzyme function and properties. The pace of structure information cannot be compared with the way new genes or proteins are being discovered, but attempts can be made to improve it further. Generating and analysing diverse crystallographic data will help in understanding the enzymes in greater details, and also, will help in rational engineering of the enzyme for improved properties. There is an urgent demand for developing new tools and pipelines which can handle and analyse the exponentially growing database, and related experimental literature, with minimal manual intervention. This will help in discovering novel and better enzymes comparatively faster with high success rate.