Introduction

Recent advancement in high throughput next-generation sequencing technologies has led to an exponential rise in genome sequence databases. However, the significance of the genomic data cannot be gained until functional inferences of these sequences are deciphered. Toward this end, elucidation of protein three-dimensional (3D) structure bears great importance in understanding the mechanism of protein function, its evolutionary features and catalytic activity, all of which can serve as important framework in designing further experimental studies. Keeping in view of the time consuming nature of experimental determination of protein structure, theoretical modeling based on homology is currently the most reliable, rapid, and cost-effective approach for deducing structural properties of sequences and to bridge the ever expanding gap between the number of known protein sequences and the number of structures solved [1]. Homology modeling method predicts the 3D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (template) [2]. Although the reliability of this method has been well established in recent years, selection of the most accurate template and correctness of the target-template alignment are still the challenging areas of research.For homologous protein sequences with sequence identity greater than 40 %, the alignment is generally considered to be almost accurate. However, as the overall sequence identity decreases, alignment becomes difficult and subsequently reduces the quality of the final model [1, 3]. Therefore, the choice of sequence alignment strategy plays a more critical role in generating accurate protein models than the choice of the modeling program, with distinctly improved models obtained by employing the best available sequence alignment technique [4]. The widely used MODELLER program [5] for homology modeling uses standard pairwise comparison methods for template selection and target-template alignment [3]. The subsequently released graphical user interfaces (GUIs) of MODELLER program such as MINT (http://www.bioinf.org.uk/software/mint/), EasyModeller [6], SWIFT MODELLER [7] and PyMod [8] have also implemented pairwise comparison methods into their workflow for comparative protein structure modeling. A brief account of some essential features of these programs with their limitations is presented in Table 1.

Table 1 Comparison of various GUI’s of MODELLER program available for protein homology modeling

Although pairwise comparison methods, which employ a dynamic programming algorithm guarantee an optimal alignment, the intensity and generality of the underlying substitution matrices (PAM and BLOSSUM) limit the reliability of such methods to cases of high sequence identity. On the other hand, alignment in the so called twilight zone (between 15–30 % sequence identity) requires additional information regarding the protein family to which the particular sequence belongs [9]. In the past several years probabilistic inference methods based on profile hidden Markov models (profile HMM) have emerged as an alternative to conventional pairwise alignment methods such as BLAST [10, 11] and FASTA [12] for creating sequence profiles in order to detect more distant remote homologous templates from database [13]. The key factor in HMM algorithm is in computing not just one best-scoring alignment but a sum of probabilities over the entire local alignment ensemble and therefore, contain more information about the sequence family than a single sequence [14, 15]. Furthermore, a number of recent studies have corroborated the principal advantage of profile-profile based alignment in template identification and overall model quality generation [1618]. Despite these many advantages, implementation of HMM method in homology modeling software and tools is yet to be addressed adequately [13]. Here we describe the development and benchmarking of MaxMod, a unique Microsoft Windows based GUI of MODELLER that integrates HMMER3 program for template identification and Clustal Omega program for sequence alignment. HMMER3 makes profile HMM searches as fast as BLAST, while retaining the power of probabilistic inference technology [13]. In conjunction, implementation of Clustal Omega allows fast scalable generation of high quality multiple sequence alignment by using HHalign package of HMMER3 [19]. We believe that MaxMod will make the entire process of protein homology modeling much faster and user-friendly.

Methods

MaxMod has been developed using Visual Studio.NET platform with C# as the programming language for a high degree of flexibility in the development of user interface (UI) and creating an interactive modular system. The UI is built on a multiple document interface (MDI) for effective presentation of different user modules. The input and output (I/O) operations dominate the entire coding architecture for formatting Python scripts and input files of the backend MODELLER program.

The architecture of MaxMod (Fig. 1) consists of three distinct layers, (a) Presentation layer: All visual elements of MaxMod including user I/O, job directory management and PDB sequence database update are present in this layer. (b) Business layer: This layer contains standard programming features of the .NET framework base class library (BCL) such as collection classes, data type definitions, variables, security and IO operations along with some non-standard features viz., drawing, classes for database interaction, and web support. Business layer takes input from the preceding presentation layer, processes data (formatting of python scripts and preparing inputs for other 3rd party programs) and sends it to the next level. (c) Data access layer: This is a virtual layer controlling various 3rd party programs such as HMMER3, Clustal Omega, Jmol, and PROCHECK [20], all of which have been integrated within MaxMod. The other programmes such as MODELLER and Python require pre-installation. The PDB database also resides in this layer for templates search. All processed data and instructions from BCL are received by the 3rd party programs of data access layer and are further executed to display the output in the presentation layer. Based on the above architecture, MaxMod follows a definite workflow as illustrated in Fig. 2.

Fig. 1
figure 1

Architecture of MaxMod demonstrating three different layers of data processing. (a) Presentation layer (b) Business layer (c) Data access layer

Fig. 2
figure 2

Workflow of MaxMod for predicting protein 3D model from target sequence

Submission of protein sequence

The user is required to submit the target protein sequence in RAW format with a job title of a maximum of five characters. If no title is provided, the program assigns a default name (MODEL) to the submitted sequence along with date and time (format: YYYYMMDDHHMMSS) of submission. The job title also represents the working directory name, where the results are saved for accessing at a later time. At this stage the user can select one of the options viz., “search templates”, “upload templates” or “express modeling”, depending on the requirement (Fig. 3a).

Fig. 3
figure 3

Screenshots of various windows of MaxMod. (a) Sequence input window (b) Homologous proteins obtained from PDB using HMMER3 program (c) Template upload window (d) Window showing R-factor of the selected templates (e) Window showing ligands selection from templates and the default/advanced parameters of MODELLER (f) Output of the resulting protein models (g) Ramachandran plot generated using integrated PROCHECK program (h) Model visualization through Jmol (i) Residue-wise DOPE profile plot

Search templates

The PDB sequence database and “phmmer” program of HMMER3 software suite are packaged together with MaxMod in order to search templates. On selecting the “search templates” option, HMMER3 program executes to find remote homologs from PDB for the target protein sequence and the output is presented in a tabular format outlining the PDB code with chain name of the crystal structure, E-value, bit-score, E-value of domain hits, bit-score of domain hits and percentage of sequence identity. The user can select desired number of templates for viewing more detailed information of the crystal structure available in PDB and their alignment with target sequence. The window will then be directed to RCSB website (www.rcsb.org) for extracting the atomic coordinates of the selected structures (Fig. 3b).

Upload templates

If the “upload templates” option in the homepage is selected, the user will be redirected to a separate window where any number of PDB structures can be uploaded as templates and the appropriate chain can be further chosen from a drop down menu (Fig. 3c).

Compare templates

The user can select the most accurate template by clicking on the “compare templates” option, which performs comparison between the selected templates on the basis of better crystallographic resolution (R-factor) and higher overall sequence identity. MaxMod then displays a dendrogram from the generated log file with their respective R-factor (Fig. 3d).

Model construction and analysis

Successful submission of template structures by exercising any of the options viz.,“search templates”, “upload templates” and “compare templates”, the user will be redirected to the model construction window where template-wise arrangement of ligands are displayed in a tree-view topology. Required ligands may be selected to copy their atomic coordinates onto the modeled structure. Other advanced features are also available in MaxMod such as, “optimization and refinement” where each model is first optimized with the variable target function method with conjugate gradients, followed by its refinement using molecular dynamics with simulated annealing; “rapid optimization” enables the user to get an approximate model very quickly and, the “automatic loop refinement after model building” allows refinement of loop regions after constructing the 3D protein model (Fig. 3e). Selection of the “build model” option after indicating the number of models to be generated will automatically redirect to a new window where ‘file name’, ‘molpdf (molecular probability density function)’, and ‘discrete optimized potential energy (DOPE) score’ are shown in the left panel and options for ‘PROCHECK’, ‘visualization’, ‘DOPE evaluation’, and ‘download’ are available in the right panel (Fig. 3f). A low ‘molpdf’ or ‘DOPE score’ signifies a reliable model. PROCHECK and Jmol are programs used to generate the Ramachandran plot (Fig. 3g) and visualize 3D conformation of protein, respectively (Fig. 3h).

Express modeling

To make the homology modeling procedure simpler and user-friendly, especially for beginners and non-programmer biologists, another useful feature named “express modeling” option is provided in the home page of MaxMod, where submission of protein sequence in RAW format is the only requirement for building protein 3D model.

Loop optimization

Loops that connect elements of secondary structure for proper protein folding determine the functional specificity of the protein [21]. As a consequence, the accuracy of loop modeling is a crucial component in determining the usefulness of comparative models for studying protein-ligand interactions [22]. In this context we have included a “loop optimization” utility in MaxMod where PDB structures can be uploaded or obtained directly from the job directory. The user is required to specify the loop region to be refined as well as the number of structures to be generated. The resulting optimized 3D protein models are displayed in a separate window to analyze and download.

Results and discussion

MaxMod is a rich user-friendly standalone tool for protein homology modeling that implements profile HMM method in the modeling framework, unlike other existing GUIs like EasyModeller, SWIFT MODELLER, and PyMod, which employ pairwise comparison methods such as ALIGN2D or SALIGN commands for target-template alignment. The advantage of using profile HMM over pairwise comparison method in MaxMod is that it turns a multiple sequence alignment into a position-specific scoring system which is more suitable for identifying distant homologous relationships. MaxMod can also effortlessly construct protein models using templates bearing modified residues, a feature not present in any other GUIs. Additionally other important features are available such as loop optimization, model validation, and visualization, automated update of PDB database, and express modeling to enable users, to build 3D model by simply submitting the protein sequence.

On comparing MaxMod with other MODELLER-based GUIs with respect to the total time taken to construct 3D model for the protein sequence lactate dehydrogenase (UniProt Acc Id: O96445), it was observed that MaxMod takes around 18 s which is approximately three times faster than PyMod and five times faster than EasyModeller and SWIFT MODELLER (Table 2). The rapid construction of protein model by MaxMod can be attributed to improved template search and target-template alignment using HMMER3 and Clustal Omega programs, respectively. Moreover on assessing the above four modeling programs in relation to their ability to build 3D models with template bearing modified residues, specifically using the crystal structure (PKR kinase domain-eIF2alpha- AMP-PNP complex; PDB Id-2A19) containing a modified residue named phosphothreonine, it was observed that unlike other programs which, either completely failed to construct any model or were unable to copy the atomic coordinates of ligands, MaxMod successfully completed protein modeling without any difficulty. Furthermore, the overall performance of these programs was compared by assessing the stereochemical quality of the various 3D structures generated from modeling a test set of 15 randomly selected proteins, ranging sequences identity from as low as 27 % to as high as 84 % (Table 3). PROCHECK results indicated that all 3D models determined using MaxMod were of better stereochemical quality with approximately more than 99 % of residues in the allowed region of Ramachandran plot (Table 3). Furthermore, to check the compatibility of inter-residues interactions, Verify3D [23, 24] tool was employed where the scores indicated that models generated through MaxMod have relatively greater percentage of residues with an average score >0.2, as compared to the models generated by other programs. Similarly, to detect potential errors in the proteins, their Z-score and total energy plots were calculated using ProSA-web program [25]. The Z-score indicates overall model quality and measures the deviation of the total energy of the modeled structure with respect to energy distribution derived from random conformations [26]. The score outside a range characteristic for native proteins indicates erroneous structures. The ProSA energy plot indicated that all the 3D models generated using MaxMod fall within the range of experimentally determined structures (Supplementary Fig. 1). Thus, the overall results (Table 3) conclusively demonstrate the reliability of MaxMod for significant improvement in model accuracy.

Table 2 Comparison of performance and features of MaxMod with other publicly available GUIs of MODELLER program
Table 3 Comparative analysis of structure validation results obtained for the homology models determined through MaxMod, and other available GUIs of MODELLER program

Conclusions

MaxMod is a rich user-friendly GUI to the MODELLER program for prediction of protein 3D structures. Its unique strengths are, (i) the use of profile HMM methods such as HMMER and Clustal Omega for template identification and target-template alignment, respectively; (ii) effortless modeling of protein using templates having modified residues (iii) other useful features such as (a) loop optimization, (b) express modeling, (c) model validation, and (d) PDB database update facility. Additionally, the processing time required for model building as well as the overall model quality is significantly improved due to substitution of progressive alignment with profile HMM method. The program runs on any version of Microsoft Windows and we plan to release regular updates, twice annually.