Keywords

1 Introduction

This decade has been witnessing a major shift in technologies which have been used in various sectors ranging from social media, agriculture, services, to science and technology. In the current age, new advances are being made in the field of satellites, robotics, micro- and nanotechnologies as well as revolution in computing. The stream of science has been impacted by this revolution. All disciplines of science have been generating and building newer technologies and different approaches for scientifically accurate experimentation. All these developments in various scientific disciplines are also changing our social life, health, environment, etc. One of the major streams of science is life sciences, which has been strongly affected and accelerated due to all these advancements in techniques and technologies.

Various technologies like next-generation sequencing (NGS) in genomics, high-throughput assays, and supramolecular chemistry are revolutionizing the life sciences and applied areas of human health, agriculture, livestock, and many more [1,2,3,4]. The robotics-based automation is generating volumes of data from various experiments and characterization techniques. The next-generation biology has been driven heavily by wet laboratory experimentation as well as dry laboratory computation.

Technologies like next-generation sequencing (NGS) enable sequencing of genomes of thousands of species in plants and animals at an extremely rapid rate [5,6,7]. Today, many genome sequencing centers are producing data of about terabytes per week. This results in petabytes of data of sequencing information per year. The figure is expected to grow exponentially and very soon will be facing challenges of storage and analysis of exabytes of sequence data [5,6,7]. To extend this further, there is already a race to sequence the genomes of all living species on the planet including humans, plants, animals, microbes to name a few. It is expected that this gigantic exercise will result in zetabytes to yottabytes of sequence data. Such large volumes of sequence data will be the genomic ocean of tomorrow [7,8,9].

Similarly, structural database of biomolecules like protein, nucleic acids, lipids, and membranes is also growing rapidly (shown in Fig. 1) due to methods like cryo crystallization, high-frequency NMR , and other characterization techniques along with computational modeling techniques [10]. Computational modeling and simulation of biomolecules have been drastically improving due to the advancement in high-performance computing (HPC) [11] and development of advanced enhanced sampling methods [12, 13]. It has paved the way for mimicking long timescale events occurring in different biological systems more efficiently. Owing to the better computing paradigm, today structural data generation is no more the major challenge, but analyzing this huge data has become one. Computer simulations help to determine mechanism of action of biomolecules in a cell, thereby suggesting their implication in various diseases and discovering their potential use in therapeutics. Hence, the computational techniques generate biomolecular structural and dynamical data via very long time scale simulations. Likewise, detailed and systematic analysis of data becomes an important part of any study, as it would further help to understand the entire mechanism of biomolecular action. Advances in crystallization, NMR , and computational methods are directly influencing and accelerating the drug discovery process.

Fig. 1
figure 1

Source https://www.rcsb.org/pdb/statistics

Growth of structural data from 2001 onwards.

2 Drug Discovery Process

Discovering a new drug is a very complex, time-consuming, expensive, and high-risk process for R&D and pharmaceutical laboratories [14,15,16]. It is also a multi-step process involving target identification, target validation, and screening of small molecules for validated targets. These steps need to be made easy, cost-effective, and fast. Computational method like computer aided drug discovery is one such process that involves identifying new ligand molecules for a particular target protein, which is an important step in drug discovery. Historically, the drug discovery process was involved extraction of chemical compounds from natural resources and testing them in the cell for disease treatment [17]. With the advancement of technology and ability to chemically synthesize small chemical moieties, various drug databases came into existence. The availability of vast structural resource of small molecules has made high-throughput screening of these databases against target protein a more feasible practice. Also, increasing affinity and reducing toxicity of already available ligand molecules needs to be addressed in drug discovery process.

Drug discovery process involves the following steps: (1) target identification, (2) validation of target protein, (3) creation of small molecule database, (4) screening of small molecules against target protein, i.e., hit to lead identification, (5) lead optimization , (6) preclinical testing, and (7) clinical testing.

Almost all these steps generate huge data from experimental laboratory and computational laboratory experimentations and need better way of handling data with fast and better analytics approaches. Target identification and validation involve selection of protein molecules whose activity when blocked or enhanced can affect the particular disease-related cellular pathway. This involves a systems biology approach wherein an understanding of all the proteins involved in the pathway or finding possibility of any alternate pathway available, role of particular protein in particular pathway and identifying side effects of the target protein. Second most important thing is to have database of lakhs of small molecules which can be screened against the target protein. The source of these small molecules can be microbial metabolites, plant origin, and chemically synthesized. There are various drug molecule databases, i.e., Chemspider [18], DrugBank [19], ZINC [20] to name a few which are already available.

The technique to screen these lakhs of molecules to a target protein is performed using molecular docking . The screening process should be fast enough, which demands the use of and better computational or programming techniques. Each of these molecules tends to have conformational flexibility which in turn makes the docking process more time-consuming. Choice of efficient force field and scoring methodologies also plays an important role in screening of these molecules. In order to achieve this, high-throughput docking methods have been developed. Although, the analysis of these docked conformations to choose the best ligand becomes a big data analytics problem as it involves finding of various parameters and several interactions between the target protein and the docked ligand.

Docking or screening projects a static picture of the binding of ligand with the receptor [21]. However, the dynamic picture would be obtained from the molecular dynamics simulations which provide an understanding of the flexibility of protein and ligand. Molecular dynamics simulation gives an insight about various intermolecular interactions and binding affinities between protein-ligand complex, thereby ensuing binding efficiency [16]. Molecular docking followed by simulations generates huge molecular trajectories data. Thus, the management and fast analytics of this data have become the need of the hour.

The upcoming area of drug repurposing is again proving to be a bigger computational task, and it has the potential to deliver a drug molecule for a chosen disease [22, 23]. Various pharmaceuticals and R&D laboratories are working on drug repurposing which involves docking of already approved FDA drugs on new target protein. The involvement of FDA-approved drugs suggests that they have been already tested on humans for their toxicity and pharmacology. Hence, rejection of such drugs due to toxicity is ruled out, and entire duration required for the drug discovery process can be shortened by few years. HPC -based molecular docking and molecular dynamics simulations pose a challenging role in this area of drug repurposing.

In order to manage this rapidly increasing data and efficient analysis, there is need to develop tools with parallelization and thereby enhance the overall performance. This denotes a continuous need to have advanced analysis platform and algorithms which can perform analysis of the biological data in a faster way. Big data technologies may help to provide solutions for these problems of molecular docking and simulations (Fig. 2).

Fig. 2
figure 2

Role of big data analytics in drug discovery

3 Big data Technologies: Challenges and Solutions

The context of big data is dependent on the problems and the existing technologies. Today’s big data can be tomorrow’s small data as the technologies and methods that are handling the data may become more advanced in the future. The big data is the data that cannot be handled using the existing traditional methods and requires specialized methods to solve the big data problem.

Big data is categorized by its three main properties, viz. volume, velocity, and variety [24]. Volume denotes the huge data that needs to be analyzed, velocity tells about the rate at which the data is generated of the data, and variety tells about the different types of data that can be generated by the various sources using different formats of data generations and exchange. Big data usually expands rapidly in the unstructured form and varies to such an extent that it becomes difficult to maintain the data in traditional databases . In such cases, specialized techniques like NoSQL [25] can be used to handle the problems of the unstructured data. Big data technologies are capable of managing huge data generated in different formats. Advancements in technologies like cloud computing offer a unified platform to store and retrieve the data. The Internet speed has increased to several manifolds, and the cloud technologies have effectively exploited the Internet capabilities to offer a scalable, multi-user platform for big data analytics in the field of Bioinformatics . The use of big data in the Bioinformatics is an emerging field which presents new opportunities to medical researchers and paves the way toward prediction of personalized medicines. The greatest challenge lies in designing a strategy to acquire the data followed by filtering it to meet the appropriate decision-making demands.

This can be achieved by bringing together experts from clinical medicines, computer science, bioinformatics , biotechnology, and statistics and address the challenge of the data management and analytics solutions toward precision biology. Hadoop [26]-based platform with MapReduce and spark-based algorithms may be useful to make all the analysis optimized with fast calculation. Hadoop- and MapReduce [27]-based algorithms implemented on scalable architecture have been discussed further along with drug repurposing big data case study for cancer protein.

4 Big data Technology Components

  • Hadoop

Apache Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware. Hadoop has gained lots of popularity among the peer parallel data processing tools because of its simplicity, efficiency, cost, and reliability. Hadoop can be built on the commodity hardware. Hadoop has major three components. Hadoop Distributed File System (HDFS), YARN scheduler and resource negotiating framework and the MapReduce [27] programming framework. A typical framework of Hadoop test bed is shown in Fig. 3.

Fig. 3
figure 3

Basic architecture diagram of hadoop test bed

  1. I.

    HDFS

Hadoop Distributed File System (HDFS) is built to provide high-throughput, reliable, efficient, and fault-tolerant file system. It can provide streaming reads and writes for large files. The basic architecture diagram of HDFS is shown in Fig. 4. As shown in the figure, the HDFS has two main components, namenode, and datanode. HDFS is mainly designed for low-cost hardware, and hence, it can be built on cluster of commodity hardware. In HDFS, the file is divided into fixed size blocks or chunks of 128 MB each except the last chunk. The fixed size 128 MB can be configured with various needs. Namenode contains the metadata information of all the files. It stores information regarding the block of file stored on datanodes, while datanodes actually store the block of data. Each block is stored on three datanodes of the cluster. This policy provides reliability at the cost of redundancy. Generally, two copies of blocks are stored on two different datanodes of the same rack of cluster, while the third copy is stored on the datanodes of the different rack of the same cluster. These two racks are connected by a very high-speed network switch. This policy ensures the reliability of the HDFS file system. In case, if any two nodes fail, still the data can be accessed from the datanode having this third copy of the data. Datanodes periodically updates their state to the namenode so that namenode can be aware of the overall state of cluster. While scheduling MapReduce [27] job, the hadoop framework ensures with most possibility that the mapper task should run on the same datanode where the actual data is residing. This avoids significant network overhead. This policy of hadoop improves the performance of the overall cluster.

Fig. 4
figure 4

Basic architecture diagram of Hadoop Distributed File System (HDFS)

HDFS major components:

  1. (i)

    Namenode

    Namenode stores the metadata about the file. It has the complete view of the distributed file system. It tracks which datanode is active and which are node. In case of any datanode failure, it initiates the operation regarding maintaining the replication factor by copying the data stored the failed nodes to the active datanodes. In case, namenode fails, the complete HDFS file system gets crashed.

  2. (ii)

    Datanode

    It stores the actual data. It performs the read and write operation once it receives the command from the namenode. It is responsible for block creation, deletion, and replication. It periodically sends the heartbeat signal to the namenode.

  1. II.

    Map Reduce

Hadoop MapReduce is the programming framework. It is one of the major parts of the Apache Hadoop project. It provides the programing model for data parallel application. The basic flow of MapReduce algorithm is shown in Fig. 5. MapReduce programming model makes use of HDFS and makes the application performance very efficient and fast. The MapReduce framework with the help of Hadoop framework places the mapper job on the datanode where the actual data resides. It improves the performance and removes the network bottleneck while processing huge amounts of data. The major phases of the MapReduce program are mapper, partitioner, combiner, shuffle and sort, and reducer.

Fig. 5
figure 5

Basic flow of MapReduce algorithm execution

The mapper reads the data from HDFS and processes it. This is followed by the partitioner ensuring that the processed data is sent to be the desired reducer. The data before being sent to the reducer is shuffled and sorted so that the reducer can easily process it. Finally, the reducer performs an operation of reduction or aggregation on the final data and this is followed by writing the final output to the HDFS. The combiner does a similar task as the reducer but at the mapper lever providing local lever aggregation or reduction.

  1. III.

    YARN

Apache YARN stands for Yet Another Resource Negotiator. Before Hadoop 2.x, the only framework which could run on Hadoop platform is MapReduce . The job scheduling and resource negotiation is integrated with the MapReduce framework and shared by Hadoop framework. The YARN provides the separate layer for job scheduling and resource negotiation. It provides the platform for other programming framework like spark and storm, and many can run on Hadoop framework . The basic architecture of YARN is shown in Fig. 6.

Fig. 6
figure 6

Basic architecture of YARN showing various components

YARN has ResourceManager, NodeManager, Container, and ApplicationMaster. Each container on datanode is specified with amount of CPU and memory, and it is configurable. ResourceManager is run on namenode, and NodeManagers are run on datanodes. Whenever a job is submitted, one container is allocated by a ResourceManager on any datanode. This container process is called as ApplicationMaster. This ApplicationMaster is responsible for all job management and resource negotiation with ResourceManager. With the help of ResourceManager, this ApplicationMaster allocates Containers from NodeManager for MapReduce task. This approach reduces the load on ResourceManager and distributes it across ApplicationMasters on the datanodes for each job. This way, using YARN the hadoop cluster can grow up to 10,000 nodes. Earlier benchmark without YARN on Hadoop 1.x was up to 4000 nodes. This way YARN provides scalability to the Hadoop cluster along with different programming platforms to be incorporated in hadoop framework .

5 Big data Tools Development for Drug Discovery

There have been efforts by various scientific groups to use HPC , grid technologies for drug discovery. Multiple docking tools like DOCK6 [28], Gold [29], Autodock Vina [30], and some others are already available in the parallel mode on HPC platform. Most of these tools are fast and robust; however, they have their own scoring functions based on molecular mechanics force fields and other geometrical descriptors . Although, improvements are still going on in enhancing the scoring function and guiding it further toward higher efficiency and accuracy. Docking with the concept of flexible ligand and protein still remains to be time-consuming calculation. Docking of multiple ligands to single protein or multiple ligands with multiple proteins may be some of the future challenges in docking area. Understanding the flexibility of both the proteins and ligands has been taken care by some of the currently available molecular simulation packages like AMBER [31], CHARMM [32], GROMACS [33], and NAMD [34]. All these packages are known to be scalable on the HPC platform. Although molecular simulations are time-consuming, they still prove to be the best in understanding the allowed flexibility of proteins, ligands , active sites, and other biomolecular entities. The advent of cloud and big data technologies promises to accelerate the drug development process using MapReduce [27] and spark methods coupled with machine learning and deep learning analytics. The tools like DIVE [35], HiMach [36], and HTMD [37] have been developed for molecular simulations as well as trajectory visualization and analysis. Many more tools may be getting developed using these newer technologies.

Bioinformatics group at C-DAC, Pune, has been addressing the issue on data analytics and visualization of trajectories in structural biology domain using HPC technologies combined with big data technologies. Various analytics tools have been developed and tested on Hadoop platform using MapReduce as shown in Fig. 7. At this stage, analytics tools for multiple molecular trajectories include hydrogen bond calculations, identifying water molecules and bridged water-mediated interactions. Other big data analytics tools for RMSD, 2DRMSD, RMSF, water density, WHAM-based free energy calculations are in the process of development. Few of the big data analytics tools which have been already developed proved to be useful in the process of drug discovery. These tools have described below.

Fig. 7
figure 7

Schematic representation of role of Hadoop and MapReduce paradigm in drug discovery process

5.1 Hydrogen Bond Big data Analytics Tool (HBAT)

The molecular dynamics (MD) simulations generate large trajectories which would be in the size of GBs to TBs depending on the size of the molecule and length of the simulation time. Many of the MD simulations use explicit solvation models in which water molecules are added explicitly to the solute to mimic the natural system. This increases the size of the system drastically in terms of number of atoms, and the analysis of such system becomes more compute intensive, iterative, and time-consuming. There are various analysis programs (ptraj, cpptraj [38], VMD [39], etc.) available corresponding to the different MD simulation packages. All these programs have modules written for performing different analyses like RMSD, RMSF, radius of gyration, PCA [40], distance calculations, H-bond analysis, and MMGBSA [41] free energy calculations. However, many of these programs are either inefficient or very slow in calculating the H-bond interactions within solute and especially between the solute and the solvent (water molecules). These programs are highly time-consuming and also have constraint in dealing with the large size data for example 500 GB or beyond. This drawback of the existing tools suggests a strong need for the development of water-mediated H-bond analysis tool which is capable of handling a very large size of trajectories and also be executed parallel to reduce the time. The water molecules added to the system may play a crucial role in the activity or functioning of that particular molecule. Hence, understanding the role and mechanism of such water molecules and their interactions with the solute (protein/RNA/DNA or drug) molecules is very important [42, 43]. In order to achieve this, a big data analytics tool for hydrogen bond calculation was developed by Bioinformatics group C-DAC.

The MapReduce algorithm for H-bond calculation was developed and ported/tested on Hadoop cluster. The algorithm flow has been shown in Fig. 8a for H-bond calculation using the MapReduce approach. The HDFS file system was used to store the multiple molecular trajectories data. The current version of tool can analyze trajectory data in the PDB format generated using molecular dynamics packages like AMBER [31], GROMACS [33], CHARMM [32]. The tool is scalable or portable on any distributed computing platform and can find out H-bonds between all types of residues including water. However, the tool requires a significant amount of time for executing the preprocessing stage where, the PDB files are generated from the trajectories and copied on the distributed HDFS storage. Despite this overhead, the overall performance of the tool is better than currently existing tools such as CPPTRAJ or PTRAJ [38], especially for trajectories with a large number of water molecules. The benchmarking of H-bond tool is shown in Fig. 8b. The benchmarking of up to 5.5 TB data is carried out, and it shows near linear scale up. Additionally, the tool can also help identify water-mediated interactions such as water bridges easily.

Fig. 8
figure 8

a MapReduce algorithm for H-bond calculation implemented in MapReduce paradigm b Benchmarking of HBAT tool for data up to 5.5 TB

5.2 Molecular Conformation Generation on Cloud (MOSAIC)

Drug databases usually contain millions of ligands , and for each ligand, there can be billions of conformations [44, 45]. Such billions of conformations need to be docked on to a target which is a generally a protein molecule. Generation and optimization of such billions of ligand conformations is a huge computational problem, since it involves the use of advanced methods like molecular mechanics , semi-empirical and quantum techniques [46, 47]. The application of an embarrassingly parallel approach accompanied by virtualized resource scaling and an efficient structure optimization tool can handle billions of conformations with the help of cloud computing technologies.

The Bioinformatics group of C-DAC has developed a tool called MOSAIC , which stands for MOlecular Structure generator In the Cloud. MOSAIC is an OpenStack [48] cloud-based conformation search tool to explore potential energy surface of biomolecules of interest in parallel mode using semi-empirical method. Molecular Orbital PACkage (MOPAC) is a general purpose semi-empirical molecular orbital package for the study of molecular structures and their energies [49]. The high-throughput energy calculations of the small molecules database can be done by MOPAC using hadoop and cloud technologies. Multiple instances of MOPAC are created for energy calculations of small molecules database. The tool can screen a database of millions of small drug-like molecules and understand their energetics and electrostatic behavior. The tool is useful for finding the target drug ligands . The torsion angle-driven conformational search method is useful in a range of chemical design applications [50], including drug discovery and design of targeted chemical hosts. MOSAIC  has an easy-to-use interface for the bioinformatics community over Software as a Service (SaaS) platform. A user-friendly Web interface has been developed for MOPAC-based energy calculation of small molecule database. The Web interface has the capability of configuring any OpenStack-based cloud and managing multiple users to submit the jobs on dynamically created cloud VM. The Web interface has been developed using LAMP (Linux, Apache, Mysql, and PHP) framework [51]. The Web interface is shown in Fig. 9a, b. The application is deployed on OpenStack kilo version which provides platform for running the MOPAC with resources allocated virtually in the cloud. OpenStack cloud infrastructure provides scalable computational resources and scalable storage capacity.

Fig. 9
figure 9

a MOSAIC tool homepage b MOSAIC tool job submission page

The details of cloud configurations are as follows:

The cloud infrastructure is installed using multi-nodes architecture. The cloud test bed is deployed using following configurations:

  • Controller node: 1 processor, 2 GB memory, and 5 GB storage and 2 NIC.

  • Network node: 1 processor, 512 MB memory, and 5 GB storage and 3 NIC.

  • Compute node: 1 processor, 2 GB memory, and 10 GB storage and 2 NIC.

To synchronize the clusters, there is a need to set up NTP server. The controller node acts as NTP server, and rest of the network along with compute nodes would be synchronize with this controller node. All the nodes in the cluster except controller node have mysql client service, and on controller mysql databases have been installed. Controller node also contains the messaging server for passing message across the nodes, and we have used the RabbitMQ [52] server. The configuration is depicted in Fig. 10.

Fig. 10
figure 10

Cloud configuration of MOSAIC tool

MOSAIC is executed using underlying Open Stack-based cloud to distribute millions of molecules in .mop format across the cloud nodes. The cloud nodes can be dynamically scaled to accommodate the computing load. The drug database is in the sdf format having different conformations of the same molecule and containing millions of such molecules. The sdf is converted into the desirable input file, i.e., .mop format which is used by the code for semi-empirical optimizations . The output files generated are parsed based on the energy value, and a few best optimized ligand molecules are selected based on the energy profile. The best few optimized ligands may further be scrutinized for possible drug target. This tool may have tremendous potential in terms of ligand optimization , i.e., finding the best posture not just for one molecule but for ligand database. The tool can be easily deployable on any OpenStack-based cloud platform. MOSAIC has an easy-to-use interface for the scientific community as it abstracts the complexity of cloud-based job submission. It has a user-specific work area for managing secured private data and outputs. It has a configurable orchestration mechanism for virtual hardware configuration. The result is shared in the form of a few selected molecules favorable for drug target. It is anticipated that MOSAIC will accelerate the process of drug discovery by using high-throughput optimization of Ligand databases in parallel manner using distributed cloud environment. MOSAIC helps in high-throughput optimization of ligand database in parallel manner using distributed cloud environment. It will accelerate the scientific research by carrying out high-throughput virtual screening and docking in parallel manner. The tool uses the advantages of cloud computing like dynamic scaling and on-demand computing reducing the overall cost and helpful in finding optimized ligands . The workflow as discussed is shown in Fig. 11.

Fig. 11
figure 11

MOSIAC tool workflow for cloud-based MOPAC implementation

The tool has following features:

  • Easy to use for the bioinformatics community which abstracts the complexity of cloud-based job execution.

  • It is supported by a user-friendly interface with user-specific storage area with login time stamp features.

  • Cloud-based high-throughput optimization of ligand database in parallel using distributed environment.

  • Integrated browser-based visualization for optimized ligand molecules.

  • OpenStack-based cloud environment facilitates users with on-demand scalable virtualized resources.

  • Configurable orchestration mechanism for virtual hardware configuration.

  • Generalized configurable solution for any OpenStack-based cloud using openrc script.

5.3 Embarrassingly Parallel Molecular Docking Pipeline

Molecular docking or high-throughput screening has become increasingly important in the context of drug discovery [45]. High-throughput screening may be the only way to identify correct inhibitors of the specific target. However, high-throughput drug docking is cost-effective and very fast and could be very useful for pharmaceutical industry. An attempt has been made to develop a scalable workflow as shown in Fig. 12, for high-throughput conformational search and docking on the high-performance computing, Hadoop or cloud-based clusters. The workflow is divided into two sections. The first section performs conformational search, and the second section performs the molecular docking. The objective of the conformation search is to find the most stable conformation of the molecule along with alternative stable conformations. The semi-empirical program like MOPAC [49] is used for finding the stable structures as described in the previous section of MOSAIC . After getting the stable structures of the small molecule, docking is carried out in the parallel manner with protein of interest in the next part of the workflow. Docking of either multiple small molecules with one protein or multiple molecules with multiple proteins docking facility is available in the workflow. The testing of the workflow has been done for the drug repurposing strategy in the cancer. A test case/example of usage of this tool is given in the Sect. 6 below in the cancer K-Ras drug repurposing studies.

Fig. 12
figure 12

High-throughput conformation generation and drug docking pipeline

This tool is also deployable on any HPC , Hadoop , or cloud platform available worldwide. The current version is deployed on the computing resources of BRAF (Bioinformatics Resources & Applications Facility), C-DAC, Pune, India.

5.4 Parallel Molecular Trajectories Visualization & Analytics (DPICT)

In any computational study of biomolecular systems, analysis and visualization play a pivotal role in understanding and interpretation. Molecular dynamics (MD) simulation studies of biomolecular systems, including proteins, nucleic acids, are no exceptions to this rule. The recent advances in MD techniques like REMD [53] generate multiple trajectory files whose size ranges in few gigabytes (GBs). The present-day tools often find it difficult to load a trajectory of a few GB size as it tends to occupy the entire CPU memory. The same problem is faced for loading multiple trajectories simultaneously, since most of the codes do not support parallel architecture. Redundancy also occurs when the same set of calculations need to be carried out for all the trajectories individually. This often becomes a bottleneck in the research work, since recoding these programs to suit one’s purpose is quite cumbersome. One often grapples around for an appropriate program/software, for analyzing and visualizing the multiple MD simulations data. And in the absence of a good program, one has to resort to writing codes and scripts. Also, loading trajectory files for visualization and analysis using the present tools often becomes extremely slow, since most of the codes are meant for serial processing and do not support multiple processors. VMD [39] tries to solve this issue by means of multi-threading, but the process becomes unresponsive when more than one trajectory is to be loaded at a time and visualized. The development of visualization and analysis tool capable of analyzing terascale and petascale data along with high-end visualization screens would accelerate the drug discovery process. Here, an attempt has been made to develop a new visualization and analysis tool capable of reading various file formats like AMBER [31], GROMACS [33] and doing most of the required analyses for a simulation in a parallel environment. The flowchart of the DPICT tool is shown in Fig. 13.

Fig. 13
figure 13

Flowchart of the DPICT tool

The tool has two distinct modules: one for visualization and rendering and the other for analysis of the MD simulations. The tool is an entirely GUI-based software meant to be run on Unix/Linux operating systems. The entire software tool is coded in C/C++ and OpenGL [ref] programming may be incorporated.

  • Features of DPICT:

    • A tool to elucidate the visualization of huge molecular dynamics trajectories simultaneously for better understanding of the simulation data

    • Supports visualization of nine molecules simultaneously

    • Different rendering options for biomolecules like ribbon, cartoon, ball, and stick can be viewed

    • Works in synchronous manner, where in nine trajectories may be handled simultaneously to perform certain operations

    • Widely used file formats of PDB, AMBER, and GROMACS are supported

    • SSH feature enables the users to handle the transfer of large files from remote to local HPC clusters and vice versa.

DPICT tool in its current version is able to manage big data of multiple trajectories as shown in Fig. 14. However, future versions would be targeted to reach the goal of big data visualization.

Fig. 14
figure 14

DPICT tool showing simultaneous multiple trajectory visualization

Bioinformatics group at C-DAC has used the above tools on docking, simulations, and analytics for the drug repurposing studies for cancer protein. The details of it have been described below.

6 Drug Repurposing Study Using Big data Analytics

The drug repositioning or repurposing is a strategy to find new action mechanism of the FDA -approved drug for other disease protein than those for which it was originally intended. The repositioned drug need not go through complete drug development cycle of many years [54]. However, it can directly enter the preclinical testing and clinical trials , thereby reducing risk, time, and costs. One of the well-known examples of repurposed drug is sildenafil citrate (viagra), which was repositioned from a common hypertension drug to a therapy for erectile dysfunction [55, 56]. Similarly, use of off-label FDA -approved drugs for cancer medical practice is also known and accounts for 50–75% of drugs or biologic therapies for cancer in the USA [57, 58]. Owing to computational drug repurposing strategy, a large number of receptors can be tested with already FDA -approved drug, thereby increases the chance of identifying cure for disease within shortened time [59]. One of the proteins crucial Ras in a center pathway has been discussed as a case study.

RAt Sarcoma (RAS) protein is a crucial member of the protein family known as G-proteins. The protein Ras is encoded by one of the most common oncogene in humans. Ras belongs to GTPase class of the proteins, which possess an inherent property of GTP hydrolysis activity. Depending on its association with GDP/GTP, the protein is classified in two distinct conformations: GDP-bound inactive state and GTP-bound active state [60,61,62]. The malfunctioning of this protein is known to play a crucial role in human cancers, especially pancreatic cancer and various developmental disorders like Costello syndrome, Noonan syndrome [63,64,65]. The normal functioning of Ras plays pivotal role in the processes of cell proliferation, development, differentiation, and signal transduction [63]. The most common of the Ras mutations are found in pancreatic cancers. Most of the cancers causing mutations are reported to belong to the conserved switch (Sw I and Sw II) and GEF-binding regions of the protein. As these regions are involved in protein–protein interactions and other crucial features, and such mutations directly affect the Ras protein interaction with other proteins [66, 67]. Studies to understand the activation and deactivation Ras pathways and comparative studies of wild type and mutant have been carried out by various groups. A significant low-energy barrier in case of mutant counterparts of Ras is also well established by various experimental and computational studies. To further explore the crucial mutations and further comparison with the wild-type counterpart, computational studies are required to provide more insight about their dynamics and conformational features. Furthermore, for K-Ras which is inherently a less druggable molecule, the current trend of the drug discovery efforts is now directed toward the development of inhibitors of Ras downstream effectors. Related studies suggest that need of dual site inhibitors to effectively block oncogenic Ras signaling. Also, triple site inhibitors are also gaining more importance for improved cancer therapeutics. Considering this as a reference, simulations have been performed to explore and understand the dynamics of activation pathway of the reported hotspot mutants of Ras [68]. Similarly, the GTP hydrolysis-mediated inactivation pathways of the mutant Ras complexes have also been explored. This has helped to provide more information on the energetics of the mutant Ras complexes by calculating the energy barrier between the end states of the protein [69]. Molecular docking studies were carried out on Ras using the approach of drug repurposing with FDA -approved drug molecules database. The literature has suggested three active sites for Ras as shown in Fig. 15 where ligands can be docked [70]. The residues involved in three sites are (SITE1) residue 29–37, (SITE2) residue 68–74 and 49–57, (SITE3) residue 58–74 and 87–91. High-throughput docking has been done using the DOCK6 software employed in embarrassingly parallel molecular docking pipeline. Docking-based drug repurposing and simulation study is being carried out on four Ras systems, namely the wild type, Q61L, G12 V, and G12D mutants, each for 37 ligands. The multiple trajectories for these systems were visualized using parallel trajectory visualizer tool, DPICT. For understanding the ligand (drug candidate) properties, multiple conformations (Fig. 16) were generated using high-throughput conformation generator tool. Moreover, to study the protein–ligand complexes for the simulated systems, in-house developed tool was used. Docked pose of one of ligand is shown in Fig. 17. Preliminary analyses have been completed for the systems. The hydrogen bond and water density analyses have been performed using the in-house developed big data analytics tool, HBAT. MSM analyses are also being carried out for the same, and the results are being compared with the wild-type counterpart. Further, MD simulations were carried out for the best molecule per site in order to check the binding of the molecule with Ras (data unpublished). Classical simulations have been carried out using GROMACS software on Bioinformatics Resource and Applications Facility (BRAF). The standard protocol has been followed for minimization, heating, equilibration, and production run.

Fig. 15
figure 15

KRas docking sites: SITE 1 (red): residue 29–37, SITE2 (yellow): 68–74 and 49–57, SITE3 (pink): 58–74 and 87–91

Fig. 16
figure 16

Conformations generated for docking

Fig. 17
figure 17

KRas protein with ligand docked at SITE2

Various tools discussed earlier in this chapter have been used for parallel visualization and efficient and fast analysis of Ras docking and simulation trajectories data. In-house computational facility BRAF has been used where these tools are already deployed and tested. The results would help the experimentalist to select the better ligand for further steps of drug development.

7 Latest Development in Big data

Bioinformatics is a technology-driven science. There have been major technological shifts which are driving the data-driven science. With the ever-increasing data, the storage and analysis of huge data are becoming very tedious and most of the data remains unanalyzed. For example, the sequencing of genomes of various organisms is generating petabytes to zetabytes of data. Also, the development of new sequencing technology like nanopore is capable of producing long reads generating huge data [71]. The assembly of such genomes put out a huge challenge on the Big Data technologies. The Apache Hadoop has also enhanced to tackle such challenges like Yarn which allows different data processing engines including graph processing, stream processing as well as batch processing. The MapReduce framework provided by Apache Hadoop is good for batch processing. In case of iterative processing where the data need to be read many times, the MapReduce is not efficient. MapReduce relies heavily on disk input/output so it is slow. The Apache Spark addresses this limitation of Hadoop and provides in memory computing but reducing disk input/output. Spark supports in memory computing and optimizes disk performance by lazy loading and cache mechanism. Hence, spark is suitable for iterative computing.

Recent progressions have empowered the most precision analytics strategies at the “single cell” level. The sequencing of single cell brings about enormous volume and complexities of information and presents an extraordinary chance to comprehend the cell level heterogeneity . The latest developments highlight the inherent opportunities and challenges in Big Data analytics. The recently created technologies like erasure encoding mechanism [72] in Hadoop 3.x tend to resolve the difficulties postured by several big data problems like single cell transcriptome analysis in bioinformatics and present great opportunity to develop cutting-edge technologies for the future research problems. The HDFS uses redundancy for high availability of data. It provides great benefit at the cost of storage byte. Generally, with replication factor of 3, HDFS uses three times more storage data redundancy. So it is very costly in terms of storage. The erasure encoding mechanism in Hadoop 3.x provides same storage safety at the cost of 50% storage overhead. This is effective when data is more and its access frequency is less.

8 Conclusions

Future of medical science is to move toward personalized medicine for enhanced health care. The high-performance computing along with parallel and better algorithms would be generating volume of data from molecular docking and simulations. Advanced structural biology laboratories and techniques would also be generating different types of data. The only way which seems to be efficient in managing and analyzing such an extreme varied data may lie in the application of big data technologies. Similar kind of extreme data is being generated using advanced experimentation in life sciences in the area of agriculture for better crop production and reduced disease susceptibility and in the field of livestock to understand their genomics as well as protect them from various diseases. Data is also being generated in the field of microbes for genomics, drug discovery, vaccine ,and better environmental studies. The near future of biology/life sciences seems to be data-driven hypothesis rather than hypothesis-driven data generation, and newer computing paradigm of big data technologies may be very useful in this aspect.