Introduction

Structure-based drug design (SBDD) has become central to the drug discovery process and helped identify several marketed drugs available today [1]. Physics-based computational approaches that characterize protein–ligand interactions have significantly evolved [2] and benefited immensely from advances in hardware and algorithm optimizations [3]. Among the wide gamut of physics-based SBDD approaches, docking methods [4] continue to be among the most popular and have been used for a range of drug discovery processes including library screening [5] and ligand optimization [6]. Although their primary appeal lies in the ability to quickly predict the binding pose of a ligand in the protein pocket, it has been shown repeatedly that incorporating conformational dynamics of protein–ligand interactions is critical for driving the ligand optimization process [7].

Molecular dynamics (MD) simulations are an important tool for understanding the dynamics of binding pockets and optimizing ligands for drug discovery [8]. MD simulations can provide detailed information about the dynamic behavior of proteins and their interactions with ligands [9]. MD simulations reveal the stability of the complex and identify potential weaknesses or vulnerabilities that are useful in ligand optimization. MD simulations have been critical for delineating the relation between pocket dynamics and function of several classes of proteins including transmembrane receptors like ion channels [10], opioid receptors [11], viral capsids [12], sirtuins [13], and RAS [14] family proteins. These studies led to the development of selective activators or inhibitors [15,16,17,18] for these protein targets.

While there have been significant advances in high-performance computing infrastructure [19,20,21,22] and optimization of MD algorithms [23,24,25] to enable running MD with biological systems of increasing size [26,27,28] and complexity [29,30,31], the process of setting up, running, and analyzing data from MD simulations continues to be multi-step [32, 33] and cumbersome. This severely constrains the regular use of MD for compound prioritization in optimization campaigns typically run in industry. Moreover, several recent works have started adopting different strategies to dramatically increase chemical search space considered either via generative machine learning (ML) strategies [34] or through docking exercises involving extremely large libraries [35, 36] in screening and optimization cycles of discovery projects [37]. These studies have applied thermodynamic methods to enrich hit rates by accounting for dynamic protein–ligand interactions and conformational heterogeneity of the protein and ligand and the interplay with water [38, 39]. Given the limitations around the chemical similarity of compounds considered in a dataset for relative free energy calculations [40] and conformational sampling with thermodynamic approaches [41], incorporating ‘regular’ long-time-scale MD into assessing large libraries from generative ML or enumeration workflows will improve the accuracy of predictions and increase enrichment of hits from these workflows.

We present an automated workflow (MDFit) that streamlines setting up, running, and analyzing Desmond [25, 42] MD simulations of protein–ligand complexes using the OPLS4 [43] force field. The workflow takes a library of pre-docked ligands and a protein structure as input, sets up and runs MD with each of the protein–ligand complexes, and then analyzes MD trajectories of each of the ligands in the input dataset. Analysis of MD trajectories includes flexibility of the ligand in the pocket via root mean squared deviation (RMSD) compared to the starting pose, stability of different ligand-pocket interactions, and other useful metrics that help quantify the dynamics of protein pocket and the ligand library. These metrics are combined into simulation fingerprints (SimFPs) that enable easy rank-ordering of the dataset along any of these collected metrics. In addition, we demonstrate that SimFPs can be used as features in ML models for potency prediction and mechanistic interpretation. In contrast to static encodings like protein–ligand interaction fingerprints, SimFPs capture the dynamics of protein–ligand interaction and facilitate more accurate predictions. Unlike relative free energy perturbation calculations, SimFP-based ML models are less restrictive about the need for chemical similarity within a dataset and can accommodate much more comprehensive sampling of pocket ligand dynamics through longer time scale MD. While there have been some attempts in the past with automating MD, analyzing bulk phase behavior of ligands [44, 45] or analyzing protein- ligand interactions [33, 46,47,48], our workflow streamlines and integrates all these aspects towards enabling a potency prediction ML model that learns comprehensively about protein–ligand interactions in the presence of water from MD and explains interesting SAR trends that are otherwise missed from static structure-based methods.

We show applications of MDFit for assessing (a) cyclic peptides that target PD-L1 & (b) small molecule inhibitors targeting CDK9, both with therapeutical potential as anticancer agents [49,50,51]. Compound names (Pep-01 to Pep-60 for PD-L1 [51] and Cpd 01 [52] to Cpd 39) from the original publications are retained.

PD-L1 binds to PD-1 at an elongated β-sheet interface. Cyclic peptides with beta-strand geometry offer unique advantages for binding to this shallow and expansive orthosteric site. An overlay of Pep-01 bound to PD-L1 (PDB code 6PV9) and PD-1 bound to PD-L1 (PDB code 4ZQK) shows that Pep-01 binds to the β-sheet interface between PD-1 and PD-L1 (Fig. 1A). By mimicking the PD-1 secondary structure, Pep-01 packs against the PD-L1 surface with sufficient interaction energy to overcome the major costs of binding (Fig. 1B).

Fig. 1
figure 1

A Overlay of 4ZQK and 6PV9 crystal structures showing Pep-01 binds to the β-sheet interface of PD-L1 to block PD-1 binding. B Peptide binding interface with PD-L1. All residues within 5 Å of Pep-01 are shown with critical residues determined by ML models (vide infra) shown in red (detrimental), green (beneficial), or blue (detrimental or beneficial). C Pocket interactions of Cpd 38 in CDK9 using PDB 7NWK. While docked poses were indistinguishable within the series, MDFit was useful in identifying the detrimental effect of pushing against Phe 103 & Ile 25 (vide infra)

Previous studies have shown a strong correlation between peptide strain and their potency through docking of extensively sampled conformations of the peptides [53]. The extremely large number of rotatable dihedrals with these cyclic peptides makes relative free-energy perturbation methods for assessing potency and pocket dynamics untenable [54]. We apply MDFit to provide insights from protein-peptide dynamics that can clearly explain potency cliffs among matched-molecular pairs (MMPs). SimFPs enable easy identification of differences in the pocket and water-mediated interactions across MMPs that help build an understanding of the structure–activity relationship (SAR). In this study, SimFP features are also used for training an ML model to predict potency outcomes and infer which features are most important for activity. For the PD-L1 dataset, the top SimFP features identified by the ML model offer additional insights about MMPs and their potency cliffs that would have otherwise been easy to miss with static information such as docked poses.

Cyclin-dependent kinases (CDKs) are Ser/Thr kinases regulated by cyclins. Several CDK inhibitors have advanced to the clinic and have shown efficacy for multiple myeloma and other tumors [52]. We ran MDFit with a series of azabenzimidazole inhibitors [51] using a previously published co-crystal structure (Fig. 1C). Akin to the PD-L1 data set, top SimFP features identified from MDFit helped explain interesting SAR trends among MMPs that were otherwise not immediately apparent from their docked poses.

Methods

The MDFit workflow (Fig. 2) automates the following process and the scripts are available for download from Github (https://github.com/brueckna2020/MDFit). The workflow requires the user to provide a protein model and a library of ligands as inputs. The protein structure needs to be fully prepared with missing side chains and loops added, protonation states of residues determined, hydrogen atoms added, and terminal residues capped. For the PD-L1 case study, protein from PDB 6PV9 [55] was used as the starting protein conformation. For the CDK9 case study, protein from PDB 7NWK was used as the starting protein conformation. Protonation states of protein residues were determined using PropKa [56] and the protein was prepared using the Protein Preparation Wizard module in Maestro (Schrodinger, LLC). Ligands in the input library need to have 3D conformations with reasonable poses when bound to the protein pocket. For the PD-L1 dataset, Pep-01 [57] and sixty of its analogs as described previously [57] were used. Previous studies have harnessed solution-state NMR and X-ray co-crystal structures of Pep-01 to accurately generate bound states of Pep-01 and its analogs [53, 57, 58]. Top poses from the docked conformer ensembles [53] were used as starting conformations for MDFit. For the CDK9 case study, thirty- nine azabenzimidazole inhibitors described previously [51] were used. Compounds were docked into the pocket using Glide [59] and poses similar to Compound 6 in the crystal structure were used as inputs for MDFit.

  1. 1)

    Force-field parameters: The workflow begins with a call to the FFBuilder tool from Schrodinger that evaluates all dihedrals in the input library, sets up QM calculations for dihedral scans, and optimizes missing or sub-optimal dihedral parameters using these QM scans. Optimized dihedral parameters are merged into the OPLS4 [43] ‘main’ force field supplied by the user. This optimized force field is subsequently used for MD and analysis.

  2. 2)

    Protein–ligand complexes: Each of the ligands in the input library is complexed with the protein which is put through an initial round of minimization using the MacroModel [60] module by Schrodinger. Powell-Reeves Conjugate Gradient (PCRG) minimization of the complex is run for a maximum of 500 steps with a convergence criterion of all gradient thresholds set to 0.3 kJ/mol.

  3. 3)

    Solvation: Minimized protein–ligand complexes are then inserted into an orthorhombic box with dimensions determined to set each edge of the box at 10 Å from the protein surface. The total charge of the protein and ligand is calculated and neutralizing ions Na+ or Cl are placed randomly inside the box between the protein surface and the box edges. The remaining space is filled with water molecules.

  4. 4)

    Relaxation, Equilibration:

    1. a.

      Protein, ligand, and ion parameters are modeled using OPLS4 [43] while SPC [61] is used to model water. All simulations are run using the Desmond [25] engine. Both the case studies discussed below were run with Desmond from Schrodinger suite version 2022-2.

    2. b.

      Solvated protein–ligand systems are relaxed before the production MD simulations. Initially, the entire system is equilibrated for 100 ps using the NVT Brownian dynamics at T = 10 K, with a harmonic position restraint of force constant of 50 kcal/mol/Å2 applied to all protein & ligand heavy atoms. At the same temperature and using the same restraints, the system is equilibrated for an additional 24 ps using a Berendsen [62] thermostat with pressure gradually dropping from 50 to 2 bars through NPT dynamics run.

  5. 5)

    Production: After equilibration, production MD simulations are run using NPT dynamics without positional restraints. By default, the workflow is set to run each protein–ligand solvated system in triplicate for a simulation time of 2 ns with a trajectory saving frequency set to 100 ps. Velocity seeds are randomized for each of the three MD runs. While the default settings stand at 2 ns for disk space considerations since MD trajectory files can be quite large, our calculations with PD-L1 & CDK9 data sets show that running simulations to 100 ns helps with convergence (see section on Simulation Length) and capturing interesting SAR trends. Therefore, for the PD-L1 & CDK9 datasets, each ligand–protein system was run for 3 replicates, each for 100 ns.

  6. 6)

    Analysis: Schrodinger’s Simulation Event Analysis (SEA) scripts are used for assessing the production MD trajectories. The scripts collect a wide range of metrics (Supplementary Info, Table SI) that capture meaningful information and insights about ligand and pocket flexibility.

    1. a.

      Clusters from Trajectories: RMSD-based clustering analysis provides the top N cluster representations (default of 5) of the model system, revealing common structural motifs or states. The Desmond MD clustering algorithm calculates the RMSD similarity matrix for the given trajectory frames. By default, ligand atoms are used for RMSD calculations, and the matrix is computed based on these chosen atoms. Subsequently, the workflow clusters the trajectory frames using the RMSD matrix. An affinity propagation algorithm is employed for clustering, which is well-suited for identifying distinct conformational clusters. The output CMS files include information about cluster size, frame indices, and timestamps. These diverse conformations based on ligand RMSD are used for all analyses described with the PD-L1 dataset.

    2. b.

      Parched Trajectory: A trimmed MD trajectory is generated by retaining only the protein + ligand and closest N solvent molecules. By default, this is set to 100. Before parching, trajectories are aligned using the ligand atoms from the starting pose for reference.

    3. c.

      Interactions: Protein–ligand interactions, water-mediated interactions, dihedral motions in ligands, and ion permeation are all recorded using event detection scripts that use pre-defined distance, angle, and dihedral cutoffs based on literature precedent [63,64,65]. The workflow extracts and tabulates all protein–ligand interactions and characterizes their stability as a percentage of the simulation time that each interaction was observed. For the PD-L1 dataset, along with the protein–ligand interactions, pre-calculated strain from the docked pose [53] is also added to the SimFP output for further analysis. Although all frames of the triplicate MD production runs can be included for this analysis (and is the default setting in the workflow), for both PD-L1 & CDK9 datasets, the first 10 ns (100 frames) of MD with each ligand were not considered for fingerprint generation.

Fig. 2
figure 2

MDFit workflow takes a library of ligands with reasonable starting poses in a protein pocket, runs MD, and generates collated SimFP for easy analysis of the stability of all ligands in the protein pocket across MD trajectories

While the automated part of the MDFit workflow stops with the generation of SimFPs, a predictive model can be readily trained to map SimFPs to experimental potency values. We emphasized the selection of simple, interpretable models that enable both the prediction of potency from SimFPs and the identification of important features. In this study, we investigated Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression as implemented in scikit-learn [66] (see Supporting Information; Table S2, Figures S1S5). Our workflow uses regression weights, impurity for tree-based models, and/or leave-one-feature-out cross-validation to estimate feature importance. Model prediction performance was investigated via nested leave-one-molecule-out cross-validation (LOMO-CV). SimFPs from triplicate runs were used as- is or averaged to arrive at an input feature matrix. The feature matrix was preprocessed by normalizing on the unit hypercube. The target IC50 values were transformed to pIC50 values and standardized to zero mean and unit variance. For each model type, hyperparameters were selected by minimizing the mean squared error using grid search LOO-CV. Feature importance was computed in CV folds and a final model was fit to the full data set for comparison.

Results and discussion

Simulation fingerprints (SimFPs) are a collection of interactions between a ligand and the protein target observed through MD simulations. The reported values are the average interaction frequency across a simulation. For example, a SimFP of 0.5 translates into a protein–ligand interaction occurring in 50% of the MD simulation frames. A SimFP value can be greater than 1.0 in cases where a ligand interacts with a protein residue through multiple points of contact (e.g., a bidentate interaction).

SimFPs can be used to rank-order or identify patterns across Matched Molecular Pairs (MMPs) for ligands with experimental readouts. Observed trends can be used to prioritize design ideas where the user gives preference to those that retain or enhance desired interactions. For larger data sets, SimFPs can be used as features to train ML models that can in turn be used to predict experimental readouts and assign feature importance. In addition, the critical SimFPs highlighted by the ML model can be used to further explain differences in observed readouts, such as potency. In this section, we discuss the utility of SimFPs in detail, focusing first on feature importance followed by handling edge cases.

SimFP feature identification

Machine learning methods can be used to identify specific peptide-protein interactions that contribute to the prediction of the desired endpoint from the full SimFP data set. For PD-L1, a Lasso regression model was built to predict the HTRF pIC50 values using SimFPs and strain energy [53] as features. While the model performance was modest (Fig. 3, right; LOMO-CV Q2 = 0.36 and RMSE = 0.78), using the SimFPs as features provides interpretability lost in more complex modern ML models.

Fig. 3
figure 3

Left: Top features for the PD-L1 full peptide SimFP data set. Green: Positive contribution, i.e., improving this interaction or maximizing this feature improves pIC50. Red: Negative contribution, i.e., reducing this interaction or minimizing this feature improves pIC50. Right: Lasso leave-one-molecule-out cross-validation (LOMO-CV) RMSE = 0.78 and Q2 = 0.36. The parity plot shows ½ and 1 log error bands. Normalized strain energy is the top feature with a negative contribution. In other words, reducing strain helps with improving potency. Water-mediated interaction with Asn63 is identified to have the most detrimental contribution while water-mediated interaction with Val76 has the most positive contribution to the HTRF potency of these cyclic peptides to PD-L1

The top ten features (weights with the largest absolute value) of the PD-L1 data set are reported in Fig. 3, left. We note that along with interaction stability fingerprints that come from MDFit, pre-computed strain energies [53] were included as an additional feature of SimFP. Strain energy remains the standout feature, consistent with previous studies [53], while a water-mediated interaction with Asn63 was the most detrimental (negative weight) and a water-mediated interaction with Val76 was the most beneficial (positive weight) SimFP to potency. Based on the feature importance, peptide optimization should focus heavily on minimizing peptide strain followed by minimizing water-mediated interactions with Asn63 and maximizing water-mediated interactions with Val76. Select Match Molecular Pair (MMP) cases will be described herein using the feature selection to explain SAR.

MMPs with strain energy differences

Mutating position 2 from NMe-Ala in Pep-01 to NMe-Val in Pep-41 results in a significant drop in potency (pIC50 = 8.1 vs 6.0, respectively). Minor variations were observed for the top SimFP features, but a major increase in strain energy for Pep-41 explains the loss in potency (Fig. 4). While seemingly minor, the addition of a bulky sidechain distorted Pep-41’s backbone conformation, increasing the strain by nearly 0.02 kcal/mol/heavy atom which is a remarkably high cost for two additional heavy atoms. In a prospective peptide design exercise around this MMP, modifications would focus on reducing the strain energy of Pep-41 while retaining the SimFPs observed with MDFit.

Fig. 4
figure 4

Cluster representatives for matched molecular pairs Pep-01 and Pep-41 in the PD-L1 data set. The backbone conformation of Pep-41 is distorted compared to Pep-01, resulting in a much higher strain energy

MMPs with hydrophobic interactions differences

Pep-01 and Pep-66 differ only at position 11, where Pep-01 has NMe-Nle and Pep-66 has NMe-Ser. This mutation results in a significant loss in HTRF potency (pIC50 = 8.1 vs 6.8, respectively). Truncating the sidechain of Pep-01 results in a favorable reduction in strain energy but sacrifices a hydrophobic interaction with Tyr123 (Fig. 5). The attractive forces between Tyr123 and NMe-Nle fully liberate water in the binding interface, more fully optimizing the protein-peptide compatibility [67]. Smaller polar sidechains will not fully desolvate the binding site, compromising the binding affinity. This case exemplifies the importance of integrating SimFPs and metrics from rigid methods such as docking. Relying solely on strain energy for ligand optimization or prioritization would incorrectly rank Pep-66 higher than Pep-01. Without the high-throughput analysis of MD provided by MDFit, project teams could be misled, and optimization strategies may lead to undesired outcomes. For peptide optimization in this MMP, designs would aim to recover the hydrophobic interaction in the Pep-01 MDFit SimFP while maintaining the lower strain energy observed for Pep-66.

Fig. 5
figure 5

Cluster representatives for matched molecular pairs Pep-01 and Pep-66 in the PD-L1 data set. Pep-66 loses a hydrophobic attractive interaction with Tyr123 relative to Pep-01

Kullback–Leibler divergence for matched-pairs

In some cases, differences in MD stabilities across the top features highlighted by the ML model do not fully explain the difference in potencies. Pep-52 features a beneficial water-mediated interaction with Val76 which is not observed for Pep-01 (importance =  +0.64) as well as an amplified detrimental water-mediated interaction with Gln66 (importance = –0.38). All other SimFP features were remarkably similar between the two peptides. Based on only these features, one would expect Pep-52 to have equal or slightly better HTRF potency compared to Pep-01. However, Pep-52 was about fivefold less potent than Pep-01.

The Kullback–Leibler divergence (KL divergence, relative entropy [68]) between SimFPs offers an alternate quantification strategy that characterizes differences across all the features in the SimFPs into a single dimensional quantity. SimFP of Pep-01 is treated as the reference and KL divergence for all the other peptides in the series was calculated relative to Pep-01. KL divergence identified Pep-52 to have the most divergent SimFP compared to Pep-01 (29.9; Fig. 6A) prompting further investigation.

Fig. 6
figure 6

A The top 10 most divergent SimFPs by KL divergence relative to Pep-01. B Differences in water-mediated interactions with Gln66 across three runs of MD. Pep-01 features less water-mediated interaction with Gln66 compared to Pep-52 indicating a tighter binding, compared to Pep-52 where water has seeped into the pocket

The difference between the raw SimFPs (|SimFPPep-52–SimFPPep-01|) identified the water-mediated interaction with Gln66 as the single most divergent SimFP feature across all three repetitions of Pep-01 and Pep-52. The detrimental water-mediated interaction between Pep-52 and Gln66 for the individual repetition SimFPs were 80%, 61%, and 55% (Fig. 6B; trajectory 3, 2, 1, respectively). In contrast, Pep-01 featured this interaction a mere 42% in trajectory 2 and never registered (0%) in trajectories 1 and 3.

Visualizing the representative clusters for Pep-01 and Pep-52 revealed the backbone carbonyl of Pro4 in Pep-52 forms a water-mediated hydrogen bond with the backbone carbonyl of Gln66 (Fig. 7). For Pep-01, the same backbone carbonyl of Pro4 hydrogen bonds directly with the sidechain of Gln66. Water infiltration characterizes protein-peptide incompatibility for Pep-52, explaining the drop in HTRF potency relative to Pep-01 (pIC50 = 7.6 vs 8.1, respectively). While incompatibility may be observed tangentially in computational methods that treat proteins as rigid bodies, direct observation of water infiltration at a specific residue from dynamic models focuses the project team on an area of the ligand for further optimization. In this case, a deep dive into ML feature importance, KL divergence, and raw SimFPs helped differentiate the peptide’s behavior in the binding pocket and explain the difference in potency.

Fig. 7
figure 7

Cluster representatives for matched molecular pairs Pep-01 and Pep-52 in the PD-L1 data set. Pro4 engages Gln66 through a direct hydrogen bond (Pep-01) or a water-mediated hydrogen bond (Pep-52). The water infiltration in the Pep-52 simulations provides a possible explanation for the difference in potency relative to Pep-01

Small molecule CDK9 inhibitors

Although there is some semblance of correlation (R2 = 0.1) between docked pose G lide scores and enzyme pIC50 (SI Figure S6), ML learning with SimFP was useful in delineating interesting SAR trends, particularly those involving water-mediated interactions that were otherwise missed in docking. The top ten features (weights with the largest absolute value) of the CDK9 data set are reported in Fig. 8, left. A hydrophobic interaction with Phe103 (Fig. 1C, Fig. 8) was the most detrimental (negative weight) and a hydrophobic interaction with Ile25 was the most beneficial (positive weight) SimFP to potency. Select Match Molecular Pair (MMP) cases will be described herein using the feature selection to explain SAR.

Fig. 8
figure 8

Left: Top features for the full CDK9 SimFP data set. Green: Positive contribution, i.e., improving this interaction or maximizing this feature improves pIC50. Red: Negative contribution, i.e., reducing this interaction or minimizing this feature improves pIC50. Right: Lasso leave-one-molecule-out cross-validation (LOMO-CV) RMSE = 0.96 and Q2 = 0.27. Hydrophobic contact with Phe103 is the top feature with a negative contribution. In other words, reducing the interaction prevalence helps with improving potency. Hydrophobic interaction with Ile25 is identified to have the most positive contribution to the HTRF potency of the CDK9 inhibitors

MMPs with hydrophobic interactions differences

Compound 24 and Compound 22 differ only around the pyridinone core, where Compound 24 is an isopropylpyridine ring and compound 22 has a methoxy-methylpyridinone. This core modification results in a significant loss in potency (pIC50 = 8.5 vs 4.5, respectively). Unsubstituted pyridinone of Compound 24 results in a favorable reduction in hydrophobic contact with Phe103 in favor of a beneficial pi-pi stacking interaction (Fig. 9). For small molecule optimization in this MMP, designs would aim to remove the hydrophobic interaction in Compound 22.

Fig. 9
figure 9

Cluster representatives for matched molecular pairs Compound 24 and Compound 22 in the CDK9 data set. Pyridinone engages Phe103 through a direct pi-pi stacking interaction (Compound 24) or hydrophobic contact (Compound 22). The replacement of the pi–pi interaction with a hydrophobic interaction during the simulations provides a possible explanation for the difference in potency relative to Compound 24

MMPs with water infiltration

Compound 24 and Compound 30 differ only around the pyridinone core R-group, where Compound 24 is an isopropyl and Compound 30 has a tetrahydropyran. This solvent-exposed modification results in a significant loss in potency (pIC50 = 8.5 vs 5.4, respectively). The smaller R-group of Compound 24 results in better protein–ligand compatibility than the bulkier R-group of Compound 30. The pyridinone core of Compound 30 distorts and drifts in the binding site, allowing water infiltration, observed as a disfavored water-mediated interaction with Asp167 (Fig. 10). For small molecule optimization in this MMP, designs would aim to probe the R-group size tolerability and the effects on protein–ligand compatibility.

Fig. 10
figure 10

Cluster representatives for matched molecular pairs Compound 24 and Compound 30 in the CDK9 data set. Piperidinone engages Phe103 through a water-mediated interaction (Compound 30) or does not engage Asp167 (Compound 24). The water infiltration in the binding site during the simulations provides a possible explanation for the difference in potency relative to Compound 24

Simulation length

To enable efficient rank-ordering of peptide designs using SimFPs prospectively, it is important to also assess simulation convergence. Root Mean Square Deviation (RMSD) of the ligand conformations relative to the protein pocket is an often discussed metric to estimate convergence. However, as shown in Fig. 11, RMSD plots are not always useful in estimating how long a simulation needs to be for full convergence. Instead, the divergence of SimFPs from different time intervals relative to the full simulation trajectory (100 ns) can be used to estimate simulation convergence (Fig. 11). For the reference Pep-01 in the PD-L1 dataset and Compound 38 in the CDK9 dataset, SimFPs converge at 70 ns for all three MD trajectory repetitions. Therefore, 100 ns MD trajectories can be assumed to fully characterize relevant protein–ligand dynamics, and ML models can rank-order designs using the SimFPs.

Fig. 11
figure 11

A, B: Heavy-atom RMSD of Pep-01 in PD-L1 and Compound 38 in CDK9 data set respectively relative to the protein pocket throughout the three 100 ns MD repetitions. C, D: KL divergence of Pep-01 & Compound 38 SimFPs relative to the full trajectory (100 ns) shows that simulations converge at 70 ns

Conclusions

We have presented a new high-throughput workflow for setting up, running, and analyzing molecular dynamics simulations for a library of ligands. MDFit produces compiled simulation fingerprints (SimFPs) for users to decipher critical protein–ligand interactions and rank-order ligands based on compatibility. Application of the MDFit workflow to a data set of 61 peptides bound to PD-L1 & 39 small-molecule inhibitors bound to CDK9 resulted in the discovery of several SimFPs critical for HTRF potency & binding. Matched molecular pairs were explored to highlight the utility of SimFPs when combined with ML techniques. KL divergence offers an attractive alternative to explain potency differences otherwise not evident in the top ML features.

The stability of pocket interactions from MD simulations characterizes the enthalpy of binding into the protein pocket. Conformational entropy is included via pre-calculated strain of the docked pose [53] in the SimFP. Through sufficient sampling of each ligand in the binding pocket, ML models trained on these SimFPs account for all important thermodynamic events and therefore have reasonable accuracy of predictions of binding affinity. Unlike relative free energy perturbation [69] approaches that have limitations based on ligand size [54] and chemical similarity [70], SimFP-based ML models for potency assessment are less likely to have either of these constraints. Future version releases will support other MD engines (OpenMM [71], GROMACS [72]) and force-fields (OpenFF [73]), add more information into SimFPs [73], and additional analysis via machine learning approaches. While the current version uses Schrodinger’s native simulation interaction analysis APIs for Desmond trajectories, for SimFPs with OpenMM/GROMACS trajectories we will integrate ProLif [48] into our workflow. The MDFit workflow is expected to be useful for characterizing pocket dynamics of multiple modalities, including small molecules, peptides, PROTACs, and molecular glues to drive drug discovery projects moving forward.