Introduction

Modern drug discovery relies on the identification of small molecules capable of interacting with a biological target of interest (receptor or enzyme) to achieve a therapeutic action. In a typical ‘target-based’ approach, finding a ‘hit’ molecule for the desired target is one of the major bottlenecks in drug discovery. One of the widely used strategies is to conduct high throughput screening (HTS) [1] of a large or ultra-large molecule collection at a miniature scale using robotic tools. The obtained hits are then further validated and structurally optimized to obtain lead molecules with potent biological activity, optimum pharmacokinetics, and low toxicity potential. However, HTS requires enormous resources in terms of sophisticated equipment, time, and skilled manpower, adding to the overall cost of drug development [2]. One of the alternatives is to screen large libraries of molecules using in silico or molecular modelling techniques rather than in wet labs. The in-silico or virtual screening (VS) relies on computational models to represent biomolecular targets and small molecules [3, 4]. In the widely employed structure-based VS (SBVS) approach, interactions between the protein and ligands are modelled to predict the binding affinity and pose [5,6,7,8,9]. The inherent advantage of VS is that only the molecules that appear promising in these models need to be experimentally validated. Thus, several small molecule libraries have emerged over the years for their application in VS and drug discovery [10, 11]. The most notable among these are ZINC [12], ChEMBL [13, 14], DrugBank [15], and PubChem [16], which are publicly accessible to download and use. However, these databases have some common set of molecules and hence overlapping chemical space [17]. With the availability of various open-source computational tools, enumerating ultra-large virtual libraries has increasingly become effortless, and so is their application in drug design [7, 9, 18]. However, one of the biggest challenges after a hit is obtained through VS is the molecules’ availability and synthetic tractability for experimental validation. Most commonly, the predicted hit compounds are purchased from commercial vendors who provide a variety of small molecules including target-focused and on-demand compounds. However, the price is often exorbitantly high and often in the range of 20–200 USD per milligram for in-stock compounds. The cost is even higher for tailor-made molecules synthesized on demand. Thus, dependence on the commercial supply of molecules is not always economically viable, especially in small academic labs. Additionally, even a small amount of impurity in the sourced samples due to inadequate quality control or generated during long shelf storage might result in false positives. Therefore, resynthesis of the hits to obtain high purity samples is important for any VS approach. However, the chemical synthesis of such commercial compounds may not be reported and might involve several steps consisting of challenging chemistry.

One solution is to curate virtual libraries of molecules obtainable from highly reliable reactions. One notable example is the commercially available REadily AccessibLe (REAL) database provided by Enamine company, reported to have ~ 80% synthesis success rate [19]. Recently, the LibINVENT tool reported by Patronov and coworkers [20] has also been made available to generate in silico libraries based on different reactions. However, the synthesis of molecules from these libraries may still involve multiple steps and unforeseen practical problems such as the unavailability of the starting materials and lower yields.

Multi-component reactions (MCRs) produce complex molecules with high pot economy [21] and hence, are potential candidates to enumerate virtual libraries [22,23,24]. The Ugi-reaction (UR) is one of the oldest and most widely studied MCRs that typically yields a peptidomimetic scaffold in a single pot reaction between an aldehyde, carboxylic acid, amine, and isonitrile components [25, 26]. UR also possess high atom economy since only one water molecule is produced as a by-product. In the last 2 decades, application of UR in the synthesis of novel scaffolds has increased, as indicated by the number of papers appearing in PubMed (Fig. 1). A typical strategy involves the formation of UR adducts from commercially available building blocks followed by a series post-Ugi modifications that may include intramolecular heteroatom alkylation/acylation, condensation, and rearrangement reactions [27,28,29,30,31,32]. In several cases UR and post-Ugi modifications can be done in one pot without the need of isolating the UR product, thus leading to facile access to diverse chemotypes for wider application [27, 33,34,35,36,37,38].

Fig. 1
figure 1

PubMed search result using the term ‘Ugi reaction’ in title/abstract. The result shows the increase in UR literature in the recent years

Additionally, with the commercial availability of a variety of starting materials for UR, it is convenient to generate analogues for structure–activity relationship (SAR) studies in a parallel fashion [39]. Despite these advantages recently reported UR-derived and post-Ugi-derived interesting chemotypes remain unexplored in medicinal chemistry (vide infra). The plausible reason might be that such reports mostly appear in organic chemistry literature, primarily focusing on optimising synthetic methodology and characterization. To tap this unexplored chemical space we planned to curate a UR derived library (URDL) of small molecules reported in the literature, intending to evaluate its chemical space [40] and make it accessible for VS. The property and chemical space of URDL is also compared with the FDA-approved oral drugs as a standard.

Results and discussion

Library curation and description

The PubMed search engine was used to collate literature discussing the synthesis of novel scaffolds/rings using either UR or post-Ugi modifications. Many of these molecules are reported to be synthesized conveniently in 1–4 synthetic steps which in several cases may be performed in 1–2 pots without the need of isolating/purifying the intermediates. In such cases, we included products of all the steps except the ones which are not isolated or not characterized to represent ‘real’ molecules. In addition, to signify the facileness of synthesis we have annotated each compound with the number of pots required for its synthesis, rather than the number of synthetic steps.

For instance, representative compound 1 (Fig. 2A) in URDL is synthesized in 2 steps; UR followed by post-Ugi Povarov-type reaction [41]. However, both steps can be carried out in a single pot without the need of isolation of the Ugi products. Thus, only the final molecules belonging to 1 substructure are included in the library and annotated with 1 pot synthesis. The Ugi-product intermediates which were neither isolated nor characterized, are not included in the library even though they were obviously synthesized during the 1 pot process. Likewise, Fig. 2B represents a cascade reaction where the synthesized Ugi-products are converted to molecule 2 in a single pot without isolation [42]. Thus, only the molecules belonging to structure 2 were included in the library.

Fig. 2
figure 2

Representative examples of products of UR and post-Ugi modification

In certain reports, building blocks are synthesized and isolated to obtain the desired Ugi product. In such cases, the steps involved in the synthesis of a building block are added to the total pot count, but the resulting building block is not part of URDL. For example, isocyanide 3 is synthesized in 3 pots to obtain pyrrolidone derivatives 4 (Fig. 3A) [43]. Similarly, the structure of aldehyde 5 required for the synthesis of chromenepyrrole scaffold 7 (Fig. 3B) [44], was not included in the library although steps in its synthesis were counted towards the synthesis of molecules 6 and 7, which are part of URDL. In rare cases where clarity regarding the source of building blocks is not available, the later are assumed to be commercially available, and pots are numbered accordingly.

Fig. 3
figure 3

Examples of URDL molecules where synthesis of UR building requires additional steps

Many literature reports describe structurally close analogues with a minor variation in ring-substituents. In such situations, to avoid manual work and maintain diversity, we omitted certain close analogues with minor structural variations (e.g. methyl vs ethyl). We believe such omissions may not affect the VS results significantly, and a medicinal chemist would be able to design and access such missing analogues during SAR exploration. Since the intended application of URDL is in drug discovery, we retained only small molecules (MW ≤ 900 Da) lacking general reactivity. This curation process resulted in the Ugi reaction-derived library (URDL) consisting of about 5773 molecules obtained from 274 references. About 92% of the molecules in URDL can be synthesized in either one or two pots (Fig. 4) thus, signifying the synthetic tractability of the library. Additionally, 85% of these molecules appeared in the last decade and hence represent the recent developments in this area.

Fig. 4
figure 4

Percentage of molecules vs the number of pots required for their synthesis. The bars are coloured according to the year in which molecule is reported depicting most of the reports to be recent

Despite its small size, URDL has several advantages over commercially available libraries for VS application. For example, the URDL molecules are cherry-picked from high-impact organic chemistry literature with robust structural validation data of the synthesized molecules using spectroscopy and X-ray crystallography. The URDL molecules are synthetically tractable with high atom and pot economy and essentially from commercially available inexpensive building blocks. The conditions reported for synthesising these molecules are often mild, catalytic, and facile enough to be carried out by novice chemists. In many cases, Ugi adducts are known to precipitate from the reaction mixture, thus reducing the workup and purification steps. Moreover, the biological activity of most URDL molecules is not reported despite the presence of novel structural features in these molecules (vide infra). Thus, URDL has additional value in terms of unexplored chemical space for drug discovery. The availability of information on the number of pots required for the synthesis of URDL members may serve as one of the important criteria when shortlisting compounds for synthesis and experimental validation.

Physicochemical profiling

Physicochemical properties of molecules such as size, shape, polarity, and lipophilicity play important role in drug development. Indeed, the molecular descriptors such as molecular weight (MW), partition coefficient (clogP), number of H-bond donors/acceptors (HBD/HBA), topological surface area (TPSA), number of rotatable bonds (RB), fraction of sp3 carbons (Fsp3), are found to correlate with solubility, bioavailability, cell permeability, clinical success, and toxicity [45,46,47,48,49,50,51,52]. Thus, property-based criteria such as Lipinski’s ‘rule-of-five’[49] and Veber’s rule [50] are widely used to estimate the ‘druglikeness’ of a molecule, albeit with known limitations [53, 54]. Certain molecular properties are also desirable for targeting a particular receptor or organ [55,56,57,58]. For instance, drugs crossing BBB are primarily restricted to a property space occupied by small, uncharged, and lipophilic molecules [59, 60]. The presence of primary amines and molecular globularity is reported to play a significant role in facilitating the entry of molecules in Gram-negative bacteria [61]. We have recently shown that the sum of basic and aromatic nitrogen (SBAN) is a key descriptor, among other properties, required for potent antimalarial activity [62]. Thus, understanding the physicochemical property space of a molecular library may assist in the identification of potential targets/diseases for its application.

For physicochemical characterization of URDL we calculated key properties of the molecules and compared it with the property-space of oral drugs as a reference. A recently compiled library of orally used small drugs (MW < 900 Da) by us [62], was updated with the oral drugs approved in the year 2021. Thus, the oral drug library consists of 1998 FDA-approved drug molecules with proven oral bioavailability. Since UR usually yields a dipeptide-like molecules that may serve as the inhibitor of protein–protein interactions (iPPI), we also compared URDL with the publicly available iPPI library [63] consisting of 3853 molecules. The t-test was used to determine the statistical significance among the three categories.

First, we compared the druglikeness of the three libraries using the widely used Lipinski’s and Veber’s rule. Not surprisingly, 91.4% of oral drugs cleared the criteria for Lipinski’s rule while the percentage was lower (85.5%) for the URDL molecules (Table 1). Only, 74.3% molecules in the iPPI library passed the druglikeness criteria based on Lipinski’s rule, which is expected since these molecules usually interact with the larger protein surface and thus, tend to have higher MW. Indeed, all 835 iPPI molecules non-compliant to Lipinski’s rule possess MW equal or greater than 500 Da. In contrast, Veber’s criteria which propose the cut-off of TPSA ≤ 140 Å2 and RB ≤ 10, predicted URDL library to have the highest percentage (85.4%) of drug-like molecules, followed by oral drugs and iPPI (Table 1). Based on these two rules, it can be concluded that URDL molecules are closer to oral drugs in terms of druglikeness and are expected to have optimal oral bioavailability.

Table 1 The percentage of molecules meeting the criteria of druglikeness, and quality based on different cheminformatics rules

One of the criteria to judge the quality of screening compounds is to look for the reactive functional groups and structural motifs that may interfere with the biochemical assay readouts thus appearing as ‘frequent hitters’. Such molecules were termed as pan assay interference compounds (PAINS) by Baell and Holloway [64] who proposed to exclude such compounds using a set of substructure filters. Similarly, Bruns and Watson from Eli Lilly labs also proposed a set of 275 rules to identify promiscuous compounds [65]. Thus, to obtain ‘clean’ molecules for screening, PAINS structural alerts and other rules are often used. However, these rules and alerts are not without limitations and should not be applied fastidiously without context [66,67,68,69]. For example, the PAINS filter should not be used in phenotypic screening or when looking for covalent inhibitors [66]. Nonetheless, there is a broad consensus that molecules possessing PAINS and ‘nasty’ functions should be flagged early and must be carefully validated before advancing them in the drug discovery pipeline.

We used a recently reported open-source Konstanz Information Miner (KNIME) workflow [70, 71] to identify molecules with PAINS feature in our libraries. Only 5% of URDL molecules have PAINS feature which is lower than the proportion of FDA-approved oral drugs (6.9%) that failed the test (Table 1). The iPPI library displayed 19% failure rate warranting cautious use and interpretation of PAINS alerts when applied to discovery of iPPIs. We also employed the open-source Datawarrior program [72] to identify ‘nasty’ or reactive functional groups defined by the medicinal chemists at Actelion [73]. The failure rate in terms of such problematic groups is comparable (~ 12–14%) in all three libraries. While molecules belonging to oral drugs and iPPI possess a variety of reactive functions, URDL molecules have limited types for such moieties (Supplementary Information Figure S1). An aromatic nitro group seems to be the most frequently occurring problematic function in all three libraries especially in case of URDL where ignoring this ‘nasty’ function reduces the failure rate by almost half (Table 1). The high occurrence of aromatic nitro group in URDL can be explained by the fact that most of these molecules are taken from organic chemistry literature discussing newly optimized reactions conditions. In such studies, authors are expected to demonstrate broad substrate scope and hence frequently use building blocks having electron-donating and electron-withdrawing (such as nitro) functional groups. Thus, URDL library molecules are drug-like and ‘clean’ when considering oral drugs as standard.

Next, we calculated and compared the average property space of the molecules in the three libraries using Datawarrior [72], an open-source cheminformatics program. The statistical details such as mean, median, p-values, quartiles, standard deviation, for the three libraries are provided in the Supplementary information (Table S1). Interestingly, the mean/median of most of the studied properties of URDL molecules are closer to iPPIs than oral drugs (Fig. 5). Among the three categories, the significantly higher values of MW, clogP, HBA/D, TPSA, RB, and aromatic rings for iPPIs are in line with the earlier reports [74, 75]. This observation is justified by the interactions of iPPIs over a large protein interface rather than smaller well-defined pockets. On average URDL molecules possess higher values for MW (Fig. 5A), clogP (Fig. 5B), HBA (Fig. 5C), and RB (Fig. 5F) than the oral drugs. In contrast, URDL molecules have significantly lower average values for HBD (Fig. 5D), TPSA (Fig. 5E), and Fsp3 (Fig. 5G) in comparison to oral drugs. The molecules belonging to URDL and iPPI, are also structurally more complex (Fig. 5H) as evaluated by the Datawarrior program. This complexity may result from more rings (7-membered or smaller) present in these molecules (Fig. 5I). However, it must be noted that molecular complexity may be calculated in several ways [76, 77]. For instance, oral drugs possess more chiral centres than URDL and iPPI molecules (Fig. 5J), which is another proposed measure of complexity together with Fsp3 [76]. The URDL and iPPI molecules have a higher number of carboaromatic rings (ArC) (Fig. 5K), suggesting these compounds are more disc-like than oral drugs. Similarly, orally used drugs have a lower number of heteroaromatic rings in their structure than the other two libraries (Fig. 5L).

Fig. 5
figure 5

Box plots displaying average properties of oral drugs (cyan), URDL (pink) and iPPI (green) molecules; A molecular weight, B calculated logP, C hydrogen bond acceptors, D) hydrogen bond donors, E topological polar surface area, F rotatable bonds, G fraction of sp3 carbons, H molecular complexity, I small rings, J stereocenters, K carbo-aromatic rings, L hetero-aromatic rings. The red and black lines within the boxplots represent the mean and median, respectively

Overall, this analysis indicates that URDL and oral drug libraries are comparable in terms of druglikeness as well as compound quality. However, in terms of physicochemical properties, the URDL molecules are closer to iPPIs.

For comparison and visualization of different sets of molecules, medicinal chemists often rely on different dimension reduction approaches [78]. The multi-dimensional information coded in structural descriptors or physicochemical properties can be reduced and projected in two (2D) or three dimensions (3D) for comprehensibility [79,80,81]. Thus, the molecules bearing structure or property-based similarity are expected to be placed nearer to each other in these projections.

To compare the chemical space of URDL with oral drugs and iPPI molecules, we calculated the SkelSphere descriptors implemented in the Datawarrior program. This descriptor encodes circular spheres of atoms and bonds into a hashed binary fingerprint of 1024 bits together with the stereochemistry and other structural details. The 1024 bits of structural information were then reduced using T-distributed stochastic neighbour embedding (t-SNE) [82, 83], a non-linear dimensionality reduction technique. Consequently, a 3D projection of chemical space was obtained with the similar molecules being closer to each other than the dissimilar ones. The oral drugs and URDL molecules seem to form separate clusters with limited overlap, suggesting the structural features present in URDL molecules are distinct from the FDA-approved drugs (Fig. 6). On the other hand, iPPI molecules form several smaller clusters, a few overlapping with either oral drugs or URDL, displaying a broader diversity in these molecules.

Fig. 6
figure 6

t-SNE 3D plot for oral dugs, URDL and iPPI library molecules shown as cyan, pink and green spheres, respectively. The multi-dimensional information stored in the SkelSphere descriptor of the molecules was reduced using t-SNE algorithm of Datawarrior to display distinct clusters of similar molecules

Overall, physicochemical profiling suggests that URDL molecules conform to a distinct chemical space in comparison to the existing oral drugs.

Scaffold and ring analysis

A recent analysis has revealed that FDA-approved drugs lack diverse rings [84]. On the other hand, 1 million ring systems (size 1–4, < 30 atoms) are possible theoretically, 98.6% of which do not exist in big databases like ZINC or ChEMBL [85]. The distinct chemical space in URDL compared to oral drugs (Fig. 6) indicates the presence of unique scaffolds and ring systems in URDL molecules which was also observed during the library curation. Indeed, several novel heterocycles can be generated using UR [27]. For comparison, the ‘most central rings’, the ring closest to the topological centre of the given molecule, were extracted from both URDL and oral drugs using Datawarrior. This analysis resulted in 417 and 316 ‘most central rings’ from oral drugs and URDL, respectively. Among the top ten most frequently appearing rings, benzene, piperidine, and pyrrolidine are common in both libraries (Fig. 7). One-third (103) of the rings in URDL display SkelSphere similarity of 80% or higher to the rings extracted from oral drugs rings, and only 62 rings (19.6%) are structurally identical. The t-SNE plot derived from the SkelSphere descriptor of ‘most central rings’ (Figs. 8 and 9) reveals several unique ring systems in URDL that are not represented in oral drugs. The majority of these diverse ring systems (Fig. 9) can be obtained in 2 pot reaction sequence from Ugi adducts indicating facile access to these rings. Additionally, these rings possess varying sizes, lipophilicity, H-bonding capacity (HBA/HBD), PSA, and globularity volume, suggesting these may be exploited to design ligands against different protein binding pockets. A substructure search using the rings displayed in Fig. 9 in the ChEMBL database resulted in no hits underscoring the structural novelty of URDL ring systems.

Fig. 7
figure 7

Top 10 ‘most central rings’ appearing in oral drugs and URDL

Fig. 8
figure 8

t-SNE plot of ‘most central rings’ of oral drugs (cyan) and URDL (pink) using SkelSphere descriptor

Fig. 9
figure 9

A 2D t-SNE plot of the representative set of diverse URDL rings. The circle represents rings in the chemical space and colour is coded by TPSA. The background of the circles is coloured according to the globularity volume of the rings

Together, this analysis confirms that the URDL library consists of molecules based on novel ring systems that are synthetically tractable and remain unexplored in drug design.

UR in the synthesis of drugs and their analogues

Some of the URDL scaffolds and molecules also showed structural overlap with oral drugs (Figs. 6 and 8) indicating the latter may be accessed in an efficient way using UR. To find the examples of oral drugs or their close analogues in URDL we performed a similarity search between URDL and oral drugs using SkelSphere descriptor. A total of 114 of the URDL molecules were found to be structurally identical or similar (≥ 0.75 SkelSphere similarity index) to the 46 unique oral drugs (Supplementary Information Figure S2 and S3). Among these, the two-pot gram-scale synthesis of praziquantel (8) is a well-known example displaying the efficiency of UR in drug synthesis (Fig. 10) [86]. A two-pot gram-scale enantioselective synthesis of R-lacosamide (9) has been achieved recently via Ugi3CR [87]. Similarly, a two-pot synthesis of an epimer of Tadalafil (10) is described using the UR-derived intermediate followed by cyclization [88]. In addition, close analogues of several other drugs, such as Roxatidine [89], Vorinostat [89], Racecadotril [90], and Pinazepam [91], are present in URDL (Supplementary Information Figure S2 and S3). Given the efficient synthesis of URDL molecules, it would be interesting to synthesize structurally similar analogues of FDA-approved drugs and test them against the corresponding targets. Analogously, one can conduct a similarity analysis of any other molecule of interest against URDL to find synthetically tractable close analogues. Such efforts may also result in scaffold hopping [92, 93], an important strategy in drug design that may be useful to overcome intellectual property constraints.

Fig. 10
figure 10

Drug molecules and their analogues present in URDL, that can be efficiently synthesized using UR

Conclusions

In conclusion, we have curated a library of molecules derived from Ugi reaction, cherry-picked from the recently reported literature. The synthesis of the majority of the URDL molecules involves mild reaction conditions with high pot economy (1–2 pots) and reported methodology. Thus, large samples of the molecules can be obtained cost-effectively for the experimental validation of the in-silico screening results. The comparison with oral drugs shows that URDL consists of drug-like molecules but occupy a distinct chemical space in terms of structural descriptor. Additionally, URDL compounds show a lower frequency of PAINS alerts and reactive functional groups compared to oral drugs. In terms of property space, URDL molecules are closer to known inhibitors of PPIs. Many of the URDL molecules consist of novel ring systems absent in the currently approved drugs or in ChEMBL database. Several oral drugs and their close analogues are also included in URDL, suggesting that UR and post-Ugi modification may efficiently synthesise these molecules and result in scaffold hoping. Thus, the URDL molecules represent synthetically tractable unexplored chemical space fit for the purpose of VS and similarity searching. The URDL library is freely accessible as a part of the supplementary information of this manuscript.

Materials and methods

Library curation

The PubMed search was performed using term ‘Ugi reaction’ in Title/Abstract field to identify literature reports discussing the application of UR. The cross-references in the review articles were also used to identify relevant original papers. The structures of the molecules were drawn manually or interpreted from the IUPAC nomenclature provided in the original papers using Osiris Datawarrior (v. 5.5.0) [72]. The synthetic schemes were carefully studied to identify the number of pots involved in synthesising different molecules reported in the research article. The molecules with MW > 900 were filtered out. The URDL library was finally annotated with detailed references, DOIs and the number of pots required for the synthesis.

Cheminformatic analysis

The URDL library (total molecules = 5773) was appended with oral drugs (total molecules = 1998) [62] and iPPI database [63] (total molecules = 3853) annotated with dataset names (Oral drugs, iPPI, URDL). For the sake of reproducibility, all processing and cheminformatics analysis was performed in open-source programmes, KNIME analytics platform 4.5.2 [70] and Datawarrior 5.5.0 [72].

The KNIME workflow developed by Bren and coworkers (https://gitlab.com/Jukic/knime_medchem_filters/) [71] was used to identify PAINS alerts in all three libraries. The compounds comprising reactive functional groups were identified using the ‘nasty functions’ [73] feature of Datawarrior 5.5.0. The SkelSphere descriptor, physicochemical descriptors, and other properties shown in Fig. 5 were calculated using Datawarrior. The boxplots and the mean/median and p-values, were obtained using the 2D plot function of Datawarrior. The rings were extracted using the ‘Most Central Ring’ feature of Datawarrior. The t-SNE plots were generated with perplexity = 40, source dimensions = 50 and iterations = 1000. The similarity comparison between the oral drugs and URDL library was performed using the threshold of 0.75 SkelSphere-based similarity index. The ChEMBL database (v. 30) was searched within Datawarrior employing the ‘superstructures of’ option with various ring structures as queries.