Introduction

Virtual screening of chemical databases is a classical chemoinformatics approach to discover compounds possessing desirable properties, in particularly, new drug molecules. Efficiency of this procedure depends on both performance of the screening tools and the content of the screened database. Nowadays, ensemble of academic, commercial and propriety databases records some 108 structures of existing chemical compounds. Since these collections are limited to already known chemotypes, an effort should be done to generate virtual compounds involving structural moieties which don’t occur in existing structures. Larger library of virtual compounds provides, certainly, with a larger chance to discover new drug-like compounds.

The question arises how large the whole chemical space of realistic drug-like molecules is? Although this question was in the focus of numerous studies [13], still there is no consensus in the estimation of the number of potential drug-like molecules (M): depending on the way of its estimation it varies from 1023 to 10180 (Table 1). Efforts were also done to assess the size of sub-spaces covering a given type of chemical compounds: alkanes, substituted heptanes and hexanes, neurological drugs. In these studies, M corresponded either to the number of all graphs containing up to N nodes (exhaustive graphs enumeration [4, 5]), or to the number of graphs resulted from an intersection of several predefined sub-sets of graphs (combinatorial graphs enumeration [68]). Each of these approaches has clear drawbacks. Most of structures resulted from an exhaustive graphs enumeration are unrealistic (reactive, strained etc.); thus, some rules should be imposed to select a relatively small portion of molecules which potentially may exist [9]. (Later, we’ll call this “constrained exhaustive graphs enumeration”). Results of the combinatorial graphs enumeration depend on preselected subsets of graphs. Usually they are drawn from already existing molecules, which significantly limits diversity of the resulting structures.

Table 1 Some popular estimations of the chemical space size

In this context, particular interest represents a project “chemical universe database” initiated by Reymond in 2005 [9]. They performed constrained exhaustive enumeration of structures containing up to 17 C, N, S, O and halogen atoms [10] which resulted to a database of 1.66 × 1011 structures (GDB-17). A lot of potentially reactive or strained structures have been discarded using different filters. Although GDB-17 represents a useful source of molecular diversity to discover new chemotypes, the molecular size (17 heavy atoms) is still too small for typical drug-like molecules. Unlike previous studies, we used the information about the GDB-17 content in order to establish a relationship between the number of structures generated for a given number of heavy atoms using Reymond’s constrains.

Below, we give some information about previous estimations of the number of chemical compounds and potential drug-like molecules followed by description of our approach.

Previous studies

The first attempt to calculate all possible unlabeled 4-valent tree graphs (i.e., the number of alkanes) has been made by Cayley [11]. Later on, numerous publications were devoted to the calculation of the number of molecular graphs corresponding to acyclic non-chiral hydrocarbons [2, 4, 1218], acyclic chiral monosubstituted hydrocarbons [19], spirits [15], polyenes [20], cyclohexanes [1, 3, 21, 22].

Nowadays, several estimates of the size of chemical universe (M) are reported. As a function of the approach used, this number varies in the range 1023–10180 (Table 1). According to Bohacek et al. [6], the crude number of compounds consisting from thirty C, N, O, S atoms and having up to 4 cycles and 10 branch points is about 1060. Roughly, this corresponds to number of linear molecules with different combinations of atoms (~1023) multiplying by the number of branching/cyclizations for each of them (~1040). In our opinion, this number is overestimated because a large part of structures should be discarded because of steric clashes and strains [10].

Using a set of semantic rules and stereochemistry, D. Weininger concluded that approximately 1033 heptanes and hexanes having molecular weight less than 750 Da and substituted by fragments consisting of H, C, N, O, F atoms [23] may exist. In order to estimate the number of potential neurological drugs, Weaver and Weaver [8] assumed that these compounds should fulfill Lipinski’s rule and fit the 7 Å radius sphere to effectively pass blood–brain barrier. The whole sphere was divided onto 350 functional group volumes. All combinations of up to 5 from 40 possible functional groups correspond to M = 1016 –1021.

Ertl [7] estimated the number of combinations of two and three substituents attached to one same scaffold. Both scaffolds and substituents were generated from the in-house database containing about 3 million organic compounds comprising C, N, O, S, P, Se, Si and halogens and containing up to 36 atoms leading to M ≈ 1023. He has noted that more than 10100 compounds (most of which unrealistic) could potentially be constructed if no restrictive filters are used. Ogata et al. [24] split the ligands extracted from 100 PDB complexes onto fragments, replaced atoms by all possible combinations of C, N, S, O or Cl considering bond orders and combined obtained fragments in new structures. Extrapolation of these results resulted in 1026 compounds containing up to 50 atoms. Drew et al. [25] approximated the number of available compounds in ChemSpider and NIST Chemistry WebBook by power function of the number of carbon atoms. Obtained equation clearly underestimates the number of compounds consisting in up to 100 atoms (~109). There are some other estimates of M in the range from 1014 to 10200 [23, 2629] given without any clear explanation.

GDB-based chemical space of drug-like compounds

In this study, in order to assess M we had to solve two problems: (1) to establish equation linking M with the number of heavy atoms (N), and (2) to estimate limiting value of N for the drug-like chemical space.

At the first stage, we used the information about the number of generated structures M as a function of N (N = 1–17) tabulated in Ref. [10]. Notice that only two filters were applied upon generation of structures containing up to 11 heavy atoms: “smallest atomic volume” one discarding strained structures and functional group filter discarding reactive non-drugable molecules. To generate structures with N = 12–17, several additional filters have been applied in order to avoid combinatorial explosion [10]. Therefore, only information about the structures with N = 1–11 have been used to build a relationship. According to Giménez and Noy [30], the number of connected undirected planar labeled graphs (M) is linked with the number of vertexes (N) by the relationship M ~ N!, hence logM ~ N × logN. Fitting the latter for GDB-17 compounds with N = 1–11 [10] using the R software [31] results in Eq. (1):

$${ \log }M = 0. 5 8 4\times N \times { \log }N + \, 0. 3 5 6$$
(1)
$${\text{R}}^{ 2} = 0. 9 9 9 3,\,{\text{F}} = 1 20 20,\,{\text{SE}} = 0.0 6 6,\,{\text{n}} = 1 1$$

In order to estimate value of N which limits drug-like chemical space, a classical Lipinski’s definition of drug-likeness we used. According to Lipinski’s “rule of five”, orally absorbed drug-like molecules should have the following properties: (1) molecular weight MW ≤ 500 Da, (2) the number of H-donor ≤ 5, (3) the number of H-acceptors ≤ 10, and (4) logP ≤ 5 [32]. The last three parameters don’t limit the number of structures that can potentially follow them; whereas molecular weight can be used as a confining parameter. Thus, we suggested that MW ≤ 500 Da can be used as a bound on drug-like chemical space.

The approximate number of heavy atoms (N) corresponding to molecular weight (MW) of 500 Da has been estimated based on PubChem molecules extracted from the ZINC database (accessed in 2010) [33]. A subset of 23 million compounds containing only C, N, O, S and halogen atoms (as in GDB-17 database) has been selected from the initial set of 31 million compounds. From the linear correlation found between median MW and N (Fig. 1) one can easily assess N ≈ 36 corresponding to MW = 500. Using this number together with Eq. 1 results in M ≈ 1033 (Fig. 2).

Fig. 1
figure 1

Median molecular weight as a function of the number of heavy atoms for the compounds of the PubChem database. Only molecules containing C, N, O, S and halogen atoms (as in GDB-17 database) have been taken into account. One may see that MW = 500 corresponds to N ≈ 36

Fig. 2
figure 2

Extrapolation of the compounds number (M) as a function of the number of heavy atoms (N) based on data taken from GDB-17. The curve was fitted for the compounds with N = 1–11 atoms because the compounds with N > 12 were generated using another selection rules

The number of 3D structures is even larger if one takes into account all stereoisomers corresponding to one planar molecular graph. According to Ref. [10], GDB-17 compounds contain, in average, 6.4 stereocenters per molecule. Suggesting that the number of stereocenters increases linearly with N, one expects about 12 stereocenters per molecule for the dataset containing compounds up to 36 atoms. This corresponds to 212 = 4096, e.g., the number of stereoisomers is proportional to 103. Thus, the overall number of 3D structures with MW ≤ 500 Da is about 1036.

It seems that remaining three Lipinski’s “rules of five” are valid for most of these molecules. Indeed, Ruddigkeit et al. [10] demonstrated that the vast majority of GDB-17 compounds follow these rules. Thus, it has been shown that the average number of H-bond donors in GDB-17 slowly increases with N and for compounds having 17 atoms this value is equals to 2.5. Average CLog P values remain almost constant and equal to zero independently on the number of heavy atoms in GDB-17 molecules. Thus, we believe that N = 1033 is a reasonable empirical estimation of the size of chemical space of drug-like compounds which follow Lipinski’s “rule of five”.

It should be noted that the idea of extrapolation of the number of drug-like compounds based on fully enumerated compounds in GDB-17 was recently suggested by Shoichet [34]. However, neither mathematical equation for M nor its estimated value were reported in [34].

The estimated number of molecules is hardly accessible, at least at the current level of computer power. Indeed, simple calculations show that the best modern 500 supercomputers in the world will be able to generate just 1014 compounds per year which corresponds to full enumeration of compounds containing up to 18 atoms. This shows that exhaustive enumeration doesn’t seem to be an effective way to generate “all” useful drug-like compounds. On the other hand, combinatorial generation based on molecular fragments taken from fully enumerated library looks more perspective. For instance, if one uses for this purpose GDB-17 (1.66 × 1011 compounds with up to 17 heavy atoms), 1.66 × 1011 × 1.66 × 1011 = 2.8 × 1022 their possible combinations could be generated. Since each pair of species can be linked by 17 × 17 = 289 different ways, a pairwise linking of GDB-17 molecules results in “only” 2.8 × 1022 × 289 = 8 × 1024 compounds, which is more affordable than N = 1033. Thus, combinatorial generation of drug-like compounds based on fully enumerated libraries of small fragments looks more realistic than fully enumerated compounds libraries. This strategy has also another advantage: the structures can be generated on the fly during virtual screening which allows one to avoid the difficulties with storage and maintenance of such a huge database. The generation of “useful” compounds could always be tuned in a guided enumeration, which fits generated molecules to the target chemical space. Generally, this can be achieved using a fitting function which includes different parameters like target property value, ADME/Tox properties, diversity of the generated library, etc. [35, 36].