Keywords

1 Introduction

In data mining, items characterized by a large number of attributes can be conceived as points in a “data space” or “attribute space” (defined by the vector of its attributes) but cannot be directly visualized as such if the dimensionality D (the total number of attributes needed to characterize the instance) exceeds three. Or, complex instances such as molecules (or chemical reactions) in chemoinformatics typically require hundreds or thousands of specific attributes—henceforth called “molecular descriptors”, as customary in chemoinformatics—in order to conveniently capture the chemical information associated to a given compound. Understanding the neighborhood relationships between items—the analysis of the relative distances separating them in data space—is often of paramount importance to the understanding and exploitation of the knowledge provided by a set of example items, if item properties can be shown to comply with the neighborhood principle. This principle states that similar items (close to each other in data space) tend to display rather similar properties. In chemoinformatics, the “similarity principle” postulating that similar molecules will likely display similar (physicochemical and/or biological) properties (Johnson et al. 1988; Johnson and Maggiora 1990) is a key paradigm in chemistry, guiding the design and synthesis of novel analogues of properties close to the ones of state-of-the-art precursor compounds.

The similarity principle may be perfectly well exploited in arbitrarily high-dimensional descriptor spaces (Papadatos et al. 2009; Patterson et al. 1996), based on therein defined distance measures (metrics) quantitatively rendering the degree of dissimilarity (remoteness) of any two items. However, such approaches are frustratingly counterintuitive “black boxes”. The alternative—intuitive grasping of the neighborhood relationships from a 2D map of the initial space—requires a procedure to project the initial points onto a plane, in a way minimizing distortions of inter-item distance value. In the practice of chemoinformatics, projected inter-item distances need not quantitatively match the ones in the initial descriptor space: it is enough to ensure that (i) neighboring molecules in the descriptor space continue to show up as neighbors, and (ii) initially remote species do not artefactually become neighbors in the projection (the so-called “latent space”). The principle of a meaningful projection is illustrated in Fig. 1—item a is closest to items d, e and b, in both the initial space and the projection.

Fig. 1
figure 1

The principle of dimensionality reduction

Many various dimensionality reduction algorithms do exist, starting from the classical linear algebra Principal Component Analysis (PCA) (Dunteman 1989), to various non-linear techniques such as Kohonen Self-Organizing Maps (SOM) (Kohonen 1984, 2001), Multidimensional Scaling (MDS) (Agrafiotis et al. 2001), Stochastic Embedding (Agrafiotis 2003), etc.

2 Generative Topographic Mapping —Principles

Generative topographic mapping (Kireeva et al. 2012) or GTM, introduced by Bishop et al. (1998a, b), are basically fuzzy-logics driven Kohonen Self-Organizing Maps (SOM). A regular squared grid of K nodes covering the 2D latent space is generated, where K is the square of some small integer \( \sqrt K \), the grid “width”, expressed by the number of nodes/square edge. A node k is defined by its integer 2D coordinates \( {\mathbf{x}}_{k} = (l_{x} ,l_{y} ) \), with index \( k = l_{x} \times \sqrt K + l_{y} \) where l x , l y  = 0, …, \( \sqrt K - 1 \) and k = 0, …, K − 1. Each node is mapped to a manifold point \( {\mathbf{y}}_{k} \) embedded in the D-dimensional space: \( {\mathbf{x}}_{k} \to {\mathbf{y}}_{k} \), using the non-linear mapping function \( y({\mathbf{x}};{\mathbf{W}}) \) that maps points from the two-dimensional latent space into the D-dimensional data space:

$$ y\left( {\varvec{x};\varvec{W}} \right) = {\mathbf{W}}\varphi (\varvec{x}) $$
$$ {\mathbf{Y}} = {\mathbf{W}}{\varvec{\Phi}}^{T} $$

where \( {\mathbf{Y}} \) is the \( K \times D \) manifold, \( {\mathbf{W}} \) is the \( D \times M \) parameter matrix, and \( {\varvec{\Phi}} \) is the \( M \times K \) radial basis function matrix with M RBF centers \( {\varvec{\upmu}}_{m} \):

$$ \Phi _{mk} = { \exp }\left( {\frac{{\left\| {{\mathbf{x}}_{k} - {\varvec{\upmu}}_{m} } \right\|^{2} }}{{2\sigma^{2} }}} \right) $$

The parameter \( \sigma^{2} \) corresponds to the average squared Euclidean distance between two RBF centers, multiplied by a tunable factor w. W, the parameter matrix, can be initialized such as to minimize the sum-of-squares error between initial-space and latent-space point distances, corresponding to a default, linear PCA mapping. The points \( {\mathbf{y}}_{k} \) on the manifold are the centers of normal probability distributions (NPDs) of t:

$$ p\left( {{\mathbf{t}} |{\mathbf{x}}_{k} ,{\mathbf{W}},\beta } \right) = \frac{\beta }{2\pi }^{D/2} { \exp }\left( {\frac{\beta }{2}\left\| {{\mathbf{y}}_{k} - {\mathbf{t}}} \right\|^{2} } \right) $$

where \( {\mathbf{t}}_{n} \) is a data instance and \( \beta \) the common inverse variance of these distributions.

Intuitively, one may think of this abstract manifold as a “rubber sheet” is inserted in the descriptor space zone covered by the items to map, and which may be subsequently “torn” (fitted) in order to accommodate, or at least pass closely to each of these items. Therefore, the ensemble of N data items—here, N molecules, each represented by their molecular descriptor vector \( {\mathbf{t}}_{n} \), n = 1, …, N, define the zone within which the manifold will be most accurately defined, thus represent a “frame” within which the map is positioned and will therefore be termed “frame set” in the following. Manifold grid points are positioned in the neighborhood of frame set points. The optimization of the below-given log likelihood function accounts for the purposeful distortion of the manifold in order to optimally cover or approach each of the frame set items.

Eventually (see intuitive example in Fig. 2), the manifold will be folded into the delimited 2D square grid of K nodes. While it may describe an infinite hypersurface in the initial space, its extrapolation far beyond the frame set-covered zone is hardly meaningful. “Exotic” items outside the frame zone will be spuriously “folded” back within the bounds of the square grid, but should be eventually ignored, because they are outside the applicability domain of the map. Therefore, the proper choice of frame compounds—which may, but do not need to coincide with the actual compound collections targeted by the GTM-based study—is a key prerequisite in GTM design.

Fig. 2
figure 2

An example of the “Swiss roll” manifold fitted to match the items (points) in the initial 3D data space, then unfolded onto the 2D latent space grid

An optimal GTM corresponds to the highest log likelihood \( {\mathcal{L}} \), taken over all frame compounds n = 1, …, N, optimized by expectation-maximization (EM):

$$ {\mathcal{L}}\left( {{\mathbf{W}},\beta } \right) = \sum\limits_{n} {ln\left\{ {\frac{1}{K}\sum\limits_{k} {p\left( {{\mathbf{t}}_{n} |{\mathbf{x}}_{k} ,{\mathbf{W}},\beta } \right)} } \right\}} $$

\( \beta \) and W are optimized during the maximization step:

$$ \frac{1}{\beta } = \frac{1}{ND}\sum\limits_{n} {\sum\limits_{k} {R_{kn} \left\| {{\mathbf{y}}_{k} - {\mathbf{t}}_{n} } \right\|^{2}} } $$
$$ \left( {{\varvec{\Phi}}^{T} {\mathbf{G}}{\varvec{\Phi}} + \frac{\lambda }{\beta }} \right){\mathbf{W}}^{T} =\, {\varvec{\Phi}}^{T} {\mathbf{RT}} $$

where I is the identity matrix and G a \( K \times K \) matrix with elements \( G_{kk} = \mathop \sum \limits_{n} R_{kn} \). GTM build-up is fully controlled by four user-defined parameters: M, the number of RBFs, the number of nodes K, the RBF width multiplication factor w and the weight regularization coefficient \( \lambda \). The latter two serve to tune the stiffness of the manifold and hence avoid overfitting. Of course, the local minimum which will be reached by the (gradient-based) optimization of the log likelihood function will depend on the initial geometry of the manifold—in our implementation, optimization starts from a flat manifold representing the plane of the first two principal components of the descriptor space. Note that the final rendering of the map may depend a lot on the initial conditions—but the neighborhood relationships it encodes will not: compounds that are close in the descriptor space should be mapped to adjacent points in latent space. Or, in chemoinformatics GTMs serve to monitor neighborhood relationships—therefore, no systematic study of all possible log likelihood minima achievable for any given quartet of control parameters (M, K, w, λ) has been pursued.

Eventually, the responsibility or posterior probability that a point \( {\mathbf{t}}_{n} \) in the data space is generated from the kth node is computed using Bayes’ theorem:

$$ R_{kn} = p\left( {{\mathbf{x}}_{n} |{\mathbf{t}}_{k} ,{\mathbf{W}},\beta } \right) = \frac{{p\left( {{\mathbf{t}}_{n} |{\mathbf{x}}_{k} ,{\mathbf{W}},\beta } \right)p({\mathbf{x}}_{k} )}}{{\mathop \sum \nolimits_{k '} p\left( {{\mathbf{t}}_{n} |{\mathbf{x}}_{k '} ,{\mathbf{W}},\beta } \right)p({\mathbf{x}}_{k '} )}} $$

These responsibilities are used to compute the mean (real value) position \( {\mathbf{xy}} \) of a molecule on the map \( {\mathbf{x}}({\mathbf{t}}_{n} ) \), by averaging over all nodes with responsibilities as weighting factors:

$$ {\mathbf{xy}}({\mathbf{t}}_{n} ) = \sum\limits_{k} {{\mathbf{x}}_{k} R_{kn} } $$

Each point on the GTM thus corresponds to the averaged position of one molecule. This step completes mapping, i.e., reducing the responsibility vector to a plain set of 2D coordinates xy, defining the position of the projection point of the initial D-dimensional vector on the map plane.

The above optimization of \( {\mathcal{L} } \) requires access to the entire set of molecular descriptor vectors, which represents a \( N \times D \) matrix of real numbers, and may thus quickly run into memory problems, knowing that the dimensionality D of the descriptors may often reach order of magnitude of 103–104, whilst frame sets of millions of molecules could be considered. In particular, in the context of “big data”, or in order to exhaustively map the chemical “Universe” of all commercially available or feasible compounds, the input data alone may easily scale up to tens of GB, and temporary variables required for processing will be similar in size. Alternatively to the use of a memory-rich supercomputer, an incremental version of GTM (iGTM) algorithm has been proposed (Gaspar et al. 2014). It divides the data into blocks and updates the model block by block until convergence of the log likelihood function.

After a map has been “built”—i.e., the manifold was optimized, based on provided frame set items—any other point t′ in the initial descriptor space can be projected on the manifold and its responsibility vector can be computed. This is technically possible—but practically not advisable—even if the compound is very different from frame set molecules, and implicitly remote from the fitted manifold. Note that mapping of external items only requires the manifold equation and t′ as input, thus can be easily parallelized, so that arbitrarily large external compound sets can be mapped on any given GTM. However, the underlying—smaller—frame set must be nevertheless representative of these external sets (i.e., cover roughly the same descriptor space hypervolume, albeit at much lower density). This is a key issue in chemical space mapping , needed to ensure that mapping of external compounds is meaningful, and not artefact-prone.

2.1 Responsibility Patterns

GTM has, over other techniques, the key advantage of is its two-stage approach to dimensionality reduction:

  1. 1.

    from the original, D-dimensional descriptor space to the K-dimensional responsibility vector space (Responsibility Level). A responsibility vector (Fig. 3) can be intuitively visualized by colored “patches” positioned at the node(s) with significant responsibility values, where color intensity is modulated by the actual responsibility values.

    Fig. 3
    figure 3

    Examples of single-node residents versus fuzzy, multi-node residents. For the antiparasitic compound above, the responsibility vector is null for all nodes except the highlighted one, in which the compound is predicted to reside exclusively. The fatty acid inhibitor below is defined as a partial resident of the highlighted nodes in red, where the color intensity matches relative responsibility values. The black crossfire signs correspond to the (x, y) latent space coordinates of the compounds of the map, and are positioned at the (responsibility-weighted) barycenter of the set of significant residence nodes. The displayed map is the result of a study (Sidorov et al. 2015) aimed at the discovery of general maps of maximal pertinence for the space of drug-like compounds (“universal” map #2 of the cited publication)

  2. 2.

    from responsibilities to 2D positions on the map (2D Level): computed latent coordinates \( {\mathbf{xy}}({\mathbf{t}}_{n} ) \) can be assigned to each compound: see crossfires at the (responsibility-weighted) barycenter of the set of significant residence nodes in Fig. 3.

The latter and final level is clearly not the most interesting one. A 2D map of very high-dimensional spaces will—irrespectively of the linear or non-linear mapping strategy—be inherently imprecise, to the point of not being of great use. Molecular structures cannot be robustly characterized by two real numbers only, irrespective of the strategy one may design for defining those numbers. The full advantages of GTM are apparent at responsibility level, at intermediate K dimensionality. K is a user-tunable parameter, which should be chosen such as to avoid massive loss of chemically relevant information, but filtering out the noise due to less relevant descriptor components.

It is straightforward to expect that similar molecules are to be represented by similar responsibility “color patches” on the map, and the human eye is perfectly suited to detect “color patch” similarity—even beyond the trivial scenario when “patches” include a single node and the GTM acts like a classical Kohonen map. Further reduction of the molecule object to a single point of 2D coordinates (crosshair), which is precisely the barycenter of the responsibility pattern , may represent a drastic loss of information, unless one single node accounts for the entire density distribution.

Fuzzy compound-to-node assignment may seem like a minor enhancement, but is actually another key strength of GTM over Kohonen maps. First, at a same grid size K, the volume of chemical information that can be monitored by a GTM is much larger. A Kohonen grid of K nodes may distinguish between at best K different core structural motifs—much less in practice. Some of these K nodes will, indeed, each stand for “main stream” compound classes, but others will serve as “garbage collectors” of all the exotic structures that would not fit any of the former, but need to be assigned to one (and only one) given node, nevertheless. On a GTM, molecules are not necessarily bound to a single node and the total number of distinct structural motifs is defined—intuitively—by the number of distinct “color patch” patterns that may be drawn with the help of a K-sized grid, or—technically—by the phase space volume spanned by the K-dimensional responsibility vectors. Kohonen maps operate only with pure states, while GTMs, by contrast, with mixed states, and the latter come in virtually infinite numbers (not all of them corresponding to real compounds or common core motifs, however). A consequence is that exotic compounds that are remote from all the nodes of the manifold will as a consequence be often mapped with equally weak responsibilities on all nodes, rather than assigned to the one—relatively—closer “garbage” node.

However, the non-fuzzy Kohonen maps seem to have an apparent advantage in terms of compound clustering. All compounds mapping to a same node are, from the Kohonen map perspective, no longer distinguishable and therefore may be unambiguously viewed as members of a same group, or cluster—which makes perfect chemical sense for all but above-mentioned “garbage” nodes. Conceptually, things are identical for compounds residing in single nodes of GTMs, with the additional benefit that single-node residents are typically compounds found close to the manifold, well within the GTM applicability domain. Single-node residents of any given node are expected to form a chemically meaningful “cluster” of similar compounds. The cluster corresponding to the “blue” node in which resided the antiparasitic compound in Fig. 4 is shown below.

Fig. 4
figure 4

Other single-node resident compounds from the ChEMBL database (Gaulton et al. 2011), in the node of residence of the antiparasitic compound from Fig. 3. The cluster regroups benzopyroles, benzimidazoles and other closely related heterocyclic scaffolds. In spite of the wide diversity of substituents (only 9 randomly picked examples are shown, out of its 9100 members in the ChEMBL database), there is a clearly visible common structural pattern associated to this node

Yet, the working hypothesis “compounds of a same node belong to a same cluster” may be easily generalized to fuzzily mapped items such as CHEMBL600799 from Fig. 3, by introducing the concept of Responsibility Patterns , RP. The responsibility pattern (Klimenko et al. 2016) of a compound n is defined as an integer, discretized version of the real-number responsibility vector R:

$$ RP_{kn} = \left[ {10 \times R_{kn} + 0.9} \right] $$

where “[]” stands for the truncation operator. The peculiar transformation rule above was chosen such as to ensure that even marginally responsible nodes (at R kn  = 0.01) will be highlighted. Beyond this minimal threshold, every additional 10% increase of a responsibility value contributes an increment of +1 to the integer RP equivalent. For a compound n, the responsibility pattern vector RP kn may be best rendered as string enumerating—in increasing node number order—the nodes of non-zero RP values, concatenated to these values, e.g., /k1:RP k1n /k2:RP k2n /k3:RP k3n /…/. For single-node compounds, the RP string /k:10/ is simply a label of the concerned node. Herewith (see Fig. 5), compounds associated to a same RP string will be considered to belong to a same cluster.

Fig. 5
figure 5

Exemplifying the definition of Responsibility Patterns as strings (labels) concentrating the information in the responsibility vectors by “binning”, and herewith regrouping molecules with identical or slightly different responsibility vectors under a common label. Integers above the lines are node numbers, and corresponding real values below are responsibility values. These are binned, and nodes returning non-zero binned values are concatenated together with their binned value, into a “RP string” shown below

The responsibility pattern approach therefore amounts to a cell-based clustering technique: the above discretization formula may be interpreted as a procedure to tessellate the vector space of responsibilities R, so that items within any cell would share a same RP string. Typically, cell-based clustering fails in high-dimensional spaces, because of the sheer number of possible cells: in the K-dimensional space of responsibility vectors, there are 10K possible cells, with K = 32 × 32 = 1024 in the GTM from Fig. 3 (see figure caption for more information about the map). However, with a well-fitted GTM model, only a minority of cells is actually populated—in particular, the K single-node configurations and fuzzy configurations with responsibilities shared between two and—for CHEMBL600799, which was picked for being the “fuzziest” mapper amongst all ChEMBL compounds—five participating nodes. The 1.3M ChEMBL compounds populate 23,253 distinct responsibility patterns , out of which 723 correspond to strict single-node mapping modes (concerning a total of 1.22M compounds, i.e., 95.3% of ChEMBL structures). There are 19,952 bi-nodal RPs, regrouping some 53K molecules. There are 4,217 ChEMBL compounds that are nonspecifically “smeared” over the entire map, with R kn  = 1/1024 for all k—these 3‰ represent mapping failures, and are beyond the applicability domain of the model.

Like in any clustering approach, the user expects to see “structurally related” compounds grouped together under a given RP label. “Structural relatedness”, however, is an intrinsically ill-defined concept—it typically refers to the scaffold-centric view cherished by medicinal chemistry, where two compounds are “analogues” if they contain a same (however defined) “scaffold”. There is no absolute truth in the above point of view—one may as well prefer the alternative pharmacophore-centric approach, where two compounds are “analogues” if they are porters of a same pharmacophore pattern (analogous spatial distribution of functional groups of analogue physico-chemical nature). Note that the GTM-based RPs are not specifically generated on the basis of scaffold-centric information, but may capture the presence of a scaffold by its specific “signature” in the provided molecular descriptor vector (specific scaffold contributes a subset of specific fragments to the ISIDA fragment count vector). Alternatively, pharmacophore patterns might also be captured, if the pharmacophore-colored fragmentation schemes (Ruggiu et al. 2010) are enabled.

Thus, the nature of the common structural “motif” behind a given RP is by default open-ended. First, “garbage” RPs—the equivalent of Kohonen “garbage” nodes—may appear, for various reasons. They may regroup cases of exotic compounds that are too far from the manifold to be clearly assignable to a node and are therefore “smeared” over many putative locations. However, single node RPs may also sometimes accommodate a set of—from a chemist’s point of view—highly diverse structures, with no obvious common “motif”. The more populous a node, the higher are its chances to accommodate widely diverse compounds. Figure 6 highlights the three most populous RPs of the ChEMBL map, each corresponding to single node RPs associated to the pinpointed “borderline” nodes. In spite of the large compound populations, in two of the three nodes it was rather easy to evidence the common underlying structural pattern “uniting” these compounds into a cluster. Finding the pattern required nothing but visual inspection of some representatives. Then, the observed putative common hypothesis were formulated as substructure search queries, and applied to the compounds matching each RPs—as their numbers are too large for exhaustive visual inspection. Indeed, 95% of the members of the most populous RP of ChEMBL (highlighted node #128) are putative Michael acceptors, matching the α,β-unsaturated ketone pattern C=C–C=O. More than 66% of residents of node #32 are oxyanions—which is a remarkable enrichment, knowing that over the entire ChEMBL set, the occurrence rate of oxyanions is of 17%. However, there is no obvious common pattern within node #64, the second-largest RP in ChEMBL. Yet, the molecules do have something in common—their “fragment-like” size, being significantly smaller, and hence less complex than typical drugs.

Fig. 6
figure 6

Analysis of the three nodes corresponding to the three most populous Responsibility Patterns in ChEMBL, based on the map introduced in Fig. 3. Given the structural diversity of compounds in node #64, this may be viewed as a “garbage” node—nevertheless, it has the specificity of regrouping small, fragment-like compounds

Thus, as exemplified in this chapter, and as observed in previous works (Klimenko et al. 2016), the unifying structural “reasons” behind a given RP may be of diverse nature, and represent different “resolution” levels. They may range from the extremely fuzzy size considerations, to clustering molecules by their predominant pharmacophore feature—anionic nature, to specific shared substructures.

These substructures need not to be scaffolds in order to be (bio)chemically relevant. As shown above, the herein discussed GTM “spontaneously decided” to regroup Michael acceptors, based on the specific signature of the C=C–C=O moiety. Of course, the mapping process did not rely on any knowledge of putative specific or non-specific biological effects, “PAINS” (Bael and Walters 2014; Dahlin et al. 2015) of Michael acceptors. Also, it cannot be taken as granted that Michael acceptors would, as such, share a specific zone in the initial descriptor space—which is a prerequisite for their projection onto a common RP. The nature of molecular descriptors on which mapping was based (Sidorov et al. 2015)—force-field-type colored ISIDA atom pair counts—was of paramount importance, because they helped to evidence the specific signature of the C=C–C=O moiety. This map was grown and selected with respect to its propensity to explain structure-activity relationships throughout diverse series of compounds associated to various targets (vide infra—Building high-quality GTMs). In the process of achieving that goal, assignment of putatively reactive and hence unspecific Michael acceptors to a “borderline” node emerged spontaneously. Note—not shown in Fig. 6—that the fifth-most populous node of the map is another Michael-acceptor dominated chemical space zone, with the peculiarity that the C=C–C=O pattern is now included in a ring.

Common structural motifs may nevertheless correspond to one scaffold, or to ensembles of similar scaffolds, as already highlighted in Fig. 4. Examples from previous work show that the relevant underlying common substructure may be more stringent than the scaffold level—compounds within a given RP may share not only a common scaffold, but also very specific substituents at key scaffold positions. The relationship between RPs and the underlying structural motifs is therefore open-ended and self-adaptive: it may stretch from very fuzzy regrouping of compounds sharing a same small size, or a same negative charge, to compound clusters based on a clear-defined common substructure, which may or may not match a scaffold (in the sense of “ring system”). Different maps may highlight different structural motifs that are specific to some of their RPs. On the contrary, “rediscovery” of the very same ones being associated to map-specific RPs (Klimenko et al. 2016) of different maps is also possible, even if those maps are based on different molecular descriptors. Either way, if a map is shown to be neighborhood-compliant, in the sense of supporting robust structure-activity models for a wide panel of properties (vide infra), then the RPs extracted from such map are highly likely to correspond to some well-defined underlying structural motif of (bio)chemical significance.

2.2 GTM-Driven Classification and Regression Predictive Models

Whenever molecules, initially represented as D-dimensional objects in descriptor space, are mapped onto a 2D latent space, their properties are being implicitly localized on the map. It makes sense to “transfer” the—mean—property of molecules residing a given latent space zone to that particular latent space zone itself. If the mapping is meaningful—that is (Horvath and Barbosa 2004), similarity principle-compliant—for a given property, then mappers onto any given latent space “spot” of sufficiently small size will be similar molecules of similar property. Hence, the mean \( \bar{P} \) of these property values will display a limited standard deviation \( \sigma (P) \), and coherently represent the local, above-expectation accumulation of compounds of property value \( P\, \approx \,\bar{P} \). Coherence of mapping of compounds with known property values may thus serve to implicitly define the quality of a mapping approach. Moreover, would a new species be shown to map in the same latent space zone, the assumption that its expected property value shall not be far from \( \bar{P} \) can be upheld and used for prediction.

P may stand for various properties of distinct nature—both continuous and categorical (class labels). The relevant latent space “spots” and the “mean” values \( \bar{P} \) may be defined in context-dependent ways, but the above outlines the general principle of predictive mapping. For example, in Kohonen maps, nodes are the smallest addressable latent space unit for which \( \bar{P} \) values may be computed. A Kohonen map may not make any more detailed prediction but returning the \( \bar{P} \) value associated to the node into which a compound has been classified. This means that, with continuous properties, it may return a discrete spectrum of node-bound \( \bar{P} \) values, one value per node for all “non-garbage” nodes, with \( \sigma (P) \) below some user-defined threshold. Therefore, Kohonen maps—unlike GTM—fail to support proper quantitative regression models: they would return \( \bar{P} \) as the predicted value for the entire series of analogues residing in the same node. A structural modification of a compound would not trigger any change of the predicted value \( \bar{P} \) unless this change causes relocation to a different node, associated to a different \( \bar{P} \) value. The above, however, is perfectly compatible with the expected behavior of a classification model. Both Kohonen and GTM approaches may therefore be used for compound classification, while GTM is—due to its fuzzy mapping abilities—better suited for regression models.

When defining the “mean” property value \( \bar{P}_{k} \) of a GTM node k, one must count each resident compound n proportionally to its degree of residence in that node, R kn :

$$ \bar{P}_{k} = \frac{{\sum\nolimits_{n} {R_{kn} \times w(n) \times P(n)} }}{{\sum\nolimits_{n} {R_{kn} \times w(n)} }} $$

where P(n) represents the property of compound n and w(n) represent importance weighting factors of the compounds. When the property P represents a continuous magnitude, such as a pIC 50 or logP value, there is no immediate reason for any specific importance weighing scenario. Letting all compounds be equally important, w(n) = 1∀n, will assign simple arithmetic means of the property to nodes. Since R kn is never strictly zero, no matter how far compound n is situated from manifold node k in descriptor space, GTMs—unlike Kohonen maps—do not display genuinely empty nodes, and the above equation is applicable for all k, without fearing divisions by zero. However, if the above denominator is low, it makes little sense to expect a meaningful extrapolation of \( \bar{P}_{k} \) based only on the remote contributions of compounds having no significant degree of residence at k. Therefore, there should be some user-defined minimal threshold for the total cumulated responsibility per node, below which k should be considered as “practically empty”, and its technically obtainable but chemically senseless \( \bar{P}_{k} \) value ignored. Node density can be encoded in plots by color transparency—from completely transparent (below defined density threshold) to full color (to be used, for example, for the top t% most dense nodes). Density (cumulated responsibility) is a major criterion of the trustworthiness of estimated \( \bar{P}_{k} \) values: the higher the density, the more robust the assigned \( \bar{P}_{k} \) and the above-mentioned minimal density threshold can be considered as an applicability domain delimiter of a map (Gaspar et al. 2015, 2013; Horvath et al. 2009; Sushko et al. 2010; Tetko et al. 2008).

Note that the equation above may also be used for classification purposes: if, for example, we define P(n) = 1 for all inactives, by contrast to P(n) = 2 for all actives, then \( \bar{P}_{k} \) above will be below 1.5 if the inactives residing the node are predominant, above 1.5 if actives are dominant and 1.5 if both categories are equally well represented. Then, the node can be assigned to either class 1 or 2, or discarded as “undecidable”. Note that the herein—for the sake of intuitiveness—outlined approach to classification by rounding up the \( \bar{P}_{k} \) values only works properly for two-class classification problems. Obviously, with three classes P(n) ∈ {1, 2, 3} a node having 50% of its residents of class 1, and the other half of class 3 would be wrongly colored as “class 2”. The proper way (Gaspar et al. 2013) to deal with multi-class classification is to count cumulated responsibilities \( \sum\nolimits_{n} {R_{kn} \times w(n)|_{n\,of\,class\,P} } \) for each class, and to return the class P with the largest sum as \( \bar{P}_{k} \). Only two-class classification supports node coloring by the fuzzy \( \bar{P}_{k} \) gradually shifting from class 1 to class 2, and herewith implicitly returning the coherence-based trustworthiness of node versus class association—the closer \( \bar{P}_{k} \) is to the extremes 1 or 2, the more robust the prediction. Coherence-based trustworthiness can also be defined for multi-class classification problems, by checking how much larger the winning cumulated responsibility score is with respect to the second-best one. Coloring nodes by winning class only is always feasible, but unfortunately such plot does not inform about coherence-based trustworthiness. In regression models, coherence-based trustworthiness can be inferred from the standard deviation \( \sigma_{k} \left( P \right) \) of the property at node k (vide infra). Two-class classification is a special case, since \( \sigma_{k} (P) \) is deterministically related to \( \bar{P}_{k} \): it is zero if \( \bar{P}_{k} \) reaches its extreme values 1 or 2, and maximal (equal to 1) when \( \bar{P}_{k} \) is an undecided 1.5. Fuzzy class landscapes rendering the mean class \( \bar{P}_{k} \) of GTM nodes have the peculiarity of informing both about the winning class in each node and the coherence-based trustworthiness of that assumption. With transparence encoding local compound density, they provide a complete picture of the class landscape and its applicability domain.

2.2.1 Bayesian Weighing to Correct for Class Size Imbalance

If the classes to be discriminated against on a map are of very different sizes (classically, the number of inactives in screening of random compound collection is much higher than the one of confirmed actives, for example) then it may be interesting to revisit the discussion above in the light of relative, rather than absolute predominance of a class in a node. Practically, if the default ratio of actives versus inactives is 1:100 throughout the studied compound collection, then a node populated by one active for only five inactives is still dominated by inactives in terms of absolute “head counts”, and yet enriched in actives by a robust factor of 20. Therefore, it may deserve to be highlighted as a node of “actives”, nevertheless. This can be achieved by corrected importance weights for molecules of each class, taking their default occurrence rates as baseline. With i classes, denoting the total fraction of class i members of the library by f i, the weight w(n) of a compound n belonging to class i should be set to \( w(n)|_{n \in i} = f_{i}^{ - 1} /\sum\nolimits_{j} {f_{j}^{ - 1} } \). In this way, nodes would be “undecidable” at \( \bar{P}_{k} = 1.5 \) if their relative population of the two classes equals default occurrences, “active” if the population of actives is higher than the default “hit rate” or inactive otherwise. Figure 7 illustrates this aspect in monitoring the distribution of ChEMBL aromatic versus aliphatic compounds on the GTM nodes (same map as in Fig. 3, see references in that figure legend). Out of the 1.3M ChEMBL compounds, only 83K are completely void of aromatic moieties and were labeled “aliphatic” (class 1), whilst the vast majority of aromatic moiety-containing molecules are considered in class 2. Plots a and b below correspond to the plain w(n) = 1 scenario and the occurrence-based importance weighing, respectively. The five-color spectrum maps \( \bar{P}_{k} \) values from 1 (aliphatic class) in red to 2 (aromatic) in blue, with middle color yellow marking “undecidable” nodes. It can be seen from plot a that nodes in which the purely aliphatic compounds significantly outnumber the ubiquitous aromatic derivatives are rare. If occurrence rate-based importance weights are used, nodes relatively enriched in aliphatics are witnessing a “red shift” of their colors. Eventually, plot c of the same figure is identical to plot b, but in bicolor mode highlighting only the winning class color.

Fig. 7
figure 7

GTM node coloring (each of the 36 × 36 nodes being a small squared “tile” of the grid) by average fuzzy class value (plots a and b), where the five-color spectrum maps \( \bar{\varvec{P}}_{\varvec{k}} \) values from 1 (aliphatic class) in red to 2 (aromatic) in blue, with middle color yellow marking “undecidable” nodes. Plot a is realized in terms of absolute compound numbers per class (out of the 1.3M monitored ChEMBL compounds), whilst plot b monitors the relative enrichment of nodes in terms of aliphatic and aromatic compounds, respectively. Plot c represents the simplified two-color “winning class” landscape of b, where the “undecidable” nodes turn either red or blue, depending on whether their \( \bar{\varvec{P}}_{\varvec{k}} \) value was slightly larger or smaller than 1.5. Node transparency is modulated by their cumulated responsibilities, i.e., the fuzzy count of resident compounds

Plot b is clearly the most informative of the three alternative renderings—it shows that aromaticity/aliphatic character, a physicochemical parameter of key relevance in drug design, defines a major “fault line” on the map, with aliphatics relatively predominant in the north-west. It also gives a clear demarcation of nodes that are robustly dominated by one or the other class, versus mixed ones—which will nevertheless be forcibly declared “aliphatic” or “aromatic” in the traditional “winning class” representation c.

Now, the plots in Fig. 7 might have represented a Kohonen map as well as a GTM, since they do focus only on the information associated to the nodes. At this level, the fact that compounds may be fuzzily shared by several nodes would not significantly impact the generic aspect of such plots. On a Kohonen map, a compound is assigned to a node, so it does make sense to show nodes as tiles covering the map. On a GTM, however, a compound shared between several nodes may be imagined as “residing”—in terms of (x, y) latent space coordinates—between the nodes, as shown in Fig. 3. Therefore, logically, the mapped property landscape is also defined over the entire latent space between the nodes and may, in principle, written as a function \( \bar{P}(x,y) \), to be interpolated—according to various strategies, from node \( \bar{P}_{k} \) values. For example, \( \bar{P}(x,y) = \bar{P}_{k} |_{k = nearest\,node\,to\,(x,y)} \) is called the “local” extrapolation strategy. By contrast, in the “global” strategy \( \bar{P}\left( {x,y} \right) \) associated to a compound n located at (x, y) is not directly inferred from latent space coordinates, but falls back to the responsibilities relying that peculiar resident to the nodes of given \( \bar{P}_{k} \):

$$ \bar{P}(n) = \sum\limits_{k} {R_{kn} \bar{P}_{k} } $$

Above, the predicted property is now a smooth function of responsibilities, so the GTM-specific global property prediction strategy is a genuine regression method for prediction of continuous molecular properties. In Fig. 8, the landscape \( \bar{P}\left( {x,y} \right) \) is obtained by polynomial interpolation with respect to the values of the four surrounding nodes. In such a landscape, nodes would merely correspond to individual grid points, but, in order to highlight their special status, small circles of homogeneous color corresponding to the actual \( \bar{P}_{k} \) values are “cut” out of the smooth, interpolated landscape. Compare the interpolated, GTM-specific and fuzzy aromaticity/aliphaticity class landscape below to its “node-only”, Kohonen-map like counterpart in Fig. 7b.

Fig. 8
figure 8

Interpolated aromaticity class landscape—red aliphatic, blue aromatic. Compare to its “node-only” rendering in Fig. 7b

2.2.2 Density, Coherence, Applicability

In order to conclude on the key issue of trustworthiness/applicability of GTM-based property landscapes, it is interesting to emphasize the standard deviation \( \sigma_{k} (P) \) and respectively mean node-based property \( \bar{P}_{k} \) values are not correlated (except for two-class problems). However, the former—“coherence”, a strong indicator of the trustworthiness of \( \bar{P}_{k} \) values—may be alternatively used, for example, as the color transparency modulation parameter on the map, to produce alternative coherence/property landscapes, which may significantly differ from above-introduced density/property plots, and herewith provide an independent point of view to chemical space analysis. This is exemplified in Fig. 9, representing three different viewpoints to the octanol-water partition coefficient logP map of the 1.3M ChEMBL compounds. As there are no experimental logP values for the entire ChEMBL, calculated values provided by the ChemAxon tool generateMD (ChemAxon 2007) were used instead. A common property coloring spectrum is used: red for extreme hydrophilic logP ≤ 0.0, blue for extreme hydrophobes at logP > 6.0, orange–yellow–green for the intermediate ranges. Plot a in Fig. 9 is the “classical” density-modulated representation, which conveys a first image of density-conditioned trustworthiness: empty zones (cumulating the equivalent of less than 1 compound/node, in terms of total responsibility) are obviously not able to predict the lipophilicity of any external compound that might be mapped therein. By contrast, plot b is coherence-modulated: all nodes in which the standard deviation \( \sigma_{k} (P) \) exceeds 2.5 logP units are no longer visible, while those with \( \sigma_{k} (P) < 1.0 \) are fully colored. In general, low-density zones are also low-coherence zones. Therein, \( \bar{P}_{k} \) and \( \sigma_{k} (P) \) are estimated on hand of remotely responsible compounds, that are basically “random picks” happening to be the less remote, not really descriptive of those zones, and therefore not expected to be coherent in terms of their logP values. However, there are significantly populated map regions that are not very selective in regrouping compounds according to their lipophilicity. Let us note, at this point, that the considered manifold was never built or selected (Sidorov et al. 2015) in order to maximize its predictive propensity of logP. This notwithstanding, the map nevertheless features many zones in which compounds of roughly similar lipophilicity cluster “spontaneously”. Eventually, plot c below shows how density and coherence can be combined into a composite “applicability” parameter, defined as the product of density and a coherence penalty factor, reaching its maximum of 1 at σ < 1.0 and its minimum 0 at σ > 2.5. This applicability score, basically a coherence-modulated density, was used in plot c instead of “pure” density in plot a, all other setups being equal.

Fig. 9
figure 9

Three alternative modes to represent the logP landscape of ChEMBL compounds: a density-modulated, b coherence-modulated, c applicability score-modulated. The used map has been introduced in Fig. 3

2.2.3 Building High-Quality GTMsProperly Choosing Key GTM Parameters

Let us re-emphasize, at this point, that obtaining of property landscapes like above-shown is a process involving two clearly distinct steps:

  1. 1.

    the actual unsupervised map (manifold) construction, based on a frame set, and

  2. 2.

    subsequent (supervised) learning or “coloring” of this map, based on a—potentially different—training set.

Note, furthermore, that any manifold from step 1 may be, in principle, independently used in many alternative coloring attempts, in as far as the herein used training sets are not too remote from the frame-set-based manifold, as already mentioned.

Some options/parameters only concern only the unsupervised manifold fitting step 1. These include the four GTM setup parameters—node number K (required to be a perfect square integer¸ the number of radial basis functions (RBFs) M, RBF width factor w and weight regularization coefficient λ—in addition to the frame set and descriptor choices, which can be formally regarded as additional degrees of freedom, “meta-parameters”. By contrast, the choice of possible coloring/interpolation procedures required to build the property map does not affect at all step 1—any given manifold is in principle exploitable for both regression and classification, based on either above-mentioned “local” or “global” approaches.

All these (meta-)parameters have an impact on the quality of the final predictive model supported by the manifold. Model quality is a key objective criterion to validate the quality of the proposed manifold. Without it, the “beauty” of a map is the only criterion to decide whether the chosen grid size is “correct”, whether the choice of a different set of molecular descriptors would have improved the mapping, etc.

Coupling visualization with prediction is therefore a key benefit of the GTM approach. Thus, one may formulate the GTM construction problem as a combinatorial optimization approach. Given all the possible choices of the seven already-mentioned (meta-)parameters (designation of a frame set and of a molecular descriptor type, out of the respective lists of possible choices, selection of the K, M, w and λ values and of the landscape interpolation strategy), which choice shall produce a map optimally rendering the one or more targeted property landscape(s)? It is understood that “optimal rendering” of a property landscape means maximizing the predictive power of such landscape. Placing an external compound (not used in the “coloring” process) on the colored map, in order to “read” the predicted property at the given location, is expected to return values in good quantitative agreement with experiment. Thus, map quality will be measured in terms of classical statistical validation criteria—cross-validated determination coefficients Q 2, for example. To design a multicompetent map able to support more than a single predictive model, the “compromise” mean Q 2 might be used as a global criterion (optionally including a penalty for high standard deviations of Q 2, in order to discourage setups with either extremely good or extremely bad results for the different monitored properties). One may alternatively consider a multiobjective optimization strategy, defining a Pareto front of locally best solutions for each of the monitored properties. The search for (near)-optimal setups in the seven-dimensional parameter space cannot be done systematically, knowing that the calculation of map goodness criteria may be a very time-consuming undertaking. Recall that this implies (1) fitting the manifold, given the descriptor choice, the frame set choice and the four GTM parameters, (2) cross-validated manifold coloring/prediction cycles, for each of the targeted properties, based on the property-specific training sets. Therefore, any stochastic search strategies—computer cluster-deployed genetic algorithms, for example—are well suited for optimal mapping parameterization.

Since a manifold needs not be tailor-made to specifically serve as support of a single dedicated model, one may ask whether it is possible to build some to successfully serve as support not only for the propertie(s) for which it was optimized (vide supra), but also for many other distinct and diverse structure-property models. So-far obtained results (Sidorov et al. 2015) of this quest for an arguably “Universal” GTM are very encouraging, having led to manifolds that showed to be valuable supports for hundreds of distinct predictive models, for properties as diverse and unrelated as target-specific activities, antiviral and antibiotic properties, physico-chemical properties. The maps used to exemplify the various issues discussed here are all, unless otherwise stated, “Universal” maps centered on the drug-like chemical space as represented in the ChEMBL database.

3 Chemical Space Analysis Using GTMs

The following will focus on the various ways of using GTMs for the rational and intuitive understanding of chemical space, and, implicitly, for library design. This covers topics as diverse as comparing different large compound libraries, or designing libraries with any desired coverage “pattern” of chemical space—both maximal diversity subsets, and focused libraries, putatively enriched in bioactives of desired class.

3.1 GTM-Based Compound Library Comparison

This topic has been extensively covered in previous publications (Gaspar et al. 2013, 2014, 2015), and therefore only a brief reminder of the underlying principles will be given here. The key concept here is representation of any compound library by the cumulated responsibility vector—the “density”—at any node. This renders any library, of arbitrary size, as a single K-dimensional vector, which is a mathematical object of the same class as \( (R_{kn} ) \), the molecular responsibility vector, i.e., the density vector of a “library” composed of one molecule, n. For a library L, the descriptor vector of cumulated responsibilities can be formulated as \( \left( {\sum\nolimits_{n \in L} {R_{kn} } } \right) \). Therefore, two libraries, L and Λ, can be straightforwardly compared by means of taking some distance/dissimilarity score (Euclid, 1-Tanimoto, etc.) of their characteristic vectors, \( \left( {\sum\nolimits_{n \in L} {R_{kn} } } \right) \) versus \( \left( {\sum\nolimits_{{n \in\Lambda }} {R_{kn} } } \right) \). This is, first of all, extremely fast compared to calculating the pairwise inter-molecular dissimilarity scores of all members of L versus all members of Λ. If the distance metric is based on some covariance score which is independent of the absolute magnitudes of the two vectors, such as the cosine metric \( \vec{x}\vec{y}/\left( {\left\| {\vec{x}} \right\|\left\| {\vec{y}} \right\|} \right) \), then two libraries with identical pro-rata representations in all GTM nodes will be reported as identical, irrespective of their sizes—as, for example, a representative “core” subset of a large collection versus this parent library. Library comparison can be intentionally rendered size-insensitive, all metrics confounded, by explicit normalization of cumulated responsibility vectors with respect to library size. If, furthermore, nodes were assigned mean characteristic property values \( \bar{P}_{k} \), these may be used as weighing factors in library comparison metrics. In order, for example, to bias the library comparison with respect to the nodes which are enriched in actives for a given target—\( \bar{P}_{k} \) representing, for example, the mean pIC 50 value of actives residing on node k—library comparison should use vectors \( \left( {\sum\nolimits_{n \in L} {\bar{P}_{k} R_{kn} } } \right) \) versus \( \left( {\sum\nolimits_{{n \in\Lambda }} {\bar{P}_{k} R_{kn} } } \right) \). This would implicitly focus more on the relative populations on nodes with high mean pIC 50 values. Note that map “coloring” to obtain \( \bar{P}_{k} \) values need not be based on any experimental pIC 50 of compounds from the actually compared L and Λ—any other independent “color” training set can be used to this purpose. Again, it is necessary to ensure, like always, that the used manifold is “competent” to acquire the compounds of L, Λ, and the putative color set, as already discussed in the previous chapter.

Alternatively, library comparison can be formally treated like a classification problem. If compounds of L are, arbitrarily, considered of class 1, whilst Λ members are assigned the class label 2, then the fuzzy mean class \( \bar{P}_{k} \) associated to nodes will intrinsically reflect the (absolute or relative, vide supra) local dominance of either of libraries, in terms of density. A GTM image consisting of perfectly separated “patches” of red and blue means that the chemical spaces covered by the libraries does not overlap at all. A homogeneously yellow landscape, corresponding to \( \bar{P}_{k} = 1.5\, \forall k \) means that local densities of both libraries are quasi-identical all over the space. The former scenario would correspond to a Tanimoto score of 0, whilst the latter means Tanimoto = 1 in terms of cumulated responsibility vectors, as discussed above. In practice, one expects both zones of significant overlap and zones of separation to coexist: this would correspond to some intermediate score in terms of quantitative library comparison. However, the class landscape is much more information-rich than a simple Tanimoto score, because it conveys node-by-node information, rather than the final “verdict” condensed into a single score value. The left side of Fig. 10 represents such a class landscape, comparing the 1.3M ChEMBL compounds to a roughly equally large collection of 1.4M commercial compounds of various sources, curated for High-Throughput Screening compliance (Horvath et al. 2014). The “blue” chemical space zones that are clearly overpopulated with commercial compounds are well visible. Furthermore, comparing this class landscape to the lipophilicity landscape on the right side (same as in Fig. 9a) immediately reveals that “commercial” chemical space is almost always associated to moderately hydrophilic compounds. It is, of course, straightforward to visualize representatives of either “blue” or “red” zones, as examples of collection-specific molecules.

Fig. 10
figure 10

(Left) Fuzzy mean class landscape (with Bayesian weighing) of the comparative map of 1.3M ChEMBL compounds (class 1, red) versus 1.4M curated molecules from commercial sources (class 2, blue). The used map is the one introduced in Fig. 3. (Right) The lipophilicity landscape already shown in Fig. 9a has been added aside for comparative purposes: it can be seen that the chemical space dominated by commercial compounds corresponds to several zones of moderated lipophilicity

Library comparison may furthermore be easily modulated and made to focus on peculiar chemical space zones. For example, in Fig. 11, the comparison of ChEMBL to the above-mentioned commercial compound library has been revisited from the perspective of two different medicinal chemists—one interested in GPCR research, the second active in the field of kinase inhibition. To this purpose, compounds of interest were selected for the given research domain—here, the predefined ChEMBL “SARfari” subsets for GPCRs and kinases, respectively. These were mapped, generating the white–grey–back density-modulated landscapes shown as miniatures below. The latter can be understood as problem-specific “masks” one would like to use in order to focus of chemical space zones of interest. Logically, this is the same thing as deciding to redefine the Applicability Domain of the map by means of the specific density of the “compounds of interest”.

Fig. 11
figure 11

Landscapes a and b represent the same “ChEMBL versus commercial” class landscape as in Fig. 10, now restricted to the compounds matching only Responsibility Patterns that were encountered, for at least ten times, amongst a the GPCR Sarfari and b the Kinase Sarfari ChEMBL subsets. Associated to a and b are the density plots of the cited ChEMBL subsets, in density-modulated whitegreyblack

Practically, the most straightforward way to apply such a filter is to

  • extract the Responsibility Patterns (RPs) for all SARfari “compounds of interest”.

  • establish a list of robustly reoccurring RPs, each representing at minimum 10 compounds of interest. On the herein used map, the ~115K compounds of the GPCR SARfari set cover 458 distinct RPs, while the less numerous (~51K) kinase SARfari compounds are responsible for 296 RPs.

  • discard, from both ChEMBL and commercial libraries, all the compounds having RPs other than the ones kept in the above list.

  • rebuild the fuzzy, mean class landscapes with the remaining representatives of the two libraries.

In the above-shown, compounds of interest were chosen to be rather large and heterogeneous sets, which are clearly not containing only actives with respect to the cited target class. However, focus on a wanted chemical space zone is extremely flexible: any set of RPs can be used, whether they come from validated bioactives of minimal potency, from compounds predicted to be actives by QSAR models, from promiscuous/specific compounds, etc.

3.2 GTM-Based Diversity Analysis

Let us consider the classical task of extracting a core subset of c% from a large library (here, ChEMBL) of maximal representativeness/diversity. GTMs are—like Kohonen maps—extremely useful for both proposing such a core subset, and a posteriori analysis of its relation to the (unselected remainder) of the parent library. Mapping in diversity analysis is a key time-saving step, because it provides an implicit “clustering” of molecules, by binding them to specific positions on the map. Molecules mapping to distinct locations—associated to different neurons on a Kohonen map, and, respectively, to distinct Responsibility Patterns (RPs)—are implicitly considered “diverse”. On the opposite, molecules which are assigned to a common location are indistinguishable, as far as mapping can tell. Here, GTM has the advantage of higher resolution: at equal number of nodes, the GTM supports more distinct RPs than the Kohonen approach, with their binary compound-to-node assignment scheme. A rational core extraction strategy supported by GTM would therefore amount to pick controlled numbers of compounds from the clusters associated to the detected RPs. This is extremely fast—the estimation of O(N 2) intermolecular dissimilarity scores is completely avoided.

The most straightforward diversity selection strategy would therefore be a pro rata draw: in order to pick a representative core of c% molecules, it is advised to (randomly) pick c% of representatives of every detected RP. First, representatives of a given RP are, as already discussed, basically expected to be rather similar, and/or share some common structural traits. Therefore, in a “generic” library subsetting exercise, when there are no specified targets for screening the core library, there is little rationale to prefer one particular compound over all the other representatives of a given RP. Note that, in principle, one may use a classical diversity algorithm in the initial descriptor space (Agrafiotis 1997; Maldonado et al. 2006; Turner et al. 1997) for selection, ensuring that the RP-specific subset of c% avoids, as much as possible, inclusion of “redundant” compounds such as methyl/normethyl analogues. Even so, computer effort would remain reasonably low, since local comparison would concern a limited number of items associated to a common RP. This was not pursued in this example, for three main reasons. First, the similarity threshold (Horvath et al. 2013) at which two similar molecules may be safely considered redundant is ill-defined and, at best, problem-specific. Second, the manner in which RP representatives are picked has no impact on the chemical space coverage as perceived by the GTM. Third, note that in practice two similar molecules may nevertheless happen to be assigned to different RPs because of binning artefacts. Map-based diversity selections are coverage-oriented, but do not formally guarantee the absence of redundant compounds. Therefore, if non-redundancy (whatever its definition) is a key issue, the optimal strategy is to generate a slightly larger-than-needed map-driven core selection, to be further refined by elimination of redundant compounds. This latter step will be relatively fast—since limited to the small core instead of the large library. Note that design of larger-than-needed cores is rather state-of-the-art protocol than exception. In practice, logistic bottlenecks, compound purity/solubility etc. will be highly impacting factors on the final compound selection. Therefore, diversity selection should be kept conceptually simple, and fast—extensive number-crunching coming up with an ideal list of compounds that were just taken off the vendor’s shelf, or are offered at unacceptable prices, makes no sense. GTM-based selection is fast, powerful in terms of coverage control.

Mean class landscapes, denoting the core as the “blue” class 2 and the remainder of the parent library as “red” class 1—mandatorily using Bayesian weighting, as class 2 is by definition a minority—are perfect indicators of the representativity of the core. At perfect pro rata sampling, and after compensation of subset sample sizes, a homogeneously yellow landscape, corresponding to \( \bar{P}_{k} = 1.5 \,\forall k \), should be obtained. That signals the fact that, at any point of chemical space, the core subset molecules reflect the original compound density of the parent library, being neither oversampled (blue spots), nor undersampled (red spots). Figure 12 represents such mean class landscapes, obtained by (above) random drawing and (below) pro rata sampling of RPs of cores representing—from left to right—50, 10, 1, 0.1 and 0.01% of ChEMBL.

Fig. 12
figure 12

Mean class landscapes, with Bayesian weighting, denoting the core as the “blue” class 2 and the remainder of the parent library as “red” class 1, at decreasing core size c% (numbers below), generated either by random draw of ChEMBL compounds (top row), or by pro rata draw of compounds from every detected responsibility pattern

Clearly, one half of the 1.3M ChEMBL compounds does indeed strongly resemble the other, and even 10% of ChEMBL is still seen to represent well the remaining 90%—even without recurring to no more sophisticated subsetting than the plain random draw. With cores of 1% or less, it becomes increasingly difficult to include representatives of every chemical space zone—hence, the clear “red shift” in the upper series of landscapes. Often, randomly picked compounds may stem from a relatively thinly populated chemical space zone—within the much smaller core, their relative importance implicitly becomes very high, and they are perceived as “oversampling” their respective chemical space zones. Therefore, the above-mentioned “red shift” is accompanied by a polarization of the landscape—emergence of a few oversampled blue “islands” in the “sea” of undersampled space.

By contrast, cores produced by pro rata draw from every RP show the characteristic “red border” effect in the lower series of landscapes. This is an implicit consequence of the existence of many sparsely represented RPs, with less than 1/c% members, which will therefore contribute none of their members to the selection. Even at 50%, “singleton” RPs, each associated to exactly one molecule (there are roughly 15K such patterns, out of a total of 23K distinct RPs observed for ChEMBL compounds on the given map), cannot contribute to the selection. They provide the population of the low-density “border” regions, which will not make it into the core selection—hence the observed “red border” effect. By contrast, it can be seen that selection within the zones that can be sampled at given core size is much more homogeneous—there is clearly less polarization in the series associated to pro rata draws.

Alternatively, one may proceed to a “flat” draw of an equal number of representatives from each of the RPs exceeding a certain population level. The left-most density landscape (a) in Fig. 13 features a ChEMBL core of 23K compounds—one representative for each of its 23K distinct responsibility patterns . This is compared to a core of similar size, obtained by random drawing—its density trace (b) can be seen to be relatively less homogeneous, and presenting some clearly highlighted diversity holes, covered by the “flat” core.

Fig. 13
figure 13

a Density landscape of the “flat” ChEMBL core, featuring one randomly picked representative for each of the detected 23K distinct responsibility patterns on this map (same as described in Fig. 3). b Random drawn core of equivalent size (~1.8% of ChEMBL). Connectors highlight diversity holes of the latter, covered by the “flat” selection

Which of pro rata and flat diversity selection strategies are best-suited is a context-dependent problem. The key message here is that GTMs, exploiting the RP-based default “clustering” of molecules, is perfectly operational in diversity selection, irrespective of the used approach. One may, for example, perform a flat selection but only based on RPs with a minimum level of occurrence—which can be seen as a hybrid pro rata/flat approach. Such could be very useful if one wishes to maximize coverage all while ensuring that selected compounds are no singletons—i.e., close analogues thereof are available, in order to support a quick harvesting of structure-activity data after primary hit confirmation. Furthermore, diversity sampling may well be associated to already known structure-activity data or any other filters for “interesting” chemical space zones. As shown in the previous chapter, library comparison can be biased towards specific chemical space zones—or, diversity selection is just an application of library comparison. Or, a key advantage of a GTM is the ability to validate the proposed map, in terms of its propensity to discriminate actives from inactives, and to quantitatively predict molecular properties. A map shown to be a competent support for classification and regression models is therefore compliant with the molecular similarity principle and proposes a chemically meaningful “image” of chemical space. As such, diversity selections based on this map are also likely to fulfill the expectation of picking all the “iconic” distinct chemotypes or pharmacophores. By default, diversity selection is tributary to the initial choice of molecular descriptors, dissimilarity metric, etc. Whatever those choices, a diversity selection will emerge—and it will heavily depend on those choices. Or, as already discussed, it is very difficult to establish any “objective” quality criteria for a diversity selection aimed at designing a general-purpose screening library. Thus, the final “verdict” about the pertinence of the diversity selection can only be given a posteriori, after experimentally screening the selected library core. Instead, if one relies on a map built and shown to be similarity-principle compliant with respect to various different biological activities, the descriptor choice and the dimensionality reduction parameters (defining the manifold) has already been done and validated on the basis of quantitative statistical criteria of predictive models. If the library to be sampled is seen to fall within the Applicability Domain of such a map, the “competence” of the map in previously tackled predictive problems may be accepted as a caution for a meaningful diversity subsetting.

3.3 Privileged Responsibility Patterns

Consider a specific subset l of a larger compound library L, consisting of all molecules of L that have a given property—for example, all the compounds that are associated to a biological target, or, alternatively, all the compounds found active against a given target. Suppose that, out of these “specific” molecules from l, there is a fraction \( f_{l} (RP) \) of compounds representing a given Responsibility Pattern RP—according to a given GTM model. Let, by default, the baseline occurrence rate of this RP, represent \( f_{L} (RP) \), the overall fraction of the RP-matching molecules over the parent library L. If L is a large compound collection, representing a significant sample of the so-far synthesized and tested organic compounds, then any RP found to occur much more often in l, i.e., \( f_{l} (RP)\, \gg \,f_{L} (RP) \) can be considered as privileged within l. A privilege score

$$ \pi = f_{l} (RP)/f_{L} (RP) $$

may thus be defined. Since l is defined in terms of a specific property shared by its members, it is straightforward to link this privileged status to the property. Of course, correlation never implies causality (Horvath 2010), but it is tempting for medicinal chemists to “relate” a given pattern to a given activity. If, for example, every second active is seen to match that pattern (f l  = 0.5), whereas the same pattern is being encountered in only one commercial compound out of 100 (f L  = 0.01), this provides a rationale to specifically design and synthesize more molecules containing the pattern. The patterns which medicinal chemists love to monitor are scaffolds—hence, the “privileged scaffold” (Evans et al. 1988; Kubinyi 2006) paradigm, a very popular pedagogical method aimed at systematizing the relationships between scaffolds and therapeutic classes. Yet, it cannot be taken as granted that the best structural motif to analyze is, indeed, a single scaffold—specific, non-cyclic fragments, or scaffold families, or pharmacophores may also have a “privileged” status. The advantage of exploiting RPs in the quest for privileged patterns is that mapping of a compound on a GTM automatically defines its RP, which can be a posteriori related to the underlying structural motif (as already discussed; see the chapter introducing the RP concept).

In a previous publication (Klimenko et al. 2016), we exemplified the detection of RPs preferentially appearing within compound sets of confirmed antiviral properties and traced these RPs back to the underlying specific structural motifs. In some cases, the underlying structural motif shared by all compounds of a given privileged RP happened to be indeed a “privileged scaffold”. More often, this was not the case—RP members could alternatively share much fuzzier common structural traits (many ATP mimics featuring a anion-linker chain-heterocycle “pharmacophore-like” pattern were, for example, regrouped under a common RP). The opposite was also observed: RPs based on a well-defined scaffold with specific substitution patterns at specific points.

In the following example of privileged RP analysis, the ChEMBL database will be used as the “baseline” library L with respect to which default occurrences \( f_{L} (RP) \), of the RPs from the previously introduced “Universal” GTM (Fig. 3), will be defined. Subsets of ChEMBL compounds associated to—being tested on—a given human biological target T from ChEMBL were used as “property-specific” subsets l, for each target with more than 500 associated compounds. Thus, the “property” that all members of a subset l have in common is not their strong affinity for a given target T, but the fact that they were tested on target T, irrespective of the result. This may seem odd, but the shared feature providing a common identity to all members of a subset l is the fact that they were all considered—rightly or wrongly—being worth testing on target T according to experts in the field. Therefore, the privileged RPs highlighted here are not the RPs privileged by the target—that is, the RPs seen to significantly enhance the change to obtain an active on that target—but rather the RPs privileged by the know-how of medicinal chemists, believed by medicinal chemists to relate to a given target. This analysis is therefore no rigorous structure-activity relationship, but rather a trend analysis of the human factor in drug design. An alternative analysis—in terms of rigorous measured activities—could be performed as well and, if the medicinal chemists’ flair was correct, it should conclude that patterns privileged by the target are the same as the ones privileged by chemists. On the contrary, if a target has been subjected to “carpet bombing” by High-Throughput Screening of random libraries, no privileged RPs should emerge at all, since there was little or no know-how used to associate those randomly picked screening candidates to a target.

The privilege score π has been calculated for each of the RPs, over all considered targets. Figure 14 locates on the map the five RPs (five nodes, as it turned out that all concerned RPs were single-node) with the absolute largest π scores, all targets confounded. Each of these RP is matched by compound sets of rather modest size (between 131 and 735 compounds) and “snapshots” of representative compounds are shown.

Fig. 14
figure 14

Location, on the ChEMBL map (see Fig. 3 legend) of five RPs (all single-node) with top privilege scores, as colored nodes against the grey density plot of the entire ChEMBL. Representative samples of compounds matching each RP are shown. Next to the associated structure tables, the target or targets that are privileging each RP are denoted, next to the actual privilege score π of the RP (listed as “×π”) with respect to the target

In red, the node reaching the absolute highest privilege factor corresponds to a structurally homogeneous series. This series is, strictly speaking, not based on a single “privileged” scaffold defined as a single cyclic moiety, but on an expanded aryl-oxadiazole-cyclohexyl core, with heteroatoms allowed in different positions of the aryl and cyclohexyl moieties. Such compounds are encountered within the set of compounds associated to SMO, the “smoothened” frizzled GPCR with a frequency 1757-fold higher than the default one in the entire ChEMBL database. Out of the 517 molecules associated to SMO in ChEMBL, 92 are representatives of the “red” RP, which gathers 131 compounds in its associated cluster. The other target having a still significant 12-fold enrichment of compounds from this class within the set of associated molecules is the ion channel HERG. Note that GPCRs and ionic channels are expected to privilege the same structural patterns, as many ligands binding to macromolecules of both classes are known.

In orange, the RP privileged by Trypsin, with a factor of 1080, matches a series of artificial peptidomimetics: rather linear compounds, with at least three aromatics connected by flexible linkers (it is worth noting the high occurrence of oxadiazole rings, though not in the same context as in SMO binders above), and often seen to embed actual amino-acids (proline, lysine) next to the non-peptidic moieties. In view of the fact that the chosen target is a protease, this makes perfect sense.

In yellow, compounds matching the third RP are even more strikingly peptide-like, consisting of several small (artificial, or amino-acid) building blocks interconnected by amide bonds. The often recurring amino-acid is glutamate, bringing a net negative charge to the species. This RP is privileged by nucleoside and peptide-binding GPCRs, and—again—proteases. Thus, the featured GTM possesses at least two—not very remote—zones dedicated to “peptide-like” molecules, and unsurprisingly associated to proteases.

Further privileged RPs cover complex patterns evoking natural product derivatives, and the analysis could be pursued for each of the significantly populated RPs (there are 504 represented by more than 100 compounds each, in the ChEMBL projection on the current map). A GTM may be manually annotated with respect to targets privileging each RP—and, as a direct consequence, compounds matching a privileged RP but not yet tested on a target are candidates of choice for further testing.

4 Conclusions

After briefly revisiting the principles of Generative Topographic Mapping as a dimensionality reduction tool in chemoinformatics, this chapter specifically focuses on the applications of GTMs for the analysis of chemical space. The key feature that dramatically enhances the analysis of chemical space through the prism of a GTM is rendering of compounds by their responsibility vectors, representing fuzzy, real-value probabilities of residence of a compound on every GTM node. Whereas on a Kohonen map, the statement “compound n resides in node k” is either correct or false, following binary logics, on a GTM the fuzzy truth value of the above statement is nothing but the responsibility value R kn . Therefore, at equal number of nodes, a GTM is much more information-rich than a Kohonen map. Albeit the latter appears to be better suited as a compound clustering tool—all residents of a node belong to the same cluster—it was shown that “binning” responsibility values can be straightforwardly used to convert this real-value vector to a short Responsibility Pattern (RP). A RP represents the non-zero responsibility values after binning, in conjunction to the node numbers to which they pertain, under the compact form of a string, or label, and may as readily serve as a clustering criterion as the Kohonen number: all compounds matching a common RP label will be regarded as members of a same cluster. In the—rather often occurring—situations of a responsibility vector dominated by a single node, the associated “single-node” responsibility pattern is formally identical to the node number identifier in the Kohonen scenario.

The fuzzy nature of GTMs versus the binary nature of Kohonen maps, and the therefrom emerging ability of the former to accommodate a much larger number of RPs at given number of nodes, will have a direct impact on the quality (structural coherence) of the clusters defined by RPs. It is well known that some of the Kohonen “garbage” nodes will “specialize” in accommodating items which do not fit into any other nodes—but need to be mapped somewhere, nevertheless. By contrast, in GTMs, such “exotic” compounds tending to be far away from the manifold in the initial descriptor space will typically be assigned, fuzzily, to many different nodes, so that single-node RPs will, in general, tend to regroup items which actually show some significant, common, structural pattern. The more populous an RP, the more difficult it is statistically to ensure that the entire set of acquired compounds is structurally homogeneous. It was found that, out of the three most densely populated single-node RPs in ChEMBL (all three being “borderline” nodes at the map edge) only one could be tentatively labeled as “garbage” node—the others are preferentially populated by Michael acceptors and oxyanionic compounds, respectively. This issue also illustrates that the nature of the “significant, common, structural pattern” assembling the compounds under a same RP label is open-ended and self-adaptive: it may be a substructure (but not necessarily a ring scaffold, as put forward by medicinal chemists), a set of related substructures, a common pharmacophore and, perhaps, even less precisely defined, a size constraint. Actually, the members of the node tentatively discarded as “garbage” do have something in common: their size, closer to the one of typical fragments (in Fragment-Based Drug Design) than to actual drug molecules.

Albeit property prediction with GTMs is not the main topic of this contribution, this very important issue has nevertheless been discussed. First, the fact that a map can be shown to support quantitative of class-based predictive modeling provides a rigorous quality assessment of the map, something which is not provided by its other applications, such as visualization and library comparison. Second, library comparison and diversity selection—or any other form of chemical space analysis—will benefit from the knowledge contained in “property landscapes” obtained by coloring the map with diverse structure-property data. Property prediction with GTM also provided the occasion to discuss the matters of compound density, coherence of mapped properties and, in general, Applicability Domain-related issues with GTMs.

Next, the problematics of library comparison with GTMs has been revisited, on the basis of class landscapes comparing the ChEMBL collection to a roughly equally-sized set of commercial compounds. It was shown how class landscapes can be used to rapidly identify “unbalanced” zones, dominated by either of the compound collections. Reading such landscapes in parallel to property landscapes allows an immediate estimate of the properties of molecules in the unbalanced chemical space zones. Eventually, any third-party compound set—here, ChEMBL subsets from the GPCR and Kinase SARfari projects, respectively—can be used as a filter, specifically focusing the comparison of the two libraries onto chemical space zones deemed “of interest” for the ongoing research project.

Further on, the usage of GTMs as both a driver and a post hoc analyzer of diverse subset selection applications is explored. It is shown why relying on RPs to conduct diverse subset sampling is very much faster than classical methods requiring the estimation of the full dissimilarity matrix between all compounds.

Eventually, one simple but effective way to link chemical space to biological activities is discussed: Privileged Responsibility Patterns . Following the now classical “privileged scaffold” concept in medicinal chemistry, this approach has the merit of straightforward generation of RPs by mapping a library onto a “meaningful” GTM (as suggested by previous predictive challenges). It is straightforward to check whether a RP is “privileged” with respect to a given property—in the sense that its occurrence rate within compounds having that property is much larger than its occurrence rate throughout the parent library. If so, visual inspection of compounds matching the RP often suffices to find the underlying structural motif behind that RP. Therefore, since the RP is “privileged”, the underlying structural motif automatically inherits the “privileged” status and, as already highlighted, this motif does not have to be a privileged scaffold. The examples of the top most privileged RPs lead to the discovery of various privileged structural motifs, some being rather well-defined structural constraints (the aryl-oxadiazol-cyclohexyl moiety), while others are fuzzy, yet chemically meaningful motifs, such as “peptidomimetics”. It would have been impossible to a priori guess the peculiar motifs that should be tested for privileged status. With a chemically meaningful GTM, such guesswork is not necessary: RPs are naturally emerging hypotheses to regroup compounds together, and the key structural motifs behind each such cluster can very often be found.

We hope this brief overview has convinced the reader of the significant strengths of GTMs in chemical space navigation and analysis.