Keywords

1 Introduction

Generating molecules with desired properties is a challenging task with important applications such as drug design. In the last few years, considerable works [1, 7, 8, 12] using deep generative models including variational autoencoders (VAE) and generative adversarial networks (GAN) for molecule generation make use of the domain specific SMILES representation of molecules [22], a linear string notation to describe molecular structures. However, the main drawback of SMILES is that there is difficulty in capturing molecular similarity since small changes can result in drastically different structures. This shortcoming prevents some generative models from learning smooth latent variables. With recent progress in the area of deep learning on graphs [3, 6, 13, 20, 21], deep generative models for molecular graphs are attracting surging interests since the graph representation can overcome limitations of the SMILES [2, 10, 14, 15, 21].

In the task of molecular graph generation, one of the key challenges lies in the difficulty of incorporating highly complex and non-differentiable property metrics into a generative model. The two main strategies to achieve this end remain reinforcement learning-based and Bayesian optimization-based approaches. The reinforcement learning-based methods [2, 8] use reinforcement learning-based objective to provide a gradient to the policy towards the desired properties, which requires an extra reward network to predict the immediate reward. However, the extra network may arise the convergence difficulties. Besides, the reward network may not be able to get the correct prediction. The Bayesian optimization-based approaches [10, 12] perform Bayesian optimization to navigate into regions of latent space that decode into molecules with particular properties. Specifically, such methods can be divided into two phases. The first phase will focus on training a generative model. During the second phase, the Bayesian optimization is performed in the latent space. So the performance of Bayesian optimization during the second phase depends largely on the smoothness of the latent space learned during the previous phase.

In this work, we design a variational autoencoder for matrix representations of graphs. We then formulize a regularizer to encourage the generation of molecules with desired properties, which avoids extra networks and the requirement of smoothness of latent space. Monte Carlo approximation of the regularizer is used in the training procedure. Since the approximation is differentiable, we can train the model by stochastic gradient optimization methods. We demonstrate the effectiveness of our framework with two benchmark molecule datasets to generate molecules with desired properties.

Fig. 1.
figure 1

Overview of the proposed OpVAE. The bottom flow depicts the regularization that encourages decoder to generate molecules with desired properties, which is detailed in Sect. 4.

2 Related Work

Recently, there has been significant advances in molecule generation. Previous works [1, 7, 8, 12] have explored the generative models on SMILES. Gómez-Bombarelli et al. [7] built generative models with recurrent neural networks. Kusner et al. [12] utilized a parse tree from a context-free grammar to improve the validity of generated molecules. Dai et al. [1] took a step further towards the validity by enforcing constraints on the generative model. In addition to the above VAE-based works, Guimaraes et al. [8] used GAN to address the generation. For graph representations, Simonovsky et al. [21] have explored generating molecular graphs by extending VAE. Jin et al. [10] proposed a generative model by combining a tree-structured scaffold with original graphs. Liu et al. [14] incorporated chemical constraints into generative models by specifying a generative procedure and employing masking. Ma et al. [15] reformulated the constrained objective of VAE by constructing Lagrangian function.

3 Model

3.1 Variational Autoencoder

We adopt AEVB algorithm [11] to learn a generative model \(p_\theta (G|z)\) using a particular encoder \(q_\phi (z|G)\), which is an approximation of actual posterior \(p_\theta (z|G)\) that is intractable. The objective is to maximize the evidence lower bound (ELBO) with respect to \(\theta \) and \(\phi \):

$$\begin{aligned} \mathcal {L}_{ELBO} = -D_{KL}(q_\phi (z|G)||p_\theta (z)) + \mathbb {E}_{q_\phi (z|G)} [\log p_\theta (G|z)]. \end{aligned}$$
(1)

The first term as a regularization is the divergence of the variational posterior \(q_\phi (z|G)\) from the prior \(p_\theta (z)\)Footnote 1, which allows for learning more general latent representations instead of simply encoding an identity mapping. The expectation term, interpreted as negative reconstruction loss, is maximized when p(G|z) assigns a high probability to the observed G.

3.2 Molecules as Graphs

Each molecule can be represented by an undirected graph with a set of labeled nodes associated with the atoms and a set of labeled edges associated with bonds between atoms. We restrict the domain to a collection of molecular graphs which have at most N nodes, \(T-1\) node types and \(R-1\) edge types.

The decoder outputs a matrix \(\widetilde{X} \in R^{N \times T}\) and a tensor \(\widetilde{A} \in R^{N \times N \times R}\), which are denoted by \(\widetilde{G}=(\widetilde{A}, \widetilde{X})\). The row \(\widetilde{X}(i, :)\) is a categorical distribution over the type of node i, which satisfy \(\sum _{t=0}^{T} \widetilde{X}(i, t)=1\). \(\widetilde{X}(i, 0)\) is the probability that node i is nonexistent. Similarly, the fiber \(\widetilde{A}(i, j, :)\) is a categorical distribution over the edge type between nodes i and j, which satisfy \(\sum _{r=0}^{R} \widetilde{A}(i, j, r)=1\). \(\widetilde{A}(i, j, 0)\) is the probability that the edge between nodes i and j is nonexistent.

We assume that the node type and the edge type are independent. The tuple \(\widetilde{G}\) hence becomes a random graph model. A one-hot \(G=(A, X)\) can be sampled via categorical sampling from \(\widetilde{G}=(\widetilde{A}, \widetilde{X})\).

3.3 Encoder and Decoder

In this paper, the encoder is parameterized as a diagonal normal distribution \(q(z|G)=\mathcal {N}(z;\mu , \psi )\) with covariance matrix \(\psi =\) diag\((\sigma ^2)\), where \(\mu \) and \(\sigma \) are outputs of the encoding graph neural networks.

Suppose that \(h_i^{(l)}\) is the hidden state of the node i at layer l. We define the following layer-wise propagation rule similar to [20] for the signal \(h_i^{(l+1)}\) of the node i:

$$\begin{aligned} h_i^{(l+1)} = \sigma (\sum _{r=1}^{R-1} \sum _{j=1}^N \frac{\widehat{A}(i, j, r)}{c_{i, r}} f_r^{(l)} (h_j^{(l)}, x_i) ), \end{aligned}$$
(2)

where \(\widehat{A}(:, :, r) = A(:, :, r) + I\) for each edge label r, with I being the identity matrix, which means a self-connection between layers is added for each edge type. The messages from neighbors (including the node i itself) that are obtained by an edge type-specific affine function \(f_r^{(l)}\) are then accumulated. Besides linear transformation, we utilize normalization constant \(c_{i, r}\) to ensure that the accumulation of messages will not completely change the scale of the feature representations. In this paper, \(c_{i, r} = |\mathcal {N}_{i, r}|\), where \(\mathcal {N}_{i, r}\) is the set of neighbors of the node i (including i itself) and the edge label between the node i and node \(j \in \mathcal {N}_{i, r} \setminus \{i\}\) is r. Finally, the normalized messages are passed through an element-wise activation function \(\sigma (\cdot )\) such as ReLU [16]. Note that R and N have the same meaning as in Sect. 3.2. It remains to define the initial hidden state at the first layer:

$$\begin{aligned} h_i^{(1)} = \sigma (\sum _{r=1}^{R-1} \sum _{j=1}^N \frac{\widehat{A}(i, j, r)}{c_{i, r}} f_r^{(0)} (x_i)), \end{aligned}$$
(3)

where \(x_i = X(i, :)\).

For graph-level outputs, we consider the following aggregation method proposed by [13] after \(L-1\) layers of propagation:

$$\begin{aligned} h_G = \tanh (\sum _{i=1}^N\sigma (v(h_i^{(L)}, x_i)) \odot \tanh (u(h_i^{L}, x_i))), \end{aligned}$$
(4)

where v and u are neural networks that take the concatenation of \(h_i^{L}\) and \(x_i\) as input and their output layers are both linear. \(\sigma (v(h_i^{(L)}, x_i))\) is explained as a soft attention mechanism to decide how relevant node i can be to the current molecule generation task. We then perform an element-wise multiplication \(\odot \) and take a sum over all weighted output vectors of the nodes to obtain the graph-level representation \(h_G\). Finally, the \(\mu \) and \(\sigma \) are generated from two multi-layer perceptrons (MLPs).

As mentioned in Sect. 3.1, the decoder draws latent variables z from the variational posterior \(q_\sigma (z|G)\) or the prior \(p_\theta (z)\) and outputs a random graph model \(\widetilde{G}\) where a graph G is sampled. This is simply done by MLPs in this paper.

4 Regularization

4.1 Formularization

In this section, we design an interpretable regularizer to encourage the generation of molecule with desired properties in this paper. The differentiable regularization term can provide a gradient to the decoder towards the desired metrics. More details will be discussed after the review of several traditional approaches to calculating the chemical properties.

  • LogP: The octanol-water partition coefficient (logP) serves as a measure of lipophilicity. Developed by [5], the atom-based method is an effective approach to calculating logP, which assigns to the individual atoms in the molecule additive contributions to molecular logP. The logP of small molecules can be calculated as the sum of the contributions of each of the atoms in the molecules. Since the high accuracies have been achieved in [23], where the 68 atomic contributions to logP have been determined, the approach is a common and standard approach to calculating logP.

  • SAS: The synthetic accessibility score (SAS) [4] has been used as a measure to estimate ease of synthesis of molecules. It is calculated based on a combination of fragment contributions and a complexity penaltyFootnote 2. Between the two constituent parts of SAS, the fragmentScore is calculated as a sum of contributions of all fragments in the molecule divided by the number of fragments in this molecule, whereas the complexityPenalty is calculated as a combination of structural features such as rings, stereo centers, etc.

The central idea of this paper is to use regularization that encourages the decoder to generate molecules with desired properties to formulate a differentiable objective. Let S be the property to be calculated. It is justifiable that the expectation of S with respect to the distribution \(p_\theta (G|z)\) is utilized as a regularizer. The expectation may then be formally written as

$$\begin{aligned} \mathbb {E}{p_\theta (G|z)}[S] = \sum _{G} p_\theta (G|z)S_G, \end{aligned}$$
(5)

By analyzing the calculation schemes of logP and SAS, we can unify these methods into a single common framework. A specific substructure in a molecular graph is called a pattern. Each pattern is associated with one additive contribution. The property of a molecule G to be calculated is the summation of all the contributions for the patterns that occur in this molecule. For the calculation of logP in [5], each one of 68 atomic types can be regarded as a specific pattern with one central atom, its neighboring atoms and the bonds between the central atom and its neighbors. For the calculation of SAS in [4], patterns include the fragments, rings, etc. For any properties to be calculated, one specific pattern is associated with one contribution. Let the set of possible patterns be denoted as Q. Given a molecular graph G, the properties mentioned above can be calculated according to

$$\begin{aligned} S_G = \sum _{q \in Q} n_q c_q, \end{aligned}$$
(6)

where \(S_G\) is the property of molecule G, \(n_q\) is the number of occurrences of the pattern q, and \(c_q\) is the contribution for pattern q.

By combining Eq. 5 with Eq. 6, we have

$$\begin{aligned} \mathbb {E}_{p_\theta (G|z)}[S] = \sum _{G} \sum _{q \in Q} p_\theta (G|z) n_q c_q, \end{aligned}$$
(7)

We let \(\mathbb {E}_{p_\theta (G|z)}[n_q] = \sum _{G} p_\theta (G|z) n_q\), which is the expectation of number \(n_q\) with respect to the distribution \(p_\theta (G|z)\). \(E_{p_\theta (G|z)}[n_q]\) can also be interpreted as the probability of pattern q of a graph sampled from \(p_\theta (G|z)\). Let \(E_{p_\theta (G|z)}[n_q] = p_q\) for simplicity. We obtain

$$\begin{aligned} \mathbb {E}_{p_\theta (G|z)}[S] = \sum _{q \in Q} p_q c_q, \end{aligned}$$
(8)

Equation 8 indicates that the expectaion can be evaluated as long as the problem of computing \(p_q\) is solved. However, evaluating \(p_q\) is computationally expensive. In practice, we can appeal to Monte Carol approximation for evaluating the expectation of S, which is differentiable with respect to \(\theta \). More details will be discussed in Sect. 4.2.

4.2 Training

We train our model to maximize the following objective function:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{ELBO} + \mathbb {E}_{p_\theta (G|z)}[S]. \end{aligned}$$
(9)

As done in [11], reparameterized trick and Monte Carlo gradient estimator are employed for evaluating \(\mathcal {L}_{ELBO}\). In order to evaluate \(\mathbb {E}_{p_\theta (G|z)}[S]\), we also consider Monte Carol approximation as mentioned above. Latent variables are first sampled from the prior \(p_\theta (z)\). The decoder takes z as input and outputs \(\widetilde{G}\). Then we sample patterns from \(\widetilde{G}\). For each \(q^{(m)} \sim \widetilde{G}\), we further assume that an occurrence of a pattern can be represented by a 2-tuple \(q^{(m)} = (V^{(m)}, E^{(m)})\), where \(V^{(m)}\) is the set of atoms in this occurrence, \(E^{(m)}\) is the set of bonds in this occurrence. Under the indenpendence assumption in Sect. 3.2, the probability \(p_{q^{(m)}}\) of a pattern for one specific occurrence is given by

$$\begin{aligned} p_{q^{(m)}} = \prod _{it \in V^{(m)}} \prod _ {ijr \in E^{(m)}} p_{it} p_{ijr}, \end{aligned}$$
(10)

where \(p_{it}\) is an element of \(\widetilde{X}\), i.e., \(\widetilde{X}(i, t)\) and \(p_{ijr}\) is an element of \(\widetilde{A}\), i.e., \(\widetilde{A}(i, j, r)\). The spirit is that the atom it is represented by node i in \(\widetilde{X}\) and its label is t, which is a similar explanation to the bond ijr. Because \(p_{it}\) and \(p_{ijr}\) are differentiable with respect to \(\theta \), parameters can be updated by using stochastic gradient ascent. Note that the latent variable z for evaluating \(\mathbb {E}_{p_\theta (G|z)}[S]\) is sampled from the prior \(p_\theta (z)\) as shown in the bottom flow of Fig. 1.

5 Experiments

5.1 Datasets

Two benchmark datasets are used for the experiments. QM9[18] is a subset of the massive 166 billion organic molecules GDB-17 chemical database [19]. The dataset contains about 134 K molecular graph of up to 9 heavy atoms with 4 distinct atomic numbers (carbon, oxygen, nitrogen and fluorine) and 4 bond types. ZINC [9] is a curated set that contains about 250K drug molecules of up to 38 heavy atoms with 9 distinct atomic numbers and 4 bond types.

5.2 Quality of the Generated Molecules

Baselines. We compared our method OpVAE against CVAE [7], GVAE [12], GraphVAE [21] and SeVAE [15]. For evaluation, we sampled 1000 latent vectors from the prior and performed maximum-likelihood decoding for each one.

Table 1. Comparison with baselines in terms of validity and novelty on QM9 and ZINC. The rows “% Valid” and “% Novel” are the validity and novelty in percentages, respectively. The results of baselines are copied from [21] and [15].

Evaluation Measures. We use the following 2 statistics to evaluate the quality of the generated molecules. Validity is defined as the ratio between the number of valid and all generated molecules sampled from the prior. Novelty is defined as the ratio between the number of valid samples that don’t occur in the training set and the number of all valid samples.

Comparison with Baselines. Table 1 reports the results obtained by the proposed method and baselines. our approach OpVAE shows a significant improvement over its competitors in terms of validity and novelty since OpVAE is not designed to boost the validity or novelty percentage. For the validity, an intuitive explanation might be that the regularization invests much effort in forcing decoder to generate patterns with higher properties and these patterns are chemically valid, which encourages sampled molecular graphs that implicitly contain the patterns to be valid. In the same way, novel molecules are obtained when the encoder explores the possibilities for generating patterns with higher properties.

5.3 Objectives Optimization

In order to demonstrate the effectiveness of the proposed regularization, following [8] and [2], we chose to optimize the objectives LogP and SAS and compare against their works. We trained our model over full QM9 dataset for 30 epochs similarly to the experiments performed by [2] but differently from those of [8], where the model is trained on 5k QM9 subset. We normalized all scores within [0, 1] by using the codesFootnote 3 of [2]. As shown in Table 2, SeVAE beats both MolGAN and ORGAN in terms of objective scores, which proves that the proposed regularization is very effective.

Table 2. Optimization for different objectives. The results of baselines are copied from [21] and [15]. The Naive VAE is trained to maximize \(\mathcal {L}_{ELBO}\).

6 Conclusion

Incorporating complex and non-differentiable property metrics into deep generative models is a challenging subject. We built graph neural networks into VAE for generating molecular graph. By resorting to a differentiable regularizer, we addressed the property optimization. We introduced several metrics to validate the quality of our proposed method. The advantages of the proposed method are reflected in experimental results.