Introduction

Structure–property modeling of chemical reactions represents a difficult task because of the complexity issue: any chemical reaction involves several molecular species of two types—reactants and products. The major question concerns the preparation of a descriptor vector encoding a chemical reaction which can serve as an input to a modeling software. Earlier, two different methodologies have been used for this purpose. The first one is based on the explicit consideration of a reaction center identified either manually or automatically using atom-to-atom mapping procedure [1]. This approach has been used in most of reported QSPR studies of reactions. Thus, Gasteiger et al. used some physicochemical parameters (charges, polarizabilities, steric accessibilities, parameters for inductive and resonance effects) for selected atoms and bonds to prepare the models for pK a for aliphatic carboxylic acids [2] and for kinetics of amide hydrolysis [3]. ISIDA fragment descriptors [4, 5] issued from Condensed Graph of Reaction [6, 7] have been used for the reaction data analysis [8] and for the modeling of the rate of SN2 [7, 9, 10] and E2 [11] reactions and optimal conditions for Michael reaction [12].

Another approach is based on the implicit representation of a reaction center, in which the feature vector for the reaction is calculated as the difference between descriptors of products and reactants [13,14,15,16] or by using only combined descriptors of substrates [17]. This methodology has been successfully applied in different reaction classification tasks [15, 18] and in building the regression model for prediction of optimal conditions of Michael reaction [12], SN2 rate constant prediction [17] and and SN1/SN2 reactions classification [19]. Both approaches—with and without reaction center detection—have their own drawbacks. Unless detected manually for small congeneric data set, the reaction center detection needs atom-to-atom mapping procedure which is error-prone and time-consuming [20]. Calculation of reaction vectors [21] or reaction fingerprints [15, 18] requires perfectly balanced reactions; otherwise the resulting feature vector would contain chemically meaningless terms. Since most of raw reaction data in the widely used databases like CAS REACT or Reaxys are not balanced, the data curation step is needed before using modeling methods. However, application of e-notebooks for new chemical reaction registration in synthetic laboratories might potentially be helpful to feed the databases with perfectly balanced reactions.

In this article we describe an approach which doesn’t need explicit encoding of a reaction center. A reaction is considered as an ensemble of two mixtures—a mixture of reactants and a mixture of products. Each mixture can be represented by special descriptors. Two different reaction representations were investigated: (i) concatenated feature vectors of reactants and products mixtures and (ii) a difference between these two vectors.

Earlier, we described an approach to prepare feature vectors for binary mixtures involving SiRMS descriptors [22]. Here, we extended this technique to mixtures having an arbitrary number of components. This new “mixture-like” methodology has been applied to model the rate constant of E2 reactions. For the comparison purpose, the models have also been built using either reaction fingerprints [15] issued from the implicit encoding of a reaction center or fragment ISIDA descriptors [4, 5] generated from the condensed graphs of reactions [6, 7] which explicitly label a reaction center. A rigorous cross-validation strategy has been suggested in order to provide with a realistic assessment of the models’ performance.

Computational procedure

Dataset

A dataset of 313 E2 bimolecular elimination reactions carried out in pure solvents at different temperatures has been collected from the literature [23]. An E2 reaction proceeds in a single step with a single transition state. It results in a formation of a π-bond due to synchronous trans-elimination of a leaving group (L) in the presence of a base (B ) needed to tie in the hydrogen atom (Fig. 1).

Fig. 1
figure 1

A bimolecular elimination reaction. (top) Schematic representation of the E2 reaction mechanism, where B is a base and L is a leaving group. (bottom) An example of an E2 transformation of (9H-fluoren-9-yl)methanol into 9-methylene-9H-fluoren, where CH3O (from sodium methylate) is a base and hydroxide ion is a leaving group

The dataset involves 90 distinct substrates and 60 distinct products, the most representative of them are listed in Table 1. The most representative substrates are ((1,1′-biphenyl)-4-yl)(1-bromocyclohexyl)methanone, (2-chloroethanesulfonyl)benzene and 2-fluorohexane, whereas the products are ethenesulfonylbenzene, cyclohexene and iso-butylene. Among the most representative leaving groups one can mention bromide and chloride anions occurred in 101 and 93 reactions, respectively, as well as p-tosylate and trimethylamine which occurred in 35 reactions each. The other seven leaving groups are occurred in very few reactions. Overall, 23 bases were detected, the most representative of them were methoxide occurred in 59 reactions, ethoxide (in 38 reactions), tert-bytoxide (30), thiophenyl (30), triethylamine (24), bromide (20), chloride (14) and hydroxide (14) ions and piperidine (10).

Table 1 The most frequently occurred substrates and products

Representation of chemical reactions

The structures of reactants and products were encoded in reaction feature vectors using three different approaches: (i) the extended SiRMS mixture representation approach, (ii) ISIDA fragments calculated from condensed graphs of reactions and (iii) reaction fingerprints. Dipole moment, refraction, dielectric permittivity, Catalan acidity [24], basicity [25] and polarity/polarizability [26], Kamlet-Taft alpha [27], beta [28] and π constants [29] used as solvent parameters and reaction temperature were concatenated with all reaction feature vectors.

SiRMS-based mixture representation of chemical reactions

In the framework of the SiRMS methodology, a single compound can be represented as a set of tetraatomic fragments (simplexes) of fixed composition and topology (Fig. 2). The counts of identical simplexes are used as descriptor values. Generated simplexes can also be labeled according to different atomic properties (partial atomic charges, lipophilicity, H-bond donor/acceptor, etc). Partial atomic charges seem to be a relevant parameter for the reactivity modeling. Therefore, Gasteiger charges on atoms were calculated by cxcalc tool [30]. Then, the whole range of charge values was split onto seven bins labeled from A to G: A ≤ −0.5 < B ≤ −0.1 < C ≤ −0.03 < D ≤ 0.03 < E ≤ 0.1 < F ≤ 0.5 < G. In such a way, each atom received the corresponding label further used for simplex encoding (see Fig. 2). In order to avoid a combinatorial explosion, we enumerated either fully connected fragments (similar to the first three simplexes in Fig. 2) or fragments containing two disconnected parts (similar to the 4th simplex in Fig. 2). For more details about the SiRMS approach see our earlier studies [31, 32]. Notice that in this work we considered simplexes for which the numbers of atoms in fragments varied from 2 to 6.

Fig. 2
figure 2

An example of simplex descriptor generation for individual compounds

The preparation of mixture descriptors for the mixture of three equally occurred components (here, reactants of E2 reactions) is illustrated in Fig. 3. It proceeds in three steps:

  1. I.

    simplex descriptors representing connected or disconnected molecular subgraphs of N atoms (in this study N = 2–6) are generated. For the mixture of three components A, B and C, the program generates simplexes of individual species including atoms of only A and B, as well as mixture simplexes including atoms of two (AB, BC, AC) or three (ABC) components. For molecular species containing less than 2 atoms (e.g., component C), individual simplexes are not generated. Each type of fragments is considered as an individual descriptor and its count weighted by the corresponding component occurrences is the descriptor value. In this study occurrences of all components were 1.

  2. II.

    the feature vectors of individual simplexes are summed up which results in vector DS = A + B + C. Similarly, superposition of the vectors of mixture simplexes AB, BC, AC and ABC results in DM vector.

  3. III.

    concatenation of DS and DM results in SiRMS-mix—the feature vector of the whole mixture.

Since a chemical reaction can be represented as an ensemble of two mixtures: a mixture of starting materials (reactants) and a mixture of products, the reaction feature vector can be computed as their combination. Two different ways of combining mixture feature vectors into reaction feature vector have been investigated: (i) their concatenation and (ii) by calculation of the difference between product and reactant mixture descriptors (Fig. 4).

In this study, simplexes included from 2 to 6 atoms; only pair-wise and triple-wise combinations of components were used for mixture simplex generation. The atoms were labeled either by symbols of chemical elements or by bin labels corresponding to partial atomic charges (see above).

Fig. 3
figure 3

Generation of simplex descriptors for a mixture of three components

Fig. 4
figure 4

Reaction descriptor vectors based on the concatenated product and reactant mixture descriptors (react-SiRMS-concat) and on their difference (react-SiRMS-diff)

Condensed graph of reaction

A Condensed Graph of Reaction (CGR) results from merging molecular graphs of reactants and products into one single connected or disconnected molecular graph described by conventional bonds (single, double, aromatic, etc) and dynamic bonds characterizing chemical transformations (single-to-double, double-to-single, etc) [6], see example in Fig. 5. In CGR, the changes of atomic charges in a course of a reaction can be accounted by introducing dynamic atoms (Fig. 5). A CGR can be prepared by superposing identically numbered atoms of reactants and products which needs to perform atom-to-atom mapping as a preliminary step. Since a CGR represents some sort of pseudomolecule, it can be encoded by fragment descriptors.

Here, two different types of ISIDA fragment descriptors—augmented atoms and sequences with length varying from 1 to 8 atoms—were calculated using ISIDA Fragmenter tool [6]. In order to reduce the number of generated fragments, the hydrogen suppressed graphs were used. Dynamic bond and atom labels were added to the specifications of the fragments.

Fig. 5
figure 5

An example of encoding of an E2 reaction into a Condensed Graph of Reaction. The broken and formed bonds are labeled by a crossing and a circle, respectively. Oxygen atoms changing their formal charges are denoted by symbols “c + 1” (negative-to-neutral) and “c − 1” (neutral-to-negative)

Reaction fingerprints

A reaction fingerprint is the difference between count-based fingerprints of products and reactants. In our study we used three types of reaction fingerprints developed by Schneider et al. [15] and implemented in RDKit software [33]: (i) atom pairs representing two particular atoms with the specified number of non-hydrogen neighbor atoms separated by up to three bonds [34], (ii) Morgan fingerprints identical to extended-connectivity fingerprints with radius 2 [35] and (iii) topological torsions representing four consecutively linked non-hydrogen atoms with the specified number of π-electrons and the number of non-hydrogen neighbor atoms [36].

Models building and validation

The models were built by the Random Forest approach using the randomForest R package [37]. The optimal number of variables used to select the best split of trees nodes was estimated by a grid search using caret package [38]. Number of trees was equal to 500 in all cases. All other parameters were set to their default values provided by randomForest R package. Since Random Forest proved to be able to handle many descriptors with complex relationships, no variable selection has been performed.

Two model validation strategies were applied. The first one is a “reaction-out” approach which is consisted in ten times repeated fivefold cross-validation where folds were randomly generated. However, this conventional validation procedure overestimates the model performance because the same reaction may proceed under different conditions and, hence, it might become simultaneously a part of both training and test sets. Therefore, a more rigorous “product-out” strategy has been suggested. It assumes that in a particular fold, all reactions with the same main product are placed in the test set. Since the number of reactions with the same product significantly varies (from 1 to 27 reactions) the randomly created “product-out” folds may contain substantially different number of objects. More balanced folds were prepared using Monte-Carlo optimization of the variance of reaction counts across folds and ten the most diverse sets of folds were selected. Functions (create_folds_mc, groupwise_tanimoto and select_folds) used to generate the balanced folds are available in pfpp R package (https://github.com/DrrDom/pfpp).

The prediction performance of models was measured by Q2 and root mean square error (RMSE).

$${\text{Q}}^{{\text{2}}} = {\text{1}} - \frac{{\sum\nolimits_{{\text{i}}} {{\text{(y}}_{{{\text{i,pred}}}} - {\text{y}}_{{{\text{i,obs}}}} {\text{)}}^{{\text{2}}} } }}{{\sum\nolimits_{{\text{i}}} {{\text{(y}}_{{{\text{i,pred}}}} - \overline{{\text{y}}} _{{{\text{obs}}}} {\text{)}}^{{\text{2}}} } }}$$
$${\text{RMSE}}=\sqrt {\frac{{\sum\limits_{\text{i}} {{{{\text{(}}{{\text{y}}_{{\text{i,pred}}}} - {{\text{y}}_{{\text{i,obs}}}}{\text{)}}}^{\text{2}}}} }}{{{\text{N}} - {\text{1}}}}}$$

Since the cross-validation procedure was repeated 10 times, we were able to estimate statistical significance of difference between averaged performances of the best models using paired t-test.

Applicability domain of models

In order to discard reactions dissimilar to those in the training set, the “Fragment Control” applicability domain (AD) approach has been used [4]. The “Fragment Control” AD discards any test set reaction containing fragments which don’t occur in the training set reactions. An AD was applied to the test set reactions at each fold followed by assembling the results for all folds. In such a way, statistical parameters were calculated for the entire set. Data coverage was assessed as a ratio of the number of reactions accepted by AD to the total number of reactions.

Results and discussion

Generally, the SiRMS-mix descriptors vary as a function of several parameters defining their size, complexity and labeling. The size of any simplex is defined by the minimal and maximal number of constituting atoms. Each atom was labeled either by element symbol or by partial charge category. Different mixture simplexes—pair-wise and triple-wise, etc.,—could take part of mixture descriptors. Since a huge number of SiRMS-mix descriptors corresponding to different combinations of the above parameters could be considered, we decided first to select their optimal values leading to the most performant models. These calculations were performed on concatenated reaction descriptors react-SiRMS-concat. Then, selected parameters were used in the modeling with difference reaction descriptors react-SiRMS-diff.

Selection of optimal parameters of SiRMS descriptors

Predictive performances of the models built on the concatenated reaction descriptors (react-SiRMS-concat) as a function of size, complexity and labeling of simplex descriptors is given in Fig. 6. One may see that more complex descriptors including both pair-wise and triple-wise mixture simplexes (mult = 3, Fig. 6) perform similarly to descriptors including pair-wise mixture simplexes only (mult = 2).

Variation of the number of atoms in simplexes doesn’t impact the models performance for “reaction-out” CV (solid line in Fig. 6). However, Q2 values for “product-out” CV significantly vary as a function of maximal number of atoms (N max ): the models with N max  = 6 perform better than those with N max  = 4. This could be explained by the fact that larger fragments better characterize substrates specificity. On the other hand, the occurrence of fragments in the training set decreases with their size. This explains significant reduction of data coverage due to application of “fragment control” applicability domain. Notice that models performance doesn’t significantly vary as a function of minimal number of atoms (N min ). Indeed, at N max  = 6, within the given validation strategy and atoms labeling, the models with N min  = 2 and 4 perform very similarly (see Fig. 6).

Comparison of different schemes of atoms labeling in simplexes shows that consideration of atomic charges together with element types (blue lines on Fig. 6) increases the models’ performance. This suggests particular importance of charge encoding for the reactivity modeling.

Fig. 6
figure 6

The cross-validation determination coefficient Q2 and the data coverage of the react-SiRMS-concat models as a function of size, complexity and composition of simplex descriptors. The model applicability domain was taken into account. Solid and dashed lines represent the Q2 values, respectively, for “reaction-out” and “product-out” validation strategies, whereas corresponding bars show the data coverage. The color code reflects the composition of simplex descriptors including subgraphs encoding by elements (green), by charges (red), and, by both elements and charges (blue). The labels at the horizontal axis specify the minimal and maximal number of atoms in simplexes (e.g., “atoms = 2–6”) and complexity of mixture simplexes used: mult = 2 for pair-wise only and mult = 3 for pair-wise and triple-wise combinations

Benchmarking calculations

The results of benchmarking calculations comparing performances of the models based on SiRMS-mix, ISIDA/CGR descriptors as well as on different types of fingerprints are summarized in Table 2. One can see that two strategies of preparation of the reaction feature vector—either products and reactants vectors concatenation (react-SiRMS-concat) or their subtraction (react-SiRMS-diff)—lead to models of similar performances. Reasonable statistical parameters were obtained in “reaction-out” CV (Q2 = 0.62–0.69, RMSE = 0.78–0.90), whereas “product-out” CV led to much worse statistical parameters (Q2 = 0.37–0.47, RMSE = 1.03–1.15). The use of model AD significantly improved the model performance, especially in “product-out” CV (Q2 = 0.59–0.74, RMSE = 0.75–0.86) which was close to the “reaction-out” CV performance (Q2 = 0.67–0.74, RMSE = 0.74–0.90). The observed performance improvement is linked to decrease of the data coverage which varies from 75 to 83% in the “reaction-out” CV and from 14 to 38% in “product-out” CV.

The comparison of the best SiRMS model (No. 5, Table 2) with the models involving different types of fingerprints and ISIDA/CGR descriptors is given on Fig. 7. One can see that all models in the “reaction-out” CV protocol perform similarly. However, this is not a case for the “product-out” cross-validation where the statistical parameters of the models built on Atom Pairs fingerprints and Topological Torsion fingerprints are very little predictive (Q2 DA < 0.5). The best SiRMS model performs better than the models based on ISIDA/CGR descriptors (model No. 20, Table 2; p-value = 0.0002) and Morgan fingerprints (model No. 16, Table 2; p-value = 0.0080).

Table 2 Statistical parameters of the best QSAR models based on SiRMS-mix descriptors, different types of reaction fingerprints and ISIDA/CGR descriptors
Fig. 7
figure 7

Benchmarking of the models for E2 reaction rate constants involving different descriptors. The dots connected by solid and dashed lines represent Q2 DA values calculated considering applicability domain for “reaction-out” and “product-out” validation strategies correspondingly. The bars represent coverage of the corresponding models. The labels on the x axis mean AP3 atom pairs fingerprints, TT topological torsion fingerprints, MG2 Morgan fingerprints

Although in “reaction-out” cross validations all sets of descriptors perform reasonably well this doesn’t reflect the predictive ability of models with respect to reactions leading to new products which can be assessed in “product-out” cross-validation. The Q2 and RMSE values obtained in “product-out” CV are relatively low. Fragment control applicability domain significantly improves the model performance discarding up to 85% of reactions. Such big lost in the data coverage can be explained by high structural diversity and relatively small size of the data set due to which the test set objects often contain the fragments absent in the training set.

Conclusion

The suggested here mixture-based simplex representation of chemical reactions has been applied to the modeling of rate constants of E2 reactions. This approach doesn’t need any explicit information about reaction center and, therefore, atom-to-atom mapping is not required. The latter represents a significant advantage compared to methods based on explicit consideration of the reaction center because AAM procedure is time consuming and may lead to erroneous results [20]. However, as any other method of implicit encoding of a reaction center, our approach requires complete reaction representation (all products and all reactants). The SiRMS-mix models perform better than the models built on ISIDA/CGR descriptors and Morgan reaction fingerprint and much better than those involving reaction fingerprints encoding atom pairs or topological torsions. However, SiRMS-mix models have the lowest coverage according to the chosen Fragment control applicability domain approach.

A clear advantage of SiRMS approach is a possibility to vary the size and composition of considered molecular subgraphs (simplexes) and, in such a way, to select the descriptors set which fits modeled property. Thus, addition of simplexes labeled by partial atomic charge improves predictive performance of the models, which might be explained by significant role of electrostatic interactions in the E2 reaction mechanism. The SiRMS approach explicitly encodes different combinations of fragments occurred in reactants and products. Therefore, compared to reaction fingerprints from RDKit, SiRMS includes information not only about chemical transformations but also about all chemical functions present in reactants and products.

It has been demonstrated that the Fragment control AD could significantly improve the model performance. However, at the same time this leads to the reduction of the data coverage which is explained by small size and high diversity of the studied data set.

In parallel with the classical “reaction-out” cross-validation strategy we suggested to apply the more aggressive “product-out” cross-validation protocol which reliably assesses the accuracy of predictions for the reactions leading to new products.

Software implementation

The described reaction SiRMS descriptors were implemented in the open-source software written on Python 3 which is available in the Github repository https://github.com/DrrDom/sirms/releases/tag/v1.0.1.