Introduction

Worldwide, about 15% of total human deaths are cancer related—it is the second leading cause of death in developed countries (following the cardiovascular diseases)—and the third leading cause of death in the developing world following cardiovascular, infectious, and parasitic diseases (Stewart and Wild 2014). Cancer is caused by both intrinsic (such as inherited genetic mutations, hormonal imbalance, and immune conditions) and environmental factors (exposure to synthetic and natural chemicals, infectious organisms, or smoking). During the past few decades, significant progress has been made in understanding, limiting the effect of and preventing cancers induced by exposure to carcinogenic chemicals and in particular to genotoxic carcinogens (Benigni and Bossa 2011). The ability of these chemicals to cause permanent damage to the genetic material of cells is primarily evaluated by in vitro Ames mutagenicity assays (Ames et al. 1975). Since mutagenicity is tied to the multi-step process of carcinogenicity and has been demonstrated to correlate well with it (Benigni 1989), the Ames test has become a preferred early indicator of potential genotoxic carcinogenicity (Gadaleta 2016). Although invaluable in evaluating the mutagenic potential of chemicals, the Ames assay requires a reasonable quantity of a compound (~ 0.75–1 g), and as such, it might be problematic for early stage drug development or testing of impurities. Hence, where bacterial mutagenicity experimental data are absent or impractical, accurate computational prediction of the mutagenic potential of chemicals is highly desirable and may speed up toxicity profiling and evaluation of drug impurities as outlined in ICH Guideline M7 (2015).

Due to reliability concerns often associated with the use of single models, the ICH Guideline M7 recommends a more stringent approach based on consensus between two complementary (Quantitative) Structure–Activity Relationship Approaches ((Q)SARs)—one generated by a statistical system and another utilizing expert rules. There are several statistical-based systems in widespread use: Sarah Nexus from Lhasa Limited (Barber et al. 2016), Leadscope Model Applier—Statistical Models (Yang et al. 2008), CASE Ultra/MC4PC from MultiCASE Inc. (Saiakhov and Klopman 2008), Lazar (Helma 2006), and T.E.S.T (Martin 2016). Among the expert rule-based systems, popular are Lhasa Limited’s Derek Nexus (Marchant et al. 2008), Leadscope Model Applier–Expert Alerts (Yang et al. 2008) and some of the modules in Toxtree (Patlewicz et al. 2008a).

Due to their use as starting materials in the pharmaceutical industry, which subsequently leads to their presence as impurities in many drug formulations, the primary and secondary aromatic amines (further referred to as PAA and SAA) are prime candidates for mutagenicity evaluation (McCarren and Springer 2011). Such scrutiny is further supported by a substantial amount of evidence linking the aromatic amines and their metabolites to well-known mechanisms of DNA damage such as the formation of DNA adducts with the nitrenium ions, direct DNA intercalation and the formation of reactive oxygen species resulting in oxidative stress (Benigni and Bossa 2011). For these reasons, structural alerts warranting further investigation are often designed around the presence of amino-aromatic moieties. The overly general nature of such alerts, however, has been demonstrated to lead to a significant amount of false positives for certain subclasses of chemicals (Cariello 2002). As an alternative to the use of structural alerts, numerous QSARs utilizing molecular descriptors have been proposed during the past few decades, some of which are summarized in Table 1.

Table 1 Summary of published models on the mutagenicity of aromatic amines

Although some of the published models were properly derived and validated, many suffered from: (1) absence of or poor validation; (2) omission of data critical for their reproducibility; (3) lack of structural/mechanistic interpretation; and (4) undefined applicability domains. Hence, it is not surprising that none of these approaches (except those based on nitrenium ions stabilities) is routinely used in estimating the Ames mutagenicity of aromatic amines. To avoid the above outlined shortcomings and to build an OECD compliant model, this work was focused on:

  1. 1.

    developing a transparent modeling procedure with well-defined training/hold-out, validation, and “blind” external test sets;

  2. 2.

    utilizing a rigorous modeling procedure relying on multiple fully randomized training/hold-out set pairs processed by a robust algorithm designed to avoid overfitting;

  3. 3.

    defining an optimal prediction space/applicability domain based on a similarity score calculated from the 3D-SDAR fingerprints;

  4. 4.

    using 3D-SDAR mapping techniques to facilitate the identification of structural features associated either positively (enhancers) or negatively (reducers) with mutagenicity.

Data sets

Curated data sets consisting of 579 PAA and 437 SAA provided to the National Center for Toxicological Research (NCTR) by Lhasa Limited were preprocessed using the following rules: (1) all solvent molecules were discarded to obtain a single structure; (2) all counter-ions of organic salts such as Cl, Br, SO42−, Na+, etc., were removed; and (3) in cases of mixtures, only the active component identified by its CAS number was retained. The removal of counter-ions from the PAA set resulted in 15 duplicates and 2 triplicates. In cases of concordant calls, one of the duplicates/triplicates was retained, whereas in cases of conflicting calls, both chemicals were discarded. In total, 25 PAAs were removed from the data set. Similarly, the identification of 15 SAA duplicates resulted in the removal of 18 chemicals. After the “washing” procedure described above, the PAA and SAA data sets were comprised, respectively, of 554 and 419 chemicals; these will be referred to as initial data sets.

To build reliable predictive models, the initial data sets of 554 PAA and 419 SAA were split into modeling (3/4) and validation (1/4) subsets using the following strategy designed to generate validation sets representative of the modeling sets. Since the conjugated systems in the molecules of aromatic amines are essential for mutagenicity, the Toxmatch (Patlewicz et al. 2008b) software was used to calculate “the number of atoms in the largest pi system” descriptor, which ranked values determined the assignment of each chemical to either the modeling or the validation set. The chemicals in the PAA and SAA sets were arranged in an ascending order of their calculated descriptor values and were labeled sequentially as “a”, “b”, “c” and “d”. The chemicals labeled by “d” were then moved to form the validation set. Thus, the initial data set of 554 PAA was split into 416 chemicals forming the modeling set and 138 compounds comprising the validation set. Respectively, 315 out of the total of 419 SAAs were assigned to the modeling set, whereas the remaining 104 chemicals formed the validation set. Both modeling sets were further (repeatedly) split (by a bagging-like randomization algorithm) into training (80% of the chemicals, 333 in case of PAA, and 252 in case of SAA) and hold-out test (20% of total, 83 in case of PAA, and 63 in case of SAA) subsets. This procedure is described in detail in the “model building” section.

Once the modeling phase was complete and the models were reported back to Lhasa Limited, NCTR was provided with two sets of PAAs and SAAs to be used as “blind” external test sets. The Tanimoto similarity score (defined as \(T\;=\frac{{A.B}}{{{A^2}+{B^2} - A.B}}\), where A and B are vector rows of the bin occupancies for compounds A and B) was used to identify and retain only those chemicals from the two external test sets that had close analogs among the PAA and SAA training set compounds. Similarity thresholds equal to the average of the maximum pairwise similarities (Tmax) calculated for all possible pairs of chemicals in the training sets were used to determine which external test set chemicals belong to the applicability domains of the PAA (T > 0.294) and SAA (T > 0.274) models. In other words, only those chemicals that belonged to the applicability domains of our models were retained. The performance parameters for these two test sets were used to estimate the true predictive power of our models. Under collaboration with the Swiss Federal Food Safety and Veterinary Office (FSVO), NCTR predicted the mutagenic potential of a data set of 23 aromatic amines, whose activities were not disclosed to the NCTR before the predictions were made. These were tested in a two-strain Ames assay at Envigo Lab (Rossdorf, Germany) (Brüschweiler and Merlot 2017).

Methodology

Endpoint transformation

The PAA and SAA sets provided by Lhasa Limited contained data expressing the mutagenic potential of chemicals as a binary overall call, with 0 (or negative) encoding for a lack of evidence for mutagenicity and 1 (or positive) encoding for a strong evidence of mutagenic activity (see Table 2 and the SI spreadsheet). The overall call was defined as a surrogate endpoint combining Ames assay data derived from TA98 and TA100 Salmonella typhimurium strains (with and without metabolic activation) with literature evidence from other salmonella strains and bacteria (e.g., Escherichia coli). In addition, the PAA and SAA data sets included records detailing the outcomes of the Ames test for two of the most sensitive to aromatic amines strains, TA98 and TA100 as well as their metabolically activated counterparts TA98+S9 and TA100+S9 (Fan et al. 1998). These two strains are often considered sufficient to detect either frame-shift (TA98) or base-pair substitution (TA100) mutations (Leong et al. 2010). Although there are different variants of the Ames test, most protocols recommend the use of either four (TA97a, TA98, TA100, and TA102) or five (TA98, TA100, TA1535, TA1537, and TA1538) strains (Winder 2004). Of these, strains TA97a, TA98, TA1537, and TA1538 detect frame-shift mutations, strains TA100 and TA1535 are used to identify base-pair mutagens and the TA102 strain detects oxidants and cross-linking agents not detected by other strains (Winder 2004).

Table 2 Average P-scores and distribution of the positive and negative samples between the test and training sets based on the binary overall call for the PAA and SAA data sets

The Ames test outcomes for TA98, TA98 + S9, TA100, and TA100 + S9 were categorized as follows: negative (no evidence of mutagenicity), conflicting (evidence for and against), equivocal (absence of strong evidence for mutagenicity), and positive (strong evidence for mutagenicity).

Due to the inherent limitations of models based on binary endpoints (i.e. their inability to identify subtle but important structural determinants of activity) a coherent approach taking advantage of all available data to construct an intermediary, synthetic endpoint (referred to as P-score) was employed. The Ames test outcomes reported for the individual TA98, TA98+S9, TA100, and TA100+S9 assays were assigned categorical variables, such that the negative outcomes were encoded as 1, the conflicting as 2, the equivocal as 3, and the positive as 4. A value of 5 (instead of 4) was considered for the positive outcomes, thus positioning the equivocal outcomes in the middle (halfway between the negative and positive outcomes). This idea was discarded due to the increased discontinuity of the transformed P-scores, characterized by more sparsely populated or unoccupied regions (see Fig. 1b). For the purpose of this transformation, the original overall call assignments (0/1) were encoded as 1 and 4, representing, respectively, the negative and positive calls. The transformed intermediary endpoint P was calculated using the following formula:

$$P\;=\;\frac{{\frac{{\sum \nolimits_{{i=1}}^{n} {A_i}}}{n}~+C}}{2},$$

in which the first term in the numerator is an average of the Ames assay outcomes for TA98, TA98+S9, TA100, and TA100+S9 and C is the overall call. As defined, the contribution of the overall call C to the intermediary score P is equal to that of the four individual assays. There were several benefits resulting from this decision:

Fig. 1
figure 1

Histogram of the distribution of the P-scores for the PAA and SAA data sets based on the: a “1, 2, 3, 4” and b “1, 2, 3, 5” categorical class assignments, encoding for the negative, conflicting, equivocal and positive outcomes

  1. 1.

    It helped resolve seemingly conflicting situations, such as the case in which 8 PAAs and 7 SAAs, each characterized by four negative individual Ames assay outcomes (for TA98, TA98 + S9, TA100, and TA100 + S9) were assigned an overall positive call due to evidence for mutagenicity from other strains.

  2. 2.

    It helped to retain chemicals for which only an overall call was reported (44 PAA and 47 SAA).

  3. 3.

    The resulting intermediary endpoint P suffered from fewer (and smaller) gaps in continuity—i.e. improved uniformity of the distribution (see Fig. 1a).

Following from the above definition, the P-score will increase gradually to its maximum value of 4 as more of its components increase to their maximum values. Since each component of the P-score represents a different mechanism (need for metabolic activation, frame-shift mutation, base-pair substitution mutation, mutations caused by oxidants and cross-linking agents, etc.), its increase would be linked to the ability of each individual compound to damage DNA via one, two, or more alternative mechanisms. For example, a chemical with a P-score of 1 would be inactive under all testing conditions (different strains and added S9 fraction), whereas a compound that needs a metabolic activation to revert TA98 (but is inactive under other testing conditions) would have a P-score of 2.875. On the opposite end of the spectrum would be a compound that is active in both strains (with and without metabolic activation) and for which there might be evidence for mutagenicity from other strains (potential indicators for alternative mechanisms). Such a compound would be characterized by a P-score of four (its highest possible value). In general, a higher P-score would indicate that a given compound is mutagenic under more testing conditions (for example, active in TA98, TA100 or both with and without metabolic activation and potentially active in other strains) and by extension might damage DNA via multiple alternative mechanisms. Alternative scoring functions that might be more closely related to the number of different mechanisms are also possible; however, as defined, the P-score was preferred for its simplicity.

Due to the improved continuity of the P-score (as opposed to the binary overall call), the generated 3D-SDAR models should be able to pinpoint specific structural features associated either “positively” or “negatively” with mutagenicity. It is also important to note that at the final stage of the modeling procedure, the predicted P-score values were converted back to a binary class assignment (negative/positive) using cut-off values determined from each of the training sets (shown in the last column of Table 2).

3D-SDAR fingerprint generation

3D-SDAR is a grid-based approach, which is based on fingerprints constructed from the NMR chemical shifts of pairs of atoms (determining the X and Y coordinates of the individual fingerprint elements) and their corresponding through-space atom-to-atom distances (determining the Z-coordinate). These fingerprints are further tessellated by regular grids, thus generating descriptor matrices with thousands of columns (see the SI spreadsheet), later processed by a bagging-like Partial Least Squares (PLS) algorithm. The following steps describe the generation of these 3D-SDAR fingerprints:

  1. 4.

    The molecular geometries of all chemicals were optimized using the AM1 Hamiltonian as implemented in Hyperchem v. 8.03 (2007). The ACDLabs NMR (2011) and XNMR (2011) predictors were used to simulate the 13C and 15N NMR chemical shifts;

  2. 5.

    The ranges of the 13C and 15N chemical shifts for the two modeling sets were determined. In case of PAAs, δ13C covered a range between 5.82 and 206.9 ppm, whereas the δ13C of SAAs ranged from 5.07 to 207.99 ppm. The lower bound of the PAAs δ15N range was − 373.56 ppm, while the upper bound was 632.13 ppm. Respectively, the lower and upper bounds of SAAs δ15N were − 360.69 ppm and 632.13 ppm;

  3. 6.

    To avoid the overlap between the δ13C and δ15N regions, the chemical shifts of the nitrogen atoms were translated downfield by 1000 ppm past the upper bound of the δ13C region. This translation, however, was performed only for convenience, as to simplify the fingerprint binning procedure and the fingerprint element count algorithm. For the purpose of structural interpretation, the original values of δ15N were restored once the modeling phase was complete.

  4. 7.

    The through-space distances between all C–C, C–N, and N–N atom pairs were calculated and combined with their corresponding chemical shifts to generate unique fingerprints representing each compound in the 3D-SDAR space. One such fingerprint constructed for the structure of aniline is shown in Fig. 2.

Fig. 2
figure 2

Molecule of aniline (1a) and its corresponding 3D-SDAR fingerprint (1b). The colors of the dashed lines correspond to the colors of the fingerprint elements shown in 1b. As explained above, δ15N is translated downfield by 1000 ppm

  1. 8.

    The above-described downfield translation of δ15N, allowed division of the XY-plane into three distinct regions: carbon–carbon, carbon–nitrogen and nitrogen–nitrogen. Since the gyromagnetic ratio γ of the carbon atoms is approximately 2.5 times that of the nitrogens, the grid density was set in such a way that the bins in the C–N region were 2.5 times larger than the bins in the C–C region and 2.5 times smaller than those in the N–N region. To explore the 3D-SDAR parametric space, tessellations with grid densities ranging from 4 × 4 ppm × 0.5 Å to 10 × 10 ppm × 1 Å in the C–C region, 4 × 10 ppm × 0.5 Å to 10 × 25 ppm × 1 Å in the C–N region and 10 × 10 ppm × 0.5 Å to 25 × 25 ppm × 1 Å in the N–N region (with incremental steps of 0.5 Å on the Z-axis and 2 ppm on the chemical shifts plane XY) were generated.

  2. 9.

    After tessellation of all individual fingerprints, the fingerprint elements occupying each bin were counted and stored in 3D-SDAR descriptor matrices. In these matrices, each row vector represents a single compound, whereas each column vector contains the counts of the fingerprint elements occupying a specific bin.

Model building

A bagging-like PLS-SIMPLS algorithm (De Jong 1993) written in Matlab (v8.0, 2012) was used to process each of the generated 3D-SDAR descriptor matrices. Before processing, all descriptors were standardized using the “zscore” Matlab function. To avoid overfitting and build models free of selection bias, the PAA and SAA modeling sets (3/4 of the initial sets) were split randomly into training (80% of the chemicals) and hold-out test (the remaining 20% of the chemicals) sets. One hundred such randomizations were performed. On each run, the PLS algorithm fitted the P-scores for the training set and predicted the hold-out test and the validation sets (see Fig. 3). At the end, the aggregated predictions for the individual compounds belonging to any of the above defined sets (training, hold-out and validation) were averaged and a threshold equal to the average of the P-scores for the training set was used to classify each chemical as either positive or negative. Due to the repeated splitting of the modeling set into training/hold-out test set pairs, each compound from the modeling set had two predicted values; one, when it was randomly assigned to the training set and another, when it belonged to the hold-out test set. Because the training set predictions merely reflect the ability of PLS to fit data (and are thus irrelevant for estimating the “true” predictive power of models), the Results and discussion section will be focused only on the predictions made for the hold-out, validation and the “blind” external test sets.

Fig. 3
figure 3

Flowchart of the 3D-SDAR modeling process

Models with between 1 and 10 latent variables (LVs) were generated and the accuracy of prediction for the hold-out test set was used to determine their optimal number. To recreate the same training/hold-out test set sequence when processing 3D-SDAR descriptor matrices based on different tessellations, the random number generator was initialized with a pre-specified random seed. This was also done with the intention of building completely reproducible models.

Interpretation

The linear nature of PLS combined with the robustness of the bagging-like randomization algorithm and the atom-level resolution of 3D-SDAR allow for a straightforward determination of the structural factors playing a role in mutagenicity. Since 3D-SDAR relies on composite models aggregating the predictions from multiple randomized models, the PLS weights from the individual models cannot be utilized directly to rank the 3D-SDAR descriptors by importance. An alternative approach would be to extract a preset number of highly weighted (positive as well as negative) descriptors for a preset number of latent variables on each run and to further calculate the frequency of occurrence of each descriptor as a fraction of the total number of extracted descriptors. The more frequently a given descriptor appears in the series of models, the more likely it is that this descriptor is truly related to the observed biological effect. During the next stage, the descriptors with the highest frequency of occurrence are mapped on the set of chemical structures and the recurring structural patterns would then represent molecular features that if present would lead to either an increase (positive PLS weights) or a decrease (negative PLS weights) in the overall mutagenic potential.

Results and discussion

Predictive performance

For both PAA and SAA sets, models utilizing grids with 4 × 4 ppm × 0.5 Å in the C–C region, 4 × 10 ppm × 0.5 Å in the C–N region and 10 × 10 ppm × 0.5 Å in the N–N region resulted in models with the highest overall accuracy for the hold-out test sets. In the case of PAA, the highest accuracy of prediction was achieved using models with 7 LVs, whereas in the case of SAA 6 LVs were found to be optimal. Our earlier work (Slavov et al. 2014, 2016; Stoyanova-Slavova et al. 2017), dealing with conversions between continuous and categorical variables indicated that the effect of the continuous variable distribution bias on the class assignment could be mitigated using cut-off values equal to the average of the training set activity vectors (in this case, the average of the P-scores for the PAA and SAA training sets). In other words, to convert the averaged P-scores for each compound back to a binary positive or negative class assignment, the cut-off values shown in the last column of Table 2 were used (2.487 for PAA and 2.309 for SAA). Since the probability to classify correctly chemicals near the above cut-off values decreases as a function of their proximity to the cutoffs, confidence bands excluding all compounds with P-scores within the ± 5% on both sides (2.362...2.611 and 2.194…2.425) of the cut-off values were defined. The performance parameters of the 3D-SDAR models for the hold-out test, validation and external test sets are summarized in Table 3. The statistical parameters for the sets in Table 3 marked with an asterisk exclude the chemicals within the low confidence bands. As can be seen from the performance parameters shown in Table 3 and Fig. 4, both PAA and SAA models offer an excellent performance transferability from the hold-out test set to the validation and the “blind” external test sets.

Table 3 Predictive performance of the best PAA and SAA models
Fig. 4
figure 4

ROC curves for the primary (a) and secondary (b) amines prediction sets after removal of the modeling inconclusives near the cutoff

Although the SAA sets were somewhat smaller in comparison to the PAA sets, the SAA model seemed to perform slightly better when predicting the validation and the “blind” external test set. The SAA model also needed fewer LVs to achieve this somewhat higher performance level. One possible explanation might be in the slightly better (although sparser) coverage of the chemical space by the SAA chemicals, which were less similar to each-other (average Tmax = 0.274) than the chemicals constituting the PAA set (average Tmax = 0.294). Hence, models based on more structurally diverse chemicals might be able to capture trends in the data that are not accessible to models based on sets with more limited structural variability and, respectively, benefit from that information when predicting external test sets. An alternative explanation could be derived from the P-score distribution shown in Fig. 1, which indicates the presence of fewer SAAs with intermediate values of P and m, respectively, fewer “borderline” cases.

Collaboration between NCTR and FSVO allowed a comparison between the predictions generated by 3D-SDAR, Lhasa Limited’s Sarah Nexus and Derek Nexus and a two-strain Ames test performed on a set of 23 aromatic amines studied by Brüschweiler and Merlot (2017). This is a subset of the 397 aromatic amines as potential cleavage products from the total 180 azo dyes used for clothing textiles. The Ames test was conducted under the assumption that the TA98 and TA100 strains are often sufficient to detect mutagenicity of most aromatic amines (Bentzien et al. 2010, Harding et al. 2015). The Ames assay outcomes reported by the FSVO (Brüschweiler and Merlot 2017) and the predictions based on our 3D-SDAR PAA model are listed in Table 4. It is interesting to note that although 3D-SDAR was based on a fairly structurally limited training set, it was the only technique able to correctly classify the 4,4′-cyclohexane-1,1-diyldianiline as mutagenic and, respectively, achieve the highest sensitivity (see Table 5). In terms of specificity, all three approaches seemed to have generated a significant proportion of false positives. However, the somewhat significant level of concordance between these predictions (with Kendall τb coefficients ranging from 0.417 to 0.685) suggested that there might be a rational explanation for (at least) some of the misclassified compounds. It was hypothesized that the presence of a 1,4-diaminobenzene moiety (found in 3 of the chemicals classified as false positives) or a 2-hydroxyethylamine group (present in 2 “false positives”) might influence the mechanism by which these chemicals cause damage to DNA and that this altered mechanism may go undetected by the Salmonella typhimurium strains used (Yoshida et al. 1998). This was the case with 2-[(4-aminophenyl)(ethyl)amino]ethanol, whose closest better-profiled analog 2,2′-[(4-aminophenyl)azanediyl]di(ethan-1-ol) demonstrated activity in the WP2 strain of Escherichia coli, but not in any of the Salmonella typhimurium strains.

Table 4 Ames assay data for the experimental validation set of 23 PAA and predictions generated by 3D-SDAR. The non-mutagenic compounds are labeled with “−”, while the mutagenic chemicals are labeled with “+”
Table 5 Comparison of the performance of Lhasa Limited’s Derek Nexus, Lhasa Limited’s Sarah Nexus and 3D-SDAR to predict the two-strain Ames results

Hence, it can be concluded that the discrepancy between the experimental and predicted mutagenicities could largely be attributed to the restricted set of strains used in the two-strain Ames test, as opposed to the broader sets, evidence from which is incorporated in Sarah Nexus, Derek and 3D-SDAR. For example, Sarah Nexus uses data from a 5 strain Ames test, whereas 3D-SDAR is based on a P-score, a component of which is the overall call.

Structural factors affecting the mutagenic potential of aromatic amines

The structural interpretation of both PAA and SAA models was based on the first two latent variables (i.e., those explaining most of the variance in mutagenicity data). On each of the 100 runs, the top 10 most positively and most negatively weighted bins for each of the two latent variables were extracted and their frequencies of occurrence (expressed as percentages) were calculated using the following formula: F = 100 × (Count/10 × NLV × Runs). In this formula, Count is the number of times a specific bin is found in the accumulated list of bins, NLV is the number of latent variables and Runs represents the number of randomization cycles. In the case of PAA, 23 positively weighted bins with a frequency of occurrence of more than one percent and a cumulative frequency of occurrence of 81.03% were selected to identify the structural features associated with an increase in the mutagenic potential. Using the same cutoff of 1% resulted in the selection of 29 negatively weighted bins with a cumulative frequency of occurrence of 77.13%. In the case of SAA, the first 12 most frequently occurring bins with positive weights had a cumulative frequency of occurrence of 95.80%, whereas the cumulative frequency of occurrence of the 26 most negatively weighted bins was 81.40%. Due to the fewer number of both positively and negatively weighted individual bins needed to achieve a comparable or larger cumulative frequency of occurrence, in comparison to the PAA set, the SAA set is characterized by more clearly defined structural trends in the data. The most frequently occurring bins were further projected on the molecular structures to identify recurring patterns associated with an effective enhancement or reduction of the mutagenic potential of PAAs or SAAs. It is important to emphasize that the abundant atom specific information encoded in the 3D-SDAR fingerprints augmented with the robust bagging-like PLS algorithm allows for identification of “true” negative contributions that might be particularly difficult or often impossible to detect via alternative techniques. Such structural features are not simply lacking contribution to the overall mutagenic potential, but are known to effectively lower it when present in the structures of otherwise potential mutagens.

All structural features discussed below were found to be consistent with earlier (Q)SAR findings and expert knowledge derived fragments as well as with known mechanisms of mutagenicity (Ahlberg et al. 2016; Benfenati et al. 2015; Gadaleta et al. 2016).

Primary aromatic amines

The presence of several highly specific structural factors was found to enhance the mutagenic potential of PAAs. Aromatic amines containing conjugated planar π-systems such as fluorene or carbazole derivatives, including those with fused benzene rings such as naphthalene and anthracene were frequently present (select examples are shown in Fig. 5). These conjugated systems were often found to be long and narrow, with their width usually not exceeding two side-by-side benzene rings or 9 Å (taking into account the hydrogen atoms). Due to these highly specific geometric restrictions, the conjugated planar aromatic amines are likely exerting their effect through DNA intercalation leading to frame-shift mutations—a well-known mechanism enhancing mutagenicity (Ames et al. 1972). However, since the PAA data set consists only of aromatic amines, it is possible that the amino group is only a passenger, while the true “mutagenophore” is the planar conjugated aromatic system. More rigid polycyclic aromatic amines such as biphenylamine and especially their ortho substituted sterically hindered derivatives will exhibit a stronger mutagenic potential than those in which the aromatic rings are separated by two or more rotatable bonds (as in 2-benzylaniline). Another, strong “mutagenophore” contender was the nitrophenyl group (see Fig. 5), found among many PAAs and SAAs characterized by high P-scores. Its presence has been associated with various possible mechanisms of DNA damage including intercalation, adduct formation and in specific cases incorporation of the active molecule as a base analog into DNA during DNA synthesis (Matsuda et al. 1991).

Fig. 5
figure 5

Structural features associated with an increase in the mutagenic potential of PAAs

The presence of the following structural factors was found to be associated with an overall reduction of the mutagenic potential: (1) ester, carboxyl and carboxamide groups directly attached to the phenylamine ring, with the ortho position being preferable (see Fig. 6 for select examples); (2) bulky (often aliphatic) substituents on both sides of the amino group; (3) diaminopyrimidine or diaminotriazine derivatives and (4) monoaromatic amines (except for the discussed above nitrophenylamines). In these cases, the reduction of mutagenic potential is likely caused by either the nearby bulky substituents blocking the free access to the nitrogen atom or its deactivation caused by redistribution of the electron density in the molecules and its concentration around the nearby more electronegative oxygens.

Fig. 6
figure 6

Structural features associated with a decrease in the mutagenic potential of PAAs

Secondary aromatic amines

In addition to the above-described structural features found to enhance the mutagenic potential of PAAs, specific patterns better expressed in the SAA data set confirmed indirectly one of the postulated mechanisms of aromatic amine activation in their path to forming DNA adducts. This is indeed the case of the phenylhydroxylamine derivatives (some examples of which are shown on Fig. 7) which undergo activation through conversion to their corresponding hydroxyl amines, further forming esters and finally producing nitrenium ions via elimination of the OR group. The nitrenium ions further form DNA adducts (Ford and Herman 1992). As with the PAAs, many SAAs containing either conjugated planar aromatic systems or somewhat more rigid (due to steric hindrance) polycyclic aromatic amine moieties were found to exhibit strong mutagenic potentials. One striking case was that all SAAs containing nitrothiophene moieties (see Fig. 7) were characterized by multiple overlaid positively weighted bins indicating a significant mutagenic effect. Although multiple positively weighted bins were also present in most of the cases discussed above (but only the most significant of them were visualized for simplicity), the large variety and recurring consistency of the bins associated with the presence of nitrothiophene moieties suggested that they carry an exceptionally high mutagenic potential. The DNA damaging potential of nitrothiophenes was noticed early on (Wang et al. 1975) with later studies demonstrating that their activation is largely dependent on the bacterial nitroreductase, and therefore, they do not require a S9 fraction (Hrelia et al. 1990). Nitrothiophenes mutagenicity is likely caused by their reduction to diamagnetic and free radical intermediates further forming hydroxylamines (Hrelia et al. 1990).

Fig. 7
figure 7

Structural features associated with an increase in the mutagenic potential of SAAs

Examination of the projection of the negatively weighted bins on the molecular structures of SAAs revealed several structural features that if present would effectively reduce the mutagenic potential of otherwise potentially active aromatic amines. Two such examples are the sulphonamide and acetamide residues (see Fig. 8), where the mechanism of deactivation is likely associated with the electron withdrawing effect on the nitrogen atom, thus hindering its oxidation in the conversion to a nitrenium ion. Similar to PAA, bulky substituents near the amino group, ester, carboxyl and carboxamide groups were also found to lower the mutagenic potential of SAA.

Fig. 8
figure 8

Structural features associated with a decrease in the mutagenic potential of SAAs

The structural interpretation of the 3D-SDAR models discussed in this section clearly demonstrates that the mutagenic potential of both PAAs and SAAs is driven by similar structural features. Although some are better expressed in one of the data sets and not in the other, this is likely a statistical artefact arising as a result of the differences in the coverage of the chemical space.

Conclusions

Using a carefully designed algorithm focused on validation and interpretability 3D-SDAR was able to successfully model diverse classes of primary and secondary aromatic amines. Experimental validation using 23 aromatic amines demonstrated predictive performance comparable to that of two widely used commercial systems developed by Lhasa Limited, namely Sarah Nexus and Derek Nexus.

The aggregated positively and negatively weighted 3D-SDAR bins and their projection on the standard coordinate space allowed the determination of structural features that either enhance or suppress the mutagenic potential of aromatic amines. Unlike most alternatives, 3D-SDAR was able to capture “true” negative contributions; i.e., functional groups or moieties whose presence actively reduces the overall mutagenic potential.

In compliance with the OECD requirements, our models were used to provide insights into the mechanisms by which the aromatic amines elicit their mutagenic effect. Observed structural trends in the data seem to confirm the postulated method of aromatic amines activation through conversion to their corresponding hydroxyl amines.