Introduction

Cancer is currently the leading cause of death in the Western world. Reasons for this high frequency in Western countries can mainly be attributed to lifestyle and environmental factors, which are thought to enhance abnormalities in the (epi)genetic material of cells and thereby facilitating the cancer process (Kinzler and Vogelstein 2002). Genotoxic carcinogens are a class of cancer-facilitating substances that share the commonality of causing DNA damage and, hence, interfere with DNA replication, transcription of genes, or the functionality of proteins. These genotoxic effects are considered part of the tumor initiation process and increase the risk of carcinogenesis. Other chemicals that are able to induce cancer, but do not directly interact with DNA, are non-genotoxic carcinogens (Silva and Van der Laan 2000). These compounds are generally not directly involved in tumor initiation, but may induce tumor-promoting effects (Silva and Van der Laan 2000; Fielden et al. 2007; Hernandez et al. 2009).

To protect society and the environment from carcinogen exposure, chemicals are thoroughly screened before being marketed. Generally, each substance is initially subjected to several tests exploring its genotoxic potential. When a substance is considered to be genotoxic, based on the results from both in vitro and in vivo genotoxicity tests, plus if human exposure risk and/or production levels are high, the substance is subjected to long-term carcinogenicity rodent bioassays (Lilienblum et al. 2008; Luijten et al. 2012). These long-term bioassays have various disadvantages, including being time-consuming, expensive, and requiring large numbers of animals. Furthermore, the use of chronic exposures to high doses may result in a high rate of false-positive results (Manuppello and Willett 2008). Another pitfall of this testing strategy is a bias toward genotoxic carcinogen identification: The initial short-term in vitro and in vivo genotoxicity assays are designed to detect genotoxic potential, possibly leaving non-genotoxic carcinogens unidentified. This can result in a substantial risk for society and the environment (Hernandez et al. 2009).

Alternative approaches are therefore needed to identify the carcinogenic potential of substances. To circumvent the aforementioned disadvantages in carcinogenicity testing, we set out to test the potential of microRNA and mRNA expression data, as a means for correct identification of (non-)genotoxic carcinogens, thereby providing a more ethical approach in terms of animal use and welfare in terms of reduction and refinement. Transcriptomics analyses have been shown to be a useful and informative contribution to the current carcinogenicity testing methods (Jonker et al. 2009; Ellinger-Ziegelbauer et al. 2008; Thomas et al. 2009; Guyton et al. 2009; Fielden et al. 2007, 2008, 2011; Nie et al. 2006; Uehara et al. 2011; Waters et al. 2010). These studies have indicated that discriminative mRNA signatures after short-term exposure can, to a certain extent, be indicative for carcinogenic modes of action or predictive for the tumor endpoints after chronic exposure. Most of the large-scale in vivo studies have been performed in rats and often focused on carcinogens with one target tissue, e.g., hepatocarcinogens. In the present study, we searched for molecular classifiers in expression profiles of murine liver generated upon a 7-day exposure to a genotoxic carcinogen (GTXC), non-genotoxic carcinogen (NGTXC), or a non-carcinogen (NC). We considered direct-acting chemicals or their reactive xenobiotic metabolites as GTXC. Indirect-acting genotoxic modes of action (e.g., induction of oxidative stress) were considered as NGTX modes of action. Four GTXC, seven NGTXC, and five NC were used for classifier selection. In addition to mRNA profiles, we also examined microRNA profiles to address the question whether microRNAs are a useful addition to such a set of classifiers. MicroRNAs can post-transcriptionally regulate up to 65 % of the transcriptome and have a clear influence on cellular processes. To date, several specific microRNAs are overrepresented in cancerous tissues or specific tumor types or are responsive to DNA damage (Kasinski and Slack 2011; Heneghan et al. 2010; Chen 2010; Elamin et al. 2011; Malik et al. 2012). However, the potential of microRNA transcripts as classifiers for carcinogen identification has not been investigated thoroughly.

Our study generated a classifier set (set of transcripts that collectively can be used as classifier) that discriminated between GTXC, NGTXC, and NC toxicants with high accuracy upon verification in the original chemical set in a 7-day in vivo experimental setup. Validation of the classifier set in an additional chemical set demonstrated that predictive potential for GTXC remained high, but also showed that prediction of NGTXC potential requires additional (genomic) strategies. Moreover, in this short-term in vivo setup, microRNA appeared to be less discriminative than mRNA.

Materials and methods

Animals

Six-week-old male wild-type mice (C57BL/6J, n = 4 per group) were acclimated for two weeks and subsequently treated for seven days with a GTXC, NGTXC, or NC through feed, gavage, or i.p. injection. From the day of weaning, the health status of the mice was monitored daily and mice were weighed weekly starting at acclimation. Animals were kept in the same stringently controlled (specific pathogen-free, spf) environment, fed ad libitum, and kept under a normal day/night rhythm. After seven days of exposure, mice were killed at a fixed time of the day. During autopsy, several organs (including the liver) were isolated and stored according to protocol using RNAlater (Qiagen, Valencia, CA, USA).

In vivo short-term exposure studies

Details for all chemicals used in the short-term exposure studies are shown in Table 1. For some of these chemicals, appropriate doses were based on previously performed 28-day dose-range finding (DRF) and mid-term studies [2-AAF, BaP, CsA, DEHP, DES, E2, PBB, Res, WY, D-man, DMBA, MMC (van Kreijl et al. 2001; de Vries et al. 1997; Melis et al. 2013a, b)]. For new compounds, not tested by us before, we performed 28-day DRF studies prior to the toxicogenomic studies using an identical setup as previous performed studies mentioned above (see Supplemental Information 1 for DRF studies of AFB1, CPPD, BPA, DIDP, SD, TBTO, AD, CCL4, DMN, TCDD, TBA, VPA). In short for these DRF studies, six- to nine-week-old male C57BL/6J mice (n = 10 per group) were exposed to one of the selected chemicals, using multiple doses based on the literature or expert advice. Substances were administered through the feed (continuously), gavage (every other day), or i.p. injection (every third day). See Table 1 for the applied route of administration for each chemical. Body weights were monitored daily for the first 10 days and semi-weekly thereafter (see Supplemental Information 1). If body weight changes were not conclusive to identify a suitable dose, the liver was studied macroscopically to determine a suitable sub-toxic dose that can be used for the short-term 7-day exposures (data not shown). An exposure time of 7 days was selected, based on previous results (Jonker et al. 2009) in which full genome responses upon 3, 7, and 14 days of exposure to several GTXC, NGTXC, and NC were examined. Herein, 7-day exposures appeared to be a suitable time point to trigger exposure-related gene expression changes.

Table 1 Overview of chemicals and their details used for short-term exposures

In the subsequent 7-day exposure studies, dietary exposure was continuous during the experiment, application using i.p. injection occurred at day 0, 3, and 6 (autopsy on day 7), and exposure using gavage at day 0, 2, 4, and 6 (autopsy on day 7) (Table 1). Body weights were recorded during this 7-day exposure period (Supplemental Information 2). Comparison of different control groups (gavage, i.p. injection or feed) showed no significant differential effect at the transcriptional level (Luijten et al. in preparation). Hence, only food-administrated control samples were implemented in this study.

RNA isolation, mRNA, and microRNA expression profiling

Hepatic total RNA was isolated using the miRNeasy kit (Qiagen, Valencia, CA, USA) and the QIAcube (Qiagen, Valencia, CA, USA) according to the manufacturer’s instructions. All samples passed RNA quality control using capillary gel electrophoresis (RIN >7.6) (Bioanalyzer 2100; Agilent Technologies, Amstelveen, The Netherlands). Amplification, labeling, and hybridization protocols details were performed according to manufacturer’s protocols, using the Affymetrix Mouse Genome 430 2.0 Array platform (Affymetrix, Santa Clara, CA, USA).

The same total RNA isolates as used for mRNA were used for isolation of microRNAs. MicroRNA profiling was performed as previously described (Pothof et al. 2009).

Transcriptomics analyses

Quality control and correction of significant hybridization and experimental blocking effects, normalization, annotation, and subsequent data analysis were performed as previously described (Jonker et al. 2013). In short, all raw data passed the quality criteria, but relevant effects of labeling batches were detected. The raw data were annotated [according to (de Leeuw et al. 2008)] and normalized using the robust multi-array average (RMA) algorithm [Affy package, version 1.22.0 (Irizarry et al. 2003), available from the Bioconductor project (http://www.bioconductor.org) for the R statistical language (http://cran.r-project.org)]. The data were corrected for labeling batch effects using a linear model with group-means parameterization and labeling batch (random). The normalized data were statistically analyzed for differential gene expression using a mixed linear model with coefficients for block (random) and each experimental group (fixed) (Smyth 2005; Wolfinger et al. 2001). False discovery rate (FDR) correction was performed globally across all contrasts [according to (Storey and Tibshirani 2003)]. Volcano plots (FDR <0.05) of exposure versus control comparisons are depicted in Supplemental Information 3. Only annotated Entrez genes were used for further analysis. Functional genomics analyses, using the top 1000 FDR-ranked genes, were performed using Metacore GeneGO pathway analyses (version 6.11 build 41105, GeneGo Inc. St. Joseph, MI, USA), to assess the biological response upon each chemical exposure (Supplemental Information 4). Results were clustered by hand into more general functionalities for representation purposes (Table 2). The raw microRNA data were normalized using quantile normalization. For the CsA-, WY-, and CPPD-exposed groups, quality control discarded one outlier per group. Normalized values were analyzed for differentially expressed microRNAs using a linear model [bioconductor package Limma; (Smyth 2005)] and corrected for multiple testing (Benjamini and Hochberg 1995). The transcriptomic results are deposited at the NCBI Gene Expression Omnibus: GSE43847 (microRNA) and GSE43977 (mRNA).

Table 2 Clustered and categorized Metacore GeneGO pathway responses upon 7-day exposure

Classification analyses

A tiered approach was used to derive a final classifier set (Fig. 1). Software-based algorithms K-nearest neighbor (KNN), prediction analysis for microarrays (PAM-R), and random forest (RF) were applied using the mRNA and microRNA transcriptome separately as input (Fig. 1). The R implementation used for these methods can be found in R-packages ‘class,’ ‘pamr,’ and ‘randomForest,’ respectively. We used a 2-step approach to generate classifiers to discriminate between genotoxic (GTXC), non-genotoxic (NGTXC), and non-carcinogens (NC). In the first step, classifiers are generated to discriminate the GTXC from the other two classes, and in the second step, classifiers for identification of NGTXC are retrieved. Since the number of chemicals within each class was unbalanced and it is well known that the KNN and PAM-R algorithms tend to create a bias toward classification of unknown compounds to the larger group, we adapted the scripts for the cross-validations in such a way that the group sizes within the training set were as large as possible but balanced. This resulted in group sizes that comprised all but one of the compounds of the smaller group, and one additional compound to that number for the larger group. For example, a classifier set to identify 2-AAF (as a genotoxicant) is generated by training on the three other GTXC and four compounds from the Rest class (a combination of NGTXC and NC). To select biomarkers for KNN and PAM-R, we performed a 100-fold cross-validation, each time with such a balanced training set (Fig. 1). For RF, this was not necessary, as the difference in class probabilities can be accounted for by setting the cut-off parameter. For RF, we used a simple leave-one-compound-out fold scheme. For each fold of the cross-validation, the classifiers were ranked according to the algorithm’s features selection (e.g., shrunken centroid distance for PAM-R, calculated importance for RF and p value based on a t statistics for KNN). Different lengths of lists of ranked features were tested, and only those genes from the list that gave the lowest error on classification of the unseen compounds in the fold were selected as potential classifier. As some folds used up to the whole array for the best result, we limited those lists to the top 100 highest ranked genes. Each algorithm therefore yielded per fold top 100 (or less) lists for the GTXC versus the Rest analysis and top 100 (or less) lists for the NGTXC versus NC analysis. For classifier selection (Fig. 1), we first analyzed per algorithm how many times a transcript was present within those generated top 100 lists. To prevent inclusion of false positives, transcripts were only considered for further selection into the classifier if they were present in more than 10 % of the top 100 cross-validation lists and a top-ranked (TR) classifier set was generated consisting of transcripts that were yielded most often within the cross-validations per algorithm (ranked from most abundant to minimally >10 %) (Supplemental Information 5, Tables 1 and 2). The three (KNN, PAM-R, and RF) generated TR-classifier sets were subsequently screened for overlap. This overlapping top-ranked (OTR, see Supplemental Information 5, Tables 3 and 4 for the complete list) classifier set was then ranked based on an OTR score (the sum of percentages that a transcript was present in the cross-validations in each algorithm, e.g., KNN 25 %, PAM-R 50 %, RF 15 % yields an OTR score of 90). As a final step in the classifier selection, we subsequently checked the generated OTR classifier set for usability implementing a class average fold-change threshold of −1.5 < FC > 1.5 (Fig. 1). This final classifier set was firstly verified using the same three algorithms RF, KNN, and PAM-R and previous settings to measure predictive potential in the total training set and subsequently validated in an additional validation set of chemicals (Fig. 1). In these verification and validation steps, a chemical was assigned to a certain class, when the majority of the algorithms (two out of three) predicted this class.

Fig. 1
figure 1

Schematic overview of the tiered classifier selection, verification, and validation approach

Results

Short-term in vivo exposure studies

The goal of this study was to explore the potential of both microRNA and mRNA transcripts as molecular discriminators for classification of (non-)genotoxic carcinogens. Transcripts, alone or part of a classifier set, should ideally be able to correctly discriminate between three different chemical classes (GTXC, NGTXC, and NC). Wild-type male mice were therefore exposed to one of the sixteen tested chemicals, as depicted in Table 1a (four GTXC, seven NGTXC, and five NC). Concurrently, a control (untreated) study was performed. We included various GTXC and NGTXC with different carcinogenic potencies and/or carcinogenic modes of actions. To possibly extract more robustly performing classifier transcripts, we also included NC which mimic a mode of action of one of the included NGTXC: DIDP and DEHP are both phthalates, BPA, E2, and DES are ER-α ligands, and TBTO and CsA are immune suppressive substances.

During the 7-day exposure period, body weights were monitored (relative body weights are shown in Supplemental Information 2). Control groups exhibited, on average, a 3 % increase in body weight (calculated for the actual exposure period from day 0 to day 7). Exposure to TBTO, CsA, and E2 resulted in a slight decrease (>1 %) in body weight compared to the start of the exposure of, respectively, 5, 4, and 3 %. The remainder of the exposures led to an increased or steady (increase or decrease <1 %) body weight during the treatment (Supplemental Information 2). No gross macroscopic injurious lesions were found at necropsy in exposed livers, apart from all WY-exposed mice, which exhibited yellow-spotted livers. This was possibly caused by fat deposits, a common finding upon Wyeth-14.643 exposure (NTP, http://ntp.niehs.nih.gov/ntp/htdocs/ST_rpts/tox062.pdf).

Functional genomics analyses confirm modes of action of chemical exposures

From an identical patch of the liver, mRNA and microRNA profiles were generated for each of the sixteen exposed groups as well as the control group. To assess whether the transcriptional response to each exposure was comparable to the described chemical modes of actions and properties in the literature, functional genomics analyses were performed using Metacore software (see “Materials and method”). For this, the top 1,000 of most significantly regulated genes (ranked on FDR, compared to the untreated samples) for each chemical were used as input. Clustered categorized functional responses for all exposures are shown in Table 2, and in more detail in Supplemental Information 4 (Metacore GeneGO overrepresentation pathway map analysis, FDR <0.05). For most substances, previously reported modes of actions and biological consequences could be retrieved from these analyses. For example, exposures to the genotoxicants 2-AAF, AFB1, BaP, and CPPD all yielded numerous overrepresented pathways involved in DNA damage response or apoptosis. Notably, also the non-genotoxic carcinogens PBB and Res generated, among others, a partly genotoxic signature. Substances belonging to the NGTXC and NC classes yielded the expected variety of functional responses, ranging from a strong signature related to fatty acid oxidation and metabolism (DEHP, WY, DIDP, and TBTO; all peroxisome proliferators) to induced immune-related responses (sodium diclofenac) and a cholesterol-associated response (CsA). Functional genomics analyses generally confirmed the expected effect of the chemical exposures and granted use of these transcriptional data as input for possible classifier identification. To obtain optimal discriminative classifier sets for GTX and NGTX carcinogens, we used a tiered approach which is described in detail in the following sections below and the “Materials and method” section (see also Fig. 1).

Discriminative classifier selection for GTX and NGTX carcinogens

To obtain predictive classifier sets from the combined mRNA and microRNA transcriptome, we employed different software-based classification algorithms (Fig. 1). We used three different algorithms to avoid favoring a certain feature selection: K-nearest neighbor (KNN), predictive analysis of microarray (PAM-R), and random forest (RF). KNN is a non-parametric method for classifying objects based on closest training examples in the feature space, whereas PAM-R performs sample classification from gene expression data using the nearest shrunken centroid method. RF selects features randomly in order to construct a collection of decision trees with controlled variation. Based on the results of previous classification studies (Jonker et al. 2009; Benjamini and Hochberg 1995), we selected a 2-step classification approach for our current study. In the first step, a classifier set is generated to separate GTXC from the other two classes (Rest = NGTXC and NC); the second step yields a classifier set to discriminate between NGTXC and NC.

This 2-step approach was performed for each of the three algorithms (Fig. 1) using a 100-fold cross-validation and subsequent classifier selection (see “Materials and method” for details). Herein, each ‘fold’ yields a classifier set for a selected test compound. The cross-validation for both the GTXC versus Rest and NGTXC versus NC steps resulted in classifier lists that were subsequently ranked according to the feature selection of the particular algorithm. The top 100 of transcripts was selected per list. These transcript lists were then used for further classifier selection (Fig. 1).

Within the GTXC versus Rest and the NGTXC versus NC steps, for each algorithm, we analyzed and ranked the transcripts according to how many times a transcript was present within the 100-fold generated top 100 lists. For each algorithm, top-ranked (TR) classifier sets were created, consisting of transcripts that were present most abundantly over the 100 lists (with a minimum of 10 % of the lists to avoid false-positive classifiers) (Fig. 1, Supplemental Information 5, Table 1 and Table 2). The TR-classifier sets for KNN, PAM-R, and RF were subsequently screened for overlap, yielding an overlapping top-ranked (OTR) classifier set [Fig. 1, Supplemental Information 5, Table 3(GTXC-R) and Table 4 (NGTXC-NC)]. The OTR-classifier sets contain the most abundantly yielded transcripts for all the generated TR-classifier sets over the three algorithms and thereby include the transcripts that most strongly influence classification. We subsequently increased the robustness of the generated OTR-classifier set by implementing an additional class average fold-change threshold of −1.5 < FC > 1.5 (Fig. 1). The class average fold change is the average fold change of a transcript of all chemical exposures of a certain class (GTXC, NGTXC, NC) (Columns 1–3, Fig. 2). One of the GTXC-specific classifiers following these requirements was Cyp1a2, which is well known to be involved in the metabolism of several groups of xenobiotics and not only GTXC. Based on this knowledge, we excluded this transcript from the final classifier set (Supplemental Information 6). The final set now includes nineteen classifiers that should be able to discriminate GTXC from the rest and an additional eight classifiers to further identify NGTXC (Figs. 1, 2).

Fig. 2
figure 2

Heatmap of fold-change values of the 27 (mRNA) transcripts of the final optimized classifier set distinguishing GTXC, NGTXC, and NC upon 7-day in vivo exposure. Column numbers are depicted below the heatmap, and row numbers at the left side. Columns 1–3 represent average fold-change values per class. Columns 4–19 represent fold-change values per chemical indicated at the top of the column. Upon classifier selection, transcripts 1–20 are considered GTXC-specific classifiers (1–13 upregulated, 14–20 downregulated) and transcripts 21–28 are NGTXC classifiers (21–24 upregulated, 25–28 downregulated)

MicroRNA as potential transcriptomic carcinogen classifiers

No microRNAs were identified as OTR-classifiers for GTXC and NGTXC when using the combined mRNA and microRNA transcriptome as input. Messenger RNA therefore proved to contain more discriminative power in this short-term in vivo approach. To be conclusive whether or not microRNA can be used for classification of carcinogens in a short-term in vivo setup, we additionally performed a similar analysis strategy (Fig. 1) using only the microRNA data as input. Without application of a fold-change threshold, this approach yielded several possible classifier microRNAs. However, when applying the same thresholds as previously (−1.5 < FC > 1.5), no distinctive classifier candidates for GTXC and NGTXC classification could be identified. Implementing a less-stringent threshold of −1.3 < FC > 1.3 yielded twelve microRNAs, but their discriminative potential is low or absent (Fig. 3). In contrast to the mRNA expression levels in Fig. 2, the heatmap in Fig. 3 indicated that a fold-change threshold for microRNA classifiers was only marginally distinct for a certain class on average (column 1–3). Additionally, on individual exposure level, this threshold was mostly not suitable to correctly assign a chemical to its correct class (column 4–19). Due to the fact that a lower fold-change threshold had to be implemented to (only partly) discriminate the classes from each other, microRNA transcripts in this short-term in vivo setup appear to be less suitable for carcinogen discrimination. We therefore pursued validation only using the strongest (mRNA) transcripts we generated upon initial analyses (Fig. 2).

Fig. 3
figure 3

Heatmap of fold-change values of the best-performing microRNA transcripts generated by only using microRNA as data input. Column numbers are depicted below the heatmap, and row numbers at the left side. Columns 1–3 represent average fold-change values per class. Columns 4–19 represent fold-change values per chemical indicated at the top of the column. Upon classifier selection (using −1.3 < FC > 1.3), transcripts 1–2 are considered GTXC classifiers, transcripts 3–8 NGTXC classifiers, and transcripts 9–12 NC classifiers

Verification and validation of classifier set in original and additional chemical set

The final classifier set, consisting of nineteen GTXC-specific and eight NGTXC-specific mRNA transcripts, was selected based on the combined outcome of three different software-based classification tools (Supplemental Information 6). As such, the performance of this ultimate set was yet unknown. Although the classification will tend to be overoptimistic because the total training set itself was used to determine the final classifier set, classifying the training set with the selected classifier set will give an indication of the maximal possible classification accuracy of this set of chemicals (we will later validate this accuracy). We calculated the overall predictive accuracy by again applying a 2-step approach using the KNN, PAM-R, and RF algorithms and use the same cross-validation fold scheme for training and test as with the gene selection, now with the fixed classifier set as input. A chemical is assigned to a certain class, when the majority of the three algorithms predicted this class. Summarized results are shown in Table 3 and in more detail in Supplemental Information 7. The predictive value seemed to be very good as concordance (94 %), sensitivity (100 %), and specificity (80 %) were all very high. We subsequently validated the possible biomarkers using an additional set of eight chemicals. Transcriptional profiles upon 7-day exposures in C57BL/6J male mice were generated for three genotoxic carcinogens [7,12-dimethylbenz(α)anthracene (DMBA), dimethylnitrosamine (DMN), mitomycin C(MMC)], two non-genotoxic carcinogens [carbon tetrachloride (CCL4), 2,3,7,8-Tetrachlorodibenzodioxin (TCDD)], and three non-carcinogenic but potentially toxic chemicals [amiodarone (AD), tolbutamide (TBA), valproic acid (VPA)]. Use of this validation set revealed that the predictive value of the possible biomarkers was in fact lower. The specificity for genotoxic compounds was very high (100 %), but the specificity for NGTXC, and especially the sensitivity, was low, leaving an overall percentage of correctly classified chemicals at 50 % (see Table 3 and Supplemental Information 7). Although the validation set of chemicals was relatively small, these results indicated that correct identification of NGTXC and putative toxic NC is more difficult and might require additional (genomic-based) test strategies.

Table 3 Overview of predictive power of the selected classifier set

Discussion

In the present study, we examined the potential of a transcription-based assay that focuses on the issues of misclassification of NGTXC and that can aid to a more ethical approach toward animal use and welfare. We used a short-term in vivo-based assay, considering the benefits of an in vivo system for correct carcinogen identification, such as fully functional metabolic, signal transduction and endocrine processes, and the possibility to test substances via a relevant route of administration. Several other in vivo toxicogenomics studies were performed over the last years, although most used rat as a model system (Ellinger-Ziegelbauer et al. 2008; Fielden et al. 2007, 2008, 2011; Nie et al. 2006; Uehara et al. 2011; Waters et al. 2010; Thomas et al. 2009). Even though predictive results varied, these studies provided evidence that some mRNA transcriptional signals could potentially serve as discriminators for carcinogenic potential of substances.

In the present study, we analyzed the discriminative power of both microRNA and mRNA transcripts to identify the (genotoxic) carcinogenic features of chemicals. Multiple classifier algorithms with different feature selections were used, which yielded a classifier set consisting of 27 mRNA transcripts being able to partly discriminate between GTXC, NGTXC, and NC. No microRNAs met the applied criteria, which indicated that microRNA expression signatures have less discriminative potential for carcinogenic classes when compared to mRNA in a short-term in vivo murine study, but possibly also in other species or in vitro assays. The fact that the number of microRNAs present in our dataset was smaller than the number of mRNA transcripts is not the reason for the underrepresentation, since any transcript with a strong discriminative signature would be selected from the analyses. MicroRNAs are considered major regulators of the genome, and expression is therefore possibly very tightly controlled, resulting in a less pronounced or class-specific regulation. Nowadays, only one microRNA (mir34-a) has been associated with a genotoxic p53-dependent response in numerous cell types and exposures (He et al. 2007) and is generally considered a genotoxic microRNA biomarker. However, this microRNA was not significantly regulated in vivo upon short-term GTXC exposures in our study, even though some of the GTXC exposures in our study did exhibit a significant p53-dependent DNA damage response based on the mRNA pathway analyses (Table 2, SI4, 2-AAF, and AFB1). In line with these findings, recent publications indicated that mir34-knockout mice and cell lines do not diverge from the wild-type situation concerning p53 response and tumor development (both spontaneous and upon genotoxic stress) (Concepcion et al. 2012; Jain and Barton 2012). This indicates that not all experimental circumstances and cellular conditions result in a default upregulation of mir-34 upon genotoxic stress. Possibly, the use of different exposures times or higher dosing might result in a more pronounced microRNA regulation.

The final classifiers in our set were not expected to undisputedly represent a well-known or anticipated class-specific biological response because of the experimental setup, i.e., using carcinogens with different potencies and modes of action, including potentially toxicity inducing NC. Nevertheless, a biological or functional relationship to cancer for several classifier transcripts has been reported by other studies. This is most obvious for the large majority of the GTXC classifiers, which have been previously linked to carcinogenesis [Tiam2 (Chen et al. 2012), Id2 (Coma et al. 2010; Lasorella et al. 2005), Il1b (Zhang et al. 2012), Nedd4 l (Gao et al. 2012), Slc45a3 (Rickman et al. 2009), Zbtb16 (Palta et al. 2012)], tumor suppressive effects [Phf17 (Zhou et al. 2005), Nr4a1 (Ramirez-Herrick et al. 2011), Ihpk2 (Morrison et al. 2007)], or have been shown to be regulated upon DNA damage [Il1a (Bender et al. 1998)]. The NGTXC classifiers in our set might not represent every possible NGTXC mode of action, but are apparently at least representative for several of them since we used NGTXC exposures with a variety of modes of action (e.g., immune suppressants, peroxisome proliferators, and hormonal carcinogens). Additionally, several of the transcripts in both classifier groups (e.g., LOC75771, 4931408D14Rik, and 9030619P08Rik) have no known function yet and might therefore be interesting candidates for further research concerning genotoxicity or carcinogenic responses. None of the included mRNA transcripts were part of any of the classifier sets generated in previously mentioned in vivo studies (Ellinger-Ziegelbauer et al. 2008; Fielden et al. 2007, 2008, 2011; Nie et al. 2006; Uehara et al. 2011), most likely because these studies used rat as a model system, performed mostly NGTXC versus NC exposures and occasionally different target tissues or cell types were used in those studies. Therefore, the current classifier set and the results of the functional pathway analyses (SI4) could shed some new light on transcriptional responses toward GTXC, NGTXC, and NC exposure in mice and, more importantly, help elucidate processes that are mostly regulated upon (certain types of) NGTXC exposure.

The final set of 27 transcripts was generated to discriminate between GTXC, NGTXC, and NC. The predictive outcome for the original set of chemicals was very high: concordance (94 %), specificity (100 %), and sensitivity (80 %). This indicated that the applied strategy for classifier selection was a valid approach. We additionally made an initial attempt to validate this classifier set using an extra set of chemical exposures. Predictive potential for GTXC remained a 100 % correct when tested in the small validation set, although more chemicals need to be tested to validate the true potential of this classifier set. In contrast to GTXC, the classifier set performed less well in correctly identifying NGTXC and NC. TCDD, a NGTXC, was misclassified in the validation, possibly due to its specific mode of action through the aryl hydrocarbon receptor (of which no NGTXC was present in the training set) and/or due to collateral DNA damage, which could potentially induce a ‘genotoxic’-like profile (Park et al. 1996; Fernandez-Salguero et al. 1996). Misclassification of NC in the validation set might also be due to their toxic nature, inducing cellular stress and indirect (oxidative) DNA damage upon exposure. Also, in vivo-derived classifier sets from Fielden et al. and Nie et al. showed high predictive potential based on training results, but upon extensive validation, the predictive power decreased substantially. Concordance levels dropped to 64 and 55 %, respectively, (Waters et al. 2010), accentuating the need for novel genomic-based approaches. Obviously, to create a more realistic view of the potential of our (and other) classifier sets, more elaborated validation studies are needed. So far, however, our results and those of others indicated that a set of single classifier transcripts might not be sufficient to obtain high predictive power for these three classes of chemicals. Therefore, additional genomic strategies, inclusion of multiple tissues, and also reevaluation of the chemical classes are necessary.

Results of our and previous studies showed that the many possible modes of actions and indirect effects of NGTXC and NC make it difficult to distinguish between these classes and should therefore be extended into more suitable groups of chemicals to evaluate carcinogenic features. Several NGTXC and NC, for example, do induce some form of genomic instability (pointed out by mutagen or chromosomal aberration assays) or result into collateral (DNA) damage, but were considered NGTXC or NC due to lack of a chronic bioassay and other supportive evidence. Regarding future prospects, it might be necessary to screen a multitude of the NGTXC-related (often tumor-promoting) processes or modes of action in order to assess whether a chemical has non-genotoxic carcinogenic potential. Additionally, non-carcinogenic, but toxic, responses should be inventoried to create an improved filter for distinction between toxic and carcinogenic modes of actions. For this approach, however, an elaborate database of NGTXC and NC exposure data is a prerequisite. Together with previous large-scale in vivo studies focusing on NGTXC, our results contribute to mapping these cellular responses and processes.

In conclusion, our results show that microRNAs have less potential as a classifier when compared to mRNA transcripts in a short-term in vivo setup and might require longer exposure times or higher doses for a more pronounced response. In our study, the classifier set as presented above was able to predict genotoxic characteristics with very high accuracy, but indicated that discrimination of non-genotoxic carcinogenic and toxic features of a chemical requires additional or different (genomic-based) strategies. We believe that our results create a realistic view of possibilities, drawbacks, and future necessities in the field of toxicogenomics and are a meaningful contribution to the development of alternative testing strategies for carcinogen identification.