Introduction

The liver is one of the largest glands in the human body. It carries out a number of complex functionalities, such as detoxification, glycogen storage, and secretion of digestive fluid, which are all essential for life. Since there is no replacement or compensation for liver function, liver diseases are considered as a grave threat to human lives. Hepatocellular carcinoma (HCC) is one of the most common tumors in East Asia. Despite people’s familiarity with the etiology of HCC, such as hepatitis virus infection, alcohol abuse, and aflatoxin B1 intoxication (Thorgeirsson and Grisham 2002), the outcome of its treatment remains limited. Thus, it is imperative for biologists to understand the molecular mechanisms of HCC for better diagnosis and therapy. High-throughput microarray technologies have provided new avenues to explore expression patterns of genes related to liver dysfunctions, including their transcriptional controls from a genomic perspective. For example, by utilizing high-density oligonucleotide microarrays, Wurmbach et al. (2007) identified a set of gene markers whose expression patterns can be used to track the progression of Hepatitis C virus (HCV)-induced liver carcinogenic process or distinguish cancerous from normal liver cells.

In general, HCC and other cancers develop when cells proliferate at an unusually high rate. This high rate of cell proliferation can also be observed during the processes of liver development and regeneration. Therefore, comparison of gene expression profiles between these two processes and HCC progression may shed light on the molecular nature of liver cancer. We and others found that transcriptional reprogramming induced in HCC mimics that of the developing liver cells (Coulouarn et al. 2005; Li et al. 2009). The global similarity in transcriptional profiles between developmental and cancerous cells has also been observed in a variety of tissues (Naxerova et al. 2008). Since liver development and regeneration both require simultaneous proliferation and acquisition of highly specialized cellular functions (Kelley-Loughnane et al. 2002), it is intriguing to find how much they share in common at the transcriptome level. A microarray study of Otu et al. (2007), however, revealed that liver restoration after hepatectomy and liver development differed dramatically in the gene expression profiles. But similarities and differences between liver regeneration and HCC still remain to be explored.

Motivated by the above considerations, gene expression patterns in stepwise liver carcinogenic processes were compared with liver development and regeneration. Our comparative analyses elucidated their relationships regarding functional groups and transcriptional control, and identified several genes that may be related to HCC progression.

Materials and methods

Microarray data

In our previous study, gene expression profiles of mouse liver development were generated by Affymetrix Mouse 430 2.0 chips (Li et al. 2009). They can be accessed in the GEO database under accession number GSE13149. Gene expressions were measured at 14 different stages during the C57/B6 mouse liver development: E11.5 (embryonic day 11.5), E12.5, E13.5, E14.5, E15.5, E16.5, E17.5, E18.5, Day 0 (day of birth), Day 3, Day 7, Day 14, Day 21, and normal adult liver (control sample).

Gene expression data of mouse liver regeneration (GSE6998) and human HCC (GSE6764) were downloaded from the GEO database. The mice undergoing partial hepatectomy in GSE6998 dataset were female CD-1 mice at 10 weeks of age (Otu et al. 2007). During their operation, the three anterior lobes of the liver were tied first and then resected. After the 2/3 hepatectomy, the liver was returned to the abdominal cavity and the abdomen closed in two layers. The regenerated liver tissues were harvested at 0, 1, 2, 6, 12, 18, 24, 30, 48, and 72 h after 2/3 hepatectomy. Gene expression data were generated by Affymetrix Mouse 430 2.0 chips. Normal liver tissue (0 h) was taken as reference point.

Human HCC data were produced by Wurmbach et al. (2007) using HG U133 Plus2 chips. The dataset was comprised of 65 disease samples of four disease groups representing a stepwise carcinogenic process from preneoplastic lesions (cirrhosis and dysplasia) to HCC, which included 13 samples from cirrhotic tissue, 17 dysplastic nodules, 18 early HCCs, and 17 advanced HCCs. This dataset also contained ten healthy liver tissue samples. The mean expression levels of healthy liver tissues were used as reference values.

All of the original.CEL files were annotated with CDF files generated by Dai et al. (2005) (Version 11.0.1, Entrez gene) using the software RMA (Irizarry et al. 2003). From the updated genome assembly, it was found that the definitions of many genes/transcripts on the microarrays are out-of-date. Dai et al. (2005) aligned the probes to different sources of genomic data to filter out problematic probes. Using their definition, this study mapped probe sets of 17,726 Entrez genes for human liver disease data, and 16,331 Entrez genes for mouse liver development and regeneration data.

Cross-platform comparison and homology mapping

After probe-set mapping, corresponding mouse orthologs of human genes were retrieved from Ensembl’s Biomart system, and only one-to-one ortholog pairs (12,480 pairs) were kept for analysis. To compare data from different platforms, we defined the expression level for each gene as its ratio relative to the mean value of control samples (Table S1). Hierarchical clustering based on Uncentered Correlation Coefficient was carried out using Cluster 3.0 (Eisen et al. 1998) and visualized in TreeView 1.60. Expression levels of each gene were adjusted by log-transforming, mean-centering, and normalizing to unit variance across different samples before clustering.

Detecting differentially expressed genes

In identifying the subset of human genes potentially responsible for HCC, we gathered stepwise carcinogenic processes ranging from preneoplastic lesions (cirrhosis and dysplasia) to HCC as a time series, and then used EDGE to identify the differentially expressed genes during this time course. EDGE is a significance method for identifying differentially expressed genes in a time course study (Storey et al. 2005). We chose the q value cutoff 1 × 10−4, and sifted 3,556 genes as HCC progressed. For the 3,556 genes, 2,788 one-to-one human–mouse ortholog pairs and corresponding expression profiles were obtained (Table S2).

KEGG pathway analysis

The manually curated pathway file of human (hsa_pathway.list, version 2009/01/05) and the map title file (map_title.tab, version 2009/01/04) were downloaded from the KEGG Website (http://www.genome.ad.jp/kegg/pathway.html). GeneMerge (Castillo-Davis and Hartl 2003) was used to identify pathways whose component genes occur more often in the group of our interest than what is expected by chance. GeneMerge produced a p value from hypergeometric test to rank over-represented pathways. To avoid inflated false positives as a consequence of multiple testing, the Bonferroni corrected p value of 0.01 was used as cutoff.

Identifying enriched motifs

Motifclass in CREAD package (Smith et al. 2006) was employed to search for over-represented motifs in the promoter regions of selected genes relative to the background promoter set. The background set was constructed by 1,000 genes randomly selected from the human genome. We defined promoter regions as 1,000 bp upstream to 200 bp downstream from annotated TSS sites, and retrieved their sequences from the UCSC Genome Browser. Known sequence motifs were obtained from the vertebrate subset of TRANSFAC (Matys et al. 2003) and JASPAR (Sandelin et al. 2004), and were represented by the position weight matrix. Then, all of the motifs were ranked based on relative error rates by Motifclass. In Motifclass, the relative error rate is equal to 1 − (sensitivity + specificity)/2. Here, sensitivity is the proportion of study set whose promoters included the motif; specificity is proportion of background promoter sets which do not include the motif. We required the relative error rates for selected motifs to be significantly lower than random promoters (p value < 0.01); and only reported top ten ranked motifs.

Results

Comparison of gene expression across liver development, liver regeneration, and HCC

Our analysis is based on a large-scale comparison of gene expression across liver development, liver regeneration, and HCC. After processing and homology mapping, expressions of 12,480 human–mouse one-to-one orthologous Entrez genes were retained. To ensure that the data from different platforms and species are comparable to each other, we applied a previously proposed referencing strategy (Kaiser et al. 2007) that defines the expression level for each gene in each sample as its ratio relative to the mean level of corresponding normal control samples. The relative expression profiles of all 12,480 genes are listed in supplementary Table S1.

To investigate the global expression patterns of different biological processes, expression profiles from different samples were grouped by hierarchical clustering. As shown in Fig. 1a, human cirrhotic and dysplasia samples were first clustered with mouse regeneration samples and then with mouse development samples; next, human early HCC and advanced HCC samples were clustered together in another clade. It has been reported that in many tissues, tumorigenesis and developmental processes have similar expression patterns (Kaiser et al. 2007; Naxerova et al. 2008), which were not observed in Fig. 1a. However, after exclusion of late development samples, early development (E11.5 to E17.5, as defined in our previous work; Li et al. 2009) and cancer were placed under the same hierarchy (Fig. 1b), suggesting a closer relationship between early development and HCC.

Fig. 1
figure 1

a Clustering dendrogram for the development, regeneration, and different stages of HCC. Here, ci stands for cirrhotic tissue, dys stands for dysplastic nodules, e-HCC stands for early HCC, a-HCC stands for advanced HCC, regene stands for regeneration, and deve stands for development. b Clustering dendrogram for the early development, regeneration, and different stages of HCC. Human HCC and mouse early development samples were placed side-by-side in contrast to a. Here, e-deve stands for early development stages which include E11.5 to E17.5

Comparing Fig. 1a and b, we can find that the inclusion of late developmental stages influenced the final clustering results dramatically. In Fig. 1a, the early and late developmental stages are more similar to each other and are clustered together firstly. All these developmental stages as a whole are not so similar to HCC. In hierarchical clustering, the clades which are grouped together in the previous step can no longer be re-evaluated; so all of these developmental stages as a whole are separated with HCC (Fig. 1a). But after we excluded the late developmental stages, the similarity between early development and HCC was finally revealed (Fig. 1b). Besides the relations between development and HCC, in Fig. 1a and b, we also noticed that the regeneration samples demonstrate similar expression pattern with cirrhotic and dysplasia, especially with the dysplasia.

The similarity and difference between HCC and liver development

The above-mentioned hierarchical clusterings suggest partial similarity between development and HCC, so we asked what gene groups contribute to the similarities and what groups can explain the differences. Here, two-dimensional clustering was performed on 2,788 genes whose mRNA levels were altered during human carcinogenesis (Table S2). The resulting heatmap clearly demonstrates four gene groups (Fig. 2), from which we can see that Groups A and C are similar, while Groups B and D are different, in relation to liver development and HCC.

Fig. 2
figure 2

Heatmap of 2,788 genes whose expression levels are significantly changed during human carcinogenesis. Columns represent samples from different processes, and rows represent genes. It is clear that four groups of genes with distinct expression patterns emerged. The definition for each label is the same as those in Fig. 1

The expression patterns of advanced HCC and early development are similar in Groups A and C. In Group A, the genes are suppressed in both advanced HCC and early development; in Group C, the genes are both elevated in these two processes. Functional analysis indicates that genes in Group A mainly execute normal liver functions such as “fatty acid metabolism” (Table 1). Motifclass was further used to sieve enriched cis-regulatory elements from promoter regions of genes in Group A. It showed that the motifs ranking on top are corresponding to liver-enriched transcriptional factors HNF1 and HNF4 (Table 2). From the above analysis we can find that genes in Group A should play important roles in normal liver functions. Suppressed expression of these liver function-related genes in advanced HCC and early development reveals the immaturity of live-related functions in early development, and the loss of these functions in cancer, respectively.

Table 1 Over-represented KEGG pathways (adjusted p value < 0.01) for different gene groups in Fig. 2
Table 2 Top ten enriched motifs for different gene groups in Fig. 2

Group C genes have high expression patterns in both advanced HCC and early development. Functional analyses showed that genes in this group preferentially participate in pathways such as “cell cycle” and “DNA replication” (Table 1). Binding sites of the key cell cycle regulator E2F1 were ranked at the top of the enriched motifs for Group C genes. These observations suggest that liver development and HCC share similar pathways that promote cellular growth. When looking more details, gene expressions of Group C showed slightly negative correlation between HCC and development: gene expression levels rise from early HCC to advanced HCC; meanwhile, expression levels decline across liver development. This indicates that those promoting cellular growth pathways are strengthen from early HCC to advanced HCC and weaken during liver development.

Despite similar expression profiles between HCC and development in Groups A and C, their differences are apparent in Groups B and D. Group B genes distinguish HCC from all other processes by the lower-than-average expressions during cancerous stages, while Group D exhibits the opposite pattern (Fig. 2). Functional analyses reveal that “TGF-beta signaling pathway” and “cytokine–cytokine receptor interaction” are enriched in Group B (Table 1). The tumor suppressor genes such as DCN (Salomaki et al. 2008; Seidler et al. 2006), BMPR2 (Pouliot et al. 2003), and ID4 (Noetzel et al. 2008) in TGF-beta signaling pathway were included in Group B. Furthermore, Group B also included three TGF-beta signaling pathway genes which play direct roles in early stages of HCC and HCC differentiation (ID1, ID2, and ID3; Damdinsuren et al. 2005). For the cytokine–cytokine receptor interaction pathway, we found that cancer-related genes such as NGFR (Jin et al. 2007), CXCL14 (Pelicano et al. 2009; Wente et al. 2008), and CCL2 (Lu et al. 2006) were also included in Group B. The opposite expression pattern between HCC and liver development of these cancer-related genes in Group B may contribute to the difference between HCC and development.

Different expression profiles between HCC and development can also be tracked by the difference of cis-regulatory elements. Motif analysis by Motifclass revealed that promoter regions of genes in Groups C and D share enriched motifs corresponding to transcription factors E2F and ETF. But the enriched motifs of transcription factors MYC and EGR are specific to Group D (Table 2). From Fig. 2, we can see that genes in Group C are elevated in both HCC and development, while genes in Group D are elevated in HCC but inhibited in development. MYC is a well-known oncogene. In our data, it is down-regulated in the early cancerous stages (Fig. 3a). EGR is up-regulated in both regeneration and development, but repressed in cirrhotic, dysplasia, and cancer (especially in its late stages) (Fig. 3b). This expression pattern is consistent with the previous report that EGR is a cancer suppressor gene (Krones-Herzig et al. 2003, 2005). The specific enrichment of these cancer-related motifs in the promoter regions of Group D should contribute to the difference between HCC and development. Taken together, differences between HCC and development should be a result of concerted activities of cancer-related genes and their transcriptional controls.

Fig. 3
figure 3

Expression profiles of MYC (a) and EGR1 (b) during different time points in cirrhotic tissue, dysplastic nodules, early HCCs, advanced HCCs, regeneration, and development. The definition for each label is the same as those in Fig. 1

The relation between liver regeneration and dysplasia

Generally, all of the above-mentioned clustering results demonstrate that HCC and liver regeneration are quite different from each other. But in Figs. 1 and 2, we can see that regeneration and cirrhosis/dysplasia are clustered together, which reveals the similarity at the mRNA level between regeneration and cirrhosis/dysplasia, especially between regeneration and dysplasia (Fig. 1b). This similarity between regeneration and dysplasia required our further attention. Considering all of the 12,480 genes, we then tried to identify which among them contributed to common expression patterns of regeneration and dysplasia. Genes whose expression levels changed at least twofold in more than 50% of the samples in regeneration relative to the normal control samples were identified. The same filtering was applied on the dysplasia data. Finally, 143 genes and 75 genes were obtained for dysplasia and regeneration, respectively. Surprisingly, the two gene sets share only seven members; among the seven genes, only one has the same direction. Although dysplasia and regeneration share similar expression pattern as a whole, their few overlap here reveals apparent difference of these two processes.

To explain the inconsistency that dysplasia and regeneration show both global similarity and apparent difference at a finer scale, we employed PCA analysis. All expression profiles from different samples were projected onto the first two principle component axes (Fig. 4). The x and y axes explained 23 and 17% of the total variance, respectively. Since expression level for each gene was its ratio relative to normal control samples and was log-transformed before PCA analysis, the origin of coordinates corresponded to the normal state. The PCA result demonstrated that human and mouse samples form two clumps extending at different directions, with the cirrhotic, dysplasia and regeneration samples gather closer to each other and are all located near normal state (Fig. 4). The similarity between regeneration and dysplasia does not result from the same group of co-regulated genes, but is caused by their close resemblance to normal tissues.

Fig. 4
figure 4

Projection of all sample points to the plane spanned by the first two principle component axes. The origin of coordinate represents the normal state. Although mouse regeneration and development points set aside from human data points, extending to two different directions, regeneration, cirrhosis, and dysplasia gather close to each other near the origin of coordinate. The definition for each label is the same as those in Fig. 1

Regeneration leads to the restoration of normal tissue, whereas dysplasia heralds the progression of cancer. Given that both have similar expression patterns to normal state, their different destinies might be subject to the activities of a number of genes that “switch on” the tumor progression. We found five genes SULT1E1, PDK4, MT1G, FOS, and S100A8 whose mRNA level changed more than tenfold between regeneration and dysplasia and at least twofold relative to control in each process. It happened that all of them were up-regulated in regeneration, but suppressed in dysplasia (supplementary Fig. S1). Among them, FOS (part of AP-1 transcription factor complex) and S100A8 (dimerizing with S100A9) reportedly play roles in cancer progression. The FOS protein family dimerized with Jun proteins to form an AP-1 transcription factor complex. They play a central role in the proliferation and differentiation of normal tissue, as well as in oncogenic transformation and tumor progression (Mahner et al. 2008). Consistent with this, the expression level of Jun changed more than fivefold between regeneration and dysplasia in the data of this study (supplementary Fig. S1). S100A8 is a Ca2+-binding protein of the S100 family. It naturally forms stable heterodimeric complex with S100A9 (Arai et al. 2008; Leukert et al. 2006). The expression profile of S100A9 was not included in our data since it was filtered by the one-to-one ortholog pairs’ criterion. We checked manually and found that the expression level of S100A9 is very similar with S100A8 and is above by ninefold in regeneration as compared to dysplasia. S100A8 and S100A9 induced inflammatory activation of the endothelial cells (Roth et al. 2003); they are strong markers for various cancers and play important roles in the development and progression of carcinomas (Arai et al. 2008; Hermani et al. 2005; Moon et al. 2008; Turovskaya et al. 2008). Based on the above analyses and reports, we hypothesized that three remaining genes, SULT1E1, PDK4, and MT1G, may also be related to cancer progression.

Discussion

In this study, the transcriptional profiles of HCC, liver development, and liver regeneration based on high-density microarray data were systematically compared. With these data, a comprehensive analysis on gene expression patterns, functional preferences, and transcriptional regulations for different gene groups was obtained.

Similarities between liver cancer and development had already been observed in previous studies. Here, we elucidated that the similarity is partial. Advanced HCC mimics prenatal development in terms of deprived normal liver functions and activated cellular proliferation, but they differ in activities of cancer-related genes and their transcriptional controls. The analyses revealed that genes with similar expression patterns between HCC and development prefer to be involved in pathways such as “fatty acid metabolism” and “glycolysis/gluconeogenesis” which are related to normal liver functions, and “cell cycle” and “DNA replication” pathways which are related to cellular proliferation. While genes with opposite expression patterns between HCC and development prefer to be involved in the “TGF-beta signaling pathway” and “cytokine–cytokine receptor interaction” pathways. It is well known that the “TGF-beta signaling pathway” is one of the most important pathways in cancer. Many genes in “cytokine–cytokine receptor interaction” pathway such as NGFR, CXCL14, and CCL2 were also reported to be cancer-related. Besides these cancer-related pathways and genes, we also found the specific enrichment of cancer-related motifs in the promoter regions of the genes which have opposite expression patterns between HCC and development. Further study on these genes and their transcriptional factors may provide new clues to cancer genesis.

Liver has its unique capability to regenerate. Although its molecular mechanism is poorly understood, similarities between liver regeneration and dysplasia were found in this study. Further analysis revealed that this similarity does not result from the same set of co-regulated genes, but due to their proximity to the normal state. This observation suggests that: firstly, the mechanisms of regeneration and dysplasia are significantly different from each other; secondly, the mechanisms of liver regeneration and dysplasia may not be as complicated as HCC since they are much closer to normal liver cells. Getting the similar expression patterns of pre-cancer stage and liver regeneration to normal state, we proposed that the regulation on a relatively few number of key genes might trigger the different progressions of cancer and regeneration. Our analyses found that SULT1E1, PDK4, MT1G, FOS, and S100A8 may be some candidates for key genes. Among them, FOS and S100A8 have been reported to be important players in cancer progression. Though we have not found many reports on SULT1E1, PDK4, and MT1G, we hypothesized that these three genes may likely contribute to HCC progression.