1 Introduction

Gene expression analyses require normalizations across different samples, which involves standardization of data against a set of reference points in any differential expression strategy. This is usually done using ‘invariant’ housekeeping genes (also referred to as endogenous controls). An inherent property of housekeeping genes is that they maintain the basic metabolic functions of a cell at a similar level under different conditions and perturbations. This is presumably achieved by keeping their gene expression level invariant [1]. Hence, housekeeping gene expression levels are chosen as endogenous references for gene expression data normalization. However, recent technological advances have raised the question of choice and reliability of these endogenous genes as references. An increasing number of reports show that housekeeping genes may also be subject to variation in expression in different disease states and experimental conditions, as well as between subjects, tissues, model systems etc. [213]. Housekeeping genes may also exhibit expression variation in the same type of cancer when located in different tissues or organs. Exposure of human peripheral blood lymphocytes (HPBL) to environmental stresses, including ionizing radiation, is also known to activate signal transduction pathways, which may result in complex patterns of gene expression changes [1417]. In a recent study, using quantitative real-time PCR (qPCR), we have compared the expression of 6 different housekeeping genes in human blood cells exposed to 60Co γ-rays for their suitability as reference or normalizer gene. We found that GAPDH, either alone or in combination with the 18S rRNA gene, suited best [18]. Others found that in colon cancer several housekeeping genes, mainly those coding for metabolic enzymes, show considerable expression changes under different conditions [19]. It is apparent that there is currently no single housekeeping gene that meets the criteria of being stably, abundantly and consistently expressed under various conditions, i.e., criteria that are required for serving as a consensus reference or normalizer gene [20].

In any gene expression study, the selection of an appropriate reference or normalizer gene(s) is critical for a reliable and accurate interpretation of the data. In the past, several bioinformatics tools have been developed for the delineation of the best reference or normalizer gene(s) for expression studies, including BestKeeper, geNorm and NormFinder. Each of these tools applies different and highly complex mathematical algorithms to finally achieve the goal set. For example, the BestKeeper tool estimates the geometric mean of the most suitable pair of genes by correlating the average variations of candidate genes [21]. To minimize variations across samples, the geNorm tool employs multiple reference genes to derive a geometric mean [12]. The NormFinder tool applies a mathematical model to analyse sample subgroups and their intra- and inter-group expression variation, thereby preventing the selection of co-regulated genes [22]. Since the choice of a reference gene, or a set of reference genes, is inherent to these complex algorithms, the same gene expression data set may yield different results by using different reference gene(s) [18].

Many attempts have been made over the last few years to settle the issue by finding the most appropriate reference gene(s) for gene expression analysis purposes. A consensus will facilitate the harmonization of data emerging from different studies across the globe, and consolidate our grip on the fundamental understanding of (alterations in) gene expression, especially in cancer. Recently, the use of 3 reference genes selected by at least three stability algorithms for reliable interpretation of gene expression data has been recommended [23]. A survey of published results shows, however, that a wide range of different reference or normalizer genes are in use for normalization of gene expression data in different cancer studies. At least 50 housekeeping genes have so far been tested and/or used in studies dealing with the most prevalent human cancers (listed in Table I). This review collates all available data published on this issue since the turn of the century to (i) highlight the wide range of reference genes employed in the normalization of qPCR data in the most prevalent human cancers and (ii) to extract consensus reference genes for normalization.

Based on worldwide data available on the incidence and prevalence of human cancers (compilation up to 2012) from the World Health Organization (WHO; http://www.who.int/mediacentre/factsheets/fs297/en/) and the World Cancer Research Fund International (WCRF; http://www.wcrf.org/int/cancer-facts-figures/worldwide-data/), we chose the 13 most common human cancers, accounting for nearly 70 % of the total global cancer burden, for this study. To retrieve and collate the relevant literature for the selected cancers, a search strategy was devised wherein the PUBMED database of the National Library of Medicine, NIH Bethesda, Maryland, USA, was the primary source. The search was restricted to full text papers published from 2000 onwards. In addition, a Google web-based search was employed to retrieve published data that were not covered by the PUBMED database. Some highly relevant cross references from these papers are also included in this review. The keywords utilized for the searches were Validation, Reference Genes, Housekeeping Genes, Human Cancer, Real Time PCR, Endogenous Controls, Normalizers, and Evaluation, both alone and in combination with specific cancer types.

2 Breast cancer

Breast cancer is the most common invasive cancer of high prevalence in women [24]. Breast cancer-related gene expression studies by qPCR have utilized a wide range of endogenous control genes. In an evaluation study by Lyng et al. [25], the best reference gene appeared to be PUM1, or the average of 3 genes, i.e., TBP, RPLPO and PUM1 (Table 2). In another evaluation study on breast cancer the MRPL19 and PPIA genes (Table 2) were identified as the most stable and reliable candidate reference genes [26]. Gur-Dedeoglu et al. [27] reported ACTB and SDHA to be the most suitable reference genes, among 18 endogenous reference genes tested, for the normalization of qPCR data in breast cancer tissues using both the geNorm and NormFinder software tools (Table 2). In another study 5 reference genes were identified, i.e., ACTB, RPS23, HUWE1, EEF1A1 and SF3A1 (Table 2), as potential normalizer genes for the experimental and clinical analyses of breast cancer samples [28]. On the other hand, the 18S rRNA gene was found to be the most suitable reference gene in the MCF-7 breast cancer-derived cell line, while the GAPDH gene was recommended for the MDA-MB-231 breast cancer-derived cell line [29]. In an earlier study, however, GAPDH was not recommended as a reference gene in breast cancer [30]. More recent studies on breast cancer have employed a wide range of reference genes, such as ACTB [31, 32], GAPDH [33, 34], APP [35], RPLPO [36] and β-globin [37], as well as the averaged expression of the GAPDH, HPRT1 and B2M genes [38] for normalization purposes.

3 Cervical cancer

Cervical cancer is another leading causes of cancer-related death in women, worldwide. In a validation study by Daud et al. aimed at identifying the best reference genes in cervical cancer GAPDH, followed by RPLPO, were found to be the best candidate reference genes (Table 2) [39]. In clinical cervical tissue samples, EEF1A1 was recommended as a reference gene by Shen et al. [40], while the combined use of EEF1A1 and GAPDH may serve as a reliable normalization strategy (Table 2). In several other studies on cervical cancer reference genes such as GAPDH [41] or ACTB [42, 43] were employed.

4 Colon cancer

Colon cancer, or colorectal cancer (CRC), is also one of the most common causes of cancer-related death in developed countries [44]. In different studies on CRC, the use of different housekeeping genes as reference genes has been suggested. In an evaluation study of several of these housekeeping genes 3 of them, i.e., PMM1, ACTB and PSMB6, showed least variation and were, therefore, considered as the most reliable reference genes for the analysis of CRC samples [45]. Three other reference genes (i.e., UBC, GAPD and TPTI) were recommended for CRC by another research group [22]. In yet another study by Jacob et al. aimed at selecting reference genes [23], HSPCB, YWHAZ and RPS13 were found to be the most stably expressed genes in at least a subset of CRC-derived cell lines (Table 2). One recent study has advocated the use of GUSB and ACTB, but not B2M, as internal reference genes for CRC gene expression studies [46]. On the contrary, B2M was reported to be the best reference gene for gene expression studies in primary human CRCs by Dydensborg et al. [47]. After evaluating 13 potential candidate genes, a combined use of the PPIA and B2M genes (Table 2) was recommended as reference for human CRC samples by Kheirelseid et al. [48]. Of the 16 genes in metastatic and non-metastatic CRC specimens studied, the use of 2 pairs of genes, i.e., HPRT1-PPIA and IPO8-PPIA, was recommended by Sorby et al. [49]. They were found to serve as the most suitable combinations (Table 2). In other CRC studies different endogenous reference genes such as GAPDH [50], ACTB and GUSB [51] were employed as normalizer genes.

5 Esophageal cancer

Esophageal cancer is the eighth most frequently diagnosed cancer worldwide [52] and, due to its poor prognosis, it is the sixth most common cause of cancer-related death [53, 54]. Of the 21 genes evaluated as best endogenous reference genes in primary esophageal cancer tissue specimens, the highest stability was observed for GAPDH, followed by CETN2 [43]. In several other studies on esophageal squamous cell carcinoma, GAPDH [5559] or ACTB [6062] were used as reference genes. As shown in Table 2, a triple normalization with 3 reference genes, i.e., PPIA, ALAS1 and ACTB, was recommended for human esophageal adenocarcinomas specimens by Slotta-Huspenina et al. [63]. In some other studies, GAPDH [6466] and 18S rRNA [67] were employed as reference genes for esophageal adenocarcinomas.

6 Kidney cancer

A study by Jung et al. [5] aimed at identifying suitable reference genes for gene expression analyses in renal cell carcinoma, which is the most common type of kidney cancer, proposed 2 housekeeping genes, i.e., PPIA and TBP. Both genes were recommended as reference genes for data normalization, either as single genes or in combination, with a preference for the latter (Table 2). Previously, one report indicated that the 18S rRNA and cyclophilin A (CyPA) genes were the most suitable reference genes for micro-dissected kidney biopsies [68]. In other studies on renal cell carcinomas, GAPDH was used as a reference gene [69, 70], whereas another recent study revealed that PPIA and RPS13 served as the most suitable combination for the normalization of gene expression data in clear cell renal cell carcinoma (ccRCC) tissues [71].

7 Liver cancer

Liver cancer, or hepatic cancer, is one of the leading causes of cancer-related death globally [72]. In a selection study aimed at identifying optimal reference genes for expression profiling of liver diseases, 10 housekeeping genes were evaluated across 67 liver tissue samples using the geNorm software tool [73]. By doing so, it was found that the HMBS and UBC gene pair served as the most accurate normalization factor in qPCR analyses (Table 2). For the normalization of microarray-based gene expression data, Lee et al. identified 3 housekeeping genes, i.e., CGI-119, CTBP1 and GOLGA, that showed stable expression across the liver cancer tissues tested [74]. One evaluation study recommended a combination of the TBP and HPRT1 genes (Table 2) for the normalization in hepatocellular carcinoma (HCC) data [75]. This recommendation was supported by another study that showed that the TBP and HPRT1 genes were stably expressed and, hence, served as reliable reference genes for qPCR-based gene expression normalization in hepatitis B virus (HBV)-related HCC specimens compared to the 18S rRNA and ACTB genes [76]. Another recent study recommended CTBP1 as the best candidate reference gene in human male HBV infection-related HCC with cirrhosis [77]. Also, a combination of 2 other genes, SFRS4 and RPL41, has been recommended for HCC by Waxman et al. [78]. On the other hand, Cicinnati et al. [79] concluded from an evaluation study that HMBS was the single best reference gene for gene expression studies in HCC (Table 2). The latter authors also suggested a combination of the HMBS, GAPDH and UBC genes for primary liver cancer samples, and a combination of the HMBS, B2M, SDHA and GAPDH genes for liver cancer-derived cell lines.

8 Lung cancer

Lung cancer, or pulmonary carcinoma, is one of the most common cancers and is a leading cause of mortality among men worldwide [80]. A wide range of reference genes has been used for the normalization of gene expression data in lung cancer studies. In a study on non-small cell lung cancer (NSCLC), the best reference gene was found to be HPRT1 followed by GAPDH [81], even though another gene, i.e., B2M, was previously preferred by Heighway et al. [82]. In a panel of 6 investigated reference genes by another group, HPRT1 was found to be the most stably expressed gene in NSCLC (Table 2), followed by RPLP0 and ESD [83]. As also shown in Table 2, 6 housekeeping genes, i.e., RPLPO, UBC, GAPDH, MT-ATP6, CASC3 and PES1, were identified as reliable reference genes for analysis of the NSCLC-derived cell line A549 by Sharungbam et al. [84]. Besides, 18S rRNA, POLR2A, ESD and YAP1 were found to be the most stably expressed genes in primary lung cancer specimens [85]. A sequencing-based approach suggested 4 genes, i.e., NDUFA1, RPL19, RAB5C and RPS18, as suitable reference genes for normalization of data in primary lung cancer tissues [86]. In a DNA microarray-based expression profiling study of lung cells, the SPCS1 and HADHB genes were found to show a high level of expression stability [87], and in a recent qPCR-based expression study of lung squamous-cell carcinomas, ACTB, EEF1A1, FAU, RPS9, RPS11 and RPS14 were identified as ideal reference genes [88].

9 Lymphoma

In studies on lymphoma, a wide range of endogenous genes, such as B2M [8991], RPS9 [92], GAPDH [9396] and ACTB [97], have been used as references. In an evaluation study on several lymphoma-derived cell lines, the RPL13A gene was found to suit best as reference gene out of 4 genes tested, i.e., ACTB, HPRT1, HMBS and RPL13A (Table 2) [98]. A study on non-Hodgkin's lymphomas, on the other hand, indicated that the expression of target genes should be normalized against the PRKG1 and/or the TBP genes [99].

10 Ovarian cancer

Ovarian cancer is another leading cause of cancer-related death in women, worldwide [100]. A recent analysis of different housekeeping genes [101] identified IPO8 as the most suitable reference gene for this malignancy, followed by RPL4 (Table 2). In different evaluation studies of endogenous references (shown in Table 2), the GUSB, PPIA and TBP genes were reported to be most suitable for expression normalization in serous ovarian cancer [102], whereas the PPIA, RPS13 and SDHA genes were recommended for use in ovarian cancer-derived cell lines by Jacob et al. [23]. In some of the more recent ovarian cancer studies, the reference genes employed were PRLPO [103], ACTB [104106], GAPDH [107109], 18S rRNA [110], GUSB and PPIA [111]. GAPDH was also employed in a study on ovarian cancer-derived cell lines, such as the SKOV3 and HO8910 cell lines and the highly invasive HO8910-PM cell line [112]. In yet another evaluation study on ovarian tissues, the RPLPO and RPL4 genes were recommended as the best combination of reference genes [113].

11 Pancreatic cancer

Pancreatic cancer has one of the highest mortality rates among all cancers for both men and women, worldwide. The results of an evaluation study of endogenous reference genes indicated that EIF2B1, ELF1, MRPL19 and POP4 were the most stably expressed genes in pancreatic cancer and, as such, should be used for the normalization of qPCR-based expression data (Table 2) [114]. In another study, RPL37A, RPLPO and CASC3 were recommended as reference genes for the analysis of pancreatic cancer-derived cell lines [84]. The 18S rRNA and QRRS genes were also found to exhibit expression variation of less than 10 % and, therefore, they were recommended as reference genes for the analysis of pancreatic carcinoma tissues [45]. In other primary pancreatic carcinoma studies, the GAPDH gene was used as a single reference gene [115], alone or together with PSMB6 [116]. Only recently it was reported that the expression of the PPM1 gene is more stable than that of the GAPDH gene in pancreatic cancer [117].

12 Prostate cancer

Prostate cancer, or carcinoma of the prostate, is one of the leading causes of cancer death in males, worldwide [118]. In an evaluation study of 16 candidate housekeeping genes in prostate cancer, the best reference gene was found to be HRPT1, alone or in combination with ALAS1 and K-ALPHA-1 (Table 2) [119]. Another study recommended a combination of the GAPDH and SDHA genes for normalization of mRNA levels of target genes in primary cultures of prostate cancer cells transfected with siRNAs (Table 2) [120]. Yet another report showed that the ACTB gene was abundantly and stably expressed in prostate-derived cells and, thus, was considered suitable for use as a reference gene [121]. However, the same gene was not recommended for primary prostate cancer tissue samples. qPCR-based gene expression analyses of prostate cancer cell lines have been conducted using the 18S rRNA gene as the reference [122], and a comparison of B2M gene expression levels between healthy volunteers and patients with prostate cancer revealed a lack of significant variation, including the absence of an effect of hormonal treatment [123]. Another study on prostate cancer showed variation in expression levels of some commonly used endogenous control genes between aerobic and hypoxic samples [13]. In several additional qPCR-based prostate cancer expression studies, the housekeeping genes used as references included GAPDH [124128], ACTB [129, 130], RPS14 [131] and S19 [132].

13 Stomach cancer

Stomach cancer, or gastric cancer, also belongs to one of the most common cancers across the globe [133]. For the identification of valid reference genes in stomach cancer-derived cell lines, a combination of the GAPDH and B2M genes has been recommended for normalization (Table 2) [134]. Besides, a combination of the RPL29 and B2M genes for comparisons between normal and stomach cancer tissues has been proposed (Table 2). In another report, combinations of pairs of reference genes, i.e., GAPDH-B2M or ACTB-B2M, were recommended for gene expression analyses in gastric tissues and cell lines (Table 2) [135]. Another evaluation by Zhao et al. [136] revealed 18S rRNA as the most stably expressed gene when compared to GAPDH, ACTB and RPII and, thus, to be most suited for the normalization of qPCR-based expression data in gastric cancer samples (Table 2). An earlier study, however, identified 3 other genes (i.e., PMM1, ADA and SDHA) as candidate reference genes for the normalization of expression data in stomach cancer [45].

14 Thyroid cancer

In a recent evaluation study aimed at selecting the best candidate reference gene for thyroid cancer [137], ACTB was found to be suited best among several other candidates tested [138]. In a comparative study on normal thyroid tissues using the NormFinder software tool, it was found that the ACTB gene was most stably expressed when compared to 5 other candidate genes (Table 2). A similar study also suggested that the ACTB gene was more stably expressed than the TBP, GAPDH and B2M genes in primary cultures of thyroid cells [139]. Moreover, ACTB has been employed as a reference gene in thyroid cancer gene expression studies by different groups [140142]. Several additional studies on primary thyroid cancers [143147] and thyroid cancer-derived cell lines [148, 149] have employed GAPDH as a reference gene. Additionally, also other housekeeping genes such as 18S rRNA [150] and G6PDH [151] have been employed as references in thyroid cancer-related studies.

15 Discussion

It is obvious from the above overview that there is no single gene, or set of genes, that has in the past been universally applied as endogenous reference or normalizer in gene expression studies of the most common human cancers. Over 50 different reference genes have so far been used (listed in Table 1). Different software tools have been employed by researchers to evaluate suitable reference genes for data normalization, yielding a number of possible candidates for particular human cancers and/or its corresponding normal tissues in different studies (Table 2). Consequently, different reference genes have been used by different researchers for gene expression studies in the same human cancers. This makes harmonization or inter-comparison of valuable gene expression data difficult and challenging.

Table 1 Reference genes used for gene expression studies
Table 2 Human cancers and reference genes selected for gene expression studies

Differences in the metabolism of cancer cells may contribute to variation in the expression levels of housekeeping genes, even within the same organ, suggesting that cancer-related studies are mostly case-specific [152, 153]. This variation may even be more prominent when comparing one cancer type with another. In colon cancer, for example, it has been suggested that such variation may also, at least partly, be due to chromosomal aberrations resulting in gains or losses of segments containing the relevant gene(s) [154]. In such situations, it is recommended to consider the genotype of the tissue samples so that extra or missing chromosomal parts (i.e., copy number variation) can be taken into account. Obviously, the outcome of reference gene evaluations may also vary depending on the validation methodology used.

Based on our present evaluation, we find that all three software packages used for validation in cancer-related gene expression studies, i.e., geNorm, NormFinder and BestKeeper, using three different and complex algorithms and efficiency-corrected values [155], ranked the genes in similar and/or comparable patterns. This clearly suggests that the choice of the software package is not the primary factor causing variability in the outcome of gene expression data analyses. Therefore, it appears that taking only the highest ranking reference or normalizer gene, as revealed by either one of the three software packages, could be the primary source of variability. To counter this adverse effect, it would be best to consider a set, or subset, of candidate reference genes for normalization, as suggested by Chervoneva et al. [156]. They reported a robust and comprehensive technique to evaluate normalizing factors, which was based on all possible subsets of reference genes, rather than addressing the stability of individual reference genes. Consequently, reference or normalizer gene(s) with a low degree of variability got automatically included in the algorithm. We suggest that the use of at least a pair of reference or normalizer genes with distant functions will help to level the influence of a possible co-regulation between the reference gene(s) and the gene(s) under investigation [152]. For even more stringent and accurate results, it is recommended to employ at least three reference or normalizer genes [21, 157] and three different evaluation algorithms in a typical gene expression study [23].

16 Conclusions and future perspectives

Upon careful scrutinizing the currently published data, we find that the PPIA gene (Table 1) is the only reference or normalizer gene that has been found suitable in several gene expression studies of at least five of the most prevalent human cancers, i.e., breast, colon, esophagus, kidney and ovary cancers (Table 3). The PPIA gene is a widely conserved eukaryotic gene that encodes a member of the peptidyl-prolyl cis-trans isomerase family of proteins. It catalyzes cis-trans isomerization of proline imidic peptide bonds in oligo-peptides to accelerate protein folding (http://www.ncbi.nlm.nih.gov/gene/5478). Further analysis indicates that the other most common reference genes for the normalization of gene expression data in several human cancers are GAPDH (cervix, lung, prostate and stomach cancers), ACTB (breast, esophagus, stomach and thyroid cancers), HPRT1 (colon, liver, lung and prostate cancers) and TBP (breast, kidney, liver and ovary cancers) (Table 3). Interestingly, the ribosomal RNA gene family, including the 18S rRNA gene (Tables 1 & 2) that is abundantly expressed and, hence, widely used as a normalizer gene in numerous studies, seems to have very limited preference in studies related to human cancers (Table 4). The same holds for over two dozen other genes, including MT-ATP6, IPO8 and UBC, which were found suitable for gene expression studies in only 1, or at most 2, different human cancers (Table 4).

Table 3 Most commonly selected reference genes for gene expression studies
Table 4 Least commonly used reference genes for gene expression studies

From the analysis presented in this review, it is evident that there is not one single consensus endogenous reference gene to normalize gene expression data in different human cancers. However, a combination of PPIA with either the GAPDH, ACTB, HPRT1 or TBP genes, or their suitable combinations, should be able to cover 11 of the 13 most common human cancers included in this review (i.e., breast, cervix, colon, esophagus, kidney, liver, lung, ovary, prostate, stomach and thyroid cancers) (Table 3). Furthermore, we found that the choice of the software package (i.e., geNorm, NormFinder or BestKeeper) for the validation of cancer-related gene expression data does not seem to have any influence on the final outcome. Therefore, we recommend that future gene expression studies in human cancer should seriously consider using the five reference or normalizer genes listed in Table 3 (in combinations of at least 2 or preferably 3) for normalization of the data. Once adopted by different researchers the final output is anticipated to be harmonized, which will lead to an increase in the depth of our understanding of the mechanisms underlying cancer, as well as its use in diagnostics and prognostics. We strongly recommend to first evaluate the suitability of the above five reference genes (preferably in appropriate combinations) using any of the available software packages for selecting 2 or 3 reference genes for normalization. Even if another reference gene, besides the above mentioned genes, is proposed, we recommend that it should first be evaluated in order to avoid obscuring real gene expression changes and/or yielding erroneous gene expression data. Such an approach would also meet the MIQE guidelines for qPCR data analysis [157].