1 Introduction

Chromosomal aberrations such as deletions, amplifications, and structural rearrangements are hallmark of cancer [10, 14, 23]. Therefore, identifying genomic regions associated with systematic aberrations provides insights into the initiation and progression of cancer, and improves the diagnosis, prognosis, and treatment strategies [15]. For understanding genome-wide genetic aberrations, comparative genomic hybridization (CGH) and array technology combined as an array comparative genomic hybridization (aCGH) have been used. Since aCGH data includes high throughput genetic information, these different analytic methods should be applied in conjunction with microarray techniques to analyze this type of data. While the purpose of microarray data analysis is significant gene selection, the main issue in aCGH analysis is to segment the sequence of log ratios along the chromosome into regions of amplification, deletion or no change [13]. Many studies conducted for this purpose have concentrated on smoothing the copy number variations (CNV) throughout the whole genome [2, 5, 9, 10, 15, 17, 19, 24, 27], and CNV was defined as a duplication or deletion event involving >1 kb of DNA [6]. aCGH methods have been applied to identify chromosomal aberration in OSCC [7, 18, 21, 22, 25, 26].

Previous studies in this field have usually focused on each experimental group and detected significant regions by comparing CNV patterns among different experimental groups or the relative CNV. Therefore, they did not consider whole samples, which were included in different experimental groups.

In this study, we used two small aCGH data sets from OSCC patients, which were collected at different time periods. The two data sets were combined after discretization, because the previous study showed that classification was improved when the data set was combined after discretization [12]. The Chi-square test can be commonly used to detect differentially expressed genes after discretization of expression intensities in microarray experiments. For the matched dataset, however, the McNemar test should be used instead of the Chi-square test.

Based on these observations, we proposed a method, shifted McNemar test, which detects significant regions by considering different experimental groups and the region size at the same time from aCGH data. The proposed method can identify significant regions, and classification accuracies can be improved by these selected regions. In addition, the relationship between the detected genomic regions and the progress of OSCC can be investigated using this novel method, which will be the topic of a subsequent study.

2 Methods

2.1 Data set

Surgical OSCC tissues and their surgical margin tissues were obtained from 11 OSCC patients. The oral cell carcinoma tissue and its marginal tissue were called “tumor” and “dysplasia”, respectively, in this study. Experiments on four of the 11 patients were conducted in 2007 and the experiments on the remaining seven patients were conducted in 2008. The clinical features of the 11 samples used in this study are summarized in Table 1.

Table 1 Clinical features of patients

2.2 Microarray-CGH labeling and hybridization

In the aCGH experiments, we used 60 mer in situ synthesized oligonucleotide arrays designed and produced by Agilent Technologies (Santa Clara, CA, USA), containing 44k probes. As a reference sample, human genomic (male/female) DNA (Promega Corporation, Madison, WI, USA) was used. All array hybridization was performed according to Agilent’s recommended protocols. Briefly, three ug gDNA was digested with restriction enzymes AluI and RsaI and fluorescently labeled using the Agilent DNA Labeling kit. Test samples and reference samples were fluorescently labeled with Cy3 or Cy5 dUTP. Labeled DNA were denatured and pre-annealed with Cot-1 DNA and Agilent blocking reagent prior to hybridization for 40 h at 20 rpm in a 65°C Agilent hybridization oven. Standard procedures were followed when washing. Hybridized arrays were scanned at a 5 µm resolution with an Agilent G2505A scanner. Scanned image analysis was performed using Feature Extraction Software 9.1.1.1 (Agilent Technologies), with the CGH-v4_91 protocol for background substraction and normalization. All array data passed Agilent recommended quality metrics.

2.3 Shifted McNemar test

For the matched-pairs data, the McNemar test [20] has been traditionally applied only to the case in which there are two possible categories for the outcome. In practice, however, the outcomes can be classified into multiple categories. Under this situation, the McNemar test was extended for data that contained more than three categories [1].

For example, in a study a test is performed before treatment and after treatment. The results of the test are coded “+” and “−”. Using this approach, we can test if there was a significant change in the result before and after treatment. When doing the test, the categorized values can be summarized as shown in Table 2.

Table 2 The summarized frequency table for McNemar test

A and D represent the number of patients not changed after treatment. B and C represent the number of patients changed after treatment.

A, B, C, and D represent the frequencies of satisfying the two conditions in the “before” and “after”, for each patient. McNemar Chi-square statistic was calculated using the values of B and C, because these two values represent the change between “before” and “after”. Therefore, the test statistic for the McNemar test was calculated as Eq. 1.

$$ \begin{aligned} \chi^{2} =& \sum {{\frac{{(O - E)^{2} }}{E}}} \\=&{\frac{{\left(B - {\frac{B + C}{2}}\right)^{2} }}{{{\frac{B +C}{2}}}}} + {\frac{{\left(C - {\frac{B + C}{2}}\right)^{2}}}{{{\frac{B + C}{2}}}}} = {\frac{{(B - C)^{2} }}{(B + C)}}\end{aligned} $$
(1)

The degree of freedom is n × (n − 1)/2, where n is the number of pairs. In this equation, “O” and “E” represent “observed value” and “expected value”, respectively. Here, the expected value of B and C is (B + C)/2, if there is no change between “before” and “after”. Therefore, we can detect the probes which change significantly in the process from dysplasia to tumor, using McNemar test. McNemar test is applied to whole chromosome shifting probe by probe. Therefore, we named the method “shifted McNemar”. Since the shifted McNemar test is executed shifting probe by probe (Fig. 1a), the selected significant regions can be partially duplicated as shown in the left-hand side of Fig. 1b. In this case, we integrated such regions and extended the region size (right-hand side of Fig. 1b).

Fig. 1
figure 1

Data structure of aCGH data. The horizontal and vertical axes represent chromosomal location and different patient groups, respectively. a The analysis is executed by area. b The duplicated parts of the selected regions are combined

2.4 Discretization of copy number variations (CNV) for McNemar test

Let the expression intensities of patient i be X i1, X i2, …, X in , when there are n probes for a patient. Then, the order statistics of n expression intensities were represented as X (1), X (2), …, X (n). The ordered intensities can be categorized into three levels by the lower quartile (Q1, 25% upper value) and upper quartile (Q3, 75% upper value) for each patient (or experiment). By categorizing the expression intensities for each experiment, some bias, which can exist between different data sets, may be adjusted.

In real data, the expression intensities were categorized and summarized by the following steps.

  1. (i)

    The raw expression intensities were categorized as shown in Table 3. We used raw intensities of Data2008 as example. Hence, example data shows seven tumors (T) and seven dysplasia (D).

    Table 3 The process for categorization of the raw expression intensities of a probe for 7 paired experiments
  2. (ii)

    The categorized values for 11 paired experiments can be summarized in the form of table as shown in Table 4. We used combined dataset, Data2007 and Data2008.

    Table 4 The summarized frequencies of the consecutive five probes for McNemar test

The sample size of the dataset used in this study was 11, which may be too small for the statistical test. However, this problem can be resolved by considering the region size. For example, if we consider five as the region size, this is the same as having a sample size of 55.

2.5 Evaluation of the proposed method

2.5.1 Inter-correlation within the selected region

The inter-correlation coefficient can be used for exploring the homogeneity within the selected region by the proposed method. To compare the homogeneity between the regions selected using the proposed method and a random method, we used 100 randomly selected regions that contain consecutive probes.

If we have a series of n measurements of probe X and probe Y written as x i and y i where i = 1, 2, …, n, then the Spearman correlation coefficient can be used to estimate the correlation of X and Y. When x i , y i are converted into ranks x (i), y (i) and the differences d i  = x (i) − y (i), the Spearman correlation coefficient was calculated as Eq. 2.

$$ r_{xy} = 1 - {\frac{{6\sum {d_{i}^{2} } }}{{n(n^{2} - 1)}}} $$
(2)

The mean value of the calculated pair-wise correlation coefficients was used as the inter-correlation among probes within the selected region.

2.5.2 Classification accuracy

For evaluating the classification accuracy of the selected regions, we calculated the OOB error rate using the randomForest (RF) test, which was included in the R package (http://www.r-project.org). For comparison, we randomly selected 100 regions that contain consecutive probes.

3 Results

The distributions of CNVs for each sample of Data 2007 and Data 2008 were shown in Fig. 2.

Fig. 2
figure 2

Distributions of CNVs for each sample of data 2007 and data 2008. Data 2007 and Data 2008 contain four paired and seven paired tissues, respectively

The CNVs of the dysplasia group were similarly distributed with those of the tumor group, and the individual variations of CNVs were shown. We discretized the raw CNVs without normalization, so that the ranks of CNVs would be retained.

3.1 Decision of appropriate region size

To determine the appropriate the region size, we explored the distribution of frequencies of each probe. As shown in Fig. 3, most frequencies were less than five; therefore, we decided that the appropriate region size was five.

Fig. 3
figure 3

The distribution of the numbers of replications in data 2007 and data 2008. The horizontal and vertical axes represent the number of replications and the frequency of the same number of replications, respectively

The average distance from probe to probe was 28530 bp in the 44k chip used for this study, and the maximum distance between the five probes was 143 kb. Based on previous information obtained from Database of Genomic Variants (DGV, http://projects.tcag.ca/variation/), 143 kb is an appropriate size for considering the probe set as CNV.

3.2 Description of probes in the selected region

Using the shifted McNemar test, 21 regions were detected, which contained 73 probes. Here, we used region size and p-value by 5 and 0.01, respectively. The number of regions, therefore, can be increased or decreased according to p-value. These selected probes were described in Table 5. The first column represented the probe number in the chip used for this study.

Table 5 Summary of 73 probes in the selected regions

From the “Genomic Variant” Database (http://projects.tcag.ca/variation/), we confirmed that the selected probes, including ADAR, RDH14, NT5C1B, SSB, METTL5, KIAA0232, TBC1D14, CCDC96, GRPEL1, ANGPT1, FAM49B, POLE2, MPP5, ATP6V1D, EIF2S1, ADM2, and MIOX, had copy number variations in the previous studies. In addition, MLZE on 8q24.21 is known to be expressed in metastatic melanoma cell [28].

3.3 Comparison of inter-correlation among probes within the selected regions

To compare inter-correlations, we used the mean value of the pairwise correlation coefficients among probes within the selected region.

The inter-correlation within the regions selected by the proposed method was significantly higher than those determined by a random method (Fig. 4, p-value = 0.002870). This result indicates that the proposed method selected significantly meaningful probes, which were homogeneous as well as consecutive within a region.

Fig. 4
figure 4

Comparison of the correlation coefficients between the selected regions by the proposed method and a random method. The correlation coefficients determined using these two methods were significantly different

3.4 Exploration of CNV in the selected regions between different experimental groups

The CNV patterns of the selected regions in chromosome 2 and chromosome 8 were investigated.

Figure 5a, b shows CNV patterns and discretized CNV patterns, respectively, of a region of chromosome 2p24, where the last row represents the McNemar Chi-square statistic. The intensities in the regions with large McNemar Chi-square statistics indicate significant differences between two experimental groups. The highlighted region was a region selected by the proposed method, and RDH14, NT5C1B and OSR1 were included in this region. It has already been reported that these genes contain copy number variations in human (http://projects.tcag.ca/variation/), and overexpression of OSR1 resulted in up-regulation of p53 activity [8]. OSR1 was also shown to activate p53 through repression of HDM2 transcription and its over-expression resulted in up-regulation of p53 activity [8]. The expression of OSR1 mRNA was significantly weakened in gastric cancer cell lines (OKAJIMA, MKN45), pancreatic cancer cell lines (PANC-1, BxPC-3, AsPC-1, PSN-1, PSN-1, Hs766T), and esophageal cancer cell lines (TE10) [11].

Fig. 5
figure 5

CNV patterns of chromosome 2p24. The horizontal axis represents chromosomal locations and the vertical axes represent raw intensity (a), discretized intensity (b) and McNemar Chi-square statistic

The three regions detected from chromosome 8 were 8q22.2, 8q22.3-q23, and q24.1-q24.2. These regions include STK3, OSR2, ANGPT1, CCDC26, MLZE, and FAM49B. It is known that ANGPT1 and FAM49B are deleted in 8q23.1 and 8q24.21, respectively (DGV). ANGPT1 was shown in the selected region of Fig. 6b. This gene was deleted in the process from dysplasia to tumor. It has also been reported that down-regulation of ANGPT1 was closely related to tumor angiogenesis and vessel maturation [16], and a high level of ANGPT1 has been associated with aggressive tumor behavior in OSCC [4].

Fig. 6
figure 6

CNV patterns of chromosome 8q22-q24. The horizontal axis represents chromosomal locations and the vertical axes represent raw intensity (a), discretized intensity (b) for different two experimental groups and McNemar Chi-square statistic

3.5 Classification accuracy of the selected regions

To compare the classification accuracy, we calculated the out of bag (OOB) error rates using the mean values of the expression intensities in each region.

To compare the discriminative accuracy, we used the mean values of the regions selected by the proposed method and random method. We calculated OOB error rates using the number of CNV patterns, which ranged from 2 to 10. To explore the distribution of OOB error rates, we used 100 repeatedly extracted regions for each size. Figure 7 shows the distributions of OOB error rates, which were significantly different (p < 10e-16) regardless of the number of regions. The inter-quartile ranges of OOB error rates were narrower in the proposed method compared to the random method and the mean OOB error rates were significantly low. This result indicates that the region selected by the proposed method could be used to accurately classify tumor and dysplasia.

Fig. 7
figure 7

Comparison of OOB error rates of the regions selected by the proposed method and random, using box plot. The vertical and horizontal axes represent OOB error rates and the number of regions used for classification, respectively. The gray and white regions represent the proposed method and random method, respectively

However, the average OOB error rates were about 40% even in the proposed method. Based on this observation, it is probable that tumor and dysplasia were not strongly heterogeneous. Therefore, the classification accuracy can be highly improved if the proposed method is applied to the data set, which includes clearly heterogeneous experimental groups, for example, tumor and normal.

4 Discussion

The main issue in aCGH analysis is to segment the sequence of log ratios along the chromosome into regions of amplification, deletion, or no change [13]. A previous study indicated that the correlation of neighboring genomic intervals should be considered in the structural analysis of aCGH datasets [17], and the neighboring probes correlated with each other [3]. These findings indicate that the significant region would be more reliable for classification of experimental groups, which includes the correlated neighboring probes.

In many aCGH studies, we are interested not only in the copy number variation in an experimental group but also in the comparison of groups of samples, i.e., whether there is a consistent change across the different experimental groups. Therefore, the proposed method could detect regions with significant genetic variations for comparison between different experimental groups. In addition, since we used two data sets for this study and combined these data sets before detection of significant genomic regions, we discretized the continuous CNV to minimize the bias between two data sets derived from different time periods.

The McNemar test has been commonly used to detect differentially expressed genes from the paired and discretized microarray data set. Although the general McNemar test has been applied to a gene (probe) independently in a microarray data set for significant gene selection, the proposed method, shifted McNemar test, was used for detecting significant regions, not probes, from aCGH data. In this novel extended MeNemar test, the neighboring genomic intervals, region, are taken into consideration. This method uses discretized aCGH data to identify regions that are significantly aberrant across the paired samples. Therefore, the CNV patterns of the selected regions were shown to be changed between paired samples.

We illustrated the performance of the proposed method using inter-correlation within the selected regions and OOB error rates. The significant regions selected by the proposed method were strongly homogeneous, and high classification accuracies were achieved with these regions.

In conclusion, this method might be useful for identifying new candidate genes that neighbor known genes because the proposed method detects significant chromosomal regions and not independent probes. Also, the proposed method could be more useful in analyzing several data sets derived from different conditions. The candidate genes, which are selected by the proposed method, could be further analyzed based on known functionality and possible links to carcinogenesis.