Test–Retest Reproducibility Analysis of Lung CT Image Features

Balagurunathan, Yoganand; Kumar, Virendra; Gu, Yuhua; Kim, Jongphil; Wang, Hua; Liu, Ying; Goldgof, Dmitry B.; Hall, Lawrence O.; Korn, Rene; Zhao, Binsheng; Schwartz, Lawrence H.; Basu, Satrajit; Eschrich, Steven; Gatenby, Robert A.; Gillies, Robert J.

doi:10.1007/s10278-014-9716-x

Test–Retest Reproducibility Analysis of Lung CT Image Features

Published: 03 July 2014

Volume 27, pages 805–823, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Digital Imaging Aims and scope Submit manuscript

Test–Retest Reproducibility Analysis of Lung CT Image Features

Download PDF

Yoganand Balagurunathan¹,
Virendra Kumar¹,
Yuhua Gu¹,
Jongphil Kim²,
Hua Wang^1,7,
Ying Liu^1,7,
Dmitry B. Goldgof³,
Lawrence O. Hall³,
Rene Korn⁴,
Binsheng Zhao⁵,
Lawrence H. Schwartz⁵,
Satrajit Basu³,
Steven Eschrich²,
Robert A. Gatenby⁶ &
…
Robert J. Gillies^1,6,8

1733 Accesses
192 Citations
3 Altmetric
Explore all metrics

Abstract

Quantitative size, shape, and texture features derived from computed tomographic (CT) images may be useful as predictive, prognostic, or response biomarkers in non-small cell lung cancer (NSCLC). However, to be useful, such features must be reproducible, non-redundant, and have a large dynamic range. We developed a set of quantitative three-dimensional (3D) features to describe segmented tumors and evaluated their reproducibility to select features with high potential to have prognostic utility. Thirty-two patients with NSCLC were subjected to unenhanced thoracic CT scans acquired within 15 min of each other under an approved protocol. Primary lung cancer lesions were segmented using semi-automatic 3D region growing algorithms. Following segmentation, 219 quantitative 3D features were extracted from each lesion, corresponding to size, shape, and texture, including features in transformed spaces (laws, wavelets). The most informative features were selected using the concordance correlation coefficient across test–retest, the biological range and a feature independence measure. There were 66 (30.14 %) features with concordance correlation coefficient ≥ 0.90 across test–retest and acceptable dynamic range. Of these, 42 features were non-redundant after grouping features with R ² _Bet ≥ 0.95. These reproducible features were found to be predictive of radiological prognosis. The area under the curve (AUC) was 91 % for a size-based feature and 92 % for the texture features (runlength, laws). We tested the ability of image features to predict a radiological prognostic score on an independent NSCLC (39 adenocarcinoma) samples, the AUC for texture features (runlength emphasis, energy) was 0.84 while the conventional size-based features (volume, longest diameter) was 0.80. Test–retest and correlation analyses have identified non-redundant CT image features with both high intra-patient reproducibility and inter-patient biological range. Thus making the case that quantitative image features are informative and prognostic biomarkers for NSCLC.

Assessing the Robustness and Reproducibility of CT Radiomics Features in Non-small-cell Lung Carcinoma

CT texture analysis reliability in pulmonary lesions: the influence of 3D vs. 2D lesion segmentation and volume definition by a Hounsfield-unit threshold

Article Open access 22 March 2023

Identification of CT radiomic features robust to acquisition and segmentation variations for improved prediction of radiotherapy-treated lung cancer patient recurrence

Article Open access 19 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Medical Imaging

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Classically, CT imaging is routinely used to establish anatomical and macroscopic pathologies in cancer patients. CT images of tumors also depict characteristics that can be related to physiological processes, such as cell density, necrosis, and perfusion, which may not be commonly evaluated. The appearance of the tumor in CT images has been used, qualitatively, to provide information about tumor type, degree of spread, and organ invasion [1, 2]. Such features are typically described subjectively (i.e., “mildly irregular”, “highly spiculated”, “moderate necrosis”). However, to be useful as biomarkers, features must be reproducible, quantifiable, and objective [2]. Thus, there is a need to identify features from CT images that can be reliably extracted and converted into quantifiable, mineable data as potential prognostic, predictive, or response biomarkers. In current clinical practice, only two tumor quantitative CT features; i.e., bi- and uni-dimensional measurements (WHO and RECIST, respectively) are routinely obtained and used to assess response to therapy. While these are satisfactory under some conditions, reduction in tumor size often does not reflect clinico-pathological response [3, 4].

Recent advances in both image acquisition and image analysis techniques allow semi-automated segmentation, extraction, and quantitation of numerous features from images, such as texture. Such features extracted from CT images of lung tumors have been shown to relate to glucose metabolism and stage [5], distinguish benign from malignant tumors [6–9], or differentiate between aggressive and nonaggressive malignant lung tumors [10, 11]. In liver cancer, combinations of 28 image features obtained from CT images could reconstruct 78 % of the global gene expression profiles [12]. As this area of investigation continues to expand, a number of critical questions remain unanswered, including correlated features and reproducibility of individual features. In the present study, we extracted and analyzed a large number of image features describing shape, size, run length encodings, pixel intensity histograms, textures, entropy, and wavelets. In this agnostic approach, we gave equal importance to all features with no prior bias towards radiologist preferences or accepted semantics. Such an analysis of a high dimensional feature space, i.e., “radiomics”, requires standardization and optimization to qualify these potential biomarkers for prognosis, prediction, or therapy response [13, 14]. An important step in the qualification process is to statistically characterize individual features as being reproducible, non-redundant, and having a large biological range. The most reproducible features are more likely to be able to identify subtle changes with time, pathophysiology, or in response to therapy. Additionally, the reproducibility must be compared to the entire biological range available to that feature across patients. The biological range can be expressed as a dynamic range, DR. It is expected that features will be more useful if they have a large dynamic range. In addition, features must be identified that are not redundant, as it is axiomatic that redundant features can overwhelm learning algorithms and be non-informative for decision support systems.

The inter-scan reproducibility of features may be affected by differences in patient variables, such as positioning, respiration phase, and contrast enhancement, as well as acquisition and processing parameters, including image acquisition power and slice thickness, image reconstruction algorithm, segmentation software, and user input for segmentation. In the present study, the acquisition and processing parameters were fixed, and patient variables were minimized by obtaining two separate CT scans from the same patient on the same machine using the same parameters, within 15 min of each other. Acquisition of these images and reproducibility of tumor uni-dimensional, bi-dimensional, and volumetric measurements has been previously reported [15]. These data have been made publically available under the NCI-sponsored Reference Imaging Database to Evaluate Response (RIDER) project [16]. The objective of the current study is to determine the variability in a large set of agnostic image features extracted from this data set in order to identify the most informative features using empirical filters.

In prior work, we have demonstrated that semi-automatic segmentation had 73 % overlap between operators across a test set of 129 patients [17]. Hence, lesions in the current study were volumetrically segmented using semi-automatic approach (with expert correction) and 219 features were extracted.

Although we began with a large feature set compared to prior conventional radiological analyses [15], it was expected that some features could be redundant due to the sample size. Thus, to reduce the dimensionality of this agnostic data set, we first filtered features based on their reproducibility, searching for those with the highest intra-feature concordance correlation coefficients between the repeats. As a second filter, we used the dynamic range (range of values observed across patients combined with repeatability). Finally, redundancy was assessed by computing an inter-feature coefficient of determination (R ² _Bet) between all possible pairs of features. A representative feature set was found by selecting from dependent groups to form an independent set.

We also acquired an existing, independent NSCLC data set (39 adenocarcinoma) of mixed histology, where the tumor size, node, and metastasis (TNM) stage ranges from I to III. The images features were computed on this subset. Using the known test/retest repeatable features (CCC and DR > 0.9), the subset of features obtained and redundancy was reduced (R ² _Bet ≥ 0.95). These image features showed high predictability of radiological prognosis. Figure 1 illustrates the process flow followed in the paper.

Data Collection

The details of patient recruitment have been described in Zhao et al. [15], the samples were collected in an Institutional Review Board (IRB)-approved study. In brief, baseline and follow up CTs of the thorax for each patient were acquired within 15 min of each other, using the same CT scanner and imaging protocol. Among other possibilities, this enables testing extracted image features for stability. Unenhanced thoracic CT images were acquired using 16-detector (GE Light Speed) or 64-detector (VCT; GE Heathcare) scanners, with 120-kvp tube voltage and image slices thickness of 1.25 mm were reconstructed using the same lung convolution kernel without overlap. The CT scans were acquired from 32 patients (mean age, 62.1 years; range, 29–82 years) with non-small cell lung cancer. There were 16 men (mean age, 61.8 years; range, 29–79 years) and 16 women (mean age, 62.4 years; range, 45–82 years). All patients had a primary pulmonary tumor of 1 cm or larger. The images are available in the “RIDER Lung CT” collection in NBIA under the “Collections” sections. The RIDER CT image data is on the National Biomedical Imaging Archive (NBIA) site [18], is de-identified without any patient information, complies with HIPPA requirement for sharing data.

Segmentation of Tumors

Definiens Developer XD [19] was used as the image analysis platform. It is based on the Cognition Network Technology® [20, 21] which allows the development and execution of image analysis applications. Here, the lung tumor analysis (LuTA) application was used [22]. LuTA contains a semi-automated three-dimensional click-and-grow approach for segmentation of tumors under the guidance of an operator. The semi-automatic segmentation workflow contained the following steps: (a) Preprocessing: The preprocessing performed automated organ segmentation with the main goal of segmenting the aerated lung with correct identification of the pleural wall in order to facilitate the semi-automated segmentation of juxtapleural lesions. (b) Semi-automated correction of the pulmonary boundary: In order to perform seed-based segmentation of a target lesion, the latter has to be completed within the extracted lung image object. In cases where a medical expert (trained radiologist with over 1 year of experience) concluded that the automated preprocessing described above failed to accurately identify the border between a target lesion and the pleural wall, it was necessary to enable correction of the automated lung segmentation. To this end, the image analysts identified the part of the lung that needed modification and placed a seed point manually where the segmentation should be corrected. A seed point outside the lung defined a lung extension, whereas a seed point inside the lung defined a reduction. (c) Click and grow: In order to segment a target lesion the image analysts identified the lesion within the segmented lung and placed a seed point in its interior—typically at the perceived center of the lesion. If the growing process did not sufficiently capture the target lesion, the operator could place additional seed points within the lesion and repeat the growing process outlined above. Upon completion of the segmentation, the individual image objects were merged to form a single image object representing the segmented target lesion. (d) Manual refinement and generation of lesion statistics: Upon completing a seed-based lesion segmentation as described above; the results were viewed by scrolling up and down the stacks of axial images to verify that the segmentation followed the anatomical compartment boundaries properly. To facilitate manual adjustment of the seed-based growing algorithm, tools of two types were constructed. The first type allowed the operator to limit the boundaries beyond which the region could grow during the “Click-Grow” step by manually placing “blocker” points. Another approach allowed for manual editing of the contour of each segmented lesion on each axial slice by cutting, merging and reclassifying objects and thus enabled the image analysts to perform any desired modifications of the segmented lesion. Image analysts were empowered to override as much or as little of the semi-automatically grown regions as their expertise suggested was indicated. The semi-automatic segmentation process (a)–(d) above required multiple human interactions in order to get the “correct” segmentation boundaries.

Once the segmentation of all target lesions was complete (e.g., Fig. 2a, b), quantified metrics on each lesion, such as volume, conventional size measurements, and custom feature implementations (texture categories), were extracted. In total, 64 lesions were segmented, i.e., two per patient. Then quantitative values of image features were extracted from each segmented volume. Figure 2c shows the distribution of the volume (in centimeter) measured after segmentations, for both test and retest scans. The volume distribution showed a diverse population, wherein half of the samples had small (volume ≤ 4 cm³) tumors while the rest of the samples are larger in size (largest group close to 140 cm³).

Image Features and Categorization

We extracted several types of image features to describe the tumors heterogeneous shape and structure (details in the subsection below). Note that there are multiple features extracted in some of the categories. As mentioned before, texture features have been shown to be good descriptors of the tumor and have shown relevance for survival prediction [23]. In this study, we have used 219 (including custom) 3D image features. Details of the features are described in Electronic Supplementary Material (ESM) Tables S.1 and description Table S.2. Most size- and shape-based feature computations were implemented within the Definiens XD® platform [18], while texture and other derived features were computed from algorithms implemented in C/C++. All the features were obtained from the region of interest (i.e., after the segmentation).

Although 219 features seem like a large set, other effective descriptors may yet need to be added. We categorize our feature set into seven broad categories to describe the lesion, namely: size based, shape based, location based, pixel intensity histogram based, run length and co-occurrence, law’s kernel-based texture and wavelets-based texture descriptors. Table 1 shows the number of features in each of the categories and a detailed description is provided in the ESM Section S.1. Our approach has been driven by the conventional radiologist belief that an ensemble of factors including tumor shape, size, location, and density best describe a heterogeneous tumor lesion. It has also been shown that features are dependent within and across the categories. Our approach is to find representative features in each category so as to “best describe” the tumor in feature space. We assembled comprehensive descriptors to cover most categories.

Table 1 Feature categories with the counts in each

Full size table

Feature Concordance and Dynamic Range

The sets of informative features were selected using a three-step process. We first tested the consistency of extracted features between the test and retest experiments. For each image feature, the concordance correlation coefficient was used to quantify reproducibility between two scans performed on each patient. The Concordance correlation coefficient (CCC) is superior to the Pearson correlation coefficient for repeated experiments [24]. Suppose that X _1,k and X _2,k are the feature values for kth feature and that (X _1,k(i), X _2,k(i)) are independent and follow a bivariate distribution with means and covariance matrix: μ _x1, μ _x2, and ([σ _x1,k ², σ _{x1,k, x2,k}], [σ _{x1,k ,x2,k} ², σ _x2,k] ), for the lesions measured in the ith test and retest experiment. Then the CCC [24] is defined as

$$ \mathrm{CCC}=\frac{\left(2{\sigma}_{x_{1, k},{x}_{2, k}}\right)}{\left({\sigma}_{x_{1, k}}^2+{\sigma}_{x_{2, k}}^2+{\left[{\mu}_{x_{1, k}}-{\mu}_{x_{2, k}}\right]}^2\right)} $$

(1)

The CCC, a standardization of mean squared deviation, evaluates a deviation from the identity line (i.e., the 45° line through the origin) and is typically used to measure concordance between test and retest. The CCC values ranges from 1 to −1, implying perfect agreement between the repeated experiments to reverse agreement between them.

On a selected set of highly reproducible features, the next step was to select the features with a large inter-patient variability, using the “dynamic range”. The normalized dynamic range for a feature was defined as the inverse of the average difference between measurements divided by the entire range of observed values in the sample set as in [2]:

$$ {\mathrm{DR}}_k=\left(1-\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$ n$}\right.{\displaystyle \sum_{i=1}^n\frac{\left|{\mathrm{Test}}_k(i)-{\mathrm{Retest}}_k(i)\right|}{{\mathrm{Max}}_k-{\mathrm{Min}}_k}}\right) $$

(2)

where i refers to sample index, for the kth feature, Test_k(i) or Retest_k(i) are sample i’s, kth feature values for a test/retest population of n patient cases, the maximum (Max_k) and minimum (Min_k) are computed on the entire sample set. The dynamic range for feature k is, DR_k ∈ [0,1]. Values close to 1 are preferred, and imply that the feature has a large biological range relative to reproducibility. Increasing the variation between the test–retest repeats will lead to a reduction in the DR value. Screening for a large DR will eliminate features that show greater variability in the repeat scans compared to the range of the coverage. The dynamic range measure will effectively address the “effect size” by identifying features with a lower value that are either not reproducible (relative to their range), or that are not highly variable across an entire sample set. This metric helps to give features a higher score that have relatively larger coverage (with respect to the repeatable differences), this does not intend to describe the dynamic range of the entire population.

The last step is to eliminate redundancies, based on the calculation of dependencies within the group. We computed the coefficient of determination (R ² _Bet) between the features that passed a dynamic range threshold to quantify dependency. It is a linear estimate of the correlation or dependency and has a range of 0 to 1. Values close to 1 would mean that the data points are close to the fitted line (i.e., closer to dependency) [24, 25]. The coefficient of determination of simple regression is equal to the square of the Pearson correlation coefficient [25, 26]. We used different threshold values for R ² _Bet to consider each feature as linearly dependent on any other feature(s) in the list. The features that passed the cutoff limit were grouped and replaced by a representative from the group; the one having highest dynamic range. The purpose of this third filter was to eliminate redundancies (and not necessarily identify independence). A range of R ² _Bet thresholds were explored. We filtered the features category wise to find the representative in each category to ensure equal representation. Feature reduction taking all the categories of features together was also carried out, see ESM Table S.3.

Prognostic Label

We used our attending radiologist to categorize the lung CT dataset into two broad prognostic groups using quantitative metrics to score the tumor on a point scale. Both the RIDER test/retest data set and the independent NSCLC data set were scored separately using the same semantics. It has been reported that tumor size, differentiation, vascular invasion, margin status: negative vs. positive or close margins, have all been shown to have prognostic value [27–29]. We used five observable features: lobulated margin, size of the tumor lesion, spiculated margin, plueral wall attachment, and texture (e.g., ground glass opacity, GGO) as factors to scale the tumor into high risk to moderate risk individuals. The semantic scheme was first proposed and individual semantics prognostic ability of these metrics studied in NSCLC (Wang et al., in preparation). The observations were given a score of 1 to 5. In order to obtain a single risk score for a sample, the five semantic values were summed, averaged, and standardized to [0, 1] to obtain a prognostic score. A normalized prognostic score over the median prognostic score was considered high risk (or poor prognosis). A sample below the median would indicate relatively lower to moderate risk. In the RIDER data set, two samples could not be scored reliably using the point scale metric due to diffused lesions and one sample was partly scored due to obscured margin. So, three samples were eliminated.

The two created categories were then used to find discriminatory markers between the poor to better prognosis groups. The table in ESM Section S.4 shows the score for individual samples for RIDER set. Figure 3 shows examples of CT images for extreme semantic scores.

Feature Selection

The reproducibility of radiographic features obtained from CT scans of lung cancer was investigated to establish potential quantitative imaging biomarkers. Most of the features showed high reproducibility using an automated image analysis program with segmentation done by a single reader. Prior work has demonstrated the use of three types of features: uni-variate, bivariate, and volumetric for automatic and manually segmented lung lesions, which seems to be limited in describing the complex nature of a tumor [15]. In the current study, the tumor is described by many features using different categories: size (volume, diameter, border length), shape (shape index, compactness, asymmetry), boundary region (border length, spiculation), relation to the lung field, image intensity (relative attenuation) based features (mean, standard deviation, average air space, deviation of airspace, energy, entropy, skewness, etc.), and transformed texture descriptors (wavelet transform: entropy and energy and laws features). The consistency of this novel set of features in repeat scans (test, retest) was tested and they were filtered to find independent features. The stable, independent features provide an image feature set that may, for example, be used to predict prognosis.

One requirement for an image feature to be qualified as a response biomarker is that the change in its value between pre- and post-therapy scans must be significantly greater than the difference observed in the “Test–Retest” (or “15 minute coffee break”) measurements. In the present study, we can estimate the change of individual features that may be encountered post-therapy to be within the entire pre-therapy biological range. The changes between test and retest for the conventional radiological measurement (Longest diameter) and other related size measurements (Area and Volume) are within the accepted bounds with prior work [15], (see ESM Fig. 1 and ESM Table S. 5).

The ratio of the range to the inter-scan variability is a measure of “dynamic range” (see ESM Fig. 2A). Features showing high dynamic range were considered potentially more informative. The distribution of CCC values between test and retest, which as expected is skewed toward higher end values. It has high concordance between the test and retest cases (see ESM Fig. 2B). There is also a larger peak toward zero values. Investigating the peaks shows some of the laws and higher level wavelet features have low concordance between the repeated test and retest scans. It is hypothesized that the reimaging of the patients resulted in some change in texture (perhaps from small patient movements and segmentation differences). These Laws features compute energy after kernel convolution in a region. Small changes in sub regional textures would make these features vary, as they capture small localized changes. A similar analogy could be made for wavelet features for higher layer decompositions (or higher layers), where discordance can be seen.

In prior work, Segal et al. have used a correlation coefficient threshold of 0.9 to distinguish highly correlated features [12]. In the current study, we used the coefficient of determination (R ²) between the features to find the dependency.

Feature Reduction

Feature reduction to select an informative feature set is an active research field; metrics that have been used in the past, include: the correlation coefficient, regression methods, and classification accuracy [30–32]. In our study, we propose finding a representative feature set that will eliminate redundancy in terms of information content, as complete independence may not be relevant for our study as texture information is subjective (and affected by sample issues, scanner setting, protocol followed, etc.). We used the coefficient of determination (R ² _Bet) between features to quantify dependency. Features were grouped based on R ² _Bet between them; in this subset one representative was picked that had the highest dynamic range. The procedure was repeated recursively to cover all the features resulting in a most representative group, done independently for each category.

The test, retest values were averaged before computing R ² _Bet. We set different limits to combine the features, R ² _Bet from 0.75 to 0.99. For higher thresholds R ² _Bet, less features will be grouped together resulting in a larger representative group. Setting the R ² _Bet to a lower limit will group more features together resulting in a smaller representative feature set (i.e., set of independent features). The combination of reproducibility, plus informative and independent are needed characteristics for an imaging biomarker. Table 2 shows a number of representative features obtained at different thresholds for concordance and dynamic range. As an example, the midlevel threshold CCC_TreT and DR ≥ 0.90 yielded 66 features. In this subset, the representative features were found with R ² _Bet ≥ 0.95 resulting in 42 features. Figure 4 shows a heatmap of coefficient of determination (R ² _Bet) between the features, for CCC_TreT and DR ≥ 0.90. All the representative features with a cutoff of R ² _Bet ≥ 0.95 are outlined. Tables 3 and 4 lists the feature counts and description category wise.

Table 2 Features obtained after concordance correlation coefficient (test and retest) and dynamic range procedures

Full size table

Table 3 List of representative reproducible Image features that was obtained by combining those with high R ² _Bet for with CCC and DR ≥ 0.90. Number of features

Full size table

Table 4 List of representative reproducible image features that was obtained by combining those with high R ² _Bet for with CCC and DR ≥ 0.90. The feature description

Full size table

The image features left out of the representative reduced feature set could also be useful features. The image features are expected to capture different aspects of morphology and texture information. Due to the consistency in samples chosen and a limited sample population, the image features computed may show a higher level of dependency. It is hypothesized that the samples chosen as primary lung tumor may have a limited amount of texture or morphological changes.

Independent Data Set and Reader Variability

The lung tumor samples (NSCLC, 39 lung adenocarcinoma) were collected in an IRB-approved study. This study was confined to adenocarcinomas with mixed staging (TNM 1A/B, 15 samples; TNM 2A/B, 11 samples; TNM 3 A/B, 11 samples). There were three samples with missing pathology staging, which were replaced with clinical staging (all of them were 1A). The patients had a CT scan before surgical resection or treatment. The tumor samples were analyzed by a board-certified pathologist. The clinical and vital statistics were obtained from the Moffitt Cancer Registry (Tampa, FL). Vital statistics are typically updated on a yearly basis via patient survey. Table 5 shows the sample histology, which indicates a mixed population (tumor stage, gender, vital status, etc.). The mixture is maintained even after the median split based on the overall Radiological prognostic score.

Table 5 Survival and histological details of the independent data samples

Full size table

These samples were independently scored on the five-point scale and the agreement between the two radiologists was measured by the Weighted Kappa index for an ordinal variable. The kappa value is typically interpreted as the following: <0: poor agreement; 0–0.2: slight agreement; 0.2–0.4: fair agreement; 0.4–0.6: moderate agreement; 0.6–0.8: substantial agreement; >0.8: almost perfect agreement [33]. The agreement for the five radiological parameters used in the paper had substantial to perfect agreement; the values were in the range of 0.75 to 1.0, details shown in Table 6.

Table 6 Agreement between two radiological readers in an independent test data (~39 NSCLC samples)

Full size table

Results and Discussion

As described in an earlier section, the CT data were segmented semi-automatically with user input to obtain the tumor boundaries [3, 14]. The distribution of tumor volumes across the sample set is diverse (see Fig. 2c). In the segmented regions of interest (ROI), 219 3D features were extracted; a comprehensive list is given in ESM Table S.1. The feature names were abbreviated to fit them in the table format, for example: “F78:3D-Laws-20” would mean, feature#78 (from the total of 219 features), it’s a 3D texture feature, computed by the “Laws” kernel of type 20. The kernel reference can be found in ESM Table S.2, which in this case is “E5 S5 W5 Layer 1”. All features can be decoded in this way.

Conventional Radiologist Measures

In order to be comparable with previous work [15], we compared the concordance correlation confidence limits for segmentation on three commonly used features: length, area, and volume (ESM Table 5). As before, we found high concordance across test–retest. The difference distribution between test and retest for the three features was consistent with previous findings (See ESM Fig. 1). As the tumor size increased, the difference between test and retest was reduced, as observed in previous analyses [15].

Concordance in the New Features

The 219 extracted features were first compared using the CCC, which is a stringent measure of reproducibility. A CCC_TreT value ≥ 0.75 indicates that the data are of acceptable reproducibility. For our data set, we examined various thresholds with a preference for high stringency. These analyses identified 45, 66, and 93 features that had CCC_TreT values above thresholds of 0.95, 0.90, and 0.85, respectively (see Table 2).

Dynamic Range in New Features

At a second level of analysis, the dynamic range was computed as described in Methods. Features with a dynamic range ≥ 0.95 have a biological range that is more than 20-fold greater than the test–retest difference. These analyses identified 59, 189, and 219 features above dynamic range thresholds of 0.95, 0.90, and 0.85, respectively. Applying both the filters, we identified features that passed the threshold set by CCC as well as the dynamic range filter. These two filtering procedures result in a set of features that is reproducible and has a large range compared to the variability between the test and retest experiments.

Redundancy Reduction

It is known that agnostic features may be inter-dependent. To reduce redundancies, we used the coefficient of determination (R ² _Bet) between all possible pairwise combinations of features to quantify the level of similarity. The threshold level to flag features as linearly dependent is critical and subject to change with sample size, tumor shape, and texture. Using an R ² _Bet threshold of ≥ 0.95 to identify interdependence, there were 42 features that had CCC_TreT and DR values ≥ 0.90. At a lower setting, of CCC_TreT and DR ≥ 0.85, there were 93 features, with redundancy reduction (R ² _Bet ≥ 0.95) we obtained 44 features.

Metric Distribution

The ordered distribution plot for the dynamic ranges along with the distribution of concordance coefficients shown in ESM Fig. 2. The features’ concordance and dynamic range criteria were computed after segmentation and the features obtained for discrete cutoffs are shown in ESM Table 3.

Representative Features

Figure 4 shows a heatmap of coefficient of determination values (R ² _Bet ) for the selected features at CCC_TreT and DR ≥ 0.90, the highly inter-dependent features tend to be more closely grouped. Features were chosen with CCC_TreT and DR ≥ 0.90 and R ² _Bet ≥ 0.95. Most size features are highly reproducible and independent. Among all features in the size category, 69 % of them passed all three filters. At the other end of the spectrum, texture features showed high inter-dependency. Hence, only 4 % of the laws features and 10 % of wavelet features passed all three filters (see Tables 3 and 4). More inferences could be derived by comparing features across various categories. Table 3 lists the number of representative features obtained for different categories and Table 4 lists the feature details. ESM Table S.3 contains exhaustive feature tables and feature listings for other CCC, DR, and R ² _Bet threshold. In obtaining discriminatory biomarkers, one may have less interest in categorizing the features, and hence, the analysis was repeated without categorization, results are presented in ESM Table S.6A and B.

Practical Application of Repeatable Markers

Quantitative image features have been shown to predict prognosis in prior studies [34, 35]. Our objective in this work was to find reproducible, non-redundant, and high dynamic range image set of features that could be prognostic or response markers. The following is a practical example to illustrate the potential utility of these markers, for which reproducibility is a required trait to be useful as a prognostic predictor.

Radiological Prognosis Discrimination in the RIDER Dataset

As an application, we used the features that passed the concordance and dynamic range filters to test their ability to discriminate groups based on their prognostic score, as described in Methods. We applied statistical tests (two sample t test) to find image features which are associated with the prognostic groups. The unadjusted p value for each feature was computed and the significant features were identified using the false discovery rate (FDR) method in order to guarantee the family wise error rate [36, 37], the feature details are reported in ESM Table 7. Because they were reproducible, the feature values for the test and retest were taken as independent observations. The features that had a FDR of ≤0.05 were considered to be prognostic discriminators.

Optimal Threshold

Linear discriminant analysis was used to find a cutoff level for the significant features. Using the prognostic labels as ground truth sensitivity, specificity, and area under, the curve (AUC) were computed. Some of the texture features (runlength, laws, and wavelets) had to be linearly scaled (by a factor of 1,000) before computation to avoid numerical errors.

The ROC (receiver operator characteristics) for two features is shown in Fig. 5, the area under the curve is 90 % for the conventional size based feature (longest diameter) and 91 % for the chosen texture (run length) feature. The ESM Fig. 3 shows representative plots of prognostic features (size and texture category) expressed according to the prognostic score for RIDER data. We arbitrarily assigned scores below the median as “good” (in green) and those above the median as “bad” (in red), in a relative sense.

The example in Fig. 6, shows two extreme sample cases with representative slices for run length and a laws kernel feature. The laws 1 (E5 E5 E5 Layer 1) is an edge detecting kernel of length 5, applied across all directions (x, y, and z) and normalized based on the size of the tumor. It is expected to have an inverse relation to tumor size and expected to measure tumors edges. Figure 7 shows the relationship of the prognostic feature to the conventional measures. It shows in the lower range it tracks the size measurement but deviates as the feature value increases. We hypothesize that these texture features capture more information than traditional size based measurements.

The sample set considered was diverse with more large tumors than small ones, and hence most size-based features were near the top of the prognostic predictor list. Size is a well-known prognostic feature for many tumors [38]. In addition, we also observed that a large number of texture features (histogram, laws, and wavelets) were prognostic. Notably, texture-, size-, and shape-based descriptors showed equal prognostic value (see ESM Table 7). Table 7 reports sensitivity, specificity, and area under the curve for the discriminant features. It is noteworthy that some of the texture features perform better than the conventional size-based features.

Table 7 Sensitivity, specificity, and area under the curve computed for the significant features that discriminate prognostic score. An optimal threshold was obtained using linear discriminant function

Full size table

Radiological Prognosis Discrimination Using Independent Data

We obtained 39 patient CT scans of adenocarcinomas with mixed stage; the lesion was marked by a trained radiologist and segmented using a semi-automated tool. From the region of interest, 219 quantitative image features were computed. The image data was log transformed to minimize the effect of the outliers. Based on this test/retest study, we selected 66 image features that are known to be reproducible with CCC and DR ≥ 0.90 (Table 2 and details in ESM Table S.3.a, b). The redundancy between the features in the 39 independent lung tumor data was removed by computing the R ². The features were grouped by a cutoff (R ² ≥ 0.95) and the group of dependent features replaced by a representative that had the largest dynamic range. This reduced the number of features to 41. The resident radiologist scored these images on five known radiological semantics, using the five-point scale described in the previous section. Figure 8 shows two patient samples with better radiological prognosis (<median) and two samples with poor prognosis (≥median). It is interesting to note that there is a 9-month difference in median survival between samples in the better prognostic group arm compared to those in poor prognostic group arm.

These 41 image features were then used to find discriminators for the radiological prognostic score by computing the false discover rate on the p values (two tail t test), independently repeated in each category. For the discriminant features (FRD ≤ 0.05), the ability to predict the radiological scores was evaluated by computing the sensitivity, specificity, and area under the curve using linear discriminant analysis. Figure 9 and Table 8 show the discriminant features and their individual predictability. It is interesting to note conventional size-based measures (longest diameter, volume) show an AUC of over 0.8. The pixel histogram and runlength categories of features have better AUC than size distribution, over 0.84. This indicates more than the size of the tumor; the texture and pixel distributions carry prognostic information. Some of the features are picked up as prognostic in the RIDER data set. It is also interesting to note that laws and wavelets were eliminated due to broader line statistical significance and additional constraint to reduce false discovery [37]. We hypothesize that the runlength and other texture feature describe the heterogeneity of the tumor.

Table 8 Image features discrimination ability of the prognostic score in the independent data (39 adenocarcinoma). The representative features (CCC and DR ≥ 0.90, R ² ≥ 0.95) was obtained on the independent data set filtered for false discover rate ≤ 0.05. Sensitivity, specificity, and area under curve (AUC) computed for linear discriminant functions

Full size table

Conclusions

The current study demonstrates that the test–retest reproducibility of most CT features of primary lung cancer is high when using an image analysis approach with semi-automated segmentation. Across all patients, the biological ranges for the majority of individual features were very high. We propose that the features with lowest inter-scan variance relative to the largest biological range (i.e., dynamic range, DR) should be explored as potentially the most informative for use as imaging biomarkers. Additionally, a co-variance matrix of features identified several redundancies in the feature set that could be combined into a single variable. Combining inter-scan variance, biological range, and co-variance, we have reduced the total number of features from 219 to a most informative set of 42 features identified at a setting of CCC_TreT and DR ≥ 0.9 (R ² _Bet ≥ 0.95). These reproducible and representative features show high ability to discriminate tumors based on prognostic labels. For 69 % of size-based features, 62 % of histogram features and 29 % run length features, it was possible to discriminate between tumors with low and high prognostic scores with considerable accuracy (area under the curve over 90 %). In the independent data set, a similar trend can be seen. The size-based features discriminate the prognostic score with an AUC of 0.8 while the texture-based features (runlength, histogram based) resulted in an AUC of 0.84. The texture-based features provided additional information about the tumor and increased the ability to be prognostic. Multivariate methods with other clinical factors could improve predictability. The current findings shed light on the selection of reproducible, informative, and independent features that are candidate imaging biomarkers to predict prognosis and assess (or predict) better therapy planning and response.

References

Nguyen T, Rangayyan R: Shape analysis of breast masses in mammograms via the fractal dimension. Conf Proc IEEE Eng Med Biol Soc 3:3210–3213, 2005
PubMed Google Scholar
Schuster DP: The opportunities and challenges of developing imaging biomarkers to study lung function and disease. Am J Respir Crit Care Med 176(3):224–230, 2007
Article PubMed Google Scholar
Suzuki C, Jacobsson H, Hatschek T, et al: Radiologic measurements of tumor response to treatment: practical approaches and limitations. Radiographics 28(2):329–344, 2008
Article PubMed Google Scholar
Tuma RS: Sometimes size doesn't matter: reevaluating RECIST and tumor response rate endpoints. J Natl Cancer Inst 98(18):1272–1274, 2006
Article PubMed Google Scholar
Ganeshan B, Abaleke S, Young RC, et al: Texture analysis of non-small cell lung cancer on unenhanced computed tomography: initial evidence for a relationship with tumour glucose metabolism and stage. Cancer Imaging 10:137–143, 2010
Article PubMed Central PubMed Google Scholar
Way TW, Sahiner B, Chan HP, et al: Computer-aided diagnosis of pulmonary nodules on CT scans: improvement of classification performance with nodule surface features. Med Phys 36(7):3086–3098, 2009
Article PubMed Central PubMed Google Scholar
Samala R, Moreno W, You Y, et al: A novel approach to nodule feature optimization on thin section thoracic CT. Acad Radiol 16(4):418–427, 2009
Article PubMed Google Scholar
Lee MC, Boroczky L, Sungur-Stasik K, et al: Computer-aided diagnosis of pulmonary nodules using a two-step approach for feature selection and classifier ensemble construction. Artif Intell Med 50(1):43–53, 2010
Article PubMed Google Scholar
Zhu Y, Tan Y, Hua Y, et al: Feature selection and performance evaluation of support vector machine (SVM)-based classifier for differentiating benign and malignant pulmonary nodules by computed tomography. J Digit Imaging 23(1):51–65, 2010
Article PubMed Central PubMed Google Scholar
Al-Kadi O, Watson D: Texture analysis of aggressive and nonaggressive lung tumor CE CT images. IEEE Trans Biomed Eng 55(7):1822–1830, 2008
Article PubMed Google Scholar
Kido S, Kuriyama K, Higashiyama M, et al: Fractal analysis of internal and peripheral textures of small peripheral bronchogenic carcinomas in thin-section computed tomography: comparison of bronchioloalveolar cell carcinomas with nonbronchioloalveolar cell carcinomas. J Comput Assist Tomogr 27(1):56–61, 2003
Article PubMed Google Scholar
Segal E, Sirlin CB, Ooi C, et al: Decoding global gene expression programs in liver cancer by noninvasive imaging. Nat Biotechnol 25(6):675–680, 2007
Article CAS PubMed Google Scholar
Buckler AJ, Mozley PD, Schwartz L, et al: Volumetric CT in lung cancer: an example for the qualification of imaging as a biomarker. Acad Radiol 17(1):107–115, 2010
Article PubMed Google Scholar
America RSoN: Quantitative imaging biomarker alliance for volumetric CT image analysis: roadmap for a staged validation plan, 2010
Zhao B, James LP, Moskowitz CS, et al: Evaluating variability in tumor measurements from same-day repeat CT scans of patients with non-small cell lung cancer. Radiology 252(1):263–272, 2009
Article PubMed Central PubMed Google Scholar
RIDER. The Reference Image Database to Evaluate Therapy Response. Available at: https://wiki.cancerimagingarchive.net/display/Public/RIDER+Collections;jsessionid=C78203F71E49C7EA3A43E0D213CE5555. Accessed 24 Jun 2014
Gu Y, Kumar V, Hall LO, et al: Automated delineation of lung tumors from CT images using a single click ensemble segmentation approach. Pattern Recogn 46(3):692–702, 2013
Article Google Scholar
NBIA. National Biomedical Imaging Archive. Available at: https://imaging.nci.nih.gov/ncia. Accessed 30 June 2014
Definiens. Definiens AG, Munchen, Germany. Available at: http://www.definiens.com/product-services/definiens-xd-product-suite.html. Accessed 30 June 2014
Athelogou M, Schmidt G, Schaepe A, et al: Cognition network technology—a novel multimodal image analysis technique for automatic identification and quantification of biological image contents. In: Shorte SL, Frischknecht F Eds. Book cognition network technology—a novel multimodal image analysis technique for automatic identification and quantification of biological image contents. Springer-Verlag, New York City, 2007, pp 407–422
Google Scholar
Baatz M, Zimmermann J, Blackmore CG: Automated analysis and detailed quantification of biomedical images using Definiens Congnition Network Technology. Comb Chem High Throughput Screen 12(9):908–916, 2009
Article CAS PubMed Google Scholar
Bendtsen C, Kietzmann M, Korn R, Mozley P, Schmidt G, Binnig G: X-ray computed tomography: semiautomated volumetric analysis of late-stage lung tumors as a basis for response assessments. Int J Biomed Imaging, vol 2011, 2011
Basu S, Hall LO, Goldgof DB, et al: Developing a classifier model for lung tumors in ct-scan images. IEEE Intl Conf on Systems, Man and Cybernetics, (SMC 2011), Anchorage, Alaska, 2011
Lin LI-K: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45:13, 1989
Article Google Scholar
RGD Steel JT: Principles and procedures of statistics. McGraw-Hill, New York, 1960
Google Scholar
Colin C, Frank AW, Gramaji H, et al: An R-square measured of goodness of fit for some common nonlinear regression models. J Econ 77(2):1790–1792, 1997
Google Scholar
Aoki T, Tomoda Y, Watanabe H, et al: Peripheral lung adenocarcinoma: correlation of thin-section CT findings with histologic prognostic factors and survival. Radiology 220(3):803–809, 2001
Article CAS PubMed Google Scholar
Takashima S MY, Hasegawa M, Saito A, Haniuda M, Kadoya M. High-resolution CT features: prognostic significance in peripheral lung adenocarcinoma with bronchioloalveolar carcinoma components. Respiration: Int Rev Thorac Dis 70(1), 2003
Subramanian J, Simon R: Gene exression-based signature in lung cancer: ready for clinical use? JNCI 102(7):464–474, 2010
Article CAS PubMed Central PubMed Google Scholar
Jain AK, Zongker D: Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal 19(2):153–158, 1997
Article Google Scholar
Pudil P, Novovičová J, Kittler J: Floating search methods in feature selection. Pattern Recogn Lett 15:1119–1125, 1994
Article Google Scholar
Saeys Y, Inza I: A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517, 2007
Article CAS PubMed Google Scholar
Landis JR, Koch G: The measurement of observer agreement for categorical data. Biometrics 33:159–174, 1977
Article CAS PubMed Google Scholar
Ganeshan B, Panayiotou E, Burnand K, et al: Tumour heterogeneity in non-small cell lung carcinoma assessed by CT texture analysis: a potential marker of survival. Eur Radiol 22(4):796–802, 2012
Article PubMed Google Scholar
Yanagawa M, Tanaka Y, Kusumoto M, et al: Automated assessment of malignant degree of small peripheral adenocarcinomas using volumetric CT data: correlation with pathologic prognostic factors. Lung Cancer 70(3):286–294, 2010
Article PubMed Google Scholar
John S: A direct approach to false discovery rate. J R Stat Soc B 64(3):479–498, 2002
Article Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc 57(1):289–300, 1995
Google Scholar
Zhao B, Oxnard G, Moskowitz CS, et al: A pilot study of volume measurement as a method of tumor response evaluation to aid biomarker development. Clin Cancer Res 16(18):4647–4653, 2010
Article CAS PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cancer Imaging and Metabolism, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
Yoganand Balagurunathan, Virendra Kumar, Yuhua Gu, Hua Wang, Ying Liu & Robert J. Gillies
Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
Jongphil Kim & Steven Eschrich
Department of Computer Science and Engineering, University of South Florida, Tampa, FL, USA
Dmitry B. Goldgof, Lawrence O. Hall & Satrajit Basu
Definiens AG, Bernhard-Wicki-Straße 5, 80636, Munchen, Germany
Rene Korn
Department of Radiology, Columbia University, New York, NY, USA
Binsheng Zhao & Lawrence H. Schwartz
Radiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
Robert A. Gatenby & Robert J. Gillies
Department of Radiology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China
Hua Wang & Ying Liu
Experimental Imaging Program, H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Drive, SRB-2, Tampa, FL, 33612, USA
Robert J. Gillies

Authors

Yoganand Balagurunathan
View author publications
You can also search for this author in PubMed Google Scholar
Virendra Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Yuhua Gu
View author publications
You can also search for this author in PubMed Google Scholar
Jongphil Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry B. Goldgof
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence O. Hall
View author publications
You can also search for this author in PubMed Google Scholar
Rene Korn
View author publications
You can also search for this author in PubMed Google Scholar
Binsheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence H. Schwartz
View author publications
You can also search for this author in PubMed Google Scholar
Satrajit Basu
View author publications
You can also search for this author in PubMed Google Scholar
Steven Eschrich
View author publications
You can also search for this author in PubMed Google Scholar
Robert A. Gatenby
View author publications
You can also search for this author in PubMed Google Scholar
Robert J. Gillies
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert J. Gillies.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 626 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balagurunathan, Y., Kumar, V., Gu, Y. et al. Test–Retest Reproducibility Analysis of Lung CT Image Features. J Digit Imaging 27, 805–823 (2014). https://doi.org/10.1007/s10278-014-9716-x

Download citation

Published: 03 July 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10278-014-9716-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Test–Retest Reproducibility Analysis of Lung CT Image Features

Abstract

Similar content being viewed by others

Assessing the Robustness and Reproducibility of CT Radiomics Features in Non-small-cell Lung Carcinoma

CT texture analysis reliability in pulmonary lesions: the influence of 3D vs. 2D lesion segmentation and volume definition by a Hounsfield-unit threshold

Identification of CT radiomic features robust to acquisition and segmentation variations for improved prediction of radiotherapy-treated lung cancer patient recurrence

Introduction

Data Collection