Evaluation of a Content-Based Image Retrieval Computer-Aided Diagnosis System for Breast Ultrasound Images Through Distance Similarity Measures

Kim, Min-jeong; Cho, Hyun-chong

doi:10.1007/s42835-018-00003-x

Evaluation of a Content-Based Image Retrieval Computer-Aided Diagnosis System for Breast Ultrasound Images Through Distance Similarity Measures

Original Article
Published: 04 January 2019

Volume 14, pages 435–443, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Electrical Engineering & Technology Aims and scope Submit manuscript

Evaluation of a Content-Based Image Retrieval Computer-Aided Diagnosis System for Breast Ultrasound Images Through Distance Similarity Measures

Download PDF

Min-jeong Kim² &
Hyun-chong Cho¹

93 Accesses
Explore all metrics

Abstract

Purpose

A content-based image retrieval (CBIR) computer-aided diagnosis (CADx) system using breast masses in ultrasound images has been developed and evaluated to assist radiologists with characterization processes. The purpose of this study is to improve the accuracy of breast cancer diagnoses by analyzing images and providing quantitative information to radiologists through the CADx system.

Methods

Two morphological features and six texture features of breast masses were extracted to design how the CADx system retrieves a mass similar to a query mass in a reference library. Based on extracted features from breast masses, the CADx system retrieves masses which are similar to the query mass from the reference library using a k-nearest neighbor (k-NN) method. To evaluate the CBIR CADx system, 39 similarity measures (nine similarity families, F₀–F₈) based on the distance similarity were used. A receiver operating characteristic (ROC) analysis was conducted to evaluate the performance of the distance similarity measures.

Conclusions

The F₀ family (Mahalanobis distance) measure used with the k-NN classifier provided slightly higher performance for the classification of malignant and benign masses as compared to those with the F₁–F₈ family measures.

Detection of Mammographic Masses by Content-Based Image Retrieval

The Role of Content-Based Image Retrieval in Mammography CAD

Improved Ultrasound Based Computer Aided Diagnosis System for Breast Cancer Incorporating a New Feature of Mass Central Regularity Degree (CRD)

Discover the latest articles, news and stories from top researchers in related subjects.

Medical Imaging

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

For females, breast cancer is one of the most prevalent causes of death worldwide. Figure 1 shows the estimated number of new cases in 2017 in both Korea and the United States [1, 2], confirming that the incidence breast cancer is higher than average for both countries. Early detection and treatment through accurate screening and diagnosis is the most effective way to reduce the occurrence of breast cancer and associated mortality rate.

Types of medical images have increased due to the development of imaging technology, and diagnostic results can vary depending on the experience of the radiologist. Computer-aided diagnosis (CADx) systems have been studied to provide quantitative information useful for the diagnosis of this disease and to prevent misdiagnoses caused by incorrect interpretations or subjective judgments [3]. A CADx system specialized for breast ultrasound images can assist in the diagnosis of the radiologist by analyzing numerous forms of data and providing quantitative information.

The importance of research using ultrasound imaging is increasing according to research findings which showed that x-rays used for breast cancer diagnoses affect the occurrence of breast cancer [4]. Ultrasound imaging is also a useful diagnostic method with which to distinguish between malignant and benign masses [5]. Generally, the margin of the mass is not clear in ultrasound images of cases of malignant breast cancer. On the other hand, the margin is smooth and clear in benign cases. Figures 2 and 3 correspondingly depict malignant and benign masses as imaged by a breast ultrasound system.

Research on a CADx system for breast ultrasound images is underway to assist radiologists by providing quantitative information about breast masses. After morphological and texture features are extracted from breast masses in ultrasound images, distance similarity measures are used to retrieve masses similar to a query mass in a reference library. Similarity distance measures are also very useful for solving many pattern-recognition issues, such as classification, clustering, and retrieval problems. Various similarity distance measures that are applicable to characterize malignant and benign masses are reviewed and categorized in both syntactic and semantic relationships. The purpose of this study is to design a CADx system for breast ultrasound images that improves the accuracy of breast cancer diagnoses using various similarity distance measures.

Methods

Data Set

In this study, we used records of patients who underwent breast imaging in the Department of Radiology at the University of Michigan to design the CADx system for breast ultrasound images. The use of the breast ultrasound data was approved by the Institutional Review Board (IRB), and all data were pathologically verified by biopsies [6].

In total, data from 250 patients were used, with 96 malignant and 154 benign masses included [6]. In this case, the radiologist selected two or more orthogonal ultrasound images that best represent each mass in the collected data. However, only a single ultrasound image was selected for certain masses not visible in orthogonal ultrasound images. The collected data were randomly classified into two sets. Set 1 ($S_{1}$) includes a total of 230 ultrasound images with 41 malignant and 80 benign masses and Set 2 ($S_{2}$) includes a total of 258 ultrasound images with 55 malignant and 74 benign masses.

Feature Extraction and Selection

To design the CADx system for breast ultrasound images, we segmented the breast masses in ultrasound images with an automated method using a previously designed active contour model [7]. The contour of the mass is automatically estimated from its center using this model. Morphological and texture features of the segmented masses were extracted, after which the features were used for characterization of a query mass for the CADx system.

We extracted two morphological features and six texture features based on automated segmentation in the design of the CADx system [6]. The morphological features represent the shape of the mass, such as the size or appearance of the mass, and the texture features represent the texture shown in the ultrasound image. In breast ultrasound images, a taller-than-wide shape is a good indicator of malignancy. Thus, the width-to-height ratio of the mass was extracted as a morphological feature. Another useful feature to differentiate between malignant and benign masses is the posterior shadowing feature, which is defined as the normalized average gray-level difference between the interior of the segmented mass and the darkest posterior strip [6]. The texture features extracted from spatial gray-level dependence matrices or co-occurrence matrices are information measures for correlations 1 and 2 and for the difference entropy, entropy, energy, and sum entropy [8]. Here, entropy is an indicator of uncertainty. A larger value indicates that the characteristics of the data are uncertain and a smaller value means that the characteristics of the data are biased toward one side. Conversely, energy is an indicator of uniformity.

To design the CADx system, the feature vectors of the selected masses were classified into training sets and test sets through a cross-validation method.

Distance Similarity Measures

To retrieve masses similar to the query mass, we extract feature vectors that are identical to the query mass in the reference library. After determining the measure similarity outcome between the feature vector of the query mass and the feature vector of the reference library, we retrieve masses similar to the query mass from the reference library. The performance of the CADx system was evaluated by a ROC (receiver operating characteristic) analysis.

In this study, 39 distance similarity measures were applied to retrieve masses similar to the query masses, and the performance of each similarity measure was evaluated. The 39 similarity measures were classified into nine categories according to the similarity of the notation. The nine families are Mahalanobis ($F_{0}$, 1 measure), the Minkowski family ($F_{1}$, three measures), the $L_{1}$ family ($F_{2}$, six measures), the Intersection family ($F_{3}$, seven measures), the Inner Product family ($F_{4}$, four measures), the Fidelity family or the Squared-chord family ($F_{5}$, three measures), the Squared $L_{2}$ family or the $\chi^{2}$ family ($F_{6}$, eight measures), Shannon’s entropy family ($F_{7}$, four measures), and Combinations family ($F_{8}$, three measures).

Mahalanobis Distance ($\varvec{F}_{0}$)

Mahalanobis distance refers to a method used to measure the degree of similarity considering the probability distribution of the data. It corresponds to the normalized Euclidean distance based on a covariance matrix [9]. Because it takes into account correlations between data instances, the performance of the Mahalanobis distance approach is usually better than that of the Euclidean distance [10]. The Mahalanobis distance can be calculated as follows,

$$D_{Mah} = \sqrt {\left( {Q - P\left( {r_{j} } \right)} \right)\sum\nolimits^{ - 1} \left( {Q - P\left( {r_{j} } \right)} \right)^{T} ,}$$

(1)

where $Q$ is the feature vector of the query mass, $P(r_{j} )$ is the feature vector of the $j$th reference mass, and $\sum$ is the covariance matrix.

Minkowski Family ($F_{1}$)

The Minkowski family refers to a general metric distance [9], and it is a generalized measure of the Euclidean distance and the ‘City block’ distance. The Minkowski distance can be defined as shown below.

$$D_{Mk} = \sqrt[L]{{\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|^{L} }}$$

(2)

In this equation, $q_{i}$ is the $i$th feature vector of the query mass, $p_{i} \left( {r_{j} } \right)$ is the $i$th feature vector of $r_{j}$, $r_{j}$ is the $j$ th reference mass, and $d$ represents the dimensions of the feature space.

There are the Euclidean distance, City block distance, and Chebyshev distance in the Minkowski family. The Euclidean distance is the method most commonly used to find the distances between feature vectors in a multidimensional space [11]. It is a special case of the Minkowski distance where $L = 2$, and it can be expressed as Eq. (3). The City block distance represents the sum of the absolute differences between feature vectors [9]. It is $L = 1$ in Eq. (2) and is calculated as shown in Eq. (4). The Chebyshev distance measures distances while assuming only the most relevant dimensions [9]. In the case of $L = \infty$ in Eq. (2), the Chebyshev distance can be expressed as Eq. (5).

$$D_{Eu} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }$$

(3)

$$D_{City} = \mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$

(4)

$$D_{Cheb} = \mathop {\hbox{max} }\limits_{i} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$

(5)

$L_{1}$ Family ($F_{2}$)

The $L_{1}$ family uses absolute differences in an extended method of the City block distance. It is defined as the sum of the absolute differences between two feature vectors. We apply six distance similarity measures—Sorensen, Gower, Soergel, Kulczynski, Canberra, and Lorentzian—in this family. Each measure can be calculated from Eqs. (6)–(11).

$$D_{Sen} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$

(6)

$$D_{Gow} = \frac{1}{d}\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$

(7)

$$D_{Sgel} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$

(8)

$$D_{Kld} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$

(9)

$$D_{Can} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$

(10)

$$D_{Lor} = \mathop \sum \limits_{i = 1}^{d} \ln \left( {1 + \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|} \right)$$

(11)

The Gower distance in Eq. (7) scales the vector space into the normalized space and then uses the absolute difference [12]. The numerator signifies the difference and the denominator normalizes the difference in Canberra and Sorensen [9]. Canberra is obtained by dividing the absolute difference between the feature vectors by the sum of the feature vectors. The Lorentzian can be expressed as Eq. (11), which indicates the absolute difference between the feature vectors with a natural log function. At this time, 1 is added to ensure non-negative attributes and to avoid zero logs [12].

Intersection Family ($F_{3}$)

The intersection between feature vectors is a widely used form of similarity [12]. Intersection similarity measures can be transformed into L₁-based distance measures using this technique, i.e., $d(q,p) = 1 - s(q,p)$ or $d(q,p) = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {s(q,p)}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${s(q,p)}$}}$. This family includes the Intersection, Wave Hedges, Czekanowski, Motyka Kulczynski, Ruzicka, and Tanimoto measures, which correspondingly are expressed as Eqs. (12)–(18).

$$\begin{aligned} D_{Its} & = 1 - s_{Its} = 1 - \mathop \sum \limits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right) \\ & = \frac{1}{2}\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right| \\ \end{aligned}$$

(12)

$$D_{WH} = \mathop \sum \limits_{i = 1}^{d} \left( {1 - \frac{{\mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}} \right)$$

(13)

$$s_{Cze} = \frac{{2\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$

(14)

$$D_{Mot} = 1 - s_{Mot} \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$

(15)

$$s_{Kls} = \frac{1}{{D_{kld} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}$$

(16)

$$s_{Ruz} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$

(17)

$$D_{Ta} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right) - 2\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right) - \mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$

(18)

Inner Product Family ($F_{4}$)

The Inner Product family is a method of using the inner product between feature vectors. There are similarity measures in this family that explicitly include the inner product form ‘$Q \cdot P$’ in the definition [12]. We apply four measures—Cosine, Kumar-Hassebrook (PCE), Jaccard, and Dice—for the Inner Product family, as expressed by Eqs. (19)–(22) below.

$$s_{Cos} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} } }}$$

(19)

$$s_{PCE} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} - \mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}$$

(20)

$$D_{Jac} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} - \mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}$$

(21)

$$s_{Dice} = \frac{{2\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} }}$$

(22)

Fidelity Family or Squared-Chord Family ($F_{5}$)

The Fidelity similarity is the sum of geometric means, and it is defined Eq. (23) [12]. It includes the Hellinger and Matusita. The squared-chord distance is referred to as Matusita without the square root. There are alternative representations using squared-chord distance for all Fidelity based measures. Equations (24), (25), and (26) denote Hellinger, Matusita, and Squared-chord, respectively.

$$s_{Fid} = \mathop \sum \limits_{i = 1}^{d} \sqrt {q_{i} p_{i} \left( {r_{j} } \right)}$$

(23)

$$D_{Hel} = 2\sqrt {1 - \mathop \sum \limits_{i = 1}^{d} \sqrt {q_{i} p_{i} \left( {r_{j} } \right)} }$$

(24)

$$D_{Mat} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {\sqrt {q_{i} } - \sqrt {p_{i} \left( {r_{j} } \right)} } \right)^{2} }$$

(25)

$$D_{Scho} = \mathop \sum \limits_{i = 1}^{d} \left( {\sqrt {q_{i} } - \sqrt {p_{i} \left( {r_{j} } \right)} } \right)^{2}$$

(26)

Squared $L_{2}$ Family or $\chi^{2}$ Family ($F_{6}$)

There are eight similarity measures using the squared Euclidean distance in this family, and the squared Euclidean distance is defined as Eq. (27) [12]. The eight similarity measures applied in this family are Squared Euclidean, Pearson $\chi^{2}$, Neyman $\chi^{2}$, Squared $\chi^{2}$, Probabilistic Symmetric $\chi^{2}$, Divergence, Clark, and Additive Symmetric $\chi^{2}$. These can be calculated by Eqs. (27)–(34), respectively.

$$D_{SEu} = \mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2}$$

(27)

$$D_{Pea} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{p_{i} \left( {r_{j} } \right)}}$$

(28)

$$D_{Ney} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} }}$$

(29)

$$D_{Squ} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$

(30)

$$D_{PSy} = 2\mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$

(31)

$$D_{Div} = 2\mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{\left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)^{2} }}$$

(32)

$$D_{Clk} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {\frac{{\left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right)^{2} }$$

(33)

$$D_{Add} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}{{q_{i} p_{i} \left( {r_{j} } \right)}}$$

(34)

Shannon’s Entropy Family ($F_{7}$)

In this family, we apply four similarity measures that are probabilistic uncertainty or entropy concepts [12]. The four similarity measures are the Jeffreys, Topsoe, Jensen-Shannon, and Jensen difference measures, as correspondingly expressed in terms of the entropy as Eqs. (35)–(38).

$$D_{Jef} = \mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)\ln \frac{{q_{i} }}{{p_{i} \left( {r_{i} } \right)}}$$

(35)

$$\begin{aligned} D_{Tsoe} = \mathop \sum \limits_{i = 1}^{d} q_{i} \ln \frac{{2q_{i} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}} \hfill \\ + \mathop \sum \limits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)\ln \frac{{2p_{i} \left( {r_{j} } \right)}}{{q_{i} + p_{i} \left( {r_{j} } \right)}} \hfill \\ \end{aligned}$$

(36)

$$\begin{aligned} D_{JSh} = \frac{1}{2}\left[ {\mathop \sum \limits_{i = 1}^{d} q_{i} \ln \frac{{2q_{i} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right. \hfill \\ \left. { + \mathop \sum \limits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)\ln \frac{{2p_{i} \left( {r_{j} } \right)}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right] \hfill \\ \end{aligned}$$

(37)

$$\begin{aligned} D_{Jdiff} = \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{q_{i} \ln q_{i} + p_{i} \left( {r_{j} } \right)\ln p_{i} \left( {r_{j} } \right)}}{2}} \right) \hfill \\ \quad\quad - \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}\ln \frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}} \right) \hfill \\ \end{aligned}$$

(38)

Combinations ($F_{8}$)

There are three similarity measures in this family, referring to methods which utilize multiple measures [12]. First, Taneja utilizes arithmetic and geometric means, which can be expressed as Eq. (39). Second, Kumar-Johnson, which utilizes the arithmetic and geometric mean divergence, can be expressed as Eq. (40). Finally, we applied the average of City block distance and Chebyshev distance in the Minkowski family ($F_{1}$). This can be calculated by Eq. (41).

$$D_{Taj} = \mathop \sum \limits_{i = 1}^{d} \frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}\ln \left( {\frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{{2\sqrt {q_{i} p_{i} \left( {r_{j} } \right)} }}} \right)$$

(39)

$$D_{KuJ} = \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{\left( {q_{i}^{2} - p_{i} \left( {r_{j} } \right)^{2} } \right)^{2} }}{{2\left( {q_{i} p_{i} \left( {r_{j} } \right)} \right)^{3/2} }}} \right)$$

(40)

$$D_{Avg} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right| + \mathop {\hbox{max} }\limits_{i} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{2}$$

(41)

In this study, we applied 39 distance similarity measures to retrieve masses similar to query masses in the CADx system. According to the notation similarity of each measure, the 39 similarity measures were classified into nine categories. The $F_{0}$ family measures the similarity considering the probability distribution of the data, the $F_{1}$ family is a general metric distance, and the $F_{2}$ family uses absolute differences. The similarity measures in the $F_{3}$ family are a widely used form in which the intersection between feature vectors is used. The $F_{4}$ family uses the inner product between the feature vectors and the $F_{5}$ family is a method that uses the sum of the geometric means. Finally, there is the $F_{6}$ family, which uses the squared Euclidean distance, the $F_{7}$ family which uses the probabilistic uncertainty or entropy, and the $F_{8}$ family which uses multiple measures. Each has been described in depth [12].

Results and Discussion

To assist radiologists with breast cancer diagnoses, a CADx system was studied for breast ultrasound images. We applied 39 distance similarity measures based on a distance metric. The performance capabilities of the CADx system when applying each similarity measure were then analyzed and evaluated.

To design the CADx system for breast ultrasound images, breast masses in ultrasound images were classified into a training set and a test set through a cross-validation method. The 39 similarity measures were categorized into nine families based on the notation similarity in each case.

When $S_{1}$ is used as the training data and $S_{2}$ is used as the test data, the performance of the CADx system applying each similarity measure is presented in Table 1 and Fig. 4. Table 2 and Fig. 5 show the performance when $S_{2}$ is used as the training data and $S_{1}$ is used as the test data, where the value of $k$ is the number of retrieved masses, referring to the number of breast ultrasound images shown in references to assist radiologists with their diagnoses. Tables 1 and 2 present numerical representations of the performance outcomes for the 39 distance similarity measures, and Figs. 4 and 5 are graphs of the performances for the nine families.

Table 1 The performances of the 39 similarity measures (training set S₁, Test set S₂)

Full size table

Table 2 The performances of the 39 similarity measures (training set S₂, Test set S₁)

Full size table

When using $S_{1}$ as the training data and $S_{2}$ as the test data, the performances of all similarity measures are similar. However, the performance of the $F_{0}$ family (the Mahalanobis distance) is slightly better than those of the others when the number of retrieved masses is increased (i.e., k = 25–50). In addition, the average performance (i.e., k = 1–50) of the $F_{0}$ family is better than those of the other similarity families (Table 1). The performance of each family is shown in Fig. 4, which indicates that the $F_{0}$ family is superior to the other families in large number of top retrieval masses (k > 25). When using $S_{2}$ as the training data and $S_{1}$ as the test data, the results are similar to the previous results, as shown in Table 2 and Fig. 5.

In this study, we compared the performances of the CADx system when applying the 39 similarity measures. It was found that the performance of the $F_{0}$ family exceeded those of the other similarity families in both experiments for large number of top retrieval masses. The $F_{0}$ family computes the covariance distance taking into account the distribution of the given data. However, the other similarity families only consider the distance between two feature vectors. Therefore, the $F_{0}$ family (Mahalanobis distance) that calculates the covariance distance outperforms the other similarity measures on average.

Conclusion

In this paper, a CADx system for breast ultrasound images was devised to assist radiologists in differentiating benign and malignant masses on ultrasound breast images. To design the CADx system, morphological and texture features were extracted from a database. The feature vectors of breast masses were then classified into a training set and a test set through a cross-validation method. Using a k-nearest neighbor (k-NN) method, we applied 39 distance similarity measures (nine similarity families, F₀–F₈) based on distance and compared the performance of the CADx system through an ROC analysis. The 39 distance similarity measures were classified in nine similarity families based on the notation similarity of each measure, as noted above. The 39 distance similarity measures were applied to retrieve masses similar to a query mass in a reference library.

It was found that the performances of each of the similarity measures did not show any significant differences. However, the performance of the $F_{0}$ family was greater than those of the other families when the number of retrieved masses is increased (k ≥ 25). When the number of retrieved masses is low, the probability distribution cannot be confirmed because there are too few data instances to be considered. When the number of retrieved masses is increased, the probability distribution of the data can be fully taken into account. Therefore, the performance of the CADx system using the Mahalanobis distance (the $F_{0}$ family), which considers the probability distribution of the data, is superior to those of other families.

For larger number (k > 10) of top retrieval masses, the classification performance of all similarity measures continuously leveled off. The relationship between the usefulness of the retrieved masses as references for radiologists and the accuracy of estimating the likelihood of malignancy of the query mass warrants further investigations.

Future work includes applying the CBIR CADx system to a larger and independent dataset, expanding the feature space, and combining the developed ultrasound image characterization method with mammographic characterization method. The effects of the different CBIR CADx systems on the characterization of breast masses by support vector machine (SVM) will also be evaluated.

References

Jung K-W, Won Y-J, Oh C-M, Kong H-J, Lee DH, Lee KH (2017) Prediction of cancer incidence and mortality in Korea, 2017. Cancer Res Treat 49:306–312
Article Google Scholar
Siegel RL, Miller KD, Jemal A (2017) Cancer statistics, 2017. CA Cancer J Clin 67:7–30
Article Google Scholar
Cho H, Hadjiiski L, Sahiner B, Chan HP, Paramagul C, Helvie M et al (2012) Interactive content-based image retrieval (CBIR) computer-aided diagnosis (CADx) system for ultrasound breast masses using relevance feedback. In: SPIE, medical imaging 2012, p 831509
Espín-López P, Martellosio A, Pasian M, Bozzi M, Perregrini L, Mazzanti A et al (2017) Breast cancer imaging at mm-waves: feasibility study on the safety exposure limits. In: Microwave conference (EuMC), 2016 46th European, pp 667–670
Hong AS, Rosen EL, Soo MS, Baker JA (2005) BI-RADS for sonography: positive and negative predictive values of sonographic features. Am J Roentgenol 184:1260–1265
Article Google Scholar
Cho H, Hadjiiski L, Sahiner B, Chan HP, Helvie M, Paramagul C et al (2011) Similarity evaluation in a content-based image retrieval (CBIR) CADx system for characterization of breast masses on ultrasound images. Med Phys 38:1820–1831
Article Google Scholar
Cui J, Sahiner B, Chan HP, Nees A, Paramagul C, Hadjiiski LM et al (2009) A new automated method for the segmentation and characterization of breast masses on ultrasound images. Med Phys 36:1553–1565
Article Google Scholar
Haralick RM, Shanmugam K, Dinstein I (1973) Texture features for image classification. IEEE Trans Syst Man Cybern SMC-3:610–621
Article Google Scholar
Belattar K, Mostefai S (2015) Similarity measures for content-based dermoscopic image retrieval: a comparative study. In: 2015 First international conference on new technologies of information and communication (NTIC), pp 1–6
Bo D, Zhangguan L, Cuixiao L (2015) An algorithm of image matching based on mahalanobis distance and weighted KNN graph. In: 2015 2nd international conference on information science and control engineering, Shanghai, 2015, pp 116–121
Bouhmala N (2016) How good is the euclidean distance metric for the clustering problem. In: 2016 5th IIAI international congress on advanced applied informatics (IIAI-AAI), pp 312–315
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Model Meth Appl Sci 1(4):300–307
Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (no. 2017R1E1A1A03070297). This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2018-0-01433) supervised by the Institute for Information and Communications Technology Promotion (IITP).

Author information

Authors and Affiliations

Department of Electronic Engineering and Interdisciplinary Graduate Program for BIT Medical Convergence, Kangwon National University, Chuncheon, South Korea
Hyun-chong Cho
Interdisciplinary Graduate Program for BIT Medical Convergence, Kangwon National University, Chuncheon, South Korea
Min-jeong Kim

Authors

Min-jeong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyun-chong Cho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyun-chong Cho.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Mj., Cho, Hc. Evaluation of a Content-Based Image Retrieval Computer-Aided Diagnosis System for Breast Ultrasound Images Through Distance Similarity Measures. J. Electr. Eng. Technol. 14, 435–443 (2019). https://doi.org/10.1007/s42835-018-00003-x

Download citation

Received: 08 March 2018
Revised: 30 June 2018
Accepted: 17 August 2018
Published: 04 January 2019
Issue Date: 30 January 2019
DOI: https://doi.org/10.1007/s42835-018-00003-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Evaluation of a Content-Based Image Retrieval Computer-Aided Diagnosis System for Breast Ultrasound Images Through Distance Similarity Measures