Introduction

For females, breast cancer is one of the most prevalent causes of death worldwide. Figure 1 shows the estimated number of new cases in 2017 in both Korea and the United States [1, 2], confirming that the incidence breast cancer is higher than average for both countries. Early detection and treatment through accurate screening and diagnosis is the most effective way to reduce the occurrence of breast cancer and associated mortality rate.

Fig. 1
figure 1

Most common types of estimated new cancer cases, 2017: a in Korea, b in the United States

Types of medical images have increased due to the development of imaging technology, and diagnostic results can vary depending on the experience of the radiologist. Computer-aided diagnosis (CADx) systems have been studied to provide quantitative information useful for the diagnosis of this disease and to prevent misdiagnoses caused by incorrect interpretations or subjective judgments [3]. A CADx system specialized for breast ultrasound images can assist in the diagnosis of the radiologist by analyzing numerous forms of data and providing quantitative information.

The importance of research using ultrasound imaging is increasing according to research findings which showed that x-rays used for breast cancer diagnoses affect the occurrence of breast cancer [4]. Ultrasound imaging is also a useful diagnostic method with which to distinguish between malignant and benign masses [5]. Generally, the margin of the mass is not clear in ultrasound images of cases of malignant breast cancer. On the other hand, the margin is smooth and clear in benign cases. Figures 2 and 3 correspondingly depict malignant and benign masses as imaged by a breast ultrasound system.

Fig. 2
figure 2

Malignant masses in breast ultrasound images

Fig. 3
figure 3

Benign masses in breast ultrasound images

Research on a CADx system for breast ultrasound images is underway to assist radiologists by providing quantitative information about breast masses. After morphological and texture features are extracted from breast masses in ultrasound images, distance similarity measures are used to retrieve masses similar to a query mass in a reference library. Similarity distance measures are also very useful for solving many pattern-recognition issues, such as classification, clustering, and retrieval problems. Various similarity distance measures that are applicable to characterize malignant and benign masses are reviewed and categorized in both syntactic and semantic relationships. The purpose of this study is to design a CADx system for breast ultrasound images that improves the accuracy of breast cancer diagnoses using various similarity distance measures.

Methods

Data Set

In this study, we used records of patients who underwent breast imaging in the Department of Radiology at the University of Michigan to design the CADx system for breast ultrasound images. The use of the breast ultrasound data was approved by the Institutional Review Board (IRB), and all data were pathologically verified by biopsies [6].

In total, data from 250 patients were used, with 96 malignant and 154 benign masses included [6]. In this case, the radiologist selected two or more orthogonal ultrasound images that best represent each mass in the collected data. However, only a single ultrasound image was selected for certain masses not visible in orthogonal ultrasound images. The collected data were randomly classified into two sets. Set 1 (\(S_{1}\)) includes a total of 230 ultrasound images with 41 malignant and 80 benign masses and Set 2 (\(S_{2}\)) includes a total of 258 ultrasound images with 55 malignant and 74 benign masses.

Feature Extraction and Selection

To design the CADx system for breast ultrasound images, we segmented the breast masses in ultrasound images with an automated method using a previously designed active contour model [7]. The contour of the mass is automatically estimated from its center using this model. Morphological and texture features of the segmented masses were extracted, after which the features were used for characterization of a query mass for the CADx system.

We extracted two morphological features and six texture features based on automated segmentation in the design of the CADx system [6]. The morphological features represent the shape of the mass, such as the size or appearance of the mass, and the texture features represent the texture shown in the ultrasound image. In breast ultrasound images, a taller-than-wide shape is a good indicator of malignancy. Thus, the width-to-height ratio of the mass was extracted as a morphological feature. Another useful feature to differentiate between malignant and benign masses is the posterior shadowing feature, which is defined as the normalized average gray-level difference between the interior of the segmented mass and the darkest posterior strip [6]. The texture features extracted from spatial gray-level dependence matrices or co-occurrence matrices are information measures for correlations 1 and 2 and for the difference entropy, entropy, energy, and sum entropy [8]. Here, entropy is an indicator of uncertainty. A larger value indicates that the characteristics of the data are uncertain and a smaller value means that the characteristics of the data are biased toward one side. Conversely, energy is an indicator of uniformity.

To design the CADx system, the feature vectors of the selected masses were classified into training sets and test sets through a cross-validation method.

Distance Similarity Measures

To retrieve masses similar to the query mass, we extract feature vectors that are identical to the query mass in the reference library. After determining the measure similarity outcome between the feature vector of the query mass and the feature vector of the reference library, we retrieve masses similar to the query mass from the reference library. The performance of the CADx system was evaluated by a ROC (receiver operating characteristic) analysis.

In this study, 39 distance similarity measures were applied to retrieve masses similar to the query masses, and the performance of each similarity measure was evaluated. The 39 similarity measures were classified into nine categories according to the similarity of the notation. The nine families are Mahalanobis (\(F_{0}\), 1 measure), the Minkowski family (\(F_{1}\), three measures), the \(L_{1}\) family (\(F_{2}\), six measures), the Intersection family (\(F_{3}\), seven measures), the Inner Product family (\(F_{4}\), four measures), the Fidelity family or the Squared-chord family (\(F_{5}\), three measures), the Squared \(L_{2}\) family or the \(\chi^{2}\) family (\(F_{6}\), eight measures), Shannon’s entropy family (\(F_{7}\), four measures), and Combinations family (\(F_{8}\), three measures).

Mahalanobis Distance (\(\varvec{F}_{0}\))

Mahalanobis distance refers to a method used to measure the degree of similarity considering the probability distribution of the data. It corresponds to the normalized Euclidean distance based on a covariance matrix [9]. Because it takes into account correlations between data instances, the performance of the Mahalanobis distance approach is usually better than that of the Euclidean distance [10]. The Mahalanobis distance can be calculated as follows,

$$D_{Mah} = \sqrt {\left( {Q - P\left( {r_{j} } \right)} \right)\sum\nolimits^{ - 1} \left( {Q - P\left( {r_{j} } \right)} \right)^{T} ,}$$
(1)

where \(Q\) is the feature vector of the query mass, \(P(r_{j} )\) is the feature vector of the \(j\)th reference mass, and \(\sum\) is the covariance matrix.

Minkowski Family (\(F_{1}\))

The Minkowski family refers to a general metric distance [9], and it is a generalized measure of the Euclidean distance and the ‘City block’ distance. The Minkowski distance can be defined as shown below.

$$D_{Mk} = \sqrt[L]{{\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|^{L} }}$$
(2)

In this equation, \(q_{i}\) is the \(i\)th feature vector of the query mass, \(p_{i} \left( {r_{j} } \right)\) is the \(i\)th feature vector of \(r_{j}\), \(r_{j}\) is the \(j\) th reference mass, and \(d\) represents the dimensions of the feature space.

There are the Euclidean distance, City block distance, and Chebyshev distance in the Minkowski family. The Euclidean distance is the method most commonly used to find the distances between feature vectors in a multidimensional space [11]. It is a special case of the Minkowski distance where \(L = 2\), and it can be expressed as Eq. (3). The City block distance represents the sum of the absolute differences between feature vectors [9]. It is \(L = 1\) in Eq. (2) and is calculated as shown in Eq. (4). The Chebyshev distance measures distances while assuming only the most relevant dimensions [9]. In the case of \(L = \infty\) in Eq. (2), the Chebyshev distance can be expressed as Eq. (5).

$$D_{Eu} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }$$
(3)
$$D_{City} = \mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$
(4)
$$D_{Cheb} = \mathop {\hbox{max} }\limits_{i} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$
(5)

\(L_{1}\) Family (\(F_{2}\))

The \(L_{1}\) family uses absolute differences in an extended method of the City block distance. It is defined as the sum of the absolute differences between two feature vectors. We apply six distance similarity measures—Sorensen, Gower, Soergel, Kulczynski, Canberra, and Lorentzian—in this family. Each measure can be calculated from Eqs. (6)–(11).

$$D_{Sen} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$
(6)
$$D_{Gow} = \frac{1}{d}\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|$$
(7)
$$D_{Sgel} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$
(8)
$$D_{Kld} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$
(9)
$$D_{Can} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$
(10)
$$D_{Lor} = \mathop \sum \limits_{i = 1}^{d} \ln \left( {1 + \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|} \right)$$
(11)

The Gower distance in Eq. (7) scales the vector space into the normalized space and then uses the absolute difference [12]. The numerator signifies the difference and the denominator normalizes the difference in Canberra and Sorensen [9]. Canberra is obtained by dividing the absolute difference between the feature vectors by the sum of the feature vectors. The Lorentzian can be expressed as Eq. (11), which indicates the absolute difference between the feature vectors with a natural log function. At this time, 1 is added to ensure non-negative attributes and to avoid zero logs [12].

Intersection Family (\(F_{3}\))

The intersection between feature vectors is a widely used form of similarity [12]. Intersection similarity measures can be transformed into L1-based distance measures using this technique, i.e., \(d(q,p) = 1 - s(q,p)\) or \(d(q,p) = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {s(q,p)}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${s(q,p)}$}}\). This family includes the Intersection, Wave Hedges, Czekanowski, Motyka Kulczynski, Ruzicka, and Tanimoto measures, which correspondingly are expressed as Eqs. (12)–(18).

$$\begin{aligned} D_{Its} & = 1 - s_{Its} = 1 - \mathop \sum \limits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right) \\ & = \frac{1}{2}\mathop \sum \limits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right| \\ \end{aligned}$$
(12)
$$D_{WH} = \mathop \sum \limits_{i = 1}^{d} \left( {1 - \frac{{\mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}} \right)$$
(13)
$$s_{Cze} = \frac{{2\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$
(14)
$$D_{Mot} = 1 - s_{Mot} \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}$$
(15)
$$s_{Kls} = \frac{1}{{D_{kld} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}$$
(16)
$$s_{Ruz} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{max} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$
(17)
$$D_{Ta} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right) - 2\mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right) - \mathop \sum \nolimits_{i = 1}^{d} \mathop {\hbox{min} }\limits_{i} \left( {q_{i} ,p_{i} \left( {r_{j} } \right)} \right)}}$$
(18)

Inner Product Family (\(F_{4}\))

The Inner Product family is a method of using the inner product between feature vectors. There are similarity measures in this family that explicitly include the inner product form\(Q \cdot P\)’ in the definition [12]. We apply four measures—Cosine, Kumar-Hassebrook (PCE), Jaccard, and Dice—for the Inner Product family, as expressed by Eqs. (19)–(22) below.

$$s_{Cos} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} } }}$$
(19)
$$s_{PCE} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} - \mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}$$
(20)
$$D_{Jac} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} - \mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}$$
(21)
$$s_{Dice} = \frac{{2\mathop \sum \nolimits_{i = 1}^{d} q_{i} p_{i} \left( {r_{j} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{d} q_{i}^{2} + \mathop \sum \nolimits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)^{2} }}$$
(22)

Fidelity Family or Squared-Chord Family (\(F_{5}\))

The Fidelity similarity is the sum of geometric means, and it is defined Eq. (23) [12]. It includes the Hellinger and Matusita. The squared-chord distance is referred to as Matusita without the square root. There are alternative representations using squared-chord distance for all Fidelity based measures. Equations (24), (25), and (26) denote Hellinger, Matusita, and Squared-chord, respectively.

$$s_{Fid} = \mathop \sum \limits_{i = 1}^{d} \sqrt {q_{i} p_{i} \left( {r_{j} } \right)}$$
(23)
$$D_{Hel} = 2\sqrt {1 - \mathop \sum \limits_{i = 1}^{d} \sqrt {q_{i} p_{i} \left( {r_{j} } \right)} }$$
(24)
$$D_{Mat} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {\sqrt {q_{i} } - \sqrt {p_{i} \left( {r_{j} } \right)} } \right)^{2} }$$
(25)
$$D_{Scho} = \mathop \sum \limits_{i = 1}^{d} \left( {\sqrt {q_{i} } - \sqrt {p_{i} \left( {r_{j} } \right)} } \right)^{2}$$
(26)

Squared \(L_{2}\) Family or \(\chi^{2}\) Family (\(F_{6}\))

There are eight similarity measures using the squared Euclidean distance in this family, and the squared Euclidean distance is defined as Eq. (27) [12]. The eight similarity measures applied in this family are Squared Euclidean, Pearson \(\chi^{2}\), Neyman \(\chi^{2}\), Squared \(\chi^{2}\), Probabilistic Symmetric \(\chi^{2}\), Divergence, Clark, and Additive Symmetric \(\chi^{2}\). These can be calculated by Eqs. (27)–(34), respectively.

$$D_{SEu} = \mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2}$$
(27)
$$D_{Pea} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{p_{i} \left( {r_{j} } \right)}}$$
(28)
$$D_{Ney} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} }}$$
(29)
$$D_{Squ} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$
(30)
$$D_{PSy} = 2\mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}$$
(31)
$$D_{Div} = 2\mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} }}{{\left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)^{2} }}$$
(32)
$$D_{Clk} = \sqrt {\mathop \sum \limits_{i = 1}^{d} \left( {\frac{{\left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right)^{2} }$$
(33)
$$D_{Add} = \mathop \sum \limits_{i = 1}^{d} \frac{{\left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)^{2} \left( {q_{i} + p_{i} \left( {r_{j} } \right)} \right)}}{{q_{i} p_{i} \left( {r_{j} } \right)}}$$
(34)

Shannon’s Entropy Family (\(F_{7}\))

In this family, we apply four similarity measures that are probabilistic uncertainty or entropy concepts [12]. The four similarity measures are the Jeffreys, Topsoe, Jensen-Shannon, and Jensen difference measures, as correspondingly expressed in terms of the entropy as Eqs. (35)–(38).

$$D_{Jef} = \mathop \sum \limits_{i = 1}^{d} \left( {q_{i} - p_{i} \left( {r_{j} } \right)} \right)\ln \frac{{q_{i} }}{{p_{i} \left( {r_{i} } \right)}}$$
(35)
$$\begin{aligned} D_{Tsoe} = \mathop \sum \limits_{i = 1}^{d} q_{i} \ln \frac{{2q_{i} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}} \hfill \\ + \mathop \sum \limits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)\ln \frac{{2p_{i} \left( {r_{j} } \right)}}{{q_{i} + p_{i} \left( {r_{j} } \right)}} \hfill \\ \end{aligned}$$
(36)
$$\begin{aligned} D_{JSh} = \frac{1}{2}\left[ {\mathop \sum \limits_{i = 1}^{d} q_{i} \ln \frac{{2q_{i} }}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right. \hfill \\ \left. { + \mathop \sum \limits_{i = 1}^{d} p_{i} \left( {r_{j} } \right)\ln \frac{{2p_{i} \left( {r_{j} } \right)}}{{q_{i} + p_{i} \left( {r_{j} } \right)}}} \right] \hfill \\ \end{aligned}$$
(37)
$$\begin{aligned} D_{Jdiff} = \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{q_{i} \ln q_{i} + p_{i} \left( {r_{j} } \right)\ln p_{i} \left( {r_{j} } \right)}}{2}} \right) \hfill \\ \quad\quad - \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}\ln \frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}} \right) \hfill \\ \end{aligned}$$
(38)

Combinations (\(F_{8}\))

There are three similarity measures in this family, referring to methods which utilize multiple measures [12]. First, Taneja utilizes arithmetic and geometric means, which can be expressed as Eq. (39). Second, Kumar-Johnson, which utilizes the arithmetic and geometric mean divergence, can be expressed as Eq. (40). Finally, we applied the average of City block distance and Chebyshev distance in the Minkowski family (\(F_{1}\)). This can be calculated by Eq. (41).

$$D_{Taj} = \mathop \sum \limits_{i = 1}^{d} \frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{2}\ln \left( {\frac{{q_{i} + p_{i} \left( {r_{j} } \right)}}{{2\sqrt {q_{i} p_{i} \left( {r_{j} } \right)} }}} \right)$$
(39)
$$D_{KuJ} = \mathop \sum \limits_{i = 1}^{d} \left( {\frac{{\left( {q_{i}^{2} - p_{i} \left( {r_{j} } \right)^{2} } \right)^{2} }}{{2\left( {q_{i} p_{i} \left( {r_{j} } \right)} \right)^{3/2} }}} \right)$$
(40)
$$D_{Avg} = \frac{{\mathop \sum \nolimits_{i = 1}^{d} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right| + \mathop {\hbox{max} }\limits_{i} \left| {q_{i} - p_{i} \left( {r_{j} } \right)} \right|}}{2}$$
(41)

In this study, we applied 39 distance similarity measures to retrieve masses similar to query masses in the CADx system. According to the notation similarity of each measure, the 39 similarity measures were classified into nine categories. The \(F_{0}\) family measures the similarity considering the probability distribution of the data, the \(F_{1}\) family is a general metric distance, and the \(F_{2}\) family uses absolute differences. The similarity measures in the \(F_{3}\) family are a widely used form in which the intersection between feature vectors is used. The \(F_{4}\) family uses the inner product between the feature vectors and the \(F_{5}\) family is a method that uses the sum of the geometric means. Finally, there is the \(F_{6}\) family, which uses the squared Euclidean distance, the \(F_{7}\) family which uses the probabilistic uncertainty or entropy, and the \(F_{8}\) family which uses multiple measures. Each has been described in depth [12].

Results and Discussion

To assist radiologists with breast cancer diagnoses, a CADx system was studied for breast ultrasound images. We applied 39 distance similarity measures based on a distance metric. The performance capabilities of the CADx system when applying each similarity measure were then analyzed and evaluated.

To design the CADx system for breast ultrasound images, breast masses in ultrasound images were classified into a training set and a test set through a cross-validation method. The 39 similarity measures were categorized into nine families based on the notation similarity in each case.

When \(S_{1}\) is used as the training data and \(S_{2}\) is used as the test data, the performance of the CADx system applying each similarity measure is presented in Table 1 and Fig. 4. Table 2 and Fig. 5 show the performance when \(S_{2}\) is used as the training data and \(S_{1}\) is used as the test data, where the value of \(k\) is the number of retrieved masses, referring to the number of breast ultrasound images shown in references to assist radiologists with their diagnoses. Tables 1 and 2 present numerical representations of the performance outcomes for the 39 distance similarity measures, and Figs. 4 and 5 are graphs of the performances for the nine families.

Table 1 The performances of the 39 similarity measures (training set S1, Test set S2)
Fig. 4
figure 4

The performances of the nine similarity families (Training set S1, Test set S2)

Table 2 The performances of the 39 similarity measures (training set S2, Test set S1)
Fig. 5
figure 5

The performances of the nine similarity families (Training set S2, Test set S1)

When using \(S_{1}\) as the training data and \(S_{2}\) as the test data, the performances of all similarity measures are similar. However, the performance of the \(F_{0}\) family (the Mahalanobis distance) is slightly better than those of the others when the number of retrieved masses is increased (i.e., k = 25–50). In addition, the average performance (i.e., k = 1–50) of the \(F_{0}\) family is better than those of the other similarity families (Table 1). The performance of each family is shown in Fig. 4, which indicates that the \(F_{0}\) family is superior to the other families in large number of top retrieval masses (k > 25). When using \(S_{2}\) as the training data and \(S_{1}\) as the test data, the results are similar to the previous results, as shown in Table 2 and Fig. 5.

In this study, we compared the performances of the CADx system when applying the 39 similarity measures. It was found that the performance of the \(F_{0}\) family exceeded those of the other similarity families in both experiments for large number of top retrieval masses. The \(F_{0}\) family computes the covariance distance taking into account the distribution of the given data. However, the other similarity families only consider the distance between two feature vectors. Therefore, the \(F_{0}\) family (Mahalanobis distance) that calculates the covariance distance outperforms the other similarity measures on average.

Conclusion

In this paper, a CADx system for breast ultrasound images was devised to assist radiologists in differentiating benign and malignant masses on ultrasound breast images. To design the CADx system, morphological and texture features were extracted from a database. The feature vectors of breast masses were then classified into a training set and a test set through a cross-validation method. Using a k-nearest neighbor (k-NN) method, we applied 39 distance similarity measures (nine similarity families, F0F8) based on distance and compared the performance of the CADx system through an ROC analysis. The 39 distance similarity measures were classified in nine similarity families based on the notation similarity of each measure, as noted above. The 39 distance similarity measures were applied to retrieve masses similar to a query mass in a reference library.

It was found that the performances of each of the similarity measures did not show any significant differences. However, the performance of the \(F_{0}\) family was greater than those of the other families when the number of retrieved masses is increased (k ≥ 25). When the number of retrieved masses is low, the probability distribution cannot be confirmed because there are too few data instances to be considered. When the number of retrieved masses is increased, the probability distribution of the data can be fully taken into account. Therefore, the performance of the CADx system using the Mahalanobis distance (the \(F_{0}\) family), which considers the probability distribution of the data, is superior to those of other families.

For larger number (k > 10) of top retrieval masses, the classification performance of all similarity measures continuously leveled off. The relationship between the usefulness of the retrieved masses as references for radiologists and the accuracy of estimating the likelihood of malignancy of the query mass warrants further investigations.

Future work includes applying the CBIR CADx system to a larger and independent dataset, expanding the feature space, and combining the developed ultrasound image characterization method with mammographic characterization method. The effects of the different CBIR CADx systems on the characterization of breast masses by support vector machine (SVM) will also be evaluated.