1 Introduction

In order to make a prediction for a test instance, many data mining algorithms search for its k closest matches or nearest neighbors (k-NNs) in the given training set and make a prediction based on the k-NNs. They use a (dis)similarity or distance measure to find k-NNs. However, finding reliable k-NNs becomes a challenging task as the number of dimensions increases. In high-dimensional space, data distribution becomes sparse which makes the concept of distance meaningless, i.e., all pairs of points are almost equidistant for a wide range of data distributions and distance measures [1, 6, 12].

Let \(D=\{\mathbf{x}^{(1)}, \mathbf{x}^{(2)},\cdots , \mathbf{x}^{(N)}\}\) be a collection of N data instances in an M-dimensional space \({{\mathcal {X}}}\). Each instance \(\mathbf{x}\) is represented as an M-dimensional vector \(\langle x_{1},x_{2},\cdots ,x_{M}\rangle \). Let \(d:{{\mathcal {X}}}\times {{\mathcal {X}}} \rightarrow {{\mathcal {R}}}\) (where \({{\mathcal {R}}}\) is a real domain) be a measure of dissimilarity between two vectors in \({{\mathcal {X}}}\). The most common approach of measuring dissimilarity of two data instances \(\mathbf{x}\) and \(\mathbf{y}\) is based on a geometric model where \({{\mathcal {X}}}\) is assumed to be a metric space (which has nice mathematical properties) and \(d(\mathbf{x}, \mathbf{y})\) is estimated as their geometric distance in the space. We use distance measures to refer to dissimilarity measures which are metric.

Minkowski distance (aka \(\ell _p\)-norm) [10] is a widely used distance measure. It estimates the dissimilarity of \(\mathbf{x}\) and \(\mathbf{y}\) by combining their distances in each dimension. Euclidean distance (\(\ell _2\)-norm) is a popular choice of distance function as it intuitively corresponds to the distance defined in the real three-dimensional world. In bag-of-words vector representation of documents, cosine distance (aka angular distance) has been shown to produce more reliable k-NNs than \(\ell _2\)-norm [26]. Cosine distance is proportional to the Euclidean distance of the length normalized vectors (i.e., they are translated in the space to be of unit lengths).

The performance of general-purpose distance measures such as \(\ell _p\)-norm and cosine distance depends on the data distribution: A distance measure that performs well in one distribution may perform poorly in others. This has been suspected to be due to the fact that these distance measures compute the dissimilarity between two instances solely based on their geometric positions in the vector space, and data distribution (positions of other vectors) is not taken into consideration.

Many psychologists have expressed their concerns on the geometric model of dissimilarity measure [17, 31] arguing that the judged dissimilarity between two objects is influenced by the context of measurements and other objects in proximity. Krumhansl [17] has suggested a distance density model of dissimilarity measure arguing that two objects in a relatively dense region would be less similar than two objects of equal distance but located in a less dense region. For example, two Chinese individuals will be judged as more similar when compared in Europe (where there are less Chinese and more Caucasian people) than in China (where there are many Chinese people).

In order to understand the influence of data distribution in judged dissimilarity, let us consider an example of a data set with distributions in dimensions i and j as shown in Table 1. In this example, \(\mathbf{x}^{(1)}\) and \(\mathbf{x}^{(2)}\) have the same values in dimensions i and j. Their value in dimension i is significantly different from the rest of the instances, but their value is a common value in dimension j (9 out of 10 instances have the same value). In a geometric distance measure such as \(\ell _p\), because \(x^{(1)}_i - x^{(2)}_i = x^{(1)}_j - x^{(2)}_j = 0\), the differences in dimensions i and j have the same contribution in \(d(\mathbf{x}^{(1)}, \mathbf{x}^{(2)})\). The main concern raised by psychologists is that having the same value in dimension j (where probability of the value is high) does not provide the same amount of information about the (dis)similarity between \(\mathbf{x}^{(1)}\) and \(\mathbf{x}^{(2)}\) as having the same value in dimension i (where the probability of the value is small). This scenario where many instances have the same value in many dimensions can be very common in high-dimensional spaces as data often lies in a low-dimensional subspace. For example, in bag-of-words vector representation, many entries in document vectors are zero as each document has only a small number of terms from the dictionary.

Table 1 An example of data distribution in two dimensions

In this paper, we propose a simple data-dependent general-purpose dissimilarity measure called ‘\(m_p\)-dissimilarity’ in which dissimilarity between two instances is estimated based on data distribution in each dimension. Rather than using the spatial distance in each dimension, \(m_p\)-dissimilarity evaluates the dissimilarity between two instances in terms of probability data mass in a region covering the two instances in each dimension. The final dissimilarity between the two instances is estimated by combining dissimilarity in every dimension as in \(\ell _p\)-norm. The intuition behind the proposed dissimilarity measure is that two instances are likely to be dissimilar if there are many instances in-between and around them in many dimensions. Under the proposed data-dependent dissimilarity measure, two instances in a dense region of the distribution are more dissimilar than two instances in a sparse region, even if the two pairs have the same geometric distance, which is prescribed by psychologists.

Our empirical evaluation in k-NN classification and content-based multimedia information retrieval tasks shows that the proposed \(m_p\)-dissimilarity measure produces better task-specific performance than existing widely used general-purpose distance measures such as \(\ell _p\)-norm and cosine distance across a wide range of moderate- to high-dimensional data sets with continuous only, discrete only, and mixed attributes.

The rest of the paper is organized as follows. Previous work related to this paper is discussed in Sect. 2. The proposed \(m_p\)-dissimilarity is presented in Sect. 3, followed by empirical results in Sect. 4. The relationship of \(m_p\)-dissimilarity with \(\ell _p\)-norm after rank transformation of data is discussed in Sect. 5 followed by the related discussion in Sect. 6. Finally, we conclude the paper with conclusions and future work in the last section. From now on, we refer to \(m_p\)-dissimilarity and \(\ell _p\)-norm by \(m_p\) and \(\ell _p\), respectively.

2 Related work

In this section, we review some widely used techniques to measure dissimilarity between instances in domains with continuous only, discrete only, and mixed attributes.

2.1 Dissimilarity measures in continuous domain

In continuous domain where each dimension is numeric, i.e., \(\forall _i\) \(x_i\in {{\mathcal {R}}}\), the dissimilarity between two M-dimensional vectors \(\mathbf{x}\) and \(\mathbf{y}\) is primarily based on their positions in the vector space. Minkowski distance of order \(p>0\) (also known as \(\ell _p\)-norm distance) is defined as follows:

$$\begin{aligned} d_{mink,p}(\mathbf{x}, \mathbf{y}) = \ell _p(\mathbf{x}, \mathbf{y}) = \Vert \mathbf{x}-\mathbf{y}\Vert _p = \left( \sum _{i=1}^M abs(x_i - y_i)^p\right) ^{\frac{1}{p}} \end{aligned}$$
(1)

where \(abs(\cdot )\) is an absolute value.

Euclidean distance (\(p=2\)) is a popular choice of distance function as it intuitively corresponds to the distance defined in the real three-dimensional world.

As distance in each dimension has equal influence, \(\ell _p\) is very sensitive to the units and scales of measurement. Min–max normalization (\(x_i'=\frac{x_i-\hbox {min}_i}{\hbox {max}_i-\hbox {min}_i}\), where \(\mathrm{min}_i\) and \(\hbox {max}_i\) are the minimum and maximum values in dimension i respectively) is commonly used to rescale feature values in the unit range ([0,1]). Even though min–max normalization takes care of scale differences between different dimensions, it does not take care of differences in variance across different dimensions. A unit distance in a dimension with low variance may not be the same as that in a dimension with high variance. In order to ensure the equal variance in each dimension, standard deviation normalization (\(x_i''=\frac{x_i}{\sigma _i}\) where \(\sigma _i\) is the standard deviation of values of instances in dimension i) is used in the literature. We call the \(\ell _p\) applied on standard deviation normalized vectors as standardized \(\ell _p\) (s-\(\ell _p\)) i.e., s-\(\ell _p(\mathbf{x},\mathbf{y}) = \ell _p(\mathbf{x''},\mathbf{y''})\). Standardized \(\ell _p\) with \(p=2\) (s-\(\ell _2\)) is the simplest variant of Mahalanobis distance [10] where the covariance matrix is a diagonal matrix of variance of values in each dimension.

The Mahalanobis distance [10, 22] is defined as follows:

$$\begin{aligned} d_{{mah}}(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T \Sigma ^{-1} (\mathbf{x}-\mathbf{y})} \end{aligned}$$
(2)

where \(\Sigma \in {{\mathcal {R}}}^{M\times M}\) is the covariance matrix of D.

Rather than using the inverse of the sample covariance matrix, metric learning literature focus on learning a generalized Mahalanobis distance [5, 18, 32, 33] from D defined as follows:

$$\begin{aligned} d_{genMah}(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T \Omega (\mathbf{x}-\mathbf{y})} \end{aligned}$$
(3)

where \(\Omega \in {{\mathcal {R}}}^{M\times M}\) is a positive semi-definite matrix.

Since \(\Omega \) is positive semi-definite, it can be factorized as \(\Omega = \Lambda ^{T}\Lambda \) where \(\Lambda \in {\mathcal R}^{\omega \times M} \) and \(\omega \) is a positive integer and \(d_{genMah}(\mathbf{x}, \mathbf{y})\) can be written as: \(d_{genMah}(\mathbf{x}, \mathbf{y}) = \Vert \Lambda \mathbf{x}-\Lambda \mathbf{y}\Vert _2\) [5, 18, 32]. The generalized Mahalanobis distance is the Euclidean distance of vectors transformed by matrix \(\Lambda \). The goal of metric learning is to learn a transformation matrix \(\Lambda \) to improve the task-specific performance of the Euclidean distance, subject to some optimality constraints, e.g., similar instances become closer to each other (similarity constraints) and dissimilar instances are separated further apart from each other (dissimilarity constraints). Learning the best \(\Lambda \) requires learning intensive optimization which is expensive in high-dimensional and/or large data sets. Furthermore, \(\Lambda \) is optimized specifically for the given task; and it may not be good for other tasks using the same data set. It is not a general-purpose measure like \(\ell _p\).

In many high-dimensional problems, data have the same value (0 or any other constant) in many dimensions. This leads to sparseness in data distribution. For example, only a small proportion of terms in a dictionary appear in each document of a corpus. Many entries of a term vector representing a document are zero. Euclidean distance is not a good choice of distance measure in such problems. The direction of vectors is more important than their lengths. The angular distance measure (aka cosine distance) [10] is a more sensible choice to measure dissimilarity between two documents. The cosine distance between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is defined as follows [10]:

$$\begin{aligned} d_{cos}(\mathbf{x},\mathbf{y}) = \displaystyle 1 - \frac{\sum _{i=1}^M x_i\times y_i}{\sqrt{\sum _{i=1}^M x_i^2}\times \sqrt{\sum _{i=1}^M y_i^2}} \end{aligned}$$
(4)

Cosine distance is proportional to Euclidean distance when the vectors are length normalized to be of unit lengths which is referred as cosine normalization in the literature. Different term weighting schemes are used to adjust the positions of the document vectors in the space based on the importance of their terms in order to improve the task-specific performance of the cosine distance [19, 25]. Cosine distance with term frequency–inverse document frequency (TF-IDF)-based term weighting [25] has been shown to perform well in many text-mining problems such as text categorization, text clustering, and text retrieval tasks.

In both metric learning and term weighting, the focus is to transform data so that task-specific performance of the Euclidean or cosine distance is maximized in the given data set. Some aspects of data distribution is taken into consideration in the transformation in metric learning and in term weighting, but still restricted to be a metric in the transformed space, i.e., dissimilarity is still computed solely based on geometrical positions in the transformed space.

2.2 Dissimilarity measures in discrete domain

In discrete domain, each attribute is a categorical attribute, i.e., \(\forall _i\) \(x_{i}\) \(\in \) \(\{v_{i,1}\),\(\cdots ,\) \(v_{i,u_i}\}\) where \(v_{i,j}\) is a label out of \(u_i\) possible labels for \(x_i\). A discrete attribute can be ordinal (where there is an ordering of discrete labels \(v_{i,1}< v_{i,2}< \cdots < v_{i,u_i}\)) or nominal (where there is no ordering of discrete labels).

In order to measure similarity between two labels \(x_i\) and \(y_i\) for a discrete attribute i, \(s(x_i,y_i)\), the simplest overlap approach assigns maximum similarity of 1 if \(x_i=y_i\) and minimum similarity of 0 if \(x_i \ne y_i\) [7, 29]. Other approaches such as occurrence frequency (OF) and inverse occurrence frequency (IOF) [7] estimate \(s(x_i,y_i)\) based on the frequencies of \(x_i\) and \(y_i\) in D if \(x_i \ne y_i\) and assign maximum similarity of 1 if \(x_i=y_i\) regardless of the frequency. The definition of \(s(x_i,y_i)\) based on overlap, OF, and IOF [7] is provided in Table 2.

Table 2 \(s(x_i,y_i)\) of two labels \(x_i\) and \(y_i\) of a nominal attribute i. \(f(x_i)\) is the occurrence frequency of label \(x_i\) in D; \(N=|D|\)

Lin [20] defined similarity using information theory and suggested a probabilistic measure of similarity in ordinal discrete domain. The similarity between two ordinal labels \(x_i\) and \(y_i\) is defined as follows:

$$\begin{aligned} s_{lin,ord}(x_i, y_i) = \frac{2 \times \log \sum _{z_i=\min (x_i,y_i)}^{\max (x_i,y_i)} P(z_i)}{\log P(x_i) + \log P(y_i)} \end{aligned}$$
(5)

where \(P(x_i)\) is the probability of \(x_i\) and it is estimated from D as \({\hat{P}}(x_i) = \frac{f(x_i)+1}{N+u_i}\) where \(f(x_i)\) is the occurrence frequency of label \(x_i\) in D.

Boriah et al. [7] used Lin’s information theoretic definition of similarity in nominal discrete domain as follows:

$$\begin{aligned} s_{lin,nom}(x_i, y_i) = \frac{2 \times \log P(x_i \vee y_i)}{\log P(x_i)+\log P(y_i)} \end{aligned}$$
(6)

In multivariate discrete domain, dissimilarityFootnote 1 between two instances \(\mathbf{x}\) and \(\mathbf{y}\) using Lin’s measure can be estimated as follows [7]:

$$\begin{aligned} d_{lin}(\mathbf{x}, \mathbf{y}) = 1-\frac{1}{M} \sum _{i=1}^M \displaystyle s_{lin}(x_i, y_i) \end{aligned}$$
(7)

Boriah et al. [7] have shown that \(d_{lin}\) performs better than \(d_{of}\) and \(d_{iof}\) in discrete domains. Even though measures such as \(s_{of}\), \(s_{iof}\) and \(s_{lin}\) assign similarity between \(x_i\) and \(y_i\) in each dimension based on the distribution of labels if \(x_i\ne y_i\), they assign the maximum similarity of 1 in the case of \(x_i = y_i\) regardless of the distribution of the label.

2.3 Dissimilarity measures in mixed domain

Many real-world applications have both continuous and discrete attributes resulting in mixed domain. In order to measure (dis)similarity between two instances in such a domain, the most commonly used \(\ell _p\)-norm uses the overlap approach to measure dissimilarity between two labels \(x_i\) and \(y_i\) of a discrete attribute i as \(x_i-y_i = 0\) if \(x_i=y_i\); and 1 otherwise.

Other approaches include converting attributes into continuous only or discrete only and using (dis)similarity measures designed for continuous or discrete domain. A continuous attribute can be converted into a discrete attribute through discretization [13]. A discrete attribute with u discrete labels can be converted into u continuous attributes as follows: Each discrete label is converted into a binary attribute where 0 represents the absence of the label and 1 represents the presence, and all converted u binary attributes are treated as continuous attributes [13].

3 Data-dependent dissimilarity measure

In order to measure dissimilarity between \(\mathbf{x}\) and \(\mathbf{y}\), instead of using \(abs(x_i-y_i)\) in Eq. 1, we propose to consider the relative positions of \(\mathbf{x}\) and \(\mathbf{y}\) with respect to the rest of the data distribution in each dimension. The dissimilarity between \(\mathbf{x}\) and \(\mathbf{y}\) in dimension i can be estimated as the probability data mass in region \(R_i(\mathbf{x}, \mathbf{y})\) that encloses \(\mathbf{x}\) and \(\mathbf{y}\). If there are many instances in \(R_i(\mathbf{x}, \mathbf{y})\), \(\mathbf{x}\) and \(\mathbf{y}\) are likely to be dissimilar in dimension i. Using the same power mean formulation as in \(\ell _p\)-norm, the data-dependent dissimilarity measure based on probability mass is defined as:

$$\begin{aligned} m_p(\mathbf{x}, \mathbf{y}) = \displaystyle \left( \frac{1}{M}\sum _{i=1}^M \left( \frac{|R_i(\mathbf{x}, \mathbf{y})|}{N}\right) ^p\right) ^{\frac{1}{p}} \end{aligned}$$
(8)

where \(|R_i(\mathbf{x}, \mathbf{y})|\) is the data mass in region \(R_i(\mathbf{x}, \mathbf{y}) = [\min (x_i,y_i) - \delta , \max (x_i,y_i) + \delta ]\) (i.e., \(|R_i(\mathbf{x}, \mathbf{y})| = |\{z_i:\min (x_i,y_i)-\delta \le z_i \le \max (x_i,y_i)+\delta \}|\)), \(\delta \ge 0\), \(p>0\) and N is the total number of instances in D. An example of \(R_i(\mathbf{x}, \mathbf{y})\) is shown in Fig. 1.

Fig. 1
figure 1

\(R_i(\mathbf{x}, \mathbf{y})\)

Fig. 2
figure 2

Contour plots of dissimilarity of points in the space with reference to the centre (0.5, 0.5), based on \(m_p\) (with \(\delta \) for each dimension i set to \(\frac{\sigma _i}{2}\) where \(\sigma _i\) is the standard deviation of values of instances in dimension i) in three data distributions (uniform: left column, normal: middle column, and bimodal: right column). The darker the color, the lesser the dissimilarity (a) \(p=2.0\), (b) \(p=0.5\)

The region is extended by small \(\delta > 0\) beyond \(x_i\) and \(y_i\) to consider the density distribution around them along with the distribution in-between them. The role of parameter p is similar to that in \(\ell _p\), i.e., p controls the influence of the dissimilarity in each dimension.

We call the proposed dissimilarity measure \(m_p(\mathbf{x}, \mathbf{y})\) as ‘\(m_p\)-dissimilarity’. This measure captures the essence of the distance density model proposed by psychologists [17] which prescribes that two instances in a sparse region are more similar than two instances in a dense region. Although \(m_p\) employs the same power mean formulation as \(\ell _p\), the core calculation is based on probability mass rather than distance. The proposed \(m_p\)-dissimilarity has a probabilistic interpretation which is provided in “Appendix 1”.

The dissimilarity between a pair of instances using Eq. 8 depends on the distribution of data. Fig. 2 shows the contour plots of \(m_p\)-dissimilarity between the point (0.5,0.5) and any other point in the feature space in three different data distributions (uniform, normal and bimodal) for \(p=2.0\) and \(p=0.5\). In contrast, \(\ell _p\) or \(d_{cos}\) would produce the same contour in all three distributions. Under uniform distribution and infinite samples, \(m_p\) will yield the same result as \(\ell _p\) because the data mass in \(R_i(\mathbf{x}, \mathbf{y})\) will be proportional to \(abs(x_i-y_i)\). This is depicted in the two contour plots in the first column in Fig. 2 where they exhibit similar contour plots to those of \(\ell _2\) and \(\ell _{0.5}\).

3.1 Time complexity and efficient approximation

In continuous domains, estimating \(m_p(\mathbf{x}, \mathbf{y})\) using Eq. 8 is expensive, especially when either \(\mathbf{x}\) or \(\mathbf{y}\) is an unseen instance, as it requires a range search in each dimension to estimate \(|R_i(\mathbf{x}, \mathbf{y})|\). One-dimensional range search can be done in \(O(\log N)\) using binary search trees resulting in the time complexity of \(O(M\log N)\) to measure dissimilarity of a pair of instances against O(M) of \(\ell _p\). It is expensive to compute in large data sets.

Alternatively, \(|R_i(\mathbf{x}, \mathbf{y})|\) can be approximated efficiently by using a histogram, i.e., divide the range of real values in each dimension i into b bins (\(h_{i1}, h_{i2},\cdots ,h_{ib}\)). The number of instances in each bin can be computed in a preprocessing step. When two instances \(\mathbf{x}\) and \(\mathbf{y}\) are given for dissimilarity measurement, \(R_i(\mathbf{x}, \mathbf{y})\) can be computed by using the bins in-between \(\mathbf{x}\) and \(\mathbf{y}\) as shown in Fig. 3. Even though the approximation using bins does not extend the range exactly by \(\delta \) beyond \(x_i\) and \(y_i\), the bins (where \(x_i\) and \(y_i\) fall into) provide a reasonable approximation of the distribution around \(x_i\) and \(y_i\).

Fig. 3
figure 3

Defining \(R_i(\mathbf{x}, \mathbf{y})\) using bins

If \(h_{il}\) and \(h_{io}\) are the two bins in which \(\min (x_i, y_i)\) and \(\max (x_i, y_i)\) fall, respectively, then \(|R_i(\mathbf{x}, \mathbf{y})|\) can be estimated as follows:

$$\begin{aligned} |R_i(\mathbf{x}, \mathbf{y})| = \sum _{q=l}^o |h_{iq}| \end{aligned}$$
(9)

Note that the binning can be done in two ways: (i) equal width: Each bin is of the same size (bins in dense region have more data mass than those in the sparse region); (ii) equal frequency: Each bin has approximately the same number of instances as much as possible (bins are smaller in dense region than those in the sparse region). The former one is sensitive to outliers. If there is only one instance having significantly different value than others, it may affect the discrimination between the other instances as they all may fall in the same bin, and many bins in the middle will be left empty. Hence, we used the latter approach of binning where each bin has approximately the same number of instances with \(b=100\) using WEKA implementationFootnote 2 [13] in this paper. Note that bins in a dimension can have different data mass if many instances have the same values in that dimension making them impossible to split in b bins with the equal data mass.

The preprocessing requires a total of \(O(NMb+Mb^2)\) time and \(O(Mb^2)\) space complexities. It builds the histogram and the pairwise dissimilarity matrix of bins in each dimension. A histogram of b bins is built for each dimension and the number of instances falling in each bin can be calculated in O(NMb) time. The dissimilarity matrix for \(|R_i(\cdot ,\cdot )|\) can be precomputed for each pair of bins in each dimension in \(O(Mb^2)\) time and stored in \(O(Mb^2)\) space. Having preprocessed, the dissimilarity between two instances in each dimension can be done as a table look-up in O(1) time, resulting in O(M) time to measure dissimilarity between a pair of instances, equivalent to those of \(\ell _p\) and \(d_{cos}\).

3.2 Handling discrete attributes

For ordinal discrete attributes, \(|R_i(\mathbf{x}, \mathbf{y})|\) can be estimated as follows:

$$\begin{aligned} |R_i(\mathbf{x}, \mathbf{y})|= \sum _{z_i=\min (x_i,y_i)}^{\max (x_i,y_i)} f(z_i) \end{aligned}$$
(10)

where \(f(z_i)\) is the frequency of discrete label \(z_i\) in D.

Unlike \(d_{lin}\) that assigns dissimilarity in an ordinal attribute i based on the frequencies of labels if \(x_i\ne y_i\) and assigns minimal dissimilarity of 0 regardless of the distribution of labels if \(x_i=y_i\), \(m_p\) assigns dissimilarity based on the frequency of the label even in the case of \(x_i=y_i\).

For nominal discrete attributes, \(|R_i(\mathbf{x}, \mathbf{y})|\) can be estimated as follows:

$$\begin{aligned} |R_i(\mathbf{x}, \mathbf{y})|= \left\{ \begin{array}{ll} f(x_i) &{} \text{ if } x_i = y_i \\ N &{} \text{ otherwise } \\ \end{array} \right. \end{aligned}$$
(11)

It is interesting to note the difference between \(m_p\) and the existing dissimilarity measures for nominal domains such as \(d_{lin}\), \(d_{of}\) and \(d_{iof}\) [7]. For a nominal attribute i, they use the frequency of labels if two instances have different labels (\(x_i\ne y_i\)), and assign the maximal similarity of 1 (or minimal dissimilarity of 0) if \(x_i=y_i\). In contrast, \(m_p\) uses the opposite approach and uses the frequency of the label if \(x_i=y_i\) and assigns the maximal dissimilarity of 1 otherwise. In the case of \(x_i=y_i\), existing measures assign maximal similarity of 1 without considering the distribution of the label. It might be the case that all the other instances have the same label, and there is no discrimination between instances w.r.t the attribute.

The frequency of each discrete label can be computed in a preprocessing step which requires O(NM) time and O(Mu) (where u is the average number of discrete labels per dimension) space.

3.3 Dissimilarity measure in bag-of-words vector representation

In the case of bag-of-words (bow) [26] vector representations, each component of a vector represents frequency of a feature (term in documents or a visual descriptor in images). Given any two vectors \(\mathbf{x}\) and \(\mathbf{y}\), many features have zero frequency i.e., \(x_i=y_i=0\) for many dimensions, because a document contains only a small proportion of words in the dictionary. Since the absence of a feature in both the instances does not provide any information about the (dis)similarity of \(\mathbf{x}\) and \(\mathbf{y}\), those features should be ignored. Hence, in the bow vector representation, \(m_p\)-dissimilarity of \(\mathbf{x}\) and \(\mathbf{y}\) is estimated using only those features that occur in either of \(\mathbf{x}\) or \(\mathbf{y}\) as follows:

$$\begin{aligned} m_p(\mathbf{x},\mathbf{y})=\left( \frac{1}{|F_{\mathbf{x},\mathbf{y}}|}\sum _{i\in F_{\mathbf{x},\mathbf{y}}} \left( \frac{|R_i(\mathbf{x}, \mathbf{y})|}{N}\right) ^p\right) ^{\frac{1}{p}} \end{aligned}$$
(12)

where \(F_\mathbf{x}\) is the set of features that occur in \(\mathbf{x}\) (i.e., \(F_\mathbf{x}=\{i:x_i>0\}\)), and \(|F_{\mathbf{x},\mathbf{y}}|=|F_\mathbf{x}\cup F_\mathbf{y}|\) is the normalization term employed to account for different numbers of features used for measuring dissimilarity of any two instances.

It is important to ignore those features which have zero frequency in both instances (\(x_i=y_i=0\)); otherwise, \(m_p\) would assign large dissimilarity w.r.t those features as many instances in the data set will have 0 values. This is not an issue for \(\ell _p\) because it assigns 0 dissimilarity when \(x_i=y_i=0\).

3.4 Distinguishing properties of \(m_p\)

3.4.1 Data-dependent self-dissimilarity

The distinguishing characteristic of \(m_p\) against the geometry-based (\(\ell _p\) and \(d_{cos}\)) and probabilistic (\(d_{lin}\)) dissimilarity measures discussed in Sect. 2 is the self-dissimilarity. The self-dissimilarity of \(m_p\) is not zero, and it ranges from the minimum of \(\frac{1}{N}\) to the maximum of 1, depending on the data distribution in each dimension. In contrast, \(\ell _p(\mathbf{x}, \mathbf{x})=d_{cos}(\mathbf{x},\mathbf{x})=d_{lin}(\mathbf{x}, \mathbf{x})=0\) irrespective of the data distribution. Because of the data-dependent self-dissimilarity, \(m_p\) is non-metric.

The approximation of \(|R_i(\mathbf{x}, \mathbf{y})|\) using equal-frequency bins will yield a nonzero constant self-dissimilarity for \(m_p\) only if each bin has the same number of instances in each dimension. This is often not be possible because there are duplicate values in many instances and this occurs in many dimensions. This is a common characteristic of many high-dimensional data sets because data often lie in a low-dimensional subspace. As a result, bins often have different numbers of instances resulting in data-dependent self-dissimilarity. The advantage of data-dependent self-dissimilarity of \(m_p\) over data-independent self-dissimilarity of existing measures is discussed in Sect. 5.

In discrete domains, unlike \(d_{lin}\), \(d_{of}\) and \(d_{iof}\) based measures that use the probabilities of categorical labels only in the case of different labels, \(m_p\) uses the probability of the label even in the case of matching labels—data-dependent self-dissimilarity in action.

3.4.2 \(m_p\) is equivalent to \(\ell _p\) only under uniform distribution

Under uniform distribution and infinite data, \(m_p\) is equivalent to \(\ell _p\) as the data mass in the range is proportional to its length. This is the only condition under which \(m_p\)—a data-dependent measure—is equivalent to \(\ell _p\)—a geometric model-based measure.

3.4.3 Robust to scale, units of measurement and outliers

As \(m_p\) is based on counts and does not use the feature values in the dissimilarity measure directly, it is robust to scale and units of measurements in continuous domains. It does not require preprocessing of data to address the scaling issue (min–max normalization) or difference in variance across different dimensions (standard deviation normalization). In many real-world applications, different properties of data may have been represented or measured in different scales (e.g., income is represented in dollars and age in normal integer scale: One unit difference is not the same in these two attributes). This can be the case in high-dimensional problems where different properties are measured by different sensors. Also for the same reason (i.e., based on the count and not the actual feature values), \(m_p\) is less sensitive to outliers. In the case of \(\ell _p\), outliers can have an adverse impact as they might change variance significantly.

4 Empirical evaluation

This section presents the results of experiments conducted to demonstrate that simply by replacing the geometric distance with the probability mass in each dimension, \(m_p\) produces better task-specific performances than \(\ell _p\) and \(d_{cos}\) across a wide range of data sets. We have evaluated the performance of \(m_p\) against the general-purpose dissimilarity measures of Minkowski distance (\(\ell _p\)), Minkowski distance after standard deviation normalization (s-\(\ell _p\)), Cosine distance (\(d_{cos}\)) and Lin’s probabilistic measure (\(d_{lin}\)) in k-nearest neighbor (k-NN) classification and content-based multimedia information retrieval (CBMIR) tasks. We used two settings of \(p\in \{0.5, 2.0\}\) in \(\ell _p\), s-\(\ell _p\) and \(m_p\) resulting in eight measures: \(d_{cos}, d_{lin}, \ell _{0.5}, \ell _2, s\)-\(\ell _{0.5}, s\)-\(\ell _2, m_{0.5}\) and \(m_2\). All dissimilarity measures and algorithms were implemented in Java using the WEKA platform [13].

We used moderately high to high-dimensional (\(M\ge 20\)) data sets from different application areas such as text, image, music, characters and digits recognition, medical and biology, games. In text collections, documents were represented by TF-IDF [25] weighted ‘bag-of-words’ [26] vectors. Feature values in each dimension in all other non-text data sets were normalized to be in the unit range. For \(d_{lin}\), continuous attributes were converted into ordinal attributes using discretization as in the case of \(m_p\).

The properties of the data sets are provided in Table 3. NG20, R52, R8, Webkb were from ([8])Footnote 3; Ohscal, Wap, New3s and Fbis were from [14]Footnote 4; Caltech256 (sift bag-of-words features) from [30]Footnote 5; Corel and Gtzan were from [34]; HBA was from [2] and the rest of the other data sets were from UCI [4]Footnote 6 and WEKA [13]Footnote 7.

Table 3 Data sets used to compare the performance of \(m_p\) with other distance or dissimilarity measures. The number of nominal attributes (\(M_\mathrm{nom}\)) is provided in bracket along with the total number of dimensions (M) and c is the number of classes in a data set

We discuss the experimental setups and results in k-NN classification and content-based multimedia information retrieval (CBMIR) tasks in the next two subsections.

Table 4 Average accuracy of 5-NN classification over a tenfold cross-validation. The average accuracy and average rank of measures in 30 data sets are included in the last two rows

4.1 k-NN classification

In the k-NN classification context, in order to predict a class label for a test instance \(\mathbf{x}\), its k nearest neighbors were searched in the training set using all eight dissimilarity measures and the most frequent label in k-NNs was predicted as the class label for the test instance. All classification experiments were conducted using a tenfold cross-validation: 10 train-and-test trials using 90% of the given data set for training and 10% for testing. We set k to a commonly used value of 5 (i.e., \(k=5\)). The average classification accuracy (%) over a tenfold cross-validation was reported. The accuracies of two algorithms were considered to be significantly different if their confidence intervals (based on two standard errors over the tenfold cross-validation) did not overlap. The average classification accuracies over the tenfold cross-validation of the eight dissimilarity measures in all data sets are provided in Table 4.

Out of 30 data sets, \(m_{0.5}\) and \(m_2\) produced the best result or equivalent to the best result in 23 and 16 data sets, respectively. Either \(m_{0.5}\) or \(m_2\) produced significantly better classification accuracy than any other contenders in the New3s, Ohscal, Wap, R52, NG20, R8, Webkb, Caltech, Corel, Connect-4 and Hypothyroid data sets. The summarized results in the last two rows in Table 4 show that both \(m_{0.5}\) and \(m_2\) produced consistently top or near top results across different data sets. \(m_{0.5}\) and \(m_2\) have average ranking of 1.97 and 2.37, respectively; whereas the average rank of the closest contender \(d_{cos}\) is 3.30. Table 5 provides the summarized result in terms of the win:loss:draw counts of \(m_{0.5}\) and \(m_2\) against the other six contenders using confidence interval based on the two standard errors in the tenfold cross-validation (standard errors are provided in Table 10 in “Appendix 3”). It shows that both \(m_{0.5}\) and \(m_2\) had significantly more wins than losses against all other contenders.

Table 5 Win:loss:draw counts of \(m_{0.5}\) and \(m_2\) w.r.t other measures in 5-NN classification

Note that in data sets with nominal attributes only (e.g., Connect-4, Chess, and Splice), \(d_{cos}\), \(\ell _p\) and s-\(\ell _p\) produced exactly the same results because they are effectively the same measure. Because of the one-of-all transformation, all the vectors are of the same length of M (as each instance has exactly M 1s) in which case \(d_{cos}\) is proportional to \(\ell _2\). Since the difference in each dimension is either 0 or 1, the parameter p is meaningless.

All eight measures had run time in the same order of magnitude. For example, predicting class labels for instances in one fold of train-and-test in NG20 took 21,458 seconds for \(m_2\) and 26,484 seconds for \(m_{0.5}\) in comparison to 16,296 (\(d_{cos}\)), 28,168 (\(\ell _{0.5}\)), 24,380 (\(\ell _2\)), 29,210 (s-\(\ell _{0.5}\)), 25,944 (s-\(\ell _2\)) and 20,515 (\(d_{lin}\)) seconds. In Corel, \(m_2\) and \(m_{0.5}\) took 37 and 47 seconds whereas \(d_{cos}\) took 22 s followed by 32 (\(\ell _2\)), 45 (\(\ell _{0.5}\)), 47 (s-\(\ell _2\)), 59 (s-\(\ell _{0.5}\)) and 90 (\(d_{lin}\)) seconds.

4.2 Content-based multimedia information retrieval (CBMIR)

Given a query instance \(\mathbf{q}\) for a retrieval task, all the instances in a data set were ranked in ascending order of their dissimilarity to \(\mathbf{q}\) based on a dissimilarity measure; and the first k instances were presented as the relevant instances to \(\mathbf{q}\). For performance evaluation, an instance was considered to be relevant to \(\mathbf{q}\) if they have the same category label. A good information retrieval system returns relevant instances at the top. Hence, the precision in the top 10 (P@10) retrieved results was used as the performance measure. The same process was repeated for each instance in a data set as a query and the rest of the instances were ranked. The average P@10 of N queries was reported. For information retrieval task, we used 10 data sets with 10 or more classes from multimedia (text, music and image) applications: New3s, Ohscal, Wap, R52, NG20, Fbis, Caltech, Gtzan, Hba and Corel. The average P@10 of \(d_{cos}, \ell _{0.5}, \ell _2, s\)-\(\ell _{0.5}, s\)-\(\ell _2, d_{lin}, m_{0.5}\) and \(m_2\) are provided in Table 6.

Table 6 Average P@10 over N queries. The average P@10 and average rank of measures in 10 data sets are included in the last two rows
Table 7 Win:loss:draw counts of \(m_{0.5}\) and \(m_2\) w.r.t other measures in CBMIR

Table 7 presents the summarized result in terms of the win:loss:draw counts of \(m_{0.5}\) and \(m_2\) using confidence interval based on the two standard errors over N queries (standard errors are provided in Table 11 in “Appendix 3”). It shows that both \(m_{0.5}\) and \(m_2\) produced significantly better retrieval results than the other six contenders in many data sets: \(m_{0.5}\) had only 1 loss and between 7 and 10 wins; \(m_2\) has at least 7 wins and at most 3 losses. The detailed result in Table 6 shows that, out of 10 data sets used, \(m_{0.5}\) and \(m_2\) produced the best result or equivalent to the best result in 9 and 6 data sets, respectively. They have the average ranking of 1.20 and 2.70, respectively whereas the closest contender \(d_{cos}\) has an average ranking of 3.2.

5 Relation to \(\ell _p\) with rank transformation

In the first glance, it appears that \(m_p\) (Eq. 8 with \(\delta =0\)) is equivalent to \(\ell _p\) with rank transformation [9] in continuous domains because they both measure dissimilarity based on the number of instances in-between the two instances under measurement. In rank transformation [9], instances in each dimension are ranked in ascending order with the smallest value having ranked 1, the second smallest value having ranked 2, and so on. The values of instances are then replaced by their ranks. If there are \(n<N\) instances which have the same value and the value has rank r, then all instances are assigned the same rank r; and the next available rank is \(r+n\) (i.e, the minimum rank is assigned in the case of tie)Footnote 8.

The distance between two instances in each dimension after the rank transformation as discussed above can be defined as: \(abs(rank(x_i)-rank(y_i)) = |\{z_i:\min (x_i,y_i)\le z_i < \max (x_i,y_i)\}|\). In \(m_p\) (with \(\delta =0\)) using the implementation based on the range search, \(|R_i(x_i,y_i)| = |\{z_i:\min (x_i,y_i)\le z_i \le \max (x_i,y_i)\}|\) Footnote 9.

These two formulations are equivalent only if all values in dimension i are distinct, i.e., \(|R_i(x_i,y_i)|=abs(rank(x_i)-rank(y_i))+1\). They are different when there are duplicate values; and the degree of difference is proportional to the number of duplicates.

It is interesting to note that the self-dissimilarity of \(x_i\) if there are duplicate \(x_i\): abs(rank \((x_i)-rank(x_i)) = 0\) versus \(|R_i(x_i,x_i)| = f(x_i)\) where \(f(x_i)\) is the frequency of \(x_i\). Even though rank difference between \(x_i\) and \(y_i\) is density (data) dependent when \(x_i\ne y_i\) (i.e., the rank difference between \(x_i\) and \(y_i\) is larger in denser region than in sparse region even if the geometric distance is the same), it is zero irrespective of the distribution when \(x_i=y_i\). In the extreme case where all the instances have the same value in dimension i, the self-dissimilarity is 1 (maximum) in the case of \(m_p\), whereas the self-dissimilarity of \(\ell _p\) after rank transformation is 0 (minimum). Often in high-dimensional real-world problems, many instances can have the same value in many dimensions, e.g., many documents in a collection can have the same occurrence frequency of a term; or different individuals can have the same age, etc.

We have compared the performances of \(m_p\) and \(\ell _p\) with rank transformation (\(\ell _p^{rank}\)) in the k-NN classification task using data sets with continuous attributes only (as rank transformation is applicable only in continuous domains). Both \(\ell _p^{rank}\) and \(m_p\) (since the efficient approximation of \(R_i(\cdot ,\cdot )\) as discussed in Sect. 3.1 is not used) have high time complexities. Estimating \(|R_i(\cdot ,\cdot )|\) in \(m_p\) and computing the rank of an unseen value of a test instance in \(\ell _p^{rank}\) in each dimension requires \(O(\log N)\) time using binary search resulting in the total time complexity of measuring dissimilarity of a pair instances to be \(O(M\log N)\) which is very expensive in large data sets. We only managed to get a tenfold cross-validation of k-NN classification completed in 24 h in ten relatively small data sets only: Hba, Gtzan, Arcene, Mfeat, Madelon, Satellite, Fbis, Wap, Webkb, and R8.

In order to provide an idea about the number of duplicate values per dimension in a data set, the factor of distinct values averaged over all dimensions, i.e., \(\alpha \), is calculated as:

$$\begin{aligned} \alpha =\frac{1}{M}\sum _{i=1}^M\frac{w_i}{N} \end{aligned}$$
(13)

where \(w_i\) is the number of distinct values in dimension i. \(\alpha =1\) indicates that the data set has unique values in all dimensions (no duplicate at all) and \(\alpha =\frac{1}{N}\) indicates that all instances have the same value in each and every dimension.

The average accuracies of 5-NN classification over a tenfold cross-validation using \(\ell _{2}^{rank}\), \(\ell _{0.5}^{rank}\), \(m_{2}\) and \(m_{0.5}\) are provided in Table 8. Based on two standard error confidence interval significance test, \(\ell _{2}^{rank}\) & \(m_{2}\) and \(\ell _{0.5}^{rank}\) & \(m_{0.5}\) produced similar results in Hba, Gtzan, Arcene, MFeat, Madelon and Satellite; but both \(m_{2}\) and \(m_{0.5}\) produced significantly better accuracies than \(\ell _{2}^{rank}\) and \(\ell _{0.5}^{rank}\) in Fbis, Wap, Webkb, and R8. These results show that \(m_p\) performs better than \(\ell _p^{rank}\) in the case where many instances have the same values (i.e., there are only a very few distinct values) in many dimensions.

Table 8 The average accuracy of k-NN (\(k=5\)) classification in a tenfold cross-validation. The distinct values statistic \(\alpha \) is provided in the second column

In order to further demonstrate this difference, we conducted experiments with the Hba and Gtzan data sets (having large \(\alpha \)) by increasing the number of duplicate values in many dimensions. The range of values in dimension i was divided into 10 equal-width bins represented by bin id \(1,2,\cdots ,10\) and an instance’s value was replaced by the id of the bin in which the instance falls into, resulting in many duplicate values in dimension i. In order to control the number of dimensions with duplicate values, we introduced a parameter a that determines the proportion of attributes to be converted into bins, i.e., \(a=0\) indicates that no attribute was converted into bins (i.e., values in all attributes were left as they were and no duplicate values were introduced) and \(a=1.0\) indicates that all attributes were converted into bins (i.e., many instances have duplicate values in all dimensions). The k-NN (\(k=5\)) classification accuracies of \(\ell _{2}^{rank}, \ell _{0.5}^{rank}, m_{2}\) and \(m_{0.5}\) in the Hba and Gtzan data sets with \(a=0, 0.1, 0.2, 0.5, 0.75\) and 1.0 are shown in Fig. 4 and corresponding \(\alpha \) values are provided in Table 9.

Fig. 4
figure 4

5-NN classification accuracies of \(\ell _{2}^{rank}, \ell _{0.5}^{rank}, m_{2}\) and \(m_{0.5}\) for different values of a

Table 9 The distinct values statistic \(\alpha \) for different values of a

Figure 4 shows that there is a significant difference between the classification accuracies of \(m_{2}\) and \(m_{0.5}\) in compare to those of \(\ell _{2}^{rank}\) and \(\ell _{0.5}^{rank}\) for \(a\ge 0.75\) in both the Hba and Gtzan data sets. This indicates that \(m_{p}\) can provide more reliable nearest neighbors than \(\ell _{p}^{rank}\) if many instances have duplicate values in many dimensions. This superior performance of \(m_p\) over \(\ell _{p}^{rank}\) is primarily due to the data-dependent self-dissimilarity.

Furthermore, the rank transformation is possible in continuous domains only. In contrast, \(m_p\) not only can apply to both continuous and discrete domains, but has a seamless treatment of mixed attribute type domains.

6 Discussion

In a high-dimensional space, the most widely used Euclidean distance (\(\ell _2\)-norm) becomes ineffective. Many researchers have argued that it is due to the ‘concentration’ effect of \(\ell _p\), i.e., pairwise distances become almost equal or similar and the contrast between the nearest and farthest instances diminishes [1, 6, 12]. Let \(dmax(\mathbf{x},d)\) and \(dmin(\mathbf{x},d)\) be the dissimilarity of \(\mathbf{x}\) to its farthest and nearest neighbors in D using dissimilarity measure d, respectively. For a given instance, the distance between the nearest and farthest instances does not increase as fast as the distance to the nearest instance for many distributions [6], i.e., the ‘relative contrast’ \(\left( \frac{dmax(\mathbf{x},{\ell _p})-dmin(\mathbf{x},{\ell _p})}{dmin(\mathbf{x},{\ell _p})}\right) \) vanishes as the number of dimensions increases.

In our investigation, we observed that \(m_p\) is more concentrated than \(\ell _p\) and \(d_{cos}\), i.e., the relative contrast of \(m_p\) is smaller than that of \(\ell _p\) and \(d_{cos}\). Despite having a higher concentration effect, \(m_p\) provides more reliable nearest neighbors than \(\ell _p\) and \(d_{cos}\) in many data sets, particularly in high-dimensional problems (see the experimental results in Sects. 4.1 and 4.2). This indicates that the negative impact of the concentration phenomenon in practice may not be as severe as it is thought theoretically in the literature. This finding is consistent with that suggested by François et al. [12]. The detailed empirical result of the phenomenon of concentration of \(m_2\), \(\ell _2\) and \(d_{cos}\) is provided in the “Concentration” section of “Appendix 2”.

Another issue of distance measures in high-dimensional spaces discussed in the literature is “hubness” [24]. Let \(N_k(\mathbf{y})\) be the set of k nearest neighbors of \(\mathbf{y}\); and k-occurrences of \(\mathbf{x}\), \(O_k(\mathbf{x}) = |\{\mathbf{y}: \mathbf{x}\in N_k(\mathbf{y})\}|\), be the number of other instances in the given data set where \(\mathbf{x}\) is one of their k nearest neighbors. As the number of dimensions increases, the distribution of \(O_k(\mathbf{x})\) becomes considerably skewed (i.e., there are many instances with zero or small \(O_k\) and only a few instances have large \(O_k\)) for many widely used distance measures [24]. The instances with large \(O_k(\cdot )\) are considered as “hubs,” i.e., the popular nearest neighbors. Hubness becomes prominent in high-dimensional space, and it affects the performance of k-NN based algorithms. For example, if \(\mathbf{x}\) is a hub, it appears in the k-NN sets of many test instances and contributes in the prediction decisions, but it may not be relevant to make predictions for all of those test instances.

We observed that the hubness phenomenon in \(m_p\) is not as severe as in the case of \(\ell _p\) and \(d_{cos}\) when the number of dimensions is increased particularly in non-uniform distributions. This may contribute to the superior performance of \(m_p\) over \(\ell _p\) and \(d_{cos}\). The detailed empirical result of the phenomenon of hubness of \(m_2\), \(\ell _2\) and \(d_{cos}\) is provided in the “Hubness” section of “Appendix 2”.

In order to circumvent the high-dimensionality issue, dimensionality reduction [11] techniques are used before using distance measures. In continuous domain, Principal Component Analysis (PCA) [16] is commonly used to project data into a lower-dimensional space defined by principal components with high variance. The principal components are computed by the eigen decomposition of covariance or correlation matrix which is computationally expensive in the case of large M and N. It relies on variance of data in each dimension which may not be enough to capture the characteristics of local data distribution. As it selects the dimensions with high variance, we may lose differences between instances in the dimensions with low variance.

In a nutshell, the main purpose of PCA is dimensionality reduction that enables an application to high-dimensional data sets; and it usually does not improve predictive accuracy. This is exactly what we observed in the 5-NN classification task. For example, 5-NN classification accuracies of \(d_{cos}\) and \(\ell _2\) were increased in Corel and Hba but that of \(\ell _{0.5}\) was decreased in both data sets. Similarly, the classification accuracies of all three measures decreased significantly in Mnist and R52. In general, \(m_{2}\) and \(m_{0.5}\) in the original space (without dimensionality reduction) produced better and consistent results across different data sets. The detailed results of this comparison are provided in Table 12 in “Appendix 4”.

Note that PCA changes the distribution of data to maximize the variance (which is defined by inter-point distances). Thus, it does not make sense to apply PCA when using \(m_p\).

Different data-dependent distance metric adaptation techniques are discussed in the literature to improve task-specific performance of distance measures in a given data set. Weighted Minkowski distance [10] assigns weight to the distance in each dimension based on the observed data. Note that standardized Euclidean distance (s-\(\ell _2\)) is a simple weighted Euclidean distance where the distance in each dimension is weighted by the inverse of data variance in that dimension. Assigning weights more intelligently requires some learning or optimization. In transductive learning, Lundell and Ventura [21] corrected the Euclidean distance between two instances based on meta-clustering which itself relies on pairwise Euclidean distances and can be computationally expensive in large and high-dimensional problems.

Metric learning [32, 33] projects data from the original space to a new low-dimensional space that best suits the Euclidean distance to solve the task at hand. Rather than projecting data in low-dimensional space by ignoring dimensions with small eigen values, regularized matrix relevance learning [28] uses a regularization scheme which inhibits decays in the eigen profile. Both of these techniques require intensive learning which is computationally expensive in large and/or high-dimensional data sets. They optimize distance metric specifically for the given task which may not be good for other tasks using the same data set. They are not a general-purpose measures like \(m_p\), \(\ell _p\) or \(d_{cos}\).

All the adaptive metric learning techniques discussed in the literature attempt to adjust the inter-point distances in the space based on the data distribution that satisfies some optimality constraints. Because the transformed space is still embedded in the Euclidean space, the self-similarity is still constant regardless of the data distribution, and they still rely on geometric model and metric assumptions. Even though metric-based measures have a nice mathematical properties, their assumptions might be inappropriate to model some problems. Recently, Schleif and Tino [27] discussed issues of metric-based proximity learning and provided a comprehensive review of non-metric proximity learning.

In this paper, we focus on general-purpose distance or dissimilarity measures which requires no learning. We have evaluated the performance of the proposed data-dependent general-purpose dissimilarity measure \(m_p\) against the geometric general-purpose distance measures \(\ell _p\) and \(d_{cos}\). In the future, it would be interesting to investigate how learning can be applied to data-dependent dissimilarity measure such as \(m_p\) to produce non-metric learning and then compare non-metric learning with metric learning.

Because of the implementation of \(m_p\) using bins, one can see some similarity with Locality Sensitive Hashing (LSH) [15]. The aims of binning are different in the two cases. In LSH, bins are used to find a small set of candidate nearest neighbors of a test instance quickly where the k-NNs are searched using the Euclidean distance. In contrast, \(m_p\) probability data mass in bins is used as a measure of dissimilarity directly. It is an open question whether LSH can be used to generate candidate set quickly for \(m_p\). LSH has a nice theoretical bounds for the Euclidean distance but it is not clear if similar bounds can be derived for \(m_p\).

7 Conclusions and future work

In this paper, we proposed a new dissimilarity measure called “\(m_p\)-dissimilarity”. It estimates the dissimilarity between two instances in each dimension as a probability data mass in the region enclosing the two instances. The final dissimilarity between the two instances is estimated by combining all single-dimensional dissimilarities as in the case of \(\ell _p\). The fundamental difference between the formulations of \(m_p\) and \(\ell _p\) is the replacement of the geometric distance with the probability mass in each dimension.

Our empirical evaluations in k-NN classification and content-based multimedia information retrieval tasks show that \(m_p\) provides better closest matches than those provided by \(\ell _p\) and cosine distance in high-dimensional spaces. Its performance is more consistent across different data sets. By simply replacing the geometric distance in each dimension with the probability mass, k-NN using \(m_p\) significantly improves the performance of k-NN using \(\ell _p\) in many high-dimensional data sets.

In contrast to the commonly used distance measures, \(m_p\) is not using the values of instances in each dimension in the measure directly. Because it is based on data mass, it is insensitive to units and scale of measurement and the difference in variance of values of instances between dimensions. Thus, it does not require any preprocessing such as min–max normalization to rescale values in the same range, or standard deviation normalization to ensure unit variance across all dimensions, or TF-IDF weighting to adjust the importance of a term in a document.

Even though \(\ell _p\) can be made data dependent though rank transformation, it is applicable only in the case where all instances have distinct values (or a few duplicates only) in each dimension. However, the data-dependent characteristics of \(m_p\) is applicable in both cases of with and without many instances having duplicate values in many dimensions. Many instances having duplicate values in many dimensions are a common characteristic of high-dimensional data sets where data lies in a low-dimensional subspace. In such high-dimensional data sets, \(m_p\) produces better task-specific performance than \(\ell _p\) with the rank transformation.

Future work includes investigating learning for \(m_p\) and compare the non-metric learning with metric learning; examining the effectiveness of \(m_p\) in other data mining tasks such as clustering, anomaly detection, vector quantization and SVM kernel learning; and developing indexing schemes for \(m_p\) to speed up the nearest neighbor search in the case of large N.