Accelerating Metric Filtering by Improving Bounds on Estimated Distances

Mic, Vladimir; Zezula, Pavel

doi:10.1007/978-3-030-60936-8_1

Vladimir Mic¹⁶ &
Pavel Zezula¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Included in the following conference series:

International Conference on Similarity Search and Applications

919 Accesses
1 Citations

Abstract

Filtering is a fundamental strategy of metric similarity indexes to minimise the number of computed distances. Given a triple of objects for which distances of two pairs are known, the lower and upper bounds on the third distance can be set as the difference and the sum of these two already known distances, due to the triangle inequality rule of the metric space. For efficiency reasons, the tightness of bounds is crucial, but as angles within triangles of distances can be arbitrary, the worst case with zero and straight angles must also be considered for correctness. However, in data of real-life applications, the distribution of possible angles is skewed and extremes are very unlikely to occur. In this paper, we enhance the existing definition of bounds on the unknown distance with information about possible angles within triangles. We show that two lower bounds and one upper bound on each distance exist in case of limited angles. We analyse their filtering power and confirm high improvements of efficiency by experiments on several real-life datasets.

V. Mic and P. Zezula—This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Fast Similarity Search with the Earth Mover’s Distance via Feasible Initialization and Pruning

Distance-Based Index Structures for Fast Similarity Search

Article 27 July 2017

Measuring similarity for query in geometric data

Article 10 September 2019

Keywords

1 Introduction

Metric spaces are often used to formalise a similarity of complex data objects from various domains. Given a domain of objects D, a metric space is pair (D, d) where $d: D \times D \mapsto \mathbb {R}^{+}_0$ is a distance function which quantifies the dissimilarity of objects. This function must be non-negative, symmetric, and the distances among three arbitrary objects from D must satisfy the triangle inequality. Metric similarity searching has become popular due to its wide applicability, and many metric indexes have been proposed [10, 12]. We consider the query by example paradigm: having a dataset $X\subseteq D$ and an arbitrary query object $q \in D$, the task is to efficiently find objects $o \in X$ that are close to q according to d.

The simplest type of a similarity query is the range query: for a threshold $\mathfrak {r}\ge 0$ and a query object $q \in D$, its solution is $\{o \in X |\, d(q,o) \le \mathfrak {r}\}$. Considering an arbitrary reference object $p_i$, called pivot, and assuming that distances $d(o, p_i)$ and $d(q, p_i)$ are known, the triangle inequalities define the lower bound on d(q, o):

$$\begin{aligned} d(q, o) \ge |d(o, p_i) - d(q, p_i)|. \end{aligned}$$

(1)

If the lower bound on distance d(q, o) given by an arbitrary pivot $p_i$ is greater than radius $\mathfrak {r}$, o cannot be in the answer of the range query. Accessing o and evaluation of d(q, o) can thus be avoided. By analogy, if the upper bound:

$$\begin{aligned} d(q, o) \le d(o, p_i) + d(q, p_i), \end{aligned}$$

(2)

given by triangle inequalities is smaller than $\mathfrak {r}$ for an arbitrary $p_i$, then o is guaranteed to be in the query answer, and the evaluation of d(q, o) can be skipped.

The triangle inequality rule enables to isometrically embed objects $q, o, p_i$ in 2D Euclidean space to form a triangle with sides d(q, o), $d(o, p_i)$, $d(p_i, q)$ [4]. We further focus on angles in this triangle in the Euclidean space. Please notice that findings in this article are valid for all metric spaces thanks to this embedding.

Bounds given by Eqs. 1 and 2 are tight, i.e., the equalities hold, if there are two zero angles and one straight angle within the triangle $q, o, p_i$ with distances d(q, o), $d(p_i, o)$, and $d(p_i, q)$. However, this is an unrealistic case in most of metric spaces that describe the similarity of complex real-life data.

We analyse triangle inequalities under the assumption of limited angles in triangles. We show that the limitation of angles can increase the lower bound given by Eq. 1 even by 66%, and decrease the upper bound given by Eq. 2 by 40% in real scenarios. Moreover, the third bound exists: the lower bound on a side based on a sum of lengths of two other sides in a triangle. These improvements have a dramatic impact on the filtering power of triangle inequalities.

Section 2 contains analysis to enhance triangle inequalities by angles limitation. Section 3 illustrates what happens if the limitation of angles is wrong. Section 4 defines, when are the newly proposed bounds correct. Section 5 provides instructions to set the angles limitation for similarity search. Section 6 presents experimental results and Sect. 7 concludes the paper.

Table 1. Notation used throughout this paper

Full size table

2 Triangle Inequalities with Limited Angles

We define novel lower and upper bounds on distances in this section, which are given by the triangle inequalities that assume the limited range of angles within triangles of distances. Specifically, we consider an arbitrary metric space (D, d) and three objects $q, o, p_i \in D$. The triangle with sides $\mathfrak {a}, \mathfrak {b}, \mathfrak {c}\in \mathbb {R}_0^{+}$ is given by pairwise distances between these objects, and we assume that the distance $\mathfrak {c}= d(q, o)$ is unknown and the distances $\mathfrak {a}$ and $\mathfrak {b}$ are already evaluated. We denote $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ the angles in triangle $\bigtriangleup \mathfrak {a}, \mathfrak {b}, \mathfrak {c}$ that are opposite to respective sides. Since we always focus on just one isolated triangle $\bigtriangleup \mathfrak {a}, \mathfrak {b}, \mathfrak {c}$, we can assume^{Footnote 1} $\mathfrak {a}\le \mathfrak {b}$ without a loss of generality for the application of the similarity search^{Footnote 2}. The whole notation is summarised in Table 1.

Two following lemmas form the core of the paper as they allow to define the bounds on $\mathfrak {c}$ given by triangle inequalities that consider a limited range of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$.

Lemma 1

For an arbitrary triangle with sides $\mathfrak {a}$, $\mathfrak {b}$, $\mathfrak {c}$ and corresponding angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ holds:

$$\mathfrak {c}= (\mathfrak {a}+ \mathfrak {b}) \cdot \frac{1 - \cos \mathfrak {\gamma }}{\cos \mathfrak {\alpha }+ \cos \mathfrak {\beta }}$$

Proof

All cosines in the fraction can be substituted using the cosine rule to get:

$$ (\mathfrak {a}+ \mathfrak {b}) \cdot \frac{1 - \cos \mathfrak {\gamma }}{\cos \mathfrak {\alpha }+ \cos \mathfrak {\beta }} = \frac{(\mathfrak {a}+ \mathfrak {b}) (1 - \frac{\mathfrak {a}^2 + \mathfrak {b}^2 - \mathfrak {c}^2}{2 \mathfrak {a}\mathfrak {b}})}{\frac{\mathfrak {b}^2 + \mathfrak {c}^2 - \mathfrak {a}^2}{2 \mathfrak {b}\mathfrak {c}} + \frac{\mathfrak {a}^2 + \mathfrak {c}^2 - \mathfrak {b}^2}{2 \mathfrak {a}\mathfrak {c}}} = \mathfrak {c}\frac{-\mathfrak {a}^3 + \mathfrak {a}^2 \mathfrak {b}+ \mathfrak {a}\mathfrak {b}^2 - \mathfrak {b}^3 + \mathfrak {a}\mathfrak {c}^2 + \mathfrak {b}\mathfrak {c}^2}{-\mathfrak {a}^3 + \mathfrak {a}^2 \mathfrak {b}+ \mathfrak {a}\mathfrak {b}^2 - \mathfrak {b}^3 + \mathfrak {a}\mathfrak {c}^2 + \mathfrak {b}\mathfrak {c}^2} = \mathfrak {c}$$

$\square $

Lemma 2

For an arbitrary triangle with sides $\mathfrak {a}$, $\mathfrak {b}$, $\mathfrak {c}$ and corresponding angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ holds:

$$ \mathfrak {c}= |\mathfrak {a}- \mathfrak {b}| \cdot \frac{1 + \cos \mathfrak {\gamma }}{|\cos \mathfrak {\alpha }- \cos \mathfrak {\beta }|} $$

Proof

The numerator of the fraction is non-negative, and all cosines in the fraction can be substituted using the cosine rule to get:

$$ \frac{|\mathfrak {a}- \mathfrak {b}| \cdot |1 + \cos \mathfrak {\gamma }|}{|\cos \mathfrak {\alpha }- \cos \mathfrak {\beta }|} = \frac{|(\mathfrak {a}- \mathfrak {b}) (1 + \frac{\mathfrak {a}^2 + \mathfrak {b}^2 - \mathfrak {c}^2}{2 \mathfrak {a}\mathfrak {b}})|}{|\frac{\mathfrak {b}^2 + \mathfrak {c}^2 - \mathfrak {a}^2}{2 \mathfrak {b}\mathfrak {c}} - \frac{\mathfrak {a}^2 + \mathfrak {c}^2 - \mathfrak {b}^2}{2 \mathfrak {a}\mathfrak {c}}|} = \mathfrak {c}\cdot \frac{|\mathfrak {a}^3 + \mathfrak {a}^2 \mathfrak {b}- \mathfrak {a}\mathfrak {b}^2 - \mathfrak {b}^3 - \mathfrak {a}\mathfrak {c}^2 + \mathfrak {b}\mathfrak {c}^2|}{|\mathfrak {a}^3 + \mathfrak {a}^2 \mathfrak {b}- \mathfrak {a}\mathfrak {b}^2 - \mathfrak {b}^3 - \mathfrak {a}\mathfrak {c}^2 + \mathfrak {b}\mathfrak {c}^2|} = \mathfrak {c}$$

$\square $

Lemma 1 expresses $\mathfrak {c}$ using the sum of $\mathfrak {a}$ and $\mathfrak {b}$ and the function of $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$. Notice thus the similarity with Eq. 2. Lemma 2 expresses $\mathfrak {c}$ using the difference of $\mathfrak {a}$ and $\mathfrak {b}$, similarly as Eq. 1, and another function of $\mathfrak {\alpha }$, $\mathfrak {\beta }$ and $\mathfrak {\gamma }$.

Real-life metric spaces contain triangles $\bigtriangleup {} \mathfrak {a}{}, \mathfrak {b}{}, \mathfrak {c}{}$ with angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ from a more narrow range than . Let us thus assume bounds $\varOmega _\textit{min}, \varOmega _\textit{max}$ on the angles such that $\forall \mathfrak {\alpha }, \mathfrak {\beta }, \mathfrak {\gamma }: \varOmega _\textit{min}\le \mathfrak {\alpha }, \mathfrak {\beta }, \mathfrak {\gamma }\le \varOmega _\textit{max}$. Please notice that bounds $\varOmega _\textit{min}$ and $\varOmega _\textit{max}$ are meaningful if and only if $0^{\circ } \le \varOmega _\textit{min}\le 60^{\circ } \le \varOmega _\textit{max}\le 180^{\circ }$, since $\mathfrak {\alpha }+ \mathfrak {\beta }+ \mathfrak {\gamma }= 180^{\circ }$. The key feature of Lemmas 1 and 2 is that $\varOmega _\textit{min}$ and $\varOmega _\textit{max}$ also limit the values of the fractions used in these lemmas. We further denote and alter these fractions as:

$$\begin{aligned} f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })= \frac{1 - \cos \mathfrak {\gamma }}{ \cos \mathfrak {\alpha }+ \cos \mathfrak {\beta }} = \frac{1 - \cos (180^{\circ } - \mathfrak {\alpha }- \mathfrak {\beta })}{ \cos \mathfrak {\alpha }+ \cos \mathfrak {\beta }} = \frac{1 + \cos (\mathfrak {\alpha }+ \mathfrak {\beta })}{ \cos \mathfrak {\alpha }+ \cos \mathfrak {\beta }}, \end{aligned}$$

(3)

$$\begin{aligned} f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })= \frac{1 + \cos \mathfrak {\gamma }}{ |\cos \mathfrak {\alpha }- \cos \mathfrak {\beta }| } = \frac{1 + \cos (180^{\circ } - \mathfrak {\alpha }- \mathfrak {\beta })}{|\cos \mathfrak {\alpha }- \cos \mathfrak {\beta }|} = \frac{1 - \cos (\mathfrak {\alpha }+ \mathfrak {\beta })}{|\cos \mathfrak {\alpha }- \cos \mathfrak {\beta }|} \end{aligned}$$

(4)

Intuitively, $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ is a coefficient exploited to express $\mathfrak {c}$ using the sum of $\mathfrak {a}$ and $\mathfrak {b}$, and $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ is utilised to express $\mathfrak {c}$ using the difference of $\mathfrak {a}$ and $\mathfrak {b}$.

We denote $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ and $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ the minimum possible values of $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ and $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ that are defined for a range of angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$. As the notation suggests, these minimum values define two lower bounds on $\mathfrak {c}$, since:

$$\begin{aligned} \mathfrak {c}= (\mathfrak {a}+ \mathfrak {b}) \cdot f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })\ge (\mathfrak {a}+ \mathfrak {b}) \cdot \mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}), \end{aligned}$$

$$\begin{aligned} \mathfrak {c}= |\mathfrak {a}- \mathfrak {b}| \cdot f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })\ge |\mathfrak {a}- \mathfrak {b}| \cdot \mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})\end{aligned}$$

We denote these lower bounds as:

$$\begin{aligned} \textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})= (\mathfrak {a}+ \mathfrak {b}) \cdot \mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}), \end{aligned}$$

(5)

$$\begin{aligned} \textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})= |\mathfrak {a}- \mathfrak {b}| \cdot \mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max}) \end{aligned}$$

(6)

Similarly, maximum possible values of $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ and $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ for a range $[\varOmega _\textit{min}, \varOmega _\textit{max}]$, denoted as $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ and $\mathcal {C}_{\textit{UB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$, define two upper bounds on $\mathfrak {c}$, since:

$$\begin{aligned} \mathfrak {c}= (\mathfrak {a}+ \mathfrak {b}) \cdot f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })\le (\mathfrak {a}+ \mathfrak {b}) \cdot \mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})\end{aligned}$$

$$\begin{aligned} \mathfrak {c}= |\mathfrak {a}- \mathfrak {b}| \cdot f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })\le |\mathfrak {a}- \mathfrak {b}| \cdot \mathcal {C}_{\textit{UB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max}) \end{aligned}$$

(7)

and we denote just the first one as:

$$\begin{aligned} \textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})= (\mathfrak {a}+ \mathfrak {b}) \cdot \mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}) \end{aligned}$$

(8)

A derivation of value $\mathcal {C}_{\textit{UB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ for given $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ is simple as it is infinity for all meaningful ranges $[\varOmega _\textit{min}, \varOmega _\textit{max}]$. This is given by the denominator in Eq. 4, which is zero for $\mathfrak {\alpha }= \mathfrak {\beta }$. Equation 7 thus defines a trivial upper bound on $\mathfrak {c}$: infinity, for all meaningful ranges $[\varOmega _\textit{min}, \varOmega _\textit{max}]$.

A derivation of concrete values of $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$, $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$, and $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ is slightly complicated as angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ are limited not only by $\varOmega _\textit{min}$ and $\varOmega _\textit{max}$, but also by equation $\mathfrak {\alpha }+ \mathfrak {\beta }+ \mathfrak {\gamma }= 180^{\circ }$. For this reason, we immediately formulate Algorithms 1–3 that evaluate $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$, $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$, and $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ for given $\varOmega _\textit{min}$ and $\varOmega _\textit{max}$.

Table 2. Examples of triangle inequalities for given ranges of angles $\varOmega _\textit{min}$, $\varOmega _\textit{max}$

Full size table

Table 2 gives examples of newly derived lower and upper bounds on $\mathfrak {c}$ for selected ranges $[\varOmega _\textit{min}{}, \varOmega _\textit{max}{}]$. We choose these ranges to illustrate several features:

We limit the angles by trivial values $[\varOmega _\textit{min}{}, \varOmega _\textit{max}{}]$ = in the first line, and we get the pure triangle inequalities.
The second line represents another extreme case: if all angles are $60^{\circ }{}$, i.e. the triangle $\bigtriangleup {} \mathfrak {a}{}, \mathfrak {b}{}, \mathfrak {c}{}$ is equilateral, bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is not defined^{Footnote 3}, and bounds $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ and $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ are tight. Together, they give the precise value $\mathfrak {c}{} = 0.5 \cdot (\mathfrak {a}{} + \mathfrak {b}{})$.
The lower bound $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is zero, and thus ineffective in case of trivially bounded angles $[\varOmega _\textit{min}{}, \varOmega _\textit{max}{}]$ = . If the angles are more limited, this bound can bring a new effective limitation on $\mathfrak {c}$.
The last row of the table illustrates that the bounds are not improved beyond pure triangle inequalities when preserving $\varOmega _\textit{min}{} = 0^{\circ }$ and decreasing $\varOmega _\textit{max}$ to $90^{\circ }$.
The table confirms a contribution of angles limitation. For instance, if all angles are guaranteed to be within range , $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is decreased by 31.6% to $0.684 \cdot (\mathfrak {a}+ \mathfrak {b})$ and $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is almost doubled to $1.938 \cdot |\mathfrak {a}- \mathfrak {b}|$, in comparison with unlimited angles. Moreover, the lower bound $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})= 0.259 \cdot (\mathfrak {a}+ \mathfrak {b})$ is established.

Table 3. Examples of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ that do not meet the limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ = and the consequences for the newly proposed bounds on $\mathfrak {c}$ that assume this angles limitation. The wrong assumption may, but does not have to lead to wrong bounds on distances. Wrong coefficients and angles are in red.

Please notice that if the maximum permitted angle is e.g. $80^{\circ }$, the sum of two arbitrary angles in a triangle $\bigtriangleup {} \mathfrak {a}{}, \mathfrak {b}{}, \mathfrak {c}{}$ is at most 160$^{\circ }{}$, and thus all angles within triangles are at least $20^{\circ }$. A setting of $\varOmega _\textit{min}$ smaller than $20^{\circ }$ for $\varOmega _\textit{max}= 80^{\circ }$ thus does not play a role as $\varOmega _\textit{min}$ is effectively at least $20^{\circ }$ in this case. Similarly, if e.g. $\varOmega _\textit{min}= 30^{\circ }$, then $\varOmega _\textit{max}$ is effectively at most $120^{\circ }$ as the sum of two smallest angles in a triangle is at least $60^{\circ }$. These features are taken into account by Algorithms 1–3.

3 Impact of Wrong Angles Limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}]$

Real-life metric space similarity models usually do not guarantee bounds on angles $\varOmega _\textit{min}$ and $\varOmega _\textit{max}$. In these cases, the angles limitation can be set experimentally to be valid for a vast majority of all triangles within a given metric space. Consequences of imprecise bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ can be of various kinds, as we illustrate by Table 3. Here, we assume limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ = , and show three examples of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ within triangles such that they violate the angles limitation.

Examples of the Upper Bound ${\varvec{UB}}_{\varvec{sum}}(\varvec{\mathfrak {a}}, \varvec{\mathfrak {b}}, \varOmega _{{\varvec{min}}}, \varOmega _{{\varvec{max}}})$ : The fourth column of Table 3 contains values $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ defined by Eq. 3 that are evaluated for actual angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ given in columns 1–3. In case of the first and second row of the table, this value is smaller than the value $\mathcal {C}_\textit{UB}\_\textit{sum}(30^{\circ }, 80^{\circ })$ which is provided by the fifth column of the table. Therefore, the upper bound $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct in case of these two rows despite wrong bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$. Specifically, in case of the first row holds:

$$\mathfrak {c}= 0.679 \cdot (\mathfrak {a}+ \mathfrak {b}) ~~~ \le ~~~ 0.684 \cdot (\mathfrak {a}+ \mathfrak {b}) = \textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$$

and the case of the second row is analogous. In case of the last row, wrong bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ cause a wrong upper bound $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$, since:

$$\mathfrak {c}= 0.696 \cdot (\mathfrak {a}+ \mathfrak {b}) ~~~ > ~~~ 0.684 \cdot (\mathfrak {a}+ \mathfrak {b}) = \textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$$

Wrong limitation of angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ in metric spaces thus can, and does not have to lead to a wrong upper bound $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$.

Examples of the Lower Bound ${\varvec{LB}}_{{\varvec{diff}}}(\varvec{\mathfrak {a}}, \varvec{\mathfrak {b}}, \varOmega _{{\varvec{min}}}, \varOmega _{{\varvec{max}}})$ : Column 6 of Table 3 contains values $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ defined by Eq. 4 that are evaluated for actual angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ given in columns 1–3. In case of first two rows of the table, these values are bigger than $\mathcal {C}_\textit{LB}\_\textit{diff}(30^{\circ }, 80^{\circ })$ which is presented in the seventh column. Therefore, the lower bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct in case of corresponding triangles $\bigtriangleup {} \mathfrak {a}{}, \mathfrak {b}{}, \mathfrak {c}{}$ despite wrong bounds on angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$. Specifically, in case of the first row holds:

$$\mathfrak {c}= 1.963 \cdot |\mathfrak {a}- \mathfrak {b}| ~~~ \ge ~~~ 1.938 \cdot |\mathfrak {a}- \mathfrak {b}| = \textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$$

and the case of the second row is analogous. In case of the last row, wrong bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ imply a wrong lower bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$, since:

$$\mathfrak {c}= 1.754 \cdot |\mathfrak {a}- \mathfrak {b}| ~~~ < ~~~ 1.938 \cdot |\mathfrak {a}- \mathfrak {b}| = \textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$$

Wrong limitation of angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ in metric spaces thus can, and does not have to lead to a wrong lower bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$.

Examples of the Lower Bound ${\varvec{LB}}_{\varvec{sum}}(\varvec{\mathfrak {a}}, \varvec{\mathfrak {b}}, \varOmega _{{\varvec{min}}}, \varOmega _{{\varvec{max}}})$ : Examples for the lower bound $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ are provided in columns 8 and 9 of the Table 3. The same reasoning as in the case of lower bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ reveals that the lower bound $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct in case of the first and third row of the Table 3, and wrong in case of the second row.

We have also proved that all new bounds on $\mathfrak {c}$: $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$, $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ and $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ can be correct at the same time even in case of a triangle that violates the assumption about the range of angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ – example is given by the first row of Table 3. The key question thus is, when are bounds correct, and when they are not.

4 When Are the Bounds on Distances Correct?

Table 2 proves that there exist different values $\varOmega _\textit{min}$, $\varOmega _\textit{max}$ that imply the same value $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$. This is also true for coefficients $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ and $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$. Identification of bounds $\varOmega _\textit{min}$, $\varOmega _\textit{max}$ that imply a fixed value of each of these coefficients will enable us to formally describe triangles for which are the newly proposed bounds on $\mathfrak {c}$ correct, and for which they are not.

Let us assume a given value of $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$. Since this is a minimal value of $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$, we can define values $\mathfrak {\alpha }$ and $\mathfrak {\beta }$ that imply the given value of $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ = $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ by an analysis of Eq. 4:

$$\begin{aligned} \mathfrak {\alpha }= 2 \cdot \arccos \left( \frac{\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \cos \mathfrak {\beta }+ 1}{\sqrt{\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max}){}^2 + 2\cdot \mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \cos \mathfrak {\beta }+ 1}}\right) - \mathfrak {\beta } \end{aligned}$$

(9)

To facilitate an understanding of this equation, we introduce plots as is the one in Fig. 1. It depicts the angles $\mathfrak {\alpha }$ and $\mathfrak {\beta }$ on axes y and x, respectively. Angle $\mathfrak {\gamma }= 180^{\circ } - \mathfrak {\alpha }- \mathfrak {\beta }$ also exists, despite it is not explicitly shown in the plot. The inequality $\mathfrak {\alpha }+ \mathfrak {\beta }\le 180^{\circ }$ limits the meaningful part of the plot, as well as the assumption $\mathfrak {\alpha }\le \mathfrak {\beta }$ used without a loss of generality for the applications in the similarity searching (see Sect. 2). These limitations are depicted by black lines in the figure, so we consider just the triangular area below these lines in the following.

Function given by Eq. 9 for value $\mathcal {C}_\textit{LB}\_\textit{diff}(30^{\circ }, 80^{\circ })$ is depicted by a blue curve in the Fig. 1. It is easy to verify that points $[\mathfrak {\alpha }, \mathfrak {\beta }]$ below this curve imply smaller values $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ than $\mathcal {C}_\textit{LB}\_\textit{diff}(30^{\circ }, 80^{\circ })$, and points above the curve imply bigger value $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ than $\mathcal {C}_\textit{LB}\_\textit{diff}(30^{\circ }, 80^{\circ })$. Formally:

if $\mathfrak {\alpha }$ is smaller than the right side of Eq. 9, then $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ is smaller than $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$,
if $\mathfrak {\alpha }$ is bigger than the right side of Eq. 9, then $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ is bigger than $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$,
if Eq. 9 holds, then $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ is equal to $\mathcal {C}_{\textit{LB}\_\textit{diff}}(\varOmega _\textit{min}, \varOmega _\textit{max})$.

Therefore, the lower bound $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct for all triangles with $\mathfrak {\alpha }$ bigger or equal to the right side of Eq. 9, and wrong for the others. If $\mathfrak {\alpha }$ equals to the right side of Eq. 9, then the lower bound is tight.

Similarly, we analyse Eq. 3 to reveal, when are the lower and upper bounds on $\mathfrak {c}$ based on a sum of $\mathfrak {a}$ and $\mathfrak {b}$ correct, tight, and wrong, respectively. The relation between angles $\mathfrak {\alpha }$ and $\mathfrak {\beta }$ that imply a given value of $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ = $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ is:

$$\begin{aligned} \mathfrak {\alpha }= 2 \cdot \arccos \left( \frac{\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \sin \mathfrak {\beta }}{\sqrt{\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){}^2 - 2\cdot \mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \cos \mathfrak {\beta }+ 1}}\right) - \mathfrak {\beta } \end{aligned}$$

(10)

This function is depicted by the orange curve for angles limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}] = [30^{\circ }, 80^{\circ }]$ in Fig. 1, and its semantics is the following:

If $\mathfrak {\alpha }$ is smaller or equal to the right side of Eq. 10, then the lower bound $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct,
if $\mathfrak {\alpha }$ is bigger than the right side of Eq. 10, then this lower bound is wrong,
if Eq. 10 holds, this lower bound is tight.

Finally, the relation between angles $\mathfrak {\alpha }$ and $\mathfrak {\beta }$ that imply a given value of $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ = $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ is^{Footnote 4}:

$$\begin{aligned} \mathfrak {\alpha }= 2 \cdot \arccos \left( \frac{\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \sin \mathfrak {\beta }}{\sqrt{\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){}^2 - 2\cdot \mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max}){} \cdot \cos \mathfrak {\beta }+ 1}}\right) - \mathfrak {\beta } \end{aligned}$$

(11)

and this function is depicted in Fig. 1 by the green curve for angles limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}] = [30^{\circ }, 80^{\circ }]$. The semantics of this equation is the following:

If $\mathfrak {\alpha }$ is bigger or equal to the right side of Eq. 11, then the upper bound $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ is correct,
if $\mathfrak {\alpha }$ is smaller than the right side of Eq. 11, then this upper bound is wrong,
if Eq. 11 holds, then this upper bound is tight.

Therefore, all three bounds on $\mathfrak {c}$ are correct in case of triangles whose angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$ are depicted between colour curves in the plot like in Fig. 1.

5 Setting Bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ for Similarity Search

Test Data. In the experiments that follow, we use three different high dimensional datasets, comprising DeCAF, SIFT and MPEG7 image visual descriptors.

DeCAF descriptors [5] are extracted from the Profiset image collection^{Footnote 5}. These descriptors derive from the Alexnet convolutional neural network [6], from which data from the second-last fully connected layer (FC7) is extracted as a 4,096-dimensional array of floating-point values. It has been demonstrated that Euclidean distance applied to the post-Relu [9] descriptors gives a good surrogate for semantic similarity over the original images [5].

SIFT descriptors [7] serve us as another real-life example of descriptors of images. Each descriptor comprises 128 floating point values. This dataset is known as ANN_SIFT1M dataset^{Footnote 6}.

MPEG7 visual descriptors [8] are provided by the CoPhIR data collection [2]. Each of five sub-descriptors is accompanied with a suitable metric function, and all five sub-descriptor spaces are combined into a single metric space (D, d) by a weighted sum of particular distances [1]. In total, this representation can be viewed as a 280-dimensional vector compared by non-Minkowski distance.

Selecting Angles $[\varOmega _{{\varvec{min}}}, \varOmega _{{\varvec{max}}}]$ for a Good Space Approximation. A simple way to adjust the lower and upper bounds on $\mathfrak {c}$ is to sample random triangles of distances from the searched metric data and depict angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$ in a plot like is the one in Fig. 1. The limitation of angles $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ should be selected to wrap the points by curves (the green, blue and orange) as tightly as possible.

We are, however, interested in the application of newly proposed bounds on $\mathfrak {c}$ to speed-up the similarity search. We thus have to pay special attention to triangles of distances $\bigtriangleup \mathfrak {a}{}, \mathfrak {b}{}, \mathfrak {c}{}$ where $\mathfrak {c}$ is an extremely small distance considering the data. These extreme cases form a significantly different distribution of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, and moreover, they are not effectively sampled by random triangles.

Selecting Angles $[\varOmega _{{\varvec{min}}}, \varOmega _{{\varvec{max}}}]$ for a Similarity Search. We thus randomly select 1000 objects $q \in D$ and find their 100 nearest neighbours $o_\textit{NN}$ from a random sample of X of size 100,000 objects. We denote $\mathfrak {c}= d(q, o_\textit{NN})$ and for each $o_\textit{NN}$ select another 100 random pivots $p_i \in X$ to form a triangle $\bigtriangleup \mathfrak {a}, \mathfrak {b}, \mathfrak {c}$ where $\mathfrak {a}$, $\mathfrak {b}$ are distances $d(q, p_i)$, $d(o_\textit{NN}, p_i)$ and $\mathfrak {b}\ge \mathfrak {a}$. Together, we have 10 million samples of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$ for each dataset.

We depict both angles distributions in Figs. 2a and b^{Footnote 7}. Specifically, purple points (MPEG7) and black circles (100NN – MPEG7) depict angles in triangles sampled in random, and with focus on nearest neighbours, respectively. Distributions are significantly different, and plots for the DeCAF and SIFT descriptors confirm this as well, though not shown due to the limited paper length.

Figure 2a also depicts angles $[\mathfrak {\alpha }, \mathfrak {\beta }]$ that imply $f_\textit{sum}(\mathfrak {\alpha }, \mathfrak {\beta })$ and $f_\textit{diff}(\mathfrak {\alpha }, \mathfrak {\beta })$ of the same values as are given by $\mathcal {C}_\textit{LB}\_\textit{diff}(16^{\circ }, 110^{\circ })$, $\mathcal {C}_\textit{LB}\_\textit{sum}(16^{\circ }, 110^{\circ })$, and $\mathcal {C}_\textit{UB}\_\textit{sum}(16^{\circ }, 110^{\circ })$. These curves, shown again in blue, orange and green, tightly embrace angles from randomly sampled triangles (purple points).

Figure 2b illustrates that a distribution of angles $[\mathfrak {\alpha }, \mathfrak {\beta }]$ from triangles with a near neighbour makes impossible to select bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ such that all three curves tightly embrace (the black) sampled points. This is caused by asymmetric semantics of angles $\mathfrak {\alpha }$, $\mathfrak {\beta }$, $\mathfrak {\gamma }$ in these triangles. We experimentally verified that $\varOmega _\textit{min}$ cannot be bigger than $4^{\circ }{}$ to set properly the orange curve (i.e. coefficient $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$). Consequently, this $\varOmega _\textit{min}$ implies the minimum meaningful value of $\varOmega _\textit{max}{} = 88^{\circ }{}$ due to equation $\mathfrak {\alpha }+ \mathfrak {\beta }+ \mathfrak {\gamma }= 180^{\circ }$, and limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}] = [4^{\circ }, 88^{\circ }]$ defines a very loose embrace of the sampled points by the green and blue curve – compare distribution of black circles “100NN – MPEG7” in Fig. 2b with the coloured curves.

Therefore, we propose to set bounds $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ independently for each of the newly proposed bounds $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$, $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$, $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ on $\mathfrak {c}$ to maximise their tightness. The formal approach to select these values of $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ forms the future work. For now, we just identify $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ for all curves and datasets to tightly wrap angles $[\mathfrak {\alpha }, \mathfrak {\beta }]$ sampled with a focus on nearest neighbours. These wrappings are illustrated in Fig. 3, and the values $[\varOmega _\textit{min}, \varOmega _\textit{max}]$ with corresponding bounds on $\mathfrak {c}$ are provided in Table 4.

Table 4. Selected angles limitation $[\varOmega _\textit{min}, \varOmega _\textit{max}]$, and new bounds on $\mathfrak {c}$

Full size table

6 Searching with New Bounds

We experimentally verify the contribution of proposed bounds on $\mathfrak {c}$ to the similarity search. We search for $k=10$ and $k=100$ closest objects (k nearest neighbours – kNN) to 1,000 randomly selected query objects $q \in D$ in 1 million datasets, and use a data filtering to speed-up the search. In particular, we select 256 pivots $p_i \in D$ and pre-compute distances $d(o, p_i), o \in X$ for all pivots. When the query object comes, we evaluate all distances $d(q, p_i)$, and then for each $o \in X$ we evaluate the biggest lower bound on d(q, o) provided by an arbitrary pivot $p_i$ – it can be either $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ or $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$. If it is bigger than the distance $\mathfrak {r}$ of the current kth nearest object to q, the evaluation of d(q, o) is skipped. Otherwise, the lowest upper bound $\textit{UB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ provided by pivots is computed, and if it is smaller than $\mathfrak {r}$, the evaluation of d(q, o) is skipped and o is added to the query answer. Please notice that the order of the verification of bounds matters, since they are sometimes in a contradiction.

The proposed bounds on $\mathfrak {c}$ do not provide any guarantees on their precision. Nevertheless, 999 out of a thousand 10NN queries on each dataset are evaluated with recall 1, i.e. all 10 true nearest neighbours are returned. Query answers may contain more than k objects due to object involvements based on upper-bounds^{Footnote 8}. But despite of this, the median answer on 10NN queries contains 10 objects in case of each examined dataset. Answers to 100NN queries contain all 100 true nearest neighbours in case of 955, 963 and 923 query objects in case of the DeCAF, SIFT, and MPEG7 dataset, respectively. The biggest answers, 105 on median, are returned in case of the MPEG7 dataset.

Numbers of saved distance computations are presented in Fig. 4. Box-plots describe the distribution of values over particular query objects. Dark-grey box-plots form the baseline, i.e. the metric filtering with bounds given by Eqs. 1 and 2. Light-grey box-plots present results achieved by newly proposed bounds. Median numbers of skipped distance computations increase from 0.4% to 11.8% (DeCAF), from 59.4% to 75.2% (SIFT), and from 64.5% to 80.2% (MPEG7) in case of 10NN queries. The results are coherent with 100NN queries.

7 Conclusions and Future Work

We analysed consequences of limited angles within triangles of distances in metric spaces and their impact on the bounds on distances given by triangle inequalities. We derived a new lower bound on a distance $\textit{LB}_\textit{sum}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})$ which is based on a sum of two opposite sides in a triangle. Our findings have a strong impact on the filtering power of triangle inequalities, which we confirmed by experiments with 3 real-life datasets. Moreover, the proposed enhancement of the filtering is extremely precise, as only 3 out of three thousand 10NN queries did not provide the query answer with the recall 1 in our experiments. The proposed method can be immediately incorporated into metric-based indexes to improve their efficiency, thanks to its simplicity and practically no overhead. In the future work, we would like to clarify the relation of this work to convex transforms of distance functions [3, 11]. We also would like to develop algorithms able to set angle limitations automatically. Also, an automatic setting of the angle limitations for each pivot independently might increase the efficiency of filtering even further. We plan to report such findings in our future publications.

Notes

1.
The assumption is used just in the last sections of the article, starting from Sect. 4.
2.
The swap of distances $\mathfrak {a}$ and $\mathfrak {b}$ is achieved by swapping the notation of objects q and o. While this swaps lengths $\mathfrak {a}$ and $\mathfrak {b}$, it preserves the distance $\mathfrak {c}= d(q, o)$ as distances d(q, o) and d(o, q) are symmetric.
3.
In this case, $\textit{LB}_\textit{diff}(\mathfrak {a}, \mathfrak {b}, \varOmega _\textit{min}, \varOmega _\textit{max})= 0 \cdot \infty $, which is an indefinite expression.
4.
This equation is derived in the same way as Eq. 10, as $\mathcal {C}_{\textit{UB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$ is given by the same function as $\mathcal {C}_{\textit{LB}\_\textit{sum}}(\varOmega _\textit{min}, \varOmega _\textit{max})$.
5.
http://disa.fi.muni.cz/profiset/.
6.
http://corpus-texmex.irisa.fr/.
7.
We depict just non-overlapping points due to hardware limitations.
8.
This is also caused by a gradual shrinking of a search radius to evaluate kNN queries.

References

Batko, M., et al.: Building a web-scale image similarity search system. Multimed. Tools Appl. 47(3), 599–629 (2010)
Article MathSciNet Google Scholar
Bolettieri, P., et al.: CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627v2 (2009)
Google Scholar
Connor, R., Dearle, A., Mic, V., Zezula, P.: On the application of convex transforms to metric search. Pattern Recognit. Lett. 138, 563–570 (2020)
Article Google Scholar
Connor, R., Vadicamo, L., Cardillo, F.A., Rabitti, F.: Supermetric search. Inf. Syst. 80, 108–123 (2019)
Article Google Scholar
Donahue, J.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of International Conference on Machine Learning, ICML, Beijing, pp. 647–655 (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Google Scholar
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, Kerkyra, Corfu, Greece, 20–25 September 1999, pp. 1150–1157 (1999)
Google Scholar
MPEG7: Multimedia content description interfaces. Part 3: Visual (2002)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML 2010, pp. 807–814. Omnipress, USA (2010)
Google Scholar
Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric spaces. In: Third International Workshop on Similarity Search and Applications. SISAP 2010, 18–19 September 2010, Istanbul, Turkey, pp. 59–66. ACM (2010)
Google Scholar
Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4), 29 (2007)
Article Google Scholar
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach, Advances in Database Systems, vol. 32. Springer, Heidelberg (2006)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Masaryk University, Brno, Czech Republic
Vladimir Mic & Pavel Zezula

Authors

Vladimir Mic
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Zezula
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vladimir Mic .

Editor information

Editors and Affiliations

National Institute of Informatics, Tokyo, Japan
Shin'ichi Satoh
ISTI-CNR, Pisa, Italy
Lucia Vadicamo
University of Southern Denmark, Odense M, Denmark
Arthur Zimek
ISTI-CNR, Pisa, Italy
Fabio Carrara
University of Bologna, Bologna, Italy
Ilaria Bartolini
IT University of Copenhagen, Copenhagen, Denmark
Martin Aumüller
IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
IT University of Copenhagen, Copenhagen, Denmark
Rasmus Pagh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mic, V., Zezula, P. (2020). Accelerating Metric Filtering by Improving Bounds on Estimated Distances. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-60936-8_1
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Accelerating Metric Filtering by Improving Bounds on Estimated Distances

Abstract

Similar content being viewed by others

Fast Similarity Search with the Earth Mover’s Distance via Feasible Initialization and Pruning

Distance-Based Index Structures for Fast Similarity Search

Measuring similarity for query in geometric data

Keywords

1 Introduction