Learning Distance Estimators from Pivoted Embeddings of Metric Objects

Carrara, Fabio; Gennaro, Claudio; Falchi, Fabrizio; Amato, Giuseppe

doi:10.1007/978-3-030-60936-8_28

Fabio Carrara¹⁶,
Claudio Gennaro¹⁶,
Fabrizio Falchi¹⁶ &
…
Giuseppe Amato¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Included in the following conference series:

International Conference on Similarity Search and Applications

862 Accesses
2 Citations
1 Altmetric

Abstract

Efficient indexing and retrieval in generic metric spaces often translate into the search for approximate methods that can retrieve relevant samples to a query performing the least amount of distance computations. To this end, when indexing and fulfilling queries, distances are computed and stored only against a small set of reference points (also referred to as pivots) and then adopted in geometrical rules to estimate real distances and include or exclude elements from the result set. In this paper, we propose to learn a regression model that estimates the distance between a pair of metric objects starting from their distances to a set of reference objects. We explore architectural hyper-parameters and compare with the state-of-the-art geometrical method based on the n-simplex projection. Preliminary results show that our model provides a comparable or slightly degraded performance while being more efficient and applicable to generic metric spaces.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Data-Driven Learned Metric Index: An Unsupervised Approach

Query Filtering with Low-Dimensional Local Embeddings

Real-Valued Embeddings and Sketches for Fast Distance and Similarity Estimation

Article 24 November 2016

Keywords

1 Introduction

Thanks to the impetus given by recent developments in deep learning, machine learning has gained unprecedented popularity and spread into unimaginable number domains of computing. Perhaps one of the most unexpected application domains is that of index structures. In an exploratory research paper, Kraska et al. [8] showed how machine learning models, including deep learning ones, can fully or partially replace existing index structures, such as B-Tree or Bloom filters. This work paved the way for a whole new area of research in index structure, termed learned index. The core idea of learned indexes is to obtain a more compact index representation or performance gains by learning from data distribution.

In particular, Kraska et al. show that an index can be seen as a model f that predicts the position y of a record x, i.e. $y = f(x)$. Seen with the eyes of machine learning, this problem is known as a regression problem. Therefore, in this context a learned index is simply an ML model (such as linear regression or a neural network) that replaces a B-Tree and predicts the position of query key. In this work, we would like to generalize the concept of learned index by providing a broader definition, in which the regression allows us to predict the representation of metric objects in vector form from the knowledge of distances from some anchors. This type of transformation of data is useful in the problems of searching in large-scale metric spaces, where a compromise between accuracy and speed of response to queries is often required. In large-scale scenarios, the amount of distance computations between objects needed for an exact search tends to saturate the available computational budget for obtaining reasonable response times, considering also that in metric spaces, distance functions are often expensive to compute.

The idea of reconstructing the distance between any pair of objects in a metric space by exploiting distances with a group of reference objects was probably first addressed in [9]. The authors proposed an embedding into another metric space where it is possible to deduce upper and lower bounds on the actual distance of any pair of objects. Connor et al. [4,5,6] observed that for a large class of metric spaces, distances to a set of n pivots can be used to project the data objects into a n-dimensional Euclidean space such that in the projected space the Euclidean distance between any two points is an upper or lower bound of the actual distance. They called this approach n-Simplex projection, and they proved that it can be used in all the metric spaces meeting the n-point property [2]. As also pointed out in [3], many common metric spaces meet the desired property, like Cartesian spaces of any dimension with the Euclidean, cosine or quadratic form distances, probability spaces with the Jenson-Shannon or the Triangular distance, and more generally any Hilbert-embeddable space [2, 10]. This approach has recently been used in an inverted index for approximate research on the n-nearest neighbors to obtain an estimate of the real distances of the objects present in the results of a query [12].

We intend to develop a similar approach to the n-simplex projection, however instead of using a handcrafted deterministic algorithm, in this paper we test a machine learning approach based on deep neural networks to probe their capabilities in this context. The rest of the paper is structured as follows. Section 2 describes the general idea of method developed and the model used. Section 3 presents the dataset and experimental evaluation. Section 4 concludes.

2 Method

2.1 Model Definition

Let $\mathcal {X}$ a metric space with distance function $d: \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}^+$ and $\mathcal {P}= \{\,p_i \in \mathcal {X}: i = 1 \dots N\,\}$ a set of reference points (or pivots) in $\mathcal {X}$. We define the pivoted embedding $e(x, \mathcal {P})$ of an object $x \in \mathcal {X}$ w.r.t $\mathcal {P}$ as an N-dimensional real-valued vector where the i-th component is the distance between x and the i-th pivot, i.e.

$$\begin{aligned} e(x,\mathcal {P}) = \left[ \, d(x, p_1), \dots , d(x, p_i), \dots , d(x, p_N) \,\right] \in \mathbb {R}^N \,. \end{aligned}$$

(1)

We are interested in estimating the distance d(x, y) between a pair of objects $x, y \in \mathcal {X}$ given their pivoted embeddings $\mathbf {e}_x = e(x, \mathcal {P}), \mathbf {e}_y = e(y, \mathcal {P})$ with respect to a common set of reference objects $\mathcal {P}$. We formulate this task as a regression problem: we define a parametric model f that outputs the estimated distance given the pivoted embeddings and optimize its parameters on a training set via gradient descent. In addition to $\mathbf {e}_x$ and $\mathbf {e}_y$, we include the distances between pivots $\{\, d(p_i, p_j) : i=1 \dots N, j = 1 \dots N, i < j \,\}$ in the inputs of our model as a real-valued vector $\mathbf {p}\in \mathbb {R}^\frac{N(N-1)}{2}$ is commonly computed once offline and available. Formally,

(2)

where indicates the estimate for d(x, y) of the model f having parameters $\theta $. Following common practice in metric learning, we define

$$\begin{aligned} f(\mathbf {e}_x, \mathbf {e}_y, \mathbf {p}; \theta ) = |\varPhi (\mathbf {e}_x, \mathbf {p}; \theta ) - \varPhi (\mathbf {e}_y, \mathbf {p}; \theta )|_2 \,, \end{aligned}$$

(3)

where $\varPhi (\mathbf {e}, \mathbf {p}; \theta )$ is a neural network that takes as input a pivoted embedding $\mathbf {e}$ and the distances between pivots $\mathbf {p}$ and outputs a real-valued vector representation.

As the architecture of $\varPhi (\mathbf {e}, \mathbf {p}; \theta )$, we choose a two-branch fully-connected residual network: $\mathbf {e}$ and $\mathbf {p}$ are independently processed by two MLPs with residual connections whose outputs are then merged by concatenation and followed by one or more additional fully-connected layer. Each branch comprises multiple residual blocks [7] having the structure reported in Fig. 1a. We explore and evaluate architectural hyperparameters such as depth and branch merging point in Sect. 3.

2.2 Model Training

We train our model with mini-batch gradient descent. Given a training set $\mathcal {X}_\text {tr} \subset \mathcal {X}$, to form a training batch we randomly draw N objects as pivots $\mathcal {P}$ and B pairs of objects $(x_i, y_i), i=1 \dots B, x_i, y_i \in \mathcal {X}_\text {tr}$, and we adopt the original metric distance d to obtain the inputs (the pivoted embedding of the objects $\mathbf {e}_{x_i}, \mathbf {e}_{y_i}$ and distances between pivots $\mathbf {p}$) and the target (exact distances between objects $d(x_i, y_i)$) of our model. We optimize the loss function

$$\begin{aligned} \mathcal {L}= \frac{1}{B} \sum _{i=1}^B \, \text {SmoothL1} \left( \, f( \mathbf {e}_{x_i}, \mathbf {e}_{y_i}, \mathbf {p}),\, d(x_i, y_i) \, \right) , \end{aligned}$$

(4)

with

$$\begin{aligned} \text {SmoothL1}(a,b) = {\left\{ \begin{array}{ll} \frac{1}{2} (a - b)^2, &{} \text {if } |a - b| < 1 \\ |a - b| - \frac{1}{2}, &{} \text {otherwise} \,. \end{array}\right. } \end{aligned}$$

(5)

We choose the SmoothL1 function to avoid huge gradients in the early phase of training that often lead to numerical instabilities. At each batch, we sample new pivots and couples and repeat the procedure. We periodically evaluate our model on a test set $\mathcal {X}_\text {ts}$ by measuring the estimation error, and we adopt early stopping when the test error stops decreasing.

3 Experiments

3.1 Dataset

Throughout all experiments, we adopt a subset of the YFCC100M-HNfc6 deep features dataset [1]—a dataset of 100M 4096-dimensional features extracted from YFCC100M [11] images using the HybridCNN [13] deep convolutional pretrained network and selecting the output of the fc6 layer. In the original space, features are compared with the $L_2$ distance, i.e. $d = L_2$. We select the first 1M features and divide them in training, validation and test sets with a 750K/150K/100K split. As a metric of performance, we report the mean absolute error and the mean absolute percentage error (MAPE) computed on the test set together with their standard deviations.

3.2 Choice of the $\varPhi $ Network

We perform experiments to evaluate two main architectural hyperparameters of $\varPhi $—the depth of the network and the fusion strategy of the two branches. For the former, we test a number of intermediate layers in $\{1, 2, 4\}$, while for the latter, we test concatenation of the two branches at the input level (early fusion), at half depth (mid fusion), and right before the final layer (late fusion). Figure 1 depicts the tested architectures for each parameter combination. We decided not to apply any bottleneck layer to reduce the dimensionality of the input, thus every layer keeps the dimensionality of its input except for the last projection. We are aware this leads to prohibitive memory requirements as N increases, as the dimensionality of $\mathbf {p}$ is $O(N^2)$, but in this preliminary phase, this reduces the architectural search space and enables us to evaluate the model without introducing performance caps. We train all models with SGD with momentum 0.9, learning rate of 0.05 (divided by 10 when the validation loss plateaus), batch size 100 for 10K iterations, validating every 100 iterations. We adopt early stopping by monitoring the MAPE on the validation set and selecting the model reaching the minimum^{Footnote 1}. On our single-GPU configuration, we were able to test all variations with N up to 128, as larger values are prohibitively expensive in terms of GPU memory required for training (+10GB); we left the exploration of larger values with reduced models to future work. Results in terms of MAPE and MAE are reported in Table 1. We notice that on average, an early fusion strategy is able to reach slightly better results with more performance gains with deeper networks. On the other hand, deeper networks with other fusion strategy suffer from numerical instabilities leading to divergence; we left to future work the tuning of optimization hyperparameters that may alleviate this phenomenon. Among shallower (and thus more efficient) ones, the model with late fusion and $\text {depth} = 1$ shows good overall performance on all tested N values with respect to other variants.

Table 1. Performance of architectural variations of our model on YFCC100M-HNfc6 features test subset. Bold entries indicate the model reaching the best mean absolute percentage error (MAPE) for each N. Dashes (-) indicate configurations that do not converge.

Full size table

3.3 Comparison with the State of the Art

We compare in Fig. 2 our models selected from Table 1 (the best for each N) with the n-Simplex projection on the same dataset. The n-Simplex projection provides geometrical upper (simplex-U) and lower (simplex-L) bounds for distance estimates in super-metric spaces given $\mathbf {e}$ and $\mathbf {p}$. We also report the estimation obtained by averaging the upper and lower bound (simplex-M) that usually provides a finer estimate. A main drawback of this technique is the iterative $O(N^3)$ simplex building procedure that needs to be executed for each value of $\mathbf {p}$ we are willing to use. Our approach provides a comparable or slightly degraded performance, but once trained, it can cope with different values of $\mathbf {p}$ without the need of expensive procedures. Moreover, our approach can be applied to generic metric spaces.

4 Conclusion

We explored the use of neural regressors for estimating distances from pivoted embedding in generic metric spaces. Preliminary experiments on deep-learned image descriptors suggest that the proposed approach can be used in approximated regimes providing a performance comparable to exact geometrical bounds while being more efficient. Moreover, our formulation is not limited to super-metric spaces and can be applied seamlessly to different set of reference points—properties that pave the way to advanced indexing structures including dynamically chosen reference points sets. Future work comprises the development of more compact architectures for higher dimensionalities and extended experimentation on additional metric and retrieval datasets.

Notes

1.
The code for reproducing our results is available at https://github.com/fabiocarrara/pivoted-estimation.

References

Amato, G., Falchi, F., Gennaro, C., Rabitti, F.: YFCC100M hybridnet fc6 deep features for content-based image retrieval. In: Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, pp. 11–18 (2016)
Google Scholar
Blumenthal, L.M.: Theory and Applications of Distance Geometry. Clarendon Press, Oxford (1953)
MATH Google Scholar
Connor, R., Cardillo, F.A., Vadicamo, L., Rabitti, F.: Hilbert exclusion: improved metric search through finite isometric embeddings. ACM Trans. Inf. Syst. 35(3), 17:1–17:27 (2016). https://doi.org/10.1145/3001583
Article Google Scholar
Connor, R., Vadicamo, L., Cardillo, F.A., Rabitti, F.: Supermetric search with the four-point property. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 51–64. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_4
Chapter Google Scholar
Connor, R., Vadicamo, L., Cardillo, F.A., Rabitti, F.: Supermetric search. Inf. Syst. 80, 108–123 (2019)
Article Google Scholar
Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric search. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) Similarity Search and Applications, Proceedings of SISAP 2017, pp. 96–109. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_7
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The Case for Learned Index Structures. Association for Computing Machinery, New York (2018)
Book Google Scholar
Micó, M.L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15(1), 9–17 (1994). https://doi.org/10.1016/0167-8655(94)90095-7
Article Google Scholar
Schoenberg, I.J.: Metric spaces and completely monotone functions. Ann. Math. 39(4), 811–841 (1938)
Article MathSciNet Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
Article Google Scholar
Vadicamo, L., Gennaro, C., Falchi, F., Chávez, E., Connor, R., Amato, G.: Re-ranking via local embeddings: a use case with permutation-based indexing and the nSimplex projection. Inf. Syst. 101506 (2020)
Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)
Google Scholar

Download references

Acknowledgment

This work was partially supported by “Automatic Data and documents Analysis to enhance human-based processes” (ADA, CUP CIPE D55F17000290009) and the AI4EU project (funded by the EC, H2020 - Contract n. 825619). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

ISTI-CNR, via G. Moruzzi 1, 56124, Pisa, Italy
Fabio Carrara, Claudio Gennaro, Fabrizio Falchi & Giuseppe Amato

Authors

Fabio Carrara
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Gennaro
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Falchi
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Carrara .

Editor information

Editors and Affiliations

National Institute of Informatics, Tokyo, Japan
Shin'ichi Satoh
ISTI-CNR, Pisa, Italy
Lucia Vadicamo
University of Southern Denmark, Odense M, Denmark
Arthur Zimek
ISTI-CNR, Pisa, Italy
Fabio Carrara
University of Bologna, Bologna, Italy
Ilaria Bartolini
IT University of Copenhagen, Copenhagen, Denmark
Martin Aumüller
IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
IT University of Copenhagen, Copenhagen, Denmark
Rasmus Pagh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carrara, F., Gennaro, C., Falchi, F., Amato, G. (2020). Learning Distance Estimators from Pivoted Embeddings of Metric Objects. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-60936-8_28
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics