Keywords

1 Introduction

Thanks to the impetus given by recent developments in deep learning, machine learning has gained unprecedented popularity and spread into unimaginable number domains of computing. Perhaps one of the most unexpected application domains is that of index structures. In an exploratory research paper, Kraska et al. [8] showed how machine learning models, including deep learning ones, can fully or partially replace existing index structures, such as B-Tree or Bloom filters. This work paved the way for a whole new area of research in index structure, termed learned index. The core idea of learned indexes is to obtain a more compact index representation or performance gains by learning from data distribution.

In particular, Kraska et al. show that an index can be seen as a model f that predicts the position y of a record x, i.e. \(y = f(x)\). Seen with the eyes of machine learning, this problem is known as a regression problem. Therefore, in this context a learned index is simply an ML model (such as linear regression or a neural network) that replaces a B-Tree and predicts the position of query key. In this work, we would like to generalize the concept of learned index by providing a broader definition, in which the regression allows us to predict the representation of metric objects in vector form from the knowledge of distances from some anchors. This type of transformation of data is useful in the problems of searching in large-scale metric spaces, where a compromise between accuracy and speed of response to queries is often required. In large-scale scenarios, the amount of distance computations between objects needed for an exact search tends to saturate the available computational budget for obtaining reasonable response times, considering also that in metric spaces, distance functions are often expensive to compute.

The idea of reconstructing the distance between any pair of objects in a metric space by exploiting distances with a group of reference objects was probably first addressed in [9]. The authors proposed an embedding into another metric space where it is possible to deduce upper and lower bounds on the actual distance of any pair of objects. Connor et al. [4,5,6] observed that for a large class of metric spaces, distances to a set of n pivots can be used to project the data objects into a n-dimensional Euclidean space such that in the projected space the Euclidean distance between any two points is an upper or lower bound of the actual distance. They called this approach n-Simplex projection, and they proved that it can be used in all the metric spaces meeting the n-point property [2]. As also pointed out in [3], many common metric spaces meet the desired property, like Cartesian spaces of any dimension with the Euclidean, cosine or quadratic form distances, probability spaces with the Jenson-Shannon or the Triangular distance, and more generally any Hilbert-embeddable space [2, 10]. This approach has recently been used in an inverted index for approximate research on the n-nearest neighbors to obtain an estimate of the real distances of the objects present in the results of a query [12].

We intend to develop a similar approach to the n-simplex projection, however instead of using a handcrafted deterministic algorithm, in this paper we test a machine learning approach based on deep neural networks to probe their capabilities in this context. The rest of the paper is structured as follows. Section 2 describes the general idea of method developed and the model used. Section 3 presents the dataset and experimental evaluation. Section 4 concludes.

2 Method

2.1 Model Definition

Let \(\mathcal {X}\) a metric space with distance function \(d: \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}^+\) and \(\mathcal {P}= \{\,p_i \in \mathcal {X}: i = 1 \dots N\,\}\) a set of reference points (or pivots) in \(\mathcal {X}\). We define the pivoted embedding \(e(x, \mathcal {P})\) of an object \(x \in \mathcal {X}\) w.r.t \(\mathcal {P}\) as an N-dimensional real-valued vector where the i-th component is the distance between x and the i-th pivot, i.e.

$$\begin{aligned} e(x,\mathcal {P}) = \left[ \, d(x, p_1), \dots , d(x, p_i), \dots , d(x, p_N) \,\right] \in \mathbb {R}^N \,. \end{aligned}$$
(1)

We are interested in estimating the distance d(xy) between a pair of objects \(x, y \in \mathcal {X}\) given their pivoted embeddings \(\mathbf {e}_x = e(x, \mathcal {P}), \mathbf {e}_y = e(y, \mathcal {P})\) with respect to a common set of reference objects \(\mathcal {P}\). We formulate this task as a regression problem: we define a parametric model f that outputs the estimated distance given the pivoted embeddings and optimize its parameters on a training set via gradient descent. In addition to \(\mathbf {e}_x\) and \(\mathbf {e}_y\), we include the distances between pivots \(\{\, d(p_i, p_j) : i=1 \dots N, j = 1 \dots N, i < j \,\}\) in the inputs of our model as a real-valued vector \(\mathbf {p}\in \mathbb {R}^\frac{N(N-1)}{2}\) is commonly computed once offline and available. Formally,

(2)

where indicates the estimate for d(xy) of the model f having parameters \(\theta \). Following common practice in metric learning, we define

$$\begin{aligned} f(\mathbf {e}_x, \mathbf {e}_y, \mathbf {p}; \theta ) = |\varPhi (\mathbf {e}_x, \mathbf {p}; \theta ) - \varPhi (\mathbf {e}_y, \mathbf {p}; \theta )|_2 \,, \end{aligned}$$
(3)

where \(\varPhi (\mathbf {e}, \mathbf {p}; \theta )\) is a neural network that takes as input a pivoted embedding \(\mathbf {e}\) and the distances between pivots \(\mathbf {p}\) and outputs a real-valued vector representation.

As the architecture of \(\varPhi (\mathbf {e}, \mathbf {p}; \theta )\), we choose a two-branch fully-connected residual network: \(\mathbf {e}\) and \(\mathbf {p}\) are independently processed by two MLPs with residual connections whose outputs are then merged by concatenation and followed by one or more additional fully-connected layer. Each branch comprises multiple residual blocks [7] having the structure reported in Fig. 1a. We explore and evaluate architectural hyperparameters such as depth and branch merging point in Sect. 3.

2.2 Model Training

We train our model with mini-batch gradient descent. Given a training set \(\mathcal {X}_\text {tr} \subset \mathcal {X}\), to form a training batch we randomly draw N objects as pivots \(\mathcal {P}\) and B pairs of objects \((x_i, y_i), i=1 \dots B, x_i, y_i \in \mathcal {X}_\text {tr}\), and we adopt the original metric distance d to obtain the inputs (the pivoted embedding of the objects \(\mathbf {e}_{x_i}, \mathbf {e}_{y_i}\) and distances between pivots \(\mathbf {p}\)) and the target (exact distances between objects \(d(x_i, y_i)\)) of our model. We optimize the loss function

$$\begin{aligned} \mathcal {L}= \frac{1}{B} \sum _{i=1}^B \, \text {SmoothL1} \left( \, f( \mathbf {e}_{x_i}, \mathbf {e}_{y_i}, \mathbf {p}),\, d(x_i, y_i) \, \right) , \end{aligned}$$
(4)

with

$$\begin{aligned} \text {SmoothL1}(a,b) = {\left\{ \begin{array}{ll} \frac{1}{2} (a - b)^2, &{} \text {if } |a - b| < 1 \\ |a - b| - \frac{1}{2}, &{} \text {otherwise} \,. \end{array}\right. } \end{aligned}$$
(5)

We choose the SmoothL1 function to avoid huge gradients in the early phase of training that often lead to numerical instabilities. At each batch, we sample new pivots and couples and repeat the procedure. We periodically evaluate our model on a test set \(\mathcal {X}_\text {ts}\) by measuring the estimation error, and we adopt early stopping when the test error stops decreasing.

3 Experiments

3.1 Dataset

Throughout all experiments, we adopt a subset of the YFCC100M-HNfc6 deep features dataset [1]—a dataset of 100M 4096-dimensional features extracted from YFCC100M [11] images using the HybridCNN [13] deep convolutional pretrained network and selecting the output of the fc6 layer. In the original space, features are compared with the \(L_2\) distance, i.e. \(d = L_2\). We select the first 1M features and divide them in training, validation and test sets with a 750K/150K/100K split. As a metric of performance, we report the mean absolute error and the mean absolute percentage error (MAPE) computed on the test set together with their standard deviations.

3.2 Choice of the \(\varPhi \) Network

We perform experiments to evaluate two main architectural hyperparameters of \(\varPhi \)—the depth of the network and the fusion strategy of the two branches. For the former, we test a number of intermediate layers in \(\{1, 2, 4\}\), while for the latter, we test concatenation of the two branches at the input level (early fusion), at half depth (mid fusion), and right before the final layer (late fusion). Figure 1 depicts the tested architectures for each parameter combination. We decided not to apply any bottleneck layer to reduce the dimensionality of the input, thus every layer keeps the dimensionality of its input except for the last projection. We are aware this leads to prohibitive memory requirements as N increases, as the dimensionality of \(\mathbf {p}\) is \(O(N^2)\), but in this preliminary phase, this reduces the architectural search space and enables us to evaluate the model without introducing performance caps. We train all models with SGD with momentum 0.9, learning rate of 0.05 (divided by 10 when the validation loss plateaus), batch size 100 for 10K iterations, validating every 100 iterations. We adopt early stopping by monitoring the MAPE on the validation set and selecting the model reaching the minimumFootnote 1. On our single-GPU configuration, we were able to test all variations with N up to 128, as larger values are prohibitively expensive in terms of GPU memory required for training (+10GB); we left the exploration of larger values with reduced models to future work. Results in terms of MAPE and MAE are reported in Table 1. We notice that on average, an early fusion strategy is able to reach slightly better results with more performance gains with deeper networks. On the other hand, deeper networks with other fusion strategy suffer from numerical instabilities leading to divergence; we left to future work the tuning of optimization hyperparameters that may alleviate this phenomenon. Among shallower (and thus more efficient) ones, the model with late fusion and \(\text {depth} = 1\) shows good overall performance on all tested N values with respect to other variants.

Fig. 1.
figure 1

Explored architectures for \(\varPhi (\mathbf {e}, \mathbf {p})\)

Table 1. Performance of architectural variations of our model on YFCC100M-HNfc6 features test subset. Bold entries indicate the model reaching the best mean absolute percentage error (MAPE) for each N. Dashes (-) indicate configurations that do not converge.

3.3 Comparison with the State of the Art

We compare in Fig. 2 our models selected from Table 1 (the best for each N) with the n-Simplex projection on the same dataset. The n-Simplex projection provides geometrical upper (simplex-U) and lower (simplex-L) bounds for distance estimates in super-metric spaces given \(\mathbf {e}\) and \(\mathbf {p}\). We also report the estimation obtained by averaging the upper and lower bound (simplex-M) that usually provides a finer estimate. A main drawback of this technique is the iterative \(O(N^3)\) simplex building procedure that needs to be executed for each value of \(\mathbf {p}\) we are willing to use. Our approach provides a comparable or slightly degraded performance, but once trained, it can cope with different values of \(\mathbf {p}\) without the need of expensive procedures. Moreover, our approach can be applied to generic metric spaces.

Fig. 2.
figure 2

Comparison with the state-of-the-art n-Simplex estimator: simplex suffixes -U, -L, and -M indicate estimation using respectively the upper bound, lower bound, and their mean.

4 Conclusion

We explored the use of neural regressors for estimating distances from pivoted embedding in generic metric spaces. Preliminary experiments on deep-learned image descriptors suggest that the proposed approach can be used in approximated regimes providing a performance comparable to exact geometrical bounds while being more efficient. Moreover, our formulation is not limited to super-metric spaces and can be applied seamlessly to different set of reference points—properties that pave the way to advanced indexing structures including dynamically chosen reference points sets. Future work comprises the development of more compact architectures for higher dimensionalities and extended experimentation on additional metric and retrieval datasets.