Keywords

1 Introduction

Extraction of image and text features with pre-trained off-the-shelf models enjoys widespread adoption among practitioners (e.g., BERT-as-a-service [58]). These features are often used in tasks such as multimedia retrieval [50, 55], semantic similarity [12, 13, 26, 40, 52], word analogies [37, 40] or zero-shot image recognition [46, 61], to name a few. Not infrequently, features are used directly without further supervised training, typically via cosine-based semantic match. As noted [1, 24], the cosine similarity is sensitive to centering, cross-dimension correlations and scale variations (Fig. 1). However, the extent to which this impacts task performance has not yet been systematically studied. Studies assessing the effect of feature transforms (e.g., normalizing or PCA) typically restrict to a single domain and task (e.g., bilingual word embeddings [1, 59]) and a single modality (text). This prompts our first research question (RQ1): Can we improve the features with simple transforms in a variety of text and image tasks? In particular, quantifying the (hypothesized) negative impact of vector uncenteredness on cosine-based performance (Fig. 1) is among our foremost hypothesis to test.

Fig. 1.
figure 1

Illustration of uncentered vectors hindering cosine similarity performance. Since cosine similarity computes the angle (\(\alpha , \beta \)) from the origin \(\vec {0}\), in this example where all vectors are dimension-wise positive, the cosine judges two points from different classes as more similar than two points of the same class. Centering helps obtaining more meaningful similarity estimates.

The cosine is generally chosen as default similarity measure in retrieval [15, 50] and semantic similarity tasks [12, 26, 27, 31, 40, 52, 53]. This choice may eventually be informed in a (labelled) validation set or even the metric itself can be learned [14, 56] if a labelled training set exists. However, because often none of these are available [12, 26, 31, 40, 46, 52], our study assumes a scenario without either set. This motivates our second research question (RQ2): Is the default choice of cosine similarity (versus Euclidean) empirically supported?

To answer RQ1 and RQ2, we perform an extensive empirical study in real-world tasks with both image and text data. We provide further insight and back up our claims in laboratory experiments. Our tests include 25 datasets with 6 different tasks covering text and image retrieval, word-, sentence- and visual-similarity, and paraphrase detection. We include 15 types of image (8) and text (7) embeddings, covering state-of-the-art models. Simple feature transforms are also compared with manifold learning methods.

Our findings reveal that: (i) Centering and standardizing are remarkably effective across real-world tasks (RQ1); (ii) the cosine significantly outperforms the Euclidean similarity across 74 conditions (embedding \(\times \) task), hence supporting the default choice (RQ2). Ultimately, our findings provide actionable advice to practitioners and warning about the negative impact of using cosine similarity along with uncentered features.

This paper is organized as follows. In Sect. 2, we discuss related work. We present our methods in Sect. 3 and our tasks in Sect. 4. In Sect. 5, we describe our embeddings and setup. In Sect. 6, we discuss our empirical results. Section 7 concludes the paper.

2 Related Work

Feature transforms: [25] study the optimality of five different whitening transformations from the viewpoint of the properties of their covariance matrices. In contrast to this study, [25] do not include empirical evaluations in text or image problems.

Additionally, [11] studied the effect of transforming features with an untrained neural network (i.e., random projections), finding that the performance of transformed vectors does not drop in word-similarity tasks. The impact on performance of different feature transforms on classification problems such as of biomedical data is also studied [4].

The closest works to ours are [32], [59] and [1], all of whom study the effect of feature transforms in the context of text problems. [32] study the effect of hyperparameters and normalization of word embeddings, revealing that the impact of design decisions and hyperparameters on performance is more important than the choice of the embedding algorithms themselves. [59] finds that constraining word embeddings to the unit hyper-sphere (i.e., normalizing them) improves performance in mono-lingual word similarity and bi-lingual word translation. [1] investigate several transformations including PCA, mean centering, normalization and whitening in the context of multi-lingual word embeddings. In contrast with ours, these studies restrict to a single domain and to text data (no images), and do not discuss standardizing – which we find to be a top performer.

Similarity measures: [24] analytically study the behavior and properties of similarity measures such as cosine similarity and the inner product from a geometric viewpoint, focusing on iso-similarity contours. Also analytically, [41] studies similarity measures in the retrieval context. In contrast to them, we carry out extensive empirical tests.

Metric learning: algorithms such as the ITML [14] or LMNN [56] learn a metric distance which can be seen as a form of learning a suitable transformation to the input vectors. However, this metric is learned in a supervised fashion, typically to be used in conjunction with a nearest-neighbor classifier, which falls out of the unsupervised scope of our study. It is worth mentioning that unsupervised metric learning algorithms also exist [9, 23], yet they do not witness widespread adoption among data practitioners.

Manifold learning: methods, such as Isomap [49], Locally Linear Embedding (LLE) [42], diffusion maps [10], multi dimensional scaling (MDS) [29] or t-SNE [34], try to discover the underlying data manifold, which enables disentangling the vectors in a lower-dimensional space. Such methods are widely used for data visualization, yet they are not popular as feature transforms for predictive models – perhaps due to their limited success for such purpose. Although the inclusion of manifold learning methods in our study obeys mainly completeness reasons – given that our focus are simple feature transforms – an empirical comparison of simple transforms and manifold learning methods across multiple tasks has not been performed yet and we believe that is of practical interest.

3 Method

Let us first lay down our general framework. Let \(S = \{s_i \}_{i=1}^N\) be a set of N data points (sentences, words or images). One extracts corresponding feature vectors \(V = \{v_i \}_{i=1}^N\) with a text or image encoder E() (e.g., BERT or a CNN model), where \(v_i = E(s_i)\) and \(v_i \in \mathbb {R}^d\). The parameters \(\theta \) of a feature transform \(T_{\theta }\) are learned using the vectors V (e.g., in centering, \(\theta \) are the dimension-wise means). A new vector v can then be transformed with \(T_{\theta }(v)\) (Sect. 3.1), where v may belong or not to the set V used for learning \(T_{\theta }\).

3.1 Feature Transforms

In the following, we describe the feature transforms included in our experiments.

  • Original (orig): denotes the original vectors \(V=\{v_i\}_{i=1}^N\) without any transformation.

  • Centering (ctr): \(\text {ctr}(v) = v - \overline{V}\); subtracts the centroid vector \(\overline{V} = \frac{1}{N}\sum _{i=1}^N v_i = \frac{1}{N}\sum _{i=1}^N (v^1_i, \cdots , v^d_i)\) to a vector v.

  • Standardizing (stz): \(\text {stz}(v) = (v - \overline{V})/\text {sd(V)}\); where \(\text {sd(V)}\) are the component-wise standard deviations \(\text {sd(V)} = (\text {sd}(V^1), \cdots , \text {sd}(V^d))\) with \(V^k = \{ v^k_i \}_{i=1}^N\); and \(\text {sd()}\) is the standard deviation. Stz zero-means the data V and sets variances equal to 1.

  • Whitening (wht): We use the Zero Components Analysis (ZCA) whitening as described in [28]. ZCA de-correlates the data dimensions and makes the variances equal to 1.

  • Normalizing (Nrm): \(\text {nrm}(v) = v/||v||\); moves any vector v to the unit hyper-sphere. Unlike the rest, this transform depends only on the same vector v, and not on the whole set \(V=\{v_i\}_{i=1}^N\). Normalizing has no effect when the cosine similarity is used.

  • Isomap (Iso): [49] and Locally Linear Embedding (LLE) [42] are used analogously by first learning the parameters \(\theta \) in the training set of vectors V, and applying the learned transformation \(T_{\theta }\) to a new vector \(v \in \mathbb {R}^d\), with \(T_{\theta }(v) \in \mathbb {R}^m\) with \(m \le d\)Footnote 1.

  • Principal Component Analysis (PCA): is a classical dimensionality reduction method that finds orthogonal directions that best fit the data in the least-squares sense. We keep a number of components (dimensions) such that 80% of the variance is explained.Footnote 2 Our implementation of PCA [39] centers but does not scale the data (for each feature) before applying the SVD decomposition.

Unlike simple transforms (e.g., center) the more complex PCA, Isomap and LLE have hyperparameters (e.g., output dimensionality) that impact their performance. Thus a validation set is often necessary, which is a shortcoming in our unsupervised setting.

3.2 Complementary Experiments

  • Additive bias: As a complement to centering, we study the effect of “uncenteredness” on the cosine similarity (as the Euclidean is shift invariant) by uncentering \(V=\{v_i\}_{i=1}^N\) with a dimension-wise bias \(b > 0\), namely \((v^1_i + b, \cdots , v^d_i + b) \text { } \forall \text { } i=1,\cdots ,N\). This equates shifting all vectors to the positive quadrant (i.e., \(v^k_i > 0 \text { } \forall \text { } k=1,\cdots ,d\)), if b large enough, or moving them further up in case they already are (see Sect. 6 for a discussion).

  • Multiplicative bias: to study the effect of non-homogeneity of scale and variances across dimensions, we multiply each dimension with a bias \(b > 0\) randomly drawn from a uniform b \(\sim \) \(\mathcal {U}[0.001, 10]^d\), i.e., \(v_i = (b_1 v^1_i, \cdots , b_d v^d_i) \text { } \forall \text { } i=1,\cdots ,N\). This study complements the standardizing method.

4 Tasks and Data

In this section, we first describe the procedure of two grand groups of tasks (Sect. 4.1), and then we introduce the datasets used in each individual task (Sect. 4.2). Our dataset selection criteria included: (i) Feasibility of implementing an unsupervised prediction approach (i.e., simply cosine-based); (ii) medium-sized datasets; (iii) rather popular and already clean data (thus little pre-processing required); (iv) diversity.

4.1 Task Descriptions

Grouping tasks: It is convenient to group our tasks in two functionally different categories, as they exhibit identical prediction-evaluation pipelines: (1) Retrieval tasks: (i) text retrieval and (ii) image retrieval; (2) Similarity tasks: (iii) word similarity, (iv) sentence similarity, (v) visual similarity and (vi) synthetic data. Furthermore, we refer throughout to real-world tasks being all tasks except the synthetic ones.

In all tasks, we consider two similarity measures to compute the predicted similarity \(\text {sim}(s_1, s_2)\) between any two inputs \(s_1, s_2\) (words, sentences or images) encoded with their respective features \(v_i, v_j \in \mathbb {R}^d\):

  • Cosine similarity: \(\cos (v_i,v_j) = \frac{v_i v_j}{\Vert v_i \Vert \Vert v_j \Vert }\).

  • Euclidean similarity \(\text {Eucl}(v_i,v_j) = \frac{1}{1+ \Vert v_i - v_j \Vert }\).

In the interest of the practitioner, we focus on simple and widely adopted transforms, cosine and Euclidean similarity, rather than aiming for an exhaustive comparison of all existing similarity measures and feature transforms. After having obtained the vectors \(V = \{v_i \}_{i=1}^N\) and learned \(T_{\theta }(v)\) as described in Sect. 3, we consider the task-specific procedures below.

\(\blacksquare \) Similarity tasks: All word-, sentence- and image-similarity datasets consist of a list of word, sentence or image pairs \((s_i, s_j)\), e.g., (‘car’, ‘truck’) along with a human (ground-truth) rating of their similarity or relatedness \(y_{i,j} \in [1, 10]\). The system needs to predict a similarity score \(\hat{y}_{i,j} \in [1, 10]\) for each pair \((s_i, s_j)\). Model predictions are computed via \(\cos (v_i, v_j)\) or \(\text {Eucl}(v_i, v_j)\), where \((v_i, v_j) = E(s_i, s_j)\).

  • Evaluation: Following [12, 26, 40], we use the Spearman correlation \(\rho (\hat{y},y)\) between the predicted \(\hat{y} \in \mathbb {R}_+^N\) and the ground-truth similarity scores \(y \in \mathbb {R}_+^N\) as the standard measure to evaluate the quality of semantic similarity predictions.

\(\blacksquare \) Retrieval tasks: We split the given test set \(V^{ts}\) into two disjoint sets: a query set \(\mathcal {Q}\) and a test collection \(\mathcal {T}\). Given a query \(s_i \in \mathcal {Q}\), the goal of the task is to rank the relevant items from \(\mathcal {T}\) higher than the non-relevant ones. The similarity between each item \(s_i \in \mathcal {Q}\) in the query set \(\mathcal {Q}\) is computed against every item \(s_j \in \mathcal {T}\) in the test collection \(\mathcal {T}\) via \(\cos (v_i, v_j)\) or \(\text {Eucl}(v_i, v_j)\) similarity, where \((v_i, v_j) = E(s_i, s_j)\).

  • Evaluation: Performance is evaluated with the TREC standard mean average precision (mAP), as described in [35]. Following [50, 54, 55], a test-collection item \(s_i \in \mathcal {T}\) is considered relevant to a query \(s_j \in \mathcal {Q}\) if they both belong to the same class.

4.2 Datasets

\(\blacksquare \) Text retrieval: AG-newsFootnote 3 is text classification and retrieval benchmark [60] consisting in (120,000 train; 7,600 test) sentences, each belonging to exactly one of the 4 classes (sports, world, business, sci/tech). E.g.,“Economic growth in Japan slows down as the country experiences a drop in domestic and corporate spending” (class = business).

\(\blacksquare \) Image retrieval:

  • Caltech-256 [20] is a benchmark widely used in image retrieval [15] and classification. The data consists of 30,607 images, each of which belongs to exactly one of the 256 categories (e.g., sushi, swan, tripod, etc.).

  • CorelDB database [51]: consists of 10,800 images, each of which belongs to exactly one of the 80 classes (ship, waterfall, lion, etc.).

\(\blacksquare \) Word similarity tasks are typically used to evaluate the quality of word embedding models [2, 26, 31, 40, 52]. Following [12, 52, 53], we use five word similarity benchmarks, which include three types of similarity ratings: (i) Semantic similarity: SemSim [44], Simlex999 [22] and SimVerb-3500 [19]; (ii) Relatedness: MEN [3] and WordSim-353 [18]; (iii) Visual similarity: VisSim [44] which contains the same data as SemSim, yet word pairs are rated for visual similarity instead of semantic similarity.

\(\blacksquare \) Sentence similarity: Our datasets are from the GLUEFootnote 4 and SentEvalFootnote 5 collections.

  • STS (Semantic Textual Similarity) [5] is a semantic relatedness benchmark consisting of sentence pairs with a crowd-annotated similarity score. E.g., (“A woman is eating something”, “A woman is eating meat”) has a score of 3 (out of 5). There are 5,749 train, 1,500 val and 1,379 test pairs.

  • SICK (Sentences Involving Compositional Knowledge) [36] evaluates compositional distributional semantics. SICK contains sentence pairs along with their semantic relatedness score. E.g., (“Two men are boxing”, “Two men are fighting”) have a score of 4 (out of 5). SICK has 4,501 train, 501 val and 4,928 test sentence pairs.

  • MSRP (Microsoft Research Paraphrase Corpus) [17] does not strictly evaluate sentence similarity but paraphrase detection, yet due to functional parallels with the former, we include MSRP in this group. It contains (4,077 train; 1,726 test) sentence pairs along with a label {1 = paraphrase or 0 = not paraphrase}. MSRP is always used with supervision, thus it may not be the most adequate test-bed for our setting.

\(\blacksquare \) Visual similarity: Visual-STS (vis-STS) [30] is a subset of STS where each textual caption is associated to an image. Here, we only use the images since (a larger super-set of) the sentences are already evaluated in STS. Vis-STS consists of 1,089 images and a single set of 829 image-image pairs along with their ground-truth similarity rating.

Fig. 2.
figure 2

Synthetic datasets of our laboratory experiment. Color indicates the semantic value of each data point (either a class label or a continues value) – best seen in color. The first five datasets are 3D while the rest are in 2D. In the first seven datasets, the semantic value assigned to data points is continuous, while for the last five datasets the class labels are discrete. (Color figure online)

\(\blacksquare \) Synthetic data: In contrast to real-world tasks, laboratory tasks offer a unique window to study the behavior of feature transforms by having full control of: (i) the (distribution of) feature vectors, (ii) the task itself, i.e., the assignation of semantic value to each data point. The majority of our synthetic (laboratory) datasets are from sklearn [39], except sphere-z, unif-rad, unif-angle and spiral (Fig. 2), which are built by ourselves.

We randomly generate 2,000 train and 200 test data points. Then, we build our similarity task by presenting all pairwise combinations of test points to the system, i.e., 40,0000 pairs (= 200 \(\times \) 200). In the discrete-labelled datasets (e.g., circles, Fig. 2) where each data point \(s_i\) has a class label \(l_i \in \{t_1, \cdots , t_C \}\) (where C = \(\#\) classes) the ground-truth similarity \(y_{i,j} \in \{0, 1\}\) between two points \(s_i, s_j\) is 1 if they belong to the same class, or 0 otherwise. In the continuous-labelled datasets (e.g., sphere), where the assignation of semantic value to each data point is a continuous value \(l_i \in \mathbb {R}_{+}\), the ground truth similarity \(y_{i,j}\) between \(s_i, s_j\) is the absolute difference: \(y_{i,j} = |l_i - l_j| \in \mathbb {R}_{+}\).

5 Experimental Setup

5.1 Feature Vectors (Embeddings)

We group below our embeddings by the unit that they represent (a word, a sentence or an image). An overview of which embeddings apply to what task can be seen in Table 1.

\(\blacksquare \) Word-level features:

  • GloVeFootnote 6 [40]: We use 300-d vectors pre-trained on the Common Crawl corpus with 840B tokens and a 2.2M-word vocabulary.

  • word2vec (w2v) [37]: We use the skip-gram 300-d embeddings trained on Wikipedia.

  • In word-similarity, we adopt the publicly availableFootnote 7 VGG-128 [6] and ResNet [21] visual features from [12]. Notice that unlike the image retrieval and visual-STS tasks, word-similarity datasets do not have any images and hence one needs to find a way to visually represent each word (e.g., ‘cat’ or ‘table’) by using external visual data. To this end, [12] used ImageNet [43], and for each image they extracted 128-d VGG-128 and 2,048-d ResNet features from the last layer (before the softmax) by using the forward pass of the CNN. The final representation for any given word is the average feature vector (centroid) of all available images for this word in ImageNet.

\(\blacksquare \) Sentence-level features:

  • BERT [16]: The large uncased version of BERTFootnote 8 (24 layers, 1,024 units) is used as a sentence feature extractor. We obtain a 1,024-d vector from the last layer (24th), before the model top, by average-pooling the output sequence of hidden state vectors, similar to BERT-as-a-service [58]. The model is pre-trained on masked language modeling and next sentence prediction in the Toronto Book Corpus and Wiki.

  • RoBERTa [33]: We obtain 1,024-d features in an identical manner as in BERT above with the large-version of a case-sensitive RoBERTa model.

  • Skipthoughts vectors [27] is a popular neural-based universal sentence encoder that learns sentence representations by predicting the surrounding sentences. We use the best-performing 4,800-d vectors (combine-skip) as recommended by the authors.

  • Vector averaging (bag of words): In the sentence-level tasks (SICK, MSRP, STS and AG-news), we include the baseline sentence representation \(v = \frac{1}{m} \sum _{i=1}^m v_i\) of averaging word vectors in a sentence \(s = (s_1, \cdots , s_m)\), where \(v_i = E(s_i)\) and m is the number of words. We add a subscript avg to the averaged vectors (e.g., GloVe\(_\text {avg}\)).

\(\blacksquare \) Image-level features. Vector dimensionality is in parenthesis: NASNet [62] (d = 4,032), ResNet-50 [21] (d = 2,048), ResNet-inception-v2 [47] (d = 1,536), Inception-v3 [48] (d = 2,048), VGG19 [45] (d = 512), Xception [8] (d = 2,048). In all these CNN networks, the feature vector \(v_i = E(s_i)\) for a given image \(s_i\) is obtained as the forward pass average-pooled activations from the last layer before the output layer.

5.2 Training Setup and Implementation

  • Given training data: In all datasets except word-similarity (Sect. 4.2), we obtain the training data \(V^{tr} = E(S^{tr})\) given in the dataset (yet without using class labels). In the case of AG-news, STS, SICK and MSRP we use the provided train-test split (Sect. 4.2). Although CorelDB, Visual-STS and Caltech-256 do not have publicly available train-test set splits, we create the train-test splits ourselves via 3-fold cross-validation. I.e., we split the full data \(S = \{v_i \}_{i=1}^N\) into 3 disjoint parts and we employ 2 parts for training (\(S^{tr}\)) and 1 part for testing (\(S^{ts}\)), repeating this 3 times and reporting the average.

    However, our setting does not require having an available training set. There are two main alternatives to using the given train split: (1) learning \(T_{\theta }()\) in the test set; (2) generating \(S^{tr}\) ourselves. Although (1) is a legit option (as one does not use labels), it falls within a transductive learning setup and assumes a test set of a certain size to enable learning \(T_{\theta }()\). Hence, this is not an option in the case of a singe-instance test set. We also evaluated learning \(T_{\theta }()\) in the test set, and results are discussed in Sect. 6.1.

  • Built training data: For word-similarity, where no training data are available, we use external data to generate \(V^{tr}\) (option (2) above). Following [12], we build \(V^{tr}\) in word-similarity by using features obtained from all words in ImageNet, i.e., visual features for CNNs (ResNet & VGG128) and word embeddings for text (GloVe & word2vec).

Implementation: We use diverse Python libraries, including: Keras [7] for the CNNs, Theano for skipthoughts, sklearn [39], Pytorch and Huggingface [57] for BERT & RoBERTa. We make our code publicly availableFootnote 9 as well as a Supplement with further specific implementation and hyperparameter details and additional results.

6 Results

Unless otherwise specified, results below are discussed for the cosine similarity (Table 1). Performance measures in the tables are according to Sect. 4.1, and scaled \(\times \) 100, for readability. Table 2 reports statistical significance of comparing a given method with the original vectors under cosine (i.e., the top left corner entry). Each comparison is a two-sided Wilcoxon signed-rank test across the 74 combinations of a real-world dataset with an embedding type (i.e., rows in Table 1). We report significance at \(p\,<\) 0.01 after a Bonferroni correction for 10 comparisons (7 methods in the first row + Eucl + add. bias + mult. bias)Footnote 10. Win-tie-loss results (W, T, L) indicate the number of wins (W), ties (T) and losses (L) of the first method against the second one, across the 74 combinations.

6.1 Real-World Tasks

Centeredness: Performance of original with an additive bias (Sect. 3.2) drastically drops (Fig. 3 and Table 2). This confirms the inadequacy of using uncentered vectors along with the cosine similarity. Results of PCA, ctr and stz are unaffected.

Fig. 3.
figure 3

Averaged results across datasets and features for different values of biases (Sect. 3.2) on original vectors. The b=0 point means no bias.

  • Centering: Consistently with the results above, centering significantly improved (\(p<10^{-4}\)) the original features by an absolute 2.5% on average (Table 2), with a win-tie-loss of (W = 52, T = 1, L = 21) (Table 1), hence proving the effectiveness of this method (RQ 1).

  • Centeredness of original vectors: All our CNN vectors (ResNet, etc.) are positive (thus uncentered), and simple statistical inspection reveals that our text vectors are also uncentered. This implies that centering has an effect on all our features.

    (Non-)homogeneity of variances and scale: In contrast with the large hindering effect of the additive bias, performance with the multiplicative bias (Sect. 3.2) barely drops (Fig. 3 and Table 2). This suggests that centeredness may have a larger impact on the cosine similarity than scale and variance differences across dimensions.

  • Standardizing is the overall winner in real-world tasks (RQ 1). It improved significantly (\(p<10^{-6}\)) the orig features by an absolute 3.3% on average (Table 2) and their win-tie-loss is (W = 60, T = 0, L = 14) (Table 1). Notice that stz also centers the vectors.

Cosine versus Euclidean: Cosine similarity significantly outperformed (\(p<10^{-6}\)) the Euclidean similarity (RQ 2) by an average absolute 5.1% (Table 2) and (W = 54, T = 7, L = 13), for the original vectors – yet the trend is similar for all transforms. This supports the common practice of defaulting to cosine similarity, yet we strongly recommend considering the remarks about centering above, to avoid sub-optimal performance. Further, if a labeled validation set is available (e.g., in SICK, STS, or AG-news), one may use it in order to make a more educated choice between cosine and Euclidean similarity.

Learning times: Remarkably, manifold learning methods are over \(\times \)1,000 times slower than standardizing (Table 2), and perform markedly worse.

Learning in test set: Notably, center and standardize can be further improved by learning them in test data (Table 2) – provided the test set is large enough.

Manifold learning methods generally underperform the simple transforms in real-world tasks. We emphasize that we do not claim that we fairly portray the full potential of manifold learning methods (and PCA), as we did not tune their hyperparameters (e.g., dimensionality) with a validation set for the sake of comparability with the simple transforms – as our setting does not assume a validation set.

PCA improved orig features by 1.8% on average (Table 2) and (W = 53, T = 0, L = 21).

Failure cases: Notably, VGG19 was not improved by any method in any dataset (Table 1), and all methods fared poorly in MSRP. However, the performance loss by standardizing or centering is small in MSRP, which suggests that, in the absence of a validation set for making more informed decisions, the large upside of defaulting to standardizing may offset its eventual and rather small potential performance downside.

Consistency: Some methods that perform poorly on average such as Iso or wht (Table 2) eventually hit the most spectacular gains (and losses) (Table 2). This contrasts with stz and ctr which tend to have less “volatility” and exhibit more consistent gains.

6.2 Synthetic Data

Unlike real-world data (Sect. 6.1) where vectors and semantic value assignment (i.e., the task) cannot be visualized, synthetic data enable intuitively grasping and visualizing the effect that transforming vectors (RQ 1) have on the similarity measures (RQ 2).

Table 1. Results with cosine similarity on real-world tasks. Since performance trends are similar, the word-similarity table (left) includes only the visual subsets, i.e., word-pairs for which images are available for both words – number of instances is in parenthesis. Results in all sets are in the Supplement. Best-performing method per row is boldfaced.
Table 2. Averaged results across real-world datasets and features. Rows include (in order) results of: (i) cosine similarity (i.e., averaged results of Tab. 1); (ii) Euclidean similarity; (iii) additive and (iv) multiplicative bias (Sect. 3.2) (b = 10); (v) learning \(T_{\theta }\) in the test set, and (vi) training times (in seconds). Despite omitting datasets, this table portrays a representative summary of the performance landscape. SDs are omitted for being uninformative, as they reflect inter-dataset variance. For individual results, see Table 1, the Supplement and win-tie-loss mentions in the text. Asterisks (\(^*\)) indicate statistically different performance (\(p\,<\,\)0.01) from orig \(\times \) cos (two-sided).

Centeredness: Crucially, original vectors in synthetic tasks are generally centered while in real-world tasks features are uncentered (Sect. 6.1). It is reasonable to not expect that features will be natively centered at \(\vec {0}\), unless explicitly imposed. Thus, using uncentered vectors orig (add) as a reference point in Table 3 may be more “realistic” than orig.

  • Applying an additive bias (orig (add)) generally hinders the original vectors (with cosine) (Table 3), yet one can find a pathological case in circles, where having centered vectors (e.g., orig or ctr) is detrimental. The reason being that, with centered vectors, the \(\vec {0}\) point falls inside the circles (Fig. 2), hence the angle (or cosine similarity) which stems from \(\vec {0}\), is utterly unhelpful to tell apart the inner from the outer circle. Although it is important to gain insight on these cases with synthetic data, real-world feature vectors (and tasks) are unlikely to exhibit this onion-like structure unless explicitly imposed [38, 59]. Thus, there is no substitute for a systematic study in real world tasks (Table 1).

Task versus vectors: A key question that this paper answers is whether it suffices to look at (the statistics of) the vectors alone in order to tell when a transform will perform well. Unif-radius and unif-angle illustrate a negative answer. All methods fail at unif-radius (radius matters) while they all do reasonably well in unif-angle (angle matters). The only difference is the assignment of a semantic value to data points, i.e., the task itself. Thus, vectors alone do not suffice to determine effectiveness of a transform but they must be considered along with the task. Many real-world instances support this conclusion, e.g., stz improving NASNet in vis-STS, yet not in Caltech nor in CorelDB.

Failure cases: Circles illustrates a task where cosine similarity is entirely unhelpful to tell both classes apart (and the Euclidean only barely useful) for the regular methods, yet manifold learning methods fare better (Table 3). Further notice the detrimental effect of normalizing with the Euclidean similarity in the same dataset, as normalizing collapses both circles into one. We also highlight the general failure of all methods in our own “stress test” task, unif-rad. Likely, polar coordinates would have done a better job.

Table 3. Results on synthetic datasets. The (add) and (mult) indicate that an additive or multiplicative bias, respectively, is added to the method (Sect. 3.2). SDs are left to the Supplement.

7 Conclusions and Future Work

Limitations. The answer to whether any of our top-performing transforms is a universal recipe to improve (text or image) features, is a negative one. As usual, there is no free lunch. However, this study strives to include a representative and reasonable number of datasets and varied tasks to gain insight on the success rate and effect size of each transform. Performance trends showcase promising potential on defaulting to center, PCA or standardize the features in applications, as well as using cosine-based (instead of Euclidean) semantic match. That said, our task selection is not exhaustive and hence we encourage researchers to report results on new tasks and datasets.

A word of caution. In line with [33] and [32], an important contribution of this work is rising awareness about the potential source of improvements in some word and sentence embedding models, which are often tested in semantic-similarity tasks and default to cosine similarity. As shown, feature re-scaling can have a much greater impact on the overall performance than the embedding model itself. Hence, it is crucial to control for any possible feature re-scalings occurring in any step of the pipeline.