Keywords

1 Introduction and Literature Review

Many real-world applications require classifying “entities” not encountered earlier, e.g., object recognition (where every object is a category), cross-lingual dictionary induction (where every word is a category), etc. Here, one of the reasons is lack of resources to annotate available (and possibly systematically growing) datasets. To solve this problem, zero-shot learning has been proposed.

While multiple zero-shot learning approaches have been introduced ([9, 19]), as of today, none of them emerged as “the best”. In situations like this, meta-classifiers, which “receive suggestions” from individual classifiers and “judge” their value to select a “winner”, can be explored. The assumption is that such meta-classifiers will perform better than the individual ones.

Let us start from the formal problem formulation. Given a dataset of image embeddings \(\mathcal {X}=\{(x_i,y_i)\in \mathcal {X}\times \mathcal {Y}|i=1,...,N_{tr}+N_{te}\}\), each image is a real D-dimensional embedding vector comprised of features \(x_i\in \mathbb {R}^D\), and each class label is represented by an integer \(y_i\in \mathcal {Y}\equiv \{1,...,N_0,N_0+1,...,N_0+N_1\}\) giving \(N_0+N_1\) distinct classes. Here, for generality, it is assumed that . The dataset \(\mathcal {X}\) is divided into two subsets: (1) training set and (2) test set. The training set is given by \(X^{tr}=\{(x_i^{tr},y_i^{tr})\in \mathcal {X}\times \mathcal {Y}_0|i=1,...,N_{tr}\}\), where \(y_i^{tr}\in \mathcal {Y}_0\equiv \{1,...,N_0\}\), resulting in \(N_0\) distinct training classes. The test set is given by \(X^{te}=\{(x_i^{te},y_i^{te})\in \mathcal {X}\times \mathcal {Y}_1|i=N_{tr}+1,...,N_{te}\}\), where \(y_i^{te}\in \mathcal {Y}_1\equiv \{N_0+1,...,N_0+N_1\}\) providing \(N_1\) distinct test classes.

The goal of zero-shot learning is to train a model (on dataset \(X^{tr}\)) that performs “well” for the test dataset \(X^{te}\). Obviously, zero-shot learning requires auxiliary information associating labels from the training and the test sets, when \(\mathcal {Y}_0\cap \mathcal {Y}_1=\emptyset \). The solution is to represent each class label y \((1\le y\le N_0+N_1)\) by its prototype (semantic embedding). Here, \(\pi (\cdot ):\mathcal {Y}_0\cup \mathcal {Y}_1\rightarrow \mathcal {P}\) is the prototyping function, and \(\mathcal {P}\) is the semantic embedding space. The prototype vectors are such that any two class labels \(y_0\) and \(y_1\) are similar if and only if their prototype representations \(\pi (y_0)=p_0\) and \(\pi (y_1)=p_1\) are close in the semantic embedding space \(\mathcal {P}\). For example, their inner product is large in \(\mathcal {P}\), i.e., \(\langle \pi (y_0),\pi (y_1)\rangle _\mathcal {P}\) is large. Prototyping all class labels into a joint semantic space, i.e., \(\{\pi (y)|y\in \mathcal {Y}_0\cup \mathcal {Y}_1\}\), results in labels becoming related. This resolves the problem of disjoint class sets, and the model can learn from the labels in the training set, and predict labels from the test set.

Multiple algorithms have been proposed to solve the zero-shot learning problem. Here, DeViSE [6], ALE [2], and SJE [3] use a bilinear compatibility function. They follow the Stochastic Gradient Descent (SGD), implicitly regularized by early stopping. The ESZSL [12] uses square loss to learn the bilinear compatibility function, and explicitly defines regularization with respect to the Frobenius norm. Kodirov et al. in [8] proposes a semantic encoder-decoder model (SAE), where the training instances are projected into the semantic embedding space \(\mathcal {P}\), with the projection matrix W, and then projected back into the feature space \(\mathcal {X}\), with the conjugate transpose of the projection matrix \(W^*\). Another group of approaches adds a non-linearity component to the linear compatibility function [18]. Third set of approaches uses probabilistic mappings [9]. Fourth group of algorithms expresses the input image features and the semantic embeddings as a mixture of seen classes [21]. In the fifth approach, both seen and unseen classes are included in the training data [20].

In this context, a comparison of five state-of-the-art zero-shot learning approaches, applied to five popular benchmarking datasets, is presented. Next, explorations into meta-classifier for zero-shot learning are reported. Extended version of this work, with additional details and results, can be found in [14].

2 Selection of Methods and Experimental Setup

Based on the analysis of the literature, five robust zero-shot learning approaches were selected: (1) DeViSE, (2) ALE, (3) SJE, (4) ESZSL, and (5) SAE. Moreover, the following, popular in the literature, datasets have been picked: (a) CUB [17], (b) AWA1 [9], (c) AWA2 [19], (d) aPY [5], and (e) SUN [11]. Finally, five standard meta-classifiers have been tried: (A) Meta-Decision Tree MDT [16], (B) deep neural network DNN [7], (C) Game Theory-based approach GT [1], (D) Auction-based model Auc [1], (E) Consensus-based approach Con [4], and (F) a simple majority voting MV [13]. Here, classifiers (C), (D), (E) and (F) have been implemented following cited literature. However, the implementation of (A) differs from the one described in [16] by not applying the weight condition on the classifiers. However, effects of this simplification can be explored int the future. Finally, the DNN has two hidden layers and an output layer. All of them use the rectified linear activation function. The optimization function is the SGD, with the mean squared error loss function. All codes and complete list of hyperparameter values for the individual classifiers and the meta-classifiers can be found in the Github repositoryFootnote 1. While hyperparameter values, were obtained through multiple experiments, no claim about their optimality is made. The following standard measures have been used to measure the performance of the explored approaches: (T1) Top-1, (T5) Top-5, (LogLoss) Logarithmic Loss, and (F1) F1 score. Their definitions can be found in [10, 15, 19].

Separately, comparison with results reported in [19] has to be addressed. To the best of our knowledge, codes used there are not publicly available. Thus, the best effort was made to re-implement methods from [19]. As this stage, the known differences are: (1) feature vectors and semantic embedding vectors, provided in the datasets were used, instead of the calculated ones; (2) dataset splits for the written code follow the splits proposed in [19], instead of the “standard splits”. Nevertheless, we believe that the results, presented herein, fairly represent the work reported in [19].

3 Experiments with Individual Classifiers

The first set of experimental results was obtained using the five classifiers applied to the five benchmark datasets. Results displayed in Table 1 show those available from [2, 3, 6, 8, 12] (in the O column). The R column represents results based on [19]. The I columns represents the in-house implementations of the five models. Overall, all classifiers performed “badly” when applied to the aPY dataset. Next, comparing the results between columns R and I, out of 25 results, methods based on [19] are somewhat more accurate in 15 cases. Hence, since performances are close, and one could claim that our implementation of methods from [19] is “questionable”, from here on, only results based on “in house” implementations of zero-shot learning algorithms are reported.

Table 1. Individual classifier performance for the Top-1 accuracy

While the Top-1 performance measure is focused on the “key class”, other performance measures have been tried. In [14] performance measured using Top 5, LogLoss, and F1 score have been reported. Overall, it can be stated that (A) different performance measures “promote” different zero-shot learning approaches; (B) aPY is the “most difficult dataset” regardless of the measure; (C) no individual classifier is clear winner. Therefore, a simplistic method has been proposed, to gain a better understanding of the “overall strength” of each classifier. However, what follows “should be treated with a grain of salt”. Here, individual classifiers score points ranging from 5 to 1 (from best to worst) based on the accuracy obtained for each dataset. This process is applied to all four accuracy measures. Combined scores have been reported in Table 2. Interestingly, SAE is the weakest method for both the individual datasets and the overall performance. The combined performance of ALE, SJE, and EZSL is very similar.

Table 2. Individual classifier combined performance; “winners” marked in bold font.
Table 3. Analysis of Instance Difficulty (represented in %)

3.1 Analysis of the Datasets

Since it became clear that the performance of the classifiers is directly related to the datasets, their “difficulty” has been explored. Hence, an instance (in a dataset) is classified as lvl 0 if no classifier identified it correctly, whereas lvl 5 if it was recognized by all classifiers. The results in Table 3, show how many instances (in %) belong to each category, for each dataset. Here, the aPY dataset has the largest percent of instances that have not been recognized at all (36.37%), or by one or two classifiers (jointly, 41.51%). At the same time, only 0.85% of its instances have been recognized by all classifiers. The SUN dataset is the easiest: 27.85% of instances were correctly recognized by all classifiers and about 55% of its instances are “relatively easy”.

Approaching the issue from different perspective, the “influence” of individual attributes has been “scored”. For each correctly recognized instance, its attributes have been given +1 “points”. For incorrectly recognized instances, their attributes were given −1 “points”. This measure captured which attributes are the easiest/hardest to recognize. Obtained results can be found in Table 4. The most interesting observation is that attributes: “has eye color black” and “metal” are associated with so many instances that they are classified (in)correctly regardless if they actually “influenced” the “decision” of the classifier.

Table 4. Analysis of the datasets

4 Meta-Classifiers

Let us now move to meta-classifiers. Here, note that the number of hard instances, found in each dataset (see, Sect. 2), establishes the hard ceiling for: DNN, MDT, and MV. Specifically, if not a single classifier gave a correct answer, in these approaches, their combination will also “fail”. In Table 5, the performance of the six meta-classifiers is compared using the Top-1 measure, where the Best row denotes the best result obtained by the “winning” individual classifier, for a given dataset (see, Table 1). Results using the F1 score, can be found in [14].

Table 5. Meta-classifier performance; Top-1 accuracy

Comparing the results, one can see that: (a) the best individual classifier performed better than the best meta-classifier on CUB, AWA2, and aPY (2.91%, 0.88%, and 3.59% better); (b) the best meta-classifier performed better than the best individual classifier on AWA1 and SUN datasets (0.08% and 0.08% better).

Table 6. Meta-classifier and individual classifier combined performance

Finally, the “scoring method” was applied jointly to the meta-classifiers and the individual classifiers, for the Top-1 and the F1 score accuracy measures. Obviously, since 11 classifiers were compared, the top score was 11 points. Table 6 displays the results. It can be noticed that (1) meta-classifiers performed better than the individual classifiers (averaging 77.83 vs. 74.6 points). (2) Combining results from the individual classifiers using a simple majority voting algorithm brought best results. At the same time, (3) use of basic versions of more advanced meta-classifiers is not leading to immediate performance gains.

5 Concluding Remarks

In this contribution, performance of five zero-shot learning models has been studied, when applied to popular benchmarking datasets. Moreover, the “nature of difficulty” of these datasets has been explored. Finally, six standard meta-classifiers have been experimented with. The main findings were: (1) there is no single best classifier, and results depend on the dataset and the performance measure; (2) the aPY dataset is the most difficult for zero-shot learning; (3) standard meta-classifiers may bring some benefits; (4) the simplest methods obtained best results (i.e., the individual classifier ESZSL and the meta-classifier MV). The obtained prediction accuracy (less than 70%) suggests that a lot of research is needed for both the individual classifiers and, possibly, the meta-classifiers. Moreover, datasets similar to the aPY, which are particularly difficult achieve good performance, should be used. Finally, some attention should be devoted to the role that individual attributes play in class (instance) recognition.