1 Introduction

Attributes are visual properties describable in words, capturing anything from material properties (metallic, furry), shapes (flat, boxy), expressions (smiling, surprised), to functions (sittable, drinkable). Since their introduction to the recognition community (Farhadi et al. 2009; Kumar et al. 2008; Lampert et al. 2009), attributes have inspired a number of useful applications in image search (Cai et al. 2015; Kovashka and Grauman 2013; Kovashka et al. 2012; Kumar et al. 2008; Siddiquie et al. 2011), biometrics (Chen et al. 2013; Kalayeh et al. 2017; Reid and Nixon 2014), and language-based supervision for recognition (Biswas and Parikh 2013; Demirel et al. 2017; Lampert et al. 2009; Parikh and Grauman 2011; Shrivastava et al. 2012; Yao et al. 2017).

Existing attribute models come in one of two forms: categorical or relative. Whereas categorical attributes are suited only for clear-cut predicates, such as wooden or four-legged, relative attributes can represent “real-valued” properties that inherently exhibit a spectrum of strengths, such as serious or sporty. These spectra allow a computer vision system to go beyond recognition into comparison, which opens up a number of interesting applications. In biometrics, the system could interpret descriptions like, “the suspect is taller than him” (Reid and Nixon 2014). In image search, the user could supply semantic feedback to pinpoint his desired content: “the shoes I want to buy are like these but more formal” (Kovashka et al. 2012). For subjective visual tasks, users could teach the system their personal perception, e.g., about which human faces are more attractive than others (Altwaijry and Belongie 2012).

In these and many other such cases, we are interested in inferring how a pair of images compares in terms of a particular attribute. Importantly, the distinctions of interest are often quite subtle, or in other words, fine-grained. Fine-grained comparisons arise both in image pairs that are very similar in almost every regard (e.g., two photos of the same individual wearing the same clothing, yet smiling more in one photo than the other), as well as image pairs that are holistically different yet exhibit only slight differences in the attribute in question (e.g., two individuals different in appearance, and one is smiling slightly more than the other).

Relative attributes are typically treated in a learning-to-rank setting: training data is ordered (e.g., we are told image A has the attribute more than image B) and a ranking function is optimized to preserve those orderings. Given a new image, the function returns a score conveying how strongly the attribute is present (Altwaijry and Belongie 2012; Cao et al. 2014; Datta et al. 2011; Fan et al. 2013; Kovashka et al. 2012; Li et al. 2012; Matthews et al. 2013; Parikh and Grauman 2011; Sadovnik et al. 2013; Sandeep et al. 2014).

A major challenge in training a ranking model is the sparsity of supervision. This sparsity stems from two factors: label availability and image availability. Because training instances consist of pairs of images—together with the ground truth human judgment about which exhibits the property more or less—the space of all possible comparisons is quadratic in the number of potential training images. This quickly makes it intractable to label an image collection exhaustively for its comparative properties. At the same time, attribute comparisons entail a greater cognitive load than, for example, object category labeling. Indeed, there is a major size gap between standard datasets labeled for classification [now in the millions (Deng et al. 2009)] and those for comparisons [at best in the thousands (Yu and Grauman 2014)].

More insidious than the annotation cost, however, is the problem of even curating training images that sufficiently illustrate fine-grained differences. Critically, sparse supervision arises not simply because we lack resources to get enough image pairs labeled, but also because we lack a direct way to curate photos demonstrating all sorts of subtle attribute changes. Existing models assume that useful unlabeled training images are already available, and that even if the existing labels are sparse, it is always possible to label more image pairs to increase the density of the overall supervision. However, this is not necessarily the case when training a fine-grained ranking model. For example, how might we gather unlabeled image pairs depicting all subtle differences in sportiness in clothing images or surprisedness in faces? As a result, even today’s best datasets contain only partial representations of an attribute.

Fig. 1
figure 1

Relative attributes are trained from ordered pairs of images, in which we know the attribute is stronger in one image than the other. When learning the attribute smiling, real training images need not be representative of the entire attribute space (e.g., Web photos may cluster around commonly photographed expressions, like toothy smiles). Our idea densifies the supervision by “filling in” the sparsely sampled regions using generated ordered pairs of synthetic images. Given a novel pair (top), the nearest synthetic pairs (right) may present better training data than the nearest real pairs (left)

We liken the current situation to the “streetlight effect”, where a person searches for their lost keys under the streetlight simply because that is where the light is, not because he thinks they were lost there (Freedman 2010).Footnote 1 Similarly, the vision community risks inadequately training models if we restrict ourselves to the “streetlight” of unlabeled Web photos that can be crawled with keyword search to create training sets.

To overcome this sparsity of supervision, we propose to add targeted synthetic image pairs into the training data when learning a fine-grained ranking model. The main idea is to generate plausible photos exhibiting variations along some attribute(s), thereby recovering samples in regions of the attribute space that are underrepresented among the real training images (Fig. 1). After (optionally) verifying the comparative labels with human annotators, we train a discriminative ranking model using the synthetic training pairs in conjunction with real image pairs. The resulting model predicts attribute comparisons between novel pairs of real images.

We introduce two models for achieving the proposed “densification” of the training data: a passive semantic jitter model and an active adversarial model. For both models, the synthetic training images are created by an attribute-conditioned image generation with a conditional Variational Autoencoder (CVAE), however the way in which the images are sampled differs.

Our first model, the semantic jitter model, performs semantic-level manipulation of synthetic images offline, prior to training the attribute model. Whereas existing methods are limited to training with observed real images—which may overly concentrate on obvious attribute differences—semantic jitter alters an attribute slightly in order to directly curate subtly different training pairs. We call our idea semantic jitter because we are jittering the image according to a semantic (attribute) property. Unlike widely used data augmentation tricks that systematically perturb images with low-level transforms [mirroring, scaling, etc. (Dosovitskiy et al. 2014; Simard et al. 2003; Singh and Lee 2016; Souri et al. 2016; Vincent et al. 2008; Yang et al. 2016)], semantic jitter injects high-level changes that affect the very meaning of the training images. In other words, our jitter has a semantic basis rather than a purely geometric or photometric basis.

Building on this idea, we next develop a model for active training image creation (ATTIC) that learns to synthesize precisely the fine-grained image pairs that will most benefit the attribute ranking model. We devise an end-to-end system that synthesizes image pairs that would confuse the current ranking model, then presents them to human annotators for labeling. While our semantic jitter approach generates individual images passively by adjusting one target attribute at a time, our active training approach generates pairs of images actively by adjusting multiple attributes simultaneously and in an adversarial manner. Our idea can be seen as a form of active query synthesis. Whereas traditional pool-based active learning also aims to prioritize informative images for labeling (Freytag et al. 2014; Vijayanarasimhan and Grauman 2009, 2014; Zhao et al. 2011), it suffers from the streetlight effect by restricting the pool to manually curated unlabeled images (typically Web photos). In contrast, our approach creates targeted novel images for annotation, thereby accelerating learning and breaking free of the streetlight bias.

Lastly, to complement these new approaches, we curate and crowdsource a large-scale shoe dataset, building upon our previous UT Zappos50K (UT-Zap50K) dataset (Yu and Grauman 2014), which is tailored for fine-grained comparison tasks. This is to-date the largest relative attribute dataset that includes instance-level supervision.Footnote 2

This article expands upon our two prior conference papers (Yu and Grauman 2017, 2019). The main additions are as follows. First, we provide a complete and consistent set of experiments to relate our two methods. Second, we expand our experiments to analyze the manner in which our approach densifies supervision, both quantitatively and through new visualizations. Third, we create many new figures to illustrate the approach, how it works, and typical qualitative results. Finally, we unify the presentation of the motivation and related work.

The rest of the paper proceeds as follows. In Sect. 2, we discuss related work. In Sect. 3, we present the datasets. In Sect. 4, we discuss our proposed approaches for addressing the sparsity of supervision issue. In Sect. 5, we perform detailed evaluation of the effectiveness of our densification approaches in two distinct domains: human faces and shoes. Finally, we conclude in Sect. 6 and overview future work.

2 Related Work

We next overview relevant work in the areas of relative attributes, image generation, learning from synthetic images, and active learning.

2.1 Attribute Comparison and Fine-Grained Attributes

Since the introduction of relative attributes (Parikh and Grauman 2011), the task of attribute comparisons has gained attention for its variety of applications, such as online shopping (Kovashka et al. 2012, 2015), biometrics (Reid and Nixon 2013), novel forms of low-supervision learning (Biswas and Parikh 2013; Shrivastava et al. 2012), font selection (O’Donovan et al. 2014), 3D model editing (Chaudhuri et al. 2013; Yumer et al. 2015), and visual semantic reasoning (Demirel et al. 2017; Su et al. 2017). The original approach (Parikh and Grauman 2011) adopts a learning-to-rank framework (RankSVM) (Joachims 2002) that learns a global linear ranking function for each attribute. The model uses pairwise supervision—pairs of images ordered according to their perceived attribute strength based on human annotators—and trains a ranking function that preserves those orderings. Given a novel pair of images, the ranker indicates which image has the attribute more. Subsequently, non-linear ranking functions (Li et al. 2012), combining feature-specific rankers (Datta et al. 2011), and multi-task learning (Chen et al. 2014) have all shown to further improve accuracy.

More recently, the success of deep networks has motivated end-to-end frameworks for learning features and attribute ranking functions simultaneously (Meng et al. 2018; Singh and Lee 2016; Souri et al. 2016; Yang et al. 2016). While these deep models offer a higher learning capacity compared to their “shallow” counterparts, they also come with a greater need for labeled data and higher computation cost. Aside from these learning-to-rank formulations, researchers have applied the Elo rating system for biometrics (Reid and Nixon 2014), and regression over “cumulative attributes” for age and crowd density estimation (Chen et al. 2013). Other work investigates features tailored for attribute comparisons, such as facial landmark detectors (Sandeep et al. 2014) and visual chains to discover relevant parts (Xiao and Lee 2015).

We use the term “fine-grained attributes” to refer to slight but perceptible differences in an attribute. Research on fine-grained categorization instead aims to recognize objects in a single domain, e.g., birds (Branson et al. 2010; Farrell et al. 2011), planes (Maji et al. 2013), and cars (Yang et al. 2015). While such problems also require making distinctions among visually close instances, our goal is to compare attributes, not categorize objects.

2.2 Attributes and Image Synthesis

A key ingredient for creating dense supervision from synthetic images is a generative model that can progressively modify the target attribute(s) while preserving the rest. Attribute-specific alterations have been considered in several recent methods, primarily for face images. Some target a specific domain and attribute, such as methods to enhance the “memorability” (Khosla et al. 2013) or age (Kemelmacher-Shlizerman et al. 2014) of facial photos, to edit outdoor scenes with transient attributes like weather (Laffont et al. 2014), or to modify 3D attributes such as pose and depth of objects (Dixit et al. 2017).

The success of deep neural networks for image generation [i.e., Generative Adversarial Nets (GAN) (Goodfellow et al. 2014; Huang et al. 2018; Isola et al. 2017; Lu et al. 2018; Radford et al. 2016; Zhang et al. 2018, 2017a; Zhu et al. 2017; Choi et al. 2018) or Variational Autoencoders (VAE) (Gregor et al. 2015; Kingma and Welling 2014; Kulkarni et al. 2015)] opens the door to learning how to generate images conditioned on desired properties (Dosovitskiy et al. 2015; Li et al. 2016; Pandey and Dukkipati 2016; Yan et al. 2016a, b). For example, a conditional multimodal autoencoder can generate faces from attribute descriptions (Pandey and Dukkipati 2016), and focus on identity-preserving changes (Li et al. 2016). Furthermore, conditional models can also synthesize an image based on an input, either a label map (Isola et al. 2017; Zhu et al. 2017) or a latent attribute label (Lample et al. 2017; Lu et al. 2018; Upchurch et al. 2017; Yan et al. 2016a). To densify the pairwise label space for fine-grained comparisons, we incorporate the model of Yan et al. (2016a), which has the property of an interpretable latent space (attributes) that our algorithm can adjust continuously; in principle other similarly equipped models could also be employed in our framework. We show how to sample images using this generator to “fill in the gaps” in the label space. Whereas the above methods aim to produce an image for human consumption, we aim to generate dense supervision for learning algorithms. To our knowledge, we are the first to propose direct image generation as a solution to the sparsity of supervision problem.

Despite the success of deep neural networks, research has demonstrated their sensitivity to small perturbations to the input image through “fooling networks” (Moosavi-Dezfooli et al. 2016; Nguyen et al. 2015). In our active image generation approach, rather than using adversarial generation to understand how features influence a classifier, our goal is to synthesize the very training samples that (once labeled by human annotators) will strengthen a learned ranker. Unlike any of the above, we create images for active query synthesis.

2.3 Training with Synthetic Images

The use of synthetic images as training data has been explored to a limited extent, and primarily for human bodies. Taking advantage of high quality graphics models for humanoids, rendered images of people from various viewpoints and body poses provide free data to train pose estimators (Shakhnarovich et al. 2003; Shotton et al. 2011; Varol et al. 2017) or person detectors (Pishchulin et al. 2011). Using the first frame of a video as reference, one can personalize a pose estimator by synthesizing deformations (Park and Ramanan 2015) or train an object tracker by synthesizing plausible future frames (Khoreva et al. 2017). Recent work also explores how to maintain realism when training with synthetic images for eye gaze and hand pose (Shrivastava et al. 2017).

For objects beyond people, recent work considers how to exploit non-photorealistic images generated from 3D CAD models to augment training sets for object detection (Peng et al. 2015; Yang and Deng 2017) and for indoor scene understanding (Zhang et al. 2017b), or words rendered in different fonts for text recognition (Jaderberg et al. 2014). Other work explores how to adversarially generate “hard” low-level transformations that are label-preserving for body pose estimation (Peng et al. 2018), greedily select useful transformations for image classification (Paulin et al. 2014), or actively evolve part-based 3D shapes to learn shape from shading (Yang and Deng 2017). Such methods share our motivation of generating samples where they are most needed, but for very different tasks. Furthermore, our approach aligns more with active learning than these data augmentation approaches: in our case the new synthetic samples can be distant from available real samples, and so they are manually annotated before being used for training.

Example synthesis is also of interest in the low-shot learning community, where training data for the novel classes are either severely limited (Dixit et al. 2017; Hariharan and Girshick 2017; Kwitt et al. 2016; Miller et al. 2000) or completely absent (Changpinyo et al. 2017; Verma et al. 2018; Xian et al. 2018; Zhu et al. 2018). The primary draw is the ability to “hallucinate” variability for learning where the variability is practically non-existent. In this work, synthesis takes place in the feature space (Dixit et al. 2017; Hariharan and Girshick 2017; Hauberg et al. 2017; Kwitt et al. 2016), not the image space. Though similarly motivated, our focus is not restricted to low-shot cases; in fact, our results will show that even after exhausting all available real labeled images, our method can improve performance. Furthermore, we explicitly generate images (not features) since the novel training instances we create need to be intelligible to be labeled by human annotators. Finally, rather than use manually defined heuristics to sample synthetic images, we show how to dynamically derive the images most valuable to training.

While in certain ways the above methods share our general concern about the sparsity of supervision, our focus on attributes and ranking is unique. Furthermore, many of the prior methods assume a graphics engine and 3D model to render new views with desired parameters (pose, viewpoint, etc.). In contrast, we investigate images generated from a 2D image synthesis engine in which the modes of variation are controlled by a learned model. Being data-driven can offer greater flexibility, allowing tasks beyond those requiring a 3D model, and variability beyond camera pose and lighting parameters, albeit in exchange for noisier generation. Finally, unlike any of the above, our approach actively targets the images to generate so as to benefit the given recognition task.

2.4 Active Learning

Active learning has been studied for decades (Settles 2010). For visual recognition problems, pool-based methods are the norm: the learner scans a pool of unlabeled samples and iteratively queries for the label on one or more of them based on uncertainty or expected model influence (e.g., Freytag et al. 2014; Vijayanarasimhan and Grauman 2009, 2014; Zhao et al. 2011). Active ranking models adapt the concepts of pool-based learning to select pairs for comparison (Liang and Grauman 2014; Qian et al. 2013). Hard negative mining—often used for training object detectors (Felzenszwalb et al. 2010; Shrivastava et al. 2016)—also focuses the learner’s attention on useful samples, though in this case from a pool of already-labeled data. Rather than display one query image to an annotator, the approach in Huijser and van Gemert (2017) selects a sample from the pool then displays a synthesized image spectrum around it in order to elicit feature points likely to be near the true linear decision boundary for image classification. We do not perform pool-based active selection. Unlike any of the above, our proposed active approach creates image samples that (once labeled) should best help the learner, and it does so in tight coordination with the ranking algorithm.

In contrast to pool-based active learning, active query synthesis methods request labels on novel samples from a given distribution (Alabdulmohsin et al. 2015; Angluin 1988; Settles 2010; Zhu and Bento 2017). When the labeler is a person (as opposed to an oracle or experimental outcome), a well known challenge with query synthesis is that the query may turn out to be unanswerable (Baum and Lang 1991). Accordingly, there is very limited prior work attempting active query synthesis for image problems. One recent study (Zhu and Bento 2017) uses the traditional active learning heuristic for linear SVMs (Tong and Koller 2017) to generate images for toy binary image classification tasks (e.g., two MNIST digits). To our knowledge, ours is the first approach that learns to actively generate training images jointly with the image prediction task, and we demonstrate its real impact on modern well-studied datasets with complex fashion products and face images. Furthermore, rather than sample from some given input distribution, our selection approach optimizes the latent parameters of an image pair to directly affect the current deep ranking model.

3 Datasets

Before we introduce our proposed approaches, we first present the datasets we use for evaluation. They span two fine-grained domains: shoes and faces. Both domains offer a good testbed for the fine-grained attribute problem, because in both there are slight but perceptible differences in attributes that matter for real applications. For example, slightly different smiles can correspond to subtle but meaningful differences in facial expression; subtle visual variations can make the difference between a consumer buying or bypassing a given shoe.

The supervision on these datasets comes in the form of relative annotations between a pair of images, i.e., given a target attribute \({\mathcal {A}}\), image i is perceived to have “more/less” of \({\mathcal {A}}\) than image j. Due to the fine-grained nature of the comparison tasks, we require the pairwise supervision to have the utmost precision, which most prior datasets do not enforce. Note that while we also make use of synthetically generated image pairs for training our ranking models, we will leave the specific details of their generation for Sect. 5.

3.1 Shoes Domain

Our original UT-Zap50K dataset (Yu and Grauman 2014) is the largest relative attributes dataset to-date. Created in the context of an online shopping task, it contains over 50,000 catalog shoe images from Zappos.com with over 12,000 instance-level pairwise annotation on 4 relative attributes: open, pointy, sporty, and comfort. The images are roughly \(150\times 100\) pixels and shoes are pictured in the same orientation for convenient analysis.

Given UT-Zap50K’s initial success in fine-grained comparison tasks, we expand upon it by turning our attention towards the selection of the attribute names. The lexicon of attributes used in existing relative attributes datasets, including UT-Zap50K, is selected based on intuitions, i.e., words that seem domain relevant (Kovashka et al. 2015) or words that seem to exhibit the most subtle fine-grained differences. However, this kind of intuition is inherently biased by the person making the selection. Therefore, to address this limitation, we (1) use crowdsourcing to mine for an attribute lexicon that is explicitly fine-grained, and (2) collect an even larger number of pairwise labels for each attribute in that lexicon.

Given a pair of images, we ask workers from the Amazon Mechanical Turk (mTurk) platform to complete the sentence, “Shoe A is a little more \(\langle \)insert word\(\rangle \) than Shoe B” using a single word. They are instructed to identify subtle differences between the images and provide a short rationale. The goal is to find out how people differentiate fine-grained differences between shoe images. Over 1000 workers participated in the study, yielding a total of 350+ distinct word suggestions across 4000 image pairs viewed. This approach to lexicon generation takes inspiration from (Maji 2012), but fine-tuned towards eliciting “almost indistinguishable” visual changes rather than arbitrary attribute differences. See the “Appendix” for more details about the crowdsourcing interfaces.

Fig. 2
figure 2

Word cloud depicting our crowd-mined data for a fine-grained relative attribute lexicon for shoes (before post-processing)

Fig. 3
figure 3

Sample image pairs from our improved UT-Zap50K dataset for five attributes from the new fine-grained lexicon. The left image has “more” of the attribute than the right image. Note that some images can look drastically different overall while still exhibiting subtle differences in the target attribute space, and vice-versa

Figure 2 shows a word cloud of the raw results, which we post-process through merging of synonyms and pruning based on user rationales. For example, if the provided rationale fails to explain the word choice, that response is removed. After post-processing, we select the 10 most frequent words as the new fine-grained relative attribute lexicon for shoes: comfort, casual, simple, sporty, colorful, durable, supportive, bold, sleek, and open.

Using this new lexicon, we collect pairwise supervision for about 4000 pairs for each of the 10 attributes, using the shoe images from UT-Zap50K. Figure 3 shows sample shoe image pairs along with their relative labels. This is a step towards denser supervision on real images—more than three times the total comparison labels provided in the original dataset across all attributes. Still, as we will see in our experiments in Sect. 5, the greater density offered by synthetic training instances is needed for best results.

Fig. 4
figure 4

Sample image pairs from the LFW-10 dataset (Sandeep et al. 2014), in the same format as Fig. 3

3.2 Face Domain

Aside from the UT-Zap50K dataset in the shoe domain, the most popular domain for fine-grained comparison is faces. Among face datasets, the LFW-10 (Sandeep et al. 2014) is the only one that contains pairwise supervision at the instance-level. The LFW-10 dataset consists of 2000 face images taken from Labeled Faces in the Wild dataset (LFW) (Huang et al. 2007). It contains 10 attributes: bald, dark hair, big eyes, good looking, masculine, mouth open, smiling, visible teeth, visible forehead, and young. After pruning pairs with less than 80% agreement from the workers, there are 600 pairwise labels on average per attribute. Figure 4 shows sample face image pairs from the dataset.

All the datasets employed in our experiments are widely used in the relative attributes literature. To our knowledge they are also the only available datasets for instance-level fine-grained attribute comparisons.

4 Approach

Our work addresses the sparsity of supervision issue for fine-grained comparisons. As discussed above, due to the pairwise nature of the supervision labels, the space of all possible comparisons is quadratic in the number of potential training images. This quickly makes it intractable to label an image collection exhaustively for its comparative properties. Meanwhile, curating real training images that sufficiently illustrate fine-grained differences is problematic—arguably even more insidious than the annotation cost.

Our idea is to “densify” supervision for visual comparisons by adding synthetically generated images to the training data. We first outline the attribute-conditioned image generator used to generate the synthetic images in Sect. 4.1, followed by the deep neural network based ranking model used to make predictions in Sect. 4.2. Then we present our proposed approach for generating and utilizing synthetic images, either passively through semantic jitter (Sect. 4.3) or actively through active training image creation (Sect. 4.4).

4.1 Attribute-Conditioned Image Generator

The key to improving coverage in the attribute space is the ability to generate images exhibiting subtle differences with respect to some attribute(s) while keeping the others constant. In other words, we want to traverse images in the high-level attribute space (Fig. 1). We adopt an existing image generation system, Attribute2Image (Attr2Img), introduced by Yan et al. (2016a, (2016b), which can generate images that exhibit a given set of attributes and latent factors.

Suppose we have a lexicon of \(N_a\) attributes, \(\{{\mathcal {A}}_1,\dots ,{\mathcal {A}}_{N_a}\}\). Let \(\varvec{y} \in {\mathbb {R}}^{N_a}\) be a vector containing the strength of each attribute, and let \(\varvec{z} \in {\mathbb {R}}^{N_z}\) be the latent variables. The Attr2Img approach constructs a generative model for \(p_\theta (\varvec{x} | \varvec{y},\varvec{z})\) that produces realistic images \(\varvec{x} \in {\mathbb {R}}^{N_x}\) conditioned on \(\varvec{y}\) and \(\varvec{z}\). The approach maximizes the variational lower bound of the log-likelihood \(\text {log}\ p_\theta (\varvec{x}|\varvec{y})\) in order to obtain the model parameters \(\theta \).

The model is implemented with a conditional Variational Auto-Encoder (CVAE). The network architecture generates the entangled hidden representation of the attributes and latent factors with multilayer perceptrons, then generates the image pixels with a coarse-to-fine convolutional decoder. The authors apply their approach for attribute progression, image completion, and image retrieval. See Yan et al. (2016a, (2016b) for more details.Footnote 3 As mentioned above, we choose this image generator due to its effectiveness and ability to condition on named attributes, though in principle other generation engines could also be employed within our framework.

4.2 Deep Ranking Functions for Relative Attributes

Deep convolutional networks are a promising model for relative attributes. The network simultaneously learns the image features and an attribute ranking function, one per attribute (Singh and Lee 2016; Souri et al. 2016; Yang et al. 2016). One common aspect of these end-to-end networks is the use of the RankNet algorithm (Burges et al. 2015), which we will briefly overview next.

RankNet is a neural network based ranking algorithm with a probabilistic cost function. Given image \(\varvec{x}_i \in {\mathbb {R}}^{N_x}\), the objective is to learn a ranking function \(R_{\mathcal {A}} : {\mathbb {R}}^{N_x} \rightarrow {\mathbb {R}}\) consisting of a series of neural network modules, which outputs the corresponding real-valued strength \(v_i\) for attribute \({\mathcal {A}}\). Let \((\varvec{x}_i, \varvec{x}_j)\) be a pair of images and \(t_{ij}\) be its target probability indicating the probability of \(R_{\mathcal {A}}(\varvec{x}_i)\) being higher valued than \(R_{\mathcal {A}}(\varvec{x}_j)\). RankNet maps this rank estimate to a pairwise posterior probability \(p_{ij}\) using a logistic function:

$$\begin{aligned} p_{ij} = \frac{1}{1 + e^{-(v_i-v_j)}}. \end{aligned}$$
(1)

The ranking loss is then defined as:

$$\begin{aligned} {\mathcal {L}}_{rank} = -t_{ij} \log (p_{ij}) - (1-t_{ij})\log (1-p_{ij}), \end{aligned}$$
(2)

which is a standard cross-entropy loss where \(t_{ij} = 1\) if (ij) is an ordered pair or \(t_{ij} = 0.5\) if (ij) is an equal pair. This loss function is good for ranking as it asymptotes to a linear function, making it more robust to noise compared to a regular quadratic function. Furthermore, it can also handle equality cases where the cost becomes symmetric.Footnote 4 The overall function can be trained using stochastic gradient descent or its variations. Using the RankNet algorithm, we are also free to design the learning network without restrictions, as long as it outputs a \(v_i\) at the end.

However, the higher learning capacity also comes with a greater need for labeled data and a higher computation cost. Existing methods rely on data augmentation techniques in the image-space (mirroring, scaling, etc.) to compensate for the data needs; our approach instead shows how to automatically augment the image pairs with synthetically generated images having controlled attribute changes.

While any learning algorithm for visual comparisons could exploit the newly generated synthetic image pairs in principle, we consider a deep Siamese RankNet with a spatial transformer network (STN), which build upon the RankNet-based deep learning formulation above. Our choice is motivated by its leading empirical performance (Singh and Lee 2016; Yu and Grauman 2017). This deep learning-to-rank method combines a convolutional neural network (CNN) optimized for a paired ranking loss (Burges et al. 2015) together with a spatial transformer network (STN) (Jaderberg et al. 2015). In particular,

$$\begin{aligned} R^{(cnn)}_{\mathcal {A}}(\phi (\varvec{x})) = \text {RankNet}_{\mathcal {A}}(\text {STN}(\phi (\varvec{x}))), \end{aligned}$$
(3)

where \(\text {RankNet}\) denotes a Siamese network with duplicate stacks, and \(\phi (\cdot )\) refers to an image feature encoding. Let \({\mathcal {P}}_{\mathcal {A}}\) denote the training set of ordered image pairs \(\{(\varvec{x}_i,\varvec{x}_j)\}\) for attribute \({\mathcal {A}}\). During training these stacks process ordered pairs, learning filters that map the images to scalars that preserve the desired orderings in \({\mathcal {P}}_{\mathcal {A}}\). The STN is trained simultaneously to discover the localized patch per image that is most useful for ranking the given attribute (e.g., it may focus on the mouth for smiling). Given a single novel image, either stack can be used to assign a ranking score representing the attribute’s strength in that image. See Singh and Lee (2016) for details.

4.3 Semantic Jitter

Having defined all the essential background, we now move on to presenting our new approach. First we introduce our semantic jitter approach to densify supervision. The main idea is to “jitter” training images by sampling from an attribute-conditioned generative model. Specifically, we use the Attr2Img (Yan et al. 2016a) generator defined above. Our approach differs from existing label-preserving data augmentation strategies that rely on transforms like mirroring, scaling, or mild white noise. Instead of a purely geometric or photometric change, our semantic jitter injects high-level changes that affect the very meaning of the image. See Fig. 5. The generated synthetic image pairs can be used independently or in conjunction with existing real image pairs. Our goal is to “fill in” underrepresented regions of image space, which we show helps train a model to infer attribute comparisons.

Fig. 5
figure 5

Whereas standard data augmentation with low-level “jitter” (left) expands training data with image-space alterations (mirroring, scaling, etc.), our semantic jitter (right) expands training data with high-level alterations, tweaking semantic properties in a controlled manner

4.3.1 Generating Dense Synthetic Image Pairs

To begin, the first step is to generate a series of synthetic identities, then sample images for those identities that are close by in a desired semantic attribute space.Footnote 5 The resulting images will comprise a set of synthetic image pairs \({\mathcal {S}}_{\mathcal {A}}\). We explore two cases for using the generated pairs: one where their putative ordering is verified by human annotators, and another “Auto” case where the ordering implied by the generation engine is taken as their (noisy) label.

Each identity is defined by an entangled set of latent factors and attributes. Let \(p(\varvec{y})\) denote a prior over the attribute occurrences in the domain of interest, where each dimension of \(\varvec{y}\) is a real-valued random variable representing the degree to which that attribute is present in the image. We model this prior with a multivariate Gaussian whose mean and covariance are learned from the attribute strengths observed in real training images: \(p(\varvec{y}) = {\mathcal {N}}(\mu ,\Sigma )\). This distribution captures the joint interactions between attributes, such that a sample from the prior reflects the co-occurrence behavior of different pairs of attributes (e.g., shoes that are very pointy are often also uncomfortable, faces that have facial hair are often masculine, etc.).Footnote 6 The prior over latent factors \(p(\varvec{z})\), captures all non-attribute properties like pose, background, and illumination. Following Yan et al. (2016b), we represent \(p(\varvec{z})\) with an isotropic multivariate Gaussian.

To sample an identity

$$\begin{aligned} {\mathcal {I}}_j = (\varvec{y}_j,\varvec{z}_j) \end{aligned}$$
(4)

we sample \(\varvec{y}_j\) and \(\varvec{z}_j\) from their respective priors. Then, using an Attr2Img model trained for the domain of interest, we sample from \(p_\theta (\varvec{x} | \varvec{y}_j,\varvec{z}_j)\) to generate an image \(\hat{\varvec{x}}_j \in {\mathbb {R}}^{N_x}\) for this identity. Alternatively, we could sample an identity from a single real image, after inferring its latent variables through the generative model (Yang et al. 2016). However, doing so requires having access to attribute labels for that image.

Next we modify the strength of a single attribute in \(\varvec{y}\) while keeping all other variables constant. This yields two “tweaked” identities \({\mathcal {I}}_j^{(-)}\) and \({\mathcal {I}}_j^{(+)}\) that look much like \({\mathcal {I}}_j\), only with a bit less or more of the attribute, respectively. Specifically, let \(\sigma _{\mathcal {A}}\) denote the standard deviation of attribute scores observed in real training images for attribute \({\mathcal {A}}\). We revise the attribute vector for identity \({\mathcal {I}}_j\) by replacing the dimension for attribute \({\mathcal {A}}\) according to

$$\begin{aligned} \varvec{y}^{(-)}_j({\mathcal {A}})= & {} \varvec{y}_j({\mathcal {A}}) - 2\sigma _{\mathcal {A}} \text { and} \nonumber \\ \varvec{y}^{(+)}_j({\mathcal {A}})= & {} \varvec{y}_j({\mathcal {A}}) + 2\sigma _{\mathcal {A}}, \end{aligned}$$
(5)

and \(\varvec{y}^{(-)}_j({\mathcal {A}}^\prime ) = \varvec{y}^{(+)}_j({\mathcal {A}}^\prime ) = \varvec{y}_j({\mathcal {A}}^\prime ), \forall {\mathcal {A}}^\prime \ne {\mathcal {A}}\). Finally, we sample an image pair \((\hat{\varvec{x}}^{(-)}_j,\hat{\varvec{x}}^{(+)}_j)\) as:

$$\begin{aligned} \hat{\varvec{x}}^{(-)}_j \sim p_\theta (\varvec{x} | \varvec{y}^{(-)}_j,\varvec{z}_j) \end{aligned}$$
(6)
$$\begin{aligned} \hat{\varvec{x}}^{(+)}_j \sim p_\theta (\varvec{x} | \varvec{y}^{(+)}_j,\varvec{z}_j). \end{aligned}$$
(7)

Recall that our identity sampling accounts for inter-attribute co-occurrences. Slightly altering a single attribute is in line with our goal to densify supervision—going beyond the existing distribution of real images—to recover plausible but yet-unseen instances.

Fig. 6
figure 6

Spectra of generated images given an identity and an attribute. We form two types of image pairs: The two solid boxes represent an intra-identity pair, whereas the two red boxes represent an inter-identity pair

Figure 6 shows examples of synthetic images generated for several sampled identities, varying only in one attribute. The generated images form a smooth progression in the attribute space. This is exactly what allows us to curate fine-grained pairs of images that are very similar in attribute strength. Crucially, such pairs are rarely possible to curate systematically among real images. The exception is special “hands-on” scenarios, e.g., for faces, asking subjects in a lab to slowly exhibit different facial expressions, or systematically varying lighting or head pose (cf. PIE, Yale face datasets). The hands-on protocol is not only expensive, it is inapplicable in most domains outside of faces and for rich attribute vocabularies. Furthermore, the generation process allows us to collect in a controlled manner subtle visual changes across identities as well.

Next we pair up the synthetic images to form the synthetic training data \({\mathcal {S}}_{\mathcal {A}}\), which, once (optionally) verified and pruned by human annotators, will augment the real training image pairs \({\mathcal {P}}_{\mathcal {A}}\). In order to maximize our coverage of the attribute space, we sample two types of synthetic image pairs: intra-identity pairs, which are images sampled from the same identity’s spectrum and inter-identity pairs, which are images sampled from different spectrums (see Fig. 6).

We expect many of the generated pairs to be valid, meaning that both images are realistic and that the pair exhibits a slight difference in the attribute of interest. However, this need not always be true. As we will see in our experiments in the next section, in some cases the generator will create images that do not appear to manipulate the attribute of interest, or where the pair is close enough in the attribute to be indistinguishable, or where the images simply do not look realistic enough to tell (Fig. 15).

To correct or eliminate erroneous pairs, we collect order-labels from five crowdworkers per pair. However, while human-verified pairs are most trustworthy for a learning algorithm, we suspect that even noisy (unverified) pairs could be beneficial too, provided the learning algorithm (1) has high enough capacity to accept a lot of them and/or (2) is label-noise resistant. Unverified pairs are attractive because they are free to generate in mass quantities. We examine both cases in our experiments. The union of the original real pairs and newly generated pairs, \(\{{\mathcal {P}}_{\mathcal {A}} \bigcup {\mathcal {S}}_{\mathcal {A}}\}\), comprise the training data for the DeepSTN, which learns an attribute-specific ranking function \(R_{\mathcal {A}}\). Note that the ranker is ultimately tested on real image pairs. Hence there will be an inherent domain shift for the proposed model, which is trained on a mix of real and synthesized images.

4.3.2 Discussion

A natural question to ask is why not feed back the synthetic image pairs into the same generative model that produced them, to try and enhance its training? We avoid doing so for two important reasons. First, this would lead to a circularity bias where the system would essentially be trying to exploit new data that it has already learned to capture well (and hence could generate already). Second, the particular image generator we employ is not equipped to learn from relative supervision nor make relative comparisons on novel data. Rather, it learns from individual images with absolute attribute strengths. Thus, we use the synthetic data to train a distinct model capable of learning relative visual concepts.

Furthermore, while traditional data collection methods lack a direct way to curate image pairs covering the full space of attribute variations, our approach addresses exactly this sparsity. It densifies the attribute space via plausible synthetic images that venture into potentially undersampled regions of the attribute spectra. Our approach does not expect to get “something for nothing”. Indeed, the synthesized examples are still annotated by humans. The idea is to expose the learner to realistic images that are critical for fine-grained visual learning yet difficult to attain in traditional data collection pipelines.

4.4 Active Training Image Creation

Building on the fundamental premise of semantic jitter, we next expand our framework to intelligently target the training samples in a task-specific manner. In particular, we show how to actively generate those training image pairs that will most rapidly improve the attribute model.

Semantic jitter first generates the synthetic images through individual identities and then forms the supervision pairs based on automated sampling heuristics (Sect. 4.3). In this section, we propose to generate the image pairs directly in an adversarial manner, through active image generation. Unlike traditional active learning methods where informative instances—which are selected from a pool of manually curated unlabeled images—are prioritized for labeling, we design a system that directly synthesizes image pairs that would confuse the current ranking model for labeling. We refer to this approach as ATTIC, for AcTive Training Image Creation.

Fig. 7
figure 7

Schematic overview of main idea. Real images (green \(\times \)’s) are used to train a deep ranking function for the attribute (e.g., the openness attribute for shoes). The pool of real images consists of those that are labeled (dark \(\times \)’s) and those that are unlabeled (faded \(\times \)’s). Even with all the real images labeled, the ideal ranking function may be inadequately learned. Rather than select other manually curated images for labeling (faded green \(\times \)’s), ATTIC directly generates useful synthetic training images (red \(\bigcirc \)’s) through an adversarial learning process. The three shoes along each path of circles represent how ATTIC iteratively evolves the control parameters to obtain the final synthetic image pairs, not to be confused with incrementally adding “more” of a target attribute

Fig. 8
figure 8

Architecture of our proposed end-to-end approach consisting of three primary modules. The control module first converts the random input \(\varvec{q}\) into control parameters \(\{(\varvec{y}^A,\varvec{z}^A),(\varvec{y}^B,\varvec{z}^B)\}\). Its architecture is detailed further in Fig. 9. The generator module then generates a pair of synthetic images \((\hat{\varvec{x}}^A,\hat{\varvec{x}}^B)\) using these control parameters. The ranker module finally uses the generated synthetic images (once manually labeled) and the real training images to train the ranking model, outputting their corresponding attribute strength \((v^A,v^B)\). During training, the ranking loss using the RankNet objective is fed back into the ranker (green dotted line), while the negative ranking loss from the same objective is fed back into the control module (red dotted line). Note that the decoders within the generator are pre-trained and their parameters are kept frozen throughout training

The main idea is to jointly learn the target visual task while also learning to generate novel realistic image pairs that, once manually labeled, will benefit that task. To this end, we propose an end-to-end framework for attribute-based image comparison, which serves as a continuation of our semantic jitter approach. The adversarial aspect of the model aims to avoid the streetlight effect of traditional pool-based active learning, which only looks for new images to annotate from a pre-defined pool of images. Rather than limit training to manually curated real images, ATTIC synthesizes image pairs that will be difficult for the ranker as it has been trained thus far. See Fig. 7.

4.4.1 End-to-End Architecture

Let \({\mathcal {P}}_{\mathcal {A}}\) again be the set of real training image pairs used to initialize the ranker. Just like in the previous section, our goal is to improve that ranker by creating synthetic training image pairs \({\mathcal {S}}_{\mathcal {A}}\), to form a hybrid training set \(\{{\mathcal {P}}_{\mathcal {A}} \bigcup {\mathcal {S}}_{\mathcal {A}}\}\).

Our proposed end-to-end ATTIC framework consists of three distinct components (Fig. 8): the ranker module, the generator module, and the control module. Our model performs end-to-end adversarial learning between the ranker and the control modules. The ranker tries to produce accurate attribute comparisons, while the control module tries to produce control parameters—latent image parameters—that will generate difficult image pairs to confuse the current ranker. By asking human annotators to label those confusing pairs, the ranker is actively improved. Compared to semantic jitter, this variant of our approach adds the control module and formulates an end-to-end adversarial network for all three modules. We next present the individual modules.

4.4.2 Ranking Module

For the ranking module in ATTIC, we again employ the state-of-the-art DeepSTN with a spatial transformer network (Singh and Lee 2016) detailed in Sect. 4.2. As before, RankNet handles pairwise outputs in a single differentiable layer using cross-entropy loss. The rank estimates \((v_i,v_j)\) for images \((\varvec{x}_i,\varvec{x}_j)\) are mapped to a pairwise posterior probability using a logistic function. The ranking loss is the RankNet loss from Equation 2.

4.4.3 Generator Module

For the generator module, we again use Attr2Image (Yan et al. 2016a) as introduced in Sect. 4.1. However, now instead of generating synthetic image pairs offline based on sampling identities, we connect the generator outputs directly to the inputs of the ranker module. Synthetic image pairs are then generated and modified on-the-fly throughout training.

The attribute-conditioned aspect of this generator allows us to iteratively refine the generated images in a semantically smooth manner, as we adversarially update its inputs \((\varvec{y}, \varvec{z})\) with the control module, defined next. We pre-train the generator using \(\{(\varvec{x}_i,\varvec{y}_i)\}\), a disjoint set of training images labeled by their \(N_a\) attribute strength labels. Subsequently, we take only the decoder part of the model and use it as our generator (see Fig. 8). We freeze all parameters in the generator during active image creation, since the mapping from latent parameters to pixels is independent of the rank and control learning.

Fig. 9
figure 9

Architecture of the control module. The model above outputs a single set of control parameters \((\varvec{y},\varvec{z})\). Since we generate the synthetic images in pairs, we duplicate the architecture

4.4.4 Control Module

As defined thus far, linking together the ranker and generator would aimlessly feed new image sample pairs to the ranker. Next we define our control module and explain how it learns to feed pairs of intelligently chosen latent parameters to the generator for improving the ranker.

The control module is a neural network that precedes the generator (see Fig. 8, left). Its input is a random seed \(\varvec{q} \in {\mathbb {R}}^Q\), sampled from a multivariate Gaussian. Its output is a pair of control parameters \(\{(\varvec{y}^A,\varvec{z}^A), (\varvec{y}^B, \varvec{z}^B)\}\) for synthetic image generation. Figure 9 shows the control architecture. It is duplicated to create two branches (without parameter sharing) feeding to the generator and then the Siamese network in the DeepSTN ranker.

The attribute control variable \(\varvec{y}\) is formed by passing \(\varvec{q}\) through a few fully-connected layers, followed by a BatchNorm layer with scaling. In particular, for the scaling we obtain the scaling parameters from the mean and the standard deviation of the attribute strengths observed from the real training images, then apply them to the normalized \({\mathcal {N}}(0,1)\) outputs from the BatchNorm layer. The scaling ensures that the attribute strengths are bounded within a range appropriate for the pre-trained generator.

The latent feature control variable \(\varvec{z}\), which captures all the non-attribute properties (e.g., pose, illumination), is sampled from a Gaussian. We simply use half of the entries from \(\varvec{q}\) for \(\varvec{z}^A\) and \(\varvec{z}^B\), respectively. This Gaussian sample agrees with the original image generator’s prior \(p(\varvec{z})\) (Yan et al. 2016a).

figure a

4.4.5 Training and Active Image Creation

Given the three modules, we connect them in sequence to form our active learning network. The entire end-to-end pipeline is summarized in Algorithm 1. The generator and the ranker modules are duplicated for both branches to account for two images in each training pair. The decoders in the generator module are pre-trained and their parameters are kept frozen. During training, we optimize the RankNet loss for the ranker module, while at the same time optimizing the negative RankNet loss for the control module:

$$\begin{aligned} {\mathcal {L}}_{control} = -{\mathcal {L}}_{rank}. \end{aligned}$$
(8)

The control module thus learns to produce parameters that generate image pairs that are difficult for the ranker to predict. This instills an adversarial effect where the control module and the ranker module are competing against each other iteratively during training. The learning terminates when the ranker converges or reaches a certain threshold of training iterations. Our adversarial losses promote “hard” examples, and are not to be confused with adversarial “fooling” images (Moosavi-Dezfooli et al. 2016; Nguyen et al. 2015); image pairs generated by our method are only fed to the learner if humans can label them confidently.

Fig. 10
figure 10

Visualization of the progression of some synthetic image pairs \((\hat{\varvec{x}}^A, \hat{\varvec{x}}^B)\) during training (not to be confused with a spectrum going from less to more of a given attribute as in Fig. 6). Our ATTIC model learns patterns between all the attributes, modifying multiple attributes simultaneously. For example, while modifying the face images for the attribute masculine (last row), our model learned to change the attribute smiling as well. The rightmost images, i.e., end of the progression, are manually labeled and augment the training data

To generate a batchFootnote 7 of synthetic image pairs \({\mathcal {S}}_{\mathcal {A}} = \{\hat{\varvec{x}}^A,\hat{\varvec{x}}^B)\}_{i=1}^T\), we sample T vectors \(\varvec{q}\) as the inputs to the end-to-end model. Once the training concludes, we obtain the image pairs from the generator module and present them to the annotators, who judge which image shows the attribute more, and the resulting pairs accepted by annotators as valid are added to the hybrid training set \(\{{\mathcal {P}}_{\mathcal {A}} \bigcup {\mathcal {S}}_{\mathcal {A}}\}\). Figure 10 shows examples of the progression of some synthetic image pairs during the training iterations. As we can see, ATTIC captures the joint interaction between the attributes and modifies each pair of images simultaneously in order to best confuse the current ranking model.

The primary novelty of this approach comes from the generation of synthetic image pairs through active query synthesis. From an active learning perspective, instead of selecting more real image pairs to be labeled based on existing pool-based strategies, our approach aims to directly generate the most beneficial synthetic image pairs (please refer back to Fig. 7). Furthermore, instead of sampling \((\varvec{y}_i,\varvec{z}_i)\) using a heuristic, as we do for semantic jitter, ATTIC automates this selection in a data-driven manner.

4.5 Discussion

While our two proposed approaches share the goal of densifying the training data using synthetic images, each possesses unique properties that could be preferable in different situations. In semantic jitter, we have direct control over the generation process by constraining \(\varvec{y}\) and \(\varvec{z}\) through a set user-provided distribution. In cases where prior knowledge or practitioner intuition is available, semantic jitter would have the advantage of directly generating into the known sparse regions. Furthermore, semantic jitter’s sampling procedure yields more interpretable training pairs: only one attribute is subtly changed at a time, so a practictioner can more quickly see what new fine-grained comparisons are being augmented. However, if the supervision sparsity is complex for a given task, or practitioner prior knowledge is lacking, then semantic jitter could have the disadvantage of wasting training samples where they are less valuable. In addition, since semantic jitter’s generation happens offline, the training time overhead is less than for ATTIC.

Fig. 11
figure 11

Side-by-side comparison between semantic jitter (left) and ATTIC (right) when given the same starting image pair, i.e., the same \((\varvec{y},\varvec{z})\) pair. Whereas semantic jitter modifies the target attribute of individual images in a pre-determined manner, ATTIC modifies each pair of images as a single unit and considers multiple attributes at the same time

In ATTIC, on the other hand, we relinquish direct control over the generation process by letting the end-to-end model decide on the \((\varvec{y},\varvec{z})\)’s on its own. ATTIC only lets the practitioner control the learning at a higher level through adjustments of the typical learning hyperparameters. ATTIC’s key advantage is its adversarial process that can identify and generate samples for more complex regions of feature space. Our experiments below show that this often provides better empirical results. However, ATTIC’s process is more tedious to train, requiring at least two rounds of training: one to generate the synthetic image pairs and another to fully train the model after labeling the synthetic image pairs. Figure 11 shows a side-by-side comparison of how the two approaches modify the same target image pair. While semantic jitter focuses on each individual image and the target attribute, ATTIC focuses on each pair of images as a single unit and considers all attributes simultaneously.

5 Experiments

We conduct fine-grained visual comparison experiments to validate the benefits of both of our proposed dense supervision approaches: semantic jitter (Sect. 4.3), which we will abbreviate as SemJitter, and active training image creation (Sect. 4.4), which we will abbreviate as ATTIC.

5.1 Experimental Setup

Datasets Our experiments make use of both real and synthetic images. For the real datasets, we use the largest available relevant attribute datasets for the shoe and face domains, overviewed in Sect. 3. To our knowledge there exist no other instance-labeled relative attribute datasets.

  • Catalog Shoe Images We use the improved UT-Zap50K dataset with the fine-grained attributes from Sect. 3.1. There are 10 attributes (comfort, casual, simple, sporty, colorful, durable, supportive, bold, sleek, and open), each with about 4000 labeled pairs.

  • Human Face Images We use the LFW dataset and the LFW-10 dataset from Sect. 3.2. We use the 8 attributes (bald, dark hair, big eyes, masculine, mouth open, smiling, visible forehead, and young) in the intersection of these two datasets. For the real image pairs, there are about 600 labeled pairs per attribute from LFW-10.

To generate the synthetic dataset, we pre-train the Attr2Img (Yan et al. 2016a) image generator for each domain using a disjoint set of real images (38,000 and 11,000 images respectively for Shoes and Faces) and their real-valued attribute strengths. To enrich training of the image generator, we train it with all available binary attribute labels for the two datasets (e.g., flip flops or boots for Shoes; goatee or bangs for Faces); we refer to them as “meta-data” attributes, to distinguish them from the comparative relative attributes that our system learns. The Shoes dataset has 50 meta-data attributes (10 relative) and the Faces dataset has 73 meta-data attributes (8 relative).

We use the code shared by the authors for Attr2Img, with all default parameters. For SemJitter, we generate the synthetic image pairs offline before attribute training begins, while for ATTIC, we first use real image pairs to initialize training and then create the synthetic image pairs on-the-fly as training progresses. Unless otherwise specified, the synthetic images are labeled by human annotators before they are used to augment the training set.

For an apples-to-apples comparison, all images for all methods are resized to \(64\times 64\) pixels to match the output resolution of the image generator. We collect annotations on the synthetic training image pairs using mTurk, exactly as we did with the real training pairs; we obtain five worker responses per label and take the majority vote (see Sects. 3.1 and 3.2). For all experiments, we only use high quality (high agreement/high confidence) relative labels. Workers are free to vote for discarding a pair if they find it illegible, which happened for just 16% of the generated pairs overall for both SemJitter and ATTIC. Of the other 84% accepted pairs, at least 4 of 5 annotators agree on the same label 63% of the time, and they rate 48% of their annotations as “very confident” and 95% as at least “somewhat confident”. See the “Appendix” for the data collection interface and instructions to annotators.

Implementation Details We run all experiments (including individual batches) to convergence or to a maximum of 250 and 100 epochs for Shoes and Faces, respectively. We use a 50%/25%/25% train/validation/test split. We monitor the ranking loss on the validation set throughout training to avoid overfitting. We validate all hyperparameters (such as the learning rate, the learning rate decay, and the weight decay) on a separate validation set. In all cases, we train with all available real data and then add the synthesized data as it is created by our model.

For the individual modules in ATTIC, implementation details are as follows. Ranker: We pre-train the DeepSTN ranking network without the global image channel using only the real image pairs [see Singh and Lee (2016) for details on the two rounds of training]. Generator: We use the pre-trained decoder from the Attr2Img framework while keeping the parameters constant throughout end-to-end training for the ranker and control (i.e., learning rate of zero on decoder). Control: We initialize the layers using ReLU initialization (He et al. 2017). The learning rate decays such that as learning goes on, the changes to \(\varvec{y},\varvec{z}\) become smaller. The input \(\varvec{q}\) is kept constant during training, and only re-sampled after each batch.

Fig. 12
figure 12

Visual representation of all methods used for evaluation. The width of a gray rectangle denotes a “unit” of training data; n in the text refers to two units. The names on the blocks denote the source of the training images. For example, ATTIC-Auto uses n real images and n automatically labeled images created by ATTIC. Notice that all methods make use of only two units of human annotation effort total. In the case of our methods, those annotations are spent on a mix of real images and synthetic images. In the case of the Auto setting, no annotations are added beyond those on the real images

Fig. 13
figure 13

Comparison between synthetic images generated by Jitter (left) and Semantic Jitter (right). While the modifications by Jitter are purely geometric/photometric, our Semantic Jitter changes high-level attributes in the images. Each image is seeded from a real image in the training set (leftmost image). For Semantic Jitter, we infer the latent \(\varvec{z}\) for the seed image using code provided by the authors of Yan et al. (2016a)

Baselines We consider the following baselines. Figure 12 presents a visual representation of these baselines in terms of the amount of human annotations used.

  • Real Standard approach which trains with only real labeled image pairs.

  • Real+ Slight modification that adds real image pairs with their pseudo labels to Real. The purpose of this baseline is to ensure that our advantage is not due to our network’s access to the attribute-strength labeled images that the image generator module requires for training. Pseudo relative attribute labels are assigned based on the difference between the two training images’ attribute strengths. The latter comes from the binary classifier outputs used as “noisy labels” to train the Attribute2Image image generator (cf. Sect. 4.3.1).

  • Jitter The traditional data augmentation process where the real images are jittered through low-level geometric and photometric transformations. We follow the jitter protocol defined in Dosovitskiy et al. (2014), which includes translation, scaling, rotation, contrast, and color. The jittered image pairs retain the corresponding real pairs’ respective labels. Figure 13 shows examples of the low-level jitter. Note how these low-level alterations differ from the attribute-conditioned “jitter” injected by our methods.

  • SemJitter Our “passive” semantic jittering approach from Sect. 4.3.

  • ATTIC Our “active” generation approach from Sect. 4.4.

All methods use the same state-of-the-art ranking network for training and predictions; hence any differences in results will be attributable to the training data and augmentation strategy.

Table 1 Accuracy for the 10 attributes in the Shoes dataset
Table 2 Accuracy for the 8 attributes in the Faces dataset

5.2 Relative Attribute Accuracy

First we evaluate the accuracy of all methods when given the exact same amount of total manual annotations. The metric for accuracy is the rate of correct classification on input pairs, i.e., how often does the model properly predict which of the two images exhibits the attribute more. Here we want to measure the impact of the synthetic training images created by our approaches on attribute comparisons. Note that the test pairs are real images. This means that while there is no domain shift for the baselines trained solely with real images, there is a domain shift for our model, which augments the real data with its synthesized training instances. Hence, an empirical question we will also be checking here is whether synthetic images, despite their inherent domain shift, can still help the algorithm learn a more reliable model. Furthermore, we must test whether our adversarial active approach pinpoint the most useful synthetic images to accelerate learning for the same amount of labels.

In the following experiment, the Real, Real+, and Jitter baselines use all n available real labeled image pairs. Semantic Jitter and ATTIC use half of the real labeled image pairs (\(\frac{n}{2}\)), then augment those pairs with \(\frac{n}{2}\) manually labeled synthetic image pairs that they generate. Jitter adopts the label of the source pair it jittered. See Fig. 12.

Tables 1 and 2 show the results for the Shoes and Faces datasets, respectively. We first look at the standard scenario (middle rows) where the synthetic images generated by our approaches are labeled by human annotators. Though using exactly the same amount of manual labels as the Real baseline, our approaches nearly always outperform it. This shows that simply having more real image pairs labeled is not always enough; our generated samples improve the training across the variety of attributes in ways the existing real image pairs could not. In addition, we see from Real+ that the image generator training images have only a marginal (and sometimes negative) effect on the baseline’s results. This indicates that both Real and Real+ suffer from the same sparsity issue, as the images are taken from similar pool of real images. The addition of similarly distributed (real) images lacks the fine-grained details needed to train a stronger model. Jitter gets a slight performance boost sometimes, but can even be detrimental on these datasets.

Table 3 Results on Zap50K-1 (coarse pairs) and Zap50K-2 (fine-grained pairs) from the original UT-Zap50K dataset versus prior methods

When comparing our two approaches, ATTIC outperforms (or matches) Semantic Jitter in 8 out of 10 shoe attributes and 6 out of 8 face attributes, with gains of just over 3% in some cases. This demonstrates ATTIC’s key advantage over Semantic Jitter, which is to actively adapt the generated images to best suit the learning of the model, as opposed to what looks the best to human eyes. Unlike Semantic Jitter, which modifies one attribute at a time, ATTIC can modify multiple attributes simultaneously in a dynamic manner, accounting for their dependencies. According to a paired t-test between our approaches and the baselines, our results are statistically significant with 95% confidence on average over all attributes in each dataset.

Next, we consider an “Auto” scenario for our approaches where instead of adding the \(\frac{n}{2}\) generated images with their manual annotations, we bootstrap from all n real labeled image pairs. Then, we generate another n synthetic images and—rather than get them labeled—simply adopt their inferred attribute comparison labels. In this case, the “ground truth” ordering for attribute j for generated images \(\hat{\varvec{x}}^A\) and \(\hat{\varvec{x}}^B\) is automatically determined by the magnitudes of their associated parameter values \(\varvec{y}^A(j)\) and \(\varvec{y}^B(j)\) output by the control module (ATTIC) or analogously the jittered \(\varvec{y}(j)\) values for the sampled identity for Semantic Jitter. Once again, all methods use the same number of labels.

Fig. 14
figure 14

Active learning curves for the shoe (left) and face (right) datasets. We show the average gain over the Real baseline after each batch of additional generated image pairs. Both of our densification approaches accelerate learning over the baselines, and ATTIC nearly doubles the gain achieved by Semantic Jitter for both domains

Tables 1 and 2 (bottom two rows) show the “Auto” results. The “Auto” variant tests the impact of having more synthetic images but with noisy (inferred) labels. Our models’ performance varies based on the dataset in this setting. For attributes where the Auto labels are similarly successful or even better than the “normal” setting, it suggests that their inferred labels are often accurate, and the extra volume of “free” training pairs can be helpful in certain scenarios. While our methods perform well overall, for a couple of attributes (i.e, mouth-open, young) our ATTIC variant underperforms both Real and Semantic Jitter. Upon inspection, we find ATTIC’s weaker performance there is due to deficiency in the image generators. As we will illustrate below, ATTIC tends to generate training images that are farther from the initial distribution of real samples, which naturally strains the generation engine. Even so, ATTIC still outperforms SemJitter in all 10 shoe attributes and 6 out of 8 face attributes. However, with that said, we want to stress once again that our approach’s motivation is to improve accuracy by addressing the curation problem, not to reduce manual labeling effort. The use of Auto was exploratory to see how important the label verification step was.

Fig. 15
figure 15

Sample training image pairs. Left: “Harder” real pairs that are incorrectly predicted by the baseline model. Middle: Synthetic image pairs generated by our semantic jitter and active approach. Right: Synthetic image pairs that are rejected by the human annotators as illegible. For each pair of images, the left image always contains “more” of an attribute than the right image

Lastly, we take this idea of using automatically labeled synthetic data further, by extending the total number of synthetic image pairs used up to five times that of the original for our ATTIC approach. We find that on the Shoes data, the gains for ATTIC-Auto seem to have saturated. However, on Faces, the gains continue and saturate only once we generate 5\(\times \) as many auto-labeled pairs, for a final accuracy of 86.25% averaged over all attributes. We speculate that since the Faces training data is smaller to begin with, the model for the face domain is able to further make use of these extra training data.

5.3 Improving the State of the Art for Fine-Grained Attributes

The experiments thus far demonstrate that our approach allows more accurate fine-grained predictions for the same amount of manual annotation effort, compared to traditional training procedures with real images plus or minus low-level jitter for data augmentation. Next we present results for our approach alongside all available comparable reported results on the original UT-Zap50K dataset. We test on all attributes that overlap between the original UT-Zap50K attributes and our newly collected fine-grained lexicon, namely open, sporty, and comfort, due to the availability of labeled synthetic image pairs. To avoid an unfair advantage, we do not use our newly collected real labeled data for our approaches.

Table 3 shows the results using the provided UT-Zappos50K train/test split. Following previous experiments, for an apples-to-apples comparison, all methods are applied to the same \(64 \times 64\) images. Our results use the approaches exactly as described above for the “Auto” scenario.

Our approaches outperform all the existing methods for all attributes. Semantic Jitter outperforms ATTIC for sporty in the first test set and open in the second test set, indicating that those attributes were similarly well-served by the heuristic choice for generated images. We explore this difference further in our qualitative evaluation below. However, ATTIC has the advantage overall. These results show that densifying supervision with our algorithms improves the state of the art for a popular fine-grained attribute task with real-world image data.

5.4 Active Versus Passive Training Image Generation

Next we examine more closely ATTIC’s active learning behavior. In this scenario, we suppose the methods have exhausted all available real training data (i.e., we use all n real labeled image pairs to initialize the model), and our goal is to augment this set. We generate the synthetic (labeled) image pairs in batches (again, not to be confused with the mini-batches when training neural networks). After each batch, we have them annotated, update the ranker’s training set, and re-evaluate it on the test set. The weights of the control module are carried over from batch to batch, while the ranker module restarts at its pre-trained state at the beginning of each batch.

Fig. 16
figure 16

Examples of the real test pairs that SemJitter (left) and ATTIC (right) predicted correctly and incorrectly. For each pair of images, the left image is predicted to contain “more” of the attribute than the right image

Fig. 17
figure 17

Box plots representing the intra-pair image distances of the various types of images used in our experiments. Each image is represented by its associated attribute values \(\varvec{y}\). The distances of all attributes from each dataset are combined for these computations

Figure 14 shows the results for both datasets. We plot active learning curves to show the accuracy improvements as a function of annotator effort—steeper curves are better, as they mean the system gets more accurate with less manual labeling. We see both of our densification approaches accelerate learning better than the Jitter baseline. Furthermore, ATTIC learns the fastest, showing that it has successfully learned to target the creation of useful training images better than the sampled identities used in Semantic Jitter. In particular, ATTIC achieves a gain of over 3% and 8% for the two domains, respectively, which is almost double that of our Semantic Jitter approach.

Jitter falls short once again, suggesting that traditional low-level jitter has limited impact in these fine-grained ranking tasks. While traditional low-level data augmentation builds in some invariance that benefits coarser image classification tasks, here for fine-grained ranking we observe more value in alterations along semantic dimensions of the class. In addition, for these face and shoe datasets, the main subject of the photo is typically aligned and in a canonical pose, such that typical jitter has less impact.

Finally, ATTIC-Auto (without human annotations) performs on the same level as Semantic Jitter (with human annotations), further highlighting the adaptability of our active generation framework.Footnote 8

5.5 Examples of Generated Training Images and Typical Predictions

As we have seen in the results above, the synthetic image pairs generated by our densification approaches outperform the baselines in most of the scenarios tested, and our active ATTIC approach successfully accelerates learning over Semantic Jitter.

Figure 10 in Sect. 4 shows examples of how the synthetic images look between the first and the last epoch of the training for ATTIC. We can see that pairs generated by ATTIC demonstrate change in multiple attributes while still keeping the target attribute of comparison at the forefront. Furthermore, the final pairs selected for labeling also demonstrate subtler visual differences than the initial pairs, suggesting that our model has indeed learned to generate “harder” pairs.

Figure 15 compares the “harder” pairs generated by our approaches to those from the real image pairs. Overall we see that the generated synthetic pairs tend to have fine-grained differences and/or offer visual diversity from the real training samples. The righthand side of the figure shows examples of generated pairs rejected by annotators as illegible. The relatively low rate of rejection (16%) is an encouraging sign for making active query synthesis viable for image labeling tasks. Figure 16 shows examples of real test pairs that the two approaches predicted correctly and incorrectly.

5.6 Densification Properties of ATTIC and Semantic Jitter

Finally, having seen the performance benefits of our approach to generate images for training ranking models, our last set of analysis aims to validate our original hypothesis: Did our synthetic images densify the supervision in the attribute space? Furthermore, how do Semantic Jitter and ATTIC differ in the manner they densify the space? For our analysis, we represent each image using its associated attribute values \(\varvec{y}\). We again use the attribute values \(\varvec{y}\) inferred using a pre-trained binary classifier for the real images.

First, we analyze the intra-pair image distances of the three types of training images used: Real, SemJitter, and ATTIC. A larger distance represents a greater difference in the respective feature space. Our automated sampling heuristics from Sect. 4.3.1 work under the assumption that shorter distances are better for learning fine-grained comparisons.

Fig. 18
figure 18

t-SNE visualization of each type of image: Real, SemJitter, and ATTIC (active batches) for the shoe domain (top) and face domain (bottom). Each batch of our ATTIC approach is represented with a different color. See the legend for their corresponding batch numbers. Best viewed on PDF

Figure 17 shows the box plot representation of these distances.Footnote 9 Immediately, we see a clear contrast between the two image domains in terms of our ATTIC image pairs. While the Face images follow our hypothesis that pairs with lower distances are preferable, the Shoe images actually exhibit the opposite preference. Our ATTIC shoe distances are not only larger than those from Real and SemJitter but also occupy a wider range, therefore, resulting in more diverse pairs. This makes sense for shoe images where the difference in the attributes are often highly correlated amongst one another, especially in relation to the 40 meta-data labels (out of the 50 total attributes used to train the image generator). Therefore, when automatically adjusting for one attribute, many of the attributes are most likely modified as well. On the other hand, ATTIC’s Face distances exhibit very subtle differences, representative of the fine-grained differences we are trying to identify on faces (e.g., minor changes in the size of the mouth for smiling). This stark difference between the generated image pairs in the two domains highlights the strength of our ATTIC approach, which adapts to each set of training data to generate the most beneficial set of images for their respective rankers.

Next, we visualize the distribution of these individual training images using 2D t-SNE (Maaten and Hinton 2008) embeddings. Figure 18 shows the 2D t-SNE embedding space of images from Real, SemJitter, and the active batches of ATTIC (numbered by their sequence). As an alternative visualization, Fig. 19 uses the same t-SNE embeddings to display the actual images in their rough locations in the embedding space, while filling out the missing locations with its nearest neighbor in the attribute space. We color-code the images to represent their individual types.

Fig. 19
figure 19

t-SNE grid visualization where every point in the embedding is filled with its nearest neighbor. We show here complementary visualizations to the main embedding from Fig. 18, for the attributes supportive and smiling. The image border colors represent the type of images, with Real as blue, SemJitter as green, and ATTIC as red. Best viewed on PDF

These visualizations depict how our methods densify supervision: they interpolate and extrapolate beyond the real training images to flesh out the space more completely. SemJitter does so rather uniformly surrounding the real samples, consistent with our method design, whereas ATTIC often decides to create images more distant from the original real training data. In particular, for the shoe domain, our ATTIC images are mostly positioned away from the existing real images, generating samples in regions of the feature space that are unoccupied. On the other hand, for the face domain, our ATTIC images tend to be tightly clustered amongst one another, signaling fine-grained differences between the images. Overall, these observations agree with the main outcomes in the box plots above.

In addition, the different types of densification shed light on our quantitative results from Tables 1, 2, and 3, where ATTIC outperforms SemJitter only for some attributes but not others. While ATTIC’s more exploratory densification yields better performance most of the time, SemJitter’s more local densification can be useful for some attributes.

To recap, a key observation from both analyses above is the difference between the synthetic images generated by semantic jitter and ATTIC. Even though both approaches use the exact same image generator (Attr2Image), SemJitter densifies throughout the space, but sticks near the distribution of real images, for both datasets. In contrast, ATTIC can venture into new parts of the feature space, demonstrating its ability to adapt its synthesized training images to each specific set of training data.

6 Conclusion

Supervision sparsity hurts fine-grained attributes—closely related image pairs are exactly the ones the system must learn from. We address the sparsity of supervision issue by proposing two new approaches to data augmentation, in which real training data mixes with realistic synthetic examples that vary slightly in their attributes.

As our experiments demonstrate, sample density is distinct from sample quantity. Even in a deep learning model, the distribution of the training data can be as important as its absolute quantity. In other words, simply gathering more real images does not offer the same fine-grained density, due to the curation problem. On two difficult datasets from two distinct domains, we showed that our densification approaches offer a real payoff in accuracy for distinguishing subtle attribute differences.

Fig. 20
figure 20

Screenshot of our “one word challenge” task on the Amazon Mechanical Turk. Our goal is to capture the first work that comes to the workers’ mind when presented with the images

Fig. 21
figure 21

Example of a single task within a HIT