FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Han, Xiao; Yu, Licheng; Zhu, Xiatian; Zhang, Li; Song, Yi-Zhe; Xiang, Tao

doi:10.1007/978-3-031-19833-5_37

Xiao Han^12,13,
Licheng Yu¹⁴,
Xiatian Zhu^12,15,
Li Zhang¹⁶,
Yi-Zhe Song^12,13 &
…
Tao Xiang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2447 Accesses
12 Citations

Abstract

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L datum contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To capitalize this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Masked Vision-language Transformer in Fashion

Article Open access 27 February 2023

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Keywords

1 Introduction

Recently, Vision-and-Language (V+L) pre-training has received increasing attention [8, 29, 31, 32, 35, 41, 48, 53, 55, 64]. The objective is to learn multimodal representations from large-scale image-text pairs, in order to improve various downstream unimodal or multimodal tasks. These models have proven to be highly effective thanks to two main factors: (i) There are a plenty of image-text pairs on the Web providing abundant training data for free (no additional annotation required), and (ii) Transformer-based model architectures have been widely used to learn the contextualized representation of multimodal inputs.

In this work, we consider the fashion domain with focus on V+L model pre-training. This is inspired by the following reasons. First, fashion V+L data are not just copious in volume but also high in quality. This is because online fashion shopping is increasingly ubiquitous. On an e-commerce website, each product detail page (PDP) contains product images and text, both are of very high quality (i.e., often generated by domain experts). Second, driven by such strong commercial forces, a larger number of downstream tasks are naturally resulted in real-world applications, ranging from multimodal product understanding [36, 42], cross-modal retrieval [18], to text-guided image retrieval [65]. When applied to the fashion domain, however, we observe that existing V+L pre-training methods [18, 77] are less effective compared to other domains (see Sect. 4). We believe that this is because they are not designed to exploit unique characteristics of both fashion V+L data and downstream tasks.

In particular, in most existing generic domain V+L datasets (e.g., COCO [37] and Flickr30k [46]), each datum is a single image-text pair with brief text (e.g., an image caption as shown in Fig. 1). In contrast, fashion datasets are collected mostly from PDPs on e-commerce sites with two specialties: (i) There are typically more than one images associated with a given text, as shown in Fig. 1. The garment “maxi dress” is presented from three different views for offering online shoppers with rich product information. (ii) There are many more fine-grained concepts in the text description due to the product description nature. As shown in Fig. 1, the fashion text is more focused on the garment itself with very detailed adjectives and nouns, describing its appearance in the title, style, and description. In a statistical perspective, we calculate the ratio on four combined fashion datasets [22, 50, 58, 68] and two combined generic datasets [37, 46]. It is found that 82% of the words in the fashion captions are adjectives or nouns, versus only 59% with the generic captions. None of the existing V+L models are capable of exploiting these specialties of fashion data.

Fashion downstream tasks are also more diverse, posing a challenge to the V+L pre-training model architecture design. Specifically, in the generic V+L domain, existing models are of single-stream or two-stream, depending on the intended downstream tasks. For example, operating on concatenated image and text tokens, a single-stream model [8, 27, 29, 32, 53] is suitable for multimodal fusion tasks such as VQA [2], VCR [71] and RefCOCO [70]. Instead, a two-stream model [28, 41, 48, 54, 55] is typically designed for efficient cross-modal retrieval tasks^{Footnote 1}. In the fashion domain, apart from image-text fusion and cross-modal retrieval, we also need to tackle other downstream tasks for which neither single-stream nor two-stream architectures are suitable. For instance, the text-guided image retrieval task [21, 60, 65] not only requires a strong fusion of a reference image and a modifying text, but also an efficient matching between the fused multimodal representation and any candidate image. Given such diverse downstream tasks in fashion, the existing one-stream and two-stream methods are limited in both flexibility and versatility.

To overcome the aforementioned limitations of existing methods, we introduce a novel fashion-focused V+L representation learning framework termed FashionViL. Two fashion-focused pre-training tasks are proposed to fully exploit the specialties of fashion data. (I) Multi-View Contrastive Learning (MVC): Given a fashion datum with multiple images/views and one text description, we require that each individual modality (unimodal or multimodal representation) should be semantically discriminative w.r.t the same product. To that end, other than the common image-text matching, we further minimize the distance between (i) the multimodal representation of one view and text and (ii) the other views. (II) Pseudo-Attributes Classification (PAC) is designed to exploit the rich fine-grained fashion concepts in the description: We first extract those common attributes/noun phrases from the fashion datasets and construct a pseudo attribute set. The model then learns to predict those attributes explicitly during pre-training. Our intuition is that, the fashion items with the same attribute(s) should be clustered together, i.e., semantically discriminating in the attribute space. As shown in Sect. 4.3, MVC and PAC both are effective and complementary to conventional V+L pre-training tasks such as Image-Text Contrastive Learning (ITC) and Masked Language modeling (MLM).

Moreover, we formulate a flexible and versatile model architecture capable of adapting a pre-trained model easily to a diverse set of downstream tasks. Specifically, our model consists of an image encoder and a modality-agnostic Transformer module, which can be used as either a text encoder or a multimodal fusion encoder. This supports fine-tuning for different downstream modes as: (i) Early-fusion single-stream mode for multimodal joint representation learning, e.g., multimodal classification; (ii) Late-fusion two-stream mode for unimodal representation learning, e.g., cross-modal retrieval; (iii) Early-fusion two-stream mode for multimodal compositional representation learning, e.g., text-guided image retrieval. As a result, our design fuses synergistically the strength of single-stream model in modality fusion and two-stream model in scalability. Crucially, it also caters for fashion-unique tasks, e.g., text-guided image retrieval and outfit complementary item retrieval.

Our contributions are summarized as follows: (1) A novel fashion-focused V+L pre-training framework is proposed to exploit the specialties of fashion data through two new V+L pre-training tasks. (2) A versatile and flexible architecture consisting of a modality-agnostic Transformer is introduced to accommodate a set of diverse downstream tasks in the fashion domain. (3) For extensive evaluation, we consider five fashion V+L tasks together: image-to-text retrieval, text-to-image retrieval [50], text-guided image retrieval [65], (sub)category recognition [50] and outfit complementary item retrieval [58]. Our experiments show that FashionViL achieves new state of the art with a consistent and significant performance boost per task. To the best of our knowledge, this is the first work capable of addressing 5 diverse fashion tasks together.

2 Related Work

With the advent of Transformer [59] and its success in NLP [10] and CV [13], there has been great success in applying large-scale V+L pre-training to generic domain [8, 31, 32, 48]. Some recent studies started to focus on e-commerce domains including fashion [11, 18, 74, 76, 77]. Existing works differ in two main aspects: architecture design and pre-training tasks.

Model Architecture. All V+L pre-training methods use image and text embedding sequences as input for modeling inter-modal and optionally intra-modal interactions through a CNN or Transformer architecture, and output a contextualized feature sequence [6]. There are many options on architecture designs on different aspects, including singe-stream early fusion [8, 32, 35, 53] vs. two stream late fusion [17, 28, 41, 48, 55], or different visual features (e.g., detector-based regions [73] vs. ConvNet patches [27] vs. linear projections [29, 67]). In many case, the design is driven by the intended downstream tasks (e.g., VQA requires earlier fusion to enhance joint representation whereas cross-modal retrieval requires later fusion to speed up inference). There are also efforts for alleviating the gap between different architectures through retrieve-and-rerank strategy [19, 54] or knowledge distillation [39, 63]. Unlike them, inspired by the recent advances in modality-agnostic models [1, 33, 61, 62, 69], we introduce a unified architecture that can be easily switched between the single-stream or two-stream mode, so there is no need to modify the architecture for different downstream tasks.

Pre-training Tasks. Various tasks have been proposed for V+L pre-training. Masked Language Modeling (MLM) and Image-Text Matching (ITM) are the direct counterparts of the BERT objectives [10, 32]. Masked Image Modeling (MIM) is the extension of MLM on the visual modality, including several variants like masked region classification [41, 53] and masked region feature regression [8]. Some other tasks are also proved to be effective, such as predicting object tags [26, 35], sequential caption generation [64, 75] and image-text contrastive learning [31, 34, 48]. However, none of these tasks are able to take advantage of the two specialities of fashion data as discussed earlier. We therefore propose two fashion-focused pre-training tasks in this work.

3 Methodology

3.1 Model Overview

The model architecture of FashionViL is illustrated in Fig. 2(a), which is composed of an image encoder (IE) and a Transformer module that can be used for both text encoder (TE) and fusion encoder (FE). Specifically, our image encoder uses ConvNet as its backbone to convert the raw pixels into a sequence of visual embeddings by rasterizing the grid features of the final feature map. For the text encoder, we follow BERT [10] to tokenize the input sentence into WordPieces [66]. Each sub-word token’s embedding is obtained by summing up its word embedding and learnable position embedding, followed by Layer Normalization (LN) [3].

One novelty of the model design lies in the shared Transformer for TE and FE, which allows us to flexibly build various multimodal model architectures, each of which is suited for different types of downstream tasks. For example, Fig. 2(b) shows an early-fusion model architecture, where the raw sentence and the computed image embeddings are jointly fed into the multimodal fusion encoder. Note that when we use the Transformer as the fusion encoder, we will further add the modality embeddings to the visual embeddings and word embeddings, helping the model distinguish the modality type. This architecture is exactly the same as the well-known single-stream models in many previous pre-training works [8, 18, 32]. Then in Fig. 2(c) we show a late-fusion two-stream model architecture, where we apply the shareable Transformer as the text encoder. The outputs from image encoder and text encoder are interacted with a simple dot product to compute the similarity between two modalities. This architecture has been widely adopted for efficient large-scale cross-modal retrieval [19, 54]. Furthermore, we can fine-tune this shared Transformer to a more complicated two-stream architecture variant, shown in Fig. 2(d). Here, one stream operates in an early-fusion manner while the other stream is an image encoder. This architecture is needed for some fashion-focused retrieval tasks with multimodal query, e.g., text-guided image retrieval [60, 65]. Note that all FE and TE in the above three architectures are actually the same Transformer, and the mere difference lies in its input.

Given an image-text pair, we denote its raw visual inputs as $\textbf{v}_{i} = \left\{ \textbf{v}_{i}^{1}, \ldots , \textbf{v}_{i}^{K}\right\} $, and its input words as $\textbf{w}_{i}=\left\{ \textbf{w}_{i}^{\textrm{cls}}, \textbf{w}_{i}^{1}, \ldots , \textbf{w}_{i}^{T}\right\} $, where the subscript i indicates the i-th pair in the dataset. An additional special [CLS] token is inserted at the beginning of the text sequence, as well as the multimodal sequence when modalities are concatenated. We follow the common pre-training + fine-tuning pipeline when applying the model to downstream tasks.

3.2 Pre-training Tasks

We first introduce two new pre-training tasks. This is followed by the other conventional pre-training tasks adopted in our framework.

Multi-view Contrastive Learning (MVC). As can be seen in Fig. 1, each fashion item is often associated with multiple views to provide a comprehensive overview of the product. To take advantage of the reciprocal information between different views, we propose to build a correlation between (i) the visual representation of the original view $\textbf{v}$, and (ii) the compositional representation of another view $\textbf{d}$ and the text $\textbf{w}$. In cases where there is only one view of the product, we augment another view by randomly cropping or horizontally flipping the given view. As shown in Fig. 2(d), the visual representation of the original view is extracted by the image encoder while the compositional representation is calculated in an early fusion way. Therefore, the similarity between the multimodal input $[\textbf{w};\textbf{d}]$^{Footnote 2} and $\textbf{v}$ can be computed as:

$$\begin{aligned} s\left( [\textbf{w}_{i};\textbf{d}_{i}], \textbf{v}_{j}\right) =g_{\theta }\left( \textbf{d}_{i}^{\textrm{avg}}|\textbf{w}_{i}\right) ^{T} g_{\theta }\left( \textbf{v}_{j}^{\textrm{avg}}\right) , \end{aligned}$$

(1)

where g represents a linear transformation that projects the average pooled features into the normalized low-dimensional latent space. Next, we apply two symmetrical InfoNCE losses [44] to pull closer the matched compositional representations and visual representations in the shared latent space:

$$\begin{aligned} \mathcal {L}_{\textrm{InfoNCE}}(x, y)=-\mathbb {E}_{(x, y) \sim B} \log \frac{\exp (s(x, y) / \tau )}{\sum _{\hat{y} \in \hat{B}} \exp (s(x, \hat{y}) / \tau )}, \end{aligned}$$

(2)

$$\begin{aligned} \mathcal {L}_{\textrm{MVC}} = \frac{1}{2} \left[ \mathcal {L}_{\textrm{InfoNCE}}([\textbf{w};\textbf{d}], \textbf{v}) + \mathcal {L}_{\textrm{InfoNCE}}(\textbf{v}, [\textbf{w};\textbf{d}])\right] , \end{aligned}$$

(3)

where $\tau $ is a learnable temperature and $\hat{B}$ contains the positive sample y and $|\hat{B}|-1$ negative samples drawn from a mini-batch B.

Pseudo-Attribute Classification (PAC). As mentioned in Sect. 1, we found that there are a large number of fine-grained attributes in the fashion description. We propose to mine the pseudo-attribute concepts from all the available textual information, including title, description and meta-info. Specifically, we extract all nouns and adjectives via NLTK tagger [5] and only keep those that appear more than 100 times, resulting in a list of 2,232 attributes. We show the histogram of the top-50 pseudo attributes in Fig. 3. It is observed that all of them are truly highly-related to the fashion domain.

Then we explore how to utilize such mined concepts. We aim to let our model learn to explicitly recognize those pseudo attributes during the pre-training stage. We model this task as a multi-label classification problem, called Pseudo-Attribute Classification (PAC). As shown in Fig. 2(c), we apply the PAC to both visual and textual modalities so that both encoders can learn to capture the fine-grained concepts. As this is a weakly-supervised learning setting, we leverage label smoothing to generate the labels [24] considering that the mined labels can be noisy. We use A to denote the whole 2,232 pseudo-attribute set and a as the smoothed soft-target for each class. For example, if one sample has two ground truth labels at position 0 and 1, then $a_0 = a_1 = 0.5$ while $a_i = 0 \ (i \ne 0, 1)$. Our objective is as follows:

$$\begin{aligned} \mathcal {L}_{\textrm{PAC}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \mathbb {E}_{a \sim A} \left[ a \log P_{\theta }\left( a|\textbf{w}\right) + a \log P_{\theta }\left( a|\textbf{v}\right) \right] , \end{aligned}$$

(4)

where $\theta $ is the learnable parameters and each pair $(\textbf{w}, \textbf{v})$ is sampled from the whole training set D.

Masked Patch Feature Classification (MPFC). While the naive masked feature regression has been shown not helpful in V+L pre-training [14, 29], we found empirically our version of masked patch modeling being effective in the fashion domain. Specifically, we disregard the feature reconstruction of each masked patch, but instead predict the patch label given by an offline image tokenizer. To this end, we first train a discrete VAE [15, 49, 57] as the image tokenizer on our collected fashion images with the perceputal loss [12]. We also adopt exponential moving average (EMA) to update the codebook, which is proved to be useful for increasing the utilization of codewords [12, 57]. We randomly replace 25% patch features with zeros through block-wise masking strategy [4]^{Footnote 3}. Since now we have discrete labels for each patch, the model can be trained to predict the label of each masked patches $\mathbf {v_m}$ given the remaining patches $\mathbf {v_{\backslash m}}$ by optimizing:

$$\begin{aligned} \mathcal {L}_{\textrm{MPFC}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \log P_{\theta }\left( \mathbf {v^t_m}|\textbf{v}_{\backslash \textbf{m}}, \textbf{w}\right) , \end{aligned}$$

(5)

where $\mathbf {v^t_m}$ is the estimated target label for the masked patch.

Image-Text Contrastive Learning (ITC). We also use ITC to encourage the two unimodal representations to be close in the latent space. As shown in Fig. 2(c), the similarity of $\textbf{w}$ and $\textbf{v}$ is measured by the dot product of their average pooled features after being projected to the latent space with two linear transformations f and g: $ s\left( \textbf{w}_{i}, \textbf{v}_{j}\right) =f_{\theta }\left( \textbf{w}_{i}^{\textrm{avg}}\right) ^{T} g_{\theta }\left( \textbf{v}_{j}^{\textrm{avg}}\right) . $ The ITC loss is:

$$\begin{aligned} \mathcal {L}_{\textrm{ITC}} = \frac{1}{2} \left[ \mathcal {L}_{\textrm{InfoNCE}}(\textbf{w}, \textbf{v}) + \mathcal {L}_{\textrm{InfoNCE}}(\textbf{v}, \textbf{w})\right] . \end{aligned}$$

(6)

Masked Language Modeling (MLM). In MLM, we randomly mask out the input words with a probability of 15%, and replace all subwords belonging to the masked words $\textbf{w}_{\textbf{m}}$ with special token [MASK]^{Footnote 4}. The goal of MLM is to predict these masked sub-words based on the observation of their surrounding words $\textbf{w}_{\backslash \textbf{m}}$ and all image patches $\textbf{v}$, by minimizing the negative log-likelihood:

$$\begin{aligned} \mathcal {L}_{\textrm{MLM}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \log P_{\theta }\left( \textbf{w}_{\textbf{m}}|\textbf{w}_{\backslash \textbf{m}}, \textbf{v}\right) . \end{aligned}$$

(7)

Image-Text Matching (ITM). In ITM, the input is an image-text pair and the target is a binary label $z \in \{0, 1\}$, indicating if each input pair is a match. Following [31], we sample the hard negative pairs from the similarity matrix $s\left( \textbf{w}_{i}, \textbf{v}_{j}\right) $ computed by ITC and then make a mini-batch H containing 50% negative pairs. We extract the hidden output of [CLS] at the last layer to represent the joint representation of both modalities, then feed it into a FC layer to do a two-class classification. We apply cross-entropy loss for ITM:

$$\begin{aligned} \mathcal {L}_{\textrm{ITM}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim H} \log P_{\theta }\left( z|\textbf{w}, \textbf{v}\right) . \end{aligned}$$

(8)

4 Experiments

In this section, we introduce our pre-training dataset and 5 practical downstream tasks. We use MMF [52] and PyTorch [45] for the implementation. For the image encoder, we use an off-the-shelf ResNet50 [23] to fairly compare with previous methods, most of which also used ResNet50. For the text encoder and multimodal fusion encoder (using the shared Transformer), we use the BERT-base-uncased [53] as the initialization. We use 4 RTX 3090 GPUs for pre-training. The details of the hyper-parameters are listed in the supplementary file.

Table 1. Statistics of the datasets used for pre-training

Full size table

4.1 Pre-training Dataset and Downstream Tasks

Pre-training Dataset. Our pre-training dataset consists of 4 public fashion-related datasets, namely, FashionGen [50], FACAD [68], Fashion200K [22] and PolyvoreOutfits [58]. In total, these datasets provide us with 373.5K fashion products for pre-training. Because each product may contain multiple images from different angles, we have about 1.35 million image-text pairs on hand. The detailed statistics are provided in Table 1.

Cross-modal Retrieval. Image-to-Text Retrieval (ITR) is a cross-modal retrieval task. Given an image query, our model finds the most aligned text from a large candidate pool. Previous fashion-domain pre-training works [18, 77] use the joint representation over the [CLS] token to predict the matching score, which results in an impractical time complexity due to the exhaustive matching between each query item and all gallery items in the early-fusion model [19, 39, 54, 63, 72]. While one of our model architectures can do the same (as Fig. 2b), we opt to use the two-stream late-fusion model in Fig. 2(c) to compute the cosine similarity for a far more efficient retrieval as [28, 48]. Text-to-Image Retrieval (TIR) is an inverse problem of ITR, where the query modality and gallery modality are swapped. The architecture for TIR is the same as ITR.

Text-Guided Image Retrieval (TGIR). TGIR is a special type of image retrieval problem, whose query is a multimodal composition [20, 21, 60, 65]. Specifically, given a query image and a modified sentence, the model is required to retrieve another image which has the similar outlook as the query image but with some appearance changes according to the query text. It has many practical applications in fashion, such as retrieving another garment according to a user’s reference garment and his/her feedback. To handle the uniqueness of the multimodal query, several interesting fusion approaches have been proposed in the past, such as the gating mechanism [51, 60], hierarchical attention [7], and style-content modification [30]. In this work, we follow [40] to simply apply an early fusion model to encode the compositional representation of the query image and modified text, which is shown in Fig. 2(d).

Category/Subcategory Recognition (CR/SCR). The (sub)category is a vital attribute for describing a product. (S)CR requires the model to produce a reliable joint representation. Following previous works [18, 77], we directly append a linear layer on top of [CLS] to predict the label for these tasks.

Outfit Complementary Item Retrieval (OCIR). OCIR aims at finding visually compatible item(s) of several given items to complete an outfit. This is a very practical task as people often buy garments that match previously selected or purchased ones. OCIR can be a helpful recommendation feature for online retailers [25, 38]. To address this task, we replace the backbone of CSA-Net [38] with the pre-trained image encoder of FashionViL. Note that unlike all multimodal/cross-modal tasks above, only the pre-trained image encoder is used in this downstream task. We leverage this task to evaluate the performance of our image encoder under the proposed multimodal pre-training.

Table 2. Results of cross-modal retrieval on FashionGen [50] with the protocol same as KaleidoBERT [77]. -e2e: Without end-to-end training, i.e., the image encoder is fixed. -pt: Directly fine-tuning without multimodal pre-training

Full size table

4.2 Comparative Results

Cross-modal Retrieval. We evaluate the cross-modal retrieval on the FashionGen [50] test split (not included in pre-training), including both ITR and TIR. Table 2 compares the performance of the previous V+L pre-training methods with our FashaionViL. Because previous works [18, 77] are designed with a single-stream architecture, they can only be evaluated on a small retrieval set. For example, for TIR, the models are required to pick the best-matched image from only 101 images given a text query^{Footnote 5}. Recall (over 1K retrievals) is reported as the metric. The same setting is used for ITR. For a fair comparison, we strictly follow the same evaluation protocol, reporting the recall for 1K retrievals^{Footnote 6}.

In Table 2, we compare our FashionViL and its two variants with existing methods. In particular, -e2e and -pt denotes our model without end-to-end training (image encoder is fixed) and multimodal pre-training respectively. We have the following observations: (1) Even with the fixed image encoder and without pre-training, FashionViL already achieves comparable results with the existing methods. This suggests that the performance of late fusion can be as effective as early-fusion for such fine-grained cross-modal retrieval. (2) When we unfreeze the image encoder for end-to-end training, we observe that R@1 jumps from 21.13 to 58.84, suggesting that end-to-end training is very efficient and redundant pre-processing may be unnecessary. (3) When we further utilize our proposed multimodal pre-training, our model achieves SOTA performance as in the last column of Table 2, whose R@1 is more than twice of the previous SOTA.

Table 3. Results of cross-modal retrieval on FashionGen [50] with full evaluation

Full size table

Note that our model architecture for this task is two-stream. This means that it can be applied to large-scale retrieval, unlike the compared baselines. Therefore, we additionally report the evaluation results on the full test set (of 32K image-text pairs), i.e., each query item is compared with every gallery item in the full test set. The results can be found in Table 3. We encourage the future works to also follow such a full evaluation protocol to measure the performance.

Table 4. Results of text-guided image retrieval on FashionIQ [65]

Full size table

Table 5. Results of category/subcategory recognition on FashionGen [50]

Full size table

Table 6. Results of outfit complementary item retrieval on PolyvoreOutfits [58]

Full size table

Text-Guided Image Retrieval. For TGIR, we compare our FashionViL with the previous V+L pre-training methods and the task-specific methods on FashionIQ [65]^{Footnote 7}. The results are shown in Table 4. For more comprehensive comparisons, we use two different implementations adopted by previous methods, i.e., training with fixed image encoder [40] or end-to-end training [7, 30, 60].

We first report the results with the fixed ResNet 152 from Column 1 to Column 4 (C1–C4). CIRR adopts OSCAR [35] as the fusion module and uses the global image features as the input. We find FashionViL consistently outperforms CIRR with a relative 10%–20% gain with or without the multimodal pre-training (C1 vs. C3, C2 vs. C4). This improvement demonstrates that the patch-level features are superior to the global features for the compositional multimodal fusion. With our proposed pre-training, the performance further improves from 31.78 to 34.19 (C3 vs. C4), showing our pre-training also works well on the off-the-shelf fixed image encoder.

We then report the results under the end-to-end training paradigm (C5-C10). We find that simply replacing GRU with BERT (C5 vs. C8) already leads to a 4% relative gain (from 23.65 to 27.17), indicating the importance of having a higher-quality text encoder. Additionally, all previous works apply a late interaction between the image embeddings and modified text embeddings with an elaborately designed fusion module, e.g., TIRG [60]. We argue that an earlier fusion of the two modalities should result in an even better compositional embedding for the query purpose. Comparing C9 and C8, our FashionViL without pre-training already outperforms TIRG+BERT, indicating better query multimodal embeddings are learned in our model. Note that our text encoder and fusion encoder are shared, so FashionViL also saves more training parameters than TIRG+BERT. With the help of pre-training, our FashionViL achieves the new SOTA result with another significant 11.2% relative gain (C9 vs. C10).

Category/Subcategory Recognition. Following KaleidoBERT [77], we evaluate CR and SCR on the FashionGen dataset [50]. The joint representation of the model architecture in Fig. 2(b) is used to predict the classification score. The results are shown in Table 5. Once again, the end-to-end learning and the well-designed fashion-specific pre-training tasks help our FashionViL outperform the two previous works by significant margins (10.4% and 3.2%, respectively). Furthermore, we also simulate a new task – multi-image subcategory recognition (M-SCR) to evaluate the performance of FashionViL with multiple input images. See more results in the supplementary file.

Table 7. Evaluation on pre-training tasks using ITR, TIR, TGIR, SCR and OCIR as downstream tasks. Each number is the mean value of all metrics for one specific downstream task. Meta-sum stands for the summation of all numbers in each row. The three shades of grey represent the top three results when sharing TE and FE

Full size table

Outfit Complementary Item Retrieval. In addition to the aforementioned multimodal and instance-level downstream tasks, we also examine FashionViL on the unimodal outfit-level task, i.e., OCIR. We compare our model with the previous task-specific methods [25, 38] on the Disjoint split of Polyvore Outfits [58]^{Footnote 8}. As shown in Table 6, our multimodal pre-training benefits the performance with a 21.0% improvement, even when only the image encoder is tuned.

4.3 Ablation Study

We analyze the effectiveness of different pre-training tasks and the sharing TE/FE strategy through ablation studies over the aforementioned five downstream tasks. The complete results are listed in Table 7. In addition to the standard metrics for each benchmark, we use the Meta-sum (sum of all scores across all the benchmarks) as a global metric.

First, we establish a baseline without any multimodal pre-training in Line 0 (L0), i.e., the image/text encoder is initialized with the off-the-shelf ResNet50 or BERT, which is pre-trained in vision-only or language-only domain.

Second, we validate the effectiveness of each pre-training task by their standalone performance, i.e., each time we pick only one task for pre-training. We show the results of MPFC, MLM, PAC, MVC, ITC in L2, L4, L5, L6 and L7. It is clear from Table 7 that all of these pre-training tasks can benefit the downstream tasks. However, we found that a pre-training task tends to be relatively more helpful to downstream tasks of its similar type. For example, both MPFC (L2) and MLM (L4) are focusing on modeling the cross-modal interaction, thus they bring more gain to SCR but contribute relatively less to ITR and TIR. In contrast, since ITC (L7) has the same objective with ITR and TIR, it significantly boosts the cross-modal performance. As for TGIR, it requires not only high-quality compositional representation but also high-quality unimodal representations, thus each of the 5 pre-training tasks have a positive impact.

Third, we validate the effectiveness of the proposed PAC (L5) and MVC (L6). For PAC, we implement a comparative experiment: MLM only on those pre-defined pseudo-attribute words (L3). The main difference between L3 and L5 is whether the multi-label supervision is performed on each masked text token or the global representation. L3 leads to much lower performance than L5, indicating that the supervision of pseudo attributes on the global representation is a better choice. Interestingly, L3 achieves a comparable result to L4, where each word (including those other than the pseudo attributes) can also be masked. This means merely masking the fine-grained words is as effective as masking all the words uniformly, which indicates the most important text cues lie in those fine-grained concept words. We then verify the superiority of MVC. To this end, we add an ablation study that does not utilize multi-angle images (L1), i.e., replacing the sampled different angle image with an augmented version of the original image. Comparing L1 and L6, we confirm that the improvement of MVC mainly comes from the contrastive learning on the images from different angles.

Next, we study the effect of different combinations of those tasks. When we add MLM and MPFC to ITC (L8), we observe a gain on Meta-sum, while the performance of ITR and TIR slightly drops. This is expected as different tasks may provide different update directions for the same parameters, which causes some tasks to overshadow the effects of others. However, minor conflicts between different tasks can be largely alleviated by employing more tasks. As shown in L9, the overall performance can be further boosted by adding ITM. The same happens when we add MVC into them (L10). When all six tasks are jointly trained (L11), we observe a significant performance gain across all benchmarks. Notably, the two new fashion-specific tasks of MVC and PAC play the most important roles to achieve the SOTA performance.

Finally, we demonstrate the superiority of sharing TE and FE. We implement a comparative model (L12) with the same pre-training tasks as L11 but using separate TE and FE. We observe a clear performance drop when breaking the parameter sharing. This indicates our modality-agnostic sharing strategy not only reduces the number of parameters but also performs far better.

4.4 Visualization

We visualize the representations from the image encoder, text encoder, and fusion encoder via t-SNE [43] in Fig. 4. Specifically, we feed all image-text pairs from FashionGen’s test split into our model. We visualize the most popular 10 categories using different colors. We compare the t-SNE of the model without multimodal pre-training (initialized with ResNet+BERT) and the model with the full 6 pre-training tasks. We found the clusters become more discriminative when more pre-training tasks are added, indicating that FashionViL learns to acquire more fine-grained concepts. See more in the supplementary file.

5 Conclusions

We have introduced FashionViL, a novel end-to-end large-scale pre-training framework for V+L representation learning in the fashion domain. We proposed two effective fashion-specific pre-training tasks and introduced a novel modality-agnostic text/fusion encoder for a flexible and versatile multimodal architecture. Our FashionViL achieves new SOTA performance with superior efficiency on 5 popular fashion-related tasks.

Notes

1.
A single-stream model can also be applied at a cost of traversing every query-gallery pair, resulting in unacceptable retrieval speed in large-scale applications.
2.
We randomly dropout some words in $\textbf{w}$ and patches in $\textbf{d}$ with the probability of 15% to make the learning process more robust.
3.
Following UNITER, we use conditional masking for MLM/MPFC, i.e., only masking one modality while keeping the other one intact at each time.
4.
Following BERT and UNITER, we decompose this 15% into 10% random words, 10% unchanged, and 80% [MASK].
5.
In the 101 images, 1 is positively paired with the text and the other 100 are randomly paired but sharing the same sub-category as the positive, increasing the difficulty.
6.
Because the authors did not release their 1K retrieval set, we report the average recall of 5 experiments with 5 randomly selected 1K retrieval sets.
7.
Details for the reproduction of previous methods are in the supplementary file.
8.
We have no access to the data splits of CSA-Net, so constructed the Polyvore Outfits [58] and reproduced CSA-Net by ourselves according to the original paper [25, 38].

References

Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)
Google Scholar
Antol, S., et al.: VQA: Visual question answering. In: ICCV (2015)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: ICLR (2022)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural language processing with python: analyzing text with the natural language toolkit (2009). https://www.nltk.org
Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D.: Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. TACL (2021)
Google Scholar
Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR (2020)
Google Scholar
Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Dong, X., et al.: M5product: a multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275 (2021)
Dong, X., et al.: Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Dou, Z.Y., et al.: An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)
Google Scholar
Fei, N., et al.: Wenlan 2.0: make AI imagine via a multimodal foundation model. arXiv preprint arXiv:2110.14378 (2021)
Gao, D., et al.: Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR (2020)
Google Scholar
Geigle, G., Pfeiffer, J., Reimers, N., Vulić, I., Gurevych, I.: Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. arXiv preprint arXiv:2103.11920 (2021)
Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.S.: Dialog-based interactive image retrieval. In: NeurIPS (2018)
Google Scholar
Han, X., He, S., Zhang, L., Song, Y.Z., Xiang, T.: UIGR: unified interactive garment retrieval. In: CVPR workshops (2022)
Google Scholar
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hoe, J.T., Ng, K.W., Zhang, T., Chan, C.S., Song, Y.Z., Xiang, T.: One loss for all: deep hashing with a single cosine similarity based learning objective. In: NeurIPS (2021)
Google Scholar
Hou, Y., Vig, E., Donoser, M., Bazzani, L.: Learning attribute-driven disentangled representations for interactive fashion retrieval. In: ICCV (2021)
Google Scholar
Hu, X., et al.: Vivo: visual vocabulary pre-training for novel object captioning. In: AAAI (2021)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Kim, W., Son, B., Kim, I.: VILT: Vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Lee, S., Kim, D., Han, B.: Cosmo: Content-style modulation for image retrieval with text feedback. In: CVPR (2021)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, L.H., You, H., Wang, Z., Zareian, A., Chang, S.F., Chang, K.W.: Unsupervised vision-and-language pre-training without parallel images and captions. In: NAACL-HLT (2021)
Google Scholar
Li, W., et al.: Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP (2021)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)
Google Scholar
Liao, L., He, X., Zhao, B., Ngo, C.W., Chua, T.S.: Interpretable multimodal retrieval for fashion products. In: ACM MM (2018)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)
Google Scholar
Lin, Y.L., Tran, S., Davis, L.S.: Fashion outfit complementary item retrieval. In: CVPR (2020)
Google Scholar
Liu, H., Yu, T., Li, P.: Inflate and shrink: enriching and reducing interactions for fast text-image retrieval. In: EMNLP (2021)
Google Scholar
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Ma, Y., Jia, J., Zhou, S., Fu, J., Liu, Y., Tong, Z.: Towards better understanding the clothing fashion styles: a multimodal deep learning approach. In: AAAI (2017)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. JMLR (2008)
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
Shin, M., Cho, Y., Ko, B., Gu, G.: Rtic: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)
Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)
Google Scholar
Sun, S., Chen, Y.C., Li, L., Wang, S., Fang, Y., Liu, J.: Lightningdot: pre-training visual-semantic embeddings for real-time image-text retrieval. In: NAACL-HLT (2021)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP-IJCNLP (2019)
Google Scholar
Tan, R., Vasileva, M.I., Saenko, K., Plummer, B.A.: Learning similarity conditions without explicit supervision. In: ICCV (2019)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Google Scholar
Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: ECCV (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval - an empirical odyssey. In: CVPR (2019)
Google Scholar
Wang, J., et al.: UFO: a unified transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023 (2021)
Wang, W., Bao, H., Dong, L., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358 (2021)
Wang, Z., Wang, W., Zhu, H., Liu, M., Qin, B., Wei, F.: Distilled dual-encoder model for vision-language understanding. arXiv preprint arXiv:2112.08723 (2021)
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision. In: ICLR (2021)
Google Scholar
Wu, Het al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xu, H., et al.: E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In: ACL-IJCNLP (2021)
Google Scholar
Yang, X., et al.: Fashion captioning: towards generating accurate descriptions with semantic rewards. In: ECCV (2020)
Google Scholar
You, H., et al.: Ma-clip: towards modality-agnostic contrastive language-image pre-training. OpenReview (2021)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)
Google Scholar
Zhang, L., et al.: Vldeformer: learning visual-semantic embeddings by vision-language transformer decomposing. arXiv preprint arXiv:2110.11338 (2021)
Zhang, P., et al.: VINVL: revisiting visual representations in vision-language models. In: CVPR (2021)
Google Scholar
Zhang, Z., et al: UFC-bert: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)
Google Scholar
Zhu, Y., et al.: Knowledge perceived multi-modal pretraining in e-commerce. In: ACM MM (2021)
Google Scholar
Zhuge, M., et al.: Kaleido-bert: vision-language pre-training on fashion domain. In: CVPR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Xiao Han, Xiatian Zhu, Yi-Zhe Song & Tao Xiang
iFlyTek-Surrey Joint Research Centre on Artificial Intelligence, Guildford, Surrey, England
Xiao Han, Yi-Zhe Song & Tao Xiang
Meta AI, 1 Hacker Way, Menlo Park, CA, 94025, USA
Licheng Yu
Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford, England
Xiatian Zhu
School of Data Science, Fudan University, Shanghai, China
Li Zhang

Authors

Xiao Han
View author publications
You can also search for this author in PubMed Google Scholar
Licheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xiatian Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Zhe Song
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Han .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5901 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, X., Yu, L., Zhu, X., Zhang, L., Song, YZ., Xiang, T. (2022). FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_37
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Abstract

Similar content being viewed by others

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Masked Vision-language Transformer in Fashion

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Keywords

1 Introduction

2 Related Work