Keywords

1 Introduction

Recently, Vision-and-Language (V+L) pre-training has received increasing attention [8, 29, 31, 32, 35, 41, 48, 53, 55, 64]. The objective is to learn multimodal representations from large-scale image-text pairs, in order to improve various downstream unimodal or multimodal tasks. These models have proven to be highly effective thanks to two main factors: (i) There are a plenty of image-text pairs on the Web providing abundant training data for free (no additional annotation required), and (ii) Transformer-based model architectures have been widely used to learn the contextualized representation of multimodal inputs.

Fig. 1.
figure 1

Examples from (left) fashion dataset FACAD [68] and (right) Flickr30k [46]. Often, fashion data present multiple images in different angles, associated with structured titles and descriptions with multiple fine-grained attributes (highlighted in color)

In this work, we consider the fashion domain with focus on V+L model pre-training. This is inspired by the following reasons. First, fashion V+L data are not just copious in volume but also high in quality. This is because online fashion shopping is increasingly ubiquitous. On an e-commerce website, each product detail page (PDP) contains product images and text, both are of very high quality (i.e., often generated by domain experts). Second, driven by such strong commercial forces, a larger number of downstream tasks are naturally resulted in real-world applications, ranging from multimodal product understanding [36, 42], cross-modal retrieval [18], to text-guided image retrieval [65]. When applied to the fashion domain, however, we observe that existing V+L pre-training methods [18, 77] are less effective compared to other domains (see Sect. 4). We believe that this is because they are not designed to exploit unique characteristics of both fashion V+L data and downstream tasks.

In particular, in most existing generic domain V+L datasets (e.g., COCO [37] and Flickr30k [46]), each datum is a single image-text pair with brief text (e.g., an image caption as shown in Fig. 1). In contrast, fashion datasets are collected mostly from PDPs on e-commerce sites with two specialties: (i) There are typically more than one images associated with a given text, as shown in Fig. 1. The garment “maxi dress” is presented from three different views for offering online shoppers with rich product information. (ii) There are many more fine-grained concepts in the text description due to the product description nature. As shown in Fig. 1, the fashion text is more focused on the garment itself with very detailed adjectives and nouns, describing its appearance in the title, style, and description. In a statistical perspective, we calculate the ratio on four combined fashion datasets [22, 50, 58, 68] and two combined generic datasets [37, 46]. It is found that 82% of the words in the fashion captions are adjectives or nouns, versus only 59% with the generic captions. None of the existing V+L models are capable of exploiting these specialties of fashion data.

Fashion downstream tasks are also more diverse, posing a challenge to the V+L pre-training model architecture design. Specifically, in the generic V+L domain, existing models are of single-stream or two-stream, depending on the intended downstream tasks. For example, operating on concatenated image and text tokens, a single-stream model [8, 27, 29, 32, 53] is suitable for multimodal fusion tasks such as VQA [2], VCR [71] and RefCOCO [70]. Instead, a two-stream model [28, 41, 48, 54, 55] is typically designed for efficient cross-modal retrieval tasksFootnote 1. In the fashion domain, apart from image-text fusion and cross-modal retrieval, we also need to tackle other downstream tasks for which neither single-stream nor two-stream architectures are suitable. For instance, the text-guided image retrieval task [21, 60, 65] not only requires a strong fusion of a reference image and a modifying text, but also an efficient matching between the fused multimodal representation and any candidate image. Given such diverse downstream tasks in fashion, the existing one-stream and two-stream methods are limited in both flexibility and versatility.

To overcome the aforementioned limitations of existing methods, we introduce a novel fashion-focused V+L representation learning framework termed FashionViL. Two fashion-focused pre-training tasks are proposed to fully exploit the specialties of fashion data. (I) Multi-View Contrastive Learning (MVC): Given a fashion datum with multiple images/views and one text description, we require that each individual modality (unimodal or multimodal representation) should be semantically discriminative w.r.t the same product. To that end, other than the common image-text matching, we further minimize the distance between (i) the multimodal representation of one view and text and (ii) the other views. (II) Pseudo-Attributes Classification (PAC) is designed to exploit the rich fine-grained fashion concepts in the description: We first extract those common attributes/noun phrases from the fashion datasets and construct a pseudo attribute set. The model then learns to predict those attributes explicitly during pre-training. Our intuition is that, the fashion items with the same attribute(s) should be clustered together, i.e., semantically discriminating in the attribute space. As shown in Sect. 4.3, MVC and PAC both are effective and complementary to conventional V+L pre-training tasks such as Image-Text Contrastive Learning (ITC) and Masked Language modeling (MLM).

Moreover, we formulate a flexible and versatile model architecture capable of adapting a pre-trained model easily to a diverse set of downstream tasks. Specifically, our model consists of an image encoder and a modality-agnostic Transformer module, which can be used as either a text encoder or a multimodal fusion encoder. This supports fine-tuning for different downstream modes as: (i) Early-fusion single-stream mode for multimodal joint representation learning, e.g., multimodal classification; (ii) Late-fusion two-stream mode for unimodal representation learning, e.g., cross-modal retrieval; (iii) Early-fusion two-stream mode for multimodal compositional representation learning, e.g., text-guided image retrieval. As a result, our design fuses synergistically the strength of single-stream model in modality fusion and two-stream model in scalability. Crucially, it also caters for fashion-unique tasks, e.g., text-guided image retrieval and outfit complementary item retrieval.

Our contributions are summarized as follows: (1) A novel fashion-focused V+L pre-training framework is proposed to exploit the specialties of fashion data through two new V+L pre-training tasks. (2) A versatile and flexible architecture consisting of a modality-agnostic Transformer is introduced to accommodate a set of diverse downstream tasks in the fashion domain. (3) For extensive evaluation, we consider five fashion V+L tasks together: image-to-text retrieval, text-to-image retrieval [50], text-guided image retrieval [65], (sub)category recognition [50] and outfit complementary item retrieval [58]. Our experiments show that FashionViL achieves new state of the art with a consistent and significant performance boost per task. To the best of our knowledge, this is the first work capable of addressing 5 diverse fashion tasks together.

2 Related Work

With the advent of Transformer [59] and its success in NLP [10] and CV [13], there has been great success in applying large-scale V+L pre-training to generic domain [8, 31, 32, 48]. Some recent studies started to focus on e-commerce domains including fashion [11, 18, 74, 76, 77]. Existing works differ in two main aspects: architecture design and pre-training tasks.

Model Architecture. All V+L pre-training methods use image and text embedding sequences as input for modeling inter-modal and optionally intra-modal interactions through a CNN or Transformer architecture, and output a contextualized feature sequence [6]. There are many options on architecture designs on different aspects, including singe-stream early fusion [8, 32, 35, 53] vs. two stream late fusion [17, 28, 41, 48, 55], or different visual features (e.g., detector-based regions [73] vs. ConvNet patches [27] vs. linear projections [29, 67]). In many case, the design is driven by the intended downstream tasks (e.g., VQA requires earlier fusion to enhance joint representation whereas cross-modal retrieval requires later fusion to speed up inference). There are also efforts for alleviating the gap between different architectures through retrieve-and-rerank strategy [19, 54] or knowledge distillation [39, 63]. Unlike them, inspired by the recent advances in modality-agnostic models [1, 33, 61, 62, 69], we introduce a unified architecture that can be easily switched between the single-stream or two-stream mode, so there is no need to modify the architecture for different downstream tasks.

Pre-training Tasks. Various tasks have been proposed for V+L pre-training. Masked Language Modeling (MLM) and Image-Text Matching (ITM) are the direct counterparts of the BERT objectives [10, 32]. Masked Image Modeling (MIM) is the extension of MLM on the visual modality, including several variants like masked region classification [41, 53] and masked region feature regression [8]. Some other tasks are also proved to be effective, such as predicting object tags [26, 35], sequential caption generation [64, 75] and image-text contrastive learning [31, 34, 48]. However, none of these tasks are able to take advantage of the two specialities of fashion data as discussed earlier. We therefore propose two fashion-focused pre-training tasks in this work.

Fig. 2.
figure 2

Overview of the proposed FashionViL model architecture, consisting of an image encoder, a text encoder and a fusion encoder. Text encoder and fusion encoder share the same parameters. We adopt six pre-training tasks for richer representation learning

3 Methodology

3.1 Model Overview

The model architecture of FashionViL is illustrated in Fig. 2(a), which is composed of an image encoder (IE) and a Transformer module that can be used for both text encoder (TE) and fusion encoder (FE). Specifically, our image encoder uses ConvNet as its backbone to convert the raw pixels into a sequence of visual embeddings by rasterizing the grid features of the final feature map. For the text encoder, we follow BERT [10] to tokenize the input sentence into WordPieces [66]. Each sub-word token’s embedding is obtained by summing up its word embedding and learnable position embedding, followed by Layer Normalization (LN) [3].

One novelty of the model design lies in the shared Transformer for TE and FE, which allows us to flexibly build various multimodal model architectures, each of which is suited for different types of downstream tasks. For example, Fig. 2(b) shows an early-fusion model architecture, where the raw sentence and the computed image embeddings are jointly fed into the multimodal fusion encoder. Note that when we use the Transformer as the fusion encoder, we will further add the modality embeddings to the visual embeddings and word embeddings, helping the model distinguish the modality type. This architecture is exactly the same as the well-known single-stream models in many previous pre-training works [8, 18, 32]. Then in Fig. 2(c) we show a late-fusion two-stream model architecture, where we apply the shareable Transformer as the text encoder. The outputs from image encoder and text encoder are interacted with a simple dot product to compute the similarity between two modalities. This architecture has been widely adopted for efficient large-scale cross-modal retrieval [19, 54]. Furthermore, we can fine-tune this shared Transformer to a more complicated two-stream architecture variant, shown in Fig. 2(d). Here, one stream operates in an early-fusion manner while the other stream is an image encoder. This architecture is needed for some fashion-focused retrieval tasks with multimodal query, e.g., text-guided image retrieval [60, 65]. Note that all FE and TE in the above three architectures are actually the same Transformer, and the mere difference lies in its input.

Given an image-text pair, we denote its raw visual inputs as \(\textbf{v}_{i} = \left\{ \textbf{v}_{i}^{1}, \ldots , \textbf{v}_{i}^{K}\right\} \), and its input words as \(\textbf{w}_{i}=\left\{ \textbf{w}_{i}^{\textrm{cls}}, \textbf{w}_{i}^{1}, \ldots , \textbf{w}_{i}^{T}\right\} \), where the subscript i indicates the i-th pair in the dataset. An additional special [CLS] token is inserted at the beginning of the text sequence, as well as the multimodal sequence when modalities are concatenated. We follow the common pre-training + fine-tuning pipeline when applying the model to downstream tasks.

3.2 Pre-training Tasks

We first introduce two new pre-training tasks. This is followed by the other conventional pre-training tasks adopted in our framework.

Multi-view Contrastive Learning (MVC). As can be seen in Fig. 1, each fashion item is often associated with multiple views to provide a comprehensive overview of the product. To take advantage of the reciprocal information between different views, we propose to build a correlation between (i) the visual representation of the original view \(\textbf{v}\), and (ii) the compositional representation of another view \(\textbf{d}\) and the text \(\textbf{w}\). In cases where there is only one view of the product, we augment another view by randomly cropping or horizontally flipping the given view. As shown in Fig. 2(d), the visual representation of the original view is extracted by the image encoder while the compositional representation is calculated in an early fusion way. Therefore, the similarity between the multimodal input \([\textbf{w};\textbf{d}]\)Footnote 2 and \(\textbf{v}\) can be computed as:

$$\begin{aligned} s\left( [\textbf{w}_{i};\textbf{d}_{i}], \textbf{v}_{j}\right) =g_{\theta }\left( \textbf{d}_{i}^{\textrm{avg}}|\textbf{w}_{i}\right) ^{T} g_{\theta }\left( \textbf{v}_{j}^{\textrm{avg}}\right) , \end{aligned}$$
(1)

where g represents a linear transformation that projects the average pooled features into the normalized low-dimensional latent space. Next, we apply two symmetrical InfoNCE losses [44] to pull closer the matched compositional representations and visual representations in the shared latent space:

$$\begin{aligned} \mathcal {L}_{\textrm{InfoNCE}}(x, y)=-\mathbb {E}_{(x, y) \sim B} \log \frac{\exp (s(x, y) / \tau )}{\sum _{\hat{y} \in \hat{B}} \exp (s(x, \hat{y}) / \tau )}, \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{\textrm{MVC}} = \frac{1}{2} \left[ \mathcal {L}_{\textrm{InfoNCE}}([\textbf{w};\textbf{d}], \textbf{v}) + \mathcal {L}_{\textrm{InfoNCE}}(\textbf{v}, [\textbf{w};\textbf{d}])\right] , \end{aligned}$$
(3)

where \(\tau \) is a learnable temperature and \(\hat{B}\) contains the positive sample y and \(|\hat{B}|-1\) negative samples drawn from a mini-batch B.

Fig. 3.
figure 3

Histogram of the top-50 pseudo attributes

Pseudo-Attribute Classification (PAC). As mentioned in Sect. 1, we found that there are a large number of fine-grained attributes in the fashion description. We propose to mine the pseudo-attribute concepts from all the available textual information, including title, description and meta-info. Specifically, we extract all nouns and adjectives via NLTK tagger [5] and only keep those that appear more than 100 times, resulting in a list of 2,232 attributes. We show the histogram of the top-50 pseudo attributes in Fig. 3. It is observed that all of them are truly highly-related to the fashion domain.

Then we explore how to utilize such mined concepts. We aim to let our model learn to explicitly recognize those pseudo attributes during the pre-training stage. We model this task as a multi-label classification problem, called Pseudo-Attribute Classification (PAC). As shown in Fig. 2(c), we apply the PAC to both visual and textual modalities so that both encoders can learn to capture the fine-grained concepts. As this is a weakly-supervised learning setting, we leverage label smoothing to generate the labels [24] considering that the mined labels can be noisy. We use A to denote the whole 2,232 pseudo-attribute set and a as the smoothed soft-target for each class. For example, if one sample has two ground truth labels at position 0 and 1, then \(a_0 = a_1 = 0.5\) while \(a_i = 0 \ (i \ne 0, 1)\). Our objective is as follows:

$$\begin{aligned} \mathcal {L}_{\textrm{PAC}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \mathbb {E}_{a \sim A} \left[ a \log P_{\theta }\left( a|\textbf{w}\right) + a \log P_{\theta }\left( a|\textbf{v}\right) \right] , \end{aligned}$$
(4)

where \(\theta \) is the learnable parameters and each pair \((\textbf{w}, \textbf{v})\) is sampled from the whole training set D.

Masked Patch Feature Classification (MPFC). While the naive masked feature regression has been shown not helpful in V+L pre-training [14, 29], we found empirically our version of masked patch modeling being effective in the fashion domain. Specifically, we disregard the feature reconstruction of each masked patch, but instead predict the patch label given by an offline image tokenizer. To this end, we first train a discrete VAE [15, 49, 57] as the image tokenizer on our collected fashion images with the perceputal loss [12]. We also adopt exponential moving average (EMA) to update the codebook, which is proved to be useful for increasing the utilization of codewords [12, 57]. We randomly replace 25% patch features with zeros through block-wise masking strategy [4]Footnote 3. Since now we have discrete labels for each patch, the model can be trained to predict the label of each masked patches \(\mathbf {v_m}\) given the remaining patches \(\mathbf {v_{\backslash m}}\) by optimizing:

$$\begin{aligned} \mathcal {L}_{\textrm{MPFC}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \log P_{\theta }\left( \mathbf {v^t_m}|\textbf{v}_{\backslash \textbf{m}}, \textbf{w}\right) , \end{aligned}$$
(5)

where \(\mathbf {v^t_m}\) is the estimated target label for the masked patch.

Image-Text Contrastive Learning (ITC). We also use ITC to encourage the two unimodal representations to be close in the latent space. As shown in Fig. 2(c), the similarity of \(\textbf{w}\) and \(\textbf{v}\) is measured by the dot product of their average pooled features after being projected to the latent space with two linear transformations f and g: \( s\left( \textbf{w}_{i}, \textbf{v}_{j}\right) =f_{\theta }\left( \textbf{w}_{i}^{\textrm{avg}}\right) ^{T} g_{\theta }\left( \textbf{v}_{j}^{\textrm{avg}}\right) . \) The ITC loss is:

$$\begin{aligned} \mathcal {L}_{\textrm{ITC}} = \frac{1}{2} \left[ \mathcal {L}_{\textrm{InfoNCE}}(\textbf{w}, \textbf{v}) + \mathcal {L}_{\textrm{InfoNCE}}(\textbf{v}, \textbf{w})\right] . \end{aligned}$$
(6)

Masked Language Modeling (MLM). In MLM, we randomly mask out the input words with a probability of 15%, and replace all subwords belonging to the masked words \(\textbf{w}_{\textbf{m}}\) with special token [MASK]Footnote 4. The goal of MLM is to predict these masked sub-words based on the observation of their surrounding words \(\textbf{w}_{\backslash \textbf{m}}\) and all image patches \(\textbf{v}\), by minimizing the negative log-likelihood:

$$\begin{aligned} \mathcal {L}_{\textrm{MLM}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim D} \log P_{\theta }\left( \textbf{w}_{\textbf{m}}|\textbf{w}_{\backslash \textbf{m}}, \textbf{v}\right) . \end{aligned}$$
(7)

Image-Text Matching (ITM). In ITM, the input is an image-text pair and the target is a binary label \(z \in \{0, 1\}\), indicating if each input pair is a match. Following [31], we sample the hard negative pairs from the similarity matrix \(s\left( \textbf{w}_{i}, \textbf{v}_{j}\right) \) computed by ITC and then make a mini-batch H containing 50% negative pairs. We extract the hidden output of [CLS] at the last layer to represent the joint representation of both modalities, then feed it into a FC layer to do a two-class classification. We apply cross-entropy loss for ITM:

$$\begin{aligned} \mathcal {L}_{\textrm{ITM}}=-\mathbb {E}_{(\textbf{w}, \textbf{v}) \sim H} \log P_{\theta }\left( z|\textbf{w}, \textbf{v}\right) . \end{aligned}$$
(8)

4 Experiments

In this section, we introduce our pre-training dataset and 5 practical downstream tasks. We use MMF [52] and PyTorch [45] for the implementation. For the image encoder, we use an off-the-shelf ResNet50 [23] to fairly compare with previous methods, most of which also used ResNet50. For the text encoder and multimodal fusion encoder (using the shared Transformer), we use the BERT-base-uncased [53] as the initialization. We use 4 RTX 3090 GPUs for pre-training. The details of the hyper-parameters are listed in the supplementary file.

Table 1. Statistics of the datasets used for pre-training

4.1 Pre-training Dataset and Downstream Tasks

Pre-training Dataset. Our pre-training dataset consists of 4 public fashion-related datasets, namely, FashionGen [50], FACAD [68], Fashion200K [22] and PolyvoreOutfits [58]. In total, these datasets provide us with 373.5K fashion products for pre-training. Because each product may contain multiple images from different angles, we have about 1.35 million image-text pairs on hand. The detailed statistics are provided in Table 1.

Cross-modal Retrieval. Image-to-Text Retrieval (ITR) is a cross-modal retrieval task. Given an image query, our model finds the most aligned text from a large candidate pool. Previous fashion-domain pre-training works [18, 77] use the joint representation over the [CLS] token to predict the matching score, which results in an impractical time complexity due to the exhaustive matching between each query item and all gallery items in the early-fusion model [19, 39, 54, 63, 72]. While one of our model architectures can do the same (as Fig. 2b), we opt to use the two-stream late-fusion model in Fig. 2(c) to compute the cosine similarity for a far more efficient retrieval as [28, 48]. Text-to-Image Retrieval (TIR) is an inverse problem of ITR, where the query modality and gallery modality are swapped. The architecture for TIR is the same as ITR.

Text-Guided Image Retrieval (TGIR). TGIR is a special type of image retrieval problem, whose query is a multimodal composition [20, 21, 60, 65]. Specifically, given a query image and a modified sentence, the model is required to retrieve another image which has the similar outlook as the query image but with some appearance changes according to the query text. It has many practical applications in fashion, such as retrieving another garment according to a user’s reference garment and his/her feedback. To handle the uniqueness of the multimodal query, several interesting fusion approaches have been proposed in the past, such as the gating mechanism [51, 60], hierarchical attention [7], and style-content modification [30]. In this work, we follow [40] to simply apply an early fusion model to encode the compositional representation of the query image and modified text, which is shown in Fig. 2(d).

Category/Subcategory Recognition (CR/SCR). The (sub)category is a vital attribute for describing a product. (S)CR requires the model to produce a reliable joint representation. Following previous works [18, 77], we directly append a linear layer on top of [CLS] to predict the label for these tasks.

Outfit Complementary Item Retrieval (OCIR). OCIR aims at finding visually compatible item(s) of several given items to complete an outfit. This is a very practical task as people often buy garments that match previously selected or purchased ones. OCIR can be a helpful recommendation feature for online retailers [25, 38]. To address this task, we replace the backbone of CSA-Net [38] with the pre-trained image encoder of FashionViL. Note that unlike all multimodal/cross-modal tasks above, only the pre-trained image encoder is used in this downstream task. We leverage this task to evaluate the performance of our image encoder under the proposed multimodal pre-training.

Table 2. Results of cross-modal retrieval on FashionGen [50] with the protocol same as KaleidoBERT [77]. -e2e: Without end-to-end training, i.e., the image encoder is fixed. -pt: Directly fine-tuning without multimodal pre-training

4.2 Comparative Results

Cross-modal Retrieval. We evaluate the cross-modal retrieval on the FashionGen [50] test split (not included in pre-training), including both ITR and TIR. Table 2 compares the performance of the previous V+L pre-training methods with our FashaionViL. Because previous works [18, 77] are designed with a single-stream architecture, they can only be evaluated on a small retrieval set. For example, for TIR, the models are required to pick the best-matched image from only 101 images given a text queryFootnote 5. Recall (over 1K retrievals) is reported as the metric. The same setting is used for ITR. For a fair comparison, we strictly follow the same evaluation protocol, reporting the recall for 1K retrievalsFootnote 6.

In Table 2, we compare our FashionViL and its two variants with existing methods. In particular, -e2e and -pt denotes our model without end-to-end training (image encoder is fixed) and multimodal pre-training respectively. We have the following observations: (1) Even with the fixed image encoder and without pre-training, FashionViL already achieves comparable results with the existing methods. This suggests that the performance of late fusion can be as effective as early-fusion for such fine-grained cross-modal retrieval. (2) When we unfreeze the image encoder for end-to-end training, we observe that R@1 jumps from 21.13 to 58.84, suggesting that end-to-end training is very efficient and redundant pre-processing may be unnecessary. (3) When we further utilize our proposed multimodal pre-training, our model achieves SOTA performance as in the last column of Table 2, whose R@1 is more than twice of the previous SOTA.

Table 3. Results of cross-modal retrieval on FashionGen [50] with full evaluation

Note that our model architecture for this task is two-stream. This means that it can be applied to large-scale retrieval, unlike the compared baselines. Therefore, we additionally report the evaluation results on the full test set (of 32K image-text pairs), i.e., each query item is compared with every gallery item in the full test set. The results can be found in Table 3. We encourage the future works to also follow such a full evaluation protocol to measure the performance.

Table 4. Results of text-guided image retrieval on FashionIQ [65]
Table 5. Results of category/subcategory recognition on FashionGen [50]
Table 6. Results of outfit complementary item retrieval on PolyvoreOutfits [58]

Text-Guided Image Retrieval. For TGIR, we compare our FashionViL with the previous V+L pre-training methods and the task-specific methods on FashionIQ  [65]Footnote 7. The results are shown in Table 4. For more comprehensive comparisons, we use two different implementations adopted by previous methods, i.e., training with fixed image encoder [40] or end-to-end training [7, 30, 60].

We first report the results with the fixed ResNet 152 from Column 1 to Column 4 (C1–C4). CIRR adopts OSCAR [35] as the fusion module and uses the global image features as the input. We find FashionViL consistently outperforms CIRR with a relative 10%–20% gain with or without the multimodal pre-training (C1 vs. C3, C2 vs. C4). This improvement demonstrates that the patch-level features are superior to the global features for the compositional multimodal fusion. With our proposed pre-training, the performance further improves from 31.78 to 34.19 (C3 vs. C4), showing our pre-training also works well on the off-the-shelf fixed image encoder.

We then report the results under the end-to-end training paradigm (C5-C10). We find that simply replacing GRU with BERT (C5 vs. C8) already leads to a 4% relative gain (from 23.65 to 27.17), indicating the importance of having a higher-quality text encoder. Additionally, all previous works apply a late interaction between the image embeddings and modified text embeddings with an elaborately designed fusion module, e.g., TIRG [60]. We argue that an earlier fusion of the two modalities should result in an even better compositional embedding for the query purpose. Comparing C9 and C8, our FashionViL without pre-training already outperforms TIRG+BERT, indicating better query multimodal embeddings are learned in our model. Note that our text encoder and fusion encoder are shared, so FashionViL also saves more training parameters than TIRG+BERT. With the help of pre-training, our FashionViL achieves the new SOTA result with another significant 11.2% relative gain (C9 vs. C10).

Category/Subcategory Recognition. Following KaleidoBERT [77], we evaluate CR and SCR on the FashionGen dataset [50]. The joint representation of the model architecture in Fig. 2(b) is used to predict the classification score. The results are shown in Table 5. Once again, the end-to-end learning and the well-designed fashion-specific pre-training tasks help our FashionViL outperform the two previous works by significant margins (10.4% and 3.2%, respectively). Furthermore, we also simulate a new task – multi-image subcategory recognition (M-SCR) to evaluate the performance of FashionViL with multiple input images. See more results in the supplementary file.

Table 7. Evaluation on pre-training tasks using ITR, TIR, TGIR, SCR and OCIR as downstream tasks. Each number is the mean value of all metrics for one specific downstream task. Meta-sum stands for the summation of all numbers in each row. The three shades of grey represent the top three results when sharing TE and FE

Outfit Complementary Item Retrieval. In addition to the aforementioned multimodal and instance-level downstream tasks, we also examine FashionViL on the unimodal outfit-level task, i.e., OCIR. We compare our model with the previous task-specific methods [25, 38] on the Disjoint split of Polyvore Outfits [58]Footnote 8. As shown in Table 6, our multimodal pre-training benefits the performance with a 21.0% improvement, even when only the image encoder is tuned.

4.3 Ablation Study

We analyze the effectiveness of different pre-training tasks and the sharing TE/FE strategy through ablation studies over the aforementioned five downstream tasks. The complete results are listed in Table 7. In addition to the standard metrics for each benchmark, we use the Meta-sum (sum of all scores across all the benchmarks) as a global metric.

First, we establish a baseline without any multimodal pre-training in Line 0 (L0), i.e., the image/text encoder is initialized with the off-the-shelf ResNet50 or BERT, which is pre-trained in vision-only or language-only domain.

Second, we validate the effectiveness of each pre-training task by their standalone performance, i.e., each time we pick only one task for pre-training. We show the results of MPFC, MLM, PAC, MVC, ITC in L2, L4, L5, L6 and L7. It is clear from Table 7 that all of these pre-training tasks can benefit the downstream tasks. However, we found that a pre-training task tends to be relatively more helpful to downstream tasks of its similar type. For example, both MPFC (L2) and MLM (L4) are focusing on modeling the cross-modal interaction, thus they bring more gain to SCR but contribute relatively less to ITR and TIR. In contrast, since ITC (L7) has the same objective with ITR and TIR, it significantly boosts the cross-modal performance. As for TGIR, it requires not only high-quality compositional representation but also high-quality unimodal representations, thus each of the 5 pre-training tasks have a positive impact.

Third, we validate the effectiveness of the proposed PAC (L5) and MVC (L6). For PAC, we implement a comparative experiment: MLM only on those pre-defined pseudo-attribute words (L3). The main difference between L3 and L5 is whether the multi-label supervision is performed on each masked text token or the global representation. L3 leads to much lower performance than L5, indicating that the supervision of pseudo attributes on the global representation is a better choice. Interestingly, L3 achieves a comparable result to L4, where each word (including those other than the pseudo attributes) can also be masked. This means merely masking the fine-grained words is as effective as masking all the words uniformly, which indicates the most important text cues lie in those fine-grained concept words. We then verify the superiority of MVC. To this end, we add an ablation study that does not utilize multi-angle images (L1), i.e., replacing the sampled different angle image with an augmented version of the original image. Comparing L1 and L6, we confirm that the improvement of MVC mainly comes from the contrastive learning on the images from different angles.

Fig. 4.
figure 4

T-sne of the learned visual/textual/joint representations from FashionViL

Next, we study the effect of different combinations of those tasks. When we add MLM and MPFC to ITC (L8), we observe a gain on Meta-sum, while the performance of ITR and TIR slightly drops. This is expected as different tasks may provide different update directions for the same parameters, which causes some tasks to overshadow the effects of others. However, minor conflicts between different tasks can be largely alleviated by employing more tasks. As shown in L9, the overall performance can be further boosted by adding ITM. The same happens when we add MVC into them (L10). When all six tasks are jointly trained (L11), we observe a significant performance gain across all benchmarks. Notably, the two new fashion-specific tasks of MVC and PAC play the most important roles to achieve the SOTA performance.

Finally, we demonstrate the superiority of sharing TE and FE. We implement a comparative model (L12) with the same pre-training tasks as L11 but using separate TE and FE. We observe a clear performance drop when breaking the parameter sharing. This indicates our modality-agnostic sharing strategy not only reduces the number of parameters but also performs far better.

4.4 Visualization

We visualize the representations from the image encoder, text encoder, and fusion encoder via t-SNE [43] in Fig. 4. Specifically, we feed all image-text pairs from FashionGen’s test split into our model. We visualize the most popular 10 categories using different colors. We compare the t-SNE of the model without multimodal pre-training (initialized with ResNet+BERT) and the model with the full 6 pre-training tasks. We found the clusters become more discriminative when more pre-training tasks are added, indicating that FashionViL learns to acquire more fine-grained concepts. See more in the supplementary file.

5 Conclusions

We have introduced FashionViL, a novel end-to-end large-scale pre-training framework for V+L representation learning in the fashion domain. We proposed two effective fashion-specific pre-training tasks and introduced a novel modality-agnostic text/fusion encoder for a flexible and versatile multimodal architecture. Our FashionViL achieves new SOTA performance with superior efficiency on 5 popular fashion-related tasks.