1 Motivation

Multimodal retrieval is an important but understudied problem in e-commerce [48]. Even though e-commerce products are associated with rich multi-modal information, research currently focuses mainly on textual and behavioral signals to support product search and recommendation [1, 15, 42]. The majority of prior work in multimodal retrieval for e-commerce focuses on applications in the fashion domain, such as recommendation of fashion items [34] and cross-modal fashion retrieval [13, 25]. In the more general e-commerce domain, multimodal retrieval has not been explored that well yet [17, 31]. Motivated by the knowledge gap, we lay out two directions for the research agenda: category-to-image retrieval, and fine-grained text-image retrieval (Fig. 1).

Fig. 1.
figure 1

Dissertation overview.

Category-to-Image Retrieval. First, we focus on the category information in e-commerce. Product category trees are a key component of modern e-commerce as they assist customers when navigating across large product catalogues [16, 24, 46, 50]. Yet, the ability to retrieve an image for a given product category remains a challenging task mainly due to noisy category and product data, and the size and dynamic character of product catalogues [28, 48]. Motivated by this challenge, we introduce the task of retrieving a ranked list of relevant images of products that belong to a given category, which we call the category-to-image (CtI) retrieval task. Unlike image classification tasks that operate on a predefined set of classes, in the CtI retrieval task we want to be able not only to understand which images belong to a given category but also to generalize towards unseen categories. Use cases that motivate the CtI retrieval task include (1) the need to showcase different categories in search and recommendation results [24, 46, 48]; (2) the task can be used to infer product categories in the cases when product categorical data is unavailable, noisy, or incomplete [52]; and (3) the design of cross-categorical promotions and product category landing pages [39].

Fine-Grained Text-Image Retrieval. Second, we address the problem of fine-grained text-image retrieval. Text-image retrieval is the task of finding similar items across textual and visual modalities. Successful performance on the task depends on the domain. In the general domain, where images typically depict complex scenes of objects in their natural contexts information across modalities is matched coarsely. Some examples of such datasets include MS COCO [33], and Flick30k [53]. By contrast, in the e-commerce domain, where there is typically one object per image, fine-grained matching is more important. Therefore, we focus on fine-grained text-image retrieval. We define the task as a combination of two subtasks: 1. text-to-image retrieval: given a noun phrase that describes an object, retrieve the image that depicts to the object; 2. image-to-text retrieval: given an image of an object, retrieve the noun phrase that describes an object.

We start off by examining the topic in the context of reproducibility. Reproducibility is one of the major pillars of the scientific method and is of utmost importance for Information Retrieval (IR) as a discipline rooted in experimentation [10]. One of the first works that touch upon reproducibility in IR is the study by Armstrong et al. [2] where the authors conducted a longitudinal analysis of papers published in proceedings of CIKM and SIGIR between 1998–2008 and discovered that the ad-hoc retrieval was not measurably improving. Later on, Yang et al. [51] provided a meta-analysis of results reported on the TREC Robust04 and found out that some of the more recent neural models were outperformed by strong baselines. Similar discoveries were made in the domain of recommender systems research [5, 6]. Motivated by the findings, we explore the reproducibility of fine-grained text-image retrieval results. More specifically, we examine how SOTA models for fine-grained text-image fashion retrieval generalize towards other categories of e-commerce products. After analyzing SOTA models in the domain, we plan to improve upon them in a subsequent future work.

2 Related Work

Category-to-Image Retrieval. Early work in image retrieval grouped images into a restricted set of semantic categories and allowed users to retrieve images by using category labels as queries [44]. Later work allowed for a wider variety of queries ranging from natural language [20, 49], to attributes [37], to combinations of multiple modalities (e.g., title, description, and tags) [47]. Across these multimodal image retrieval approaches we find three common components: (1) an image encoder, (2) a query encoder, and (3) a similarity function to match the query to images [14, 40]. Depending on the focus of the work some components might be pre-trained, whereas the others are optimized for a specific task. In our work, we rely on pre-trained image and text encoders but learn a new multimodal composite of the query to perform CtI retrieval.

Fine-Grained Text-Image Retrieval. Early approaches to cross-modal mapping focused on correlation maximization through canonical correlation analysis [18, 19, 45]. Later approaches centered around convolutional and recurrent neural networks [11, 22, 23, 29]. They were further expanded by adding attention on top of encoders [29, 35, 38]. More recently, inspired by the success of transformers [8], a line of work centered around creating a universal vision-language encoder emerged [4, 30, 32, 36]. To address the problem of attribute granularity in the context of cross-modal retrieval, a line of work proposed to segment images into fragments [27], use attention mechanisms [26], combine image features across multiple levels [13], use pre-trained BERT as a backbone [12, 54]. Unlike prior work in this domain that focused on fashion, we focus on the general e-commerce domain.

3 Research Description and Methodology

The dissertation comprises two parts. Below, we describe every part of the thesis and elaborate on the methodology.

Category-to-Image Retrieval. Product categories are used in various contexts in e-commerce. However, in practice, during a user’s session, there is often a mismatch between a textual and a visual representation of a given category. Motivated by the problem, we introduce the task of category-to-image retrieval in e-commerce and propose a model for the task.

We use the XMarket dataset recently introduced by Bonab et al. [3] that contains textual, visual, and attribute information of e-commerce products as well as a category tree. Following [7, 21, 43] we use BM25, MPNet, CLIP as our baselines. To evaluate model performance, we use Precision@K where \(K = \{1, 5, 10 \}\), mAP@K where \(K = \{ 5, 10 \}\), and R-precision.

RQ1.1 How do baseline models perform on the CtI retrieval task? Specifically, how do unimodal and bi-modal baseline models perform? How does the performance differ w.r.t. category granularity?

To answer the question, we feed BM25 corpora that contain textual product information, i.e., product titles. We use an MPNet in a zero-shot manner. For all the products in the dataset, we pass the product title through the model. During the evaluation, we pass a category expressed as textual query through MPNet and retrieve top-k candidates ranked by cosine similarity w.r.t. the target category. We compare categories of the top-k retrieved candidates with the target category. Besides, we use pre-trained CLIP in a zero-shot manner with a text transformer and a vision transformer (ViT) [9] configuration. We pass the product image through the image encoder. For evaluation, we pass a category through the text encoder and retrieve top-k image candidates ranked by cosine similarity w.r.t. the target category. We compare categories of the top-k retrieved image candidates with the target category.

RQ1.2 How does a model, named CLIP-I, that uses product image information for building product representations impact the performance on the CtI retrieval task?

To answer the question, we build product representations by training on e-commerce data. We investigate how using product image data for building product representations impacts performance on the CtI retrieval task. To introduce visual information, we extend CLIP in two ways: (1) We use ViT from CLIP as an image encoder. We add a product projection head that takes as an input product visual information. (2) We use the text encoder from MPNet as category encoder; we add a category projection head on top of the category encoder. We name the resulting model CLIP-I. We train CLIP-I on category-product pairs from the training set. We only use visual information for building product representations.

RQ1.3 How does CLIP-IA, which extends CLIP-I with product attribute information, perform on the CtI retrieval task?

To answer the question, we extend CLIP-I by introducing attribute information to the product information encoding pipeline. We add an attribute encoder through which we obtain a representation of product attributes. We concatenate the resulting attribute representation with image representation and pass the resulting vector to the product projection head. Thus, the resulting product representation \(\mathbf {p}\) is based on both visual and attribute product information. We name the resulting model CLIP-IA. We train CLIP-IA on category-product pairs and we use visual and attribute information for building product representation.

RQ1.4 And finally, how does CLIP-ITA, which extends CLIP-IA with product text information, perform on the CtI task?

To answer the question, we investigate how extending the product information processing pipeline with the textual modality impacts performance on the CtI retrieval task. We add a title encoder to the product information processing pipeline and use it to obtain title representation. We concatenate the resulting representation with product image and attribute representations. We pass the resulting vector to the product projection head. The resulting model is CLIP-ITA. We train and test CLIP-ITA on category-product pairs. We use visual, attribute, and textual information for building product representations. The results are to be published in ECIR’22 [16]. The follow-up work is planned to be published at SIGIR 2023.

Fine-Grained Text-Image Retrieval. The ongoing work is focused on fine-grained text-image retrieval in the context of reproducibility. For the experiments, we select two SOTA models for fine-grained cross-modal fashion retrieval, each model with distinctive architecture. One of them is based on Transformer while another one is CNN-RNN-based. The Transformer-based model is Kaleido-BERT [54], that extends BERT [8]. Another model is a Multi-level Feature approach (MLF) [13]. Both models claim to deliver SOTA performance by being able to learn image representations that can better represent fine-grained attributes. They were evaluated on Fashion-Gen dataset [41] but, to the best of our knowledge, were not compared against each other.

In the work, we aim to answer the following research questions:

RQ2.1 How well Kaleido-BERT and MLF perform on data from an e-commerce category that is different from Fashion?

RQ2.2 How well both models generalize beyond e-commerce domain? More specifically, how do they perform on object-centric data from the general domain?

RQ2.3 How Kaleido-BERT and MLF compare to each other w.r.t performance?

The results are planned to be published as a paper at SIGIR 2022. The follow-up work is planned to be published at ECIR 2023.