Keywords

1 Introduction and Related Work

Recommender systems (RSs) help users in their decision-making process by guiding them in a personalized fashion to a small subset of interesting products or services amongst massive corpora. In applications where visual factors are at play (e.g., fashion [22], food [14], or tourism [33]), customers’ choices are highly dependent on the visual product appearance that attracts attention, enhances emotions, and shapes their first impression about products. By incorporating this source of information when modeling users’ preference, visually-aware recommender systems (VRSs) have found success in extending the expressive power of pure collaborative recommender models [10, 12, 13, 17, 18].

Recommendation can hugely benefit from items’ side information [4]. To this date, several works have leveraged the high-level representational power of convolutional neural networks (CNNs) to extract item visual features, where the adopted CNN may be either pretrained on different datasets and tasks, e.g., [3, 11, 18, 26, 29], or trained end-to-end in the downstream recommendation task, e.g., [23, 38]. While the former family of VRSs builds upon a more convenient way of visually representing items (i.e., reusing the knowledge of pretrained models), such representations are not entirely in line with correctly providing users’ visual preference estimation. That is, CNN-extracted features cannot capture what each user enjoys about a product picture since she might be more attracted by the color and shape of a specific bag, but these features do not necessarily match what the pretrained CNN learned when classifying the product image as a bag.

Recently, there have been a few attempts trying to uncover user’s personalized visual attitude towards finer-grained item characteristics, e.g., [7,8,9, 21]. These solutions disentangle product images at (i) content-level, by adopting item metadata and/or reviews [9, 31], (ii) region-level, by pointing the user’s interest towards parts of the image [8, 36] or video frames [7], and (iii) both content- and region-level [21]. It is worth mentioning that most of these approaches [7, 8, 21, 36] exploit attention mechanisms to weight the importance of the content or the region in driving the user’s decisions.

Despite their superior performance, we recognize practical and conceptual limitations in adopting both content- and region-level item features, especially in the fashion domain. The former rely on additional side information (e.g., image tags or reviews), which could be not-easily and rarely accessible, as well as time-consuming to collect, while the latter ignore stylistic characteristics (e.g., color or texture) that can be impactful on the user’s decision process [41].

Driven by these motivations, we propose a pipeline for visual recommendation, which involves a set of visual features, i.e., color, shape, and category of a fashion product, whose extraction is straightforward and always possible, describing items’ content on a stylistic level. We use them as inputs to an attention- and neural-based visual recommender system, with the following purposes:

  • We disentangle the visual item representations on the stylistic content level (i.e., color, shape, and category) by making the attention mechanisms weight the importance of each feature on the user’s visual preference and making the neural architecture catch non-linearities in user/item interactions.

  • We reach a reasonable compromise between accuracy and beyond-accuracy performance, which we further justify through an ablation study to investigate the importance of attention (in all its configurations) on the recommendation performance. Notice that no ablation is performed on the content-style input features, as we learn to weight their contribution through the end-to-end attention network training procedure.

Fig. 1.
figure 1

Our proposed pipeline for visual recommendation, involving content-style item features, attention mechanisms, and a neural architecture.

2 Method

In the following, we present our visual recommendation pipeline (Fig. 1).

Preliminaries. We indicate with \(\mathcal {U}\) and \(\mathcal {I}\) the sets of users and items. Then, we adopt \(\mathbf {R}\) as the user/item interaction matrix, where \(r_{ui} \in \mathbf {R}\) is 1 for an interaction, 0 otherwise. As in latent factor models such as matrix factorization (MF) [25], we use \(\mathbf {p}_{u} \in \mathbb {R}^{1 \times h}\) and \(\mathbf {q}_{i} \in \mathbb {R}^{1 \times h}\) as user and item latent factors, respectively, where \(h<< |\mathcal {U}|, |\mathcal {I}| \). Finally, we denote with \(\mathbf {f}_i \in \mathbb {R}^{1 \times v}\) the visual feature for item image i, usually the fully-connected layer activation of a pretrained convolutional neural network (CNN).

Content-Style Features. Let \(\mathcal {S}\) be the set of content-style features to characterize item images. Even if we adopt \(\mathcal {S} = \{\text {color}, \text {shape}, \text {category}\}\), for the sake of generality, we indicate with \(\mathbf {f}_i^{s} \in \mathbb {R}^{1 \times v_s}\) the s-th content-style feature of item i. Since all \(\mathbf {f}_i^{s}\) do not necessarily belong to the same latent space, we project them into a common latent space \(\mathbb {R}^{1 \times h}\), i.e., the same as the one of \(\mathbf {p}_{u}\) and \(\mathbf {q}_{i}\). Thus, for each \(s \in \mathcal {S}\), we build an encoder function \(enc_s: \mathbb {R}^{1 \times v_s} \mapsto \mathbb {R}^{1 \times h}\), and encode the s-th content-style feature of item i as:

$$\begin{aligned} \mathbf {e}_i^s = enc_s(\mathbf {f}_i^s) \end{aligned}$$
(1)

where \(\mathbf {e}_i^s \in \mathbb {R}^{1 \times h}\), and \(enc_s\) is either trainable, e.g., a multi-layer perceptron (MLP), or handcrafted, e.g., principal-component analysis (PCA). In this work, we use an MLP-based encoder for the color feature, a CNN-based encoder for the shape, and PCA for the category.

Attention Network. We seek to produce recommendations conditioned on the visual preference of user u towards each content-style item characteristic. That is, the model is supposed to assign different importance weights to each encoded feature \(\mathbf {e}_i^s\) based on the predicted user’s visual preference (\(\hat{r}_{u, i}\)). Inspired by previous works [7, 8, 21, 36], we use attention. Let \(ian(\cdot )\) be the function to aggregate the inputs to the attention network \(\mathbf {p}_u\) and \(\mathbf {e}_i^s\), e.g., element-wise multiplication. Given a user-item pair (ui), the network produces an attention weight vector \(\mathbf {a}_{u,i} = [a^{0}_{u, i}, a^{1}_{u, i}, \dots , a^{|\mathcal {S}| - 1}_{u, i}] \in \mathbb {R}^{1 \times |\mathcal {S}|}\), where \(a^{s}_{u, i}\) is calculated as:

$$\begin{aligned} a^{s}_{u, i} = \boldsymbol{\omega }_2(\boldsymbol{\omega _1}ian(\mathbf {p}_u, \mathbf {e}^{s}_i) + \mathbf {b}_1) + \mathbf {b}_2 = \boldsymbol{\omega }_2(\boldsymbol{\omega _1}(\mathbf {p}_u \odot \mathbf {e}^{s}_i) + \mathbf {b}_1) + \mathbf {b}_2 \end{aligned}$$
(2)

where \(\odot \) is the Hadamard product (element-wise multiplication), while \(\boldsymbol{\omega }_{*}\) and \(\mathbf {b}_{*}\) are the matrices and biases for each attention layer, i.e., the network is implemented as a 2-layers MLP. Then, we normalize \(\mathbf {a}_{u, i}\) through the temperature-smoothed softmax function [20], so that \(\sum _s a_{u, i}^s = 1\), getting the normalized weight vector \(\boldsymbol{\alpha }_{u, i} = [\alpha ^{0}_{u, i}, \alpha ^{1}_{u, i}, \dots , \alpha ^{|\mathcal {S}| - 1}_{u, i}]\). We leverage the attention values to produce a unique and weighted stylistic representation for item i, conditioned on user u:

$$\begin{aligned} \mathbf {w}_i = \sum _{s \in \mathcal {S}} \alpha _{u, i}^s \mathbf {e}_i^s \end{aligned}$$
(3)

Finally, let \(oan(\cdot )\) be the function to aggregate the latent factor \(\mathbf {q}_i\) and the output of the attention network \(\mathbf {w}_i\) into a unique representation for item i, e.g., through addition. We calculate the final item representation \(\mathbf {q}'_i\) as:

$$\begin{aligned} \mathbf {q}'_i = oan(\mathbf {q}_i, \mathbf {w}_i) = \mathbf {q}_i + \mathbf {w}_i \end{aligned}$$
(4)

Neural Inference. To capture non-linearities in user/item interactions, we adopt an MLP to run the prediction. Let \(concat(\cdot )\) be the concatenation function and \(out(\cdot )\) be a trainable MLP, we predict rating \(\hat{r}_{u, i}\) for user u and item i as:

$$\begin{aligned} \hat{r}_{u, i} = out(concat(\mathbf {p}_u, \mathbf {q}'_i)) \end{aligned}$$
(5)

Objective Function and Training. We use Bayesian personalized ranking (BPR) [32]. Given a set of triples \(\mathcal {T}\) (user u, positive item p, negative item n), we seek to optimize the following objective function:

(6)

where \(\boldsymbol{\Theta }\) and \(\lambda \) are the set of trainable weights and the regularization term, respectively. We build \(\mathcal {T}\) from the training set by picking, for each randomly sampled (up) pair, a negative item n for u (i.e., not-interacted by u). Moreover, we adopt mini-batch Adam [24] as optimizing algorithm.

3 Experiments

Datasets. We use two popular categories from the Amazon dataset [17, 28], i.e., Boys & Girls and Men. After having downloaded the available item images, we filter out the items and the users with less than 5 interactions [17, 18]. Boys & Girls counts 1,425 users, 5,019 items, and 9,213 interactions (sparsity is 0.00129), while Men counts 16,278 users, 31,750 items, and 113,106 interactions (sparsity is 0.00022). In both cases, we have, on average, \(>6\) interactions per user.

Feature Extraction and Encoding. Since we address a fashion recommendation task, we extract color, shape/texture, and fashion category from item images [34, 41]. Unlike previous works, we leverage such features because they are easy to extract and always accessible and represent the content of item images at a stylistic level. We extract the color information through the 8-bin RGB color histogram, the shape/texture as done in [34], and the fashion category from a pretrained ResNet50 [6, 11, 15, 37], where “category” refers to the classification task on which the CNN is pretrained. As for the features encoding, we use a trainable MLP and CNN for color (a vector) and shape (an image), respectively. Conversely, following [30], we adopt PCA to compress the fashion category feature, also to level it out to the color and shape features that do not benefit from a pretrained feature extractor.

Baselines. We compare our approach with pure collaborative and visual-based approaches, i.e., BPRMF [32] and NeuMF [19] for the former, and VBPR [18], DeepStyle [26], DVBPR [23], ACF [7], and VNPR [30] for the latter.

Evaluation and Reproducibility. We put, for each user, the last interaction into the test set and the second-to-last into the validation one (i.e., temporal leave-one-out). Then, we measure the model accuracy with the hit ratio (HR@k, the validation metric) and the normalized discounted cumulative gain (nDCG@k) as performed in related works [7, 19, 39]. We also measure the fraction of items covered in the catalog (iCov@k), the expected free discovery (EFD@k) [35], and the diversity with the 1’s complement of the Gini index (Gini@k) [16]. For the implementation, we used the framework Elliot [1, 2].

3.1 Results

What are the Accuracy and Beyond-Accuracy Recommendation Performance? Table 1 reports the accuracy and beyond-accuracy metrics on top-20 recommendation lists. On Amazon Boys & Girls, our solution and DeepStyle are the best and second-best models on accuracy and beyond-accuracy measures, respectively (e.g., 0.03860 vs. 0.03719 for the HR). In addition, our approach outperforms all the other baselines on novelty and diversity, covering a broader fraction of the catalog (e.g., \(iCov \simeq 90\%\)). As for Amazon Men, the proposed approach is still consistently the most accurate model, even beating BPRMF, whose accuracy performance is superior to all other visual baselines. Considering that BPRMF covers only the 0.6% of the item catalog, it follows that its superior performance on accuracy comes from recommending the most popular items [5, 27, 40]. Given that, we maintain the competitiveness of our solution, being the best on the accuracy, but also covering about 29% of the item catalog and supporting the discovery of new products (e.g., \(EFD = 0.01242\) is the second to best value). That is, the proposed method shows a competitive performance trade-off on accuracy and beyond-accuracy metrics.

Table 1. Accuracy and beyond-accuracy metrics on top-20 recommendation lists.
Table 2. Ablation study on different configurations of attention, ian, and oan.

How performance is affected by different configurations of attention, ian , and oan? Following [8, 21], we feed the attention network by exploring three aggregations for the inputs of the attention network (ian), i.e., element-wise multiplication/addition and concatenation, and two aggregations for the output of the attention network (oan), i.e., element-wise addition/multiplication. Table 2 reports the HR, i.e., the validation metric, and the iCov, i.e., a beyond-accuracy metric. No ablation study is run on the content-style features, as their relative influence on recommendation is learned during the training. First, we observe that attention mechanisms, i.e., all rows but No Attention, lead to better-tailored recommendations. Second, despite the {Concat, Add} choice reaches the highest accuracy on Men, the {Mult, Add} combination we used in this work is the most competitive on both accuracy and beyond-accuracy metrics.

4 Conclusion and Future Work

Unlike previous works, we argue that in visual recommendation scenarios (e.g., fashion), items should be represented by easy-to-extract and always accessible visual characteristics, aiming to describe their content from a stylistic perspective (e.g., color and shape). In this work, we disentangled these features via attention to assign users’ personalized importance weights to each content-style feature. Results confirmed that our solution could reach a competitive accuracy and beyond-accuracy trade-off against other baselines, and an ablation study justified the adopted architectural choices. We plan to extend the content-style features for other visual recommendation domains, such as food and social media. Another area where item content visual features can be beneficial is in improving accessibility to extremely long-tail items (distant tails), for which traditional CF or hybrid approaches are not helpful due to the scarcity of interaction data.