Keywordsss

1 Introduction

With the development of convolutional neural networks (CNN) and the publication of large-scale fashion datasets, significant progress has been made in fashion-related research, including fashion item recognition [1,2,3], fashion compatibility recommendation [4, 5], fashion attribute prediction [1, 6, 7] and fashion image retrieval [8,9,10]. Fashion category classification is a multi-class classification task, and fashion attribute prediction is a multi-label classification task. both of them generate helpful information for fashion items. Traditional classification methods usually only use the features learned from images as input and ignore the attribute information.

Figure 1 illustrates three fashion items in a fashion dataset. We can see that each image belongs to a category and has some attribute labels associated with it. The attribute labels of each fashion image reflect the category of the image. For example, an item with the ‘strapless’ attribute is unlikely to be a pair of ‘Jeans’, while an item with the ‘mini’ attribute is more likely to be a ‘dress’. In addition, there is a specific correlation between the attribute labels describing a fashion image, and they are not entirely independent. For example, ‘denim’ and ‘crochet’ will not be used to describe the same piece of clothing, while ‘strapless’ and ‘mini’ express the same piece of clothing because they are independent of each other. The use of labels and dependencies between labels helps to understand fashion items more accurately. Our goal is to use images and a group of known attribute labels to build a multi-modal classification model.

Fig. 1.
figure 1

Examples of fashion images and attributes. Image (a) and Image (b) share some attributes: mini, strapless and sweetheart, and they belong to the same category. Image (c) belongs to a separate category and has completely different attributes.

To support multi-modal interaction, we use two types of attention mechanisms to facilitate the interaction between visual and semantic information, i.e. an attribute-specific spatial attention module and an attribute-specific channel attention module. They enable the network to learn multi-modal features based on known attribute labels. In the model training phase, we represent the state of the labels as positive, negative or unknown to model them. Suppose we know the attribute state of image (a) and set it to true (‘strapless’) or false (‘maxi’), the model can predict with a high degree of confidence that the image belongs to the ‘upper body’ category and has the ‘mini’ attribute. We compare our model with some competing methods on public datasets, which proves the model’s superiority. The main contributions are as follows:

  • We propose a fashion classification model (M2Fashion) based on multi-modal features. It is an attribute-guided attention-based model, which extracts more associated information between images and attributes to promote accurate fashion classification and attribute prediction. A channel attention module and a spatial attention module are integrated into the model for data fusion of two different modalities.

  • We adopt a multi-task learning framework that combines category classification and attribute prediction tasks. Compared with other classification models, the attributes in our model are not independent, and their relationship is contained in the attribute hierarchy.

  • Extensive experiments are carried out to compare the proposed model with several state-of-the-art models on public datasets. Experimental results show the superiority of the proposed model. In addition, M2Fashion is applied to an attribute-specific image retrieval tasks by removing the final classifier. This supplementary experiment also demonstrates the effectiveness of our model.

2 Related Work

Attribute Learning.

The existing attribute learning methods can be categorized into two groups: 1) visual feature-based [9, 10]. They embed images in a common low-dimensional space and use the feature vectors in the low-dimensional space for attribute classification. 2) visual-semantic feature-based [11,12,13]. They learn joint representation by exploring the correlation between multi-modal content. Some of these methods use semantic information from attributes or annotated text to extract saliency or visual attention from the image. The above studies all learn visual/semantic features but ignore the relationship between attributes. Our work aims to mine the inner correlation of multiple attributes to learn fine-grained image representations.

Attention-Based Models.

In recent years, the attention mechanism is widely used in computer vision and natural language processing. This technology has also been researched and applied in the field of fashion. Ji et al. [14] proposed a tag-based attention mechanism and a context-based attention mechanism to improve the performance of cross-domain retrieval of fashion images. Li et al. [15] proposed a joint attribute detection and visual attention framework for clothes image captioning. Ma et al. [16] proposed an attribute feature embedding network, which learns attribute-based embedding in an end-to-end manner to measure the attribute-specified fine-grained similarity of fashion items. Inspired by the success of the attention mechanism, we proposed to use two attribute-aware attention modules for fine-grained image classification tasks.

Multi-task Learning.

Since it was proposed, multi-task learning (MTL) has achieved many successes in several domains, such as image classification with landmark detection [17], attribute-enhanced recipe retrieval [18], and visual question answering [19]. To explore the intrinsic correlation of attributes to obtain more reliable prediction results, we are motivated to build a multi-task framework to model the correlation and common representation of categories and multiple attributes of fashion images.

3 Methodology

3.1 Problem Formulation

Given a set of fashion items denoted by \(D=\{({x}_{1},{A}_{1}),...,({x}_{n},{A}_{n})\}\), where \({x}_{i} (1\le i\le n)\) is the i-th image, \({x}_{i}\in {\mathbb{R}}^{c\times h\times w}\) (c, h, and w are the number of channel, height, and weight respectively), \({A}_{i}=[{a}_{i1},{a}_{i2},{...,a}_{iK}]\) is a multi-hot attribute vector which describes the image appearance with \({\rm K}\) semantic attributes, \({a}_{ij}\in \{-\mathrm{1,0},1\}\) \((1\le j\le {\rm K})\), and \(K\) is the number of all attributes. The attribute set is denoted as \(\mathcal{A}=\{{\mathcal{A}}_{1}, {\mathcal{A}}_{2},...,{\mathcal{A}}_{K}\}\). The goal of our model is to map the unimodal representation from images and attributes to a joint semantic space, and learn a classifier \(f(\bullet )\) in the joint space so that \(y=f(x,A;\theta )\). In category classification tasks, \(y\) denotes the predicted image category, and in attribute prediction tasks, \(y\) denotes predicted attribute labels.

Fig. 2.
figure 2

The framework of our proposed model. It is made up of four key components, input representation module, attribute-aware spatial attention module, attribute-aware channel attention module, and multi-task classifier.

3.2 Network Structure

Figure 2 illustrates the framework of our model. It consists of four key components: input representation module, attribute-aware spatial attention (ASA) module, attribute-aware channel attention (ACA) module, and multi-task classifier. For an input image, the image embedding vector is extracted using ResNet pre-trained on ImageNet. Then, to learn the fine-grained features of the image, we use the image and multiple attributes to learn the feature representation. We adopt the architecture of [16] but add some changes to the method. In their work, spatial attention and channel attention of images guided by attributes are generated by embedding one attribute category such as ‘sleeve_length’. In contrast, our model combines images and attribute values such as ‘3/4 sleeve’ to generate attribute-guided attention. Our intuition is that images with the same attribute values will have more similar features. After that, these new attribute-aware features are fed into the attribute classifier.

Input Representation.

To represent an image, we use ResNet, a CNN model pre-trained on ImageNet, as the backbone network. To maintain the spatial information of the image, we remove the last fully connected (FC) layer in CNN. Given an image \({x}_{i}\in {\mathbb{R}}^{h\times w\times 3}\), the feature extractor outputs a vector \({\overline{x} }_{i}\in {\mathbb{R}}^{h\times w\times d}\), where \(h\times w\) is the size of the feature graph, and \(d\) is the number of channels. We represent the attribute label status as positive, negative, or unknown. They are represented by 1, −1 and 0, respectively. For an image \({x}_{i}\), we collect a set of labels embedded in \({A}_{i}\), the j-th element in \({A}_{i}\) means the i-th image has the attribute \({a}_{ij}\). Attribute label embeddings \({\overline{A} }_{i}\) are learned from an embedding layer of size \(d\times \mathrm{\rm K}\).

$$ \overline{A}_{i} = f_{1} (A_{i} ) = \delta (W_{{{\text{a}}1}} A_{i} ), $$
(1)

where \({W}_{a1}\in {\mathbb{R}}^{d\times K}\) denotes transformation matrix, \(\delta \) denotes the tanh activation function. Note that we broadcast \({\overline{A} }_{i}\) along the height and width dimension so that its shape is compatible to the image feature map \({\overline{x} }_{i}\).

Attribute-Aware Spatial Attention.

An attribute is related to a specific visual region of the fashion image. For example, the attribute ‘3/4 sleeve’ usually appears on either side of the middle area in the image, and to learn attribute-specific features such as ‘sleeve length’, the regions around the sleeve will receive more attention. To calculate the attribute-specific image space attention, instead of using a single attribute category to guide attention, we use multiple attribute values to generate attribute embedding. These values are organized into a hierarchical structure, called attribute hierarchy.

Specifically, for an image \({x}_{i}\) and its attribute labels \({A}_{i}\), we use I and \({T}_{1}\) to represent \({\overline{x} }_{i}\) and \({\overline{A} }_{i}\), respectively. First, we get the attribute guided spatial attention feature vector denoted as \({V}_{s}\), obtained by calculating the weighted average of the input image features according to the attribute label embedding. For image embedding I, we employ a convolution layer with d \(1\times 1\) convolutional kernels following a nonlinear tanh activation function to transform the dimension of the image to d. The mapped image feature vector is expressed as

$$ f_{2} (I) = \delta (W_{v1} I), $$
(2)

where \({W}_{v1}\) denotes a convolutional layer containing d \(1\times 1\) convolution kernels, and \(\delta \) denotes the tanh activation function.

The attended image feature vector is fused with attribute feature using element-wise product followed with an activation function.

$$ f_{s} (I,T_{1} ) = \delta (W_{v2} (f_{2} (I) \odot T), $$
(3)

where \(\odot\) denotes element-wise product operation, \({W}_{v2}\) is \(1\times 1\) convolutional layer, and \(\delta \) denotes the tanh activation function. The attention weight is obtained through the softmax activation function.

$$ \alpha_{l}^{s} = \frac{{\exp (f_{s} (I_{l} ,T_{1} ))}}{{\sum\nolimits_{j}^{h \times w} {\exp (f_{s} (I_{j} ,T_{1} ))} }}. $$
(4)

Then, the spatial attention feature vector under the attention of attribute \({A}_{i}\) can be obtained by the following calculation.

$$ V_{s} = \sum\nolimits_{l}^{h \times w} {\alpha_{l}^{s} I_{l} ,} $$
(5)

where \({\alpha }_{l}^{s}\in {R}^{h\times w}\) is the attention weight, and \({I}_{l}\) is the image feature at location l.

Attribute-Aware Channel Attention.

We adopt the attention mechanism of Ma et al. [16] with one modification. In their work, they apply sum pooling on the output from ASEN module. In contrast, we adopt global max pooling on the feature map \({V}_{s}\) to concentrate only on discriminative areas. For the attribute \({A}_{i}\), we employ a separate attribute embedding layer to generate an embedding vector with the same dimension as \({V}_{s}\),

$$ \tilde{A}_{i} = f_{3} (A_{i} ) = \delta (W_{a2} A_{i} ), $$
(6)

where \({w}_{a2}\in {\mathbb{R}}^{c\times n}\) is the embedding parameter, and \(\delta \) is the tanh activation function. For the convenience of understanding, we use \({T}_{2}\) to represent \({\stackrel{\sim }{\mathrm{A}}}_{i}\). The spatial attended features and attribute embedding are fused by concatenation, then fed into two sequential FC layers to generate the attribute-aware channel attended feature. The attention weight \({\alpha }^{c}\in {\mathbb{R}}^{c}\) is calculated by

$$ \alpha^{c} = \sigma (W_{c2} \sigma (W_{c1} [T_{2} ,V_{s} ])), $$
(7)

Where [,] represents the concatenation operation, \(\sigma \) represents the sigmoid activation function, and \({W}_{c1}\) and \({W}_{c2}\) are parameters of the FC layer. For simplicity of understanding, the bias in the formula is removed. The final output of ACA is obtained by the element-wise product of \({I}_{s}\) and attention weight \({\alpha }_{c}\).

$$ V_{c} = \alpha^{c} \odot V_{s} . $$
(8)

Finally, we further employ an FC layer over \({V}_{c}\) to generate the attribute-guided feature of the given image with known image labels.

$$Z=W{V}_{c}+b,$$
(9)

where \(W\in {\mathbb{R}}^{c\times c}\) is the transformation matrix, and \(b\in {\mathbb{R}}^{c}\) is the bias.

Multi-task Learning.

In this paper, the MTL framework is used to predict the categories and attributes of images. We share feature vectors in two tasks, category classification and attribute prediction, which helps to share knowledge and distinguish subtle differences between different tasks. At the end of the network, we add two different branches, one for predicting categories of images and the other for predicting attributes of images. The shared attribute-guided image features output is fed to two branches, respectively. We use the cross-entropy loss for category classification, denoted as

$$ \begin{gathered} L_{category} = - \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\{ y_{i}^{c} \log (P(\hat{y}_{i}^{c} |Z_{i} ))} + (1 - y_{i} )\log (1 - P(\hat{y}_{i}^{c} |Z_{i} ))\} . \hfill \\ \hfill \\ \end{gathered} $$
(10)

The output of the attribute prediction branch is passed into a sigmoid layer to squeeze the output between [0,1] and output \({\widehat{a}}_{j}\). We use the binary cross-entropy loss for attribute prediction, denoted as

$$ L_{attribute} = \sum\nolimits_{j = 1}^{K} {a_{j} \log (p(\hat{a}_{j} |x_{i} ))} + (1 - a_{j} )\log (1 - p(\hat{a}_{j} |x_{i} )), $$
(11)

where \({a}_{j}\) is the j-th ground truth of the binary attribute label, \(p\left({\widehat{a}}_{j}|{x}_{i}\right)\) is a component of \(Y=\left[{y}_{1},\cdots ,{y}_{k}\right]\), and Y is the predicted attribute distribution.

3.3 Label Mask Training

We adopt the strategy of label masking training proposed in [20] to learn the correlation between labels and allow the model to perform multiple label classification with given partial labels. In the process of training, we mask a certain number of labels randomly and use the ground truth of other labels to predict masked labels. For K possible labels, we set certain labels \({Y}_{u}\) as unknown labels for a particular sample, where |Yu| is a random number between 0.25K and K. \({Y}_{u}\) are randomly sampled from all available labels Y, and their state is set as unknown. The remaining labels are known and denoted as \({Y}_{k}\). These labels in the known state will be used as input to the model along with the image, and our model predicts labels in the unknown state. In the training process, some labels are randomly masked as unknown, and the model learn the combination association of different known status labels. After the label mask training is incorporated, Eq. (11) is modified as

$$ L_{attribute} = \sum\nolimits_{j = 1}^{K} {E\{ CE(Y_{u} ,\hat{Y}_{u} |Y_{k} )\} .} $$
(12)

3.4 Triplet Network Training

We use the triplet network shown in Fig. 3 to train our model, aiming to learn effective embedding and similarity measurements to minimize the distance between anchor and positive samples and maximize the distance between anchor and negative ones.

The construction process of the input data in the triplet network is as follows. Given an image triplet \(\left\{{x}^{a},{x}^{p},{x}^{n}\right\}\), \({x}^{a}\) is the anchor image, \({x}^{p}\) is the positive image, and \({x}^{n}\) is the negative image. The positive example image has at least one attribute that is the same as the anchor image, while the negative example does not have any attribute the same as the anchor image. Let \(\left\{{Z}^{a},{Z}^{p},{Z}^{n}\right\}\) be the attribute attended feature embedding triplet. The similarity is defined as cosine similarity.

$$ sim(Z^{a} ,Z^{p} ) = \frac{{Z^{a} \cdot Z^{p} }}{{\left\| {Z^{a} } \right\| + \left\| {Z^{p} } \right\|}},sim(Z^{a} ,Z^{n} ) = \frac{{Z^{a} \cdot Z^{n} }}{{\left\| {Z^{a} } \right\| + \left\| {Z^{n} } \right\|}}. $$
(13)

We force the similarity between the anchor and the positive samples to be greater than the similarity between the anchor and the negative samples, i.e., \(sim\left({Z}^{a},{Z}^{p}\right)>sim\left({Z}^{a},{Z}^{n}\right)\). Then we define a triplet ranking loss function based on hinge loss as

$$ L_{tri} = max\{ 0,sim(Z^{a} ,Z^{p} ) - sim(Z^{a} ,Z^{n} ) + m\} , $$
(14)

where m represents the margin between two similarities. The total loss is defined as

$$ L_{total} = L_{category} + \lambda L_{attribute} + \gamma L_{tri} , $$
(15)

where \(\lambda \) and \(\gamma \) are parameters that balance the contribution of all losses.

Fig. 3.
figure 3

The triplet network structure used to train our model.

4 Experiments

4.1 Experiments Settings

Datasets.

We conduct our experiments on a public dataset Deepfashion [1], a large-scale clothes dataset. We choose its Category and Attribute Prediction benchmark (abbreviated as DeepFashion-C) that is more suitable for our tasks. DeepFashion-C contains 289,222 clothes images in 46 categories and five attribute categories with 1,000 attribute values. Each image is annotated with only one category and several attributes. We adopt the same train-valid-test division as [1].

Metrics.

For image category classification, top-k accuracy is usually adopted as the evaluation metric. For image attribute prediction, the top-k recall rate used in [1] is traditionally used as the evaluation metric.

Implementation Details.

The proposed model is implemented in the Pytorch framework with an NVIDIA GeForce GTX 1080Ti GPU. We use the ResNet 50 network pre-trained on ImageNet for feature extraction. The images are resized to \(224\times 224\). We use a \(1\times 1\) convolutional layer to reduce the dimension of the feature vector to 512. The multi-hot vector of the attributes is transformed to 512-dimensional vectors by an embedding layer followed by the tanh activation function. Then the image and attribute features are used to obtain spatial attention through the dot product operation. In the ACA module, we use a separated attribute embedding layer. We use SGD to train the triplet network, the total epoch is set to 20. The learning rate is 1e-5 and decays at the rate of 0.95 every epoch. We empirically set \(\alpha \) to 1 and \(\gamma \) to 0.5 in Eq. (15).

Baselines.

We conduct comparative tests with some baseline models. All models use the same triple sampling method for fine-tuning, but the training methods are different. WTBI [21] first trains a generalized similarity model, and then fine-tunes each type of clothing to obtain a class-independent model. DARN [8] constructs a tree structure for all attributes to form a semantic representation space of clothing images. FashionNet [1] extracts features and landmark location information from images, and combines them for training to predict image categories and attributes. Corbiere [9] uses weak label information and images crawled from the Internet to make dot products and predicts the probability of each word in the vocabulary. Attentive [3] uses a two-way convolutional recursive neural network to improve classification through landmark-aware attention and category-driven attention. Upsampling [22] increases the resolution of the feature map through up-sampling and uses the predicted landmark location as a reference to improve classification.

4.2 Experiment Results

We validate the performance of our model on the DeepFashion-C dataset, and Table 1 summarizes the performance of different methods in terms of top-k (k = 3, 5) recall rate for fashion classification and attribute prediction. Some clothing classification and attribute recognition results are show in Fig. 4. The following observations can be obtained.

Table 1. Performance comparison of different models on DeepFashion-C dataset
Fig. 4.
figure 4

Results of clothing category classification and attribute prediction on DeepFashion-C dataset. The correct predictions are marked in green, and the wrong predictions are marked in red.

Fig. 5.
figure 5

Visualization of the attribute-aware spatial attention on DeepFashion-C.

  • Our model outperforms all competitors in the category classification task and the attribute prediction task. For the former, our model improves the top-3 accuracy rate by 1.3%. For the latter, our model also improves the recall rate.

  • We evaluate our model using only one attention module and get two variants: M2Fashion w/o ASA and M2Fashion w/o ACA. The former employs global max pooling instead of an attribute-aware spatial attention model to generate features. The latter utilizes vector \({V}_{s}\) as the attribute-guide feature vector directly. We can see that removing the ASA or ACA module reduces the performance of the two subtasks, showing the effectiveness of both ASA and ACA modules.

  • The classification task has a more significant impact on the part-related attribute prediction (+5.6% of top-3 recall rate) and the shape-related attribute prediction (+3.7% top-3 recall rate) than the texture-related attribute prediction (+1.3% top-3 recall rate). It does not perform well on the style-related attribute prediction and the fabric-related attributes prediction because it is hard to focus the attention on these two attributes on the images. The classification of clothing is more dependent on the shape characteristics of clothing, and clothing classification can also promote the understanding of shape-related attributes.

4.3 Attention Visualization

Visualization of our attention mechanisms can be found in Fig. 5. We observe that the learned attention gives a higher response in the attribute-related areas, which shows that the attention helps find out which areas are relative to the given attribute. According to our observations, the attributes related to ‘part’, such as ‘maxi’ in Fig. 5(a) and ‘sleeve’ in Fig. 5(e), are more likely to highlight local visual features. The attention map of attributes related to ‘material’ or ‘style’ focuses on the entire clothing.

4.4 Impact of Joint Learning and the Pooling Methods

The Impact of Joint Learning.

We explore the correlation between category classification and attribute prediction. As shown in Table 2 (top), the results show that the joint learning of categories and attributes improves the accuracy of the two tasks. We found that after adopting the multi-task learning framework, in the classification task, the top-3 accuracy is increased by 4.1%, and the top-5 accuracy is increased by 3.0%; in the attribute prediction task, the top-3 recall rate is increased by 11.7%, and the top-5 recall rate is increased by 12.2%.

The Impact of Global Max Pooling.

We use global max pooling instead of global average pooling to capture global context information. Global max pooling is sensitive to discriminative local features. The function of global maximum pooling is verified by ablation experiments. The results are shown in Table 2 (bottom). Global max pooling improves the recall rate of category classification and attribute prediction.

Table 2. Performance comparison of different learning methods and pooling methods.
Table 3. Performance comparison of attribute-specific fashion retrieval on DARN using MAP

4.5 A Case: Attribute-Specific Image Retrieval

The learned model can be applied to attribute-specific image retrieval tasks by removing the final classifier. For instance, given a query with an image and two labels v-neckline and floral, the model returns top-k similar images with these two labels.

We conduct the experiment on DARN [8] dataset, which contains about 253983 upper-clothing images and has a total of 9 attributes and 179 attribute values. We randomly divided the dataset into 8:1:1 for training, validation and test. Similar to [16], we use the metric of mean average precision (MAP) for evaluation. Following baselines are considered for comparison: Triplet Network, Conditional Similarity Network (CSN) [23], and Attribute specific embedding network (ASEN) [16].

Table 3 shows the results of attribute-specific image retrieval tasks on the DARN dataset. We can see that (1) the triplet network that learns the universal embedding space performs the worst; (2) our proposed M2Fashion outperforms other baselines. We attribute the better performance to the fact that M2Fashion uses multiple attribute labels and label masks to learn the association between labels and the attention of labels with images. In contrast, ASEN uses a single attribute category to guide attention.

5 Conclusions

In this paper, we explore fine-grained fashion image embedding to capture multi-modal content for fashion categorization. The proposed model adopts the visual-text attention mechanism to capture the association between different modal data and effectively uses any number of partial labels to perform multi-label and multi-class classification tasks. It also helps to discover how different attributes focus on specific areas of an image. In the future, we will study the impact of the hierarchical structure of attributes on the model and extend the model to hierarchical attribute prediction.