Abstract
How similar are two fashion clothing? Fashion apparels demonstrate diverse visual concepts with their designs, styles and brands. Hence, there exist a hierarchy of similarities between fashion clothing, ranging from exact instance or brand to similar attributes, styles. An effective search method, thus, should be able to represent the tiers of similarities. In this paper, we present a deep learning based fashion search framework for learning the tiers of similarity. We propose a new attribute-guided metric learning (AGML) with multitask CNNs that jointly learns fashion attributes and image embeddings while taking category and brand information into account. The two tasks in the framework are linked with a guiding signal. The guiding signal, first, helps in mining informative training samples. Secondly, it helps in treating training samples by their importance to capture the tiers of similarity. We conduct experiments in a new BrandFashion dataset which is richly annotated at different granularities. Experimental results demonstrate that the proposed method is very effective in capturing a tiered similarity search space and outperforms the state-of-the-art fashion search methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Fashion contributes a significant portion in rapidly growing online shopping and social media [2, 9]. With such a growth, visual fashion analysis has received a huge research attention [1, 15, 16, 19, 24] and has been successfully deployed in large e-commerce companies and websites [13, 23, 29] such as eBay, Amazon, Pinterest, Flipkart, etc. One of the most important aspects in visual fashion analysis is fashion search. This paper presents a new deep learning based fashion search framework with an interesting fusion of multitask and metric learning.
With the recent advances in deep learning, end-to-end metric learning methods for visual similarity measure have been proposed [3, 22]. The main task here is to learn a discriminative feature space for image representation. Particularly for fashion search, the feature space should incorporate various elements of fashion. As fashion domain demonstrates a huge diversity in visual concepts with their designs, styles, brands, there exist tiers of similarity for fashion clothing. Visual fashion similarity can be defined based on various concepts such as categories (e.g. dress, hoodie), brands (e.g. Adidas, Nike), attributes (e.g. color, pattern) or design (cropped, zippered). Figure 1 illustrates tiers of similarity for clothing images. Clothing (A) and the reference clothing (R) are the exact same (brand, model, categories, color etc.) clothing and hence lies closest within the inner circle. Clothing (B) shares the same model as clothing (R) with a different color and hence it is second nearest to (R). Similarly, clothing items (C), (D) & (E) lie farther away. We aim to learn such a tiered feature space, as this provides the desired retrieval outcome for practical fashion search applications.
Deep metric learning has demonstrated huge success in learning visual similarity [3, 8, 11, 22, 28]. Siamese networks [5, 8, 28] and triplet networks [22, 27] are the most popular models for metric learning, the latter being reported to be better [10, 22]. Although successful, the existing triplet based methods [3, 10, 22] have few limitations. First, they require exact instance/ID level annotations, and do not perform well with weak label annotations e.g., category labels (shown in Sect. 3). Secondly, these methods employ hard binary decisions during the triplet selection and treat the selected triplets with equal importance, which creates a restriction to learn tiers of similarity.
To learn discriminative feature space, researchers have combined metric learning with auxiliary information using multitask networks which have achieved better performance for face identification and recognition [6, 22, 30], person re-identification [14, 18], clothing search [12]. Particularly for fashion representation, multitask learning with attribute information is used in [12, 24]. Where-To-Buy-It (WTBI) [15] used pre-trained features and learned a similarity network using Siamese networks. Recently, FashionNet in [19] proposed to jointly optimize classification, attribute prediction and triplet loss for fashion search. However, they do not explore the possible interaction between the tasks and hence do not effectively learn a tiered similarity space required for fashion search.
In view of this, we propose a new attribute guided metric learning (AGML) framework using multitask learning for fashion search. The proposed framework utilizes the interactions between the attribute prediction and triplet network, by jointly training them. This has two major advantages over the existing methods. First, it helps in mining informative triplets especially when exact anchor-positive pair annotations are not available. Second, training samples are treated based on their importance in a soft manner which helps in capturing multiple tiers of similarity required for fashion search. We demonstrate its effectiveness for fashion search using a new BrandFashion dataset. Compared to the existing fashion datasets [4, 7, 19], this dataset is richly annotated with essential elements of fashion including clothing categories, attributes, and brand information which capture different tiers of information in fashion.
2 Proposed Method
The architecture of the proposed framework is shown in Fig. 2. It consists of three identical CNNs with shared parameters \(\theta \) and accepts image triplets \(\left\{ x^a,x^p,x^n \right\} \) i.e. an anchor image \((x^a)\), a positive image \((x^p)\) from the same class as the anchor, and a negative image \((x^n)\) from a different class. The last fully connected layer has two branches for learning the feature embedding f(x), and the attribute vector \( \mathbf {v} \). The guiding signal links two tasks and helps triplet sampling based on the importance of the samples. The network is trained end-to-end using the loss,
where \(L^{G}_{tri}(\theta )\) & \(L_{attr}(\theta )\) represent the attribute-guided triplet loss & attribute loss respectively, and \(\lambda \) balances the contribution of the two losses.
2.1 Attribute Prediction Network
We use K semantic attributes to describe the image appearance, denoted \( \mathbf {a} = [ a_1, a_2, \dots , a_K ]\), where each element \(a_i \in \left\{ 0,1 \right\} \) indicates the presence or absence of the \(i^{th}\) attribute. The problem of attribute prediction is treated as multilabel classification. We pass the first branch from last fully connected layers into a sigmoid layer to squash the output to [0, 1] and output \( \mathbf {v} \). The attribute prediction is optimized using binary-cross entropy loss \(L_{attr}(\theta ) = -\sum _{i=1}^K \left[ a_i \log (v_i) + \, (1-a_i) \log (1- v_i) \right] \), where \(a_i\) is binary target attribute labels for image x, and \(v_i\) is a component of \( \mathbf {v} = [v_1, v_2, \dots , v_K]\), which is the predicted attribute distribution.
2.2 Attribute Guided Triplet Training
We use the predicted attribute vectors to guide both triplet mining and triplet loss training.
A. Triplet Mining
Random sampling based on class/ID labels for triplet does not assure the selection of the most informative examples for training. This is especially critical when only category information is available. For effective training, anchor-positive pairs should be reliable. Therefore, we propose to leverage cosine similarity \(\langle \mathbf {x} , \mathbf {y} \rangle = \frac{ \mathbf {x} ^\intercal \mathbf {y} }{\Vert \mathbf {x} \,\Vert _2\Vert \mathbf {y} \,\Vert _2}\), between the anchor-positive attribute vectors (outputs of the attribute prediction network) to sample better triplets. In particular, we use a threshold \((\varPhi )\) such that only the triplets with \(\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle > \varPhi \) are selected for the training. This ensures that the anchor-positive pairs are similar in attribute space and hence are reliable.
B. Attribute Guided Triplet Training
We propose two ways to guide the triplet metric learning network. The first weights the whole triplet loss while the second operates on the margin parameter of the loss function. Let \( \left\{ x^a, x^p, x^n \right\} \) be an input triplet sample and \(\left\{ f(x^a), f(x^p), f(x^n)\right\} \) be the corresponding embedding. The proposed attribute-guided triplet loss given by,
where w(a, p, n) and m(a, p, n) are the loss weighting factor and margin factor, which are functions of attribute distributions \(\left\{ \mathbf {v} ^a, \mathbf {v} ^p, \mathbf {v} ^n \right\} \), as explained below.
B1. Soft-Weighted (SW) Triplet Loss
The SW triplet loss operates on the overall loss using the weight factor w(a, p, n). We use \(w(a,p,n) = \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle \), the product of similarities between attribute vectors of anchor-positive and anchor-negative pairs. The above function adaptively alters the magnitude of the triplet loss. When the anchor-positive pair is similar in attribute space (i.e. \(\langle \mathbf {v} ^a, \mathbf {v} ^p \rangle \) is high), the sample is more confident and reliable. Likewise, when the anchor-negative pair is similar in attribute space (i.e. \(\langle \mathbf {v} ^a, \mathbf {v} ^n \rangle \) is high), it forms a hard negative example i.e. high information. Hence, the triplet is given higher priority and more attention during the network update. This is analogous to hard negative mining [22], but we handle them in a soft manner.
B2. Soft-Margin (SM) Triplet Loss
The SM triplet loss operates on the margin parameter using m(a, p, n). The naive triplet loss uses a constant margin m, which treats all triplets equally and restricts learning desired tiered similarity. The soft margin is an adaptive margin \( m(a,p,n) = m_0\log (1+ \langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \langle \mathbf {v} ^a, \mathbf {v} ^n \rangle )\), and promotes a tiered similarity space. Similar to SW triplet loss, when both \(\langle \mathbf {v} ^a, \mathbf {v} ^p\rangle \) and \(\langle \mathbf {v} ^{a}, \mathbf {v} ^n \rangle \) are high, the triplet is more reliable and informative (hard negative), and hence the effective margin becomes larger. In other words, when the negative image and anchor image are very similar in the attribute space, a reliable margin is used to learn the subtle difference and avoid the confusion. Hence, both SW and SM triplet loss explore the importance of the triplets, which helps in learning a tiered similarity space.
3 Experiments
We collected a new BrandFashion dataset with about 10K clothing images with distinctive logos from 15 brands. The images are categorized into 16 clothing categories and annotated with 32 semantic attributes. The goal is to demonstrate the tiered similarity space using the category, brand and attribute annotations. There are 50 query images in the dataset. We evaluated the performance for instance search using mean average precision (mAP) and the performance of tiered similarity search using normalized discounted cumulative gain (NDCG) i.e. \(NDCG@k = \frac{1}{Z}\sum _i^k \frac{2^{r(i)}-1}{\log _2(1+i)}\). The relevance score of \(i^{th}\) ranked image is calculated based on similarity match considering three levels of information, namely category, brand and attribute i.e. \(r(i) = r_i^{cat} + r_i^{brand} + r_i^{attr}\), where \(r_i^{cat} \in \left\{ 0,1 \right\} \), \(r_i^{brand}\in \left\{ 0,1 \right\} \). The attribute match \(r_i^{attr}\) is computed by taking the ratio of the number of matched attributes to the total number of query attributes [12]. Overall, the relevance score summarizes the tiered similarity search performance.
We used VGG16 [25] as the base CNN network, which is trained using the loss defined in Eq. (1), with SGD momentum of 0.5 & learning rate of 0.001. We set \(\lambda \) to 1. The value of margin \(m_0\) is experimentally set to 0.5. For the SM triplet loss, the value of \(m_0\) set to 0.8 such that the effective margin swings around the original value. We set the threshold (\(\phi \)) to 0.7 and observe that the performance is fairly stable on \(\phi \in [0.5,0.9]\).
Items from the same category and brand are sampled for the anchor-positive pairs, and items from different categories or brands constitute the negative samples. \(L_2\)-normalized feature from the last fully connected layer is used as the feature vector. We used PyTorch [20] for the implementation. Similar to [15, 19, 23], we crop out the clothing region prior to feature extraction. We used Faster-RCNN [21] to jointly detect the brand logo and clothing items in the images.
Table 1 compares performance of different methods in terms of mAP and NDCG@20. In terms of mAP, the naive triplet loss achieves 33.8%, while the multitask network (triplet+attribute) achieves 56.4%. This shows that there is a clear benefit of using auxiliary information using multitask metric learning. The proposed method additionally guides the triplet loss using the predicted attributes. The proposed AGML-SW and AGML-SM achieve mAPs of 63.79% and 63.71%. This demonstrates the advantage of attribute guided triplet loss. The proposed method clearly outperforms the deep feature encoding based methods [13, 17, 26], and state-of-the art metric learning methods [15, 19, 23].
Similar trend in results can be observed for NDCG in Table 1. The proposed attribute-supervised SW and SM triplet network achieve NDCG@20 of 83.66% and 85.12% respectively. Our method clearly outperforms other state-of-the-art methods which demonstrate the advantage of learning a tiered similarity space. We further take advantage of logo detection to re-rank the retrieval results. The proposed method achieves mAP \(\approx \) 71% and NDCG@20 \(\approx \) 96% with re-ranking based on detected brand logo information. Figure 3 shows example search results obtained using WTBI [15], FashionNet [19] and the proposed method which further demonstrates the advantage of the proposed method.
4 Conclusions
We presented a new deep attribute-guided triplet network which explores the importance of training samples and learns a tiered similarity space. The method uses multitask CNN which shares the mutual information among the tasks for better tuning the loss. Using the predicted attributes, the proposed method first mines informative triplets, and then uses them to train the triplet loss in a soft-manner, which helps in capturing the tiered similarity desirable for fashion search. We believe that the tiered similarity search will be appreciated by fashion companies, online retailers as well as customers.
References
Al-Halah, Z., Stiefelhagen, R., Grauman, K.: Fashion forward: forecasting visual style in fashion. In: IEEE International Conference on Computer Vision (ICCV), pp. 388–397. IEEE (2017)
Baldwin, C.: Online spending continues to increase thanks to fashion sector (2014). https://www.computerweekly.com/news/2240225386/Spend-online-continues-to-increase-thanks-to-fashion-sector
Bell, S., Bala, K.: Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. (TOG) 34(4), 98 (2015)
Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., Van Gool, L.: Apparel classification with style. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7727, pp. 321–335. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37447-0_25
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in Neural Information Processing Systems, pp. 737–744 (1994)
Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010)
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_44
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 539–546 (2005)
Financial Times: online retail sales continue to soar (2018). https://www.ft.com/content/a8f5c780-f46d-11e7-a4c9-bbdefa4f210b
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
Hu, J., Lu, J., Tan, Y.P.: Discriminative deep metric learning for face verification in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1882 (2014)
Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: International Conference on Computer Vision, pp. 1062–1070 (2015)
Jing, Y., et al.: Visual search at pinterest. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1889–1898. ACM (2015)
Khamis, S., Kuo, C.-H., Singh, V.K., Shet, V.D., Davis, L.S.: Joint learning for attribute-consistent person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 134–146. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16199-0_10
Kiapour, M.H., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: IEEE International Conference on Computer Vision, pp. 3343–3351 (2015)
Kiapour, M.H., Yamaguchi, K., Berg, A.C., Berg, T.L.: Hipster wars: discovering elements of fashion styles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 472–488. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_31
Lin, K., Yang, H.F., Liu, K.H., Hsiao, J.H., Chen, C.S.: Rapid clothing retrieval via deep learning of binary codes and hierarchical search. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 499–502. ACM (2015)
Lin, Y., Zheng, L., Zheng, Z., Wu, Y., Yang, Y.: Improving person re-identification by attribute and identity learning. arXiv:1703.07220 (2017)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1096–1104 (2016)
Paszke, A., et al.: PyTorch. http://pytorch.org
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNnet: a unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Shankar, D., Narumanchi, S., Ananya, H., Kompalli, P., Chaudhury, K.: Deep learning based large scale visual recommendation and search for e-commerce. arXiv:1703.02344 (2017)
Simo-Serra, E., Ishikawa, H.: Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 298–307 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv:1511.05879 (2015)
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. arXiv:1404.4661 (2014)
Wang, X., Sun, Z., Zhang, W., Zhou, Y., Jiang, Y.G.: Matching user photos to online products with robust deep features. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 7–14. ACM (2016)
Yang, F., et al.: Visual Search at eBay. arXiv:1706.03154 (2017)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv:1411.7923 (2014)
Acknowledgment
This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the Infocomm Media Development Authority, Singapore. We gratefully acknowledge the support of NVIDIA AI Technology Center for their donation of GPUs used for our research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Manandhar, D., Bastan, M., Yap, KH. (2019). Tiered Deep Similarity Search for Fashion. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11131. Springer, Cham. https://doi.org/10.1007/978-3-030-11015-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-11015-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11014-7
Online ISBN: 978-3-030-11015-4
eBook Packages: Computer ScienceComputer Science (R0)