1 Introduction

As the popularity of digital products and the wide development of sensor techniques, large sets of data that contain numbers, textures, images, and videos are produced. The authors predict the data amount collected globally will reach 35 trillion GB before 2020 (Gantz and Reinsel 2011). Some companies would like to collect this large volume of data and then extract useful information from it such as Facebook and Google. Formally, Big Data means that data are rapidly generated and added in four Vs (Chen and Lin 2014; Najafabadi et al. 2015), volume—the number of data amount, variety—the raw data is diverse and complex, velocity—the increasing rate of data, and veracity—the usefulness of results obtained from data analysis. However, the amount of data that cannot be analyzed using the traditional database techniques in storage, processing, or computation. Thus, dealing with Big Data has become a hot topic in recent years.

Not only storage and computation techniques need to be overcome for large data but also Big Data Analytics mining meaningful information from massive data to provide decision and prediction is a very important issue. To provide business analysis and decision guidance, some companies such as Google and Microsoft process data in exabyte proportions or larger. While social software companies such as Facebook and Twitter collect log records and use behaviors from billions of users, they also generate a very large quantity of data. However, Big Data Analytics faces a series of challenges to dealing with billions of data containing thousands of dimensions, features, and category labels (Chen and Lin 2014). Therefore, extracting valuable knowledge from a massive data set is not an ordinary task and cannot be completed in a conventional machine learning scenario.

Recently, one machine learning technique, called Deep Learning, provides a solution to address the data analysis and learning problems for the Big Data. Deep Learning is to emulate the process of the sensorial areas of the neocortex in the human brain, which can automatically learn high-order hierarchical representation in deep architectures from the underlying data (Bengio et al. 2013; Arel et al. 2010). Thus through a hierarchical learning process, low-level features to complex high-level features can be extracted via the Deep Learning algorithm. In addition, Deep Learning algorithm can learn relational and semantic knowledge data representations from large unsupervised raw data at high-level layers (Hinton and Osindero 2006; Bengio et al. 2007) that conventional techniques that use shallow learning hierarchies fail to extract complex and nonlinear patterns efficiently. Now, the Deep Learning algorithm has been successful in some domains, for example, speech recognition (Dahl et al. 2012; Mohamed et al. 2012), natural language process (Socher et al. 2011) and computer vision (Ciresan et al. 2010; Krizhevsky et al. 2012). Google, Apple, Microsoft, and IBM (Jones 2014; Wang et al. 2011) have developed projects using Deep Learning to analyze collected data from users. For example, the virtual assistant in Apple cell phones, Siri, can provide various services such as weather reports, news updates, and order tickets, through voice (Efrati 2013). Thus, with the aid of Deep Learning algorithms, discriminative tasks such as semantic indexing, information retrieval, and discriminative modeling can be easily developed in Big Data Analytics. However, for some high-dimensional data such as images and videos factors that both increase data volume, learning processing with a hierarchical structure is slow and computationally expensive (Sukumar 2014). Therefore, the application of Deep Learning algorithms for Big Data Analytics involving high-dimensional data remains unexplored (Sukumar 2014).

Considering images becoming the main contents on the Internet and social media applications such as social networks, there has been a rapid increase in the size of digital images collected from online users. Additionally, more commercial benefits can obtain by analyzing those collected data to predict the consumers’ behavior and then recommend products. Hence, in this study, we explore deep learning to analyze high-dimensional clothing images in an attempt to discover fashion style elements that can be used for consumption behavior prediction in the Big Data era. In the conventional computer vision field, lots of low-level features such as LBP (Ojala et al. 2002) and SIFT (Lowe 2004) have been developed to extract specific patterns from domain images. However, there is a large gap between these hand-crafted features and the abstract meaning. Inspired from the learning ability for a large amount of data, we explore deep learning to mine low- and high-level features from fashion images in a hierarchical layered structure and further extend for the clothing retrieval task. Moreover, three modified deep architectures were developed by considering computational efficiency and the system flexibility. Different from conventional algorithms with shallow structure, the deep net contains many parameters to be optimized. A slow learning process for high-dimensional image data, particularly massive amounts of data collected from social applications or on-line shopping sites makes the system prohibitive. Therefore, inspired by the ideas of ensemble learning, one architecture is designed to use multiple deep nets and then each net is computed in one client node in a distributed computing environment. Thus, the training time can be accelerated. On the other hands, in order to increase system flexibility, two architectures with multiple deep nets with two outputs are proposed for binary-class classification. Therefore, when new classes are added, no additional computation is needed for all training data.

The remainder of this paper is organized as follows: in Sect. 2, we review the deep learning for computer applications and related studies in vision-based analysis for clothing images. In Sect. 3, the system framework for knowledge discovery of fashion style patterns and the learning process of deep net is introduced. In Sect. 4, three modifications of deep architectures are discussed to simulate the learning process of a large volume of data in the distributed computing environment. In Sect. 5, experiments are performed to evaluate the system performance and a conclusion is made.

2 Related work

This section introduces the history of Deep Learning and its usage in computer fields by examining studies pertaining to fashion related discuss.

2.1 Deep Learning

Deep Learning is a hierarchical model used to emulate computations similar to human learning. Different from conventional machine learning, which is a method best suited for simple structure small data volumes that incorporate shallow learning architectures, deep Learning can extrapolate abstract and complex data representations using a hierarchical layered structure where less abstract features are learned in the lower-level, while more abstract features are extracted in the higher-level. Deep learning can be used for a supervised problem if the labeled data is available. Moreover, Google and Stanford (Le et al. 2012) proposed a vast and deep neural network to learn high-level features by using only unlabeled data. In Google’s experimentation, they trained a nine-layered sparse autoencoder with 1 billion connections using ten million \(200 \times 200\) images downloaded from the Internet (Sukumar 2014). The experiments were performed at a computational cluster of 1000 machines with 16,000 cores, and the training duration was of 3 days. Using the extracted high-level abstract features, the trained deep net can recognize 22,000 object categories from the ImageNet dataset. Since deep learning demonstrated the ability to extrapolate an abstract representation from large amounts of unlabeled/unsupervised/unseen data and outperformed than other conventional machine learning methods, deep learning becomes an attractive technique especially in the Big Data era.

In the deep net, the input of each layer is the output from the previous layer. In other words, data are sequentially passed from the first layer to the last output as the abstract representation. Using the layer structure, more abstract features are obtained by passing data through multiple nonlinear computations. For example, in face recognition, the first layer can learn the edges in different orientations. In the second layer, more complex features like the different parts of a face such as lips, nose, and eyes are learned. In the third layer, features like face shapes are learned. The final representations can be used for face recognition, semantic data indexing, or other classification problems. The idea of Deep Learning is that the input data is nonlinearly transformed in each layer (Sukumar 2014), which tries to extract underlying abstract factors in the data. Li et al. (LeCun et al. 1998) applied Deep Learning to develop the Microsoft Research Audio Video Indexing System (MAVIS) for speech recognition to provide a search service of audio and video files that use speech. In their study, the authors proposed convolutional neural networks (CNN) to scale up effective Deep Learning to high-dimensional data. In CNNs, the neuron units in the latent layers do not need to be connected to all of the nodes in the previous layer and the resolution of the high-dimensional image data can be reduced via sampling methods when feeding data into higher layers in the network (LeCun et al. 1998). Researchers have taken the advantage of CNNs to achieve impressive accuracy improvement in image classification on the ImageNet dataset, one of the largest databases for image object recognition with 1.2 million \(256 \times 256\) RGB images (Krizhevsky et al. 2012). Additionally, CNNs are applied to many other computer vision applications such as object detection (Girshick et al. 2014), face recognition (Sun et al. 2015), and human attribute prediction (Zhang et al. 2014).

Recently, there have been improvements on Deep Learning including the usage of drop-out (Hinton et al. 2012) to prevent the optimization from overfitting. Also, more effective nonlinear activation functions are used such as rectified linear units (Glorot et al. 2011), max-outs (Goodfellow et al. 2013), and network-in-network (NiN) functions are proposed to reduce dimensionality (Lin et al. 2013). In addition, Socher et al. (2011) introduced recursive neural networks which can predict hierarchical tree structures for segmentation and annotation of complex image scenes and the performance can surpass other existing methods based on conditional random fields. Le et al. (2011) demonstrate that Deep Learning can be used for action scene recognition and video data tagging. Their approach outperforms other existing methods which have adopted hand designed features for images like SIFT and HOG on the video domain. These studies show the advantage of Deep Learning as an approach to extract data representations from different data types.

Based on the underlying assumption that the low- or mid-level features are common across different problem domains, the domain adaption or transfer with deep neural networks gains increased attention. The majority of existing approaches adapts to re-train the last few layers of the network using samples from the target domain, or performs fine-tuning of all the layers using backpropagation (Razavian et al. 2014; Oquab et al. 2014). Several works have shown that it is effective to transfer a Deep Learning model from a large-scale dataset, e.g., ImageNet, to other tasks (Chen et al. 2015b; Huang et al. 2015). However, these approaches usually require a relatively large training sample from the target domain to produce good results (Chen et al. 2015b). Thus, Chen et al. (2015b) proposed a new deep domain adaption approach using a double-path network to learn domain-invariant hierarchical features directly and transfer the domain information within intermediate layers to bridge the gap between the source and target clothing domain distributions. Alternatively, Huang et al. (2015) proposed a network architecture that learns effective features for measuring visual similarities across domains for cross-domain image retrieval.

2.2 Visual-based analytics on fashion images

Advances in mobile technology combined with the ubiquity of Internet service creates an environment where users are taking more pictures and posting them online. Because of the benefits realized by modeling and predicting consumer’s consumption behavior from crawled data, there is a growing interest in popularity prediction based on online contents (Yamaguchi et al. 2014). As pictures become a core content type on the Internet, and human behavior migrates to social network sharing, Big Data increasingly generated from the login and browsing records attracts a lot of attention. For example, an online economic service can accept pictures as input and return corresponding recommendations for users. In Yamaguchi et al. (2014), the authors examine the social influence of clothing images in an online fashion social network using vision-based analytics.

Because of the high revenue of these services and the huge impact for e-commerce applications, fashion pictures were selected as an influential target that would also be popular. There is a growing interest in methods for garment understanding or recognition (Yamaguchi et al. 2012, 2013), product suggestion (Kalantidis et al. 2013), outfit recommendation (Jagadeesh et al. 2014; Liu et al. 2012), clothing retrieval (Kalantidis et al. 2013), fashion style recognition (Chen and Liu 2015) and clothing attribute prediction (Bossard et al. 2013). Among applications, human parsing, namely partitions the human body into several clothing specific regions (e.g., hat, left/right leg, and upper-body clothes) gives each pixel in the input image a semantic label. This has drawn much attention in recent years (Yamaguchi et al. 2012, 2013), serving as the basis for other related applications such as clothing classification, and retrieval (Liu et al. 2014). The frameworks of pixel-based parsing are proposed to segment and classify simultaneously. Human parsing can be roughly divided into two parametric and nonparametric-based methods. For the parametric method, Yamaguchi et al. (2012) proposed an image parsing system that consisted of three parsers with different classification models to recognize the clothing classes for each pixel, and then combined all results from the parsers for final prediction. Yang et al. (2014) developed the human parsing system consisting of two sequential phases for image co-segmentation and region co-labeling to capture the correlations between different human images. Conversely, nonparametric methods build a pixel- (Liu et al. 2011) or hypothesis-level (Tung and Little 2014) to match a new image and annotated images in a dataset. The labels are then transferred from the annotated images to the new image. Instead, of combining multiple sequential steps, CNN-based methods have been proposed for image parsing (Farabet et al. 2013). Farabet et al. (2013) trained a multi-scale CNN from raw pixels to assign a label to each pixel. Long et al. (2014) proposed recurrent CNNs, the state-of-the-art scene parsing method, to speed up parsing time. However, these deep models cannot be easily updated when new semantic labels are incorporated (Liu et al. 2015). To increase flexibility of human parsing, Liu et al. (2015) proposed a matching convolutional neural network (M-CNN) to predict the matching confidence based on the k-nearest-neighbor (KNN) nonparametric framework. Different from the classic CNN architecture, the cross image matching filters are embedded into every convolutional layer to characterize the matching between the testing image and semantic region of KNN images. Then the output is fused by displacing the best matched region in the testing image for a particular semantic region in one KNN image.

Fig. 1
figure 1

System framework. Three components, pre-trained model, fine-tune process, and retrieval task. Note that Architecture 1 contains one deep net with multiple outputs

Garment and clothing classification/retrieval is a hot issue as well. Song et al. (2011) predicted one’s occupation by recognizing their clothing styles. Di et al. (2013) defined 12 fine-grained clothing classes: material, fastener types, collar types, folded collar, overcoat and jacket, and the hand-crafted features including LBP, SIFT, and histogram of gradient (HOG) for designation with SVM classifiers. In 2012, Bossard et al. (2013) proposed a clothing recognition system for 15 common classes with 78 attributes. The system identifies the upper-body region as region of interest (ROI) and extracts speeded up robust features (SURF), local binary patterns (LBP), and CIE L*a*b* color space as feature descriptors. Multi-class learning based on a Random Forest random is applied to recognize clothing styles. However, only 41% accuracy can be achieved. Different from previous works that focus on classifying or retrieving similar garment from images, Chen et al. (2015a) proposed a sparse-coding representation approach to discover the elements from ten fashion styles with color, LBP, and HOG features. However, only 68% accuracy rate was achieved. Although the authors define ten fashion styles by investigating color and texture statistics in their work, the definition of each fashion style is abstract and difficult. Conversely, Kiapour et al. (2014) designed an online competitive style rating mechanism to associate the style ratings of five style categories hipster, bohemian, pinup, preppy, and goth, for clothing based on reliable human judgments. Then, between- and within-class style classifications are performed by using the proposed style descriptor with linear kernel SVM classifiers. Although these issues are interesting, the ability to analyze them is limited for the existing methods are based on conventional machine learning techniques within a natural setting.

Rather than using conventional machine learning (Bossard et al. 2013; Chen et al. 2015a; Kiapour et al. 2014), the domain adaption or transfer with deep neural networks has been explored in the field of garment and clothing classification/retrieval. Khosla and Venkataraman (2015) applied CNNs with multiple VGGNet architectures to classify an input shoe image into one shoe category. Transfer learning is then applied in the VGGNet architecture, and the last fully connected layer from the trained model feature vector is input to retrieve the most similar five shoes in the data. A similar idea is applied in Bossard et al. (2013) where Random Forests are extended to be capable of transfer learning from different domains to reduce noise effects that exist in the crawled image data. Huang et al. (2015) proposed a dual attribute-aware ranking network (DARN) to address the gap between the user photo in a cluttered background and online images with a clean background in clothing retrieval problem. DARN consists of two sub-networks with similar structures for shopping scenarios and street scenarios. In addition, an enhanced R-CNN detector is applied to localize the clothing area in images with cluttered backgrounds. Chen et al. (2015b) addressed the same problem of bridging the gap between two clothing domains by implementing a specific double-path deep neural network where each deep network is used to model one clothing domain and additional alignment layers have been placed to connect the two paths for the domain consistency. In Lin et al. (2015), the authors developed a hierarchical deep search framework for clothing retrieval in a recommendation system. Mid-level visual features were learned in one pre-trained network, and then the network was fine-tuned using the closing dataset. Note that a latent layer was added into the network and hash codes were learned. The authors conducted the experiments on the dataset with 15 clothing categories and 161,234 clothing images from Yahoo shopping websites. However, in the Big Data era, the flexibly and scalability of models are important issues. Chen and Liu (2015) proposed three modified deep net architectures for distributed computational environment. According to the experimental results, the classification rates for fashion style recognition were significantly improved based on the deep learning methods. Moreover, via the distributed computation, the classification rates are compatible with the original architecture and the flexibly of the deep learning would be improved.

3 System framework

Figure 1 depicts the CNN used to solve style recognition and clothing pattern discovery. Comprised of three components, the first component is the supervised pre-trained model on the large ImageNet dataset (Krizhevsky et al. 2012; Lin et al. 2015). Based on the assumption that the parameters of the low- and mid-level network layers can be re-used across domains (Oquab et al. 2014; Chen et al. 2015b; Khosla and Venkataraman 2015), domain transfer learning is performed in the second component by using clothing style images to fine-tune the pre-trained network with the latent layers containing multiple nodes (Lin et al. 2015). In the third component, the outputs of nodes in the latent layer from the re-trained network are used as feature vectors for the clothing retrieval task. Note that because three modifications are performed on this framework as a part of our work (introduced in the next section), the framework contains one deep net with multiple outputs represented as Architecture 1.

Fig. 2
figure 2

Deep convolutional net with eight layers

Figure 2 shows the deep convolutional net with eight layers which contain five convolution layers, two fully connected layers, one latent layer, and the softmax output corresponding to the clothing styles. The first convolutional layer filters the \(227\times 227\times 3\) input region (Fig. 2), randomly cropped from \(256\times 256\times 3\), with 96 Gaussian kernels of size \(11\times 11\times 3\) with a stride of four pixels. Then the rectified linear unit (ReLU) nonlinearity is applied and the responses are pooled with kernel size of \(3\times 3\) with a stride of two pixels, normalized, and padded as the output of the first convolution layer. The second convolutional layer takes the output of the first convolutional layer as input and filters it with 256 kernels of size \(5\times 5\times 48\). Post-processing, the same as in the first convolutional layer, are performed before subsequent layers. The third convolutional layer has 384 kernels of size \(3\times 3\times 256\) connected to the outputs of the second convolutional layer, and only the ReLU function is applied without pooling or normalization process. The fourth convolutional layer has 384 kernels of size \(3\times 3\times 192\), and the fifth convolutional layer has 256 kernels of size \(3\times 3\times 96\). The fully connected layer (fc6) containing 4096 nodes takes the output of the fifth convolution layers as input, and the pooling and dropout function is applied to reduce overfitting in the fully connected layer (Jagadeesh et al. 2014) by setting the output of latent neurons with a probability smaller than 0.5–0. Before passing to the output layer with the softmax function, a latent layer with 64, 128, or 256 nodes is added to extract features vectors for the retrieval. Because the memory requirements to train this network are huge, the batch size is set to 64 in the training process. After fine-tuning the network, the test data is fed into the network and the outputs of the softmax function become the classification results.

Fig. 3
figure 3

a Forward and b backward propagation computation in the convolution layer

Figure 3 shows the detailed forward and backward computational propagation in the convolution layer. In general, the convolution layer comprises three functional elements, namely, convolution operator, active function, and pooling. Assuming the weight features of convolution filter to be \(w=[ {{\begin{array}{cccc} {w_1 }&{}{w_2 }&{} {\ldots }&{} {w_n} \\ \end{array} }}]\) where n is the length of w, the convolution operator is defined as

$$\begin{aligned} y=x \times w=[y_n ], \quad \hbox { where }\quad y_n =\sum _{i=1}^{|w|} {x_{n+i-1} w_i }. \end{aligned}$$
(1)

The active function was then applied to the convolution results; the commonly used active functions include sigmoid, Tanh, Maxout, and ReLU. In the proposed work, ReLU function was applied as

$$\begin{aligned} {\hat{y}}=\max (0,y_n) \end{aligned}$$
(2)

In the forward propagation, the downsampling operator was applied to the pooling layer. The aggregate function g was applied to the subsample results from the previous layer to avoid overfitting. The input of function g is a vector, and the output is a scalar. Commonly used aggregation functions include mean pooling, max pooling, and \(L^{p}\) pooling. Here the max pooling is applied. With m as the size of pooling region, the max pooling is represented as

$$\begin{aligned} g\left( {\tilde{y}}\right) =\max \left( {\hat{y}}_n\right) ,\quad n=1, \ldots , m \end{aligned}$$
(3)

In the training process of the deep net, the stochastic gradient descent was applied to re-weigh the parameters in the backpropagation process. As shown in Fig. 3b, the parameters in the pooling layer were firstly updated. From the pooling layer to the activation function, the error signals \(\delta _{(n-1)m+1:nm}^{{\hat{y}}} \) for each training example were computed by upsampling using the following equation:

$$\begin{aligned} \delta _{(n-1)m+1:nm}^{\hat{{y}}}= & {} \delta _n^{\tilde{y}} {g^{\prime }}_{n} =\delta _n^{\tilde{y}} \frac{\partial g}{\partial \hat{{y}}_{(n-1)m+1:nm} }\nonumber \\= & {} \frac{\partial J}{\partial {\tilde{y}}_n }\frac{\partial {\tilde{y}}_n }{\partial {\hat{y}}_{(n-1)m+1:nm} }=\frac{\partial J}{\partial {\hat{y}}_{(n-1)m+1:nm} } \end{aligned}$$
(4)

Note that the gradient of the max pooling function is \(\frac{\partial g}{\partial z_i} = \left\{ \begin{array}{ll} 1&{}\quad \hbox {if}\,{z_i =\max (z)} \\ 0 &{} \quad \hbox {otherwise} \\ \end{array} \right. \), where z is a dummy variable. After obtaining the error signals propagated from the pooling layer, the error signals in the convolutional layer can be obtained by

$$\begin{aligned} \delta _{n}^{y} =\delta _n^{\hat{y}}\, {\bullet }\, {f}^{\prime }(y_n ) \end{aligned}$$
(5)

where \({f}^{\prime }(y_n )\) is the derivative of the active function. Then, the error signal \(\delta _n^x \) propagated from the convolution layer can be obtained by the equation:

$$\begin{aligned} \delta _n^x= & {} \frac{\partial J}{\partial x_n }=\frac{\partial J}{\partial y}\frac{\partial y}{\partial x_n }=\sum _{i=1}^{|w|} {\frac{\partial J}{\partial y_{n-i+1} }} \frac{\partial y_{n-i+1} }{\partial x_n }\nonumber \\= & {} \sum _{i=1}^{|w|} {\delta _{n-i+1}^y w_i } \end{aligned}$$
(6)

In other words, it can be viewed as the convolution of and the flip of w. The gradient of the error function with respect to w is

$$\begin{aligned} \frac{\partial J}{\partial w_i }=\frac{\partial J}{\partial y}\frac{\partial y}{\partial w_i }=\sum _{n=1}^{|x|-|w|+1} {\frac{\partial J}{\partial y_n }\frac{\partial y_n }{\partial w_i }=} \sum _{n=1}^{|x|-|w|+1} {\delta _n^x x_{n+i-1} } \end{aligned}$$
(7)

Then in SGD process (Krizhevsky et al. 2012), the weight of filters could be updated by

$$\begin{aligned} v= & {} \gamma v+\alpha \nabla _w J \nonumber \\ w= & {} w-v \end{aligned}$$
(8)

where v is the current velocity vector with the same dimension as w, \(\alpha \) is the learning rate, and \(\gamma \in (0,1)\) determines how many iterations from the previous gradients are incorporated into the current update, and \(\gamma =0.5\) is set.

In the following section, more architectures are proposed by using either various training protocols or deep nets in the fine-tuned components. In addition, the feature extraction and classification rules in each architecture are introduced as well.

4 Deep net architectures for clothing recognition

In the Big Data era, millions of data is collected from the Internet. According to Glorot et al. (2011), the training time takes up to 3 days using 1600 computing cores on the millions of data. To utilize the advantage of parallel computing in the distributed environment, compared with Architecture 1, three additional architectures are proposed by using either various training protocols or deep nets in the fine-tuned components to accelerate the training process for deep nets through a distributed environment. In addition, the issue of how to efficiently re-train the model without re-computing all the training data when one new class is added is regarded in the proposed architectures.

Fig. 4
figure 4

Idea of Architecture 2 in the distributed computing environment

4.1 Multiple nets for load saving

MapReduce (Dean and Ghemawat 2008) is the most famous framework for the distributed computing environment. The problem it addresses is divided into multiple independent sub-tasks, and then each sub-task can be distributed and processed in one client of cluster by the Map procedure. When each sub-task is completed, all of the sub-tasks are performed using a summary operation in the Reduce procedure. The key challenge is how to divide the problem into multiple independent sub-tasks that can be computed in parallel for each client node. For the deep net training, the input of each layer is the output of the previous layer; the relations between layers are highly dependent. Because the process is sequential, it is difficult to distribute the computation in each layer into different clients. On the contrary, the computation within a layer, i.e., convolution process, is adequate to be processed in each client. Dean (2012) developed a framework, DistBelief, which utilizes computing clusters with thousands of machines to accelerate the training process among two proposed algorithms Downpour SGD and Sandblaster L-BFGS. It is the first implementation of parallelism by partitioning large network architectures into several smaller structures, called blocks. Each block is assigned and calculated in one machine. However, boundary nodes whose edges belong to more than one partition require a data transfer between machines. Fully connected networks that have more boundary nodes demand higher communication costs resulting in performance benefits that are limited (Chen and Lin 2014).

Hence, rather than partitioning the network into smaller structures, the first modified architecture, Architecture 2. Inspired by the Random Forest (Bossard et al. 2013) machine learning technique, Architecture 2 fine-tunes multiple deep nets. A large volume of training data is randomly divided into sub-data, and each sub-data is fed into one deep net that can be computed into one client node. Figure 4 demonstrates Architecture 2 for a distributed computing environment. Note that each client node has one pre-trained model as depicted in Fig. 1, and the cross-domain transfer learning is applied with sub-data to fine-tune the corresponding deep net. Thus the low-, mid-, and high-level features can be learned in each deep net.

4.2 Multiple nets for new class and pattern mining

A problem faced in the classification task when a new class is added is that the trained model might need to be re-trained with all training data. Note that this is different from incremental learning where models are updated when more information is added to existing classes. However, it is not feasible to re-train or refine models with massive volumes of data, and even data are increasing everyday. Thus, inspired by ensemble algorithms, e.g., Adaboost (in which multiple weak models are trained and combined into the final strong one), Architecture 3 is proposed to manage the problem of adding new classes. Figure 5 shows that Architecture 3 consists of multiple deep nets with two outputs. This net architecture is the same as Architecture 1, except that there are only two softmax outputs in the last layer for the binary-class problem. Specifically, one node is for the positive class, i.e., one clothing style, and the other node is for the negative class, i.e., remaining clothing styles. In other words, each deep net can be processed using binary-class classification, and the number of deep nets in Architecture 3 depends on the number of object classes. Hence, when one class is added, only one deep net is added and fine-tuned. Moreover, this architecture benefits from its data independence, and thus, each deep net can be fine-tuned at one client node in a distributed computing environment.

Fig. 5
figure 5

Idea of Architecture 3 with multiple nets of two softmax outputs for binary-class classification

In Architecture 4, the training data of a negative class are reduced to investigate the performance tolerance within a data volume. Note that the deep nets are the same as in Architecture 3, and the difference is that only the s part data of the negative class are trained with the positive class in each net. Here, in our simulation, \(s= 1{/}4\). Because each net is applied to discover the discriminant features of the positive class from the remaining sets (complement set of the positive class), sufficient positive data are more important. Through the training of multiple nets, each net learns part of the discriminant features from the sub-data to save training time and storage of massive volumes of data.

4.3 Classification mechanism for deep net architectures

For clothing recognition and retrieval, different classification mechanisms are applied to obtain the final prediction results in the proposed four architectures, which can be categories into softmax layer and latent layer process.

4.3.1 Softmax layer process

For the net in Architecture 1 and Architecture 2, the output of the cth node in the softmax layer, i.e. the last layer in the deep net, represents the posterior probability of the test data for the cth class. Hence, for Architecture 1 that only one net in this architecture, the predicted label \(u^{*}\) for the test image \(I_q\) is assigned according to the class label which can give highest probability response. On the other hand, in order to combine the probability responses for multiple nets in Architecture 2, the pooling process is firstly applied to obtain the maximum probability response for each class k across all nets as

$$\begin{aligned}&p(\hbox {Class}_k | I_q )\nonumber \\&\quad =\max \left\{ {p\left( \hbox {Net}_k^1 | I_q\right) ,\ldots ,p\left( \hbox {Net}_k^m | I_q \right) ,\ldots ,p\left( \hbox {Net}_k^M | I_q\right) } \right\} \nonumber \\ \end{aligned}$$
(9)

where \(p(\hbox {Net}_k^m | I_q )\) is the kth node response in the softmax layer in the m deep net. Then the final predicted label \(k^{*}\) for the test image \(I_q\) is given by

$$\begin{aligned} k^{*}=\mathop {\arg \max }\limits _{k\in \left\{ {1,2,\ldots ,C} \right\} } p(\hbox {Class}_k | I_q ) \end{aligned}$$
(10)

where C is the number of classes.

In Architecture 3 and Architecture 4, multiple deep nets are consisted as well. However, the number of output nodes in the softmax layer is different from Architecture 2, and only two output responses where one is the posterior probability for the test image belonging to one specific class (positive class) and the other one is to other classes (negative class). Hence, the final predicted label \(k^{*}\) for the test image \(I_q\) is assigned by maximizing the posterior probability,

$$\begin{aligned} k^{*}=\mathop {\arg \max }\limits _{k\in \left\{ {1,2,\ldots ,C} \right\} } p\left( \hbox {Net}_{k=1}^c | I_q \right) \end{aligned}$$
(11)

where \(p(\hbox {Net}_{c=1}^k | I_q )\) is the output response for the positive class in the kth net and C is the number of nets in the Architecture 3 or Architecture 4.

Fig. 6
figure 6

Examples of three datasets. First row (left) examples from the dataset (Chen et al. 2015a) and (right) from dataset (Kiapour et al. 2014). Second row examples from the dataset (Bossard et al. 2013)

4.3.2 Latent layer process

To extend the deep nets for various applications, in the proposed four architectures, the latent layer can be viewed as the feature extraction process, and traditional classification mechanism, either a nonparametric-based classifier (e.g., k-nearest-neighbor rule) or parametric-based mean vector or SVM can be applied. In Architecture 1, for each class c the mean vector \(f_c^{\hbox {Mean}} \) for the training data can be viewed as a kind of parametric model as

$$\begin{aligned} f_c^{\hbox {Mean}} =\frac{1}{N}\sum \nolimits _{y(I_n )\in c} {L(I_n )} \end{aligned}$$
(12)

where \(y(I_n )\) is the class label of the training data \(I_n\), N is the number of training data belonging to class c, and \(L(I_n )\) is the outputs of latent layer. Note that the length of \(L(I_n )\) is 64, 128, or 256 in the proposed work. Then the predicted label \(k^{*}\) for the test image \(I_q \) is given by finding the minimum distance between the extracted feature vector \(L(I_q )\) and \(f_c^{\mathrm{Mean}} \) as

$$\begin{aligned} u^{*}=\mathop {\arg \min }\limits _{c\in \{1,\ldots ,C\}} \left\| {L(I_q )-f_c^{\mathrm{mean}} } \right\| _2 \end{aligned}$$
(13)

In addition, the k-NN, the nonparametric model, can be applied by comparing the distance between \(\{L(I_n )\}_{n=1}^N \) and \(L(I_q )\), where N is the number of training data.

On the other hand, because multiple nets are consisted in Architecture 2, for the training data \(I_n \) and the test image \(I_q \) the average pooling of the responses in the latent layer across all nets is first applied as

$$\begin{aligned} {\tilde{L}}(I_n )= & {} \frac{1}{M}\sum \limits _{m=1}^M {L^{m}(I_n )}\nonumber \\ {\tilde{L}}(I_q )= & {} \frac{1}{M}\sum \limits _{m=1}^M {L^{m}(I_q )} \end{aligned}$$
(14)

The mean vector \({\tilde{f}}_c^{\mathrm{Mean}} \) can be obtained by

$$\begin{aligned} {\tilde{f}}_c^{\mathrm{Mean}} =\frac{1}{N}\sum \limits _{y(I_n )\in c} {{\tilde{L}}(I_n )} . \end{aligned}$$
(15)

Then the predicted label \(k^{*}\) for \(I_q \) can be estimated by the parametric model by finding the minimum distance between the extracted feature vector \({\tilde{L}}(I_q )\) and \({\tilde{f}}_c^{\mathrm{Mean}} \), or k-NN classification rule. Note that not only recognition, image retrieval results can be obtained as well.

However, the comparison process is not intuitive for Architecture 3 and Architecture 4 where the abstract meanings of the latent layer in all nets are different. In other word, each deep net in Architecture 3 and Architecture 4 the responses in the latent layer is the abstract properties for only one specific class. Hence, the feature vector for the training image \(I_n \) and the test image \(I_q \) is obtained by concatenating the response in each net,

$$\begin{aligned} {\hat{L}}(I_n )= & {} L^{1}(I_n )\oplus \cdots \oplus L^{c}(I_n )\oplus \cdots \oplus L^{C}(I_n )\nonumber \\ {\hat{L}}(I_q )= & {} L^{1}(I_q )\oplus \cdots \oplus L^{c}(I_q )\oplus \cdots \oplus L^{C}(I_q ). \end{aligned}$$
(16)

where \(L^{c}(I_n)\) and \(L^{c}(I_q )\) is the response of the latent layer in the cth net and \(\oplus \) is the concatenation operation for \(I_n \) and \(I_q \), respectively. Following that, the parametric model, i.e. the mean vector, can be obtained similar to Eq. (12) and the comparison can be applied similar to Eq. (13). Moreover, the classification results of k-NN by comparing between \({\hat{L}}(I_n )\) and \({\hat{L}}(I_q )\) can be used for recognition or retrieval applications.

5 Experimental results

In this section, we introduce the development environment and the three public clothing datasets. The performance is then compared with the existing systems in these datasets.

5.1 Development environment

Because the number of parameters need to be optimized is very large in the cNNs, the training process consumes a large amount of memory and uses much of the CPU/GPU. Among famous deep learning frameworks such as Caffe (Jia et al. 2014), cuda-convnet (Krizhevsky 2012), Decaf (Donahue et al. 2013), and OverFeat (Sermanet et al. 2014); Caffe is selected because it can provide CPU/GPU computations and pre-trained models which are suitable to demonstrate learning as our problem. Caffe (Jia et al. 2014) is a framework with the code written in C++, with CUDA used for GPU computation, and well-supported bindings to Python and MATLAB for training and deploying general purpose convolutional neural networks. It is developed and maintained by the Berkeley Vision and Learning Center (BVLC) and provides assistance for large-scale applications, and research prototypes in vision and multimedia. The experiments are performed on one PC with a 3.6 GHz and 4 cores CPU, 16 GB memory, an NVidia GTX 980 graphics card with 2048 CUDA core, and 4 GB of DDR5 memory.

5.2 Dataset and evaluation metric

We conduct the experiments on three public clothing datasets. The first dataset used in the study (Chen et al. 2015a), collects 800 images of ten style classes from fashion websites and online-shopping stores with sizes ranging from \(102 \times 62\) to \(540 \times 150\) pixels. The second data set (Kiapour et al. 2014), contains 1893 images with five different fashion styles: bohemian, goth, hipster, pinup, and preppy. Note that all the data images can be download but more information, like the similar degree of one clothing image related to one specific style provided by Internet user clicks, is not available. The third data set is the largest and contains 80,000 images with 15 classes of clothing and 78 attributes (Bossard et al. 2013). Figure 6 shows the examples of three datasets.

In order to analyze the proposed architectures, all three datasets were used in evaluating the classification performance. In addition, the second and third datasets were used for the performance evaluation of retrieving fashion style and clothing type, respectively. When an architecture is set, the retrieval task can be performed by feeding a query image \(I_q\) into the architecture and the outputs of the latent layers are treated as the feature vector \(f_q\). The retrieval results can be obtained by selecting the top k images from the source images with a minimal Euclidean distance given by

$$\begin{aligned} s(I_q,I_n )=\left\| {f_q -f{ }_n} \right\| _{2} , \end{aligned}$$
(17)

where \(f_n \) is the feature vector of the nth source image in the pool. Specifically, in Architecture 2, all source images and the query image \(I_q \) are fed into M deep nets and the average pooling of the responses in the latent layer across all the nets are applied as Eq. (14) to obtain \(f_n \) and \(f_q\). Note that the length of \(f_n \) and \(f_q \) equal the number of nodes in the latent layer (64, 125, or 256 used in the experiments). The training data can be used as the source images and the corresponding feature vectors can thereby be extracted off-line, whereas in Architecture 3 and Architecture 4, \(f_n \) and \(f_q \) are obtained by concatenating the response of the latent layer in each net as Eq. (16), i.e., the length of the feature vector \(f_n \) and \(f_q \) equal the number of classes C multiplied by the number of nodes in the latent layer. After extracting the feature vector of the query image, the top k images can be retrieved by Eq. (17). To evaluate the retrieval performance, the metric of a ranking-based criterion is then applied (Lin et al. 2015; Deng et al. 2011). The precision of the top k ranked images with respect to a query image is defined as

$$\begin{aligned} \hbox {pre}_k =\frac{\sum _{n=1}^k {\hbox {RL}(n)} }{k}, \end{aligned}$$
(18)

where RL(n) denotes the ground truth relevance between the query image and the nth ranked image and \(\hbox {RL}(n)\in \{ {0,1} \}\) with the value of 1 for the query and the nth image with the same class label; otherwise, the value is 0.

5.3 Classification and retrieval results

The related clothing style recognition studies (Bossard et al. 2013; Chen et al. 2015a; Kiapour et al. 2014), used hand-crafted low- or mid-level features such as color features of RGB and Lab, texture features of LBP, HOG, SURF, and MR8 texture response. After extracting features, machine learning techniques such as Random Forest, SVM are applied for classification. However, because feature extraction and classifier learning are independent, the relations are ignored. Deep Learning merges these two processes into one training path. To compare with Chen et al. (2015a), numerous softmax output neurons in the deep net are set to 10 which correspond to ten clothing styles in the first dataset. Also, each training/test image is resized into\( 256 \times 256\) pixels to keep the aspect ratio with zero padding. The deep net is trained with 10,000 iterations with a batch size of 64. The classification results are shown in Table 1. The best accuracy achieved (Chen et al. 2015a) is 68.2% which is obtained by using the spare representation of the feature vector which concatenates Lab color, HOG, and LBP features with 96, 1025, and 512-dimensions, respectively. Conversely, the accuracy rate of Architecture 1 is 90.0 and 91.0% using Flicker or ImageNet dataset as the pre-trained dataset, respectively, with a significant improvement larger than 20%. The performance of Architecture 1 without fine-tuning is evaluated. Not only is the training time exponentially increased but only a 68% classification rate is achieved. Thus, a fine-tuning mechanism is recommended.

Table 1 Accuracy on the first clothing dataset (Chen et al. 2015a) of ten classes
Table 2 Performance comparison on the second clothing dataset with various nodes number in the latent layer and classification rules for Architecture 3 and 4

In order to analyze system performance with different designs for latent layer, 64, 128 and 256 nodes are used in the latent layer for each net. Table 2 shows the accuracy rate with various node numbers and classification rules for Architecture 3 and Architecture 4 on the second dataset. In Architecture 3 and Architecture 4, we can observe that the effect of node number in the latent layer is not significant for different classification rules, and for each kind of classification rule, the difference of accuracy is about 1%. In addition, the accuracy using Architecture 3 is overall higher than Architecture 4. It is because that only the 1 / 4 part data of the negative class are trained with the positive class in each net.

Table 3 Performance comparison on the second clothing dataset with Kiapour et al. (2014) and the proposed method with different architectures and classification rule
Table 4 Confusion matrix on the second dataset using Architecture 3
Fig. 7
figure 7

Image retrieval precision of the second dataset for Architecture 2

Fig. 8
figure 8

Top 5 ranking images in the second datasets. From the top to bottom row, the query image is Bohemian, Goth, Hipster, Pinup and Preppy class

Table 5 Performance comparison on the third clothing dataset with various node numbers in the latent layer and classification rules for Architecture 2 and 3
Table 6 Performance comparison on the third clothing dataset with Bossard et al. (2013) and the proposed method with different architectures and classification rule

Additionally, we evaluate the system performance with different deep net architectures and compare with the work (Kiapour et al. 2014) on the second dataset. Note that in Kiapour et al. (2014), a parameter \(\delta \) is used to determine the percentage of the data used in classification, and \(\delta \) from 0.1 to 0.5 are set where \(\delta =0.1\) represents the top 10% of the images. Images, with a strong relation to one specific style, from each category are selected and the train-test process is executed 100 times with 9:1 train to test ratio. According to these results, the best accuracy rate is about 75% with \(\delta =0.2\) and worst rate is 70% with \(\delta =0.1\). Table 3 shows our results using all data with a 4:1 train to test ratio by different architectures and classification rules. The accuracy rate using Architecture 1 is 67%, while a 77% accuracy rate can be obtained by using the proposed modification architecture,Architecture 3, with a 640-dimensional feature vector concentrated from 128 outputs in the latent layer of each net. Note that five deep nets are used in Architecture 3. It is hard to judge the improvement compared with the work (Khosla and Venkataraman 2015) because the number of training/test data is not the same. However, it is noted that no prior segmentation pre-process is performed in our work. Also, Architecture 2 and Architecture 4 test the same dataset for the simulation in the distributed computing environment. The difference between these two architectures is that both the number of training data for the positive and negative classes are reduced in Architecture 2, while in Architecture 4 only the data of negative class is reduced and the number of positive data is the same as used in Architecture 3. It is observed that compared with Architecture 1, Architecture 2 can provide a higher accuracy rate, about 5% improvement, when classification rule of comparing the mean vector with Euclidean distance is applied. However, for Architecture 4, there is a 5–7% accuracy decrease compared with Architecture 3. It is predictable because the overfitting problem may occur when less training data for negative is used for large parameter estimation. In contrast, when sub-data are used for both positive and negative applications, data might not cause server degradation for the learning of discriminant features because multiple nets can compensate each other to obtain better classification performance. Moreover, Table 4 shows the confusion matrix using Architecture 3. It can be observed that Bohemian class is more easily recognized. However, hipster and preppy classes, which are more abstract and harder to be defined, result in a low recognition rate and are easily confused.

In order to investigate the retrieval performance of the fashion style, we used all the test data as the query image to obtain the precision among the top k ranking images in the second dataset. Note that we have conducted an exhaustive search for the query and database images based on the \(L_{2}\)-norm distance between the concatenating responses of the latent layer in each CNN net as shown in Eq. (17). Then, the precision among the top k ranking images can be obtained. Figure 7 shows the precision results of using the various lengths of latent layer for Architecture 2. Figure 8 shows examples of the clothing retrieval. Given a query image, the corresponding top 5 images were retrieved by Architecture 2.

Fig. 9
figure 9

The learning curve with various training iterations with Architecture 1

Table 7 Confusion matrix on the third dataset using Architecture 2 with 128 nodes in the latent layer and the test data is classified based on the softmax layer output
Table 8 Confusion matrix on the third dataset using Architecture 3 with 128 nodes in the latent layer and the test data is classified based on the softmax layer output

Since the numbers of images in the first and second dataset are <2000, in order to evaluate the proposed architectures for the larger dataset, the third dataset (Bossard et al. 2013), which contains more than 80,000 clothing images with 15 classes, was used. According to the number of classes, the deep net is with 15 neurons in the output layer. Also, 100,000 iterations with 100 batches are set in the training process. Firstly, we evaluate the effects of node number for the system performance and the numbers of nodes in the latent layer were set to 64, 128, and 256. Table 5 shows the accuracy with different classification rules and node numbers in the latent layer for Architecture 2 and 3. We can observe that for each kind of classification rule, the difference of accuracy rates with different node numbers is not significant for the Architecture 2. However, for the Architecture 3, with increasing the node number, the accuracy rates decrease when the k-NN classification rule or the computation of the Euclidean distance with the mean vector was applied. It is because the variations for each class in the third dataset are larger than the first and the second dataset, and the risk of overfitting occurrence is higher when the larger length of feature vectors was used. Note that for Architecture 3, the length of feature vector as shown in Eq. (16) is the number of classes multiplied the number of nodes in the latent layer. In other words, the length is 1920 and 3840 for the node number 128 and 256, respectively. Hence, the classification rule based on the softmax layer is recommended for Architecture 3.

Fig. 10
figure 10

Image retrieval precision of the third dataset for Architecture 3

Fig. 11
figure 11

Top 5 ranking images in the third datasets. From the top to bottom row, the query image is blouses, cloak, coat, jacket, long, sport shirt, robe, shirt, short, suit, sweater, t-shirt, upper, uniform and waistcoat class

Table 6 shows the comparison with the study (Bossard et al. 2013), and the accuracy rate with the best parameter, i.e. node number, was listed. Note that each deep net is fine-tuned with pre-trained model on ImageNet dataset. In Bossard et al. (2013), the authors applied Random Forest to be capable of transfer learning from different domains and provide a 41.38% average accuracy higher than 35.07% with SVM baseline. The deep net with Architecture 1 can provide accuracy of 59.5%, with an 18% improvement than (Bossard et al. 2013). Figure 9 shows the learning curve with various training iterations with Architecture 1. It can be observed that the system is stable after 30,000 iterations. Moreover, Tables 7 and 8 shows the confusion matrices usingArchitecture 2 and Architecture 3 with 128 nodes in the latent layer and the test data is classified based on the softmax layer output, respectively. From the classification results, it can be observed that the classification results for blouses and waistcoat class are worse and easily confused with other classes in both cases. The classification results for the shirt and upper class are better. Additionally, the number of classes that have at least 50% classification rate is 8 and 10 in Tables 7 and 8, respectively. The result is consistent that Architecture 3 has higher classification rate than Architecture 2 in Table 5.

In addition, we randomly selected 2000 images as the query image to obtain the precision of the top k ranking images in the third dataset. The experimental settings which include the distance measurement and feature vectors were the same as the settings in the second dataset. Figure 10 shows the precision results of using the various length of the latent layer for Architecture 3, and Fig. 11 shows the examples of clothing retrieval for each class. Given a query image, the corresponding top five images were retrieved.

6 Conclusion

Inspired by the robust classification ability of Deep Learning and its ability to analyze a massive volume of data in the Big Data era, we applied the deep net theory to clothing style recognition and proposed three modifications of deep architectures for the distributed computing environment. Compared with the existing approaches that use hand-crafted low-level features and machine learning algorithms with shallow structure, the proposed method provided more promising results on the three public clothing datasets, particularly on the large dataset with 80,000 images where an improvement of 18% in accuracy was recognized. Also, the proposed architectures, Architecture 2 and Architecture 4, using sub-training data either on positive or negative classes might not cause server accuracy degradation. Thus these architectures could provide one feasible way to apply the Deep Learning tool for Big Data Analytics in the distributed computing environment.