1 Introduction

With the development of e-commerce platforms and information technology, the quantity of image data on e-commerce platforms has increased dramatically. How to effectively classify, retrieve and organize this massive quantity of images has become an urgent problem. To manage these images, it is necessary to acquire the attribute information contained in the images. However, e-commerce images often contain complex and diverse semantic information and not only a single attribute. A two-attribute image means that an image contains two attributes. E-commerce images not only have the attribute of “type” but also have the attribute “color.” For example, a commodity image can be described as a “white dress.” Each attribute describes the e-commerce images from different aspects [1]. Every customer has his own color preferences when shopping, so customers not only focus on the types of commodities but also the color or other information. Only by acquiring more information contained in commodity images can e-commerce platforms better manage the image data and serve customers.

Convolutional neural networks are common deep learning models that are widely used in the automatic extraction of image features. They combine low-level single features to form abstract high-level features and then classify them by classifiers such as softmax according to the extracted features [18]. At present, convolutional neural networks have been widely used in image classification because of their excellent performance [8, 15, 25, 33,34,35, 43]. However, the traditional convolutional neural network is a single-task learning model, which mainly classifies the images according to a single label. The category information of commodities such as “shoes” and “shirts” are mainly used in the e-commerce image classification. However, only relying on the “type” attribute to classify commodities ignores important color information and cannot meet the needs of e-commerce image classification [16].

Two-attribute image classification can be considered a kind of multi-task learning (MTL) [4, 12, 20, 28]. The simplest method divides a task into simple and independent single tasks for learning and then merges the results. That is, several different convolutional neural networks are constructed to learn the information of “type” and “color” of the commodity image. However, for commodity images, if two attributes contained in an image are decomposed into independent attributes, the correlation between the two attributes is neglected, such as the low-level features of an image such as edges, which are shared among two attributes.

Two-attribute image classification is also a kind of multi-attribute image classification. Many scholars have studied multi-attribute image classification. Li et al. [19] constructed a DeepMAR model for pedestrian multi-attribute image classification. Wang et al. [37] used a convolutional neural network based on an improved triplet loss function to classify vehicle multi-attribute images. Bossard et al. [6] fused low-level image features such as HOG, SURF and LBP and used migrating forests to classify multi-attribute e-commerce images. Liu et al. [22] divided the human body into several regions and then extracted the color histogram features of these regions for human body image classification. Bao et al. [3] proposed a convolutional neural network method based on metric learning for multi-attribute clothing image classification and retrieval. Ak et al. [2] used a method based on unsupervised segmentation and a convolutional neural network to classify multi-attribute e-commerce images.

Recently, transfer learning has attracted considerable attention in research. Referring to the idea of parameter transfer in transfer learning, a new two-task learning method based on an improved convolutional neural network that can be used to classify two-attribute images is proposed. The network has two channels, and each channel is responsible for learning different attributes of the image. By sharing the parameters of the network, the complexity of the network is reduced, the interpretability is better, and the generalization ability is improved. In this paper, the method is applied to two-attribute e-commerce image classification, and two channels correspond to learning tasks of different attributes. Experiments show that the two learning tasks can help each other learn, accelerate the convergence process of the whole network and have a good classification effect.

2 Related algorithm

2.1 Deep learning and CNN

Deep learning has developed rapidly in recent years and has been widely used in image processing, such as image classification, object detection, video processing and other fields [26, 27]. For example, Zhao et al. [41] proposed one 3D CNN architecture for facial expression recognition. Shu et al. [31, 32] proposed a novel fine-grained dictionary learning method for image classification and a novel H-LSTCM for recognizing human interactions. Bui et al. [7] proposed a deep learning-based approach that can generate high-resolution photorealistic point renderings. Li et al. [21] proposed a CNN architecture to improve stereo-segmentation performance.

As a common deep learning model, a convolutional neural network can input the original images directly without complex preprocessing, avoiding manual extraction of complex low-level features and data reconstruction.

The CNN is generally composed of convolutional layers, pooling layers, fully connected layers and one classifier. The structure of a CNN is shown in Fig. 1. It optimizes network structure and adjusts network parameters using a back-propagation algorithm. In the CNN, the convolutional layer uses weight sharing, so the network can extract image features with fewer parameters, which reduces the network complexity [18].

Fig. 1
figure 1

The structure of a convolutional neural network

2.2 Transfer learning

Transfer learning stores the knowledge used in solving a problem and applies it to a different but related problem. The more factors that are shared by two different tasks, the easier it is to use transfer learning [24, 34]. The transfer learning process is shown in Fig. 2.

Fig. 2
figure 2

The process of transfer learning

Parameter-based transfer learning is a kind of transfer learning. The idea of this method is to assume that there are some prior distributions or model parameters that can be shared between source and target domains, and these prior distributions or model parameters can be migrated during the training process [5, 13, 23, 38].

Yu et al. [39] used prior knowledge in the Gauss process to establish connections among multiple tasks. Evgeniou and Pontil [12] proposed a transfer learning method based on a regularization framework in which the SVM parameters were transferred. Bonilla et al. [5] proposed a transfer method using a covariance matrix to construct prior knowledge between source domain tasks and target domain tasks. Finkel and Manning [13] proposed a transfer learning algorithm based on hierarchical Bayesian prior knowledge.

2.3 Mix-up algorithm

CNNs have very large numbers of parameters, which require a large number of labeled images to meet the requirement of network training. If we have insufficient training examples, the CNN can be subjected to decayed performance. Many scholars have proposed many solutions to solve this problem. For example, Shu et al. [30] proposed a type of novel weakly shared deep transfer network. Among all the solutions, the most common is data augmentation.

Mix-up is a data augmentation algorithm proposed by Zhang et al. in a paper published in ICLR 2018. This algorithm generates new samples by linear interpolation, which makes the neural network tend to train into simple linear relations [40].

To reduce the risk of over-fitting and improve the generalization of the neural networks, a large number of labeled samples are usually used to train the networks, but the cost of acquiring a large number of labeled data is too high, and in some cases, it is impossible to obtain more data, such as medical lesion images. Vicinal risk minimization (VRM) is usually used to solve the problem of insufficient training data by constructing the neighborhood values of training samples through prior knowledge, such as image flipping, image rotating and noise addition.

Mix-up is a general neighborhood distribution method proposed by Zhang H et al. The calculation formula is as follows:

$$ \mu \left( {\tilde{x},\tilde{y}|x_{i} ,y_{i} } \right) = \frac{1}{n}\mathop \sum \limits_{j}^{n} {\mathbb{E}}\left[ {\delta \left( {\begin{array}{*{20}c} {\tilde{x} = \lambda \cdot x_{i} + \left( {1 - \lambda } \right) \cdot x_{i} ,} \\ { \tilde{y} = \lambda \cdot y_{i} + \left( {1 - \lambda } \right) \cdot y_{i} } \\ \end{array} } \right)} \right] $$

where \( \lambda \sim\,Beta\left( {\alpha ,\alpha } \right) \) for \( \alpha \in \left( {0,\infty } \right) \). Then, virtual new samples can be generated from the mix-up distribution by sampling. The calculation formula is as follows:

$$ \tilde{x} = \lambda x_{i} + \left( {1 - \lambda } \right)x_{j} $$
$$ \tilde{y} = \lambda y_{i} + \left( {1 - \lambda } \right)y_{j} $$

where (\( x_{i} ,y_{i} \)) and (\( x_{j} ,y_{j} \)) are two arbitrary samples in the dataset, and \( \lambda \in \left[ {0,1} \right] \). The mix-up hyperparameter \( \alpha \) is used to control the mixing strength of two samples. \( (\tilde{x},\tilde{y}) \) is a new sample generated by the mix-up algorithm through linear interpolation.

The mix-up algorithm is simple, effective, data-independent and has good versatility. Research shows that the mix-up algorithm reduces the generalization errors of the most advanced neural network models (ResNet-50, ResNet-101, ResNetXt-101, etc.) in datasets such as ImageNet, CIFAR-10 and CIFAR-100, and it is also valuable for reducing the sensitivity and instability of adversarial examples [11, 17, 36].

2.4 Grad-CAM algorithm

Grad-CAM is an algorithm proposed by Selvaraju et al. in CVPR 2016 [29]. This algorithm solves the problem of interpretability of CNN model classification and shows the decision-making key areas of CNN in the form of a heat map. The Grad-CAM algorithm is further improved on the basis of the CAM algorithm [42], and it does not change the structure of the original CNN model, so the CNN does not need to be retrained. It obtains the interpretability of the CNN model without reducing the accuracy compared with the CAM algorithm. Therefore, the Grad-CAM algorithm has been widely used and has been extended to the analysis of image description and visual question answering [9, 10, 14].

Relevant studies have shown that deeper convolutional layers in the CNN can capture deeper visual features of images. Multilayer convolutional maps in CNN extract high-level features of images, which contain spatial information of image features. However, this spatial information is lost after the features are expanded into vectors and input to the fully connected layers of the CNN. Therefore, the output features of the last convolutional layer of the CNN contain detailed high-level semantics and spatial information of the image.

The Grad-CAM algorithm uses the gradient information of the last convolutional layer of the CNN to explain the importance of each neuron to the final classification determination. It has the ability to explain the difference in image classification and locate the key areas. In the Grad-CAM algorithm, the weight of the feature map \( k \) of the CNN network for target class \( c \) is defined as \( \alpha_{k}^{\text{c}} \). The calculation formula is as follows:

$$ \alpha_{k}^{c} = \frac{1}{Z}\mathop \sum \limits_{i} \mathop \sum \limits_{j} \frac{{\partial y^{c} }}{{\partial A_{ij}^{k} }} $$

\( Z \) is the number of neurons in the feature map, and the number in each feature map is the same. \( y^{c} \) is the classification score of CNN for class \( C \). \( A_{ij}^{k} \) is the value of point \( \left( {i,j} \right) \) in the \( k \) feature map. After calculating the weight of each feature map for each class, the heat map can be described by the following formula:

$$ L_{\text{Grad-CAm}}^{c} = {\text{ReLU}}\left( {\mathop \sum \limits_{k} \alpha_{k}^{c} A^{k} } \right) $$

For example, the Grad-CAM algorithm can explain the image classification support for the “cat” class and “dog” class by CNN. Figure 3a is the original image, Fig. 3b is the support area for the “cat” class, and Fig. 3c is the support area for the “dog” class. The redder the color, the more important the area is for a classification determination. The bluer the color, the less important the area is for a classification determination.

Fig. 3
figure 3

Original image and support area for classification

3 Data acquisition and augmentation

Web crawler technology based on Python is used to download commodity images from TaoBao and TianMao. The image format is JPG, and the size is 200 × 200. Each image has two attributes, one is “type” and the other is “color.” The categories are shown in Fig. 4. According to the type of commodities, they are divided into six categories: dresses, high heels, suits, leather clothes, leather shoes and shirts; according to color, they are divided into five categories: gray-white, black, red, blue and yellow.

Fig. 4
figure 4

Image categories in the dataset

In these images, there are relatively more “gray-white,” “black” and “red” commodity images, while there are fewer “blue” and “yellow” commodity images, and some specific types of commodity images are very few (such as “blue shoes” or “yellow shirts”). The total number of images is 15,000.

The two attributes describe the e-commerce images from different aspects. For example, in Table 1, the leather shoes can be described as (leather shoes, black), the dress can be described as (dresses, gray-white), and the suit can be described as (suits, blue).

Table 1 Two-attribute image instances

Convolutional neural networks have a large number of parameters that require a large number of labeled images to meet the requirements of network training. To achieve the purpose of strengthening training, data augmentation methods were used to expand the image training set.

Traditional image data augmentation methods include flipping, rotation and noise addition. To ensure the beauty of commodity images, three traditional data augmentation methods, horizontal flip, vertical flip and 90-degree rotation, are used in this paper. Figure 5 shows the result of one shoe image and one dress image processed by the three methods. The number of images after augmentation is 15,000 × 4 = 60,000.

Fig. 5
figure 5

Results of images processed by three augmentation methods

Although the number of training images is increased by using these three traditional data augmentation methods, some specific types of commodity images (blue shoes, etc.) are still very scarce. The proportion of these types of images to the total images has not changed, and there is still the problem of class imbalance.

To address this problem and to further increase the number of specific types of images and reduce the risk of over-fitting, the mix-up algorithm is used to process the image dataset.

Assume the training set image can be expressed as \( \left( {x_{1} ,y_{1} ,z_{1} } \right), \left( {x_{2} ,y_{2} ,z_{2} } \right),\left( {x_{3} ,y_{3} ,z_{3} } \right), \ldots ,\left( {x_{n} ,y_{n} ,z_{n} } \right) \), where \( x_{\text{i}} \in R^{600 \times 600 \times 3} \), \( y_{i} \in R^{6} \) and \( z_{i} \in R^{5} \). \( x_{\text{i}} \) is a third-order tensor, 600 × 600 means that the size of each image is 600 × 600, and 3 means that each image has three channels: the red (R) channel, the green (G) channel and the blue (B) channel. \( y_{i} \) is a six-dimensional one-hot vector, and each dimension corresponds to one of the six categories: dress, high-heeled shoes, suit, leather coat, leather shoes and shirt. \( z_{i} \) is a five-dimensional one-hot vector, and each dimension corresponds to one of the five categories: gray-white, black, red, blue and yellow. The one-hot encoding forms of \( y_{i} \) and \( z_{i} \) are shown in Table 2.

Table 2 The one-hot coding forms

\( y_{i} \) and \( z_{i} \) represent the categories of “type” and “color” of the image, respectively. For example, when \( y_{m} = \left( {0,0,1,0,0,0} \right) \) and \( z_{m} = \left( {0,1,0,0,0} \right) \), the image \( x_{m} \) is a suit image, and the color is black.

In this paper, the mix-up algorithm hyperparameter \( \alpha \) is set to 100, then parameter \( \lambda \sim\,Beta\left( {100,100} \right) \). According to the previous formula, new image data (\( \tilde{x},\tilde{y} \)) can be generated. For the convenience of representation in tabular form, assuming that the value of parameter \( \lambda \) is 0.5, the information of some samples generated by the mix-up algorithm is illustrated in Table 3.

Table 3 The information generated from some samples

A total of 60,000 image samples are represented as \( {\mathcal{D}} \). According to the number of images in each category, 30 kinds of images are divided into three classes: \( {\mathcal{D}}_{1} \), \( {\mathcal{D}}_{2} \) and \( {\mathcal{D}}_{3} \), where \( {\mathcal{D}} = \left\{ {{\mathcal{D}}_{1} } \right.\left. {,{\mathcal{D}}_{2} ,{\mathcal{D}}_{3} } \right\} \), \( Card\left( {{\mathcal{D}}_{1} } \right) > Card\left( {{\mathcal{D}}_{2} } \right) > Card\left( {{\mathcal{D}}_{3} } \right) \). Among them, \( {\mathcal{D}}_{1} \) class contains the commodity images with the most common matching of type and color, such as “black shoes,” “white shirts,” “red dresses” and so on. The \( {\mathcal{D}}_{3} \) class contains the commodity images with the unusual matching of type and color, such as “blue shoes,” “yellow shirts,” “gray-white suits” and so on. The \( {\mathcal{D}}_{2} \) class is between the \( {\mathcal{D}}_{1} \) class and \( {\mathcal{D}}_{3} \) class and contains commodity images with common matching type and color, such as “red high heels” and so on.

The sampling strategy used in the mix-up algorithm in this paper is as follows: The images in the \( {\mathcal{D}}_{3} \) class are randomly matched, and \( Card\left( {{\mathcal{D}}_{3} } \right)/2 \) new samples can be generated. The images in the \( {\mathcal{D}}_{2} \) class and the images in the \( {\mathcal{D}}_{3} \) class are randomly matched, and a total of \( Card\left( {{\mathcal{D}}_{3} } \right) \) matching pairs are generated due to \( Card\left( {{\mathcal{D}}_{2} } \right) > Card\left( {{\mathcal{D}}_{3} } \right) \), and \( Card\left( {{\mathcal{D}}_{3} } \right) \) new samples will be generated. The images in the \( {\mathcal{D}}_{1} \) class do not participate in sampling.

A large number of images containing content of the \( {\mathcal{D}}_{3} \) class images are generated by this method, and the proportion of images containing content of the \( {\mathcal{D}}_{3} \) class images to the total data increases. This method is essentially an over-sampling method that can address the problem of class imbalance to some extent [44].

Approximately 10,000 new samples are generated, which with the original 60,000 images are used as training samples to participate in the network training process, improving the generalization of the neural network. However, due to the particularity of these generated images, they only participate in the training process of the neural network in this paper and do not participate in the testing process or other uses.

4 Improved convolutional neural network

4.1 Two-channel convolutional neural network

The basic structure of the traditional convolutional neural network consists of input layers, convolutional layers, pooling layers, fully connected layers and a classifier. Convolutional layers and pooling layers are usually connected alternately. In the convolutional layer, each feature map extracts a unique feature of the image. The lower convolutional layer extracts low-level features such as image edges, and the higher convolutional layer extracts high-level features such as texture. The deeper the network is, the more complex the extracted image features are.

In recent years, transfer learning has received extensive attention and research. The success of transfer learning proves the universality of the features extracted by deep learning. For the two attributes of e-commerce image “type” and “color,” the extracted edge and other features contribute to the classification of these two attributes. Fine-tuning is a commonly used method in transfer learning [19, 31], which means initializing neural network parameters by using existing parameters, migrating some tasks in the pre-training model to other tasks to allow the network to learn from a good starting point so that it can save considerable time in training new tasks.

In this paper, the idea of fine-tuning is used for reference. The proposed convolutional neural network is improved on the basis of a simplified AlexNet network. The workflow of the network is shown in Fig. 6, and the simplified network structure is shown in Fig. 7. The network model is simple and easy to implement. The former part of the network, similar to the traditional network, has four convolutional layers, and each convolutional layer is connected with a pooling layer. The two attributes share parameters in the first four convolutional layers. From the fourth pooling layer, the network is divided into two channels. Each channel consists of two convolutional layers, one pooling layer, three fully connected layers and one final softmax classifier. The first channel trains and classifies commodity images according to the “type” attribute, and the second channel trains and classifies commodity images according to the “color” attribute. The former part of the network parameters is shown in Table 4, and the latter part of the network parameters is shown in Table 5. The max-pooling is chosen for all the network pooling modes. In the latter part of the network, the network parameters of the two channels are basically the same, but the last fully connected layer of the first channel outputs six-dimensional vectors corresponding to six categories of “type” attributes, and the last fully connected layer of the second channel outputs five-dimensional vectors corresponding to five categories of “color” attributes. The two vectors are input to the two softmax classifiers. The larger the output value, the higher the probability that the image belongs to the corresponding category.

Fig. 6
figure 6

The workflow of the proposed network. When an image is input, two channels can work simultaneously to predict the category of the image “type” and “color”

Fig. 7
figure 7

The improved structure of the convolutional neural network (C1 is short for convolutional layer 1, P1 is short for pooling layer 1, and so on)

Table 4 The former part of the network parameters
Table 5 The latter part of the network parameters

When an image is input, two learning tasks can be carried out simultaneously, and two classifiers can simultaneously predict the category of the image “type” and “color.” By sharing the parameters of the first four convolutional layers, the network can transfer general knowledge between two learning tasks to reduce the scale of the parameters of the whole network model and make the prediction more efficient.

Referring to the transfer learning training process, the training process is divided into two steps: (1) First, the most important attribute of the e-commerce images, that is, the first channel corresponding to the “type” attribute, is used to pre-train the whole network. To speed up the convergence of the network, a higher learning rate is chosen during the training. (2) Second, after the pre-training, two channels are used to train the network alternately. To prevent the network from oscillating, a lower learning rate is chosen at this time. Both channels adopt stochastic gradient descent, which is commonly used in convolutional neural networks to adjust network parameters. The purpose of pre-training is to let the network learn from a good starting point to accelerate the convergence process of the second channel corresponding to the “color” attribute and improve the training efficiency of the whole network.

In step (2), an iterative training process is divided into two parts. The first channel is used to train once, and then the second channel is used to train once. The network parameters are adjusted twice through the two channels, and the network passing through two forward and two backward propagation processes is called one iteration. Because the complexity of the two attributes is different, the convergence speed of the loss function is different, and the training speed is inconsistent. Therefore, multiple groups of different learning rates are used to train and test for comparison.

The method proposed in this paper has better interpretability. Each channel is responsible for extracting different features of the image. Additionally, according to the complexity of the features, the learning rate of each channel can be adjusted. The network realizes the sharing of low-level image features such as edges to reduce the number of network parameters, optimize training results and reduce network training time to a certain extent.

4.2 Improved Grad-CAM network

The original structure of the Grad-CAM network is shown in Fig. 8. The input image can be expressed as \( \left( {x_{i} ,y_{i} } \right) \); \( x_{i} \) is the content of the image, and \( y_{i} \) is the label of the image (such as cat and dog). \( y_{i} \) is a kind of one-hot vector, and only the element corresponding to the category to which the image belongs is one and all other elements are zero. After an image is input into the CNN, high-order features of the image are extracted by a convolutional layer through forward propagation. The output feature gradient information of the last convolutional layer and classification score before the softmax layer can be calculated. Then, the heat maps of corresponding categories can be drawn using weighted formulas. Because the network only addresses the influence of positive values on the final classification results, the ReLU function is used to remove the negative values in the calculation results and reduce the influence of irrelevant categories on the final rendering results.

Fig. 8
figure 8

Original structure of the Grad-CAM network

The input image in this paper can be expressed as \( \left( {x_{i} ,y_{i} ,z_{i} } \right) \); \( x_{i} \) is the content of the image, \( y_{i} \) is the “type” label of the image, and \( z_{i} \) is the “color” label of the image. \( y_{i} \) and \( z_{i} \) are both one-hot vectors. However, the traditional Grad-CAM network has only one channel and can only draw the image heat map according to the single attribute label of the image, so it is impossible to draw the heat maps of two attributes simultaneously.

Aiming at this problem, an improved structure of the Grad-CAM network is proposed in this paper. We improve the traditional Grad-CAM network by adding a channel to adapt to the proposed two-channel CNN. The improved Grad-CAM network can draw heat maps of two attributes, “type” and “color,” simultaneously, which can visualize the key areas of classification determination of two attributes.

The improved structure of the Grad-CAM network is shown in Fig. 9. After a commodity image is input into the CNN network proposed in this paper, two kinds of image high-order features are extracted by the convolutional layers of two channels. Gradient information of the output feature maps of the last convolutional layer of two channels (C6 and C8 in Fig. 7) and classification scores of the two channels in front of the softmax classifier are calculated.

Fig. 9
figure 9

Improved structure of the Grad-CAM network

The weight of the feature map \( k \) of channel \( l \) for target class \( c \) is defined as \( \alpha_{lk}^{c} \), and the improved formula is as follows:

$$ \alpha_{lk}^{\text{c}} = \frac{1}{Z}\mathop \sum \limits_{i} \mathop \sum \limits_{j} \frac{{\partial y^{{c^{\left( l \right)} }} }}{{\partial A_{lij}^{k} }} $$

\( Z \) is the number of neurons in the feature map, and the number in each feature map is also the same because the two channels have the same network structure. \( y^{{c^{\left( l \right)} }} \) is the classification score of channel \( l \) for class \( C \). \( A_{lij}^{k} \) is the value of point \( \left( {i,j} \right) \) in the \( k \) feature map of channel \( l \). After calculating the weight of each feature map of two channels for each class, the heat map can be described by the following formula:

$$ L_{l}^{c} = {\text{ReLU}}\left( {\mathop \sum \limits_{k} \alpha_{lk}^{c} A_{l}^{k} } \right). $$

5 Experiments and analysis

5.1 Experimental environment

The experimental environment adopted in this paper is Windows10 + CUDA9.0 +TensorFlow1.9. The graphics card used in the experiment is a GTX1080ti.

5.2 Experimental classification analysis of the proposed CNN

First, the network parameters are initialized randomly without pre-training, and the training steps (2) are carried out directly. Two channels are used to train the network simultaneously. The learning rate is set to 0.00001 and 0.0001 for comparison, and the number of epochs is set to 1000. The convergence curve of the network loss function is shown in Fig. 10. To display more intuitively, the longitudinal coordinate values show only the (0, 200) range. Figure 10a shows that the network tends to converge after 400 epochs with a learning rate of 0.00001; Fig. 10b shows that the value of the loss function tends to decline rapidly at the beginning and converges after 200 epochs with a learning rate of 0.0001. It can be seen from the figures that the convergence speed of the loss function of the “color” attribute is relatively fast and that of the loss function of the “type” attribute is relatively slow under both learning rates because the characteristics of the “type” attribute are relatively complex and need more iterations to train. In the network training stage, the fluctuation of the loss function is very obvious, and there is a small oscillation. When the network tends to converge, the loss function curve still occasionally oscillates, especially when the learning rate is set to 0.0001, which is relatively high.

Fig. 10
figure 10

Convergence curve of the loss function for two-attribute learning tasks without pre-training and with training step (2)

In the two groups of experiments, the classification results with different learning rates are shown in Table 6. The first two rows are the classification results when the learning rate is 0.00001, and the second two rows are the classification results when the learning rate is 0.0001.

Table 6 Classification results without pre-training

Then, the network is pre-trained using the “type” attribute. At this time, the second channel of the network is suspended, and the learning rate is set to 0.00001 and 0.0001 for comparison. To compare with the previous experiment, the number of epochs is still set to 1000. In the pre-training, the convergence curve of the loss function of the “type” attribute is shown by the yellow curve in Fig. 11a, b. The black curve is the convergence curve of the loss function corresponding to the “type” attribute in the two-channel training above. It can be seen from the figures that the convergence speed of the loss function is relatively slow when only the channel corresponding to the “type” attribute is used for training, while the convergence process can be accelerated when the network is trained by using the “color” attribute channel at the same time. Experiments show that the two tasks corresponding to the two attributes can help each other learn to speed up the adjustment of the parameters of the first four layers of the network, and then speed up the convergence of the network. However, compared with using only “type” channels to train, training using two channels simultaneously improves the training speed, but the loss function oscillation is more obvious.

Fig. 11
figure 11

Contrast chart of the convergence curve of the “type” attribute loss function between pre-training and two-channel training

After the pre-training, the parameters of the first four layers are adjusted, and then the training step (2) can start the training process of the two attributes from a good starting point. To accelerate the convergence of the network, a higher learning rate of 0.0001 is chosen for training step (2) based on pre-training with 400 epochs. In training step (2), the latter part of the network parameters of the second channel is initialized randomly. On the basis of pre-training, two channels are used to train the network alternately. The learning rate is set to 0.00001 and 0.0001 for comparison. The convergence curve of the network loss function is shown in Fig. 12. From the figures, it can be seen that the initial value of the “color” attribute loss function is very high and shows a rapid downward trend, while the “type” attribute loss function has undergone pre-training, so the initial value is very low. When the learning rate is set to 0.00001, the training effect of the network is better, and the oscillation amplitude of the convergence curve is lighter.

Fig. 12
figure 12

Convergence curve of the network loss function using two channels on the basis of pre-training

The loss function curve of the “color” attribute in Fig. 12 is compared with that of the “color” attribute in Fig. 10 without pre-training. The comparison results are shown in Fig. 13. Figure 13a, b shows the results of comparison when the learning rates 0.00001 and 0.0001, respectively. From the figures, it can be seen that the convergence speed of the network processed by pre-training is obviously improved, the loss function curve declines faster, and the oscillation amplitude of the network is relatively lighter.

Fig. 13
figure 13

Contrast chart of the convergence curve of the “color” attribute loss function in step (2) between using pre-training and not using pre-training

The final classification results of the network processed by pre-training are shown in Table 7. Compared with the classification results in Table 6, the accuracy rates of the training set and test set are both improved. Experiments show that the proposed method improves the convergence speed of the network, optimizes the training results and can effectively classify two-attribute images by sharing low-level network parameters.

Table 7 Classification results using pre-training

5.3 Sparse rate analysis of CNN output

To study the complexity of “type” and “color” attributes of e-commerce images, the output feature matrices of different layers of networks are studied. The high-level features of “type” and “color” are composed of many low-level features, which are used by convolutional neural networks to determine the “type” and “color” of images. In the convolutional neural network, the output of most neurons on the convolutional layer is 0, and the output feature matrix is a sparse matrix. Only when the image has a low-level feature are the corresponding neurons activated, and other neurons are not affected. Therefore, the less sparse the output feature matrix is, the more complex the feature is. We input several test images into the network and count the average sparse rate of the 12 × 12 dimension feature matrix (depth = 256) output by pooling layer P4, the 12 × 12 dimension feature matrix (depth = 128) output by convolutional layer C5 and C7 and the 6 × 6 dimension feature matrix (depth = 128) output by pooling layer P5 and P7, as shown in Table 8, and the comparison result is shown in Fig. 14.

Table 8 Average sparse rate of feature matrices
Fig. 14
figure 14

Experiment comparison chart of two channels

The feature matrices output by P4 are shared by two attributes. The feature matrices output by C5 and P5correspond to “type” attribute, and the feature matrices output by C7 and P7 “color” attribute. As shown in Table 8 and Fig. 14, the output sparse rate of convolutional layer C7 is much higher than that of convolutional layer C5 because the “color” attribute is relatively simple and can be represented with fewer features. The output sparse rate of pooling layers P5 and P7 is much lower than that of convolutional layers C5 and C7 because the pooling method used by the network is max-pooling, and many zero values are filtered out during the pooling process. Experiments show that the complexity of the “type” attribute is higher than that of the “color” attribute, which is consistent with our knowledge in real life.

5.4 Baseline classification results

To validate the method proposed, we designed another experiment as the baseline. In this experiment, we used traditional CNN for classification of the two attributes.

First, the network is pre-trained by using the “type” attribute referring to the above experiment. Then, we keep the network parameters unchanged after training and divide the latter part of the network into two parts according to the channel and reconstructed two new networks. The former parts of the two networks are the same. The two reconstructed networks are shown in Fig. 15.

Fig. 15
figure 15

The two reconstructed networks

Next, the two networks are trained separately. The first network is trained according to the “type” attribute, and the second channel is trained according to the “color” attribute. In the training, the hyperparameters are the same as those of the above experiments. The final classification results are shown in Table 9. Since this classification does not use any other methods, it is only the traditional CNN application for classification of the two attributes, so it can be regarded as the baseline.

Table 9 Classification results when training separately

Compared with the results in Table 7, it can be seen that the classification accuracy when training simultaneously in one network is higher than that in two networks separately. The comparison results validate the proposed method.

5.5 Heat map analysis of the improved Grad-CAM network

To analyze and study the support for classification of the “type” attribute and “color” attribute by two channels of CNN, the improved two-channel Grad-CAM network is used to draw the key area heat maps of two attributes to improve the interpretability of the CNN structure proposed in this paper.

By inputting commodity images into the improved Grad-CAM network, the key area heat maps of two attributes “type” and “color” can be drawn simultaneously. Some of the visual results are shown in Table 10. The first column in Table 10 contains the original images, the second column contains the heat maps of the “type” attribute, and the third column contains the heat maps of the “color” attribute. The redder the color, the more important the area is for a classification determination. The bluer the color, the less important the area is for a classification determination.

Table 10 Heat maps of “type” and “color” attribute

Table 10 shows that the key areas for the “type” attribute classification are relatively concentrated. The key areas for identification of shirts, suits and leather clothes are concentrated in the chest. The key areas for identification of dresses are concentrated in the lower half of the skirt hem. The key areas for the identification of high heels and leather shoes are concentrated in the tip of the shoes. The differences between the kind of commodities and other kinds of commodities are mainly reflected in these parts; for example, leather shoes cannot have a skirt hem, dresses cannot have shoe tips, and these parts have their own unique characteristics and are the key areas for recognition of CNN.

However, the key areas for “color” attribute classification are relatively scattered and distributed almost throughout the whole commodity body. The “color” attribute has relatively few unique features compared with the “type” attribute. In the second high heels image, even the black pants are taken as the support for the recognition of the “black” class. However, the two kinds of heat maps do not differ much, and the main key areas are concentrated in the body of the commodity. Experiments show that the two channels can share low-level network parameters and that the method proposed in this paper is effective in another way.

5.6 Experiments on cats and dogs dataset

To validate the proposed method in traditional image classification, we perform further experiments with this method on the Cats and Dogs Dataset.

The Cats and Dogs Dataset is a kind of dataset for competition in Kaggle. It is a classic image dataset in machine learning. It contains approximately 30,000 images of cats or dogs. We selected 700 white cat images, 700 black cat images, 700 white dog images, and 700 black dog images. Some of the images are shown in Fig. 16. According to the type, they are divided into two categories: cats and dogs; according to color, they are divided into two categories: white and black.

Fig. 16
figure 16

Some of the images selected

The traditional data augmentation methods were used to expand the dataset. Finally, all images were processed to a size of 200 × 200. Two groups of experiments are carried out for comparison.

In the first experiment, the CNN structure is based on the network in Fig. 7. The first channel corresponds to the “type” attribute (cat or dog), and the second channel corresponds to the “color” attribute (black or white). In this experiment, the epochs of two steps were set to 500 because there are only two kinds of images, and the training images are relatively simple, and the learning rate is set to 0.00001.

First, the network is pre-trained by using the “type” attribute. Then, two channels are used to train the network simultaneously. The final classification results are shown in Table 11.

Table 11 Classification results when training simultaneously in one network

In the second experiment, the CNN structure is based on the networks in Fig. 15. The learning rate is set to 0.00001, and the epochs are set to 500. First, the network is pre-trained using the “type” attribute. Then, the latter part of the network is divided into two parts, and two new networks are reconstructed. Next, the two networks are separately trained. The final classification results are shown in Table 12.

Table 12 Classification results when training separately in two networks

The results show that the classification accuracy for images of cats or dogs when training simultaneously in one network is higher than that in two networks separately. The comparison results validate the proposed method in traditional image classification.

6 Conclusion

This paper studies and designs a two-channel convolutional neural network model. The two channels can simultaneously learn two attributes of e-commerce images. The network uses the idea of transfer learning for reference. First, the network is pre-trained by the channel corresponding to the most important attribute of the image, and the network former parameters are optimized. Then, two channels are simultaneously used to train the network. In the training process, the two learning tasks help each other by sharing the parameters, which improves the convergence speed of the network, and the generalization ability of the model. Aiming at the problem that there are fewer specific types of commodity images in the dataset and the problem of class imbalance exists, a method of over-sampling using the mix-up algorithm is proposed. The proposed method achieves good classification results on the Cats and Dogs Dataset, and the results validate the proposed method in traditional image classification.

Experiments show that the two kinds of attribute learning tasks can help each other learn and accelerate the convergence speed of the network by sharing the low-level parameters. The relationship between the complexity of the two attributes and the sparse rate of the CNN output feature matrix is studied, and experiments show that the less sparse the output feature matrix is, the more complex the attribute is. An improved Grad-CAM algorithm is used to visualize and analyze the key areas for attribute classification, which improves the interpretability of the network. Experiments show that the proposed method has a good classification effect for two-attribute e-commerce images.