1 Introduction

Product design knowledge refers to the knowledge generated in the process of product design, which is coming from existing product data. Product data are very significant references for designing a new proposal and usually presented as description documents, product images, etc. After reviewing and analysing the previous product example, designers could find the features and styles from description documents and product images created by skilled designers to seek the inspiration and then propose the initial design scheme. So both description documents and product images should be acquired and analysed before designing a new proposal.

It is reported that the image projected by a product depends to a large extent upon its physical form, and certain critical features of the product images could affect a consumer’s psychological response to the product [23]. So existing product images are very crucial carrier for product design knowledge and can also be the references for designing a new proposal.

On traditional product design task, they are always collected at the beginning of the product design task. However, product image data were usually collected manually in traditional design work which costs about even 27% of the whole design task according to some reports. Most importantly, the data collected by designers cannot cover most of the product images. With rapidly update in the product design environment, it is difficult to grasp the current product design trends and design concept with only a small number of product data. Besides, designing with only a small number of product data is easily resulting in the appearance of the same design and bad law and economic consequences. The rapid growth of product design has called for the intelligent methods to accomplish image collecting task. At present, the enterprise official websites and shopping platforms offer a huge amount of product data. Instead of a manual process, advanced big data and deep learning technology could be used for mass product data acquisition and analysis [31, 44].

Another crucial carrier for product design knowledge is description documents. However, most of collected data just contain the product images without description documents or sometimes description texts can be obtained but are chaotic, complex and useless for product design knowledge. Description documents are usually used when demonstrating new products to the market. So they mainly contain some attribute data of the product, design style and design elements which could semantically describe the products. In this case, a new idea is proposed to structure the semantic descriptio with multi-label learning since multi-label technology could endow several characteristics such as time, style and function. We claim that multi-label learning can provide valuable information on product data, especially in the case of description documents in which each product image can have its corresponding semantic description.

Departing from traditional single-label classification, in multi-label learning, each sample is associated with multiple labels simultaneously. However, in product design field, most of the time, product images only contain the products themselves, that means the images only have one label in traditional annotation norm. we argue that there are two differences between traditional multi-label case and the case in this paper.

First, for multi-label annotation, labels represent certain object presented in images in traditional case. So an annotator is asked to determine whether the images have or not for each label. In our case, attribute data of the product and the elements of industrial design are also used to better describe the products. So it is hard to determine whether the product images have or not for these labels. On the contrary, annotator need to determine the images should be labelled with this or that for each label. For example, ‘1’ means one image has the current label,‘0’ means not. Just as the Sect. 4 present, binary ‘0’ and ‘1’ represent the color numbers less or equal than and greater than 2 for the label of the value of HSV color components, respectively.

Second, for multi-label prediction, it has a set of labels where multiple labels may be assigned to each object where each object only have one label in the traditional case. So in this way all the labels corresponding to the object can exactly belong to one subset (‘object’). And the features corresponding to each object are fairly different. In our case, we have 20 labels from five subsets of labels (category, chromatic or achromatic color, color numbers, value of HSV color components and morphology) where exactly labels from five subsets are assigned to only one object. Most importantly, only one sutset of labels corresponds to the category of object, so constructing the framework to model the relationship between the labels and the only one object would be a harder problem than the problem in traditional multi-label case.

Motivated by this, we present a pre-processing procedure to accomplish product data acquisition, sorting and analysis to construct new data with the goal of shortening the period of data collection. And based on our newly structured data, we propose the application of multi-label learning to supplement the description documents for product design knowledge.

To the best of our knowledge, this is the first work which applies multi-label research in the product design field. The key contributions of our work are as follows:

  • The construction of the data for the product design field, which can avoid the high cost and labor-intensive process and greatly shorten the design period and improve efficiency in images collection task.

  • An efficient procedure for image processing to improve the quality of the large scale data collected from the Internet, and this procedure can be easily generalized to add more categories to augment or update the data.

  • Multi-label learning to convert a huge pile of images into structured data, which can also be used for multi-label semantic learning.

The rest of the paper is structured as follows. Section 2 gives the related work on product image data and multi-label learning methods. Section 3 presents how we obtain the images and how we pre-process them, including the removal of “cluttered ” and duplicate images. Section 4 is devoted to the introduction of multi-label annotation based on this data. Conclusions and future work are finally reported in Sect. 5.

2 Related Work

With the development of computer vision, increasing number of researchers pay attention to the understanding of visual scenes. Methods were therefore proposed to recognize the objects in the scene images and characterize the relationship between objects and corresponding scenes. To copy with the requirement for advancing the object-related research, several object classification and detection data [8, 9, 28, 39] were constructed and played an important role in the recent breakthroughs in both object classification and detection research. However, as the research moves along, it is found that there may exists more than one object in scene images. So traditional single label classification cannot meets requirements for fully describing the scene images. Thus, multi-label classification research became one of the hotest topics in recent years. Similarly, several multi-label annotation data such as scene [2], NUS-WIDE [5], Microsoft COCO [20], etc. were also presented as new benchmarks to test the performance of the different multi-learning methods.

Most data contain a large number of scenes and the objects that commonly occur in them or some nature images. But no data were constructed in the product design field. Product data are very important references for designing a new proposal and are usually collected manually in traditional design work which usually costs much time, labor and money. Recently, designers can download images from some websites which provide the platforms for designers to share product images of original design or existing products. But there are problems such as mixed background and low relevance of image retrieval.

The multi-label learning framework has attracted considerable attention in the literature over the last decade due to its numerous real-world applications [29, 32]. Traditionally, it is applied in text [17, 19], audio [35, 41], image classification [16], and even image segmentation [25]. Departing from traditional single-label classification, in multi-label learning, each sample is associated with multiple labels simultaneously. One of the primary goals of multi-label learning is to understand the objects categories or scenes and objects that commonly occur in them. Objects understanding involves multiple tasks, one of which is recognizing what objects are present in one scene image. During the past decade, significant amount of progresses have been made towards this emerging machine learning paradigm [43].

Traditionally, for multi-label annotation, labels represent certain objects present in images. So an annotator is asked to determine what objects are present for each label. But in product design field, most of the time, product image only contain one object, i.e., the product itself, which means the images only have one label in traditional annotation norm. To the best of our knowledge, there is no work that considers about this situation. Inspired by the missing of description documents when collecting product data, we propose the multi-label learning research in the product design field.

3 Data Acquisition and Processing

In this section, the proposed method for image acquisition and pre-processing will be presented in detail. To get as many samples as possible for the products of each category, images are crawled from several big online shopping websites, which are bigger than enterprise official platforms. One key contribution of this work lies in fusing multi-source data and preprocessing the data with robust methods to ensure the integrality of the data.

3.1 Multi-source Data Fusion

Web crawler is a program that can automatically grab information from webs based on certain rules. So researchers usually utilize this technology to obtain specific data existing in certain webs. To ensure the integrality of the data, we crawl images from several big online shopping websites. However, different shopping websites may have the same products. It is inevitable that there exist duplicate products in the multi-source data. Fortunately, shopping websites provide identification number for each product in each corresponding specifications. Hence, when product images are crawled, the corresponding identification number is also crawled simultaneously.

More than 160,000 images are crawled from several big online shopping websites. After crawler task is finished, identification numbers for products are first read and compared between each other, the product images which are with the same identification numbers will be put into the same folder. Then all folders are gathered according to the category.

3.2 Data Arrangement

It is observed that: (1) some images have different size, (2) some images have cluttered background which are useless to be the reference for design work, (3) some images are same. So firstly, the images which have the size of \(200 \times 200\) pixel are remained, others whose sizes are inappropriate or have bad length-width ratios are resized to the size of \(200 \times 200\) pixel.

Next step is to remove the cluttered images. At the beginning of this step, the frequency-tuned salient region detection [13] is used to get salient objects since it could achieve the best results among most of saliency detection algorithms [1, 4, 45]. Then edge detection algorithm is used to get the edge map based on salient objects images, after which, feature extraction is applied to the inter-edge area and outer-edge area of the salient objects. Feature judgement is used to classify whether the backgrounds of images are pure and white or not, the images with poor backgrounds or unclear edges of products are “cluttered ” and are selected to be deleted, on the contrary, the images where products sit on the central zone of images and the background is pure and white are remained. Taking smart watch as an example, Fig. 1 shows the examples of “cluttered ” (a) and “pure” (b) images.

Fig. 1
figure 1

An example of “cluttered ” and “pure” images. a the “cluttered ” image with poor background or unclear edge of product; b “pure” image with pure white background and clear edge of product

To get rid of data redundancy, the comparison of similarity based on dhash [42] is put forward to remove the duplicated images. Firstly, perceptual hash algorithm is used to computer dhash value for the remained images after “cluttered ” removal. Then an image is randomly selected as sample A and the other images are selected in order as sample B. Hamming distance (L) between sample A and B with the threshold T are compared to determine whether to remove sample B or remain it. The Hamming distance (L) can be regarded as the dissimilarity between two samples. The threshold T is set as 0 since we want to remove the duplicates of the same images. This flow is circled until sample B is the last one in dhash list. Next, sample A is remained in the final data, and reselecting another one image as sample A after dhash list updated. Finally, the full duplicate removal flow is circled until sample A traverse all the samples. The formal steps are achieved automatically using data analysis algorithms coded by Python, decreasing several hundred worker hours. This procedure for image preprocess can be easily employed to add more categories to augment the data. And the flow diagram of the image pre-processing is presented in Fig. 2.

Fig. 2
figure 2

The flow of the image pre-processing

The remain set contains 64,893 images including a total of 16 categories of daily electronic products with more than 4 thousand brands, 76.3% of the products have more than 5 images including front view, top view, left view and some local view in detail. Hence, new folders are set up to classify the products images according to the views, and put the images with the same view of products into the same folders. Figure 3 gives the categories of samples and their corresponding amounts and brand numbers. The data will be augmented in order to make data become larger and contain more categories, the complete data set will be available online as soon as possible.

Fig. 3
figure 3

The categories of samples and their corresponding amounts and brand numbers

4 Multi-label Semantic Learning

4.1 Label Analysis

Annotating almost 65 thousand images is an extremely time consuming work and is nearly impossible to manually annotate all images. In the meantime, due to subjective feelings, they may not grasp the meaning of some images and fail to make critical annotation. So several multi-label annotation algorithms are strategically applied to assist annotators to complete this task to minimize the cost and increase the precision. However, this semi-manual annotation of ground-truth still has some shortcomings to a certain degree, since annotator inescapably has their own subjective feeling that may lead to the unsatisfactory training data for multi-label learning, in this case, the precise and recall of algorithms will be lower than the expectation. To address these shortcomings, convincing multi-label algorithms could be adopted to reduce the effect of subjective feelings as far as possible. What’s more, label selection is a crucial way to verify the effectiveness of algorithms, and subjective labels are the best choices.

Below, to verify the effectiveness of multi-label learning algorithms and to facilitate experimentation, we firstly select a set of important and basic labels which are picked from attribute data of the product and the elements of industrial design. With the addition of the categories of products, the number of labels is extended to 20.

  1. 1.

    Category

    As mentioned before, the data contain a total of 16 categories of daily electronic products with more than three thousand brands. Although the brand is one part of the attribute data of the product, there are too many brands included in the data. We just put the categories rather than brands into the annotation task.

  2. 2.

    Chromatic or achromatic color

    Achromatic color is a combination of white, black, and a variety of degree of gray that are formed from white and black. According to certain change rule, achromatic color can be arranged in a series, from white gradient to light gray, medium gray, dark gray and black, which is called as black and white series. On the contrary, chromatic is a combination of red, orange, yellow, green, blue and purple.

  3. 3.

    Color numbers

    When designing products, designers usually need to consider the color assortment. So the number of colors is also a design element that should be taken into consideration. Since there are not only two kinds of color numbers due to the wide variety of products, a threshold is needed to split it to two parts to fit the binary code for each label. The threshold is set to ‘2’ on the basis of statistics result, so the binary ‘0’ and ‘1’ represents the color numbers less or equal than and greater than ‘2’, respectively.

  4. 4.

    Value of HSV color components

    Value indicates the bright degree of color. For object color, this value is tied to the transmittance or reflectance of an object. The range is usually 0% (black) to 100% (white), similar to the color number, after setting a threshold, so the binary ‘0’ and ‘1’ represents the value less or equal than and greater than the threshold, respectively.

  5. 5.

    Morphology

    Morphology is another kind of design element that is usually considered when designing a product. Common morphology usually contain two kinds such as circular type and rectangle type.

4.2 Image Annotation

Due to the characteristic of objective labels, these labels should be able to be annotated automatically. But sometimes, images of some categories may effect the results. Taking mobile as an example, the picture shown in mobiles screens may decrease the precision of annotation for the label of color numbers and Chromatic or achromatic color. Hence, the annotation task is divided into two parts, one part is annotated by code, and the other one part is annotated by annotator with the help of human computer interaction (HCI) application.

  1. 1.

    Automatic annotation

    For the labels of category and value of HSV color components, programs are written to annotate the labels automatically. For the labels of category, it’s easy to accomplish the task since the data are managed into their corresponding folders of category when crawled from webs. For the labels of value of HSV color components, we referred to the pre-processing step which is discussed in Sect. 2. Saliency detection is applied first to get the region of objects, and then the value of the certain region of the image could be obtained. To keep the balance of label ‘0’ and ‘1’, statistics suggest that the numbers of samples with two kinds of labels are approximate when the threshold is set to 80%.

  2. 2.

    Semi-manual annotation

    For the labels of chromatic or achromatic color, color numbers and morphology, HCI application is designed to assist the annotation task, an annotator is asked to determine to annotate ‘0’ or ‘1’ for the images for each label.

    Next, we briefly introduce the semi-manual annotation procedures for annotating the multiple labels. By the way, in this paper, annotation is different to the traditional annotation task for the last four labels. In the traditional multi-label annotation, an annotator is asked to determine whether the images have or not for each label. But in this paper, annotator has to determine the images should be labelled with this or that for each label. For instance, ‘1’ means one image has the current label,‘0’ means not. Just as the formal present, the binary ‘0’ and ‘1’ represents the color numbers less or equal than and greater than ‘2’, respectively.

As shown in Fig. 4, the annotators manually view and annotate about a quarter of all images using a hierarchical labeling approach [7, 15] for each category [5, 20], these portion of annotated images are used as the training data to perform multi-label induction on the remaining unlabeled images, then semi-manual annotation is applied. Two state-of-art multi-label algorithms are used to generate the annotation for each unlabeled image by using VGG_16 [30] feature to train data, these two algorithms could achieved the best result compared to other algorithms [18]. RAkEL–DT represents the ensemble approach of RAndom k-labELsets [36] and Decision Trees [6], ECC–DT is the ensemble approach of classifier chains [26] and Decision Trees [6]. Then the corresponding annotation results L(R-D) and L(E-D) from RAkEL–DT and ECC–DT algorithm are compared in the judgement procedure.

Fig. 4
figure 4

The flow of multi-label annotation

If these two annotation results are equal for the same inputs, the equal annotation will be output to the training data to extend the training data, otherwise, the images with different annotations will be annotated manually again. If the annotator sees a certain concept exist in the image, label it as positive; if the concept does not exist in the image, or if the annotator is uncertain on whether the image contains the concept, then label it as negative [5].

4.3 Experiments Setting and Evaluation

As for state-of-the-art multi-label algorithms, a mass of researches by the machine learning community have provided a large number of multi-label approaches [11, 43]. These approaches can be broadly divided into three categories according to different theories, including problem transformation, algorithm adaptation and ensemble approaches [22]. The existing data [5, 20] used for multi-label learning usually applied kNN based algorithms to evaluate the data since the kNN based algorithms are better than most of the problem transformation methods, however, kNN based algorithms which is included in algorithm adaptation methods sometimes could not get the best performance compared to the ensemble approaches [18]. Due to these properties, ensemble approaches are very attractive.

We finally decided to choose two ensemble approaches including EA(LP–RF) and EA(BR–SVM). EA(LP–RF) represent the ensemble approach of Random Forests [3] under a Label Powerset [37] multi-label classifiers, and EA(BR–SVM) means the ensemble approach of Support Vector Machine (SVM) [38] under a Binary Relevance (BR) [21] multi-label classifiers. To conduct the experiments, we consider the algorithms implementations provided by Wroclaw University of Technology [34]. Random selection of subsets is likely to negatively affect the ensemble’s performance since the chosen subsets may not cover all labels or inter-label relationship [27]. Note that both approaches partition each label subspace using fast greedy community detection methods on a label co-occurrence graph [10] instead of overlapping RAndom k-labELsets [36]. Further details can be seen in [34].

However, these two ensemble approaches need to learn from the extracted features instead of images. CNN based algorithms can effectively extract features from original images, so we extract inception_V1 [33], inception_V2 [14], ResNet_V1-50, ResNet_V1-101 [12], VGG_16 and VGG_19 [30] features to evaluate the performance of the annotation. We randomly select 2 thousand samples from each category except intelligent lamp to avoid sample disproportion, and make a split of the training to testing samples in the order of 8:2, the overview of the performance is given in Tables 1 and 2.

Table 1 Performance (\({\text {mean}} \pm {\text {SD}}\)) of each ensemble approaches over ten measures
Table 2 Performance (\(\hbox {Mean} \pm \hbox {SD}\)) of each ensemble approaches over ten measures

Assuming \({T = \left\{ {\left( {{x_i},{Y_i}} \right) |i = 1 \ldots n} \right\} }\) be the multi-label data with n examples, where i means the ith sample. Note that in the following formulas, for a test sample x, \({Z_x}\) is defined as its predicted set of labels, and \({{r_x}\left( \lambda \right) }\) means the associated ordered rank for the label \({\lambda }\). Based on this, the measures we choose to evaluate the performance are as follows:

  1. (a)

    Coverage (C) calculates the average distance for all of the relevant labels of the example for the ranked label list.

    $$\begin{aligned} {C = \frac{1}{n}\sum \limits _{i = 1}^n {\mathop {\max }\limits _{\lambda \in {Y_i}} } \;{r_{{x_i}}}\left( \lambda \right) \mathrm{{ - 1}}} \end{aligned}$$
    (1)
  2. (b)

    Average precision (AP) shows the percentage of labels ranked above a special relevant label.

    $$\begin{aligned} {AP = \frac{1}{n}\sum \limits _{i = 1}^n {\frac{1}{{\left| {{Y_i}} \right| }}\sum \limits _{\lambda \in {Y_i}} {\frac{{\left| {\left\{ {\lambda ' \in {Y_i}:{r_{{x_i}}}\left( {\lambda '} \right) < {r_{{x_i}}}\left( \lambda \right) } \right\} } \right| }}{{{r_{{x_i}}}\left( \lambda \right) }}} } } \end{aligned}$$
    (2)
  3. (c)

    Ranking loss (Rl) evaluates the average part of label pairs that are false ordered.

    $$\begin{aligned} {Rl = \frac{1}{n}\sum \limits _{i = 1}^n {\frac{1}{{\left| {{Y_i}} \right| \left| {{{\bar{Y}}_i}} \right| }}} \left| {{G_i}} \right| } \end{aligned}$$
    (3)

    where \({G_i}\) represents \({\left\{ {\lambda ',\lambda '' \in {Y_i}:{r_{{x_i}}}\left( {\lambda '} \right) < {r_{{x_i}}}\left( {\lambda ''} \right) } \right\} }\).

  4. (d)

    Hamming loss (Hl) reports the percentage of example-label pairs in which label is predicted incorrectly.

    $$\begin{aligned} {Hl = \frac{1}{n}\sum \limits _{i,j = 1}^n {1\left( {{Y_i} \ne {Y_j}} \right) } } \end{aligned}$$
    (4)

    where \({1\left( x \right) }\) is indicator function.

  5. (e)

    Micro and macro measures.

    $${\text{Micro-precision}}= \frac{{\sum \nolimits _{j = i}^m {T{P_{{\lambda _j}}}} }}{{\sum \nolimits _{j = i}^m {T{P_{{\lambda _j}}}} + \sum \nolimits _{j = i}^m {F{P_{{\lambda _j}}}} }}$$
    (5)
    $${\text{Micro-recall}}= \frac{{\sum \nolimits _{j = i}^m {T{P_{{\lambda _j}}}} }}{{\sum \nolimits _{j = i}^m {T{P_{{\lambda _j}}}} + \sum \nolimits _{j = i}^m {F{N_{{\lambda _j}}}} }}$$
    (6)
    $${\text{Macro-precision}}= \frac{1}{m}\sum \nolimits _{j = 1}^m {\frac{{T{P_{{\lambda _j}}}}}{{T{P_{{\lambda _j}}} + F{P_{{\lambda _j}}}}}}$$
    (7)
    $${\text{Macro-recall}}= \frac{1}{m}\sum \nolimits _{j = 1}^m {\frac{{T{P_{{\lambda _j}}}}}{{T{P_{{\lambda _j}}} + F{N_{{\lambda _j}}}}}}$$
    (8)

    where TP, TN, FP, FN means true positives, true negatives, false positives and false negatives, respectively, m is the number of labels. F_measure can be calculated as the harmonic mean between precision and recall, see [40] for further details.

For each feature, five different cross validation experiments are conducted for each algorithm, and the average value is calculated as the final result. Considering the ensemble approaches of Random Forests under a Label Powerset multi-label classifiers, VGG_16 feature achieves the best performance for all the measures except coverage and macro precision, for the measure of average precision, the scores for most of the features are approximately or above 90%. For the ensemble approaches of Support Vector Machine (SVM) under a Binary Relevance (BR) multi-label classifiers, VGG_16 feature still has a clear advantage for most of the measures except Micro precision, Macro precision and Macro F_score, The best score for these three metrics are achieved by the inception_V2 feature, and the scores are even high than 96%, suggesting that false positives are very small and true positive are fairly high. Analyzing the entire form, we observe that almost all the features based on these two ensemble approaches could achieve quite high score, so we can argue that the quality of the annotation is reasonable and satisfactory. Next step, we examine the predictive performance with respect to the number of training samples. This is a critical parameter since it is directly related to the cost and manpower required for the classification and understanding of newly acquired product images. We consider a varying number of training samples ranging from 200 to 1600 for each category, averaged over 8 realizations, and the testing samples still remain at 400 for all realizations.

Different to the former experiment, this step mainly focuses on the number of training samples, so only one feature is enough. Tables 1 and 2 suggest that VGG_16 feature has a clear advantage compared with other features, so it is selected to be the basic feature for training samples. Additional parameters are set same with the former except the number of training samples. Figures 5, 6, 7 and 8 gives complex interactions between training data size and classification performance with regard to the number of training samples.

Fig. 5
figure 5

Average precision with regard to the number of training samples for two ensemble algorithms

Fig. 6
figure 6

Hamming loss with regard to the number of training samples for two ensemble algorithms

Fig. 7
figure 7

Macro precision and recall for EA(LP + RF) with regard to the number of training samples

Fig. 8
figure 8

Micro-averaged AUC with regard to the number of training samples for two ensemble algorithms

Figure 5 gives the interaction between average precision and the number of training samples for these two ensemble algorithms. It is observed that, the average precision achieves almost 0.6 when the number of testing samples is twice as large as training samples from the beginning. Then the average precision for both algorithms increases smoothly and steadily before the number of training samples reaches to 1400, after which it increases slower for EA(BR + SVM) algorithm than EA(LP + RF) algorithm. The BR approach is the lack of consideration for label correlations while LP methods can capture interrelationships among labels, so it may lead to under- or overestimation of the active labels or the identification of multiple labels that never co-occur for EA(BR + SVM) algorithm when training set size is bigger than 1400.

Figure 6 shows the variation results of Hamming loss. This metric is highly representative since the Hamming loss belongs to the example-based metrics and can give us an overall intuition of the misclassified object-label pairs. Opposite to average precision, Hamming loss for both algorithms drops with the number variation of training samples. It decreases steadily for \({\text {EA}}({\text {BR}}+{\text {SVM}})\) algorithm and \({\text {EA}}({\text {LP}}+{\text {RF}})\) algorithm due to more and more sufficient training and reaches almost equal when the number of training samples reaches to 1600.

Tables 1 and 2 illustrate that the macro and micro precision and recall for \({\text {EA}} ({\text {BR}}+{\text {SVM}})\) algorithm vary more widely than \({\text {EA}} ({\text {LP}}+{\text {RF}})\) algorithm to some extent. To figure out the interactions between training data size and precision and recall, further experiment is conducted. Figure 7 just gives the Macro precision and recall for the \({\text {EA}} ({\text {LP}}+{\text {RF}})\) algorithm since macro measures focus more on classes instead of samples compared with micro measures, regarding less of how many documents belong to it [40]. We observe that the macro precision always achieves at a high score and increases slowly from 200 to 1600 for the number of training samples. While the macro recall changes greatly from approximately 0.4 to 0.9, increasing sharply with the number of training samples growth. This illustrates that the number of FP is fairly smaller than FN compared to TP, which may due to insufficient training for SVM when training set size is small.

More specifically, we consider the area under the curve (AUC) metric which is calculated from the receiver operating characteristic (ROC) curve. The AUC score describes the overall quality of performance, independently of individual threshold configurations regarding specific trade-offs between TP and FP [24]. Different to Fig. 7, this time, we choose micro measures to focus more on the performance of samples. Figure 8 present the micro-averaged AUC score with respects to the number of training samples. It can be found that, for both algorithms, the performance increases significantly before the number of testing samples is equal to training samples from beginning, then with the gradual increase of the training set size, it grows with a slow rate until the end.

Looking closer at each graphs in Figs. 5, 6, 7 and 8, these results indicate that the predictive performance of these two ensemble algorithms becomes better with gradual increase of the training set size, whereas slight differences attributed to the variation of the intrinsic characteristics of each algorithms. For average precision and Hamming loss, their approximately invariable changing speed verifies the reasonable and satisfactory prediction quality indirectly. And the variation of macro precision, macro recall and micro averaged AUC suggest that one can get reasonable predictive performance of annotating new images when the number of testing samples is at least equal with training samples if one wants to cut the cost and manpower. Otherwise, the bigger training set size is, the better performance one could get.

5 Conclusions and Future Work

In this paper, we have proposed a new approach to supplement the description documents, where we cast the problem as an instance of multi-label learning. To achieve this, a product data crawled from several big online shopping webs is present. We develop a pre-processing procedure which can achieve multi-source data acquisition, fusion, “cluttered ” images removal and duplicate removal. This procedure can be easily generalized to add more categories to augment or update the data.

Based on this data, an application is developed to accomplish automatic and semi-manual annotation task with a set of basic labels which are picked from attribute data of the product and the elements of industrial design. To evaluate the quality of annotation labels, we have considered an extensive set of experiments, employing state-of-the-art multi-label learning algorithms under diverse and challenging scenarios. Experimental results suggest that almost all the features based on these two ensemble approaches could achieve quite high score, verifying the reasonable and satisfactory quality indirectly. In addition to the annotation work, the classification performance with respect to the number of training samples is further examined. Multi-label learning algorithms trained on part of this data can also be employed to annotate newly acquired product images to generate corresponding description documents, greatly cutting the cost and manpower. Experimental results show that one can get satisfying performance of annotating new images when the number of testing samples is at least equal with training samples.

More subjective elements of industrial design will be done in future annotation which can better semantically describe the design proposal. Due to the difference for multi-label annotation between the product design field and traditional object understanding field, robust multi-label algorithms that suit for subjective labels in the product design field will be the main point in future work.