Keywords

1 Introduction

Deep Learning (DL) models are made of multiple layers of nonlinear functions that automatically learn features from raw inputs. This property of the DL models allows us to reduce the costs and time of feature engineering compared to developing Machine Learning (ML) models [1]. However, for the model to learn the mapping function between the raw inputs and the desired outputs and at the same time achieve good generalization, it is required to feed them with a large amount of labeled data (for supervised learning). The prevalence of large datasets is one of the main reasons for the success of DL models, especially the Convolutional Neuronal Networks (CNN) that have dominated the state-of-art in the later years in computer vision problems such as object detection, image classification, or semantic segmentation [16]. Nevertheless, gathering the required amount of labeled data for building strong solutions is impossible in many cases. This could be due to many reasons, for example, high costs of data labeling when experts’ knowledge is required, for instance, in medical fields. Another case is when the observations (input data) are difficult to obtain in the case that rarely occurs, or there are privacy concerns for acquiring the data. For commercial solutions, the time required to assemble large datasets usually does not fit the market times or tight budgets used in the industry.

Particularly, one of the biggest challenges of the development of computer vision applications is the amount of data required [8]. Different methods are used to train DL models with scarce data: (i) Transfer Learning that aims to use knowledge of a source model trained on different but related domain and task [23]; (ii) Data augmentation that applies a set of transformations (i.e. Affine or point operators) to each sample of the training dataset to generate new ones [7]; (iii) Synthetically generated data that uses simulations to create new data (in the case of computer vision tasks, renders or animations are used for training the model [13]); and (iv) Few-shot learning that aims to train a model with fewer samples but generalize to unfamiliar samples [6]. In all of the presented methods, the main challenge is the discrepancy between the domains (source and target), where the feature space or label space is different or has different distributions.

Open Data (OD) refers to data that can be published and reused without price or permission barriers [12]. In the later years, the availability of large OD online has increased through platforms such as Kaggle or Roboflow that allows users to share labeled datasets. Although it is possible for academic purposes to use them for bench-marking, they can not be used directly for industrial or commercial purposes as they differ from the target domain and task. For instance, they have different labels, or the images are related but in different contexts. This article presents an approach for using OD from multiple domains and labels to train a supervised image multi-label model (Sect. 3). The approach is composed of three elements: a training pipeline (Sect. 3.1), an attention module (Sect. 3.2), and a Focal loss with incomplete data (Sect. 3.3). We validated the proposed approach in a case study of waste identification (Sect. 2), where the model is trained to predict the material, type of packaging, and product brand of waste in an image (Sect. 4). The proposed approach reached an average F1-macro-score of 86% on the target dataset and 85.8% evaluated in the source domains (Sect. 4.3). Section 5 present the research conclusions.

Figure 1 shows a random sample from the source (also includes 12.5% of target domain images) dataset with the three categories of labels. The label of the material, the type of packaging, and the product brand occupy the first, second, and third positions, respectively.

Fig. 1.
figure 1

Random sample from the source (also includes 13% of target domain images) dataset with the three categories of labels. The label of the material, the type of pack- aging, and the product brand occupy the first, second, and third positions, respectively. Section 6 describes the datasets used and the link where they are available.

2 Waste Identification Challenge

The world’s cities generate 1.5 billion tonnes of solid waste per year, and it is expected to increase to 2.2 billion tonnes by 2025 [4]. Therefore Solid Waste Management (SWM) is a global issue that affects every individual and government. At least 21% of municipal waste is not managed correctly, directly impacting public health and the environment [4]. Performing the correct waste separation increases the material recovery rate and reduces environmental impacts and economic expense [2]. Nevertheless, performing the waste separation is difficult as it is affected by multiple factors such as physical, human behaviors, knowledge, and policies, among others [14].

ML models are frequently used for waste classification and location in waste management systems. The CNNs are commonly used to perform these tasks [15, 17, 19, 22] but two main barriers are typically mentioned: (i) The size of the datasets: given that the visual appearance of waste can vary a lot due to high deformations, dirt, and material degradation, it is required to have many samples of the same type of object that includes multiple variations. (ii) Location tailored: brands, products, objects’ appearance, and even recycling categories vary from place to place, such that training datasets need to be custom for a specific location. These reasons prevent the reuse of datasets for different places.

In the proposed case study, a DL model is trained using datasets from different locations with different products, contexts, brands, and labels. The predictive function takes an image of a waste and outputs three categories of predictions (Fig. 1 shows a random sample from the source):

  1. 1.

    Material: one of 12 classes: plastic, PET, cardboard, aluminum, paper, Tetrapac, glass, steel, paper towel, mix, paper receipt, and organic.

  2. 2.

    Type of packaging: because of not only the material but the recycling process and type of the object gives insights about the recycling category. For instance, aluminum cans can be reused to contain food again, unlike aluminum used on other products. The model predicts 12 classes of packaging: wrapper, container, box, can, bottle, foil, cap, pencil, cutlery, organic, battery, and masks.

  3. 3.

    Product brand: Extended Producer Responsibility (EPR) are policies where the producer is responsible for the post-consumer stage of product life, including recycling [10]. Thus 39 local brands (from Colombia) are predicted by the model, and one additional to include “other” brands that are not taken into account but allow the model to not classify mandatory in one of the defined brands. Particularly, brand identification is the most local-tailored prediction category compared to material and type of packaging that share more similar features in different places.

The source dataset labels are composed of three categories: the material, the type of packaging, and the product brand occupy the first, second, and third positions, respectively.

3 Our Open Data Approach to Train Deep Learning Models

The presented approach aims to improve the performance of a model on a target domain that contains fewer samples (or none) required to learn a predictive function with samples of other domains. This problem is formally defined under the term of Transfer Learning [20] where a Domain \(\mathcal {D}\) is described by two parts, a feature space \(\mathcal {X}\) and a marginal probability distribution \(P(\textbf{X})\) where \(\textbf{X}=\{x_1, \dots , x_n\}\). The feature space \(\mathcal {X}\) comprises all possible features, and \(\textbf{X}\) is a particular set of the domain with n number of instances. For a domain \(D=\{\mathcal {X}, P(\textbf{X})\}\). A task \(\mathcal {T}\) is defined by two components as well, a label space \(\mathcal {Y}\) and a predictive function \(f(\cdot )\) trained from the feature-label \({x_i,y_i}\) where \(x_i \in \textbf{X}\) and \(y_i \in \textbf{Y}\). \(\textbf{Y} = \{ y_1, ..., y_n \}\) are the corresponding labels of the particular learning sample set \(\textbf{X}\), and the function \(f(\cdot )\) is the predictive function that can be seen as P(y|x), the probability of y given a feature x. In a general case, we have two domains with their related tasks, the source domain \(\mathcal {D^S}=\{\mathcal {X^S}, P(\mathbf {X^S}) \}\) and its respective task \(\mathcal {T^S} = \{ \mathcal {Y^S}, P(\mathbf {Y^S} | \mathbf {X^S}) \}\). Similarly, the target domain \(\mathcal {D^T}=\{\mathcal {X^T}, P(\mathbf {X^T}) \}\) with \(\mathcal {T^T} = \{ \mathcal {Y^T}, P(\mathbf {Y^T} | \mathbf {X^T}) \}\). Therefore, TL is defined in this context as the process of improving the target predictive function \( P(\mathbf {Y^T} | \mathbf {X^T}) \) using information from \(\mathcal {D^S}\) and \(\mathcal {T^S}\) with the condition that \(\mathcal {D^S} \ne \mathcal {D^T}\) or \(\mathcal {T^S} {\ne } \mathcal {T^T}\).

For our case, we have a source composed of multiple domains, each one with its related task: \(\mathbf {D^S} = \left[ (\mathcal {D^S}, \mathcal {T^S})_1, \dots , (\mathcal {D^S}, \mathcal {T^S})_n \right] \) for improving one target predictive function in the domain \(\mathcal {D^T}=\{\mathcal {X^T}, P(\mathbf {X^T})\).

The proposed approach comprises three elements: a training pipeline, an attention module placed in the head of a pre-trained feature extractor, and a Focal loss for incomplete data. The approach is defined for a multi-label task where the labels can be grouped into categories that are not mutually exclusive. The same approach can be used for traditional multi-label or classification problems.

3.1 Training Pipeline

The pipeline is performed in three stages. The first is to prepare the datasets, then model training, and the last is the model evaluation.

  • Data preparation. After gathering the source datasets, the first step is performing a label mapping. In the label mapping, each class of the target categories (i.e., material or brand) is labeled encoded (each class is represented by a consecutive integer number starting by zero), and an additional class “Unknown” represented by \(-1\) is added. The mapping is performed as follows: if the class is in the set of target classes, then assign it the corresponding label (\(y_i\)). Otherwise, it is assigned to “Unknown”:

    $$ y_i = {\left\{ \begin{array}{ll} y_i^T, &{} \text {if } y_i^S \in \mathcal {Y^S} \\ -1, &{} \text {otherwise} \end{array}\right. } $$

    The label mapping allows easy processing of all the samples of the source datasets without the need to manually relabel, and at the same time, at the training time, to detect if a sample has an “unknown” label (if its value is less than 0). Later, the labels are one-hot encoded by category; each category label is represented by a K dimensional vector corresponding with length of the number of classes of the category, and one element is equal 1 that corresponds to the position of the label encoded and the rest remains 0. The final label is a vector of the concatenated one-hot encoded categories of \(\mathbb {R}^{\sum K_i} \). Figure 2 shows an example of the encoded label process for each image, where each category generates a one-hot encoded vector concatenated to produce the final label. If the label is “Unknown”, a zero vector of K dimension is generated.

  • Model training. The model training is performed with Adam [9], a stochastic gradient-based optimization. The train split is composed of all source datasets randomly shuffled where a training sample is a pair feature, label \((x_i, y_i)\). The training is performed in two steps:

    1. 1.

      Attention training: The parameters of the attention module described in Sect. 3.2 are tuned first, the rest of the model parameters are frozen (not updated during backpropagation), and the attention module is only trained with the target dataset on one prediction category (i.e., brand or material). The category selection is based on the one that can provide more local or domain-tailored information (i.e., the brand category).

    2. 2.

      Prediction training: After the attention module is trained, the layers in charge of the final prediction are trained. Both the attention module and the feature extractor are frozen during this step, and the model is trained with the source datasets.

    The proposed pipeline intends that the feature extractor is already pre-trained in a general dataset (ImageNet) and “knows” to extract features from images. The attention module is trained to weigh features relevant to the tailored dataset for the target task. Therefore, step 2 (Prediction Training) later pays more attention to these features during the training of the final prediction layers. The model is trained using backpropagation with the loss presented in Sect. 3.3.

  • Model evaluation. To evaluate the model with highly unbalanced datasets with missing labels, a non-zero average macro F1-score is used as a performance metric. The F1-score is the harmonic mean between the precision and recall, thus penalizing if the model’s prediction is biased to a majoritarian class. The average macro F1-score is calculated by each class and averaged by the prediction categories. Given that some datasets could not be present in some classes, it is not considered in the average.

Fig. 2.
figure 2

Example of the encoded label process for each image. For each category is generated a one-hot encode vector that is concatenated to produce the final label.

3.2 Attention Module

The model architecture is composed of a feature extractor, an attention module, and a custom head for performing the predictions (Fig. 3).

Fig. 3.
figure 3

Model architecture composed of a Feature extractor (in blue) that produces a feature map (red). The feature map is passed by a CBAM Attention with spacial (yellow) and channel (green) modules, and finally, fully connected layers for the prediction for each category. (Color figure online)

The Feature extractor is a CNN model pre-trained on ImageNet where the prediction head is removed and is only used in the last convolution layer output. For the feature extraction, any state-of-art model could be used, such as VGG16 [18], MobileNet [5], or ResNet50 [3]. For example, ResNet50 (achieved the highest score) produces a feature map of (7, 7, 2048) dimensions (red in Fig. 3). Resnet50 is a CNN architecture that uses skipped connections to fix the gradient-vanish problem related to backpropagation on deep models. The feature map is passed by a CBAM Attention module [21] that is composed of channel (green on Fig. 3) and spatial (yellow on Fig. 3) modules.

(i) The attention module produces a channel attention map (\(M_c\)) using the feature map (F) channels inter-relationship, such that each channel can be considered a feature detector. The channel attention emphasizes what is meaningful in an input image. In the channel module, the spatial dimension is squeezed by adding two pooling operations over the channel dimension that produces an average (\(F_{avg}^c\)) and max (\(F_{max}^s\)) pooled features and passed by the same Fully Connected (FC) network:

$$ M_c(F) = \sigma ( FC( F_{avg}^c + FC(F_{max}^c) )) $$

(ii) The spatial attention module produces a map (\(M_s\)) using the features’ relationship. It determines the relevant information of the feature map (F) and encodes where to emphasize or decrease its contribution to the prediction. The spatial attention module is calculated by performing a Sigmoid activation (\(\sigma \)) of the convolution (\(f^{n\times n}\)) over the concatenated a max (\(F_{max}^s\)) and average pooled-features (\(F_{avg}\)) of the feature map.

$$ M_s(F) = \sigma \left( f^{n \times n}(F_{avg}^s ; F_{max}^s ) \right) $$

Each one of the attention maps multiplies the feature map; the channel attention vector multiplies each channel of the map and the spatial each feature (Fig. 3). The attention module (spatial and channel) are trained only on one prediction category with the target dataset. Later, their weights are frozen, and the rest of the model is trained for all the categories.

3.3 Focal Loss for Incomplete Data

The Focal Loss [11] is used for training models with highly unbalanced datasets. For instance, on object detection usually there is more background than foreground objects, which causes the model to be biased for the amount of one class. For dealing with dataset unbalance, there are other techniques, such as dataset sampling, which can be performed by oversampling the classes with fewer observations, or undersampling the majoritarian classes. Another technique commonly used is to add a weight to each sample depending on its class that defines their contribution to the loss during training (Weighted loss). However, both of these techniques usually do not work well when there are large differences between the classes (i.e., Brands distribution in Fig. 6) because training is inefficient and the model tends to over-fit given that a small set of samples are continually repeated.

Focal Loss (FL) is intended to work when there is an extreme class unbalance by adding a term \((1-p)^\gamma \) to the Cross-Entropy loss that allows reducing the contribution of easy samples focus on the difficult ones. The parameter \(\gamma \) adjust how much easy samples are weighted down:

$$ FL(p_i) = -(1 - p_i)^\gamma \log (p_i) $$

For our case, the multi-label problem can be decomposed as a multiple classification problem, given that each category is independent. We have a composed loss L of each categories loss \(L_c\) weighted by factor:

$$ L = \sum ^c_1 W_c L_c $$

Each category loss is computed using the Focal Loss, and each sample (i) is only considered if the label of its category is known. Given that the “unknown” labels are set to a zero-vector and the Cross-Entropy loss is computed by multiplying the log of the predictions by its true probability (\(t_i\)), the loss for each category is computed as follows:

$$ L_c = - \frac{1}{n_c} \sum _i^{n} (1 - p_i^c)^\gamma \log (p_i^c) t_i^c $$

Each category loss is normalized by the number of known samples (\(n_c\)), where \(p_i^c\) is the model prediction of the sample i regarding the category c and \(t_i^c\) is the true label of the sample regarding the same category.

4 Definition of a Multi-label Deep Learning Model Using Our Open Data Approach

4.1 Dataset Acquisition and Preparation

In our case study, we train a predictive function that takes an image of waste and outputs three categories of predictions: its material, type of packaging, and product brand (Sect. 2). For the acquisition of the target data, a photography device (A in Fig. 4) was designed to control environment variables such as lighting or background, and perform an efficient data collection. The photography device is composed of 5 elements (B in Fig. 4).

1. Chasis in an aluminum profile that allows configuring the components in multiple positions; 2. Profile unions; 3. Multiple cameras that can be positioned with a support that allows to place them in any position and rotation; 4. Configurable lighting ring of three colors (natural, warm, and cold) and ten light intensities; 5. Background panels that can be changed to use different colors.

Fig. 4.
figure 4

(A) Photography device used for the target dataset acquisition. The photo shows three cameras and a lighting ring. (B) Photography device design, composed of five main parts: Chasis in aluminum profile, profile unions, multiple cameras, lighting ring, Background panels

The target dataset comprises 624 products commonly consumed in Universidad EAFIT - Colombia. An ID is assigned to each product, labeled according to the three categories to predict (material, type of packaging, and product brand). Generic products are also included for products that are not possible to know any labels. For each product, multiple photos are taken with three deformations: (0) No deformation is applied to the product, (1) Mild deformation: the product is opened for consumption and has some deformation, and (2) Severe deformation is applied to the packaging. Additionally to the photos taken with the device, 191 photos of products in different contexts were taken. Figure 5 shows random samples from the target dataset.

Fig. 5.
figure 5

Random sample images from the target dataset, taken with the photography device and with random context

The target dataset is composed of 11.207 images, from which 25% is used for testing. Figure 6 shows the distribution of the target dataset of the three prediction categories.

Fig. 6.
figure 6

Class distribution of target dataset of the 3 distribution categories

The source dataset has been created from 34 online datasets and contains 75.101 images, from which 20% is used for testing. In the next link are available the sources, the class contribution to the source dataset, and its respective references: Open Data waste datasets.

4.2 Model Training

The training dataset split comprises the source and target training splits. First, the attention module (Sect. 3.2) is trained for brand prediction as it contains more target-tailored information. After the attention module is trained, the rest of the model (type of packaging and material) is trained, freezing the weights of the attention module. Finally, the model training is performed with Adam [9] with the loss function described in Sect. 3.3. All the models were trained for 18 Epochs in total, ten epochs for the attention module and eight epochs with Early-stopping for the rest of the prediction heads(material and type of packaging).

Five models were trained, four of them using the proposed approach with different feature extractors: (i) OpenData-VGG6 using VGG16 [18] as feature extractor, (ii) OpenData-MOBIL using MobileNet [5] as feature extractor, (iii) OpenData-RESN using ResNet50 [3] as feature extractor, and one OpenData-NoFL with ResNet50 as feature extractor and trained without Focal Loss; (iv) BASE, the traditional approach where a ResNet pre-trained model is fine-tuned with Cross Entropy loss.

Figure 7 shows the training curves of the highest performance model (OpenData-RESN). There is a larger difference between validation and training split in the brand category training because it is more difficult to make generalizations as it contains more categories and many of them look very similar in general (the package) but with some details different (the logo of the brand). This is less the case in material and packaging type training, where, for example, for the packaging type, the object shape characteristics are more universal.

Fig. 7.
figure 7

Training curves of the highest performance model (OpenData-RESN). In A, the training of material and type of packaging model, and in B, Brand and attention module training. In dashed, the performance of the model in the validation split (5% of training dataset).

The data pre-processing was the same for each of the models evaluated, the images were loaded and converted from RGB to BGR, then each color channel was zero-centered for the ImageNet dataset without pixel intensity scaling. After the data pre-processing, feature extraction was performed using a pre-trained CNN model directly from the images without performing features selection. Hyperparameter selection was manually performed on the cross-validation split (5% of training split), and the samples of each split remained the same in all the experiments (Table 1).

Table 1. Training hyperparameters used in all the experiments.

4.3 Model Evaluation

Table 2 summarizes the performance of the evaluated models. In blue is the performance over the source dataset, and in green is the performance over the target dataset. The metric used is the non-zero average macro-F1-score. The non-zero means that classes without samples will not be considered in the F1-score computation to avoid division by zero. Given the unbalanced target dataset, the F1-score is used as the evaluation metric as it penalizes lower precision and recall. The proposed approach OpenData-ResN achieved the highest result with ResNet50 as the feature extractor. Although the average macro F1-score is almost similar between the sources and the target datasets (1%), there is a major difference in two categories: material (-15 on target) and brand (+23 on target) The difference in the material performance may be due to the prediction depends of specific visual features that can be miss leading, for instance, packages that may look like other material (plastic as aluminum). On the other hand, in source datasets, many samples are labeled as “other” that may look very similar to a brand in the target dataset.

Table 2. Models performance evaluation results. The metric used is the macro F1-score (Sect. 3.1)

Figure 8 shows the confusion matrices of material and type of packaging of OpenData-RESN in the target dataset. Worth noting that the target dataset is extremely unbalanced (Fig. 6). Additionally, there is one class (organic) that does not have evaluation samples, thus, it is not taken into account to calculate the evaluation metric (non-zero average macro-F1 score) due to zero division in the recall computation. Also, most of the miss classification occurs in the classes with more samples (“easy classes”), and the reason for this could be the use of the Focal loss and that the model can learn the difficult classes due to it being trained in a larger domain.

Fig. 8.
figure 8

Normalized confusion matrices by prediction of materials (left) and type of packaging (right) of the highest performance model (OpenData-RESN) on the target dataset.

In order to evaluate the effect of the Focal loss, two brand classifiers were trained with and without Focal Loss, achieving 84% with Focal loss and 59% trained with Cross-Entropy loss.

5 Conclusions

This article presents a TL approach for using multiple source datasets from different domains with incomplete labels (Sect. 3) with three main elements: a training pipeline, an attention module, and a focal loss for missing labels. The approach was validated in a case study for waste identification using an image-based multi-label model with three categories: material, type of packaging, and product brand (Sect. 4).

The proposed approach performs better than the standard form of training custom models (15% increase in average macro F1-score, see Table 2) in the target domain. At the same time, the model trained with the proposed approach has a similar performance considering all the domains (85% average macro F1-score). The model was not biased toward the classes with more samples, given that the dataset was highly unbalanced (Fig. 6) and was able to differentiate the classes in both domains (Table 2).

The selection of the feature extractor has little impact on the performance regarding (1% in all the domains and 6% with the target). These differences could be due to the size of the feature map of each of the evaluated architectures.

The difference in the performance of the model regarding the prediction categories (11% between material and type of packaging) could be due to some categories using mode “generalizable” image features, for instance, the type of packaging is related to the shape of the object that is the same in different contexts opposite to the brand or the material that depends uniquely of the object appearance.

The inclusion of the Focal Loss has a positive impact on the performance of the models, 4% higher in the OpenData models on the target dataset (see Table 2) and 25% higher in the brand classifier with Focal loss (see the last paragraph of Sect. 4.3).

Future work should focus on techniques and algorithms for using fewer samples of the target dataset and reducing the impact of the image background on inter-domain models. Additionally, exploring different electromagnetic spectra and lighting conditions with proper camera calibration to improve material identification.

6 Dataset Statement

This research uses 34 datasets collected from different sources. In the next link are available the sources and their respective references: Open Data waste datasets.