1 Introduction

Content-based image retrieval (CBIR) is a method to retrieve the most similar images to a query image. This system searches similar images using features such as color, texture, and shape. One of the most common methods that can be used for this purpose is the K-nearest neighbors algorithm (KNN) [1,2,3]. This algorithm returns the most similar image using a distance function within feature space.

Retrieving similar images using only low-level features is one of the main shortcomings of the conventional content-based image retrieval (CBIR) systems. Among various techniques, machine learning such as deep learning [4,5,6,7,8,9,10], sparse coding for bag-of-words (BoW)-based approach [11, 12], Fisher vectors [13], etc., have been actively investigated as possible directions to bridge the semantic gap in the long term. This paper presents an effort to overcome this drawback and proposes a CBIR approach in which the retrieved labels of images in the multi-label classification framework satisfy user expectations.

It is also noteworthy to consider that images are in different classes and each of them would be more separable based on one particular kind of feature. For instance, assume a color feature would be more discriminant in one image, so this image is more distinguishable based on the color feature, while another one would be more discernible based on texture or shape features. Therefore, to have an effective search with more accurate results, one way is to search images based on a combination of features using the distance function. Besides, for the purpose of using the discriminative power of each feature concurrently and more purposefully, it is advantageous to train the weights for features. For example, when color features are more likely to be discriminant than the other features, more weight must be assigned to the color features and vice versa [14,15,16].

Then, if weighting is supposed to be applied to the whole data, considering that the data scope is large and dispersed, it would be difficult to find an ideal set of weights for features. Contrastingly, if the weights are found within subspaces, it will be more probable to achieve weights which yield more accurate retrieval results. Thus, we suggest dividing feature space into some regions in order to do a more intelligent search within smaller regions to retrieve similar images to the query. This strategy makes searching faster and more accurate [17,18,19]. The clustering algorithm is an intelligent method to divide the space into regions containing more similar data. Obviously, finding samples similar to the query image would be easier in such regions [20].

Based on the aforementioned discussion, in this paper a local feature weighting algorithm [21] is used for CBIR application. This algorithm first clusters the data and simultaneously trains the features in each cluster to have the appropriate weights for such features based on minimizing the classification error rate. In other words, both clustering and classification methods are used conjointly to create a powerful framework for retrieving similar images to the query.

The rest of the paper is organized as follows. In Sect. 2, we review some related studies in our multi-label learning framework such as C-means clustering, KNN and a feature extraction algorithm, which are used in this paper. In Sect. 3, we explain our method for multi-label classification using feature weighting algorithms and C-means clustering based on classification error rate minimization (MLC-FWC). In Sect. 4, we present experimental results and show the superiority of our algorithm over its counterparts which do not use either feature weighting or a multi-label framework. Finally, conclusions are presented in Sect. 5.

2 Related works

2.1 Multi-label classification

Multi-label classification has become popular in many areas [22]. For instance, in text categorization, a document may belong to multiple classes simultaneously [23], or in an automatic image annotation scenario, a scene may be linked to multiple concepts [24]. Similarly, different labels can be assigned to each video in a video indexing domain [25]. This situation also happens in functional genomics where multiple functions may be represented by a single gene [26]. In all the above examples, each instance is associated with multiple labels, and this situation is different from traditional single-label classification. Figure 1 shows a sample image from the Corel1000 database that can be classified as beach, sea, sky, tree and even mountain by considering the semantic meaning of the image.

Fig. 1
figure 1

Sample multi-label data from Corel1000: beach, sea, sky, tree and mountain

There are some methods that do CBIR in multi-label framework. For example, in [27, 28] the authors created a visual dictionary for each group in the dataset. Then, they proposed a system for image retrieval based on local features using a BoW model that considered mid-level representations. The methods based on BoW create a codebook of visual discriminating patches (visual words) and then compute statistics (using the codebook) on the visual word occurrences in the test image. The system tries to bring more accuracy with the option to use local rather than global features to reduce the semantic gap issue and enhance the performance of the CBIR by performing image annotation.

In what follows, we provide a formal definition of the multi-label classification problem [29]. Let X denote the instance space and Y the set of class labels. Then, the goal is to learn a function \( f:{\mathbf{X}} \to 2^{{\mathbf{Y}}} \) which is defined on a given dataset { \( ({\mathbf{x}}_{1} .{\mathbf{y}}_{1} ).({\mathbf{x}}_{2} .{\mathbf{y}}_{2} ) \ldots ({\mathbf{x}}_{N} .{\mathbf{y}}_{N} )\} \), where \( {\mathbf{x}}_{i} \in {\mathbf{X}}. i = 1 \ldots N \) is an instance and \( {\mathbf{y}}_{i} \subseteq {\mathbf{Y}} \) is a set of labels \( \left\{ {y_{i1} .y_{i2} \ldots y_{{il_{i} }} } \right\} \),\( y_{ik} \in {\mathbf{Y}} .k = 1.2 \ldots l_{i} \), \( l_{i} = 1 \ldots H. \) Here, H is the number of labels, N is the number of data samples and li denotes the number of labels in \( {\mathbf{y}}_{i} \).

2.2 KNN classification

The KNN classification rule has been used successfully in many pattern recognition applications. This method is simple, and the distance measure in this algorithm can be various kinds of distance function. We use Euclidean distance, mostly used for dissimilarity measurement in image retrieval due to its effectiveness. It measures the distance between two vectors of images by \( d\left( {{\mathbf{x}},{\mathbf{x}}^{{\prime }} } \right) = \sqrt {\mathop \sum \nolimits_{j = 1}^{n} \left( {{\text{x}}_{j} - {\text{x}}_{j}^{{\prime }} } \right)^{2} } \), where x and x’ are two different vectors and n is the feature dimension. Since the purpose of our algorithm is to assign weights to the features, in this framework we use the weighted Euclidean distance, defined as:

$$ d_{w} \left( {{\mathbf{x}},{\mathbf{x}}^{{\prime }} } \right) = \sqrt {\mathop \sum \limits_{j = 1}^{n} {\text{w}}_{j} \left( {{\text{x}}_{j} - {\text{x}}_{j}^{{\prime }} } \right)^{2} } $$
(1)

where \( {\text{w}}_{j} \) is the weight corresponding to the jth feature.

2.3 C-means clustering

C-means [30] is one of the most reliable and widely used clustering algorithms. The procedure follows a simple and quick way to cluster a dataset to a certain number of clusters. The main idea is to define C centroids, one for each cluster as a prototype. Then, a binding is created between each data point and one of these C cluster centroids. First, the cluster centroids are set randomly [31]. This algorithm minimizes iteratively the total squared Euclidean distance between the data points in feature space and the cluster centroids. In each iteration, the centroids change their location with respect to the data points in each cluster. After changing the centers, the data points associated with each cluster will change based on the new centroids. These steps will be repeated until no more changes happen or centroids move too slightly (less than a threshold).

Let \( {\mathbf{X}} = \left\{ {{\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \ldots {\mathbf{x}}_{N} } \right\} \) be a set of data points and V = \( \left\{ {{\mathbf{v}}_{1} ,{\mathbf{v}}_{2} , \ldots {\mathbf{v}}_{C} } \right\} \) represent the C cluster centers. C-partitioning \( \left\{ {\theta_{1} \ldots \theta_{C} } \right\} \) is created by the Euclidean C-means algorithm with the aim to minimize the following objective function:

$$ J_{1} \left( {{\mathbf{V}},{\mathbf{W}}} \right) = \mathop \sum \limits_{k = 1}^{C} \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} \|{\mathbf{x}}_{i} - {\mathbf{v}}_{k}\|_{{\mathbf{w}}_{\text{k}}}^{2}$$
(2)

where \( {\mathbf{W}} = \left\{ {\varvec{w}_{1} ,{\mathbf{w}}_{2} , \ldots ,\varvec{w}_{\varvec{C}} } \right\} \) is the set of local feature weights in each cluster and \( \|{\mathbf{x}}_{i} - {\mathbf{v}}_{k}\|_{{\mathbf{w}}_{\text{k}}}^{2}\) is a weighted Euclidean distance measure between a data point \( {\mathbf{x}}_{i} \) and the cluster center \( {\mathbf{v}}_{k} \). In the kth cluster, \( {\text{N}}_{\text{k}} \) is the number of samples and \( {\mathbf{w}}_{k} = \left\{ {w_{kj} , 1 \le j \le n} \right\} \), where n is the number of features (dimensions).

2.4 Feature weighting

Feature weighting is a technique used to approximate the optimal degree of influence of individual features using a training set. When successfully applied, highly relevant features are attributed a high weight value, whereas irrelevant features are given a weight value close to zero [32]. Feature weighting can be used not only to improve the classification accuracy but also to discard features with weights below a certain threshold value and thereby increase the resource efficiency of the classifier [33, 34].

2.5 Feature extraction

Feature extraction has an undeniable role in the efficiency of the classification process within the image retrieval system. In the proposed method, we use low-level features such as color and texture. For texture features, the wavelet correlogram method [35] and, for color, the method based on a color correlogram with vector quantization [36] are used, because they have shown a good individual ability to classify images.

2.5.1 Wavelet correlogram

Wavelet correlogram is a CBIR system based on texture. It applies the color correlogram method [37] on quantized wavelet coefficients and creates texture features. As a result, this method gains from the multiscale–multiresolution characteristic of the wavelet transform and inherits the translation–rotation invariance property from the color correlogram.

According to the enhanced Gabor wavelet correlogram (EGWC) indexing algorithm, three different scales of wavelet transform are calculated.

In each scale, the non-maximum suppression block is aimed to reduce the data redundancy as it is explained in [35]. Then, the coefficients are discretized using three sets of quantization thresholds. Thanks to optimal selection of thresholds, improved retrieval performance is guaranteed. Finally, the text feature is calculated based on horizontal and vertical auto-correlogram of quantized coefficients.

2.5.2 Vector-quantized color correlogram

A vector-quantized color correlogram (VQCC) is a compound CBIR system designed to utilize both color correlogram and vector quantization methods [36]. The VQCC algorithm quantizes the RGB color of the input image using a vector quantizer. The resulting pixel vector represents a cluster of points in the three-dimensional RGB space. Then, for each pixel inside a cluster, based on its Euclidean distance from the cluster center, a binary code is specified. In other words, the pixel is labeled with the code in the codebook corresponding to the closest cluster center. Finally, an index vector is created by computing the auto-correlogram of these binary codes.

Image retrieval is performed using a comparison between indices. The codebook is constructed by the following three steps:

  1. a.

    Some images are selected randomly from the dataset. (The number of images is determined experimentally based on the total number of images and image categories.)

  2. b.

    C-means clustering extracts pixel clusters for this population.

  3. c.

    The centers of clusters make code vectors in the codebook for quantization.

3 Multi-label classification using feature weighting and C-means clustering (MLC-FWC)

The goal of the learning algorithm is to minimize the total error rate of the KNN classifier [32]. To achieve this goal, we use C-means clustering and local feature weighting in each cluster. The weights of features as parameters of the distance function are learnt by our proposed method that attempts to minimize the average error rate per cluster.

The schematic diagram of the proposed method is shown in Fig. 2. In the training phase, first, texture and color features of database images are extracted. Then, in an iterative process, a supervised C-means clustering, which is explained later, is run to divide the input space into proper subspaces, and then a KNN classifier tries to minimize the classification error on each cluster by means of local feature weighting. This procedure is executed till the feature weights converge. In the test phase, first the nearest cluster to the query is found based on its distance from the center of each cluster. The nearest cluster together with the weights (found in the training phase) represents a new input space within which the most similar images to the input query are retrieved.

Fig. 2
figure 2

The schematic diagram of multi-label classification with features weighting and C-means clustering (MLC-FWC). The classification error is minimized within each cluster. (Among images with the same label, retrieve N similar images according to their distance from the query.) The dashed boxes show how shape features can be integrated to MLC-FWC

3.1 Training phase

In the training phase of this algorithm, clustering and classification are combined to construct a model that benefits from the advantages of both methods. In other words, clusters are improved by taking the classifier feedback, and classification is improved by using clustering information. The feedback from classification is used for relocating misclassified samples into other clusters and finding a new partitioning of the data, so a new labeling is assigned accordingly. In other words, a quasi-supervised clustering is used in this framework to classify more accurately. In contrast, classification can be improved by the clustering information, since it divides the space into some related subspaces. This process continues till no sample is displaced between clusters. For this process, a method is also proposed to weight features locally based on minimizing the error of each cluster [32].

So in our framework, clustering and classification are incorporated in a single objective function in order to partition the input space into related subspaces. It makes classification and retrieval simpler and faster. Our purpose is to construct related clusters and minimize the error rate of the NN classifier within each cluster using leave-one-out (LOO) error rate by doing feature weighting. These weights can be represented as a weight matrix: \( {\mathbf{W}} = \left\{ {{\mathbf{w}}_{k} ,1 \le k \le C} \right\} \), where C is the number of clusters. It is noteworthy that this definition assigns distinct weights to the different features in each cluster.

MLC-FWC first uses an approximation of the NN classification error as introduced in [32] to define a classification objective function. Then, it tries to minimize the objective function such that the samples are assigned to their corresponding cluster.

In the training phase, first the data are randomly clustered. Then, the idea is to learn the weights of the features in each cluster such that, for each sample \( {\mathbf{x}}_{i} \), the nearest same-class sample seems closer, while the nearest different-class sample appears farther. This strategy reduces the weighted distance between same-class samples. The LOO NN classification error as a function of feature weights can be written as [32]:

$$ J_{2} \left( {\mathbf{W}} \right) = \frac{1}{N}\mathop \sum \limits_{k = 1}^{C} \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} \text{S}_{\beta } \left( {\frac{{d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ = } } \right)}}{{d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ \ne } } \right)}}} \right) $$
(3)

where \( N \) is the number of samples, \( {\mathbf{x}}_{ = } , {\mathbf{x}}_{ \ne } \) are the nearest same-class and different-class samples of \( {\mathbf{x}}_{i} \) in the kth cluster and \( d_{w} \) is the weighted Euclidean distance as defined in (2). As the framework performs on the multi-label images, the same-class sample \( {\mathbf{x}}_{ = } \) denotes one which has the largest number of common labels with \( {\mathbf{x}}_{i} \), and different-class sample \( {\mathbf{x}}_{ \ne } \) indicates one that has the fewest common labels with \( {\mathbf{x}}_{i} \). If there are more than one sample with this condition, the nearest and farthest ones are selected as the same-class and different-class samples, respectively. Let \( \text{S}_{\beta } \) be the sigmoid function with slope β, centered at z = 1:

$$ \text{S}_{\beta } \left( z \right) = \frac{1}{{1 + e^{{\beta \left( {1 - z} \right)}} }} $$
(4)

For large β, if \( {\mathbf{x}}_{i} \) is closer to some prototypes of its own class than to any other from a different class, then \( d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ = } } \right) \) < \( d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ \ne } } \right) \) and the argument of \( \text{S}_{\beta } \) will be smaller than 1 and \( \text{S}_{\beta } \) approaches 0. On the contrary, if \( {\mathbf{x}}_{i} \) is closer to some prototypes of a different class than to any other from its own class, the argument of \( \text{S}_{\beta } \) will be greater than 1 and \( \text{S}_{\beta } \) approaches 1. Accordingly, \( J_{2} \left( {\mathbf{W}} \right) \) is in fact the LOO NN estimate of the misclassification probability over the training set X.

In this paper, gradient descent optimization is employed, so it requires the function \( J_{2} \left( {\mathbf{W}} \right) \) to be differentiable with respect to the corresponding parameters \( {\mathbf{w}}_{k} \) and \( {\mathbf{v}}_{k} \) to be minimized, which is why using the sigmoid function is adopted here.

Now, we combine the two objective functions (\( J_{1} \left( {{\mathbf{V}},{\mathbf{W}}} \right) \) and \( J_{2} \left( {\mathbf{W}} \right) \)) which both try to minimize the error, one for classification and the other for clustering. So the overall objective function is defined as:

$$ J\left( {{\mathbf{V}},{\mathbf{W}}} \right) = J_{1} \left( {{\mathbf{V}},{\mathbf{W}}} \right) + \gamma J_{2} \left( {\mathbf{W}} \right) $$
(5)

where γ is a weight which determines the trade-off between the two objective functions. Substituting \( J_{1} \left( {{\mathbf{V}},{\mathbf{W}}} \right) {\text{and}} J_{2} \left( {\mathbf{W}} \right) \) yields to:

$$ J\left( {{\mathbf{V}},{\mathbf{W}}} \right) = \left( {\mathop \sum \limits_{k = 1}^{C} \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} \| \varvec{x}_{i} - \varvec{v}_k\|_{{{\mathbf{w}}_{k} }}^{2} + \frac{1}{N}\mathop \sum \limits_{k = 1}^{C} \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} \text{S}_{\beta } \left( {\frac{{d_{w} \left( {\varvec{x}_{i} ,\varvec{x}_{ = } } \right)}}{{d_{w} \left( {\varvec{x}_{i} ,\varvec{x}_{ \ne } } \right)}}} \right)} \right) $$
(6)

In this equation, the first term specifies the clustering error rate, while the second term designates the classification error and \( \gamma \) is set to 1. Within each cluster, the weight of features corresponding to (\( {\mathbf{x}}_{ = } \)) should be modified to make this sample appear closer to \( {\mathbf{x}}_{i} \) in a feature-dependent manner, while those corresponding to (\( {\mathbf{x}}_{ \ne } \)) should be modified such that it appears farther.

Obviously, (6) is differentiable. To solve the objective function, a gradient descent procedure is proposed which guarantees convergence. The gradient descent evolution equation is:

$$ \theta^{t + 1} = \theta^{t} - \alpha \left( {\frac{\partial J\left( \theta \right)}{\partial \theta }} \right) $$
(7)

where θt and α denote the optimal parameters (here v and w) in the tth iteration and the step size, respectively. Before replacing J(θ) with the objective function in (6), we define:

$$ R\left( {{\mathbf{x}}_{i} } \right) = \left( {\frac{{d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ = } } \right)}}{{d_{w} \left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{ \ne } } \right)}}} \right) $$
(8)

The partial derivative of \( J\left( {{\mathbf{V}},{\mathbf{W}}} \right)_{ } \) with respect to \( {\mathbf{w}}_{k} \) is calculated as:

$$ \frac{{\partial J\left( {\varvec{V},\varvec{W}} \right)}}{{ \partial \varvec{w}_{k} }} \cong \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }}\| {\mathbf{x}}_{i} - {\mathbf{v}}_{k}\|^{2} + \frac{1}{N}\mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} \text{S}_{\beta }^{{\prime }} \left( {R\left( {{\mathbf{x}}_{i} } \right)} \right)\frac{{\partial R\left( {{\mathbf{x}}_{i} } \right)}}{{\partial \varvec{w}_{k} }} $$
(9)

where the derivative of \( \text{S}_{\beta }^{{\prime }} \left( z \right) \) and \( \frac{{\partial R\left( {{\mathbf{x}}_{i} } \right)}}{{\partial {\mathbf{w}}_{k} }} \) is:

$$ \text{S}_{\beta }^{{\prime }} \left( z \right) = \frac{{dS_{\beta } \left( z \right)}}{dz} = \frac{{\beta e^{{\beta \left( {1 - z} \right)}} }}{{\left( {1 + e^{{\beta \left( {1 - z} \right)}} } \right)^{2} }} $$
(10)
$$ \frac{{\partial R\left( {\varvec{x}_{i} } \right)}}{{\partial \varvec{w}_{k} }} = \frac{1}{{2d_{w}^{2} \left( {\varvec{x}_{i} ,\varvec{x}_{ \ne } } \right)}}\left( {\frac{1}{{R\left( {\varvec{x}_{i} } \right)}}\|\varvec{x}_{i} - \varvec{x}_{ = }\|^{2} - R\left( {\varvec{x}_{i} } \right)\|\varvec{x}_{i} - \varvec{x}_{ \ne }\|^{2} } \right) $$
(11)

And the partial derivation of \( J\left( {{\mathbf{V}},{\mathbf{W}}} \right)_{ } \) with respect to \( {\mathbf{v}}_{k} \) is calculated as:

$$ \frac{{\partial J\left( {{\mathbf{V}},{\mathbf{W}}} \right)}}{{\partial {\mathbf{v}}_{k} }} \cong \mathop \sum \limits_{i = 1}^{{{\text{N}}_{\text{k}} }} - 2\varvec{w}_{k} \odot \left( {{\mathbf{x}}_{i} - {\mathbf{v}}_{k} } \right) $$
(12)

Therefore, we have two parameters \( {\mathbf{w}}_{k} \) and \( {\mathbf{v}}_{k} \) that should be optimized. We use the gradient descent algorithm to optimize these parameters. In this approach, the number of clusters starts from an initial value and then is learnt dynamically by the algorithm; we first update W in each cluster and then based on the modified W, the cluster centers will be updated. Samples are clustered again with the new weights and centers.

The gradient descent algorithm minimizes \( J\left( {{\mathbf{V}},{\mathbf{W}}} \right) \) via an iterative procedure. At each iteration, the weights \( {\mathbf{w}}_{k} \) and \( {\mathbf{v}}_{k} \) are updated in the opposite direction to \( \nabla J\left( {{\mathbf{V}},{\mathbf{W}}} \right) \) using the step sizes of \( \alpha \) and \( \delta \):

$$ \varvec{w}_{k}^{{\left( {t + 1} \right)}} = \varvec{w}_{k}^{\left( t \right)} - \alpha \frac{{\partial J\left( {{\mathbf{V}},{\mathbf{W}}} \right)}}{{\partial \varvec{w}_{k} }} $$
(13)
$$ {\mathbf{v}}_{k}^{{\left( {t + 1} \right)}} = {\mathbf{v}}_{k}^{\left( t \right)} - \delta \frac{{\partial J\left( {{\mathbf{V}},{\mathbf{W}}} \right)}}{{\partial {\mathbf{v}}_{k} }} $$
(14)

The values of \( \alpha \) and \( \delta \) are referred to as learning rates or learning step factors.

A pseudo-code of our training algorithm is presented in Fig. 3. It is noteworthy that all the parameters are initialized randomly. The maximum number of iterations is set empirically. The other parameter α, \( \delta \) and β are selected experimentally for each dataset.

Fig. 3
figure 3

A pseudo-code of the MLC-FWC algorithm

The computational complexity of the proposed algorithm as we can see in the pseudo-code of MLC-FWC algorithm in Fig. 3 is \( O\left( {I \times {\text{K}} \times \left( {N_{K} + {\text{N}}} \right)} \right) \), where I is the number of iterations, K is the number of clusters, NK is the number of samples in each cluster and N is the number of all training samples. Based on the experiments, the algorithm converges within less than 10 iterations, which is not a large number, K is a constant much smaller than the number of samples, and therefore negligible. NK is the number of samples in each cluster, which in the worst case is equal to N (for only one cluster case). Therefore, the overall complexity is O(N).

3.2 Test phase

In the test phase, for each query, its distance to all cluster centers is calculated based on local feature weights in each cluster. The nearest one is considered as the cluster corresponding to the query image. In other words, we find the proper cluster k which minimizes \( \sqrt {\mathop \sum \nolimits_{j = 1}^{n} {\text{w}}_{kj} \left( {x_{ij} - v_{kj} } \right)^{2} } \) among all clusters (k = 1,…,C). Based on the K value, the common labels among KNN will be assigned to the query as labels. If K = 1, all labels of NN will be considered as the query label or if K = 5, the labels that are repeated two times or more between all 5NN will be assigned to the query image.

Then, M similar images within that cluster with the same labels are assigned to the query and are retrieved as the similar images in the retrieval framework based on their distance to the query image.

3.2.1 Experimental results

The proposed method was evaluated on the Corel1000 database of images. The database consists of 1000 images from 10 categories in which each category has 100 images. The categories are African people, beaches, buildings, buses, dinosaurs, elephants, roses, horses, mountains and food. All these categories were used for the experiments. This dataset is most often used for CBIR systems as a single-label dataset. Since our algorithm is introduced in a multi-label classification framework, we labeled manually this dataset with 1 to maximum 5 labels based on the related semantics of the image. We did the same for the Corel5000 dataset [35]. We also evaluated our proposed method using the Scene multi-label dataset [38] which is a benchmark for multi-label image classification containing 2407 natural scene images. Their characteristics are summarized in Table 1.

Table 1 Characteristics of data sets

The performance of the proposed algorithm can be evaluated within two frameworks: first, based on criteria which are used for evaluating multi-label classification algorithms; second, by criteria that are used for evaluation of image retrieval systems. In the following, various criteria of both approaches are discussed.

3.3 Evaluation criteria

The performance of any retrieval system can be measured in terms of its precision and recall. Precision measures the ability of the system to retrieve only models that are relevant, while recall measures the ability of the system to retrieve all models that are relevant. In this paper, we use the criteria for a multi-label framework as defined in [23] and for image retrieval as defined in [35].

3.3.1 Multi-label classification framework

Let D = \( \left\{ {x_{1} ,x_{2} , \ldots x_{N} } \right\} \) be a multi-label evaluation dataset containing N labeled examples. Let \( {\hat{\text{Y}}}_{i} = {\text{H}}\left( {{\text{Q}}_{\text{i}} } \right) \) be the predicted label set for the pattern Qi as query image, while Yi is the ground truth label set for xi. A first metric called the averaged accuracy gives an average degree of similarity between the predicted and the ground truth label sets of all test examples:

$$ {\text{AA}}\left( {{\text{H}}, {\text{D}}} \right) = \frac{1}{\text{N}}\mathop \sum \limits_{{{\text{i}} = 1}}^{N} \frac{{\left| {{\text{Y}}_{\text{i}} \cap {\hat{\text{Y}}}_{i} } \right|}}{{\left| {{\text{Y}}_{\text{i}} \cup {\hat{\text{Y}}}_{i} } \right|}} $$
(15)

where |S| represents the cardinality of the set S. Two other metrics called average precision and average recall are also used in the literature [39] to evaluate a multi-label learning system. The former computes the proportion of correct positive predictions, while the latter calculates the proportion of true labels that have been predicted as positives:

$$ {\text{AP}}\left( {{\text{H}}, {\text{D}}} \right) = \frac{1}{\text{N}}\mathop \sum \limits_{{{\text{i}} = 1}}^{N} \frac{{\left| {{\text{Y}}_{\text{i}} \cap {\hat{\text{Y}}}_{i} } \right|}}{{\left| {{\hat{\text{Y}}}_{i} } \right|}} $$
(16)
$$ {\text{AR}}\left( {{\text{H}}, {\text{D}}} \right) = \frac{1}{\text{N}}\mathop \sum \limits_{{{\text{i}} = 1}}^{\text{N}} \frac{{\left| {{\text{Y}}_{\text{i}} \cap {\hat{\text{Y}}}_{i} } \right|}}{{\left| {{\text{Y}}_{\text{i}} } \right|}} $$
(17)

Another evaluation criterion is the F1 measure that is defined as the harmonic mean of the AP and AR metrics [40]:

$$ {\text{F}}1 = \frac{2}{{\frac{1}{\text{AP}} + \frac{1}{\text{AR}}}} $$
(18)

The values of these evaluation criteria are in the interval [0, 1]. Larger values of these metrics correspond to higher classification quality.

3.3.2 Image retrieval framework

Let \( {\text{Y}}\left( {{\text{Q}}_{\text{i}} } \right) \) be the set of retrieved images which are matched to the query image \( {\text{Q}}_{\text{i}} \):

$$ {\text{Y}}\left( {{\text{Q}}_{\text{i}} } \right) = \left\{ {{\text{I}}_{\text{i}} | {\text{rank}}\left( {{\text{I}}_{\text{i}} } \right) \le {\text{M}}, {\text{I}}_{\text{i}} \in {\text{A}}} \right\} $$
(19)

where \( {\text{rank}}\left( {{\text{I}}_{\text{i}} } \right) \) is the rank of \( {\text{I}}_{\text{i}} \) among M retrieved images and A is the subset of the image database having the same cluster as the query image. Retrieval precision was used for evaluation of the image retrieval. Precision is defined as the ratio of the number of retrieved relevant images to the total number of retrieved images:

$$ {\text{P}}\left( {{\text{Q}}_{\text{i}} } \right) = \frac{{\left| {{\text{Y}}\left( {{\text{Q}}_{\text{i}} } \right)} \right|}}{\text{M}} $$
(20)

where |Y(\( {\text{Q}}_{\text{i}} \))| represents the size of Y(\( {\text{Q}}_{\text{i}} \)). The average precision per cluster is defined as:

$$ {\text{P}}\left( {{\text{cluster}}_{k} } \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{{N_{k} }} {\text{P}}\left( {{\text{Q}}_{\text{i}} } \right)/N_{k} $$
(21)

where Nk is the number of images in the kth cluster. Finally, the average precision is given by:

$$ {\text{P}}_{\text{avg}} = \mathop \sum \limits_{{{\text{k}} = 1}}^{\text{C}} {\text{P}}\left( {{\text{cluster}}_{k} } \right)/{\text{C }} $$
(22)

where C is the number of clusters.

4 Results and discussion

There are several approaches to improve retrieval effectivity by feature weighting, and also plenty of methods which address the multi-label classification problem. However, we could not find any algorithm which uses feature weighting in a multi-label classification and retrieval framework. Consequently, we decided to evaluate our algorithm in the four following ways. First, we compare our results before and after the MLC-FWC algorithm within the multi-label classification framework. Second, we compare the MLC-FWC method to methods in [4147] which use feature weighting (FW) or feature fusion (FF) algorithms to combine features in a single-label image retrieval framework. Third, we compare our algorithm within a multi-label classification framework with the algorithms in [24, 29] which perform multi-label classification without feature weighting on the Scene multi-label dataset. Finally, the retrieval accuracy of the algorithm is evaluated by comparing it to two CBIR methods [35, 36] on the Corel1000 dataset within a multi-label image retrieval framework. For these results, all images are selected as query from the same dataset. In particular, a retrieved image is considered a match if and only if it is in the same category as the query in a single-label framework.

Tables 2 and 3 present the average values of accuracy, precision, recall and F1 measure for various groups of images in the Corel1000 and Corel5000 databases. Here, we used VQCC [36] as color feature and EGWC [35] as texture feature. Tables 2 and 3 demonstrate that the proposed MLC-FWC method provides better results when the EGWC and VQCC features are fused.

Table 2 Average values of accuracy, precision, recall and F1 measure resulting from applying MLC-FWC to Corel1000 database with/without feature combining by FW
Table 3 Average values of accuracy, precision, recall and F1 measure resulting from applying MLC-FWC to Corel5000 database with/without feature combining by FW

To evaluate our proposed method in comparison with other algorithms, the average precision results of the MLC-FWC algorithm and various state-of-the-art methods are shown in Table 4. In these algorithms [41,39,40,41,42,43,44,48], features are weighted or fused in a single-label image retrieval framework to retrieve more similar images and improve their result by feature weighting. All of these algorithms evaluated their methods on the Corel1000 dataset.

Table 4 Average precision of some new feature combination (FW and FF) methods in comparison with MLC-FWC with k = 3 on Corel1000 dataset within single-label image retrieval framework

In Table 5, we also present the average precision and F1 measure results of our algorithm in comparison with some new multi-label classification algorithms [24, 29] on the Scene multi- label dataset with its original features within a multi-label classification framework. All the results are reported directly from their corresponding papers.

Table 5 Average precision of some new multi-label algorithms in comparison with MLC-FWC with k = 3 on scene dataset and within multi-label classification framework

In Table 6, we evaluate the performance of the MLC-FWC algorithm in the retrieval framework. In the current implementation of MLC-FWC, the dataset images are indexed separately with the EGWC [35] and VQCC [36] algorithms. Therefore, we compared the results of the MLC-FWC with retrieval results of these two individual algorithms in terms of average precision (Table 6).

Table 6 Average precision for each category of EGWC and VQCC in comparison with MLC-FWC on Corel1000 dataset within multi-label image retrieval framework

5 Conclusion

In this paper, we present a novel approach for image retrieval in a multi-label framework. The idea of this approach is to make a framework using clustering and classification together to reveal the structure of data and divide the input space into meaningful parts. Also, we fuse features by feature weighting learning algorithm to improve the retrieval results by purposefully using the ability of each feature to separate data.

Compared to the majority of papers that use deep learning architectures to perform CBIR [4,5,6,7,8,9,10], the proposed method simplifies the retrieval process and decreases the problem of “semantic gap,” by finding the labels of the query in the multi-label framework before doing retrieval, and also improves the performance through feature weighting. The main advantages of using a deep neural network are providing high accuracy and reducing the efforts in handcrafted feature design. The deep neural network automatically enables feature learning from raw data [10]. Nonetheless, it requires a huge computing power to train, so it is a costly and time-consuming process. It also suffers from difficulty in configuration and has not interpretable results. So if the features are ready to use with a satisfactory performance, it is not valuable to use that as a training model.

Experimental results on the Corel1000 and Corel5000 datasets demonstrate a significant supremacy of the proposed CBIR system. The results are evaluated by four statistical measures, all of which confirm the significant improvement in the proposed framework.