Introduction

The world is moving towards the fourth technological revolution, where everything gets automated due to the integration of different technologies that blurs the boundary between physical, biological, and digital spheres. In this digital environment, there is an exponential growth in multimedia data from every domain, in which one-third of data are images. According to the International Data Corporation (IDC) report, the big data market will reach over US$125 billion by 2019, and the number of sensors will increase by 1 trillion in 2030 (Marjani et al. 2017)

As the number of image repositories in various domains such as medical, digital image archives, art galleries, geographic information systems, e-commerce, law enforcement, biometric identification, historical analysis, and so on increases dramatically. During 1970s, active research started to meet the challenges of image matching and retrieval system. Text-based image retrieval is a common technique in the early stages, in which images are matched based on textual descriptions such as labels, captions, keywords, semantic context, and so on (Yue et al. 2011). It does not capture the entire image content; different users may give different interpretations and annotations for the same image. It is also more subjective and incomplete. Therefore, for large image repositories, text-based image retrieval is not a standard and practical method. Hence, content-based image retrieval is the better alternative method that overcomes the problem of text-based image retrieval. Active research on content-based image retrieval had been started early 1990s. Deep learning is the hottest research area in computer vision and machine learning applications in the last decade. It mimics the human brain with high computing and processing power through multiple stages of transformation without hand-crafted features. Some of the its application include face recognition, image detection, voice recognition, video analysis, health care, smart cities, smart agriculture, smart grid energy usage analysis, business intelligence, natural language processing, and more. A convolution neural network is a stack of operations like convolution, pooling, and activation layers that recognize the visual pattern of images. It starts its milestone with the supervision of a deep convolution neural network model called AlexNet in 2012 (Krizhevsky et al. 2012) on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) to classify images. Popular deep neural network models perform object detection, object localization, and image classification tasks on these challenges (Russakovsky et al. 2015). ZFNet is an extension of AlexNet with a small change in the filter size to avoid pixel loss. It uses a 7 × 7 filter size which is lesser than AlexNet, but it fails to reduce the computational cost. The inception model is also called as GoogleNet which provides a very deeper network of having 22 layers, but it reduces the computational complexity, the number of parameters, and memory usage by the use of the inception module with dimension reduction using 1 × 1 convolutions. Microsoft’s research team developed Residual Network (ResNet) that overcomes the problem of vanishing/exploding gradient in extremely deep networks by the introduction of residual networks. A residual network is a stack of residual blocks by providing identical connections between layers. ResNeXt used a stack of blocks and then use the ResNet approach of residual blocks. It has a feature known as cardinality referred to the size of the set of transformations. Because of uniformity in topology, a fewer number of parameters were required for deeper networks. Progressive Neural Architecture Search (PNASNet) used a new learning structure for CNN using reinforcement learning and evolutionary algorithms which produced optimized results than the previous models. The taxonomy of the image retrieval system is shown in figure (Fig. 1). The first category shows the text-based retrieval, and it specifies the possible ways to annotate or describe the images. The next category content-based image retrieval describes the image by its features or content, and its types are specified in the figure. The last category is the hybrid approach, which combines both text and content to describe the images.

Fig. 1
figure 1

Taxonomy of image retrieval systems

In content-based image retrieval, images get matched with the features or contents available in the images. QBIC (Faloutsos et al. 1994), Photobook (Pentland et al. 1996), Virage (Gupta and Jain 1997), VisualSEEK (Smith and Chang 1996), Netra (Ma and Manjunath 1997), and SIMPLIcity (Wang et al. 2001) are some of the commercial CBIR systems, and wide-ranging surveys of CBIR can be found in the journals (Liua et al. 2007; Datta et al. 2005). Improving the efficiency of the retrieval system, reducing the response time, semantic gap, and sensory gap are the primary objectives of every image retrieval system. A survey on the high-level semantic-based system to narrow down the semantic gap is presented in this article (Liua et al. 2007). Based on the type of features, user intervention, and computational intelligence, CBIR has been classified into low-level image retrieval, high or semantic level image retrieval, image retrieval using relevance feedback, and intelligent image retrieval system. For a large-scale image repository, the conventional method of matching during the query phase degrades the performance of the retrieval system. So, it is necessary to select the initial collection of the most relevant images from the large repository, and then, the query image features are matched with the selected subset of images. K-means algorithm is the most simple and common clustering algorithm which is an unsupervised machine learning technique that could be applied to select the initial subset of images.

The standard K-means algorithm was first proposed by Lloyd in 1957 at Bell Labs, and it is one of the most popular data mining clustering algorithms, because of its efficiency and simplicity. It increases intra-cluster similarity and decreases inter-cluster similarity using the sum of squared distances between two feature points. The generalized version of the K-means algorithm is presented in this article (Cheung 2003) to reduce the problems of the conventional K-means algorithm. Conventional K-means algorithm requires pre-determination of cluster numbers, and it suffers due to the dead unit problem; which means incorrect initialization of cluster centers. K-means algorithm provides a faster convergence rate in local optima, but it fails to find the global optimum solution. To overcome the above drawbacks, moth flame optimizer is applied before K-means clustering (Mirjalili 2015). Moth flame optimizer was developed by Mirillaji in the year 2015; it produces a higher convergence rate in the global solution. Hence, to improve the convergence and trapping in local optima, this system reduces the search space by combining the K-means clustering algorithm with a bio-inspired algorithm called moth flame optimizer. MFO algorithm improves the initial random solutions and convergence to a better point in the search space. So, the initial seed values like the number of clusters and cluster centroids are assigned to the K-means algorithm from the moth flame algorithm. The performance efficiency is tested on WANG or COREL1K dataset and COIL dataset.

Related Work

The particle swarm optimization is combined with K-means clustering to reduce the search space by clustering the images are proposed in this article (Younus et al. 2015). K-means algorithm finds local optimal solution effectively, but it rarely catches the global optimal solution. This method uses particle swarm optimization initially to locate the cluster centroid optimally, and these centroids are given as a seed value to the K-means clustering. The experiment was conducted on the WANG dataset, and it proved to be better than the other CBIR systems. The inter-class boundary problem is addressed in the feature space by replacing simple distance-based retrieval for texture databases (Dash et al. 2015). It reduces the searching time in class membership-based retrieval. The class membership and classification confidence-based retrieval (CM-CCR) seem to be computationally efficient than class membership-based retrieval and yield better retrieval performance than classification confidence-based retrieval. Texture-based image retrieval using two novel wavelet features is proposed in this article (Huang and Dai 2003). Energy distribution pattern strings are a fuzzy matching mechanism that acts as a filter, and the selected images are compared with the query images using composite sub-band gradient vectors.

The graph-theoretical cluster-based image retrieval using unsupervised learning could be embedded with any CBIR systems including relevance feedback. Most of the image classification or clustering algorithms are global, static, and independent of the query. It is a dynamic clustering algorithm because it captures the characteristics of the query image (Chen et al. 2005). An efficient framework for an image retrieval system based on a rule-based system is proposed in this system (ElAlami 2011a). The retrieval process is limited to the set of images matched to the rule with the query image. It requires rule generation and rule pruning. A survey of cluster techniques is proposed in (Xu and Wunsch 2005). Unsupervised clustering is used to select a subset of relevant images which is used to narrow down the feature search space. Images within a particular cluster tend to have high similarity, and it is dissimilar to other clusters. From their findings, none of the cluster algorithms is universally accepted to provide more promising results for general dataset. Hence, cluster algorithms can be selected based on the domain-specific information with different proximity measures and a criterion function.

K-means clustering with B+ tree indexing is used in this system (Yildizer et al. 2012), when the querying phase relevant images are retrieved by matching the cluster centroid and the first three topmost closest clusters are targeted for similarity matching with the query image. It uses two parameters CG and CS to determine the distance range which leads to an increase in the computation complexity. Unlike static clustering, SQL-based query which is a dynamic way of selecting the closest images concerning the query image is used in this system (Annrose and Seldev 2016). It uses search space reduction using rule generation. Here, intra-normalization is applied to divide each feature element into five intervals. During querying, the feature interval of each query image feature element is determined, and it is combined using a Boolean operator, and a rule has been generated. This method misses out on some true positive images that lie in the border of the adjacent intervals. Hence, it affects the recognition rate. To overcome this issue, SQL-based range query is used to select the initial level of images (Annrose and CC, 2018). It selects the images that lie within the range which is closer to the query image features. Deep learning is the hottest research area in computer vision and machine learning applications in the last decade. It mimics the human brain with high computing and processing power through multiple stages of transformation without hand-crafted features. A convolution neural network is a stack of operations like convolution, pooling, and activation layers that recognize the visual pattern of images. It starts its milestone with the supervision of a deep convolution neural network model called AlexNet in 2012 on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) to classify images. Deep learning specifically addresses feature representation and similarity matching related to CBIR tasks. The deep learning-based clustering techniques cluster the data points based on complex patterns rather than distance measures. Running K-means on representation vectors learned by deep autoencoder tend to give better results comparing to running K-means directly on the input vectors. This article (Wan et al., 2014) provides an empirical study on deep CNN for CBIR task. The following conclusions were drawn: pre-trained CNN models can be used for feature extraction that captures low-level features to high-level semantic information, and it was demonstrated that these features outperform traditional hand-crafted features, resulting in significant improvements in retrieval efficiency. Caron et al. (2019) proposed a deep clustering, which learns the hyper-parameters of neural network and cluster assignment. It is a novel end-to-end learning of convnets that works with K-means clustering. K-means clustering works onset of features generated by convNet. A recurrent framework was proposed in this article (Yang et al. 2016) to iteratively learn convnet features and clusters within this model. It shows promising performance on small datasets but may be challenging to large-scale image datasets. Although all the deep neural network models provide the best results, it requires high computing GPUs; hence, it is mandated to upgrade the system requirements.

Proposed System

Content-based image retrieval is commonly used in a broad range of fields, from handheld devices to large-scale applications such as commerce, satellite image processing, and the medical industry. The proposed CBIR system shown in figure (Fig. 2) uses four primary stages; the first three stages are performed during the indexing or offline process, and then, the querying or retrieval phase is performed during the online process.

Fig. 2
figure 2

Proposed CBIR architecture

Indexing (offline process)

  • Feature extraction

  • Feature transformation and reduction

  • A subset of image selection (KMFO clustering)

Querying phase (online process)

  • Query feature extraction and pre-processing

  • Query feature matching with cluster centroid

  • Similarity matching with selected subset of images

The two phases in designing CBIR based image retrieval systems are:

Indexing phase

In this phase, the image information like color, texture, or shape is separated into features that are stored in an index data structure along with a corresponding link to the actual image. Indexing enhances the data access speed and improves the accuracy of the retrieval process, and hence, it is an important factor in image database systems. Indexing in content-based image retrieval systems is used to facilitate automatic identification and abstraction of the visual content of an image.

Image acquisition

This method uses a heterogeneous collection of broad domain image dataset called the COREL1K dataset. It is otherwise called as WANG dataset (Wang et al. 2001) which consists of 1000 images with 10 different categories.

Feature extraction

Feature extraction is the primary reduction process in an image retrieval system. It is the process of transforming images from pixel space to feature space (Otávio et al. 2012). This system extracts both global and local features by fusing low-level shape, color, and texture features. Gray-level co-occurrence matrix (GLCM) and Gabor wavelet transform are used for extracting texture features. The dominant color descriptor, HSV color histogram, color correlogram, and color moments are extracted for color features, and shape features are extracted using region-based shape components to determine the largest connected component.

Gray-level co-occurrence matrix

It is a second derivative statistical analysis method, which specifies the intensity variation of pixel elements in different direction and orientation. It gives the distribution of common gray values frequently scattered throughout the image. It calculates the probability of co-occurrence of gray values in different positions. For an n × m image, a co-occurrence matrix GLCM is defined as:

$$ glcm\left(x,y\right)=\sum \limits_{p=1}^n\sum \limits_{q=1}^m\left\{\begin{array}{l}1, if\ A\left(p,q\right)=x\ and\ A\left(r,s\right)=y\\ {}0, otherwise\end{array}\right. $$
(1)

where

|p-r|=|q-s|=|p-q|=|r-s|=1

x and y are the image intensity values

(p, q) and (r, s) are the adjacent spatial positions in an image A

The contrast, correlation, energy, and homogeneity are extracted using spatial distribution. The contrast gives the gray-level intensity difference between two adjacent pixels over the entire image.

$$ Contrast={\sum}_{x,y}^n{\left|x-y\right|}^2 glcm\left(x,y\right) $$
(2)

Inverse difference moment gives the local homogeneity, and it is the inverse of contrast.

$$ \mathrm{Inverse}\kern0.5em \mathrm{Diff}\kern0.5em \mathrm{Moment}={\sum}_{\mathrm{x},\mathrm{y}}^{\mathrm{n}}\frac{1}{{\left|\mathrm{x}-\mathrm{y}\right|}^2}\mathrm{glcm}\left(\mathrm{x},\mathrm{y}\right)\kern0.5em \mathrm{if}\kern0.5em \mathrm{x}\ne \mathrm{y} $$
(3)

Inverse Inversecorrelation specifies the measure of the interrelationship between adjacent pixel values over the whole image.

$$ \mathrm{Correlation}={\sum}_{\mathrm{x},\mathrm{y}}^{\mathrm{n}}\frac{\left(\mathrm{x}-{\upmu}_{\mathrm{x}}\right)\left(\mathrm{y}-{\upmu}_{\mathrm{y}}\right)\mathrm{glcm}\left(\mathrm{x},\mathrm{y}\right)}{\upsigma_{\mathrm{x}}{\upsigma}_{\mathrm{y}}} $$
(4)

Energy is one for the constant image.

$$ \mathrm{Energy}=\sum \limits_{\mathrm{x},\mathrm{y}}\mathrm{glcm}{\left(\mathrm{x},\mathrm{y}\right)}^2 $$
(5)

The uniformity and closeness of pixels in an image are determined by the homogeneity.

$$ homogeneity={\sum}_{x,y}\frac{glcm\left(x,y\right)}{1+\left|x-y\right|} $$
(6)

HSV color histogram

It is the widely used feature that calculates the histogram of images in the HSV color model.

Region-based shape descriptor

Object shape features provide a boundless sign to object identity. The object is recognized and diagnosed easily from its shape. Two categories of shape features are contour or boundary-based and region-based descriptors (Su et al. 2010). In contour-based shape descriptors, peripheral information of object shapes is used instead of using interior shape details. In region-based methods, shape descriptors use information from both peripheral and interior regions of the shape. Region props are used to measure the properties of image regions, and it is a global feature extraction method. It determines the number of connected components in an image, and here, we extract the largest connected component. The following five shape features like area, centroid, perimeter, solidity, and circularity are extracted from the largest region.

Dominant color

The histogram of each color space is determined in the RGB color model. With the histogram values, the percentage of red, green, and blue color components are determined.

Color moments

The distribution of color is provided by color moments. Using this method, the first-, second-, and third-order color moments are determined using the following equation:

$$ {\mathrm{M}}_{\mathrm{i}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\mathrm{n}$}\right.{\sum}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{A}\left(\mathrm{i},\mathrm{j}\right) $$
(7)

Standard deviation is the second-order moment, and it is obtained by applying the square root of the variance, and it is given as:

$$ {\mathrm{SD}}_{\mathrm{i}}=\sqrt{1/\mathrm{n}{\sum}_{\mathrm{j}=1}^{\mathrm{n}}{\left(\mathrm{A}\left(\mathrm{i},\mathrm{j}\right)-{\mathrm{M}}_{\mathrm{i}}\right)}^2} $$
(8)

Skewness specifies the third-order color moment, and it gives the shape, and it shows the asymmetric property of the color distribution; it is calculated as follows:

$$ {\mathrm{Sw}}_{\mathrm{i}}=\sqrt[3]{1/\mathrm{n}{\sum}_{\mathrm{j}=1}^{\mathrm{n}}{\left(\mathrm{A}\left(\mathrm{i},\mathrm{j}\right)-{\mathrm{M}}_{\mathrm{i}}\right)}^3} $$
(9)

The color moment is an invariant color feature vector that could be applied to any size of images.

Wavelet transform

The multi-resolution analysis of an image is represented using a wavelet that provides signals in both space and frequency domains. In one-dimensional discrete wavelet transform, images are decomposed into high-frequency and low-frequency components (Arai and Rahmad 2012). Two-dimensional DWT decomposes an image into four components, LL retains the approximation details, HL gives the vertical edge detail, horizontal edge detail is retained in LH, and high-frequency values are provided by HH. Features are extracted by determining the first-order mean and second-order standard deviation of 2D DWT.

Color correlogram

Color correlogram includes the spatial correlation of colors; it can be used to describe the global distribution of local spatial correlation of colors. The image is quantized to 16 levels hence it generates 64 feature dimensions.

Feature Transformation and Selection

Feature transformation is a pre-processing step that could be applied before any data mining or machine learning techniques. This system uses multiple features and each feature has its domain or range of values. During similarity matching, a larger feature value dominates and assigns more weight than small range values. Hence, feature transformation is necessary to convert all the feature elements with the same significance (Aksoy and Haralick 2001). This method uses intra-normalization where each feature element gets normalized independently.

$$ {FV}_{i,j}=\frac{FV_{i,j}-{\mathit{\operatorname{Max}}}_i}{{\mathit{\operatorname{Max}}}_i-{\mathit{\operatorname{Min}}}_i} $$
(10)
FVi,j :

Feature vector

Maxi:

Maximum value of each feature element

Mini:

Minimum value of each feature element

The transformed features undergo a selection process to remove less significant features. Three types of feature selection are:

  • Filter-based feature selection is independent of any classifier or learning algorithms.

  • The wrapper-based selection method depends on learning algorithms.

  • The hybrid method combines both the filter method and the wrapper method.

This paper uses a filter-based feature selection method which uses a ranking algorithm to sort out feature elements having a greater number of null values and less distinct values.

figure a

Algorithm 1 SQL-based feature reduction

Thus, features that have a large number of repeated values and NULL values are eliminated. Repeated values and NULL values in each feature element are selected, and if it is greater than the threshold, then those features are removed; thereby, reduced feature set is obtained. In the COREL dataset, 158 feature elements are initially extracted, and the reduced feature set consists of 143 feature dimensions.

Search space reduction

When feature reduction techniques are used, they lose out on certain important features, causing retrieval efficiency to suffer. To improve the retrieval accuracy and to reduce the response time, it is necessary to reduce the search space. This system uses the K-means algorithm to cluster the images; hence, the query image features are compared to the cluster center initially, and the most relevant clustered group of images are further used for similarity matching. Hence, the NP-hard problem will be reduced to a polynomial problem. K-means algorithm is an unsupervised hard clustering that is widely used for most applications. Conventional K-means algorithm requires pre-determination of cluster numbers, and it suffers due to the dead unit problem, which means incorrect initialization of cluster centers. Hence, this work combines the moth flame optimization algorithm with K-means clustering. Using the MFO algorithm optimum number of flames is determined, and these flame values are initialized as the cluster centers, and the number of flames is assigned to number of clusters (k) for the K-means algorithm. The following algorithm gives the procedure of K-means clustering algorithm.

figure b

Algorithm 2 K-means algorithm

Moth flame optimizer

Seyedali Mirjalili in 2015 developed a moth flame optimization algorithm (MFO) which is a bio-inspired algorithm. The moth is the decorative insect that belongs to the butterfly family. It has a special navigation mechanism in which a moth flies by retaining a fixed angle concerning the moon position. In the artificial light, the moths fly spirally around it and drop towards the artificial flame and it is represented in figure (Fig. 3) .

Fig. 3
figure 3

Flying behavior of moth around the light flame

The set of moths is represented in a matrix as follows, and it represents the image feature vector:

$$ M=\left[\begin{array}{cccc}{M}_{1,1}& {M}_{1,2}& ..& {M}_{1,d}\\ {}{M}_{2,1}& {M}_{2,2}& ..& {M}_{2,d}\\ {}:& :& :& :\\ {}{M}_{n,1}& {M}_{n,2}& ..& {M}_{n,d}\end{array}\right] $$
(11)

where

n :

number of moth (no. of images)

d :

number of feature element (no. of feature dimensions)

Every moth has an array for storing the resultant fitness values:

$$ OM=\left[\begin{array}{l}{OM}_1\\ {}{OM}_2\\ {}:\\ {}{OM}_n\end{array}\right] $$
(12)

where OMi is the fitness value of ith moth Mi.

It uses flame as an important component. Flame matrix is also equal to the moth matrix and the set of flame is represented in a matrix as follows:

$$ F=\left[\begin{array}{cccc}{F}_{1,1}& {F}_{1,2}& ..& {F}_{1,d}\\ {}{F}_{2,1}& {F}_{2,2}& ..& {F}_{2,d}\\ {}:& :& :& :\\ {}{F}_{n,1}& {F}_{n,2}& ..& {F}_{n,d}\end{array}\right] $$
(13)

where

m :

number of flames

d :

number of variables

Every flame has an array for storing the resultant fitness values

$$ OF=\left[\begin{array}{l}{OF}_1\\ {}{OF}_2\\ {}:\\ {}{OF}_n\end{array}\right] $$
(14)

Here, the moths and flames are both solutions. The key distinction between moths and flames is that moths are search agents that drive towards the best flame, while flames are moth’s best positions. The mathematical model and behavior is specified in the following equation, where moth in each position is updated concerning a flame:

$$ {M}_i=S\left({M}_i,{F}_j\right) $$
(15)

where Mi is the ith moth, Fj is the jth flame, and S indicates spiral function.

The logarithmic spiral function is stated as follows:

$$ S\left({M}_i,{F}_j\right)={D}_i.{e}^{bt}.\cos \left(2\prod t\right)+{F}_j $$
(16)

where Di is the distance between an ith moth and jth flame, the shape of the logarithmic spiral is represented by a constant b, and t is a random number that lies between − 1 and 1 and it is illustrated in figure (Fig. 4).

Fig. 4
figure 4

A logarithmic spiral, space around a flame

The distance between an ith moth and the jth flame is calculated as follows:

$$ {D}_i=\left|{F}_j-{M}_i\right| $$
(17)

Here, the number of flames can be decreased during the iterative process, and the number of flame can be calculated as:

$$ \mathrm{Flame}\kern0.5em \mathrm{no}.\kern0.5em =\mathrm{round}\left(N-C\ast \left(\frac{N-1}{T}\right)\right) $$
(18)

where C is represented as the current iteration number, the maximum number of flames is N, and maximum number of iterations is represented as T. Hence, decrease in the number of flames balances the exploration and exploitation of the search space.

K-means moth flame optimizer

The proposed work is illustrated in figure (Fig. 5) which combines moth flame algorithm with K-means clustering. Mirjalili’s (2015) MFO algorithm is theoretically able to improve the initial random solutions and convergence to a better point in the search space. Using the moth flame algorithm number of clusters and cluster centroid is determined, and it is given as an initial seed value to the K-means algorithm.

  1. Step 1:

    Moth value matrix and flame matrix are initialized by an image feature vector. In the context of clustering, a single flame position signifies the centroid of clusters.

    M= {M1, M2,…,Mn} and F= {F1,F2,…,Fn} // Initial Moth and Flame Values

  2. Step 2:

    For each moth

    1. a)

      Calculate the Moth fitness based on clustering criteria args min| Mi - Fj |. The best position of moth and flame are updated.

    2. b)

      Calculate the number of flames using Equation (18)

    3. c)

      Repeat until the stopping criteria are satisfied (maximum no. of iteration)

  3. Step 3:

    Apply K-means algorithm using the best flame positions and a number of flames obtained in MFO.

  4. Step 4:

    Return the clustered images and the cluster centroids.

Fig. 5
figure 5

Proposed K-means moth flame optimizer

Retrieval phase

In this phase, the searching for the desired query image in the CBIR index is performed. Description of the properties of the image is done either by supplying a query image or denoting the image features. Generally, the collections of images are represented as a set of feature vectors. For the query input, the same set of features is extracted and processed using the feature transformation and selection technique. Then, the query image feature is compared with the cluster centroids obtained using the proposed KMFO algorithm. Then, the topmost matched cluster images are matched with the query image to generate the top most relevant images.

Experimental Results

This section presents the experimental results of the proposed method, and it is compared with other existing CBIR systems. In the proposed system, images are clustered in the offline process by combining moth flame optimization with K-means algorithm. During the online or querying phase, the query image is initially compared with the clustered centroid to identify the most relevant cluster set. The images belonging to the selected cluster are then compared with the query image to retrieve the most relevant images.

The performance measures used to evaluate the efficiency of the proposed system are precision, recall, and F-measure, mean average precision. Precision gives the ratio of no. relevant images retrieved (Nr) to the whole number of images retrieved (Rt). Recall specifies the ratio of number of relevant images retrieved (Nr) to the total relevant images present in the dataset (Nt). F-measure is the harmonic mean of precision and recall. The computation time of the query phase is determined using the response time. Precision and recall should be high to show high retrieval performance; hence, the joint precision-recall curve is used to characterize the performance of the image retrieval system.

$$ Precision=\frac{Nr}{Rt}=\frac{TP}{TP+ FP} $$
(19)
$$ Recall=\frac{Nr}{Nt}=\frac{TP}{TP+ FN} $$
(20)
$$ F- Measure=2\left[\left( Precision\ast Recall\right)/\left( Precision+ Recall\right)\right] $$
(21)

The test is conducted on two different datasets: COREL and COIL datasets. More than half of the surveyed papers use the COREL dataset which consists of various content including animals, buildings, African people, and natural scenery. This system uses the COREL1K dataset, which is mostly accepted because of its heterogeneity and human-annotated ground truth images. The images are pre-classified into 10 different categories of size 100 images by domain experts. Initially, 158 image features are extracted in the feature extraction phase, and 15 features are removed which is having fewer number of distinct values and more null values. Then, it is grouped into different clusters using KMFO, and the query input is randomly selected from each category to test the average true positive images. The following figure specifies the top 20 images retrieved related to the sample query image. The first figure in each category denotes the query image. Figure 6a shows that four of the twenty images are unrelated to the query image, which results in 80 percent retrieval precision. The same set of tests is performed several times with different query images in the same category, and the average precision rate is 81% which is depicted in Table 1. Figure 6b shows that one image is unrelated to the query image, with a retrieval precision of 95% and an average precision rate of 95% for the horse class. Table 1 specifies the average precision comparison of the proposed KMFO with other image retrieval systems; it shows that the average precision of KMFO is better than all the other systems except range query-based image retrieval. Figure 7 specifies the comparison of recall with other retrieval systems on the corel1K dataset.

Fig. 6
figure 6

a Sample input from bus category with 16 relevant images out of 20 retrieved. b Input from horse category with 19 relevant images out of 20 images retrieved

Table 1 Precision comparison of the different image retrieval system
Fig. 7
figure 7

Recall comparison of the different image retrieval system

The next type of comparison is image retrieval with K-means and KMFO, and the performance measures like precision, recall, and F-measure are shown in Table 2. The results show that moth flame optimizer improves the performance of K-means to some extent.

Table 2 Precision, recall, and F-measure of K-mean and KMFO (COREL1K dataset)

COIL dataset (Nene et al., n.d.) is the next experimental image dataset which consists of 1440 images with 20 different categories and each category consisting of 72 images. It was captured using a motioned camera with the common black background, and the object was rotated 360 degrees placed on a turntable corresponds to 72 different positions. After feature extraction, search space is reduced by grouping similar images using the K-means algorithm with the moth flame optimizer. The results are compared with and without the moth flame optimizer. The sample image in each category is shown in Fig. 8.

Fig. 8
figure 8

Sample images in each category of the COIL dataset

The following Table 3 specifies the retrieval precision, recall, and F-measure of the COIL dataset using K-means and KMFO algorithm. The result shows that the performance of the K-means algorithm is improved slightly by initializing the seed value of K-means parameters using moth flame optimizer.

Table 3 Precision, recall, and F-measure of K-mean and KMFO (COIL dataset)

The retrieval time is compared with and without search space reduction. Figure 9 shows the execution time of the three methods, and KMFO is slightly less than K-means, and it is drastically improved compared with the original method that is without search space reduction.

Fig. 9
figure 9

Comparison of retrieval time on COIL dataset

Conclusion

Clustering algorithms have been applied to the feature space to reduce the searching time, thereby reducing the response time without compromising the retrieval accuracy. The proposed CBIR system presents the moth flame optimization algorithm with K-means clustering which overcomes the drawbacks of the conventional K-means clustering algorithm. Random selection of initial cluster centroids and number of clusters in the K-means algorithm leads to a dead unit problem. Hence, it is reduced by providing optimal value through MFO. The comparability of the proposed method with the other existing system demonstrates the usage of COREL and COIL dataset. The result proves that this system provides a satisfactory outcome, and it is slightly better than other methods. The future work of the study is to improve the retrieval accuracy by including feature dimension reduction and applying other bio-inspired optimization algorithms.