1 Introduction

In computer vision field, the content-based image retrieval (CBIR) has grown as an advanced research topic (Peng et al. 2019; Zeng et al. 2019). One of the significant tasks that take a core role in computer vision applications is feature extraction process. Returning the most relevant or similar images for a given query image by means of extracting and formulating discriminative and meaningful image representation is the main goal of feature extraction process. Notably, the low-level image features (spatial information, shape, texture and color) were highly adopted by most of the earlier research works on CBIR-based methods (Somasundaran et al. 2020; Tian et al. 2019). For an image, the whole visual contents can be described well using the global features. In other words, this feature is found to be the best one for not considering alone some specific points of interest to illustrate the complete visual contents of an image (Hongpeng 2019; Tesfaye and Pelillo 2019; Sujatha and Shalini Punithavathani 2018; Vinu 2016).

From the past decades, the researchers in computer vision field have provided their attention on analyzing the local image descriptors, such as histograms of oriented gradients (HOG) (Argyriou and Tzimiropoulos 2017), speeded-up robust features (SURF) (Bay et al. 2008) and scale-invariant feature transform (SIFT) (Lowe 2004). More commonly, the key points of particular image portions (corners/edges, object of interest, and region) are utilized to illustrate local information by the local image descriptors. Toward wide range of current computer vision applications, the local descriptors have proved their strength in image retrieval process, object tracking, visual object classification, panoramic stitching and scene categorization and so on. Most significant advantages of local descriptors on providing reliable matching across wide-scope of various conditions (Amato et al. 2019; Lai et al. 2016) and having the ability to withstand image scale and rotation invariance make them strong enough compared to other conventional global features. For improving the discrimination and robustness during image representation, the exploitation of local and global image features benefits is considered to be a challenging and interesting task in this work (Zheng et al. 2018).

Some of the soft computing methods and several significant features have been developed for computer vision applications in recent years (Sundararaj 2019; Al-Janabi and Alkaim 2019; Al-Janabi and Mahdi 2019; Sundararaj et al. 2018; Vinu 2019). However, the different forms of image deformations (such as viewpoint changes, appearance of noises, image rotations) and scale variances are handled well using local descriptors; thereby, this descriptor has improved the system robustness. In other sense, the spatial relations and objects are closer to the human vision characteristics, which means the complete image structure is considered by the global features. But, the high retrieval accuracy can be significantly achieved by extracting only the proper (accurate) features. Notably, the CBIR system performance can be improved by considering carefully the dimension of feature vector. In fact, if this step is avoided, then the increased computational cost and memory consumption will degrade the performance of CBIR systems (Gladis 2019).

In this paper, multi-level matching scheme is proposed for content-based image retrieval (CBIR) combining local and global features. Our research contribution includes two significant parts: At first, the local and global information of the image are represented effectively through developing a new multi-level structured representation. Additionally, the complex scenes and events categories can be characterized suitably using this representation. Next, the optimal similarity among images is identified by introducing a multi-level matching technique incorporating the linear programming-solved Euclidean distance formula. Followed by this, the image retrieval accuracy is improved through using a hybrid similarity combining both local and global information. In the CBIR field, the research contributions carried out by this work are explained as follows: (a) Both global and local information are combined to develop hybrid feature information; (b) the retrieval performance is improved by introducing a color-related feature (CRF) combining other relevant features; (c) two-step retrieval steps-combined multi-level matching (MLM) scheme is introduced.

The rest of this paper is organized as follows: Sect. 2 details the related works on CBIR methods. Sections 3 and 4 explain the multi-level feature extractions and hybrid similarity-based multi-level matching scheme. The proposed image retrieval system framework is introduced in Sect. 4. In Sect. 5, the experimental results are analyzed. Section 6 concludes the paper.

2 Review of related works

In current trend, the Bag-of-Words (BoW) representation-included local features are employed by the CBIR methods. Some of the image primitive features (shape, spatial information, texture and color) were highly focused by these CBIR-based studies (Dubey et al. 2014) to attain better image retrieval accuracy. Work in Bagri and Johari (2015) has proposed the texture and shape properties-based feature extraction technique. This technique has used both the shape-invariant Hu-moments and gray-level co-occurrence matrix approaches. Texture and shape features are combined to perform the comparison. The performance metrics such as recall and precision are applied to determine the retrieval accuracy. In object recognition field, the visual cue called shape has played a major role. The classification of binary shapes is performed based on the vector quantization, feature extraction and feature detection. In Ramesh et al. (2015), the BoW model is used to develop the invariant features-based classification framework. Experimental study is conducted through adopting shape classifier to be used in animal shapes dataset for shape classification. Work in Montazer and Giveki (2015) has introduced image descriptors with two significant methods. Feature matrix is obtained through performing k-means clustering based on SIFT extraction process. Two different forms of dimensionality reduction methods were employed to obtain high precision rate. Li database images and Caltech-101 are used for experimental validation process. Work in Li et al. (2015) has proposed the 3D shape retrieval technique.

Here, 6 and 12 dissimilar 3D shape retrieval techniques are considered to perform the evaluation process. The common benchmark evaluation is adopted to compare 26 retrieval techniques during experimental analysis (Anandh et al. 2016). Furthermore, the Wavelet Transform, Gabor Wavelet, and color auto-correlogram were used for image generation. Initially, the RGB color space is considered for extracting the features combined with color information. Then, the proposed feature extraction technique allows for the extraction of features combined with the texture information. Thirdly, the corner and edge detection process is used for the extraction of shape-based information.

A Color Directional Local Quinary Pattern also called as color-texture features is extracted using the proposed image retrieval method (Vipparthi and Nagar 2014). To the surrounding and reference pixels, the extraction of RGB channel-wise directional edge information is performed. MIT-Color and Core-5000 databases were used to conduct experimental validation process. Work in Iakovidou et al. (2015) has used the four image extraction techniques to extend and simplify the MPEG-7 descriptor functionality. Ultimately, the interest points were generated for an input image. UKbench and UCID are the two databases used to conduct experimental validation. Fuzzy classifiers were generated from the local image features used in object classification (Korytkowski et al. 2016). Local features can be determined using Meta learning approach. PASCAL Visual Object Classes (VOC) dataset with its three classes is used for experimental analysis. Work in Wang et al. (2017) has proposed a technique based on the combination of texture and shape features. Notably, this technique has extracted the texture features through applying the localized angular phase histogram; conversely, the shape features were extracted through applying the exponent moment’s descriptor. To satisfy hue saturation intensity (HSI) color space, it is most to extract the texture features, whereas the RGB color space can be satisfied through extracting the shape features. In order to improve the simplified selection process, the feature selection technique was suggested by authors El Alami (2011). Essentially, the suggested feature selection technique should select the most relevant image features. By that fact, the texture and color features were extracted using both Gabor filter and 3D color histogram process. The feature space complexity is reduced by applying the genetic algorithm. Corel-1000 dataset is used for experimental analysis.

For all complex background images, the average precision rates were reported. Work in Shrivastava and Tyagi (2015) has avoided the feature fusion process and retrieved the images using different steps. Initially, the color features are applied for the retrieval of fixed number of images. Next, the shape and texture features are applied to filter the most relevant images. However, the elimination of normalization and fusion steps has reduced significantly the computational cost. But, the spatial information of an image is not classified accurately. Moreover, the color co-occurrence matrix (CCM) is adopted by the ANN classifier (El Alami 2014). Here, the texture and color features are extracted through computing the scan pattern pixels (DBPSP). Similarity value is computed by means of presenting a feature matching strategy. For all object images, their average rates were reported. Instead of considering the multichannel descriptors, the image properties were captured using two image channels (Xiao et al. 2014). However, the embedded Sobel filter information is used for improving a hyperopponent color space performance. But, this method has not classified accurately the background and foreground objects.

3 Multi-level structured feature extraction

In this section, a multi-level feature extraction scheme is explained using the combination of global and local features. Due to the ability of showing robustness and scalability in global characteristics or information in an image, the local features are selected in this work. Conversely, the global features showing its ability in local characteristics or information are selected. However, the CBIR systems retrieval accuracy is improved with the combination of local and global features. Using CEDD and color-related feature (CRF), the global features are computed. Similarly, local features are calculated using LBP and SURF.

3.1 Global feature extraction

3.1.1 Color-related features (CRF)

To extract color information (features), we have proposed a novel feature named color-related feature (CRF). For an image, their spatial color information can be described effectively using a new image descriptor called color-related feature (CRF). This descriptor works same as that of color histogram feature (CHF) (Guo et al. 2015). Furthermore, some of the image properties such as color distribution, color information and image brightness are also described using CRF. Max and min-quantizers are used to compute this CRF.

In CHF computation, the color indexing is performed to do the color truncation process through applying balanced-tree clustering method. Figure 1 depicts the CHF computation process (Guo et al. 2015). Several nodes altogether form a balanced-tree. Root node is identified from the top node of the balanced-tree. A set of nodes not including any child is called as leaf nodes (i.e., bottom nodes of the balanced-tree). By the fact, the nodes in a balanced-tree contain left and right child nodes. Among these, the right child will possess higher or equal value as its parent node, while the left child will possess lower value than its parent node. The color codebook of CHF can be followed to develop a balanced-tree.

Fig. 1
figure 1

Example of CHF feature computation

Furthermore, a balanced-tree is built depending on the norms of all CHF codewords. In other words, their norms are considered to sort and arrange them in ascending order. Hence, using this arranged order, it is easy to develop a balanced-tree. Then, the leaf nodes are assigned with these sorted codewords. A new value is obtained by averaging one of the nodes in the tree and its sibling (i.e., two adjacent nodes) to form a complete balanced-tree. Then, a node using this value can act instead of a root node (parent node).Notably, for all codewords (all leaf nodes), this process is repeated continuously until the root node is reached.

Subsequently, the color quantizers can be made effectively through performing the color truncation process. This can be achieved only with the formation of a complete balanced-tree. For each max and min color quantizer, a single value representation is assigned using this color truncation process. Assume that the balanced-tree formed the color quantizer (i.e., max- and min-quantizers) and has a set of leaf nodes denoted as \( T_{\hbox{min} } = \left\{ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{1} ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{2} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{{N_{\hbox{min} } }} } \right\} \) and \( T_{\hbox{max} } = \left\{ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{1} ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{2} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{{N_{\hbox{max} } }} } \right\} \), respectively.

In which, the min and max sizes of color clusters are indicated as Nmin and Nmax. More significantly, the color truncation process is possible to be performed only with the presence of different balanced-tree in different image databases. Consider an image block (ij) having \( i = 1,2, \ldots ,\frac{M}{m} \), and \( j = 1,2, \ldots ,\frac{N}{n} \), then the min- and max-quantizers are represented as qmin(ij)and qmax(ij), respectively. For min-quantizer, the color truncation process used is expressed as follows:

$$ \xi \left\{ {q_{\hbox{min} } } \right\} \Rightarrow \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{a} $$
(1)

Here, \( a = 1,2, \ldots ,N_{\hbox{min} } \). The color truncation process is denoted using the symbol \( \xi \left\{ \bullet \right\} \). From the leaf of the balanced-tree, the color codeword index is returned using this color truncation process. Leaf nodes that are matching close to the min-quantizer can be considered to be the min-quantizers color truncation process, which is satisfying \( \arg \min_{{a = 1,2, \ldots ,N_{\hbox{min} } }} \left\| {q_{\hbox{min} } \left( {i,j} \right),\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{a} } \right\|_{2}^{2} \). For max-quantizer, the color truncation process is shortly expressed as,

$$ \xi \left\{ {q_{\hbox{max} } } \right\} \Rightarrow \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{b} $$
(2)

Here, \( b = 1,2, \ldots ,N_{\hbox{max} } \). In Tmax set, the color cluster and max-quantizers closest matching is expressed using Eq. (2), where it should satisfy \( \arg \min_{{b = 1,2, \ldots ,N_{\hbox{max} } }} \left\| {q_{\hbox{max} } \left( {i,j} \right),\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{b} } \right\|_{2}^{2} \).

In a traversed fashion, the balanced-tree and color quantizers are closely matched with the aid of CIF performance. Initially, the parent (root) of the balanced-tree and its corresponding color quantizers similarity score are computed using CIF. As and when compared to the similarity score of right child of a balanced-tree, the similarity score of left child is smaller; then, it is important to continue the similarity searching process for left child in the balanced-tree. After reaching the leaf node, the repetition of this process is stopped. Ultimately, the color truncation output is obtained based on its returned index. The steps involved in the color truncation process are indicated in Fig. 2. However, the CHF after using this strategy has faced more computational complexity than the CIF.

Fig. 2
figure 2

Example illustrating number of color clusters a 8 and b 16 in the color truncation process

For min-quantizer, the CIF shows the complexity as \( O\left( {d \times \log_{2} N_{\hbox{min} } } \right) \). Alternatively, the CHF has shown the computational complexity as \( O\left( {d \times N_{\hbox{min} } } \right) \), in which the color quantizers are represented using color dimension. By that fact, d = 3 is the dimensional value considered for RGB color space. Likely, the same computational complexity is required by the max-quantizer.

Based on the color truncation process, the image feature descriptors in two forms, namely CIFmin and CIFmax, can be obtained. They are expressed as follows:

$$ CIF_{\hbox{min} } \left( a \right) = \Pr \left[ {\xi \left\{ {q_{\hbox{min} } \left( {i,j} \right)} \right\} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{a} \left| {i = 1,2, \ldots ,\frac{M}{m};j = 1,2, \ldots ,\frac{N}{n}} \right.} \right] $$
(3)
$$ CIF_{\hbox{max} } \left( b \right) = \Pr \left[ {\xi \left\{ {q_{\hbox{max} } \left( {i,j} \right)} \right\} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{q}_{b} \left| {i = 1,2, \ldots ,\frac{M}{m};j = 1,2, \ldots ,\frac{N}{n}} \right.} \right] $$
(4)

Here, a = 1, 2, …, Nmin and b = 1, 2, …, Nmax. In whole min- or max-quantizers, the number of codeword index occurrences is computed using the probability factor \( \Pr \left[ \bullet \right] \). Nmin or Nmax indicating the number of leaf nodes in balanced-tree (i.e., the size of color clusters) equals both the CIFmin- and CIFmax-related feature dimensionalities.

3.1.2 Color and edge directivity descriptor (CEDD)

This section explains about the color and edge directivity descriptor and its structural components. For CBIR function, the invented universal feature image descriptor is characterized using CEDD. Hence, its effective storage basis and discrete mediocre size have significantly attained good system performance. An image is segmented into 1600 rectangular image parts using this CEDD. Then, the texture and color data are extracted through picking the related image blocks. Figure 3 indicates the flow process of CEDD descriptor. Color extraction unit: At first, several image blocks are formed through dividing the input image I. Then, the RGB values of each color unit contained in several image blocks are transformed into the HSV color space. Subsequently, for all that images, a fuzzy linking histogram is generated using a two-staged fuzzy system. The resultant 10-bins histogram is produced from the three mean HSV channels comprised in the fuzzy system output.

Fig. 3
figure 3

Block diagram of CEDD descriptor

By that fact, the fuzzy system has considered hue (H), saturation (S) and channel value (V) as its input and they are explained as follows: 8, 2 and 3 fuzzy regions are formed through dividing the hue (H), saturation (S) and channel value (V). A set of 20 rules is applied to facilitate the fuzzy system output. Then, the limit is fixed from 0 to 1 for generating a crisp value. Ultimately, the first stage of histogram having 10-bins is created using this generated crisp value. The membership functions are illustrated in Fig. 4. Usually, 8 fuzzy regions are formed after dividing hue. Figure 4a indicates the borders of corresponding 8 fuzzy regions. Figure 4b indicates the 2 fuzzy regions formed after dividing saturation. Similarly, Fig. 4c indicates the 3 fuzzy areas formed with the division of channel V. Furthermore, the 2 fuzzy regions generated through separating channel V are indicated in Fig. 4d. To each of the seven colors (including white, gray and black), the brightness value is supplemented at the Takagi–Sugeno–Kang (TSK) fuzzy linking system’s second stage. Again, the image block is assigned with S and V mean values (the given fuzzy inputs). Ultimately, the crisp values containing 3-bin histogram are obtained as the final output. This represents that the color either can be dark-hued, normal or light. At last, a 24-bin histogram is produced by means of pooling both first- and second-stage histograms (i.e., the two outputs).

Fig. 4
figure 4

Membership functions for a hue, b saturation, c value, d saturation and value for 24-bins expansion

3.1.2.1 Texture extraction unit

In color and edge directivity descriptor (CEDD), the texture has played a main feature role. Initially, the YIQ color space is formed through converting the image block. Subsequently, the texture unit is obtained followed with the conversion of given image block. Moreover, the MPEG-7 Edge Histogram Descriptor-EHD suggested five digital filters are employed to this texture unit. This process is done for grouping the extra non-edge filter with edges (isotropic, 135 diagonal, 45 diagonal, horizontal and vertical). Then, the non-edge filter is used for the segmentation of each given image block into 4 sub-blocks. Thirdly, each image block has produced the specific edge types after performing the fuzzy mapping process among them. For each image block, it is significant to achieve 6-bin vector output as the final process. Thereby, the non-edge case is symbolized using one-bin vector, whereas the textures are symbolized using the other first five bin vectors. Label ‘1’ is assigned for the relative bin, if identified that an image block has included any of the given edge types. If this is not the case, the binary image-block texture vector is generated by labeling the relative bin as ‘0.’

3.1.2.2 CEDD descriptor

Firstly, this descriptor has obtained 144 bin vectors. Then, 24-bins containing six regions are formed through dividing these bin vectors. Furthermore, a divergent texture is illustrated using these six regions. The most relevant regions of 144 bins vector are represented as ‘1’ and the image block that computes the 24-bins color histogram were used to fill these relevant regions.

Followed by this, the image descriptor is generated by adding all together the whole image-block descriptors. Ultimately, about 8 pre-decided levels are formed through quantizing and standardizing these vectors. Followed with the completion of this process, the visual contents of images are characterized in a distinctive and compressed style by the formulation of EDD descriptor of an image.

3.2 Local feature extraction

3.2.1 Local binary pattern

Local structure of an image is defined using a nonparametric descriptor called LBP (Ojala et al. 2002). An operator value is assigned to each pixel of an image. This value is obtained through considering neighborhood round pixel (neighborhood threshold value fixed using the center pixel value). When compared to the center pixel value, if found the neighboring pixel rate is higher, then the value of this pixel is fixed to 1; otherwise, the value is fixed to 1. LBP formulation is described as follows: The resultant decimal form of LBP obtained for the input pixel at (uc, vc) is expressed as:

$$ h\, = \,\left( {i_{p} - i_{c} } \right) $$
(5)
$$ LBP\left( {u_{c} ,v_{c} } \right) = \sum\nolimits_{p = 0}^{p - 1} {sign\left( h \right)} \cdot 2^{p} $$
(6)

The central pixel located in the neighborhood circle of radius R having its gray-level values is indicated as ip and ic, respectively. Here, the surrounding pixels are denoted as p. Equation (7) indicates the numerical expression of function Sc(x).

$$ S_{c} \left( x \right) = \left\{ {\begin{array}{*{20}c} {0,} & {\text{if}} & {\,x < 0} \\ {1,} & {\text{if}} & {x \ge 0} \\ \end{array} } \right. $$
(7)

The limited structural features surrounding the static pixel are characterized using the unit value obtained from the binary value. The way of 256 various patterns performance can be indicated using LBP image histogram. However, the whole image structure is defined using this pattern distribution. More information loss is avoided by us through selecting only reduced number of patterns (i.e., the uniform pattern included in LBP histogram alone is selected). The other form of unvarying pattern is LBP pattern. In other words, during binary classification, this LBP pattern has undergone maximum dual bitwise switching varying from 0 to 1 or 1 to 0. For instance, a non-uniform shape is observed with four changes (10110111) and uniform shape is seen with two changes (11100000), respectively. However, there includes possibly 198 unique non-uniform patterns and 58 different uniform patterns in 8-bit LBP representation. Thereby, the histogram representation requires only 59 bins instead of considering 256 bins for texture representation.

3.2.2 Speeded-up robust features (SURF)

Herbert Bay et al. (2008) have explained deeply an inventive scale- and rotation-invariant interest point detector and descriptor called SURF algorithm. Detector and descriptor are the two main phases included in SURF algorithm.

3.2.2.1 Detector

An integral image and a basic Hessian matrix approximation were basically used by this detector. Four main steps included in this detector are: (a) integral image, (b) Hessian matrix-based interest points, (c) scale-space representation and (d) interest point localization.

Step 1: Integral Image

Initially, the SURF method is improved through using integral images \( M_{\varSigma } (u) \) to further enhance the speed of local feature extraction. The pixels of an input image Mare all summed to represent the presence of an integral image \( M_{\varSigma } (u) \) at a position \( u = \left( {u,v} \right)^{T} \). However, a rectangular region formed using origin u is used to hold this input image M.

$$ M_{\varSigma } (u) = \sum\limits_{k = 0}^{k \le u} {\sum\limits_{l = 0}^{l \le v} {M\left( {k,l} \right)} } $$
(8)

The process of integral image calculation is illustrated in Fig. 5. After completing the integral image computation, the rectangular area consisting sum of the intensities is calculated using three additions. Hence, based on the rectangle size, the computational process may show changes.

Fig. 5
figure 5

Representation of integral image

Step 2: Hessian matrix-based interest points

Here, the interest points are determined through employing the Hessian matrix H(iσ). Equation (9) has defined the Hessian matrix H(iσ) in i at scale σ

$$ H\left( {\,i,\sigma } \right)\, = \left[ {\begin{array}{*{20}c} {L_{ii} \left( {i,\sigma } \right)} & {L_{ij} \left( {i,\sigma } \right)} \\ {L_{ij} \left( {i,\sigma } \right)} & {L_{jj} \left( {i,\sigma } \right)} \\ \end{array} } \right] $$
(9)

where the Gaussian second-order derivatives \( \frac{{\partial^{2} \,}}{{\partial i^{2} }}g\left( \sigma \right) \) convolution is indicated as \( L_{ii} \left( {i,\sigma } \right) \). The similarity for an image M at a given point i is indicated as \( L_{ii} \left( {i,\sigma } \right) \) and \( L_{jj} \left( {i,\sigma } \right) \). Furthermore, the approximation of H(iσ)is used by the SURF to minimize the computational cost.

$$ H_{\text{approx}} \, = \left[ {\begin{array}{*{20}c} {D_{ii} } & {D_{ij} } \\ {D_{ij} } & {D_{jj} } \\ \end{array} } \right] $$
(10)

The location having maximum determinants is analyzed to detect the blob-like structure. It is expressed in (11)

$$ \det \left( {H_{\text{approx}} } \right) = D_{ii} D_{ij} - \left( {wD_{ij} } \right)^{2} $$
(11)

Here, the expression \( \det \left( {H_{\text{approx}} } \right) \) is balanced using the relative weight w. Thus, the relation among approximated Gaussian kernels and kernels can be improved using this relative weight by further enhancing the energy conservation.

Step 3: Scale-space representation

Interest points included in images are extracted from the scale space consisting of different filter size levels. In other words, this extraction process is done by applying Gaussian approximation filters to each of the filter size levels. Also, the scale-invariant feature transform (SIFT) algorithm has also adopted the scale-space representation notion.

By that fact, the image size is gradually reduced using SIFT algorithm. Conversely, the integral images are used by the SURF algorithm to allow filter upscaling in a reduced cost. It is evident that the components without requiring any aliasing and having high-frequency with more computational efficiency are offered by the SURF algorithm.

Step 4: Localization of Interest point

Using three neighborhood pixels (3 × 3 × 3 neighborhood scales), the non-maximum suppression (NMS) is applied to perform the interest point detection process. NMS has considered the feature points from the Hessian matrix with maxima determinant points. Figure 6 indicates the interest points detected from the input images using SURF algorithm.

Fig. 6
figure 6

Interest points detected from the input images using SURF feature

3.2.2.2 Descriptor

This descriptor requiring every interest points should carry their own indicator to perform the assignment of invariability to the interest points. Two most significant steps are included in the SURF descriptor process: (a) assignment of orientation and (b) the descriptor using sum of Haar wavelet responses.

Step 1: Orientation assignment

Considering the image rotation, the invariability of interest points is recognized through applying image orientation. For the Gaussian-weighted Haar wavelet responses summation, the dominant vectors are detected to compute the orientation. This process is done based on split circle region of sliding window using pi/3 (Schnorrenberg et al. 2000). This is due to the fact that the interest points corresponding directional property and strength are included in vertical and horizontal responses of Haar wavelet transform. Based on the image rotation, the most significant image points are represented effectively using image orientation.

Step 2: Descriptor using the sum of Haar wavelet responses

The interest points of the descriptor are identified using the square regions obtained around the interest points and the selected orientation from the orientation assignment step. However, the 4 × 4 smaller sub-regions are formed through splitting each square region. At a sample points having 5 × 5 regular space, the vertical Haar wavelet response dj and the horizontal Haar wavelet response di are determined for each sub-region. Then, the 4D description vector is formed through utilizing di and dj of each sub-region as follows

$$ v = \left( {\sum {d_{i} ,\sum {d_{j} ,\sum {\left| {d_{i} } \right|,\sum {\left| {d_{j} } \right|} } } } } \right) $$
(12)

3.3 Multi-level matching scheme

This section explains the process of a multi-level matching scheme based on content-based medical images. From database \( D^{D} \), the query image Q is retrieved depending on the matching scheme. This work has used both the local and global features (described in Sect. 3) to enhance the speed of query processing unit in relevant image searches. It is also important to use similarity matching (SM) along with these features. In order to handle good shape representation, the benefits of both local and global features are adopted by the SM. Thereby, the objects from the query images are identified directly. Subsequently, the database images (retrieved similar images) have been verified to determine whether the accurate words for an input query image are placed or not. The specific similarity measures have provided better outcome and motivate us to hybridize both the global and local features.

To compute global features QGlobal and local feature QLocal, the query image Q is considered in matching scheme. Likewise, both the global feature vector \( I\left[ {I_{\text{Global}}^{1} \,,\,I_{\text{Global}}^{2} \,, \ldots \,,\,I_{\text{Global}}^{n} } \right] \) and local feature vector \( I\left[ {I_{\text{Local}}^{1} \,,\,I_{\text{Local}}^{2} \,,\, \ldots .\,,\,I_{\text{Local}}^{n} } \right] \) are represented for each image of the database. The main aim of this scheme is to choose n number of optimal (best) images that are showing resemblance to the specific input query image. Hence, the distance between the input query image and image contained in the database (DB) is measured to select n top matching images. The steps given below explain the multi-level matching process.

  • Step 1: Consider Q, QGlobal and QLocal

  • Step 2: To the database image features (\( I\left[ {I_{\text{Local}}^{1} \,,\,I_{\text{Local}}^{2} \,,\, \ldots \,,\,I_{\text{Local}}^{n} } \right] \),\( I\left[ {I_{\text{Global}}^{1} \,,\,I_{\text{Global}}^{2} \,,\, \ldots \,,\,I_{\text{Global}}^{n} } \right] \)), the query image features Q[F](\( Q\,\left[ {Q_{\text{Local}} ,\;\,Q_{\text{Global}} } \right] \)) are matched.

  • Step 3: Initially, the global similarity between the query image and input image is computed in this matching scheme. Equation (13) indicates the global similarity between images.

    $$ S_{\text{Global}} \, = d_{ij} \, = \,\,f\,\left( {\,Q^{\text{Global}} \,,\,I^{\text{Global}} } \right) $$
    (13)

    Here, the database image \( D^{D} \) and query image Q related using a Euclidean distance are indicated as \( f\,\left( {\,Q^{\text{Global}} \,,\,I^{\text{Global}} } \right) \).

  • Step 4: In between database image and query image Q, the local similarity evaluated using ED is expressed as:

    $$ S_{\text{Local}} \, = \,d_{ij} \, = \,f\,\left( {\,Q^{\text{local}} \,,\,I^{\text{local}} } \right) $$
    (14)
    $$ d_{ij} \, = \,\sqrt {\sum\limits_{i = 1}^{n} {\,\left( {Q^{\text{local}} \, - I_{i}^{{^{\text{local}} }} } \right)^{2} } } $$
    (15)
  • Step 5: Both distance measures are given by Eqs. (13) and (14) and provide the normalized distance. Ultimately, the global and local similarity synthesized using the hybrid similarity measure is as follows:

    $$ S_{\text{hybrid}} \,\, = \,C\, \times \,S_{\text{Global}} \,\, + \,\,\,\left( {1 - C} \right)\,\,S_{\text{Local}} $$
    (16)

    Here, the global and local similarity measures corresponding significances are adjusted using the weight C. Based on the user’s expectations, this hybrid local and global similarity measures are balanced by means of altering the value of weight (C). Hence, it is evident that the system has offered good user flexibility.

  • Step 6: For the query image, the top n best images are selected from the database images through sorting the hybrid feature score value. The operations of proposed multi-level matching scheme are indicated in Fig. 7.

    Fig. 7
    figure 7

    Multi-level matching scheme used in proposed model

4 Proposed approach

This work mainly aims to introduce a new content-based medical image retrieval system based on the hybrid similarity measure and multi-level matching scheme. Two significant modules included in the proposed system are: (a) feature extraction and (b) multi-level matching. Firstly, the global and local features are extracted from the images contained in DB. This process is done through converting the color image into a grayscale image. Then, the multi-level matching-based similarity measure is performed to retrieve all the relevant images with respect to the given query image from the database DB. Based on the given query image, the most relevant images are retrieved initially only by means of using global similarity. Secondly, the local similarity measure is used to filter out the too far relevant images from the query image. Figure 8 illustrates the entire implementation process of the proposed framework. Algorithm 1 has explained the step-by-step working procedures of the proposed framework.

figure a
Fig. 8
figure 8

Flow diagram of the proposed system

5 Results and discussion

In this section, we discuss the result obtained from the proposed hybrid similarity-based multi-level matching scheme for color image retrieval system. Implementation is done using MATLAB® version (R2017a). Intel Core i5 processor with speed 1.6 GHz and 4 GB RAM which is equipped in the windows machine is used by the proposed technique to perform its operation. A general purpose database called SIMPLIcity image database (http://wang.ist.psu.edu) (Wang et al. 2001) is adopted to test the proposed CBIR system. Conversely, the images are retrieved through adopting a medical image database. However, the elephants, dinosaurs, buses, villages and African people were categorized to form 10 semantic groups. About 100 sample images are included in each individual category. Another one is the gastro-intestinal database (http://www.gastrolab.net) which is comprised of endoscopic images of gastro-intestinal disorders, partially shown in Fig. 9. The samples images that relate to cancer affected regions are partially shown in Fig. 9.

Fig. 9
figure 9

Sample images used in experimentation a planet, b sunrise from a general database and c small bowel category from the medical database

5.1 Performance measures

Some of the commonly known evaluation metrics such as F-measure (F), recall (R) and precision (P) are applied to analyze the system performance. Ratio of total number of recovered images to the number of recovered images (such as query image) defines a precision rate. Ratio of total number of query images in the database to the number of recovered images defines the recall rate. F-measure (F) is obtained by taking the harmonic mean of precision and recall rate. Numerical expressions of (P), (R) and (F) are expressed as follows:

$$ P\, = \,\frac{{N^{Q} }}{{T^{Q} }} \times 100 $$
(17)
$$ R\, = \,\frac{{N^{Q} }}{{D^{Q} }} \times 100 $$
(18)
$$ F\, = \,2 \times \frac{P.R}{P + R} $$
(19)

Here, the number of images in the database relevant to the query image Q is indicated as DQ, the total number of retrieved images as TQ, and the number of relevant images retrieved from the database is denoted as NQ, respectively.

In order to determine the weight C influence, the performance metrics called ‘area under the precision-recall curve’ (AUC) is applied as follows:

$$ AUC\,\left( {\,C} \right)\,\, = \,\,\sum\limits_{i = 2}^{{R_{\hbox{max} } }} {\frac{{\,\left( {\,P_{C} \,\left( i \right)\, + \,P_{C} \,\left( {\,i - 1} \right)} \right)\,\, \times \,\,\left( {\,R_{C} \,\left( i \right)\, - \,R_{C} \,\left( {i - 1} \right)} \right)}}{2}} $$
(20)

In Eq. (20), the terms \( R_{C} \left( i \right) \) and \( P_{C} \left( i \right) \) indicate the recall and precision values with the ith image retrieval and the maximum number of images retrieved is denoted as Rmax.

5.2 Experimental results

Different performance metrics such as recall, precision and F-measure are used to analyze the performance of proposed medical image retrieval system. The retrieved images of the proposed model obtained based on the input query image are shown in Figs. 10, 11, 12 and 13.

Fig. 10
figure 10

Retrieved results for general database a query image ‘rose’ and b retrieved images

Fig. 11
figure 11

Retrieved results for general database a query image ‘butterfly’ and b retrieved images

Fig. 12
figure 12

Retrieved results for medical database a query image ‘Oesophagitis’ and b retrieved images

Fig. 13
figure 13

Retrieved results for medical database a query image ‘Cardia’ and b retrieved images

5.3 Comparative analysis

Improvement in image retrieval rate is considered as the main objective of proposed system. In between the images, their overall similarities are computed through applying Eq. (13), which provides the similarity measure. This computation is done only after achieving global features with the usage of both local and global features. Most relevant images indicate the images having minimum distance. In query-based image retrieval process, the multi-level matching step has played a significant role. However, the matching scheme includes various distance measure types. In this paper, the Euclidean distance (ED) measure is applied to enhance the image matching system. The two conventional distance measures such as Canberra Distance (CD) and Manhattan distance (MD) are used to compare the effectiveness of proposed ED over two databases (Figs. 14, 15, 16, 17).

Fig. 14
figure 14

Results obtained with general database on using different distance formulas in multi-level matching scheme a precision, b recall and c F-measure (for one-step retrieval process)

Fig. 15
figure 15

Results obtained with medical database on using different distance formulas in multi-level matching scheme a precision, b recall and c F-measure (for one-step retrieval process)

Fig. 16
figure 16

Results obtained with general database on using different distance formulas in multi-level matching scheme a precision, b recall and c F-measure (for two-step retrieval process)

Fig. 17
figure 17

Results obtained with medical database on using different distance formulas in multi-level matching scheme a precision, b recall and c F-measure (for two-step retrieval process)

Different distance measures used in one-step retrieval steps of proposed approach are shown in Figs. 13 and 14. When used CD and MD, the maximum precision rate obtained by the proposed model is 75% and 80%; but the proposed model has achieved 88% higher precision rate on using ED shown in Fig. 14a. The F-measure and recall rate obtained on using one-step retrieval process with different distance measures are shown in Fig. 14b, c. Similarly, the precision, recall and F-measure rate obtained using one-step retrieval process for medical database are illustrated in Fig. 15. Different distance measures used by the one-step retrieval process of the proposed model are illustrated in Figs. 16 and 17.

Figure 16a indicates the higher precision rate (92%) achieved by the proposed model by using ED measure in two-step retrieval process. Depending on different distance measures, the two-step retrieval process applied in proposed model achieving the recall rate is plotted in Fig. 16b. From the results, it is evident that the proposed model has yielded better performance compared to other traditional approaches.

For two-step retrieval process, the F-measure rate achieved by the proposed model is shown in Fig. 16c. Similarly, the precision, recall and F-measure rate obtained using two-step retrieval process for medical database are illustrated in Fig. 17. From the results, it is evident that the ED measure used on multi-level matching scheme has yielded better performance compared to other conventional measures.

For general purpose database, the AUC values generated are indicated in Fig. 18. These values are produced by varying the value of weight from 0 to 1 using 0.02 increments in terms of considering precision and recall curves of the MLM-hybrid technique. For each dataset, the significances of both local and global information are balanced using an optimal weight value. MLM-hybrid method will consider that setting the weight values to 0.3 is a suitable choice to test the CBMIR system. Based on feature extraction method, the comparative analysis performed for the retrieval process is shown in Fig. 19. From the figures, it is evident that higher precision rate is achieved by our proposed approach compared to other traditional approaches.

Fig. 18
figure 18

Different weights of AUC measure in general image database

Fig. 19
figure 19

Feature extraction-based proposed method comparison a general dataset and b medical database

For one-step retrieval system, we have to utilize the global features and query image alone. Followed by this, the similarity score is obtained by combining the global and local features. Finally, the corresponding image is retrieved using the similarity score. In this work, the global and local features-dependent hybrid similarity measure is adopted.

5.3.1 General database

For the query image ‘butterfly,’ the performance of one-step retrieval step is illustrated in Table 1. When used the MLM-local features-based image retrieval and MLM-global features-based medical image retrieval process, the proposed model has achieved 86% precision rate. But, the proposed model has achieved 90% of precision rate after using a multi-level matching scheme.

Table 1 Performance efficiency of one-step retrieval process in a general database for a given query image ‘butterfly’

Form the analysis, it is inferred that, when compared to image retrievals based on both global and local feature-based methods, the better image retrieval performance is yielded by the hybrid MLM methods. For a given query image ‘butterfly,’ the two-step retrieval performance is analyzed, as shown in Table 2. As and when compared to the computational efficiency of one-step image retrieval process, the two-step retrieval process has achieved good performance. Furthermore, the efficiency of two-step retrieval process not only stops with providing accurate relevant retrieval outcomes but also supports the users by providing rapid query response through enabling the system performance. For shape representation, the benefits of both local and global features are adopted during similarity measure computation. As and when compared to the global feature-based image retrieval and individual local feature-based image retrieval, the proposed approach has achieved higher precision rate of 93% (Table 6). For the query image ‘rose,’ the performance of two-step retrieval and one-step retrieval process is indicated in Table 3.

Table 2 Performance efficiency of two-step retrieval process in a general database for a given query image ‘butterfly’
Table 3 Performance efficiency of one-step retrieval process in a general database for a given query image ‘rose’

From the table, it is observed that the one-step retrieval process has achieved lower retrieval performance than the two-step retrieval process (Table 4). For the query image ‘Oesophagitis,’ the retrieval performance achieved with two-step retrieval and one-step retrieval process is shown in Tables 5 and 6. From the results, it is evident that higher precision value is achieved with our proposed model than the other global features and individual local-based retrieval system.

Table 4 Performance efficiency of two-step retrieval process in a general database for a given query image ‘rose’
Table 5 Performance efficiency of two-step retrieval process in a medical database for a given query image ‘Oesophagitis’
Table 6 Performance efficiency of one-step retrieval process in a medical database for a given query image ‘Oesophagitis’

5.4 Comparison with other published approaches

In this section, the efficiency of proposed approach is analyzed with different existing works. In medical image retrieval system, the existing works of Kumar et al. (2014) and Srinivas et al. (2015) have proved the whole strength on image retrieval. Based on various representations, they have characterized the global and local features of an image. One can observe easily from Table 7 that the better performances are yielded using our proposed approach (Kumar et al. (2014)). This goodness is observed because the visual features of an image are described clearly using these methods. Compared to the aforementioned conventional systems, our method has shown better image retrieval performance. Dictionary learning method was used on the clustering method-based image retrieval process of Srinivas et al. (2015).

Table 7 Comparative analysis of the state-of-the-art methods

In which, the classical dictionaries are matched with a query image to further use the Orthogonal Matching Pursuit (OMP) algorithm for identifying the sparsest representation dictionary. However, the multi-modality medical images were retrieved using the graph-based approach of Kumar et al. (2014). In addition, the comparison was made with some published results. The precision rate achieved on using the techniques of El Alami (2011) and Kumar et al. (2014) is 79.2% and 71%; but the proposed approach has achieved higher precision rate of 93% on using multi-level matching scheme (Table 7). From the result, it is evident that the proposed approach has yielded higher precision rate compared to other traditional methods.

6 Conclusion

In this study, a multi-level matching scheme is introduced for retrieval of image based on a hybrid feature similarity integrating local and global features for an input image. During retrieving the target objects, the complexity is reduced effectively by the local features. For an image, the whole data such as color, texture and shape are captured using global features. Merits and demerits are equally balanced in both global- and local-based features. But, the discrimination ability of local-based features is relatively high as and when compared to the global-based features. To ease the retrieval process difficulty, the similarity measures were adopted in this work to hybridize both the global and local features of an image. Using a medical image database, the performance of the proposed method is determined in terms of certain evaluation metrics such as F-measure, recall and precision. Also, one-step retrieval and two-step retrieval process are applied to evaluate the retrieval efficiency of proposed approach. Experimental results have evidently proved that higher precision rate of 91% and 92% is achieved with proposed approach over two different databases. In the future, we intend to extend our dataset scale, further normalizing the image tags, and enriching the query terms.