1 Introduction

Modernization in technology has led to advancement in various areas of academics, medicine, forensic analysis, entertainment, and other such developments. In the areas of image processing, this has led to the rising necessity of information retrieval from images. A significant amount of research has been done in this area using various methods to retrieve text and related information from images. Image recognition has found wide applications in various real-time applications, for examples tracking automobiles on CCTV cameras, or helping blind people to travel, etc. For this work, the main hindrance that is experienced is from the quality of images. Blurry images or images with a lot of contour deformations provide challenging scenarios for text detection. Lighting conditions, complex backgrounds are other such challenges that need to be overcome for proper text detection in images. So, to drive out these difficulties in image retrieval based on text information, content-based image retrieval (CBIR) was introduced. In content-based image retrieval (CBIR) textual descriptions are avoided for image retrieval. Instead, similarities in their contents (textures, colors, shapes, etc.) with respect to the query image are considered, and retrieval of images are done by listing images from large databases in descending order of similarity. There have been numerous researches done in content-based image retrieval [2,3,4,5,6,7,8,9,10] in recent past involving both color and texture features.

Color quantization is closely related with color models. A large number of color models have been proposed and used for image retrieval and other related tasks over the years. Selecting the appropriate color space thus plays an important role in this aspect. The color model which is most commonly used in image processing and computer vision problems is the RGB model which contains three color channels, namely (R channel), Green (G channel) and Blue (B channel). Apart from texture, color information also serves to play an important role in content-based image retrieval task [11,12,13,14] since it provides the global information about the image in terms of distribution of various color components. The drawback of using RGB color model is that the color information contained in the three channels is highly correlated. This prompted us to use the HSV color space in content-based image retrieval framework in order to capture the color information efficiently. In the HSV color space, H, S and V stand for Hue, saturation and value, respectively. The Hue component is defined as an angle and it varies from 0 through 1 and the corresponding color varies from yellow, green, cyan, blue, magenta and back to red. Thus there are red values both at 0 and 1.0. The saturation is an indication of the purity of the color. Value component indicates the brightness which is almost similar to the gray-scale version of RGB image.

In this paper, the primary motive of developing a feature descriptor is to efficiently capture both the color and texture information present in the image. This makes the feature a multipurpose descriptor that can work for a large variety of images belonging to different databases. Our work stresses on this point and aims at giving equally effective results on different image datasets available online. In this paper, we aim to develop a novel color descriptor by exploring the inter-channel or the mutual relationship between H and S channels which has not been done in any previous work. Along with this, a texture descriptor has been designed which considers the relationship between the pixels symmetric about the left and right diagonals of a 3 × 3 window. These two descriptors individually perform better than existing color and texture descriptors and concatenation of these two gives a significant improvement over existing descriptors for content base color image retrieval.

The rest of the paper is organized as follows. Related works have been discussed in Sect. 1.1. In Sect. 1.2, the main contributions of our work are mentioned with respect to existing techniques. In Sect. 2, we detail the proposed color and texture descriptor. We summarize the proposed method in Sect. 3. Section 4 presents the experimental results and advantages of the proposed descriptor. Finally, the conclusion part is mentioned in the last Section of the paper.

1.1 Related work

Image retrieval process mainly focuses upon texture and color analysis. Local intensity of the image defines texture to some extent, which is why local neighborhood features and statistical features are discovered for such texture patterns and similarly color correlogram, color histogram, color coherence vector, etc., are used for low-level color feature descriptor. Most renowned method for statistical feature extraction of images, Gray Level Co-occurrence Matrix (GLCM) was first proposed by Haralick [2]. The GLCM, also known as the Gray Level Spatial Dependence Matrix examines the texture by considering the spatial relationship of pixels. The texture of an image is characterized by the GLCM function by calculating co-occurrence of pairs of pixel with specific values and in a particular spatial relationship. A GLCM is thus created, and then from this matrix statistical measures are extracted. Features were calculated directly by applying GLCM to the texture image, then edge image was used by Zhang et al. for gathering even more concrete and relevant information [3]. Thus, GLCM of edge images was calculated and Prewitt edge detector was applied there in four directions. For LUV and RGB color channels, GLCM was further extended to single-channel co-occurrence matrix and then to multi-channel co-occurrence matrix. Application of color texture image retrieval [11] was thus introduced using GLCM. Gray Level Co-occurrence Matrix was used for retrieval of rock texture images by Partio et al. [4]. Siqueira et al. [5] utilized pyramid representation and Gaussian smoothing for multi-scale image extraction and retrieval. Some other applications of GLCM are [15,16,17].

For color and texture features proposition of integrated color and intensity, co-occurrence matrix was done where composition of texture and color features were computed. Color representation was performed using HSV color space instead of RGB, and image retrieval was done by labeled and unlabeled image datasets [18]. In color histogram, the frequency of every intensity is considered discarding the color spatial co-relation. This spatial co-relation is utilized in color correlogram [19]. Again, this color correlogram combined with supervised learning was used for feature vector extraction and thus improved result in image retrieval in two different ways, firstly by modifying the query image, and secondly by the distance metric learning [20]. For retrieving images, color coherence vector was proposed using image pixel color coherence and incoherence, and then it was compared with color histogram [13]. Technique of artificial neural network (ANN) was applied for image retrieval at faster rate by clustering images by Park et al. [21]. Quantization of color histogram for retrieving images was utilized in Gaussian Mixture Vector Quantization (GMVQ) [12]. The Motif Co-occurrence Matrix builds a 3D matrix, corresponding to local statistics of images was proposed for image retrieval [22]. Further, an extension of this Motif Co-occurrence Matrix was used in Modified Color Motif Co-occurrence Matrix (MCMCM) for image retrieval using relationships between the color channels by Murala et al. [14]. Again, using HSV color space, text was used with motif matrix on histogram in [23].

Wavelet transform has found extensive application in description of image texture. Along the most prominent perceptual dimensions [24] the texture quality is determined. Wavelet transform was used [25] to collect texture and color features from an image. Daubechies’ Wavelet Transform (DWT) was used for image searching and indexing in Wang et al. [26]. Here, the feature vectors are constructed by using the wavelet coefficients in the lowest few frequency bands and their variances. The idea of wavelet correlogram in Content-Based Image Retrieval was first proposed in [27]. Information is extracted from an image only in three directions (horizontal, vertical and diagonal) by DWT. This directional limitation was removed by using Gabor wavelet feature-based texture analysis in [28]. Gabor wavelet transform was used for texture classification in [29] by Ahmadian et al. Gabor wavelet correlogram is an extension of [27], which was proposed as a rotation-invariant feature using Gabor wavelet in Content-Based Image Retrieval [30].

Ojala et al. [31] first proposed the local binary pattern (LBP) for texture feature extraction. In LBP, an eight-bit binary string is used for representing the spatial relationship between the local neighboring pixels with its center pixel. Uniform version of LBP and rotation-invariant LBP has been introduced for image classification and retrieval. Various extensions of LBP, e.g., Completed LBP (CLBP) [32], Block-based Local Binary Pattern (BLK LBP) [33], Dominant LBP (DLBP) [34], Center Symmetric LBP (CS-LBP) [35], etc., were introduced for image retrieval and texture classification. One major drawback of traditional LBP method is that the anisotropic features are not described in its circular sampling region. A Multi-structure local binary pattern (Ms-LBP) [36] operator was proposed as a solution to this problem, where an extended LBP operator was obtained by changing the shape of sampling region for texture image classification. In fact, for texture classification Gaussian as well as wavelet-based low-pass filters, were used in LBP called as Pyramid Local Binary Pattern (PLBP) [37] proposed by Qian et al. where multi-resolution images were extracted by using a low-pass filter from the original image, and these multi-resolution low-pass images are used in LBP features collection. Again, a combination of LBP along with the Gabor filter gave better result [38]. Besides these, the moment was applied in feature extraction in [39]. Again, the feature was extracted by the edge information in Directional Local Extrema Pattern [40]. There were various improvements made over LBP in Dominant Local Binary Pattern(DLBP) [34], Local Bit-plane Decoded Pattern (LBDP) [41], Local Edge Pattern for Segmentation and Image Retrieval (LEPSEG and LEPINV) [42], Local Mesh Pattern (LMP) [43], Average Local Binary Pattern (ALBP) [44], etc. To minimize noise effect in LBP numerous algorithms have been formulated. In Local Ternary Pattern (LTP) [45], firstly a threshold value is considered (say t) if the neighboring pixel values \( \left( {I_{i} } \right) \) is in the range of center pixel \( \left( {I_{c} } \right) \pm \) threshold, i.e., \( \left( {I_{c} - t,I_{c} + t} \right) \) then 0 is assigned and if it is less than this range − 1 is assigned otherwise + 1 is assigned. Then this ternary pattern is converted into upper and lower binary bit patterns and improved versions of LTP known as Improved LTP [46] gives better result. Noise-Resistant LBP (NR-LBP) [47], Robust LBP (RLBP) [48] are used in noise reduction of LBP feature. Second order derivation in horizontal direction as well as vertical directions are considered in Local Tetra Patterns [49] and it gives better result than LBP which is then transformed to binary patterns for calculations. Extended version of Local Tetra Patterns is Local Oppugnant Pattern [50] in RGB color space. Murala et al. proposed spherical symmetric 3D Local Ternary Patterns [51] using Gaussian filters and RGB color space which provided a 3D space and extracted LTP from every directions. A texture synthesis-based texture hashing framework has been proposed by Bhunia et al. [52]. Zhang et al. [53] introduced a novel learning framework in order to transform tree-structured data into a vector representation and its performance has been examined in content-based image retrieval task. Recently, there have been a few works [54,55,56,57,58] which introduced the new methods for texture classification. Dense Micro-block Difference (DMD)-based method is proposed by Dong et al. [55]. Multi-scale rotation-invariant representation (MRIR) of textures based on multi-scale sampling is proposed in [57].

It is to be noted that our work is inspired by the work in [59] which used both color histogram and texture descriptor in order to capture the global and local information of the image respectively. However, to our knowledge, none of the earlier local descriptors considered the inter-channel relationship for histogram calculation in HSV color space. In other words, the mutual relationship between the channels has not been thoroughly investigated to evaluate the feature descriptor in content-based image retrieval task. Also, the idea of exploring the relationship between diagonally symmetric pairs in a 3 × 3 window has not been done earlier. Following this, we develop one color histogram descriptor and one texture descriptor which upon concatenation provide a significant improvement over the method in [59] and other existing methods as well.

1.2 Main contributions

A number of works [59,60,61] have focused on color content-based image retrieval. The work in [60] proposed a novel image feature representation method using color information from the L* a* b* color space. They called it Color Difference Histogram (CDH) for image retrieval. Walia et al. [61] exploited the Color Difference Histogram (CDH) and Angular Radial Transform (ART) features to obtain color, texture and shape information of an image. They used a modified Color Difference Histogram to improve the retrieval performance. In [59], the authors simply quantized the histograms from the H and S channels into different bins and then concatenated those histograms to obtain the color feature. Very few works like the one by Lu et al. [62] developed a novel LBP-based color feature named Ternary-Color LBP (TCLBP), to represent the inter-channel information using the RGB color space. However, to the best of our knowledge, none of the existing works have exploited the inter-channel information existing in the HSV color space. In this paper, we exploit the inter-channel relationship between the H and S channels for color histogram computation. This is done by quantizing the H channel (angle i.e. different color range) into different bins and voting using saturation value of the corresponding pixel position. This explores the inter-channel or the mutual relationship between H and S channels in a novel manner. This is due to the fact that this novel histogram computation takes into account the actual saturation(S channel) values corresponding to a range of colors (a particular bin in the H channel) rather than just the number of occurrences of those values within the range. With the similar motivation, we quantize the S channel into different bins and use the corresponding H channel (different color range) value for voting.

A number of texture features based on local patterns have been proposed for texture-based image retrieval by considering the relation among symmetric neighbors with respect to the center. The most popular among them is the Center Symmetric Local Binary Pattern (CSLBP) [35]. In CSLBP, the relation between the center symmetric pixels is considered for calculating the local pattern of the input image and the remaining neighbors are ignored. A modified form of CSLBP is proposed by Verma et al. [59] for calculating the feature map. Moreover, they have calculated the texture feature using the V channel. Not many works have focused on exploring the mutual relationship between the diagonally symmetric neighbors though. The importance of representing the relationship between the diagonal neighbors was proposed in the work done by Dubey et al. [1]. Here, the first-order local diagonal derivatives are calculated for exploiting the relationship among the diagonal neighbors of a given center pixel in an image. The authors compared the intensity values of the center pixel with the intensity value of local diagonal for utilizing the relationship of the central pixel with its neighbors. Motivated from this work, we intend to explore the relationship among the diagonally symmetric neighbors along the left and right diagonals of a 3 × 3 window of an image. Moreover, we calculate the GLCM of the feature map obtained rather than computing the histogram to maintain the information of the spatial correlation.

The major contributions of our paper are as follows: Firstly, we introduce a novel method for color histogram calculation from H and S channels of HSV color space with an objective to explore the mutual relationship between the two channels. Secondly, a new texture descriptor is developed using the relationship between the diagonally symmetric neighbors along both the diagonals in a 3 × 3 window of an image. These two feature descriptors, color histogram and texture feature, are concatenated in order to utilize both the global and local information of the image respectively which found to be beneficial in our experiments. Thirdly, the resultant feature descriptor has been used for color image retrieval on different databases (Corel-1K, Corel-5K, Corel-10K database, Salzburg texture database and MIT-VisTex database) and it has been found to be performing better than the method in [59] significantly and other existing methods as well.

2 Color and texture descriptor

2.1 Color histogram using inter-channel voting

Since the primary objective in this work is to exploit the Inter-channel relationship, we do not focus on separately quantizing the H and S channels into bins and concatenating histograms as done in [59]. Our principle is motivated from the popular HOG descriptor. The HOG (Histogram of Oriented Gradients) is a feature descriptor which calculates the gradient of an image in two different directions, X and Y. The orientation and magnitude of the gradient are calculated. The gradient vector is quantized into a histogram of P bins. Each individual bin is used to specify a particular octant in the angular space. The histogram is formed by adding the gradient magnitude g (x,y) to the bin indicated by quantized gradient orientation Ω(x,y). Similarly, our focus in this work is to quantize the Hue (holding color information) value \( \emptyset (x,y) \) into different bins and adding up the Saturation value \( S\left( {x,y} \right) \) to the bin indicated by \( \emptyset (x,y) \). Studying the variation of the Hue with the Saturation is equally as important as studying the variation of Saturation with Hue. For the accomplishment of this objective, we consider it reasonable to quantize the Saturation value \( S\left( {x,y} \right) \) into different bins and form the histogram by voting using the Hue values \( \emptyset (x,y) \). If the Hue value at a particular pixel position (i,j) of the image be \( \emptyset (i,j) \) and if it belongs to the kth quantized histogram bin, then we may write:

$$ {\text{Bin}}\left( k \right) = {\text{Bin}}\left( k \right) + S\left( {i,j} \right) $$
(1)

where \( S\left( {i,j} \right) \) is the saturation value at pixel position (i,j). Following the same principle, if we quantize the Saturation values into L bins and vote with the corresponding Hue value, we may write:

$$ {\text{Bin}}\left( l \right) = {\text{Bin}}\left( l \right) + H\left( {i,j} \right) $$
(2)

Thus, we construct two sets of histograms one with K bins and another with L bins to exploit the inter-channel relationship. The traditional histogram quantization method and our proposed histogram quantization method are shown in Fig. 1a, b respectively.

Fig. 1
figure 1

a Traditional histogram quantization method. b Our proposed histogram quantization method in order to explore the inter-channel relationship

2.2 Local pattern

2.2.1 Local binary patterns

Ojala et al. proposed Local Binary Pattern (LBP) which was used mainly in texture classification [63] but its computational ease leads it to be further used in medical imaging [64], image classification [31], object tracking [65] and facial expression recognition [66]. This method uses a small window of an image. Here, each of the N neighboring pixels surrounding the center pixel is compared to the center pixel and a binary value (0 or 1) is assigned based on this intensity difference (as given in Eq. 3). The final result is obtained after multiplying these bits with specific weights. The center pixel is replaced with this value which is the binary pattern value for that center pixel. Thus by replacing each center pixel with its binary pattern value a local binary map of the image is generated in its gray level. A histogram of this local binary map is calculated to create the feature vector. Equations (3)–(6) give the formula for LBP and the histogram.

$$ {\text{LBP}}\left( N \right) = \mathop \sum \limits_{k = 1}^{N} 2^{k} \times \nabla_{1} \left( {I_{k} ,I_{c} } \right) $$
(3)
$$ \nabla_{1} \left( {I_{k} ,I_{c} } \right) = \{ 1,\quad I_{i} \ge I_{c} 0,\;{\text{otherwise}} $$
(4)
$$ {\text{Hist}}\left( N \right)| _{\text{LBP}} = \mathop \sum \limits_{k = 1}^{X} \mathop \sum \limits_{l = 1}^{Y} \nabla_{2} \left( {{\text{LBP}}\left( {k,l} \right),L} \right);\quad L \in \left[ {0,\left( {2^{N} - 1} \right)} \right] $$
(5)
$$ \nabla_{2} \left( {b_{1} ,b_{2} } \right) = \{ 1,\quad b_{1} = b_{2} 0,\;{\text{otherwise}} $$
(6)

Here N represents the number of neighboring pixels. The kth surrounding pixel is denoted by \( I_{k} \) and center pixel is denoted by \( I_{c} \). The final histogram of the pattern map is computed by Eq. (5). An example window for LBP calculation is given in Fig. 3a.

2.2.2 Gray level co-occurrence matrix

The concept of Gray Level Co-occurrence Matrix (GLCM) was proposed by Haralick et al. [2] in which they studied 24 features. It is used to study the co-occurrence of pixel pairs within a specific distance and in a particular direction in an image. It is a very popular statistical method for feature calculation. In this paper, we have calculated the GLCM of the feature map obtained after applying the proposed texture DSCoP (Diagonally Symmetric Co-occurrence Pattern) rather than computing a histogram. This has been done to exploit the spatial correlation of pixels in the feature map which is lost on histogram computation as it is purely a frequency distribution. The equation used for calculating GLCM of an input image:

$$ M_{d} \left( {l,m} \right) = \# \left\{ {\left( {a,b} \right),\left( {c,d} \right):I\left( {a,b} \right) = l,I\left( {c,d} \right) = m} \right\} $$
(7)

where \( \left( {a,b} \right),\left( {c,d} \right) \)\( H_{a } \times H_{b} \left( {c,d} \right) = a + k \times \emptyset_{1} ,b + k \times \emptyset_{2} . \)

In this equation, \( M_{d } \) is the gray level co-occurrence matrix. Here \( k \) represents the distance and \( \emptyset \) represents the direction. \( H_{a } \times H_{b} \) represents the horizontal and vertical spatial domains. \( I\left( {a,b} \right) \;{\text{and}}\;I\left( {c,d} \right) \) are the values of pixel intensity at positions \( \left( {a,b} \right) \) and \( \left( {c,d} \right). \) An example of GLCM calculation is shown in Fig. 2. Figure 2a shows the original matrix, and in Fig. 2b, we have calculated the GLCM of the matrix given in Fig. 2a with adjacent pairs (one-distance) and horizontal pixel pair (zero-degree direction).

Fig. 2
figure 2

Example showing the gray level co-occurrence pattern matrix calculation in b for matrix a

2.2.3 Diagonally symmetric co-occurrence pattern

In the present work, we have calculated a texture feature named Diagonally Symmetric Co-occurrence Pattern (DSCoP) for image retrieval. In this pattern, we consider the relationship between the diagonally symmetric neighboring pairs of a 3 × 3 window as shown in Fig. 3b. There are two diagonals for every 3 × 3 window of an image. One is the principal diagonal and the second diagonal is the counter diagonal. If the symmetric neighbor pair about a given diagonal be \( \left( {I_{k} ,I_{j} } \right) \), where k = {1,2,3,4,…,8}is one of the eight neighbors of the center and the center pixel be denoted by \( I_{c } \) then \( I^{{\prime }}_{k} \) may be written as:

Fig. 3
figure 3

a Example showing local binary pattern calculation b illustration for diagonally symmetric co-occurrence pattern calculation

$$ I_{k}^{{\prime }} = I_{k} - I_{c } \quad k = 1,2, \ldots ,8 $$
(8)

The values of k can be divided into two subsets. One set of values (k = 1,7,8) for the principal diagonal and the second set of values (k = 1,2,3) for the counter diagonal. As a result, the values of \( I_{j}^{{\prime }} \) may be expressed as:

$$ I_{j}^{{\prime }} = \left\{ {I_{{\bmod \left( {12 - k,8} \right)}}^{{\prime }} ,\quad k = (1,7,8)\;({\text{for}}\;{\text{principal}}\;{\text{diagonal}}) I_{8 - k}^{{\prime }} , \quad k = (1,2,3)\;({\text{for}}\;{\text{counter}}\;{\text{diagonal}})} \right. $$
(9)

The relationship between \( I'_{k} \) and \( I^{{\prime }}_{j} \) may be represented as:

$$ \rho \left( {I_{k}^{{\prime }} ,I_{j}^{{\prime }} } \right) = \{ 1,\quad I_{k}^{{\prime }} \times I_{j}^{{\prime }} \ge 0 0,\;{\text{else }} $$
(10)

Thus, when both \( I^{{\prime }}_{k} \) and \( I^{{\prime }}_{j} \) are of the same sign, the resultant bit will be 1; otherwise, it is zero. There are six neighbor pairs in total, three for each diagonal. We obtain a six-bit binary string and calculate the decimal equivalent which replaces the center pixel of the window. Thereafter, we calculate the GLCM which helps to exploit the spatial co-relation. For GLCM computation we follow the same set of specifications as that followed in [59]. However, since we compute 64 feature vectors instead of 16, we quantize the GLCM into 16 levels from 0 to 15 to maintain the same feature dimension as used in [59].

2.3 Advantage of proposed descriptor

  1. 1.

    The texture descriptor proposed in our work takes into account the relationship between the diagonally symmetric neighboring pairs about the principal and counter diagonal of a 3 × 3 window of an image rather than considering the relationship only between center symmetric pixels which has not been studied earlier in the literature so far.

  2. 2.

    A novel color descriptor by taking into consideration the inter-channel relationship between the H and S channels of an image. This type of relationship between H and S channels has not been studied in the literature so far.

  3. 3.

    The proposed descriptor has been evaluated on a number of publicly available color and texture datasets. For each of them, the proposed descriptor has outperformed the existing descriptors for image retrieval.

3 Proposed system framework

The proposed method has been illustrated with the help of a block diagram shown in Fig. 4, and the corresponding algorithm for the same in Sect. 3.1. In this work, we have computed the color feature by quantizing the H channel into different bins and studied the variation of Saturation with respect to the Hue by following a principle similar to that of the HOG descriptor. Hue represents the color component and has a value between 0 and 1. For our experiments, the H channel has been divided into 18/36/72 bins. The variation of Hue with Saturation has also been studied following the same principle as mentioned above. For this purpose, the S channel has been quantized into 10/20 bins. All possible combinations of Hue and Saturation have been used and the results have reported in the results section. We have used the same value of normalization factor as used in [59] for all databases to appropriately justify the superiority of our method. For texture feature extraction, we have calculated the GLCM of the DSCoP pattern. The same set of specifications for GLCM computation as mentioned in [59] has been followed. The only difference is that we have quantized our GLCM matrix into 16 levels from 0 to 15 to obtain a 256-dimensional feature vector and keep the feature dimension same as [59].

Fig. 4
figure 4

Proposed system block diagram

The algorithm is shown in two parts. The first part describes the system framework. Here, an image is fed as input and in the output, and the feature vector is obtained by concatenating the histograms generated by calculating the GLCM vector of DSCoP feature and the Modified Color Histogram. In part 2, retrieval of image is performed using the proposed feature extraction method. Here, query image is taken as input and in output; the retrieved images are obtained based on similarity measure of the feature vectors as in part 1.

3.1 System framework algorithm

Part 1: Construction of feature vector

Input: An Image from the database

Output: Feature vector.

  1. 1.

    Choose an image from database and convert it from RGB to HSV color space.

  2. 2.

    Construct histograms by quantizing the hue into different number of bins and voting with the corresponding Saturation value and vice versa.

  3. 3.

    Obtain DSCoP map from the value channel of HSV color space.

  4. 4.

    Form GLCM of DSCoP map by quantizing it into 16 levels.

  5. 5.

    Transform GLCM into a vector form.

  6. 6.

    Concatenate GLCM vector of step 5 with the histogram of step 2, and thus the final histogram is constructed as a feature vector.

Part 2: Image retrieval using DSCoP + Modified Color Histogram

Input: Database query image

Output: Retrieved images after similarity measure

  1. 1.

    Take the query image as input from the database.

  2. 2.

    Perform step-2 to step-6 in part 1 to extract the feature vector of the query image.

  3. 3.

    Using different similarity measures compute the similarity index of the query image vector with every database images.

  4. 4.

    Sort the similarity indices to produce the set of similar matching retrieved images as the final result.

3.2 Similarity measure

In content-based image retrieval for retrieving and classifying images alongside color and texture feature computation, similarity measure is of same importance. The distance between the query image feature vector and feature of every image from the database in the feature space is given by the similarity measure which is performed after feature calculation. Indexing is then done based on this measure and sorting of the set of retrieved images is done based on the images with lower indices measures. Calculation of similarity matching is done using these five distance measures.

  1. a.

    d1 distance:

    $$ \partial_{{D^{ } ,q_{k} }} = \mathop \sum \limits_{l = 1}^{n} \left| {\frac{{\rho_{d}^{k} \left( l \right) - \rho_{{q_{k} }} (l)}}{{1 + \rho_{d}^{k} \left( l \right) + \rho_{{q_{k} }} (l)}}} \right| $$
    (11)
  2. b.

    Euclidean Distance

    $$ \partial_{{D^{ } ,q_{k} }} = \left( {\mathop \sum \limits_{l = 1}^{n} \left| {(\rho_{d}^{k} \left( l \right) - \rho_{{q_{k} }} (l))^{2} } \right|} \right)^{1/2} $$
    (12)
  3. c.

    Manhattan Distance

    $$ \partial_{{D^{ } ,q_{k} }} = \mathop \sum \limits_{l = 1}^{n} \left| {\rho_{d}^{k} \left( l \right) - \rho_{{q_{k} }} (l)} \right| $$
    (13)
  4. d.

    Canberra Distance

    $$ \partial_{{D^{ } ,q_{k} }} = \mathop \sum \limits_{l = 1}^{n} \left| {\frac{{\rho_{d}^{k} \left( l \right) - \rho_{{q_{k} }} (l)}}{{\rho_{d}^{k} \left( l \right) + \rho_{{q_{k} }} (l)}}} \right| $$
    (14)
  5. e.

    Chi-square Distance

    $$ \partial_{{D^{ } ,q_{k} }} = \frac{1}{2}\mathop \sum \limits_{l = 1}^{n} \frac{{(\rho_{d}^{k} \left( l \right) - \rho_{{q_{k} }} (l))^{2} }}{{\rho_{d}^{k} \left( l \right) + \rho_{{q_{k} }} (l)}} $$
    (15)

Here, the distance function for database \( D^{ } \) and query image \( q_{k} \) is represented by \( \partial_{{D^{ } ,q_{k} }} \), n represents the length of the feature vector. The feature vector of kth database image and query image are \( \rho_{d}^{k} \left( l \right)\;{\text{and}}\;\rho_{{q_{k} }} (l) \) respectively.

4 Experimental results and analysis

In this paper, we have evaluated the performance of our method on 5 different datasets including texture image databases—MIT-VisTex database and Salzburg texture database and natural scene databases Corel 1K, Corel 5K and Corel 10K. The superiority of the proposed method has been validated by evaluating the precision and recall rate and comparing it with existing methods on all these 5 datasets. Precision shows the relation between the total no. of relevant images retrieved for a given query image and the total no. of retrieved images from the database as in Eq. 16. Precision decreases as we gradually retrieve more images. The equation for determining the rate of precision may be given as:

$$ {\text{Precision}}\;{\text{Rate}}\left( {P_{k} ,Q} \right) = \frac{{{\text{Total}}\;{\text{no}} .\;{\text{of}}\;{\text{correct}}\;{\text{images}}\;{\text{retrieved}}\;{\text{from}}\;{\text{the}}\;{\text{database}}}}{{{\text{Total}}\;{\text{no}} .\;{\text{of}}\;{\text{images}}\;{\text{retrieved}}\;{\text{from}}\;{\text{the}}\;{\text{databases}}\,(n)}} $$
(16)

Here \( Q \) is the query image. \( P_{k} \) represents the precision rate for category \( Q \).

Another commonly used measure for determining accuracy is Recall. It can be defined as the probability of retrieving a correct relevant image by the query. For an image retrieval system, recall improves with increase in the number of images retrieved. It increases as more images are retrieved for different datasets. Recall can be viewed as the ratio of the total no. of relevant images retrieved for a given query image to the total no. of relevant images of that class from the database as in Eq. 17.

$$ {\text{Recall}}\;{\text{Rate}}\left( {R_{k} ,Q} \right) = \frac{{{\text{Total}}\;{\text{no}}\;{\text{of}}\;{\text{correct}}\;{\text{images}}\;{\text{retrieved}}\;{\text{from}}\;{\text{the}}\;{\text{database}}}}{{{\text{Total}}\;{\text{no}}\;{\text{of }}\;{\text{relevant}}\;{\text{images}}\;{\text{the}}\;{\text{databases}}\,(N_{k} )}} $$
(17)

Here \( N_{k} \) indicates number of images in each category of the database, i.e., the total number of relevant images in the database.

The average precision rate may be calculated as follows:

$$ P_{\text{avg}} \left( M \right) = \frac{1}{j}\mathop \sum \limits_{s = 1}^{j} P_{s} $$
(18)

In Eq. 18, \( P_{\text{avg}} \left( M \right) \) represents the average precision rate for category \( \left( M \right) \), where j is the total no of images in that category. Similarly, the recall rate for each category may be expressed as given in Eq. 19.

$$ R_{\text{avg}} \left( M \right) = \frac{1}{j}\mathop \sum \limits_{s = 1}^{j} R_{s} $$
(19)

On similar terms, we can compute the total precision and total recall for our experiment using Eqs. 20 and 21.

$$ P_{\text{total}} \left( M \right) = \frac{1}{c}\mathop \sum \limits_{i = 1}^{C} P_{\text{avg}} \left( M \right) $$
(20)
$$ R_{\text{total}} \left( M \right) = \frac{1}{c}\mathop \sum \limits_{i = 1}^{C} R_{\text{avg}} \left( M \right) $$
(21)

Here, C is the total no. of categories that is present in that particular database. Total Recall is also known as Average Recall Rate (ARR). The performance of the proposed method has been compared with a number of state-of-the-art methods. The list of abbreviations for these methods is given in Table 1.

Table 1 Abbreviations of the different methods used

4.1 Dataset 1

The first dataset used in our experiment is the Corel 1k database. It consists of 10 categories with 100 images in each category. Thus, there are a total of 1000 images in this database. The various categories of images in this database include Asians, buildings, beaches, elephant, flower, dinosaur, buses, mountains, hills and flood. Each image in this database has a size of 256 × 384 or a size of 384 × 256. Some sample images from this database are shown in Fig. 5b. The precision, recall and average retrieval rate have also been evaluated for this database. The precision and recall curves with different number of retrieved images for this dataset are shown in Fig. 6. For our experiment, we have initially retrieved 10 images and then increased the number of retrieved images in steps of 10 images at a time till the number of retrieved images becomes 100. In Fig. 7, the query image is represented by the first image of each row and the remaining images show the retrieved images for each query image. The comparative study of color and texture patterns for this dataset is shown in Fig. 8a, b by studying the precision and recall considering each of the two individually.

Fig. 5
figure 5

Sample images from different datasets

Fig. 6
figure 6

Precision and recall curves with number of images retrieved for Corel-1K database

Fig. 7
figure 7

Query image and retrieved images from Corel-1 K dataset

Fig. 8
figure 8

Precision and recall values of proposed methods for Corel-1 K database

4.2 Dataset 2

The second database that we have worked with in our experiment is the Corel 5K dataset. It consists of a total of 5000 images. There are a total of 50 categories and 100 images in each category. The dataset includes images of animals, e.g., bear, lion, fox, tiger, etc., human, buildings, paintings, natural scenes, fruits, cars, etc. In this experiment, we have retrieved 10 images initially. This has been increased till we retrieve 100 images to provide a fair comparison. The precision and recall curves for this dataset with varying number of images are shown in Fig. 9. Figure 5e shows some sample images for this dataset, and the images retrieved corresponding to these sample images are shown in Fig. 10. The average retrieval rate for this dataset shown in Table 2 indicates that the proposed method performs better than state-of-the-art methods given in Table 1. As shown in Fig. 11, the texture pattern individually is more effective than the color pattern for CBIR.

Fig. 9
figure 9

Precision and recall curve with number of images retrieved for Corel-5 K database

Fig. 10
figure 10

Query image and retrieved images from Corel-5K dataset

Table 2 Average retrieval rate for STex and MIT-Vistex datasets
Fig. 11
figure 11

Precision and recall value of proposed methods for Corel-5K database

4.3 Dataset 3

The third database we used in our experiment is the Corel 10K database.Footnote 1 It consists of 100 categories of images with 100 images in each category. This database is a continuation of Corel 5k database. It has images belonging to categories like buses, ships, texture, food, army, airplanes, furniture, oceans, cats, fishes, etc. Some sample images from this database are shown in Fig. 5c. The precision and recall curves for this dataset with different number of images retrieved for this dataset are shown in Fig. 12. The Average Retrieval Rate for this dataset shows an improvement over existing methods given in Table 1. For experimental study, we have retrieved 10 images at first. We have then retrieved some more images from this dataset. Finally, we have retrieved 100 images from this dataset to provide a detailed comparison. In Fig. 13, the query image is represented by the first image of each row and the remaining images show the retrieved images for each query image. Figure 14 shows the comparative study of both texture and color feature of our method with LECoP.

Fig. 12
figure 12

Precision and recall curve with number of images retrieved for Corel-10K database

Fig. 13
figure 13

Query image and retrieved images from Corel-10K dataset

Fig. 14
figure 14

Precision and recall value of proposed methods for Corel-10K database

4.4 Database 4

We have evaluated the performance of the proposed method on our fourth database the Salzburg texture (STex) database.Footnote 2 The database consists of 7616 images. This includes 476 categories with 16 images in each category. The sample images from this dataset are presented in Fig. 5a. For STex dataset, we have retrieved 16 images to measure the precision and recall performance. To provide a detailed study, we have retrieved some more images which are reported with the help of precision and recall curves in Fig. 15a, b. In Fig. 16, the query image is represented by the first images of each row and the remaining images show the retrieved images for each query image. Figure 17a, b shows the comparative study of both texture and color feature of our method with LECoP.

Fig. 15
figure 15

Precision and recall curve with number of images retrieved for STex database

Fig. 16
figure 16

Query image and retrieved images from STex database

Fig. 17
figure 17

Precision and recall value of proposed methods for STex database

4.5 Database 5

Finally, our method is tested on MIT-VistexFootnote 3 database created by MIT Vision and Modeling Group. The dataset contains texture images of size 512 × 512. There are total 40 such gray-scale texture images. These images are again subdivided into images of size \( 128 \times 128 \). So, the dataset is divided into 40 different types of images with 16 images of each type. In this experiment, initially 16 images are retrieved and then the number of images retrieved is increased by 16. The maximum number of retrieved images is 96. The precision rate and recall rate for all images in the database are calculated and compared with the methods in Table 1. A graph in support of our observations is shown in Fig. 18. Some sample images from our dataset is shown in Fig. 5d and some query images and their corresponding retrieved images are shown in Fig. 19. Our proposed texture and color feature performs better than the texture and color feature of LECoP as shown in Fig. 20a, b.

Fig. 18
figure 18

Precision and recall curve with number of images retrieved for MIT-Vistex database

Fig. 19
figure 19

Query image and retrieved images from MIT-Vistex dataset

Fig. 20
figure 20

Precision and recall value of proposed methods for MIT-Vistex database

The average retrieval rate for this method is determined as in Table 2. It has been compared with recent methods for different datasets. We have also studied the image retrieval and feature extraction time of several state-of-the-art feature descriptors and compared the same with our method. The feature vector length and image retrieval time of the recently developed techniques are shown in Table 3. Various similarity metrics have been considered for performance evaluation as in Eqs. 1115. We have shown the performance of the proposed method using different similarity or distance metrics as given in Table 4. The performance varies with different distance metrics on different datasets. However, the best performance on all datasets is obtained for d1 distance metric. We have also shown a comparative performance with LECoP (Local Extrema Co-occurrence pattern) by studying the texture and the color patterns separately. Both the patterns show an improvement over the corresponding patterns of LECoP. However, the texture pattern turns out to be more effective than the texture pattern for all datasets as indicated by the precision–recall values shown with the help of bar graphs. In Table 5, the individual importance of hue, saturation and value in HSV color space is analyzed with different values of quantization levels of hue and saturation components for all databases.

Table 3 Feature retrieval time, extraction time and feature length using different methods
Table 4 Comparative study with different distance matrics
Table 5 Precision and recall values of our method with different quantization schemes for all databases

Overall, our proposed color and texture descriptors captures better color and texture information in the encoded feature representation and it performs better than existing handcrafted texture descriptors. Large scale exploration of images due to easy availability of smartphones demands an expert automatic annotation and retrieval system, which can perform based on the content of the images in real time.

5 Conclusion

This paper presents a novel approach toward content-based image retrieval by proposing a novel descriptor by combining the color and texture information. The texture descriptor is named Diagonally Symmetric Local Binary Co-occurrence Pattern since it effectively captures the co-occurrence relationship between the symmetric neighbor pairs about the left and right diagonals of an image. The color descriptor focuses on capturing the inter-channel relationship between the H and S channels of the HSV color space by quantizing the H channel into bins and voting with Saturation value and replicating the process for the S channel. The texture descriptor developed in this paper effectively captures the co-occurrence relationship between the neighbor pairs symmetric about the principle and counter diagonal of an image. The method has been evaluated on texture image databases—MIT-VisTex database and Salzburg texture database and natural scene databases Corel 1K, Corel 5K and Corel 10K. The result obtained has been compared with existing techniques by calculating the precision and recall values for all of them. The proposed method turns out to be better than the existing approaches in terms of both precision and recall. The feature vector length and image retrieval rate are also competitive with most approaches. Thus, in real-time systems, this image retrieval technique is quite effective and efficient.