1 Introduction

The main goal of the content-based image retrieval (CBIR) [20, 38] scheme is to retrieve a set of the most desired images from the digital image repository using low level visual contents such as color, texture and/or shape of a given query image. The digital image repository has becoming larger day by day from the various areas including, social networking sites, entertainments, medical imaging, crime prevention, historical archives, and broadcasting. So retrieving the most relevant images from such a large digital repository is one of the challenging task. In a typical CBIR system, initially the low level visual contents are computed from the query image and target images in digital repository for construction of feature vectors/descriptors. The appropriate similarity measure is calculated between the feature descriptors of the given query image and each target image in a digital repository. The target images are ranked based on their computed similarity measures to the query image. The visual contents like color, texture and shape play an important role for producing desired results as per the user’s requirements. Several CBIR [6, 37, 41, 45] schemes based on texture and shape features have been found in literature reviews. Therefore, in this paper, author has considered color, texture and shape visual descriptors/features [28] for CBIR system, where probability histogram based color feature descriptor has been computed using color moments and novel texture feature extraction technique has been proposed in DCT domain with the help of co-occurrence matrix. Finally shape feature extraction technique is suggested based on the various multi-resolution sub-images.

The color feature descriptor is significantly adapted in image retrieval schemes due to its rotation and scaling invariant properties. Liu et al. [18] have computed color feature descriptor using color difference histograms based on first and second order partial derivatives and uniform quantization approach in image retrieval. Using color visual descriptors, several image retrieval systems have been proposed such as co-occurrence matrix [17], color autocorrelation [33], exponent color moments [52] and color moments [5]. In general, color moments represent color characteristics of the digital image. These moments are also determined the important visual features of image under different circumstances, and various lighting environments. The color moments alone are not sufficient to identify and differentiate the various image contents significantly and effectively.

Texture is one of the most essential features of image which play a vital role for developing effective CBIR applications. The texture moments or features provide the distinguish image properties such as smoothness, homogeneity, coarseness, and regularity. In an image, various objects can be notable solely using texture patterns. Malik et al. [22] have proposed an image retrieval scheme where the texture features have been extracted from the non-overlapping DCT blocks of the grayscale image. In this scheme, DC and first three AC coefficients from each DCT block have collected in zigzag scanning order and constructed the four histograms of DC and three AC coefficients individually. Further, they have quantized the corresponding histograms into different numbers of bins and calculated statistical parameters like mean, standard deviation, skewness, kurtosis and smoothness from the histograms for construction of the texture feature descriptor. The various similarity distances have been performed for effective retrieval of images from the database. Phadikar et al. [27] have also proposed a CBIR scheme in DCT domain, where three visual features i.e., color histogram, color moments, and edge histogram are computed directly from the transform domain. However, all the above computed texture visual features are not uniformly significant for the construction of the feature vector. So, before similarity matching, they have performed the genetic algorithm (GA) for optimizing the set of image visual features using weight factor which increases the retrieval accuracy. Haralick [10] have proposed a Gray level co-occurrence matrix (GLCM) [29] which provides the co-relation between the pixels at a particular distance and gives the much spatial information of an image. Wang et al. [48] have suggested an image retrieval scheme based on texture features; initially an image is divided into 8-connectivity regions and GLCM based texture features have been extracted from each connected region. Vahid et al. [24] have proposed new texture descriptor, also known as the CoALTP texture feature vector which is obtained by the efficient combination of GLCM and Local Ternary Pattern (LTP). This CoALTP texture feature vector inherits the properties of both occurrence matrix and LTP. Finally, the retrieval results using different distance metrics have been analyzed based on this texture feature vector.

Shape feature is one of the important components of the image which is extensively used as a discriminating element for the developing CBIR applications [11]. It also provides many accurate results if the image consist of objects, various shapes, different structures and distinguish edges in many directions. In general, the description of the shape is done by two methods; one is contour based method where the Fourier descriptors have been used as shape representatives of image. The second is region based method, where invariant moments represent the shape features. Akrem et al. [7] have developed shape based image retrieval scheme, where shape signatures have been extracted based on the Fourier descriptor (FD) and farthest point distance (FPD) technique. In this scheme, the shape signatures have been computed at each point on a shape contour and it also achieved the scale, translation invariant properties of image which further improves the retrieval accuracy. But during acquiring the invariant properties, some valuable information has been lost. Hence another shape based image retrieval scheme has been suggested by Emir et al. [39] which overcome the problem of existing scheme [7] and it also preserves the invariance properties of image. They have adopted only the phase of Fourier coefficient and it has been used for the specific points (or pseudo mirror points) as a shape orientation reference. The shape signature is also invariant under translation, scaling and rotation due to the phase-preserving Fourier descriptors. Li et al. [15] have suggested invariant moments based CBIR scheme, where the shape feature vector has been constructed by combining the Zernike moments (ZMs) based phase coefficients and ZM magnitude. Kothyari et al. [14] have proposed CBIR scheme where they have directly computed the seven invariant moments for the formation of the feature vector. But images captured in nature are not noise free. There is a need for some significant preprocessing technique before extracting the visual contents or moments from images. In this paper, we have analyzed the images at different multi-resolution levels prior to extract the shape features/moments, since information at single resolution has proven to be insufficient.

Most of the earlier discussed CBIR schemes have used only single visual contents among three low-level visual contents i.e., color, texture and shape. However, it is very difficult to achieve adequate retrieval results proficiently by considering single feature descriptor alone because, image in nature contains a variety of visual attributes. For improving the retrieval performance, many researchers have developed several CBIR schemes [21, 49] based on the proficient combination of texture and shape features. Wang et al. [49] have suggested an image retrieval scheme, where shape and texture features are combined efficiently. In this scheme, shape features have been computed from an RGB color image using exponent moments, those having numerous desirable image characteristics and texture features have been computed using histogram of localized angular phase of the intensity image plane. Liu et al. [21] have proposed CBIR scheme by combining texture and shape features using weighted distance measurement efficiently, where the texture visual features have been computed from the extracted optimal non-subsampled shearlet transform based decomposed images and the shape features are extracted from images by using low-order quaternion polar harmonic transforms (QPHTs). Finally, the single distance is computed based on the optimal weighted similarity for texture and shape features respectively. This single distance is used in the retrieval process. In literature survey, a number of image retrieval schemes [42, 50] have been developed using a various combination of different image visual features which improves retrieval performance in certain extent. But it is observed that by combining such visual image features, there is no guarantee to produce the better retrieval results. For developing effective CBIR scheme, it is necessary to extract suitable visual image features that are enough significant to represent the low dimensional feature vector effectively without compromising the retrieval performance. In the presented work, author has proposed a novel CBIR scheme based on color, shape and texture visual moments/ features. The main contribution of paper is highlighted by the following points:

  • 1. The color moments have been computed from the probability histograms of image planes after using an effective pre-processing algorithm on color image.

  • 2. The Gray Level Co-occurrence Matrix (GLCM) based texture moments from the image are computed in a very new fashion by selecting salient components in Discrete Cosine Transform (DCT) domain after determining an inter-relationship between the DCT blocks. In this way, the new texture feature extraction technique is proposed by computing the GLCM features from the arranged matrices of DC and AC coefficients of the DCT image blocks.

  • 3. The shape moments are extracted from the multiresolution based sub-images since the most of the detailed information of an image plane is not visible at one resolution level while some significant visual information is analyzed in different multi-resolution levels.

  • 4. Finally, the low dimesional feature descriptor is constructed by simple fusion of color, texture and shape moments of an image effectively which reduces the computional overhead. This simple fusion approach also improves the retrieval accuracy of CBIR system.

  • 5. The proposed similarity distance has been suggested and comparative results with Euclidean distance have been presented using three standard image datasets.

The rest of the paper is organized as follows: In Section 2, some preliminaries on discrete cosine transform, gray level co-occurrence matrix, Gaussian image pyramid are described. Section 3 elaborates the proposed CBIR scheme in details. In Section 4, the experimental results and discussion are provided. Finally, Section 5 presents the conclusions of the paper.

2 Preliminary concepts

Before presenting our proposed CBIR scheme, we will describe some basic concepts of discrete cosine transformation, Gaussian image pyramid and gray level occurrence matrix in brief. These concepts have been used in the proposed feature extraction techniques during the formation of the feature descriptors.

2.1 Block level discrete cosine transformation

Discrete cosine transformation (DCT) [2] has been intensively used in the signal and digital image processing applications. It is a proficient tool which converts an image into frequency/transform domain from the pixel/spatial domain. The DCT tool considered only real part of frequency domain which makes it faster than the other transformation tool like discrete Fourier transformation. The DCT transformed image has DCT coefficients, where the first upper top left corner component is known as DC coefficient (or energy of image block) and all remaining components are called as AC coefficients. Moreover, the most significant visual information of the transformed image is lies in the fewer numbers of coefficients which lies on the top upper left part of decomposed image block. If we select DCT coefficients in zigzag scanning order then it represent the most significant visual information of the image block. The selection process of DCT coefficients from transformed block is depicted in Fig. 1. The special characteristics of this transformation are pixel de-correlation, high energy compaction which helps us to split image into its different frequencies by de-correlating the pixels and preserving the energies. Due to these characteristics, DCT plays an important role in various fields of digital image processing such as image compression, image segmentation, feature extraction, image enhancement and visual content based image retrieval [44]. The 2-D DCT of the image block of size N × N can be defined as:

$$ \begin{array}{l} F(u, v) = \frac{2}{N}c\left( u \right)c\left( v \right)\sum\limits_{x = 1}^{N} {\sum\limits_{y = 1}^{N} {f(x, y)} } \cos \left[ {\frac{{(2x + 1)u\pi }}{{2N}}} \right] \times cos\left[ {\frac{{(2y + 1)v\pi }}{{2N}}} \right]\\ c(u) = c(v) = \left\{ \begin{array}{l} \frac{1}{{\sqrt 2 }} \text{if} u = v = 0\\ 1 \text{if} u, v > 0 \end{array} \right\} \end{array} $$
(1)

where, F(u, v) and f(x, y) are the transformed image and the original image respectively. The top upper left component i.e. F(0,0) = DC coefficient; is the average intensity or energy of the image block. As per the concept, only few DCT coefficients are sufficient to represent an approximate image block without losing any significant visual information. Therefore, in the presented work, 27 out of 63 AC coefficients from each 8 × 8 DCT block have been taken in zigzag scanning order and corresponding statistical values are computed for visual information representation. A lot of researchers have suggested various techniques to extract the visual information from the DCT blocks. Jiang et al. [12] have proposed image retrieval scheme, where the visual information is extracted by considering the spatial relationship between the DCT coefficients of block and its sub-blocks. In our proposed work, DC and AC coefficients have been collected separately, where DC coefficients are kept in the form of matrix. The only 27 out of 63 AC coefficients from each block are selected in zigzag scanning order and divided into three uniform groups where each group contains similar number of AC coefficients. The first group has the smallest number of AC coefficients while the last group has the largest number of AC coefficients. Thereafter, the coefficient of variation (CV ) is computed from three non-uniform groups i.e., group G1, group G2 and group G3. For the computation of the CV, initially the mean(μ) and standard deviation σ from each group are calculated and accordingly the ratio between standard deviation and mean provides the CV. Later, these three CV s would be used for the construction of the three AC matrices because this selection retains the local structure of image block [4]. The computation of CV is defined as:

$$ \begin{array}{l} C{V_{Gr}} = \sigma_{Gr}/\mu_{Gr}, {Gr} = \left\{ {G_{1}, G_{2}, G_{3}} \right\}\\ {\mu_{Gr}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {A{C_{i}}} , \sigma{_{Gr}} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {{{(A{C_{i}} - {\mu_{Gr}})}^{2}}} } \end{array} $$
(2)

Where n is the number of AC coefficients and ACi is ith AC coefficient in a particular group Gr. The value of CV is high in the areas of DCT image block, where the edges are exists while it is very low in the uniform areas [23]. Hence a higher value of CV shows that AC coefficients in the image block belong to the edges while the low CV value specifies AC coefficients belong to a uniform region. Indeed, the characteristics of CV have proven that it can be considered as a good region detector. The CV is also used to detect non-spurious items in the image block. Now, for the texture feature extraction, the DC coefficients and CVs of selected AC coefficients are arranged in a matrix form separately. In this way, the three matrices based on AC coefficients and one matrix based on DC coefficients have been achieved. These matrices will be considered for the texture feature representation. The whole process for the texture feature extraction will be described in the proposed CBIR scheme Section.

Fig. 1
figure 1

Selection of DCT coefficients in zigzag scanning order

2.2 Gray-level spatial dependence matrix

The Gray-level spatial dependence matrix has been introduced by Haralick [10] in year 1973 which converts an image into a matrix using frequency of occurrence of pair of pixel values at a specific distance in the original image. The Gray-level spatial dependence matrix is also known as Gray level co-occurrence matrix (GLCM) and it is a most effective tool for analyzing texture visual features in the image [44, 54, 55] using second order statistical parameters. Moreover, the GLCM approach provides the spatial relationship between two pixel values at a particular distance rather than a single pixel value in the image. Let I(x, y) be an image consisting of Nx and Ny horizontal and vertical resolution cells respectively. In order to reduce the computational complexity, the gray level or gray tone values of each resolution cell are quantized into Ng gray tone values. Let us suppose Lx = {1,2,...,Nx} and Ly = {1,2,...,Ny} represent the horizontal and vertical spatial domains, and N = {1,2,...,Ng} is set of quantized gray tone values. The quantized set N is computed by the uniform quantization technique. Now, an image I(x, y) is defined as Ly × LxNg. In our proposed work, the spatial relationship among pixel values at a unit distance and in four different directions is determined. For each direction, corresponding computed co-occurrence matrices are converted into normalized gray level co-occurrence matrices (NGLCMs). Finally, these NGLCMs are considered for the computing the texture visual features using second order statistical parameters/moments. The element (i, j) of GLCM is computed by collecting the quantized gray tone value i occurs in particular relation with a quantized gray tone value j in the image. An each element (i, j) is determined by occurrence of quantized gray tone values to its neighboring quantized gray tone values in the image at a specific displacement with particular direction. Since, the size of the GLCM is rely on the number of quantized gray tones in the image, so we have computed the GLCM based on the symmetric property. Hence, the normalized occurrence matrix has the probabilities of occurrence of pairs of all gray level values with distance (d) i.e., d = 1 and direction (𝜃), where 𝜃 = 00,450,900,1350,1800,2700,2250,3150. Then the co-occurrence probability can be computed as

$$ {P_{r}}(x) = \left\{ {{P_{ij}}\left| {(d, \theta )} \right.} \right\} $$
(3)

where, element Pij of NGLCM between two gray tones i.e, i and j is defined as:

$$ {P_{ij}} = \frac{{{C_{ij}}}}{{\sum\nolimits_{i = 1}^{G} {\sum\nolimits_{j = 1}^{G} {{C_{ij}}} } }} $$
(4)

Cij represents the frequency of occurrence of gray tones i and j with parameters (d, 𝜃) and G is the total number of gray tones which is obtained by uniform quantization process. The value \({\sum \nolimits _{i = 1}^{G} {\sum \nolimits _{j = 1}^{G} {{C_{ij}}} } }\) represents the sum of elements of GLCM for a particular value of (d, 𝜃). For easy computation, the parameter (d) is normally taken as less than or equal to 10 while eight orientations i.e. 𝜃 = 00,450,900,1350,1800,2700,2250,3150 are adapted. In GLCM approach [10], fourteen statistical parameters are suggested and corresponding similarity measure between them is calculated for texture analysis. In order to reduce computational complexity, we have computed only four textural features including contrast, correlation, energy and homogeneity from GLCMs for distinct values of (d, 𝜃), where, this statistical information is appropriate to classify the texture characteristics of an image effectively.

2.3 Gaussian image pyramid

The presented work uses the multiscale representation approach since multiscale approaches are mostly used for deriving the significant image visual information at different resolution levels. The visual information of the digital image can be analyzed in various scales easily because it is very difficult to visualize all significant visual features/information of the original image directly. The Gaussian image pyramid (GIP) [1, 51] is the most popular multiscale technique for image analysis and feature extraction. In our proposed scheme, Gaussian image pyramid(GIP) [51] has been considered due to its easy implementation and low computational overhead. The GIP is performed in two steps that is averaging/ smoothing and down sampling. Smoothing is done by the convoluting the filter or mask with the image, where every time the smoothed image is reduced by a factor two. The whole process is repeated until to obtain the lowest resolution image of size 1 × 1. Each iteration provides the reduced size image with increased smoothing. In this way, the GIP is constructed, where the collection of distinct resolution images in decreasing order based on their sizes is kept in pyramidal shape. Let I0 be the original image of size N × N and it is convolved with the low pass filter (i.e., Gaussian kernel function) and apply down sampling technique to generate next level image I1 of size N/2 × N/2. Similarly an image I2 at level two has been obtained from image I1. This process is continue until to achieve the lowest level Gn of image pyramid. The mathematical formula for GIP of original image I(x, y) is defined as

$$ \begin{array}{l} {I_{0}}\left( {x, y} \right) = I\left( {x, y} \right)\\ {I_{l}}\left( {x, y} \right) = \sum\limits_{a = - 2}^{2} {\sum\limits_{b = - 2}^{2} {w(a, b)} } {I_{l - 1}}\left( {2x + a, 2y + b} \right), \\ \forall 0 \le l \le G\_n \end{array} $$
(5)

where w(a, b) is a low pass filter (or the approximate Gaussian filter). Sometimes it is also known as weighing function or Gaussian generating kernel. These weighting functions are constants, separable and symmetric for all decomposition levels. In our proposed work, the following Gaussian kernel is used and it is defined as:

$$\frac{1}{{256}}\left[ \begin{array}{l} {\text{ 1 ~~~4 ~~~6 ~~~~4 ~~~1}}\\ {\text{ 4 ~~16 ~~24 ~~16 ~~4}}\\ {\text{ 6 ~~24 ~~36 ~~24 ~~6}}\\ {\text{ 4 ~~16 ~~24 ~~16 ~~4}}\\ {\text{ 1 ~~~4 ~~~6 ~~~~4 ~~~1}} \end{array} \right]$$

The mean distribution of the low pass filter is lies at the middle of the Gaussian generating kernel i.e., w(a, b) = w(0,0). The kernel’s weight can be generated using following Gaussian function [35]:

$$w\left( {a, b} \right) = \frac{1}{{\sigma \sqrt {2\pi } }}{e^{- \frac{1}{{2{\sigma^{2}}}}\left[ {{{\left( {a - \mu } \right)}^{2}} + {{\left( {b - \mu } \right)}^{2}}} \right]}}$$

where a and b are the coordinates of the kernel in horizontal and vertical directions/axis. Parameters μ and σ are the mean and standard deviation of the Gaussian distribution function. The Gaussian functions provides almost symmetric curve (also known as Gaussian shape) in two dimensions. The main objective of GIP technique is that the neighboring pixel values within a specific region or window in the image often have the similar kinds of properties, and thus, they are highly correlated with each other. For the visualization purpose, the 3-level decomposition of cameraman image of size 256 × 256 using GIP is shown in Fig. 2 but GIP can decompose an image of size 1 × 1 at the lowest level.

Fig. 2
figure 2

The visualization of gray-scale cameraman image using three levels of GIP

3 Proposed content-based image retrieval scheme

The main purpose of the proposed CBIR scheme is to provide the effective and efficient image retrieval system using color, texture and shape visual features/moments. The most of existing CBIR schemes have high dimensional visual descriptors which causes slower retrieval speed and requires high retrieval time. The presented scheme not only constructed the low dimensional visual feature descriptor but also gives the comparative retrieval accuracy with existing schemes. In the presented work, the color feature descriptor has been constructed by computing color moments from the probability histogram model while the texture visual feature descriptor are calculated using DCT and GLCM tools. At last the shape visual features are extracted using GIP based multi-resolution sub-images and geometric moments. Thereafter, the low dimensional of feature descriptor is obtained by the proficient combination of color, texture and shape visual features/descriptors. In CBIR process, the similarity metric is measured between the feature descriptors of digital repository images and given query image. The preprocessing, color, texture and shape based image feature extraction techniques will be presented in the following subsections.

3.1 Preprocessing

The preprocessing approach consist of three steps i.e, histogram equalization, sharpening and cropping of an image. The whole process of preprocessing is shown in Fig. 3. Generally, the images are captured in different environments. Thus foreground and background of these images are bright or dark. The histogram equalization [5] is widely used technique to enhance the contrast by adjusting the intensity values for improving the quality of original image data. However, the human visual perception is highly sensitive to edges, isolated points and fine details of an image. In order to find detailed thin lines and isolated points, the Laplacian filter [9] is considered for computing much finer detailed image, since it is based on the second-order partial derivative while other filters like prewitt, sobel are rely on first-order partial derivatives. Let I(x, y) be an RGB color image, then the Laplacian operator can be defined as:

$$ {\nabla^{2}}\left[ {I(x, y)} \right] = \left[ \begin{array}{l} {\nabla^{2}}R(x, y)\\ {\nabla^{2}}G(x, y)\\ {\nabla^{2}}B(x, y) \end{array} \right] $$
(6)

where R, G and B represent the red, green and blue color components respectively; x and y are horizontal and vertical coordinates. The Laplacian filter is applied on each color component of an RGB image individually. The Laplacian derivate for an Image Icc(x, y), cc ∈{R, G, B} is defined as

$$ \begin{array}{l} {\nabla^{2}}\left[ {{I_{cc}}(x, y)} \right] = \frac{{{\partial^{2}}{I_{cc}}(x, y)}}{{\partial {x^{2}}}} + \frac{{{\partial^{2}}{I_{cc}}(x, y)}}{{\partial {y^{2}}}}\\ = {I_{cc}}(x + 1, y) + {I_{cc}}(x - 1, y) + {I_{cc}}(x, y + 1)\\ + {I_{cc}}(x, y - 1) + {I_{cc}}(x - 1, y - 1) + {I_{cc}}(x + 1, y - 1)\\ + {I_{cc}}(x - 1, y + 1) + {I_{cc}}(x + 1, y + 1) - 8{I_{cc}}(x, y) \end{array} $$
(7)

The filter mask of size 3 × 3 with center − 8 is obtained by (7). The mask is shown in (8) which covers horizontal, vertical and diagonal edges of objects in the image. Since Laplacian filter has provided the filtered image using second order partial derivative which drives constant areas of image to zeros.

$$ \left[ {\begin{array}{*{20}{c}} 1&~1&~1\\ 1&{-8}&~1\\ 1&~1&~1 \end{array}} \right] $$
(8)

During the filtering process, some amount of information in the image is lost. To restore this information and obtain an enhanced/sharpened image g, the Laplacian filtered image L1 is subtracted from the original image I as calculated:

$$ g = I - {L_{1}} $$
(9)

After histogram equalization and sharpening process, the obtained color image is cropped from the center. The color image of size R × S is decomposed into two parts i.e., peripheral and central regions which is depicted in Fig. 3d, where R and S represent the rows and columns of the image. It is very common that the major object lies on the central position of the image. Therefore, the proposed scheme consider only the central area and eliminate the peripheral area of the image. The cropped image is shown in Fig. 3e which will be considered for the feature extraction process.

Fig. 3
figure 3

Preprocessing of an RGB color image

The major steps for the preprocessing of color image are described in Algorithm 1.

figure a

The image obtained in the step 6 will be adapted for the color and texture feature extraction process.

3.2 Probability histogram based color moments

Color visual feature is one of most widely used image content due to its computational simplicity and invariant properties with rotation, scaling, translation and any other spatial transformation [26]. The appropriate global feature representation is performs better than local feature representation since it is minimized the computational overhead and increase image retrieval performance into a certain extent. The statistical color moments [43] represent the distribution of colors in the image and it can be computed from image in any color model but in the proposed scheme the HSV color model is taken into account for color feature representation. The HSV color model is chosen due to its human visual perception property. In accordance with the three elements of color vision characteristics of human eyes, HSV is more in line with human visual perception than commonly used other color space, and it can well reveal the visual consistency to human eyes [40]. In the presented work, cropped RGB enhanced color image is obtained in step 6 using Algorithm 1 and this RGB enhanced color image cropped image is again converted into HSV color image and decomposed into its hue (H), saturation(S) and value (V) color planes/components. After that, the histograms of each color component have been constructed and corresponding probability histograms have been computed for calculating moments. The statistical color moments such as mean, standard deviation, skewness and kurtosis from each probability histogram of color planes is computed for the formation of color feature descriptor. Then statistical moments can be calculated as

$$ \small \begin{array}{l} {\text{Mean}} {\mu_{CC}} = \sum\limits_{r = 0}^{L - 1} {rP(r)} \\ {\text{Standard}} {\text{deviation}} {\sigma_{CC}} = \sqrt {\sum\limits_{r = 0}^{L - 1} {{{(r - {\mu_{CC}})}^{2}}P(r)} } , \\ {\text{Skewness}} s{k_{CC}} = \frac{1}{{{\sigma^{3}}}}\sum\limits_{r = 0}^{L - 1} {{{(r - {\mu_{CC}})}^{3}}P(r)} \\ {\text{Kurtosis}} {k_{CC}} = \frac{1}{{{\sigma^{4}}}}\sum\limits_{r = 0}^{L - 1} {{{(r - {\mu_{CC}})}^{4}}P(r)} \\ {\text{where}} CC \in \{ {H_{H}}, {S_{H}}, {V_{H}}\} {\text{and}} {\text{Probability}} P\left( r \right) = \frac{{{\text{Number}} {\text{of}} {\text{Pixels}} {\text{at}} {\text{bin}} 'r^{\prime}}}{{Width \times Height}} \end{array} $$
(10)

where CC ∈{HH,SH,VH} represents the probability histogram of color components of HSV color image and pixel values are ranging from 0 − (L − 1). Mean value represents brightness and the average color information of the image while standard deviation shows the contrast of the image and it measures the distribution of the pixel values about mean in the histogram. The skewness measures the skewed pixel values of the image about mean in histogram. The kurtosis calculates the peakness of the pixel values of the image about mean in histogram. The collective color moments of all three probability histograms of HSV color planes has constructed the 12-D feature descriptor. Hence color feature descriptor is defined as

$$ F{V_{Color}} = \left\{ {{\mu_{CC}}, {\sigma_{CC}}, s{k_{CC}}, {k_{CC}}} \right\} , CC \in \{ {H_{H}}, {S_{H}}, {V_{H}}\} $$
(11)

3.3 Gray level co-occurrence matrix based texture moments

In this section, the proposed the texture feature extraction technique is presented. Initially, the color components are estimated using block level DCT tool and collect the DC coefficients from each block. These DC coefficients are arranged in such way that they form the DC matrix/sub-image. The formation of DC Matrix is depicted in Fig. 4.

Fig. 4
figure 4

Formation of the DC matrix/sub-image

Thereafter, the CV s of three different groups of AC coefficients are computed and formed corresponding three matrices/sub-images separately. These matrices are known as CV matrices. The formation of these matrices of AC coefficients are similar to the formation of DC matrix. In this way, we have got four matrices i.e., one is DC matrix and remaining three matrices are based on the AC coefficients known as CV matrices. As we know that the DCT blocks are co-related and they have interrelated information which indicates that the DCT coefficients describe the local feature information inside each block (Intra-Block) but more DCT coefficients based global features are integrated by providing spatial information between each block and its neighbors block (Inter-Block). The GLCM is one of the statistical tool which provides the spatial/inter-related information between the pixel values of image block at a particular distance and at specific orientation. Here, we have DC matrix which contains the DC coefficients of various DCT blocks and three CV matrices having information of AC coefficients of various DCT blocks. In order to provide inter-related information between DCT blocks, GLCMs of DC matrix and three CV matrices of AC coefficients have been used for extracting texture features since it provides the spatial/inter-related information among matrix elements at a specific distance and particular orientations. The presented work uses various GLCMs in four and eight different directions with unit distance separation among the matrix elements using the symmetric property and corresponding normalized GLCMs (NGLCMs) have been computed. The computation details of these matrices are discussed in Section 2.2. The statistical parameters such as contrast, correlation, energy and homogeneity are widely used for characterizing the texture properties of the image. These four features have been extracted from image in the existing CBIR schemes [36, 53] where the retrieval results have been reported the satisfactory. For formation of the texture feature descriptor, these four statistical texture parameters are computed from all NGLCMs, those are obtained from DCT coefficients based matrices. These statistical moments are defined as

$$ \begin{array}{l} Contrast {f_{1}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {{{\left( {i - j} \right)}^{2}}{P_{ij}}} } \\ Correlation {f_{2}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {\frac{{\left( {i - {\mu_{i}}} \right)\left( {j - {\mu_{j}}} \right)}}{{{\sigma_{i}}{\sigma_{j}}}}} } {P_{ij}}, {\sigma_{i}}, {\sigma_{j}} \ne 0\\\ Energy {f_{3}} = {\sum\limits_{i}^{G}} {{\sum\limits_{j}^{G}} {{P_{ij}}^{2}} } \\ Homogeneity {f_{4}} = {\sum\limits_{i}^{G}} {{\sum\limits_{j}^{G}} {\frac{{{P_{ij}}}}{{1 + \left| {i - j} \right|}}} } \end{array} $$
(12)
$$\begin{array}{l} {\text{where}} \\ {\mu_{i}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {i{P_{ij}}} } , {\mu_{j}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {j{P_{ij}}} } \\ {\sigma_{i}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {{{\left( {i - {\mu_{i}}} \right)}^{2}}{P_{ij}}} } , {\sigma_{j}} = \sum\limits_{i = 1}^{G} {\sum\limits_{j = 1}^{G} {{{\left( {j - {\mu_{j}}} \right)}^{2}}{P_{ij}}} } \end{array}$$

where μi and μj are the means; σi and σj are the standard deviations. The contrast feature f1 measures contrasts between a value and its adjacent values over the matrix/image block and it represents the variation among matrix elements in the texture; The correlation feature f2 is considered to measure how a matrix element is correlated to its elements over the matrix/image block. Energy feature f3 represents the sum of the squared elements in NGLCM and sometimes it is also called as angular second moment or uniformity of energy. If energy is 1, image block is said to be a constant. The homogeneity feature f4 measures the closeness of the distribution of elements in the NGLCM to the NGLCM diagonal and homogeneity is always one for a diagonal matrix. Let us consider a image component I, then feature vector (FVI) is formed as

$$ FV_{_{I}} = \{ {f_{1}}, {f_{2}}, {f_{3, }}{f_{4}}\} $$
(13)

In the presented paper, initially image component I is decomposed into N × N blocks and each block is transformed using DCT tool. After that, we have constructed DC matrix using collected DC coefficients (or energies of image blocks) and CVs matrices of different groups of AC coefficients. For image component I, let FVDC, \(FV{_{AC\_G_{1}}}\), \(FV{_{AC\_G_{2}}}\) and \(FV{_{AC\_G_{3}}}\) feature vectors of the DC matrix and CVs matrices of three groups of AC coefficients, where all these features have been computed by (13). The single feature vector \(FV{_{DCT\_I}}\) for image component I is obtained as follows:

$$ F{V_{DCT\_I}} = [F{V_{DC}}, F{V_{AC\_G_{1}}}, F{V_{AC\_G_{2}}} , F{V_{AC\_G_{3}}}] $$
(14)

Now, we will describe the procedure for constructing the feature descriptor of an RGB color image in brief. Initially, an RGB color image is decomposed into its red(R), green(G) and blue(B) color planes and each color plane is divided into non-overlapping N × N blocks. Subsequently, all the blocks are transformed using DCT tool and we have constructed matrices using specific arrangement of DCT coefficients. Let \(FV{_{DCT\_R}}\), \(FV{_{DCT\_G}}\) and \(FV{_{DCT\_B}}\) are the visual feature vectors/descriptors of red(R), green(G) and blue(B) components respectively, where these feature descriptors have been computed using (14). Then, the single RGB feature descriptor is obtained as

$$ F{V_{DCT\_RGB}} = [FV{_{DCT\_R}}, FV{_{DCT\_G}}, FV{_{DCT\_B}}] $$
(15)

3.4 Multi-resolution based shape moments

Shape is one of the most important feature descriptor to identify the objects [13] in the image. These objects can be recognize by human beings solely based on their shapes in the image significantly. The main purpose of the shape representation is to determine attributes of object, those attributes are used in the matching process during image retrieval. In general, there are two ways of extracting shape features from image; first is edge based method; second is region based method. Since in CBIR applications, the invariant properties are essentially required for effective retrieval of images from the large scale database. Due to the efficient representation of shape descriptors, the moments have been considered as a pattern features for the development of many image retrieval applications [30, 46]. Sometimes the most of the object/ shape features of an image is not recognizable in a single resolution but these features can be visualized in different resolution levels. So in this paper, shape features based on moments have been exploited from gray scale image using GIP multi-resolution approach [51]. Let us consider two dimensional discrete image f(x, y), moments of order p and q of discrete image f(x, y) is defined as

$$ {m_{p q}} = \sum\limits_{x} {\sum\limits_{y} {{x^{p}}{y^{q}}f(x,y)} } , \forall p, q = 0, 1, 2 $$
(16)

where x and y are spatial coordinates of the image. The central moments are defined as

$$ {\mu_{pq}} = \sum\limits_{x} {\sum\limits_{y} {\left( {x - \overline x } \right)\left( {y - \overline y } \right)f(x,y)} } $$
(17)

where \( \overline x=m_{10}/m_{00}\), \( \overline y=m_{01}/m_{00}\), are known as the center of region. Hence center moments of order three can be calculated as:

$$ \begin{array}{l} {\mu_{00}} = {m_{0 0}}\\ {\mu_{10}} = 0\\ {\mu_{01}} = 0\\ {\mu_{11}} = {m_{11}} - \overline y {m_{10}}\\ {\mu_{20}} = {m_{20}} - {\overline x^{2}}{m_{10}}\\ {\mu_{02}} = {m_{02}} - \overline y {m_{01}}\\ {\mu_{30}} = {m_{30}} - 3\overline x {m_{20}} + 2{m_{10}}{\overline x^{2}}\\ {\mu_{21}} = {m_{21}} - 2\overline x {m_{11}} - \overline y {m_{20}} + 2{\overline x^{2}}{m_{01}}\\ {\mu_{12}} = {m_{12}} - 2\overline y {m_{11}} - \overline x {m_{02}} + 2{\overline y^{2}}{m_{10}}\\ {\mu_{03}} = {m_{03}} - 3\overline y {m_{02}} + 2{\overline y^{2}}{m_{01}} \end{array} $$
(18)

The central moments of order p and q are normalized as

$$ {\mu_{pq}} = {\mu_{pq}}/{\mu^{\gamma} }_{00}, \forall p,q = 0, 1, 2,... $$
(19)

where γ = (p + q)/2 + 1. The set of seven moments (ϕ1ϕ7) for (p + q) = 2,3,... can be calculated as follows:

$$ \begin{array}{l} {\phi_{1}} = {\mu_{20}} + {\mu_{02}}\\ {\phi_{2}} = {\left( {{\mu_{20}} + {\mu_{02}}} \right)^{2}} + {\left( {4{\mu_{11}}} \right)^{2}}\\ {\phi_{3}} = {\left( {{\mu_{30}} + 3{\mu_{12}}} \right)^{2}} + {\left( {3{\mu_{21}} - {\mu_{03}}} \right)^{2}}\\ {\phi_{4}} = {\left( {{\mu_{30}} + {\mu_{12}}} \right)^{2}} + {\left( {{\mu_{21}} - {\mu_{03}}} \right)^{2}}\\ {\phi_{5}} = \left( {{\mu_{30}} + 3{\mu_{12}}} \right) + \left( {{\mu_{30}} + {\mu_{12}}} \right)\left[ {{{\left( {{\mu_{30}} + {\mu_{12}}} \right)}^{2}} - 3{{\left( {{\mu_{21}} + {\mu_{03}}} \right)}^{2}}} \right]\\ + \left( {3{\mu_{21}} + {\mu_{03}}} \right)\left( {{\mu_{21}} + {\mu_{03}}} \right)\left[ {3{{\left( {{\mu_{30}} + {\mu_{12}}} \right)}^{2}} - {{\left( {{\mu_{21}} + {\mu_{03}}} \right)}^{2}}} \right]\\ {\phi_{6}} = \left( {{\mu_{20}} - {\mu_{02}}} \right)\left[ {{{\left( {{\mu_{30}} + {\mu_{12}}} \right)}^{2}} - 3{{\left( {{\mu_{21}} + {\mu_{03}}} \right)}^{2}}} \right] \\~~~~~~~+ 4{\mu_{11}}\left( {{\mu_{30}} + {\mu_{12}}} \right)\left( {{\mu_{21}} + {\mu_{03}}} \right)\\v {\phi_{7}} = \left( {3{\mu_{21}} - {\mu_{03}}} \right)\left( {{\mu_{30}} - {\mu_{12}}} \right)\left[ {{{\left( {{\mu_{30}} + {\mu_{12}}} \right)}^{2}} - 3{{\left( {{\mu_{21}} + {\mu_{03}}} \right)}^{2}}} \right]\\ - \left( {{\mu_{30}} - 3{\mu_{03}}} \right)\left( {{\mu_{21}} + {\mu_{03}}} \right)\left[ {3{{\left( {{\mu_{30}} + {\mu_{12}}} \right)}^{2}} - {{\left( {{\mu_{21}} + {\mu_{03}}} \right)}^{2}}} \right] \end{array} $$
(20)

The six moments ϕ1ϕ6 are the invariant with size, orientations and translations while ϕ7 is invariant to skewness which is used to distinguish the mirror images. The set of seven central moments represents the feature descriptor of an image and it is obtained as:

$$ F{V_{mu}} = \left[ {{\phi_{1}}, {\phi_{2}},{\phi_{3}},{\phi_{4}},{\phi_{5}}, {\phi_{6}}, {\phi_{7}}} \right] $$
(21)

Further, the set of seven central moments are computed from each multi-resolution decomposed gray scale image and collective representation of all multi-resolution set of seven central moments form a feature descriptor. Let FVM be the final shape feature descriptor in a multi-resolution levels which is obtained as:

$$ F{V_{M}} = [FV_{mu1}, F{V_{mu2}}, F{V_{mu2}}] $$
(22)

where FVmu1,FVmu2, and FVmu2 represent the feature descriptors of multi-resolution images or sub images at 1, 2 and 3 levels respectively.

3.5 Fused features

Let \(F{V_{Color}} = \left \{ {{f_{c1}}, {f_{c2}}, {f_{c3}},..., {f_{cn}},} \right \}\), \(F{V_{Texure}} = \left \{ {{f_{t1}}, {f_{t2}}, {f_{t3}},..., {f_{tn}},} \right \}\) and \(F{V_{Shape}} = \left \{ {{f_{s1}}, {f_{s2}}, {f_{s3}},..., {f_{sn}},} \right \}\) be the color, texture and shape feature descriptors respectively, where cn, tn and sn represents the number of color, texture and shape feature components respectively. Generally image in nature is complex and clumsy and it is recognized and identified by its visual contents like color, shape and/or texture. To represent the characteristics of color, shape and texture features simultaneously, specific fusion technique has been proposed. The whole process of feature extraction is represented in Fig. 5.

Fig. 5
figure 5

Block diagram for the formation of fused feature vector descriptor

The feature components should be normalized to make the same range/scale of components because the different multimedia data have features/components of different ranges/scales; hence it is necessary to normalize the features and avoiding the component variances during the similarity measurement process. This is especially important for distinguishing different kinds of images. The major summarized steps for image feature representation is given in Algorithm 2.

figure b

3.6 Similarity measurements and image retrieval

In this section, author will describe the proposed similarly measurement along with some existing similarity distances. Let the fused feature descriptor of the query image is FVQ = [FQ1,FQ2,FQ3,...,FQn] and fused feature descriptor of target images in the database is FVT = [FT1,FT2,FT3,...,FTn], where n represent the length of the feature descriptor. These feature descriptors have been computed using (23) of Algorithm 2. The main aim of the similarity measure is to get the top most desired images from digital repository those are similar to the given query image. In the presented work, the similarity distance (Dmn) based on minimum and maximum values of feature descriptors is proposed, known as min-max distance. It is defined as:

$$ \small {\Delta} {D_{mn}} = \sum\limits_{i = 1}^{dd} {\sqrt {\frac{{\left| {\max \left\{ {F{V_{Qi}}, F{V_{Ti}}} \right\}} \right|}}{{\left| {\min \left\{ {F{V_{Qi}}, F{V_{Ti}}} \right\}} \right| + \in }}} } , {\text{where}} 0 < \in < 0.5, i = 1,2,...,dd $$
(24)

Where, dd represents the dimension of the fused feature descriptor. Feature Vectors FVQi and FVTi are the fused feature descriptors of query image and target image of dataset. In this distance, it is obvious that the sometimes the minimum value between two descriptors is zero. So if it is zero, then small quantity i.e., ∈ must be assign in the denominator to avoid the undefined distance value.

This paper also uses the Euclidean distance to find the similarity between the query image and target images in the database. It computes the distance between feature descriptors of the two images by taking the square root of the sum of squaring their absolute differences. The Euclidean distance ΔD: It is computed as:

$$ {\Delta} D = \sum\limits_{i = 1}^{dd} {\sqrt {{{\left| {F{V_{Qi}} - F{V_{Ti}}} \right|}^{2}}} } , i = 1,2,...,dd \ $$
(25)

The smaller distance represent the better retrieval result in terms of relevancy. If the distance is zero, then two images are identical. The block diagram of the proposed image retrieval scheme is depicted in Fig. 6.

Fig. 6
figure 6

Block diagram of proposed image retrieval scheme

The major algorithmic steps of the proposed image retrieval system are presented in Algorithm 3.

figure c

4 Experimental results and discussion

In this section, the experimental results are discussed and analyzed. The retrieved performance is also compared and discussed with some other state of arts CBIR schemes.

4.1 Databases

The retrieval performance of the proposed CBIR scheme is validated on three standard Corel-1K [16], OT-8 dataset [25] and GHIM-10K [19] image datasets. The Corel-1K image dataset consists of 1000 images and it is divided into 10 different categories, where each category has 100 similar types of images. The semantic names of each category images are people, building, food, horse, bus, flower, elephant, mountain, beach and dinosaur. All images are in the JPEG format and the sizes of them are either 384 × 256 or 256 × 384. The sample images of Corel image dataset is depicted in Fig. 7. The Corel image dataset has different variety of images and it meets all the requirements to perform image retrieval due to the diverse contents in the images. The OT dataset has divided into 8 categories consisting of 2688 images, where each category has a different number of images. The semantic names and number of images of each category are coasts (360), open countrys (410), forests (328), mountains (374), highways (260), streets (292), inside cities (308) and tall buildings (356). The forest images include all forest and rivers scenes since the most of the images have the sky objects and there is no specific sky scene image. The images of this dataset have the diverse contents and mixed with other category images. These images are in JPEG format and size of each image is 265 × 265. This dataset is an imbalanced dataset because each category has a different number of images. The sample images of each category of OT dataset is shown in Fig. 8. The GHIM-10K database consists of 10000 images and it is categorized into 20 categories where each category contains 500 similar types of images of size 400 × 300 or 300 × 400 in JPEG format. The semantic names of images are beaches, flowers, horses, ships, flies, cars, bikes, insects etc., The sample images of GHIM-10K dataset are depicted in Fig. 9, where single image from each category of dataset has been taken.

Fig. 7
figure 7

Sample images of the Corel-1K dataset

Fig. 8
figure 8

Sample images of OT-8 dataset

Fig. 9
figure 9

Sample images of the GHIM-10K dataset

The proposed CBIR scheme has been implemented on MATLAB 2013b using computer configuration Intel(R) Core i3, 2.27GHz processor with 6 GB RAM and Microsoft window 7 ultimate with 32-bit operating system.

4.2 Performance evaluation metrics

The retrieval performance of CBIR systems is measured by two standard metrics i.e., precision and recall. These metrics represent the retrieval relevancy of the images based on the query image. The precision of image retrieval system based on the query image q is defined as:

$$ \begin{array}{l} P(q) = X/Y \end{array} $$
(26)

The precision 100.00% means all the retrieved images from the database are relevant. The recall of image retrieval system based on the query image q is defined as:

$$ \begin{array}{l} R(q) = X/Z \end{array} $$
(27)

where X represents the total relevant retrieved images; Y is the total retrieved images from database and Z is the number of relevant images available in the image database category wise. The precision and recall are not alone capable to give the whole effectiveness of image retrieval system. Hence, a weighted harmonic mean of them is defined and it is known as F-score or F-measure. The F-score is computed as

$$ F(q) = = \frac{{\left( {{\beta^{2}} + 1} \right) \times P(q) \times R(q)}}{{{\beta^{2}} \times P(q) + R(q)}}, {\beta^{2}} \in [0, \infty ] $$
(28)

The parameter β is to consider to balance the ratio of precision and recall by selecting the appropriate weight. If β = 1, then it is said to be balanced. Thus, F-score is called as F1-score. The parameter β = 1 is mostly used in general CBIR systems. Then the F-score is re-written as

$$ F(q) = \frac{{2 \times P(q) \times R(q)}}{{P(q) + R(q)}} $$
(29)

The values of the precision, recall, and F-score are normally lies between 0 to 1, but they are also written in percentages.

4.3 Quantitative and qualitative results

In the proposed scheme, the author have extracted the three kinds of visual contents i.e., color, texture and shape. Before doing retrieval, we have applied preprocessing technique to remove the distorted/ unwanted information from the image. Thereafter, the color based visual contents have been extracted by computing four statistical parameters from the color components of the HSV color image. So the 12-D dimension of the color feature descriptor is obtained which is very small as compared to the original size of the image. Next, we have computed the texture features using DCT and GLCM tools. In this texture feature extraction based technique, an RGB preprocessed color image is taken and decomposed it into red, green and blue components. These three components are divided into 8 × 8 fixed size blocks and each block is operated by the DCT tool. Then, DC coefficients (i.e. energy of the block) of all blocks are arranged into matrix form and selected only 27 AC coefficients from each block which shows the significant information of each block. Now, the selected AC coefficients are divided into groups where each group has the similar number of AC coefficients. After that, CV of each group has been computed and these CVs have been arranged into a matrix forms. In this way, four matrices have been achieved i.e., one for DC coefficient and three for AC coefficients. Once, we have got the DC and AC coefficients based matrices, four GLCM features from both the matrices have been computed. In this way, the dimension of texture feature descriptor will be Mn × Cn × GLn = 48, where Mn = 4 number of DC and AC coefficient based arranged matrices, Cn = 3 number of color components and number of texture moments GLn = 4. These GLCM features have been tested with 4 and 8 different directions using Corel image dataset in combination with other features. Lastly, the multi-resolution based shape features have been computed from the image. Initially, original an RGB color image is converted into the gray scale image. Then the Gaussain image pyramid has been applied upto three level and seven moments from each level is calculated. So, the dimensional of the shape feature descriptor will be 21-D. Finally, the author has concatenate all three feature descriptors to get a single feature descriptor which has overall dimension of 12 + 48 + 21 = 81-D. Table 1 shows the component values of the fused feature vector descriptor of bus sample image of the Corel-1K image dataset, where each values is lies between [-1 1], since it is normalized individually in (23). The first 12 values represent the color moments, next 48 values represent the texture moments and last 21 values represent the shape moments of the image.

Table 1 The component values of fused feature descriptor of bus image of Corel-1K image dataset

Table 2 shows the retrieval accuracy in terms of precision, recall and F-score in percentages for different directional GLCM texture features using Euclidean distance and above mentioned single feature descriptor, where only top L = 20 images are retrieved from Corel-1K image dataset. In this table, it is observed that the dinosaur images have 100.00% precision in 4 directions (i.e., two cases) and 8 directions. Since, these images do not have much structural information and complex attributes. The lowest precision is vary from directions to directions since may be one category image has most prominent features in one direction while other category has in other directions. For horizontal and vertical directions, beaches images have generated the lowest precision i.e., 55.00% while this precision reduces to 50.00% for diagonal directions. Since, beaches images have also the some features from other category like mountain, people and elephant images. Therefore, it is hard to driven the significant image features from such kinds of images. In 8-directions, two kinds of image category i.e., beaches and elephant images have the lowest precisions. In Corel-1K image database, people, beaches, elephants and mountains images requires high quality feature extraction algorithms to extract good enough significant image features so that results will not be effected. It is found that by Table 2, average the precision, average recall and average F-score for 4 angles (0 − 1800 and 900 − 2700) are 72.50% 14.50% and 24.17% while these becomes 70.50% 14.10% and 23.50% for another 4 angles (450 − 2250 and 1350 − 3150) respectively which means that accuracy is little bit decreased. But the combined results for 8 angles i.e., horizontal, vertical and diagonal directions (0 − 1800, 900 − 2700, 450 − 2250 and 1350 − 3150) is improved and it produces the 73.50% 14.70% and 24.50% average the precision, average recall and average F-score respectively.

Table 2 The Precision, recall and F-score in percentages (%) for top L = 20 retrieved images from Corel-1K dataset using different directional texture features with color and shape descriptors based on Euclidean distance

In the proposed image retrieval scheme, the experimental results are tested on two kinds of similarity measures. First is Euclidean distance while other distance is suggested by the author. The proposed distance is totally depends on the minimum and maximum values between the feature vectors of query image and target images. This distance has provided the good results but little bit less than the Euclidean similarity measurement. The main advantage of suggested distance is that it has low computational overhead as compared to the Euclidean similarity measurement. Table 3 depicted the retrieval results for top L = 20 images using suggested and Euclidean distances for eight directional texture features with color and shape visual contents. For Euclidean distance, the minimum precision recall and F-score are 55.00%, 11.00% and 18.00% have been obtained for beaches and elephant images while newly proposed distance has given lowest precision (30.00%), recall (6.00%) and F-score (10.00%) for building image category. It is noticed from the table that the retrieval accuracy is decreased from 73.50% to 54.00% from Euclidean to proposed distance using Corel-1K image database. But it may be possible that the suggested distance will be produced good results for any real life applications or for other datasets. The experimental results are also computed for other large size image dataset where, author has found that very little change in the retrieval accuracy between Euclidean distance and suggested distance.

Table 3 The Precision, recall and F-score in percentages (%) for top L = 20 retrieved images from Corel-1K dataset using Euclidean distance and proposed distance

The proposed CBIR scheme is also validated on OT-8 dataset which is imbalance in nature. Table 4 shows the retrieval results in terms of precision, recall and F-score for top L = 20 retrieved images from OT-8 dataset. It is clear from the table that the street category images recieved the 100.00% precision for Euclidean distance while it is decreased to 75.00% precision for the proposed distance. The lowest retrieval results for Euclidean distance (precision 45.00%) and proposed distance (precision 35.00%)is achieved by tall building image category, since the contents of these images have mixed with the contents of other category images. So it is very difficult for the proposed feature vector descriptor to distinguish the actual contents of the images. The average precision, average recall and F-score are 68.13%, 4.17% and 7.84% for Euclidean distance while these matrices are decreased to 52.50%, 3.24% and 6.10% for proposed distance. But the overall retrieval performance is satisfactory in terms of image categorwise as well as whole average metrices(i.e., precision, recall and F-score).

Table 4 The Precision, recall and F-score in percentages (%) for top L = 20 retrieved images from OT-8 dataset using Euclidean distance and proposed distance

The proposed scheme is also validated on GHIM-10K image dataset, where it produces the satisfactory retrieval results. Table 5 shows the retrieval accuracy in terms of precision, recall and F-score using Euclidean distance and proposed distance, where top L = 20 images have been retrieved from GHIM-10K database. In this table, horses and airplane images have obtained the lowest retrieval performance i.e., precision (35.00%), recall(1.40%) and F-score (2.69%) using Euclidean distance while proposed distance has provided the lowest precision (20.00%), recall(0.80%) and F-score (1.53%) for wall image category. The highest retrieval performance has given by the fireworks images in both the Euclidean distance and proposed distance. The average retrieval performance i.e., average precision (52.25%), average recall(2.09%) and average F-score (4.02%) for Euclidean distance has been achieved while average precision (48.00%), average recall(1.92%) and average F-score (3.69%) have been obtained by proposed distance. Here, it has been observed that the precision decreased from 52.25% to 48.00% using Euclidean distance and proposed distance. This difference is very low which is acceptable for any natural image dataset and proposed distance also required low computational overhead in the retrieval process due to not considering the square root computation.

Table 5 The Precision, recall and F-score in percentages (%) for top L = 20 retrieved images from GHIM-10K dataset using Euclidean distance and proposed distance

4.4 Comparative results and discussion with related state of art CBIR schemes

To check the validity of the proposed image retrieval scheme, the experimental results has also compared with some recently developed CBIR schemes [3, 8, 31, 32, 34, 47] in terms of retrieval accuracy. Alamin et al. [32] have computed color and texture visual descriptors in singular value decomposition (SVD) domain, where initially an RGB color image transformed into HSV color image and all color components i.e., hue(H), saturation(S) and value(V) of it are divided into non-overlapping blocks. Then the SVD tool is employed on each block of all color components to compute the color-texture information of an image by discarding non-significant singular values. In this particular work, building category images have been got the lowest precision i.e., 24.00% while highest precision has achieved by the dinosaur category images. Somehow overall average precision is acceptable i.e., 63.00%. Rahimi et al. [31] have also suggested a color-texture features based CBIR scheme, where they have extracted the color information from color components, i.e. red, green and blue using spatial relationship between pixel values. The texture information has been computed from the manual segmented image regions where each region is operated by dual tree complex wavelet transform (DT-CWT) and SVD tool. Since the segmentation based on human visual perception is not suitable in image classification. Therefore, this scheme has not produced the good retrieval results in Corel image dataset and the average precision has been obtained is 49.69%. Fadaei et al. [8], have extracted color information from uniform division of H, S and V color planes of HSV color image using dominant color descriptor (DCD) technique and this color information has been integrated with wavelet and curvelet based texture information. The integrated information is optimized using the particle swarm optimization (PSO) algorithm. In the DCD proposed method, the dinosaur images has produced the best precision (99.75%) while the worst precision (45.05%) has been achieved by the mountain image category and the average precision of this DCD technique is obtained as 71.05%. But once the integrated information is optimized by PSO algorithm, the average precision has become 76.50%. Ashraf et al. [3] have proposed CBIR scheme based on fused features of edge histogram and discrete wavelet domain information. After construction of fused feature database, the Artificial Neural Networks (ANN) is applied on it and precision and recall values have been computed based on the retrieved images from the dataset. The building image category has the worst retrieval precision (50.00%) and highest precision (100.00%) has been obtained by dinosaur image category. The average precision of this scheme is 0.73.50% which is equivalent to the our proposed CBIR scheme. Abdolreza et al. [34] have computed the multi-resolution (wavelet) and color information features using DWT and color histogram techniques. The most relevant information features have been selected using ant colony optimization selection technique which maximize image retrieval accuracy of CBIR system. In this work, mountain image category has the lowest precision (39.80%), while the dinosaur image category has the highest precision (99.80 %). Vimina et al. [47] have proposed CBIR scheme based on a multi-cue fusion approach for BoVW framework using early and late fusion methods. For fusion, a composite edge, Speeded Up Robust Features and color visual feature descriptors have been extracted to represent local regions of the image effectively. Thereafter, independent vocabularies of these visual feature descriptors have been used to form the histograms. The histograms are further fused to characterize the image. Based on the histograms, the retrieval process has been performed and satisfactory image retrieval accuracy has been gained. This scheme has obtained the worst results for building image category while dinosaur has the highest retrieval results. In the discussed CBIR scheme, the image features have been extracted from both the spatial and transform domains but fusion methods are complicated. But in our proposed CBIR scheme, the image visual feature descriptors have been extracted from both spatial and transform domain effectively and very simple fusion technique is applied and produced the satisfactory results in most of the instances. It also produces the 100.00% precision for dinosaur image category and the worst precision (55.00%)for the beach image category. The average precision is somehow good as compare to the other existing CBIR methods. The comparative Table 6 shows the retrieval performance in terms of precision, recall and F-score with respect to all category images of Corel 1K dataset, where top L = 20 images have been retrieved from the database. From Table 6, we observed that in the most of the cases the worst results have been obtained by the mountain, building and beach image category. Since these category images have similar kinds of visual contents like sky, ocean etc., so the actual contents are mixed with each other category images and also have the complex structures and shapes. Therefore, the proposed/existing feature extraction algorithms are unable to extract the actual contents and not always producing the satisfactory results. For such type of images, high level feature extraction algorithms are required.

Table 6 Comparison of proposed CBIR scheme with some recently developed schemes on Corel-1K image database in terms of average precision

The proposed CBIR scheme is also compared with existing schemes in terms of the dimensionality of feature vectors. In Table 7, the feature vector (F V) dimension and average precison for Corel-1K image dataset has been given for four existing methods, where in the paper [3], the results are similar but feature vector dimension is higher than the dimension of proposed feature vector descriptor. Similarly, [40] has the highest feature vector dimension among all other existing methods but they have provided the lowest average precision. However, sometimes lowest feature vector dimension is also provided the good average precision [44]. In paper [47], the dimension of the feature vector descriptor is 86 and average precision is 69.20%. Based on our discussion, it is very clear that the extraction of significant feature vector descriptor with low dimesion is very important in image retrieval system without compromising the retrieval performance.

Table 7 Comparison of the proposed CBIR scheme in terms of the dimensionality of feature vectors

4.5 Simulation results

To visualize the retrieval results,we have presented the different kinds of image categories based on Corel-1K dataset using Euclidean distance and new proposed distance. Based on Corel-1K dataset, for Euclidean distance, the lowest precision (i.e., 55.00%) has been obtained by elephant and beach category images and dinosaur image category have produced best precision (100.00%) while in case of proposed distance, the building and dinosaur images have the worst and best retrieval precision. Figure 10 shows the retrieval results for beach and dinosaur images using Euclidean distance while Fig. 11 depicted the results for building and dinosaur images using proposed distance.

Fig. 10
figure 10

The retrieval results using Euclidean distance for Corel-1K image database from (a)-(b) where top left corner images are queries

Fig. 11
figure 11

The retrieval results using proposed distance for Corel-1K database from (a)-(b) where the top left corner images are queries

5 Conclusions

In this paper, a low dimensional feature descriptor is constructed by using simple fusion of color, texture and shape moments effectively. The color moments have been computed from the pre-processed HSV color components using probability histogram model. The texture moments are calculated by determining an inter-relationship between DCT blocks and GLCM based statistics have been computed from them which provides significant textual information of the image. Lastly, the invariant shape moments of image at different resolutions of GIP have been computed since multi-resolution captures the significant information those are not covered in single resolution. The combined visual feature descriptor is tested on two distances; one is Euclidean and other is suggested distance. Both the distances provides the satisfactory retrieval results for three standard image databases i.e., Corel-1K, OT-8 and GHIM-10K. The proposed scheme will be also valid for any standard natural image dataset. The experimental results also shows that the proposed scheme out performs over the some other existing image retrieval schemes.