1 Introduction

WITH advances in information technology, there is an explosive growth of image databases (DBs), which demands effective and efficient tools that allow users to search through such a large collection. Traditionally, the most straightforward way to implement image database management systems is by means of using the conventional database-management systems such as relational databases or object-oriented databases. The system of these kinds is usually called keyword-based, in which the images are annotated with keywords. As the databases grow larger, to retrieve a particular image with these methods becomes tedious and inadequate. To solve these problems, content-based image retrieval has emerged as a promising alternative, and has drawn substantial research attention in the last decade [8]. In a typical CBIR, low-level features related to visual content are first extracted from a query image, the similarity between the set of features of the query image and that of each target image in a DB is then computed, and target images are next retrieved which are most similar to the query image. Extraction of good visual features which compactly represent a query image is one of the important tasks in CBIR [13].

Most of the early studies on CBIR have used only a single feature among various low-level visual features. However, it is hard to attain satisfactory retrieval results by using a single feature because, in general, an image contains various visual characteristics. Recently, active researches in image retrieval using a combination of some low-level visual features have been performed [12]. In these methods, some different low-level visual features are extracted and combined, but it is shown that such a combination of features does not always guarantee better retrieval accuracy [12, 40, 41]. It is a challenging work to choose visual features that are complementary to each other and to combine chosen features effectively so as to yield an improved retrieval performance.

In this paper, we propose a new content-based image retrieval method based on an efficient combination of shape and texture features. The novelty of the proposed method includes: 1) Exponent moments (EMs) is introduced to describe image shape, which has many desirable properties such as expression efficiency, robustness to noise, geometric invariance, and fast computation etc.; 2) localized angular phase histogram is used to depict image texture, which is robust to illumination, scaling, and image blurring; 3) EMs descriptor and LAPH are combined effectively for color image retrieval.

The rest of this paper is organized as follows. A review of previous related work is presented in Section II. Section III recalls the decomposition and reconstruction about EMs. Section IV provides the detailed construction of LAPH. Section V discusses the content based color image retrieval using EMD and LAPH. Simulation results in Section VI will show the performance of our CBIR scheme. Finally, Section VII concludes this presentation.

2 Related work

Over the past decades, CBIR has been an active area of research in many applications, and many low-level visual features have been extracted and applied for image retrieval. Color is one of the most common and determinant low-level visual features, which is stable against direction variations, size of image and background complexity. As conventional color features used in CBIR, there are color histogram, color correlogram, color structure descriptor (CSD), and scalable color descriptor (SCD). The latter two are MPEG-7 color descriptors [40]. Chen et al. [6] proposed an adaptive color feature extraction scheme by considering the color distribution of an image. Based on the binary quaternion-moment-preserving (BQMP) thresholding technique, the proposed extraction methods, fixed cardinality (FC) and variable cardinality (VC), are able to extract color features by preserving the color distribution of an image up to the third moment and to substantially reduce the distortion incurred in the extraction process. Li et al. [20] presented a novel algorithm based on running subblocks with different similarity weights for object-based image retrieval. By splitting the entire image into certain subblocks, the color region information and similarity matrix analysis are used to retrieval images under the query of special object. Liu et al. [26] proposed a novel image feature representation method, namely, the color difference histogram (CDH), which is used to describe image features for image retrieval. For the proposed histogram, orientation and perceptual color information are combined in the unified framework, and both of their spatial layouts are considered. Talib et al. [39] proposed a new semantic feature extracted from dominant colors (weight for each DC). The newly proposed technique helps reduce the effect of image background on image matching decision where an object’s colors receive much more focus. Aptoula et al. [4] presented three morphological color descriptors, one making use of granulometries independently computed for each subquantized color and two employing the principle of multiresolution histograms for describing color, using respectively morphological levelings and watersheds.

Texture is an important visual attribute both for human perception and image analysis systems, its role in domain-specific image retrieval is particularly vital due to their close relation to the underlying semantics in these cases. Texture features, such as gray-level co-occurrence matrix (GLCM), Markov random field (MRF) model, simultaneous auto-regressive (SAR) model, Wold decomposition model, edge histogram descriptor (EHD) etc., have long been studied in image processing, computer vision, and computer graphics. He et al. [11] presented a novel method, which uses non-separable wavelet filter banks, to extract the features of texture images for texture image retrieval. Compared to traditional tensor product wavelets (such as DB wavelets), the new method can capture more direction and edge information of texture images. Lasmar et al. [19] introduced two new multivariate models using, respectively, generalized Gaussian and Weibull densities. These models can capture both the subband marginal distributions and the correlation between wavelet coefficients. Aptoula [3] presented the results of applying global morphological texture descriptors to the problem of content-based remote sensing image retrieval. Specifically, they explored the potential of recently developed multiscale texture descriptors, namely, the circular covariance histogram and the rotation-invariant point triplets. Pappas et al. [28] reviewed recently proposed texture similarity metrics and applications that critically depend on such metrics, with emphasis on image and video compression and content-based retrieval. Atto et al. [5] derived a 2-D spectrum estimator from some recent results on the statistical properties of wavelet packet coefficients of random processes, and discussed the performance of this wavelet-based estimator, in comparison with the conventional 2-D Fourier-based spectrum estimator on texture analysis and content-based image retrieval. Rakvongthai et al. [32] investigated the use of complex wavelets for texture retrieval in a noisy environment where the query image is noisy. Based on a statistical framework, the feature vector is formed by modeling an image in the complex wavelet domain and estimating parameters from the image.

Shape is known to play an important role in human recognition and perception. Object shape features provide a powerful clue to object identity. Humans can recognize objects solely from their shapes. The significance of shape as a visual feature can be seen from the fact that every major CBIR system incorporates some shape features in one form or another [44]. By using a mathematical form of analysis, Li et al. [21] compared the amount of visual information captured by Zernike moments(ZMs) phase and the amount captured by ZM magnitude, and then proposed combining both the magnitude and phase coefficients to form a new shape descriptor for CBIR. Shu et al. [36] suggested a novel shape contour descriptor for shape matching and retrieval. The new descriptor is called contour points distribution histogram (CPDH) which is based on the distribution of points on object contour under polar coordinates. CPDH not only conforms to the human visual perception but also the computational complexity of it is low. Jian et al. [15] proposed an efficient method based on singular values and potential-field representation for face-image retrieval, in which the rotation-shift-scale-invariant properties of the singular values are exploited to devise a compact, global feature for face-image representation. Liu et al. [25] considered the family of total jBregman divergences (tBDs) as an efficient and robust “distance” measure to quantify the dissimilarity between shapes, and used the tBD-based l1-norm center as the representative of a set of shapes. Anuar et al. [2] proposed a novel technique for trademark retrieval that demonstrates improved performance due to the integration of two shape descriptors. The technique employs the Zernike moments as the global descriptor and the edge-gradient co-occurrence matrix as the local descriptor.

Most of the early studies on CBIR have used only a single feature among various low-level visual features. However, it is hard to attain satisfactory retrieval results by using a single feature because, in general, an image contains various visual characteristics. Recently, active researches in image retrieval using a combination of some low-level visual features have been performed [7]. Lin et al. [24] proposed three image features for image retrieval. The first and second image features are based on color and texture features, respectively called color co-occurrence matrix (CCM) and difference between pixels of scan pattern (DBPSP). The third image feature is based on color distribution, called color histogram for K-mean (CHKM). Yap et al. [43] proposed a content-based image retrieval using Legendre chromaticity distribution moments (LCDM), which can provide a compact, fixed-length and computation effective representation of the color contents of an image. Yu et al. [16] considered multiple features from different views, i.e., color histogram, Hausdorff edge feature, and skeleton feature, to represent cartoon characters with different colors, shapes, and gestures. Each visual feature reflects a unique characteristic of a cartoon character, and they are complementary to each other for retrieval and synthesis. Jacob et al. [14] proposed local oppugnant color texture pattern (LOCTP), which is able to discriminate the information derived from spatial inter-chromatic texture patterns of different spectral channels within a region. It determines the relationship in terms of the intensity and directional information between the referenced pixels and their oppugnant neighbors. The LOCTP strives to use the harmonized link between color and texture, which helps the system to incorporate the human perception. Farsi et al. [9] presented a novel CBIR method based on combination of Hadamard matrix and discrete wavelet transform (HDWT) in hue-min-max-difference colour space. An average normalized rank and combination of precision and recall are considered as metrics to evaluate and compare the proposed method against different methods. Kashif et al. [17] proposed a CBIR scheme based on the three well-known algorithms: colour histogram, texture and moment invariants. Here, colour histogram is used to extract the colour features of an image. Gabor filter is used to extract the texture features, and the moment invariant is used to extract the shape features of an image. Prasad et al. [31] presented a technique to retrieve images by region matching using a combined feature index based on color, shape, and location within the framework of MPEG-7. Dominant regions within each image are indexed using integrated color, shape, and location features. Various combinations of regions are also indexed. Seetharaman et al. [34] proposed a unified scheme for automatic image retrieval based on the multivariate parametric tests. In the proposed technique, mean and covariance are used as representatives of both query and target images, and statistical features such as coefficient of variation, skewness, kurtosis, variance-covariance, spectrum of energy, and number of shapes in the images are used. Sherin [35] proposed a novel integrated curvelet-based image retrieval scheme (ICTEDCT-CBIR), in which curvelet multiscale ridgelets is integrated with region-based vector codebook subband clustering for enhanced dominant colors extraction and texture analysis. Singh et al. [30] proposed a novel solution to image retrieval system by combining local and global features. Local features are extracted by detecting linear edges of the edge map of the image using Hough transform and then computing the normalized histograms of distances of lines from the centroid of the edge image. The global features are represented by Zernike moments. Susana et al. [38] proposed a color-texture descriptor: the texture component descriptor (TCD), that arise from the decomposition of the image in its textural components, which are groups of blobs with similar attributes either color, shape or orientation. Wang et al. [42] proposed a new and effective color image retrieval scheme which uses the combination of dynamic dominant color, Steerable filter texture feature, and pseudo Zernike moments shape descriptor. Singha et al. [37] proposed a CBIR approach based on the combination of Haar wavelet transformation using lifting scheme and the colour histogram (CH). The colour feature is described by the CH, which is translation and rotation invariant. The Haar wavelet transformation is used to extract the texture features and the local characteristics of an image. The lifting scheme reduces the processing time to retrieve images. In [23], two-dimensional or one-dimensional histograms of the CIELab chromaticity coordinates are chosen as color features, and variances extracted by discrete wavelet frames analysis are chosen as texture features. In these methods, some different low-level visual features are extracted and combined, but it is shown that such a combination of features does not always guarantee better retrieval accuracy [12, 40, 41]. It is a challenging work to choose visual features that are complementary to each other and to combine chosen features effectively so as to yield an improved retrieval performance.

3 Exponent moments descriptor (EMD)

Shape is known to play an important role in human recognition and perception [1]. Object shape features provide a powerful clue to object identity. Humans can recognize objects solely from their shapes. The significance of shape as a feature for content-based image retrieval can be seen from the fact that every major CBIR system incorporates some shape features in one form or another. As the most commonly used approaches for shape descriptors, moments and function of moments have been utilized as pattern features in a number of applications [1]. The theory of moments provides useful series expansions for the representation of object shapes. In this section, we propose a robust and effective shape feature, which is based on a set of new orthogonal moments of images known as Exponent moments.

In 2011, Meng & Ping [27] extended radial harmonic Fourier moments and introduced a new moment named Exponent moments. Compared with other orthogonal moment, EMs has many desirable properties such as better image reconstruction, lower noise sensitivity, geometric invariance, lower computational complexity. Besides, the EMs is free of numerical instability issues so that high order moments can be obtained accurately.

A function set P n,m (r, θ) defined in a polar coordinate system (r, θ) contains the radial function A n (r) and Fourier factor in angle direction exp(jmθ)

$$ {P}_{n,m}\left(r,\theta \right)={A}_n(r) \exp \left(jm\theta \right) $$
(1)

where \( \begin{array}{cccc}\hfill {A}_n(r)=\sqrt{\raisebox{1ex}{$2$}\!\left/ \!\raisebox{-1ex}{$r$}\right.} \exp \left(j2n\pi r\right),\hfill & \hfill n,m=-\infty, \cdots, 0,\cdots, +\infty, \hfill & \hfill 0\le r\le 1,\hfill & \hfill 0\le \theta \le 2\pi \hfill \end{array} \). According to the characteristic of radial function and Fourier factor in angle direction, the set of P n,m (r, θ) is orthogonal and sound over the interior of the unit circle

$$ {\displaystyle {\int}_0^{2\pi }{\displaystyle {\int}_0^1{P}_{n,m}\left(r,\theta \right){P}_{k,l}^{*}\left(r,\theta \right) rdrd\theta =4\pi {\delta}_{n,k}{\delta}_{m,l}}} $$
(2)

where 4π is the normalization factor, δ n,k and δ m,l are the Kronecker symbols, and P * k,l (r, θ) is the conjugate of P k,l (r, θ).

The image f(r, θ) can be decomposed with the set of P n,m (r, θ) as

$$ f\left(r,\theta \right)={\displaystyle \sum_{n=-\infty}^{+\infty }{\displaystyle \sum_{m=-\infty}^{+\infty }{E}_{n,m}{A}_n(r) \exp \left(jm\theta \right)}} $$
(3)

where E n,m is the EMs of order n with repetition m, whose definition is

$$ {E}_{n,m}=\frac{1}{4\pi }{\displaystyle {\int}_0^{2\pi }{\displaystyle {\int}_0^1f\left(r,\theta \right){A}_n^{*}(r) \exp \left(-jm\theta \right)}} rdrd\theta $$
(4)

here, A * n (r) is the conjugate of A n (r).

Following the principle of orthogonal function, the image function f(r, θ) can be reconstructed approximately by limited orders of EMs (\( \begin{array}{cc}\hfill n\le {n}_{\max },\hfill & \hfill m\le {m}_{\max}\hfill \end{array} \)). The more orders used, the more accurate the image description

$$ {f}^{\hbox{'}}\left(r,\theta \right)={\displaystyle \sum_{n=-\infty}^{+\infty }{\displaystyle \sum_{m=-\infty}^{+\infty }{E}_{n,m}{A}_n(r) \exp \left(jm\theta \right)}}\approx {\displaystyle \sum_{n=-{n}_{\max}}^{n_{\max }}{\displaystyle \sum_{m=-{m}_{\max}}^{m_{\max }}{E}_{n,m}{A}_n(r) \exp \left(jm\theta \right)}} $$
(5)

where f '(r, θ) is the reconstructed image. The basis functions A n (r)exp(jmθ) of the EMs are orthogonal over the interior of the unit circle, and each order of the EMs makes an independent contribution to the reconstruction of the image.

Below, we will derive and analyze the geometric invariant property of EMs. Let f r(r, θ) = f(r, θ + α) denote the rotation change of an image f(r, θ) by the angle α, then EMs of f(r, θ + α) and f(r, θ) have the following relations

$$ {E}_{n,m}\left({f}^r\right)={E}_{n,m}(f) \exp \left(\mu m\alpha \right) $$

where E n,m (f r) and E n,m (f) are the EMs of f r(r, θ) and f(r, θ), respectively. According to above equation, we know that a rotation of the image by an angle α induces a phase shift e μmα of the E n,m (f). Taking the norm on both sides, we have

$$ \left|{E}_{n,m}\left({f}^r\right)\right|=\left|{E}_{n,m}(f) \exp \left(\mu m\alpha \right)\right|=\left|{E}_{n,m}(f)\right|\left| \exp \left(\mu m\alpha \right)\right|=\left|{E}_{n,m}(f)\right| $$

So, the rotation invariance can be achieved by taking the norm of the images’ EMs. In other words, the EMs modulus |E n,m (f)| are invariant with respect to rotation transform. Besides, the EMs modulus is invariant to scaling if the computation area can be made to cover the same content. In practice, this condition is met because the EMs are defined on the unit disk.

Figure 1 gives some examples of image reconstruction using ZMs and EMs for image “Plane” and “Flower” (moment orders K = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50). As can be seen from the Fig. 1, the reconstructed images using EMs show more visual resemblance to the original image in the early orders. The edges of the reconstructed images are also better defined with less jaggedness. In Fig. 2, we show plots of the average reconstruction errors [10] for image “Plane” and “Flower”. It can be observed that the reconstructed images using EMs show visual resemblance to the original image in the early orders, and the EMs is free of numerical instability issues. Figure 3 shows the comparison of the moments computing time using ZMs and EMs for image “Plane” and “Flower”. Figures 4 and 5 show the EMs magnitudes for image “Plane” and “Flower” under various common image processing operations and geometric transforms. It can be seen that the EMs magnitudes of image have good robustness against common image processing operations and geometric transforms. So, EMs magnitudes (called EMs descriptor, EMD) are suitable for CBIR system.

Fig. 1
figure 1

Some samples of reconstructed images (moment orders K = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50): (a) Original images Plane and Flower, (b) Reconstructed images Plane for ZMs, (c) Reconstructed images Plane for EMs, (d) Reconstructed images Flower for ZMs, (e) Reconstructed images Flower for EMs

Fig. 2
figure 2

The reconstruction error yielded by ZMs and EMs in the image reconstruction experiment

Fig. 3
figure 3

The moments computing time using ZMs and EMs for image “Plane” and “Flower”

Fig. 4
figure 4

The EMs magnitudes for image Plane under various common image processing operations and geometric transforms: (a) Original image, (b) Image blurring, (c) Edge Sharpening, (d) Light increasing, (e) Median filtering, (f) Rotation, (h) Scaling, (i) Translation

Fig. 5
figure 5

The EMs magnitudes for image Flower under various common image processing operations and geometric transforms: (a) Original image, (b) Image blurring, (c) Edge Sharpening, (d) Light increasing, (e) Median filtering, (f) Rotation, (h) Scaling, (i) Translation

4 Localized angular phase histogram (LAPH)

Generally, texture features play a very important role in computer vision and pattern recognition, especially in describing the content of images. Importance of the texture feature is due to its presence in many real world images: for example, clouds, trees, bricks, hair, fabric etc., all of which have textural characteristics. Earlier methods for texture representation suffer from two main drawbacks: they are either computationally expensive or retrieval accuracy is poor [18, 22, 33]. In this section, we introduce a new texture feature, the localized angular phase histogram, for CBIR, which are efficient both in terms of accuracy and computational complexity.

4.1 Localized angular phase (LAP) of image pixel

In this paper, LAP is based on the localized Fourier transform that provides information in both the time and frequency domains. In contrast with the 2D short term Fourier transform, LAP applies the 1D Fourier transform over a 1D signal of pixels from local image window. The phase is obtained by computing the arctangent of the division between imaginary and real Fourier coefficients. The phase sign is analyzed to form 8-bit codewords where the distribution of their decimal values is used to describe image pixel texture.

For pixel texture extraction purpose, we use the HSI representation of the color image because this color space can control color and intensity information independently. Here, the image pixel texture is extracted from the I component, this is because that the ISS component closely matches human perception of lightness. For each image pixel, the construction of LAP can be summarized as follows.

  1. 1)

    Construct Local Image Window and Convert Signal

    As shown in Fig. 6, the local image window (shadow part) is firstly constructed centered on pixel s(x, y), and 17 pixels from local image window are then converted into a 1D discrete signal along arrow direction. We denote this discrete signal by \( \begin{array}{cc}\hfill p(n),\hfill & \hfill n=\hfill \end{array}0,1,\dots, 16 \).

    Fig. 6
    figure 6

    Local image window for computing LAP of image pixel s(x,y)

  2. 2)

    Perform 1D Fourier Transform

    The 1D Fourier transform and inverse transform of p(n) are given by

    $$ \begin{array}{cc}\hfill P(k)={\displaystyle \sum_{n=0}^{N-1}p(n){e}^{-\frac{2\pi i}{N}kn}},\hfill & \hfill p(n)=\frac{1}{N}{\displaystyle \sum_{k=0}^{N-1}P(k){e}^{-\frac{2\pi i}{N}kn}}\hfill \end{array} $$
    (6)

    where N is the number of samples in p(n), and for local image window, N is 17. Using (6), the discrete signals p(n) are converted to the Fourier coefficients P(k).

    After the Fourier transform, the values of 17 complex coefficients P(0), P(1), …, P(16) are obtained. The next step is to select some complex coefficients to extract the phase information.

  3. 3)

    Extract Phase Information

    The P(0) is the DC value of the Fourier transform and contains no phase information, thus it is excluded from the selected coefficients. Because the image contains only real values, its Fourier transform becomes centrally symmetric where half of the coefficients are redundant. However, if the number of samples is even, e.g., 16 samples, then the resulting complex coefficients will have another DC value. This extra DC value will reduce the number of useful non redundant complex coefficients. So, to avoid information loss, the LAP method uses 17 samples, instead. This will result in only one DC value. Then, 8 non redundant complex coefficients are selected, whereby half of the complex coefficients are either P(1), P(2), …, P(8) or P(9), P(10), …, P(16). So, the phase information C can be extracted from these 8 complex Fourier coefficients by computing the arctangent of the division between imaginary and real Fourier coefficients.

    $$ \begin{array}{c}\hfill C=\left[{C}_1{C}_2{C}_3{C}_3{C}_5{C}_6{C}_7{C}_8\right]\hfill \\ {}\hfill {C}_i= \arctan \frac{a_i}{b_i}\cdotp \left(i=1,2,\cdots, 8\right)\hfill \end{array} $$

    where a i and b i are real part and imaginary part of the selected complex Fourier coefficients.

  4. 4)

    Construct Localized Angular Phase (LAP)

    Phase information matrix C is quantized into 8-bit binary code by using the following formula

    $$ b(k)=\left\{\begin{array}{c}\hfill 1, if{C}_k\ge 0\hfill \\ {}\hfill 0, otherwise\hfill \end{array}\right. $$
    (7)

    where b(k) is the sign of each phase information C k .

    By arranging b(1), B(2), …, b(8), the 8 bit binary code can be formulated, and a binomial factor is assigned as 2 for each b(k), hence it is possible to transform (7) into a unique LAP number (LAP element), given by

    $$ LAP={\displaystyle \sum_{k=1}^8b(k){2}^{k-1}} $$
    (8)

    Based on (8), this LAP is a decimal value between 0 and 255 resulting from the 8-bit binary code. Figure 7 shows the color images and their LAP matrices.

    Fig. 7
    figure 7

    The color images and their LAP matrices: (a) The color image, (b) The LAP matrix

4.2 Localized angular phase histogram (LAPH) of image

Considerable research has been carried out on the basis of image content. The most popular representation of image information is global histogram. Statistically, the histogram denotes the joint probability of intensities of the three image channels, thus describing the global pixel distribution in an image. In general, the histogram provides useful clues for the subsequent expression of similarity between images, due to its robustness to background complications and object distortion. Moreover, it is translation, scale, and rotation invariant, very simple to implement and systems encountering histograms exhibit a fast retrieval response that makes real-time implementation easier. In this paper, we introduce a new texture feature, the localized angular phase histogram (LAPH), for CBIR, which is efficient both in terms of accuracy and computational complexity.

Let L denote the LAP matrix of color image I, which contains N LAP elements, and the corresponding LAPH can be presented as

$$ H(k)=\frac{n_k}{N}\left(k=0,1,\cdots, 255\right) $$
(9)

where n k is the total number of LAP elements in the k th bin.

Figures 8 and 9 shows the traditional color histogram(TCH) and localized angular phase histogram (LAPH) for two images with similar/different content. From Figs. 8 and 9, we can see that LAPH can reflect effectively the image content, and is superior to TCH.

Fig. 8
figure 8

The TCH and LAPH for different images with similar/different content: (a) The color images with similar/different content, (b) The TCH for different images with similar/different content, (c) The LAPH for different images with similar/different content

Fig. 9
figure 9

The TCH and LAPH for different images with similar/different content: (a) The different color images with similar/different content, (b) The TCH for different images with similar/different content, (c) The LAPH for different images with similar/different content

5 Content-based color image retrieval using EMD and LAPH

For content-based color image retrieval (CBIR), image features in color image database are extracted and stored in an index file that is linked to the original color images. The descriptor of the query color image is represented in vector form and the similarity is calculated between the descriptor vectors of database color images and of the query color image. This section presents a content-based color image retrieval scheme based on EMD and LAPH. Figure 10 describes our image retrieval system framework.

Fig. 10
figure 10

Block diagram of the proposed color image retrieval system

5.1 EMD computing and selection

According to Section II, we can compute rapidly the EMD, i.e. EMs magnitudes. However, we don’t need too much EMs magnitudes in color image retrieval, since color image features can normally be captured by just a few low-frequency EMs magnitudes. The choice of the max order value n max will depend on the size of the given color image and also on the resolution needed. Besides, we must consider fully the symmetrical characteristic of EMs magnitudes distribution (|E n,m | = |E − n,− m |) when the EMs magnitudes are selected. Table 1 lists the selected EMD for different max orders (In this paper, the max order value is selected as 5). So, the shape feature vector based on EMD in RGB color space is given by

$$ {\mathbf{F}}_1=\left({\mathbf{E}}^R,{\mathbf{E}}^G,{\mathbf{E}}^B\right) $$
(10)

where E R, E G, E B denote the EMD of Red, Green, and Blue components, respectively.

Table 1 List of the selected EMD for different max orders

5.2 LAPH computing

According to Section III, we can compute rapidly the LAP of each image pixel, and further extract the LAPH of image in HSI color space. So, the new texture feature vector based on LAPH is denoted as F 2.

5.3 Similarity measure

After the shape feature vector F 1 and texture feature vector F 2 are extracted, the retrieval system combines these feature vectors, calculates the similarity between the combined feature vector of the query image and that of each target image in an image DB, and retrieves a given number of the most similar target images.

  1. (1)

    Shape Feature Similarity Measure

    The shape feature similarity is given by

    $$ {S}_1\left(Q,I\right)=\left|{\mathbf{F}}_1^Q-{\mathbf{F}}_1^I\right| $$
    (11)

    where F Q1 denotes the shape feature vector of query image Q, and F I1 denotes the shape feature vector of target image I.

  2. (2)

    Texture Feature Similarity Measure

    We give the texture feature similarity as follows

    $$ {S}_2\left(Q,I\right)=\left|{\mathbf{F}}_2^Q-{\mathbf{F}}_2^I\right| $$
    (12)

    where F Q2 denotes the texture feature vector of query image Q, and F I2 denotes the texture feature vector of target image I.

  3. (3)

    Feature Similarity Measure

    So the distance used for computing the similarity between the query feature vector and the target feature vector is given as

    $$ \begin{array}{c}\hfill S\left(I,Q\right)={w}_1{S}_1\left(Q,I\right)+{w}_2{S}_2\left(Q,I\right)\hfill \\ {}\hfill {w}_1+{w}_2=1\hfill \end{array} $$
    (13)

    where w 1 and w 2 are the weights of the shape and texture features respectively.

    When retrieving images, we firstly calculate the similarity between the query image and each target image in the image DB, and then sort the retrieval results according to the similarity value.

6 Simulation results

To evaluate the performance of the proposed algorithm, we conduct an extensive set of CBIR experiments by comparing the proposed scheme to several state-of-the-art image retrieval approaches [7, 9, 43].

6.1 Image database

The color image retrieval systems have been implemented in MATLAB 7.0 programming environment on a Pentium 4 (2 GHz) PC. To check the retrieval efficiency of proposed method, we perform experiments over 10000 images from 150 categories of the COREL photo gallery, in which each category contains 100 images. Every database image is of size 256 × 384 or 384 × 256, which cover a variety of topics, such as “Flowers”, “Buses”, “Beach”, “Elephants”, “Sunset”, “Buildings”, “Horses”, etc.

We also perform experiments over 10000 images from 256 object categories of the Caltech image database. The Caltech image database comprises 30607 images, in which each category has a minimum of 80 images. Caltech images are harvested from other popular online image database, and they represent a diverse set of lighting conditions, poses, backgrounds, image sizes, and camera systematics. The categories were hand-picked by the authors to represent a wide variety of natural and artificial objects in various setting. The organization is simple and the images are ready to use, without the need for cropping or other processing. Corel and Caltech images have been widely used by the image processing and CBIR research communities. Figure 11 shows our image retrieval system interface.

Fig. 11
figure 11

User interface of our image retrieval system

6.2 The performance of parameters selection

In our image retrieval, the EMs magnitudes (EMD) are used to capture the shape feature. To evaluate the overall performance of the proposed image feature in retrieval, a number of experiments were performed on our image retrieval.

In order to choose the “good” maximum EMs order value for extracting image feature, we randomly selected 500 images as query images from the above image database, and tested the image retrieval accuracies for different maximum EMs order values. Figure 12 shows the mean of retrieval precisions of 500 times query results for different maximum EMs order values, which reflects the image retrieval efficiency. From the Fig. 12, we can obtain the optimal maximum EMs order value n max = 5.

Fig. 12
figure 12

The mean of retrieval precisions of 500 times query results for different maximum EMs order values: (a) Average retrieval performance for different maximum EMs order values (n max = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), (b) Average retrieval performance for different maximum EMs order values (n max = 5, 10, 15, 20, 25)

Figure 13 shows the average retrieval precisions of 500 times query results for different feature weight values w 1, w 2, which reflects the image retrieval efficiency. In Figs. 14 and 15, we demonstrate our retrieval results with EMD only, LAPH only, and both EMD and LAPH, respectively. The image at the top of left-hand corner is the query image; other 20 images are the retrieval results

Fig. 13
figure 13

The average retrieval precision for different feature weight values

Fig. 14
figure 14

Our image retrieval results (Panda): (a) By taking only the EMD, (b) By taking only the LAPH, (c) By taking both EMD and LAPH

Fig. 15
figure 15

Our image retrieval results (Horse): (a) By taking only the EMD, (b) By taking only the LAPH, (c) By taking both EMD and LAPH

6.3 Comparative performance evaluation

We report experimental results that show the feasibility and utility of the proposed algorithm and compare its performance with three state-of-the-art image retrieval approaches [7, 9, 43]. To simulate the practical situation of online users, the sequence of query images used in all the experiments is generated at random.

Figures 16, 17 and 18 show the image retrieval results using the scheme [43], scheme [7], scheme [9], and the proposed method. The image at the top of left-hand corner is the query image; other 20 images are the retrieval results.

Fig. 16
figure 16

The image retrieval results (Flower) using different schemes: (a) The retrieval scheme [43], (b) The retrieval scheme [7], (c) The retrieval scheme [9], (d) The proposed retrieval method

Fig. 17
figure 17

The image retrieval results (Horse) using different schemes: (a) The retrieval scheme [43], (b) The retrieval scheme [7], (c) The retrieval scheme [9], (d) The proposed retrieval method

Fig. 18
figure 18

The image retrieval results (Bust) using different schemes: (a) The retrieval scheme [43], (b) The retrieval scheme [7], (c) The retrieval scheme [9], (d) The proposed retrieval method

In order to further confirm the validity of the proposed algorithm, we randomly selected 500 images as query images from the above image database (The tested 10 semantic class includes bus, horse, flower, dinosaur, building, elephant, people, beach, scenery, and dish). Each kind is extracted 50 images, and each time returns the first 20 most similar images as retrieval results. To each kind of image, the average normal precision and average normal recall of 10 times query results are calculated. These values are taken as the retrieval performance standard of the algorithm, as shown in Fig. 19.

Fig. 19
figure 19

The average retrieval performance of four schemes (COREL dataset): (a) The average normal precision, (b) The average normal recall

We also compared the proposed method with some state-of-the-art image retrieval approaches [9, 29] on another two publicly available retrieval datasets, NUS-WIDE and Oxford Buildings. The NUS-WIDE is a large-scale web image dataset collected from Flickr as a benchmark for evaluating multimedia search techniques, which contains 269648 images and their ground-truth annotations for 81 concepts. The Oxford Buildings contains 5062 high resolution images (1024 × 768) showing either one of the Oxford landmarks (the dataset contains 11 landmarks), or other places in Oxford. The database includes 5 queries for each landmark (55 queries in total), each of them including a bounding box that locates the object of interest. Figure 20 presented the average retrieval performance on NUS-WIDE and Oxford Buildings.

Fig. 20
figure 20

The average retrieval performance on NUS-WIDE and Oxford Buildings: (a) The NUS-WIDE dataset, (b) The Oxford Buildings dataset

According to the Figs. 16, 17, 18, 19 and 20, we see that the image retrieval accuracy by the proposed method is competitive with the other tested methods. The effectiveness of the proposed image retrieval results from: (1) Exponent moments descriptor (EMD) is adopted to depict the image shape, which has many desirable properties such as expression efficiency, robustness to noise, geometric invariance, fast computation, etc.; (2) Localized angular phase histogram (LAPH) is used to describe texture information, which is robust to illumination, scaling, and image blurring.

In order to improve furtherly the retrieval performance, we can also add relevance feedback to this scheme. The image retrieval with relevance feedback has four main components: query, retrieval, labeling, and learning. When a query is submitted, its low-level visual features (EMD and LAPH) are extracted. Then, all images in the database are sorted based on a similarity metric. If the user is satisfied with the result, the retrieval process is ended. If the user is not satisfied, he can label some images as positive feedbacks and/or some images as negative feedbacks. Using this feedback process, the system is trained based on machine learning using the embedded relevance feedback algorithm. Then, all the images are re-sorted based on the recalculated similarity metric. If the user is still not content with the result, he repeats the process.

7 Conclusion

CBIR has drawn substantial research attention in the last decade. CBIR usually indexes images by low-level visual features which, though they cannot completely characterize semantic content, are easier to integrate into mathematical formulations. In this paper, we have proposed a content-based image retrieval approach using Exponent moments descriptor and localized angular phase histogram. Experimental results showed that the proposed method yielded higher retrieval accuracy than the other conventional methods with no greater feature vector dimension. In addition, the proposed method almost always showed performance gain in of average normal precision and average normal recall over the other methods. As further studies, the proposed retrieval method is to be evaluated for more various DBs and to be applied to video retrieval.