1 Introduction

In the big data era, content-based image retrieval (CBIR) has been a hot research topic in the last few years, because of its many practical applications, including in the fields of multimedia management and analysis [5, 40], biomedical image retrieval [43], dermatological lesions analysis [21], intelligent video surveillance [15, 18, 19, 49], traffic congestion analysis [48, 51], videos captioning [10], and so on.

Nowadays, with the progress of various camera sensors, multimedia technology and internet network, the number of digital images is increasing rapidly [55]. These vast and tremendous amounts of big-image-data have important applications in both military and civil domains. Therefore, how to effectively and efficiently manage and retrieve image from internet network has became a hot research issue. Traditional image search/query technology by text (QBT) mainly depends on manual labeled image. However, QBT has obvious disadvantages and shortcomings: different people has various understanding on the identical image content, and manual-based labeling has result in that it is unpractical to label with the great variety of images in the internet network. As a result, those search engines produced match results are not actually equivalent to what users are really seeking or interested. In 1990’s, content-based image retrieval (CBIR) was firstly proposed and investigated. Content-based image retrieval uses the image content itself, such as color, shape and texture features, instead of using the manual labels of images as the search index [17, 25]. The CBIR system extracts the query image’s features (Query by Example), and retrieves from the internet network or the image database.

In general, one of the main steps in CBIR system is how to effectively extract feature vectors that can be utilized to depict the relevant visual contents of images [27]. Then, feature vectors of the given input example (Query image) and the candidate images are compared based on a mathematical similarity function, and the images containing approximately close feature vectors are returned. Nevertheless, the traditionally CBIR methods using holistic/global feature vectors normally cannot represent all the characteristics of the image, the reason is that a typical image also involves local features that are associated with some objects or certain meaningful regions of interest. Distinctly, local features are crucial and useful to depict an image in CBIR system [17, 26, 53]. To bridge the semantic gap, a generic semantic image/video indexing method based on three levels of analyzing the spatial information in visual content was proposed in [6]. Recently, Xu designed a multiple-instance learning (MIL) based decision neural network that attempts to learn the concept from the training bags for CBIR [52]. Interested readers can refer to the survey paper on semantic content-based image retrieval [1].

In addition, Human psychology discovers that people commonly pay close attention to the visual meaningful regions of their interest in an image [20, 30]. Thus, in a CBIR system, users only want to search images containing similar regions to the query example, whereas neglecting unrelated image areas (such as the irrelevant texture regions or the image background). A straightforward method to solve this issue is to utilize image segmentation algorithms to segment the consumer’s concerned regions or objects from the background. Previous relevant literatures [2, 8, 9, 23, 44] are published on this issue and a large number of segmentation-based algorithms are introduced to discuss the local features representation for CBIR. The obvious drawback of the segmentation-based CBIR systems is that, when the result of image-segmentation procedure is not precise enough, the experimental results will be seriously deteriorated. What’s more, when segmentation failure occurs during the search process, it will be produce even inferior retrieval performance [25]. To sum up, the precise and general image-segmentation technique is still an open and hard issue in the domain of computer vision and pattern recognition. Consequently, as suggested in [31] by Pavlidis, it would be difficult to pursue general CBIR when we do not have robust segmentation methods that can identify objects on an image efficiently.

In order to exploit saliency region for CBIR, different saliency detection models have been investigated. In [46], Wang et al. proposed a novel supervised Markov-random-field framework based on differential threshold for saliency detection. Based on multiple-instance learning, a saliency detection model by integrating low-, mid-, and high-level features in the detection procedure was proposed [47]. Meanwhile, for scalable image and video retrieval, a general quantization-based hashing framework, which preserves the properties in the original space and reduces the quantization error, was proposed in [38]. To tackle the bottleneck of matching speed in large-scale dataset, Song et al. [39] designed an efficient Deep Region Hashing (DRH) method using an image patch as the query. Interested investigators can refer to the comprehensive survey paper on learning to hash [50].

Since psychological investigation observed that a local region with high global variations normally cause human’s concern and visual attention, which plays a vital role in the designing of effective CBIR systems. In this paper, we develop a novel hierarchical-local-feature extraction approach in a biologically and psychological plausible way for CBIR, which can keep away from complicated and time-consuming image segmentation. The main advantages and contributions of this paper are as follows:

  1. (1).

    Regarding that directionality information is one of the important and intrinsic features for the human visual system (HVS) to perceive and analyze images, a perception-based directional patch extraction method is proposed to describe the local features of an image for CBIR, while the global image features can be avoid. By using wavelet decomposition, the directionality patches in an image, from the corresponding vertical, horizontal, and diagonal wavelet sub-bands, can be detected and extracted.

  2. (2).

    Simultaneously, the visually salient patches can always catch human’s visual attention and interest [30]. Therefore, an improved wavelet-transform salient patch detector based on Chauvenet’s criterion is designed. These extracted patches can be utilized as the local properties of the image to characterize an image, so that the global image features and image segmentation can also be excluded.

  3. (3).

    Both the proposed two local feature extraction methods, aiming to extract and capture the image’s local characteristics, have an implicit merit, i.e., since psychological research has shown that the HVS applies a multi-scale frequency analysis when we analyze an image [20], the wavelet-decomposition process is closely resembled to the zooming-in and the zooming-out mechanism of the HVS.

  4. (4).

    A clustering algorithm is performed to parallelly divide the extracted directional and salient patches into different regions in terms of their spatial distribution. The clustering scheme in the proposed CBIR system can divide the salient patches into meaningful subclasses, which can be further used as a hierarchical-local-feature during the image-retrieval process.

  5. (5).

    A hierarchical-local-feature extraction scheme comprising three levels is devised to represent an image, which is particularly suitable for network big-image-data index and management. In the proposed approach, both the global and local features are taken into consideration simultaneously, whereas time-consuming and complex image segmentation is avoided.

The rest of the paper is organized as follows. In section 2, we will introduce the proposed hierarchical-local-feature extraction scheme, and present the clustering algorithm according to the spatial distribution of the extracted patches in detail. Experimental results and comparisons are given in section 3. Finally, the paper draws a conclusion in section 4.

2 Hierarchical-local-feature extraction scheme

The framework of the proposed novel scheme based on a three-level hierarchical structure for content-based image retrieval is illustrated in Fig. 1.

Fig. 1
figure 1

The proposed framework for CBIR based on a three-level hierarchical structure. The different color circles represent the different meaningful regions to represent the local image properties, which are clustered into different meaningful subclasses based on the directional- and the visually salient patches of the input image, accordingly. Finally, these different meaningful regions (leaf nodes of the TREE Hierarchical data structure) are also utilized to construct local feature vectors for describing the separated part of the image to carry out CBIR

In the first level, global features, i.e. color moments [11] and Gabor filters features [13, 36, 54], of the whole image are firstly utilized to form a global/holistic feature vector to index the image. Next, in the second level, directional- and salient-patches detectors are developed to extract the perception-based directional patches and the visually salient patches of the image in parallel. In the third phase, a clustering algorithm is performed to divide the two typical patches (e.g. directional and salient patches) into different and meaningful regions/subclasses to characterize the local image characteristics. Finally, the different meaningful regions, namely leaf nodes of the TREE Hierarchical data structure, are also utilized to construct local feature vectors for describing the separated part of the image to carry out CBIR.

The following sections will introduce the details of the proposed hierarchical-local-feature extraction structure.

2.1 Perception-based directional patch extraction

In human history, sense of direction-perception has been born in the course of society development, and has developed its concept of directionality/orientation. Therefore, directionality/orientation cue in an image is a solid and credible feature, and can be used this intrinsic feature for the human visual system (HVS) to perceive and analyze image. Thereby, the directionality/orientation feature is very beneficial and helpful for some computer-vision applications, for example saliency detection [20] and image retrieval [16].

In this subsection, we will use a wavelet-based algorithm to extract directional patches. The wavelet transform, which belongs to the multi-resolution analysis (MRA) technique, can present and reflect image variations at different resolutions and scales [4, 28]. By the definition of wavelet theory, a wavelet is an attenuated and oscillating function, and defining the two-dimensional discrete wavelet transform (DWT) of the image f(x, y) can be written [4, 28]:

$$ {\displaystyle \begin{array}{l}{\mathrm{W}}_{\varphi}\left({\mathrm{j}}_0,m,n\right)=\frac{1}{\sqrt{MN}}\sum \limits_{x=0}^{M-1}\sum \limits_{y=0}^{N-1}f{\varphi}_{{\mathrm{j}}_0,m,n}\kern0.5em \mathrm{and}\\ {}{{\mathrm{W}}_{\psi}}^{\mathrm{i}}\left({\mathrm{j}}_0,m,n\right)=\frac{1}{\sqrt{MN}}\sum \limits_{x=0}^{M-1}\sum \limits_{y=0}^{N-1}f{\psi}_{{\mathrm{j}}_0,m,n}^i,\end{array}} $$
(1)

where M × N denotes the size of f(x, y); j0 is an arbitrary initial scale; φ(x) is the scaling function, and ψ(x) is the wavelet function; and i = {H, V, D}, where H is the horizontal high-frequency information, V the vertical high-frequency information, and D the diagonal high-frequency information, respectively.

Since the {H, V, D} sub-images of wavelet transform cover three different direction high-frequency information, at first, we calculate the vertical, horizontal, and diagonal directionality maps of the three {H, V, D} sub-images. Then, these {H, V, D} directionality maps are re-scaled and normalized to integrated into a compound directionality map for detecting the perceptual-meaningful directional patches.

With regard to horizontal directionality computation, which is computed from the HL sub-band (including the vertical high-frequency information) of the wavelet transform. Suppose h(x, y) is the sub-band coefficients of the wavelet transform, where x and y denote the row coordinate and column coordinate of the sub-band, accordingly. Regarding every row of the HL sub-band image, the horizontal directionality is expressed in the following form [20]:

$$ {Dir}_H\left(x,y\right)=\frac{1}{2W}\mid h\left(x,y\right)\mid \left\{\sum \limits_{r=-W}^W|h\left(x+r,y\right)|\right\}, $$
(2)

where W represents the width of adjacent coefficients to be extracted, and in practice it can be set to p (where p is the wavelet regularity of a orthogonal wavelet with a compact support).

In allusion to the computation of horizontal directionality, the vertical directionality is defined along each column:

$$ {Dir}_V\left(x,y\right)=\frac{1}{2W}\mid h\left(x,y\right)\mid \left\{\sum \limits_{r=-W}^W|h\left(x,y+r\right)|\right\}, $$
(3)

Contrasted with the horizontal and vertical directionality, the calculation of diagonal directionality is more complicated. The two perpendicularly diagonal directions, namely π/4 and 3π/4, are taken into consideration together. The calculation process bears a resemblance to those computation for the horizontal and vertical directionality. Then, the diagonal directionality is defined as the average of the two perpendicularly diagonal directions of π/4 and 3π/4 in the form follows:

$$ {Dir}_D\left(x,y\right)=\frac{1}{2}\left({Dir}_{D1}\left(x,y\right)+{Dir}_{D2}\left(x,y\right)\right), $$
(4)

where DirD1 and DirD2 denotes the diagonal directionalities of π/4 and 3π/4, respectively.

After fusing the vertical, horizontal and diagonal directionality maps, these three different directionality maps can be re-scaled to the same size of the original image, and then normalize and merge them to construct a compounded orientation/directionality map [20]. Then, these patches with first K (e.g. K = 60) largest values can be extracted from the orientation map. Figure 2b shows some detected perception-based directional patches with the maximum values extracted from the compounded directionality map. As illustrated in Fig. 2b, the proposed method can detect the directionally salient patches in the image efficiently, which can be seen as a typical local feature to represent the local properties of the image for CBIR.

Fig. 2
figure 2

Extraction of directional and salient patches: (a) an input image, (b) extraction of some directional patches, and (c) extraction of some salient patches, both are using the Db4 wavelet

2.2 Improved salient-patch extraction

In term of the wavelet decomposition theory, at a coarse scale, a local absolute maximum value of the wavelet-transform coefficients matches along with a region with high holistic variations and responses the singularity point of a signal. Consequently, a salient point detector based on wavelet transform, which can express image variations at different resolutions, is devised to detect the visually salient points [22–25].

A orthogonal compactly supported wavelet filter produces a non-redundant decomposition of a signal. For the compactly supported wavelets, every single wavelet coefficient \( {W}_{2^j}f(n) \) at the scale 2j can be represented with 2jp signal points, and p represents the wavelet regularity. Accordingly, we can further observe the wavelet coefficients at the finer scale 2j + 1. For the point in the scale 2j + 1, there is a set of coefficients reflected with the same point at the scale 2j. These coefficients \( C\left({W}_{2^j}f(n)\right) \) are called the children of the coefficient \( {W}_{2^j}f(n) \), and the relationship between them is expressed in the following:

$$ C\left({W}_{2^j}f(n)\right)=\left\{{W}_{2^{j+1}}f(k),2n\le k\le 2n+2p-1\right\}, $$
(5)

where 0 ≤ n ≤ 2jN and N is the length of the signal.

In signal processing, the variation of a signal point can be reflected and calculated by their children coefficients \( C\left({W}_{2^j}f(n)\right) \). Therefore, in an image, salient points can be detected and extracted by finding a relevant point to represent this global variation through examining coefficients at different scales. Classic methods employed the following formula which named saliency value to extract the salient points [33,34,35, 42]:

$$ saliency=\sum \limits_{k=1}^{-j}\mid {C}^{(k)}\Big({W}_{2^j}f(n)\mid, 0\le n\le {2}^jN,-{\log}_2N\le j\le -1. $$
(6)

However, the largest absolute values of the wavelet coefficients at different scales have different means and varying scope [14, 20].

Based on this observation, Jian and Dong proposed a revised salient point extraction algorithm in [14], which used a weight w(k) (the reciprocal of the standard deviation) to normalized maximum wavelet coefficients to produce more accurate extraction results. Some details can be found in [14].

2.2.1 Improved salient-point extraction:

In this paper, for the purpose of detecting the salient points more precisely, an improved salient point detector is proposed firstly. According to statistical theory, the Chauvenet’s criterion is a measurement of evaluating whether one piece of experimental data from a set of observations, is liable to be spurious [41]. Another method for eliminating spurious data is named “3σ” criterion which was introduced some years before Chauvenet’s criterion was proposed. However, Chauvenet’s criterion is a more rigorous method to the rational deletion of noise points and outlier data. In controversial practice, Chauvenet’s criterion offers a quantitiative and objective approach for data rejection, particularly in small sets or where a Gaussian distribution cannot be assumed.

To employ Chauvenet’s criterion, we should compute the mean \( \overline{x} \) and standard deviation Sx of the experimental data firstly. Let n be the total number of the subband’s wavelet coefficients, then Chauvenet’s criterion can be simplified into:

$$ {\omega}_n=1+0.4\ln (n); $$
(7)

If \( \mid x\left(j,k\right)-\overline{x}\mid <{\omega}_n\ast {S}_x \); let xmax, xmin be the maximum and minimum value of subband’s wavelet coefficients which satisfy\( \mid x\left(j,k\right)-\overline{x}\mid >{\omega}_n\ast {S}_x \), respectively. We can use the following function (8) to normalize wavelet coefficients satisfying \( \mid x\left(j,k\right)-\overline{x}\mid <{\omega}_n\ast {S}_x \) into [0, 1].

$$ {x}^{\prime}\left(j,k\right)=\frac{x\left(j,k\right)-{x}_{\mathrm{min}}}{x_{\mathrm{max}}-{x}_{\mathrm{min}}}. $$
(8)

Then, if \( \left(x\left(j,k\right)-\overline{x}\right)\ge {\omega}_n\ast {S}_x \);

set x'(j, k) =1;

If\( \left(\overline{x}-x\left(j,k\right)\right)\ge {\omega}_n\ast {S}_x \);

set x'(j, k) =0.

In order to weighted different scales and orientations subband of wavelet decomposition equally, the wavelet coefficient are processed by the Chauvenet’s criterion. In this paper, the process of transforming the value into a range of 0 and 1 is called wavelet coefficients normalization.

Then the improved saliency values based on the normalized wavelet coefficients can be expressed:

$$ Imp\_ saliency=\sum \limits_{k=1}^{-j}\mid {C}^{(k)}\Big({W}_{2^j}^{\prime }f(n)\mid, 0\le n\le {2}^jN,-{\log}_2N\le j\le -1, $$
(9)

where wavelet coefficients in the three different orientations’ and scales’ subbands can be accumulated and weighted equally.

2.2.2 Extent improved salient-point to salient-patch extraction

Because the human visual ability and visual resolution is limited, in reality, people normally interest in these visually meaningful salient-patches of an image [14, 20] rather than the salient points. With the aim of reflecting the local characteristics of an image using those salient patches, a salient patch detector is developed to detect the visual meaningful patches in an image, and then represent this images’ local characteristics for CBIR system. Therefore, we propose a method to extent the improved salient points detection to salient patches extraction for CBIR.

In detail, we define a set of pixels as the s-patches of the pixel p. By considering the wavelet coefficient at the scale with its s-patches, we can further observe the s-patches at the finer scale. In other words, there is a set of coefficient calculated with the same points in the s-patches at the scale. These coefficients are named the children of the s-patches. Actually, the children coefficients reflect the variations of the s-patches, and the most salient patches are the sum of wavelet coefficients in s-patches with the largest absolute values at different scales. Accordingly, salient patches can be extracted by detecting this maximum and examining its largest child patches. The saliency at position (x, y) of an image is calculated by adding the saliency values inside Rect(x, y) in total. The whole saliency value of a patch is the summation of the normalized saliency values calculated from these three different sub-images. At last, we can detect the visually salient patches using the improved saliency values.

Figure 2c displays some visually salient-patches, which are detected using the improved salient-patch detection algorithm based on the Db4 wavelet. As illustrated in Fig. 2c, the developed salient-patch detector can extract the salient patches accurately, and can be used as a local feature to depict image’s local characteristics for the proposed CBIR system.

2.3 Clustering using the spatial distribution of the extracted patches

In this subsection, we use a clustering algorithm to divide the two typical patches (e.g. directional and salient patches) into different and meaningful regions to represent the bottom-level-local image characteristics for CBIR.

As a popular and widely used method for clustering, the DBSCAN algorithm produces excellent performance in spatial data clustering. The DBSCAN clustering method was firstly investigated by Ester, et al. [7, 32], and the algorithm depends on a density-based notion of clusters. This clustering method is especially suited to deal with big data with noises. With only one input parameter, DBSCAN algorithm can cluster the data into different meaningful subclasses. Some details regarding the DBSCAN algorithm please refer to [7, 32].

In the proposed method, the DBSCAN algorithm is applied to identify clusters of salient patches and then apply the clusters to automatically divide salient patches into various meaningful salient regions. The basic idea of the DBSCAN algorithm is that, for every point of a cluster, the neighborhood of a given radius has to include at least a minimum number of points, namely, the density in the neighborhood has to exceed a predefined threshold. Thereby, the DBSCAN algorithm requires an input parameter - neighborhood radius, to divide salient patches into various meaningful subclasses in our CBIR system. In this paper, a simple algorithm is developed to determine the neighborhood radius threshold Eps adaptively:

Assume K salient patches that are extracted in an image, and for each salient patch, utilizing one center-point (xc, yc) to represent the spatial distribution of the salient patch accordingly. Let Cen = (x’, y’) be the centroid of the center-point of the salient patches, and x’ and y’ are expressed as follows:

$$ {x}^{\prime }=\frac{1}{K}\sum {x}_c; $$
(10)
$$ {y}^{\prime }=\frac{1}{M}\sum {y}_c. $$
(11)

Define the centroid Cen = (x’, y’) as the center of the detected salient patches. Let R be the maximum radius with the Cen as the center of a circle, it can be defined as follows:

$$ R=\max {\left[{\left({x}_c-{x}^{\prime}\right)}^2+{\left({y}_c-{y}^{\prime}\right)}^2\right]}^{\frac{1}{2}}. $$
(12)

After that, the neighborhood radius threshold can be determined:

$$ \mathrm{Mean}\_R=\pi R\times R/M; $$
(13)
$$ Eps={\alpha}^{\ast}\mathrm{Mean}\_R; $$
(14)

The parameter α can be set in the range of [0~1]. After the threshold Eps is obtained, we can localize the salient regions and efficiently cluster the salient patches into various meaningful subclasses, namely salient regions in terms of salient patches’ spatial distribution to represent the bottom-level-local image characteristics for CBIR.

To sum up, the proposed scheme in our CBIR system is a three-level hierarchical structure, as shown in Fig. 1. In the first stage, color moments [11] and Gabor filters features [13] of the whole images in the database are utilized to construct a global/holistic feature vector to describe each image. In the second level, directional and salient patches detector are carried out to detect the perception-based directional patches and the visually salient patches of each image, accordingly. After the two typical patches (e.g. directional and salient patches) are extracted, color moments and Gabor texture features of the different typical patches are also utilized to build feature vectors for indexing the different patches of the image, called mid-level-local image feature, to reflect the local image properties (As shown in the second level in Fig. 1). In the third phase, the DBSCAN clustering algorithm is performed to cluster the two typical patches (e.g. directional and salient patches) into different and meaningful regions in terms of their spatial distribution, to represent the bottom-level-local image characteristics. The color moments and Gabor filter features of the divided different meaningful regions are finally utilized to construct local feature vectors, which are used to index and represent the different part of the image to apply CBIR. Since a bank of filters at different scales and orientations can extract the texture frequency and orientation information effectively, the feature vector is computed by filtering the texture image with 6 orientations and 4 scales and computing the mean and standard deviation of the output in the frequency domain. Figure 3 shows some typical results of the individual step in the proposed three-level hierarchical structure for CBIR.

Fig. 3
figure 3

Some typical results of the separate steps in the proposed three-level hierarchical structure

3 Experimental results and analysis

In order to verify the effectiveness of the proposed scheme, a huge number of retrieval experiments are used to evaluate the performance of our proposed CBIR scheme. In the retrieval experiment, sixty directional and salient patches are extracted in parallel, and the directional patches detection is based on 1-level wavelet decomposition, while the salient patches are extracted with wavelet transform of 3-level decomposition, both the two patches extraction are using db4 wavelet (which is an overlapping wavelet with a compact support, and the regularity p is 4). The DBSCAN clustering algorithm is performed to divide the salient patches into different salient regions/groups, and the value of α is empirically set to α = 0.02. After the input parameter α is set, the neighborhood radius threshold Eps can be computed in a self-adaptation way according to Eq. (14); next, the density-based clustering algorithm can divide those directional and salient patches into different meaningful subclasses, as is shown in the bottom-level in Fig. 1, which then can be used to form bottom-level-local feature vectors to describe the different part of the image to achieve CBIR. Experiment results have shown that the above overall setting of the parameters can cluster both the directional and salient patches into different meaningful regions to reflect the local image characteristics, effectively. In addition, the proposed approach can efficiently avoid the over-segmentation problem usually emerged in traditional segmentation-based CBIR algorithms.

In the retrieval stage, a user can choose a query example/image, or selects an object or region of interest in an image using GUI (Graphical User Interface). Whereafter, wavelet transform with db4 wavelet is performed to the query image, and the hierarchical-local-feature extraction scheme works. As a result, as illustrated in Fig. 3c and e, the clustering algorithm can be performed to parallelly divide the extracted directional and salient patches into different regions, namely different meaningful sub-classes. Regarding the region specified in the query image using GUI, when there are multiple salient regions in the query image, the customer can choose a typical one or the whole image as input. Then, those images with feature vectors close to the feature vector of the specified region/image in the index space are returned. In our CBIR system, the hierarchical-local-feature extraction scheme can be realized by a tree structure of three levels (TREE Hierarchical data structure, in which a node with T child nodes expresses an image that are divided into different T visually meaningful regions in terms of their spatial distribution using the DBSCAN clustering algorithm), and the root node describes the global/holistic feature vector of the whole image. The main advantage of the developed hierarchical-local-feature extraction scheme is that it can take the global and hierarchical-local features into consideration simultaneously, and this scheme is suitable and very useful for network big-image-data indexing. In the similarity-distance-computing stage, the CBIR system directly outputs specified number of images with user-interested salient regions/objects of whose indices are closest to the query feature vector, based on the weighted Euclidean distance [17]. The returned images are sorted in order that the candidate with the most resembling or similar salient region of user’s interest searches in the first place. Even though an image contains some regions which are similar to the query key (feature vector), the searched image displays only once in the retrieval interface (Fig. 4).

Fig. 4
figure 4

Some examples used in the retrieval experiments

We perform extensive experiments to evaluate the performance of the proposed approach. A hybrid and huge image dataset is downloaded from the internet network based on Corel images Database [3], McGill Calibrated Color Image Database [29], SIMPLIcity Image Database [37], Ground Truth Database [12] and some images collected from the internet network by our research groups. The image dataset is comprised of more than 16,000 different categories of images. For CBIR experiments, the test set is randomly selected and includes 80 species, 4000 images in total. Figure 3 shows some examples, with various regions/objects in an image, used in our retrieval experiments.

For comparison, we also compare our developed method with seven state-of-the-art CBIR methods. The seven methods are:

  1. (1)

    a classic CBIR method using wavelet-based salient points [42],

  2. (2)

    three region-based image retrieval methods [9, 17, 44],

  3. (3)

    a method using color difference histogram proposed by Liu and Yang [24],

  4. (4)

    an fusion framework for color image retrieval proposed by Walia et al. [45], and

  5. (5)

    a method using Quaternionic Local Ranking Binary Pattern proposed by Lan et al. [22].

In order to objectively compare the developed CBIR system with the other seven state-of-the-art CBIR approaches, the precision and recall, which are the two widely used evaluation measurements in CBIR, are calculated to assess the different schemes. Precision is defined as the fraction of the number of relevant images are retrieved over those of the number of all returned images, while recall assesses the ratio of the number of relevant image are retrieved over those of the number of the total relevant images, as follows:

$$ \mathrm{Preision}=\frac{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{images}\ \mathrm{are}\ \mathrm{retrieved}}{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{all}\ \mathrm{returned}\ \mathrm{images}}, $$
(15)

and

$$ \mathrm{Recall}=\frac{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{images}\ \mathrm{are}\ \mathrm{retrieved}}{\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{the}\ \mathrm{total}\ \mathrm{relevant}\ \mathrm{images}}. $$
(16)

Firstly, the retrieval performance are evaluated in terms of the average precision P(n) in a intuitive way, which is defined as a function of the total number of correctly retrieved images in the first n returned images (Eq. (15)). In our experiments, the precision P(n) for the first n (n = 10, 30, 50 and 70) retrieval results of seven state-of-the-art methods and our method are accounted. The new proposed method are compared with the other seven approaches are tabulated in Table 1, to assess the average precision. From Table 1, it can be seen that our method can achieve a superior average precision to the other seven state-of-the-art CBIR methods. As tabulated in Table 1, the average retrieval results show the effectiveness of the developed CBIR system.

Table 1 Average precision P(n) of eight different methods

Thank to the two evaluation measurements - precision and recall are mutually restricted in definition [17, 25], Precision-recall curve is also utilized to assess the overall image-retrieval performance. In practice, the average precision-recall curve (PR-curve) is produced by calculating different numbers of returned images n. Figure 5 shows the retrieval performance of the eight different methods, according to the average precision-recall curve. Obviously, as can be seen from Fig. 5, our proposed scheme of three-level hierarchical structure for content-based image retrieval produces a better comprehensive performance than the other state-of-the-art CBIR algorithms. That’s because that when a input image contains local features, for instance, including some different meaningful regions or various objects, the CBIR systems using only global/holistic features will be degraded significantly [22, 45]. Our proposed hierarchical-local-feature extraction scheme, which comprises three levels and also considering both the global and local features simultaneously, utilizing local characteristics of images can remove some influence arose from these irrelevant texture areas or image background (see Fig. 3).

Fig. 5
figure 5

Precision-recall curve of eight different methods

Furthermore, as the Fig. 5 illustrates, the proposed three-level hierarchical structure achieves much better retrieval performance than the previous algorithm proposed by Jian et al. [17]. This attributes to the flexibility of the deeply three-level hierarchical indexing structure, and the strong representation-ability of the directional and the improved salient patches for representing the local image properties. For the latter algorithm [17], which mainly focuses on extracting salient regions singly, is unable to capture the local feature of perception-based directional patches. In sum, the devised three-level hierarchical indexing structure is efficient and effective for network big-image-data representation and indexing, and the integrated feature vectors with the REE Hierarchical data structure are suitable for depicting an image with various objects or different meaningful regions. Experimental results show that the proposed hierarchical-local-feature extraction scheme is more effective than the other seven state-of-the art algorithms.

4 Conclusion and discussion

Since global features usually cannot capture various local characteristics of different parts in the image, in this paper, a hierarchical-local-feature extraction scheme with three-level TREE Hierarchical data structure is introduced to represent an image. Our purpose is not to accurately segment each single object in the image using traditional image-segmentation based methods, but to focus and extract these salient regions that are visually meaningful in view of human visual perception for CBIR. With the aid of wavelet transform, the directionality patches detector and the improved wavelet-transform salient patch detector based on Chauvenet’s criterion are developed to describe the local image properties. Extensive experimental results verify that the developed CBIR scheme has produced promising retrieval performance.

In our future work, we hope to investigate and design a compact scale-rotate-shift-invariant representation of image, including edge representation matrix and corner representation matrix, to further improve image search results.