1 Introduction

There has been much recent interest, in both research and practice, in classifying images into categories. To achieve this goal, the first stage is keypoint extraction. Keypoints are salient image patches that contain rich local information of an image. There are different keypoint detectors, which are surveyed by Mikolajczyk and Schmid [43] and Zhang et al. [65]. Keypoints are depicted by descriptors such as the Scale-Invariant Feature Transform (SIFT).

Lowe [37] proposed SIFT, which is a robust feature in scaling, rotation, translation, and illumination, and is partially invariant to affine distortion. In addition, there is no need to digest images. The only thing we need to do is to quantize SIFT features by the well-known Bag of Visual Words (BoVW) technique, first presented by Csurka et al. [13].

The Bag of Words (BoW) model is a popular technique for document classification. In this method, a document is represented as a bag of its words, and features are extracted from the frequency of occurrence of each word. The Bag of Words model has also been used for computer vision by Perona [48]. Therefore, instead of document version name (BoW), Bag of Visual Words (BoVW) was used in the present research. For BoVW extraction, we must first extract blobs and features (e.g., SIFT). In the next stage, a visual vocabulary must built by using a classification method (e.g., K-means). Representation of images with BoVW histograms is the third stage. The final stage is image classification, using a classification method (e.g., Support Vector Machine [SVM]).

O’Hara and Draper [46] presented a survey on BoVW image representations. They highlighted recent techniques that mitigated quantization errors, improved feature detection, and sped up image retrieval. Lazebnik et al. [30] presented an extension to the BoVW model for recognizing scene categories based on global geometric correspondence (spatial pyramid framework). Their method divides each image into sub-regions and computes the histograms of local features for each sub-region. This spatial pyramid was applied in later generation BoVW models, such as Ionescu et al. [20], which extracted dense SIFT descriptors of whole images or a spatial pyramid of the image. They also proposed a method for classifying human facial expressions from low-resolution images based on a bag of words representation. Pyramidal Histogram of Visual Words (PHOW) was proposed by Bosch et al. [10]. PHOW is an image descriptor based on SIFT feature. It uses a grid of dense points in the image, and a SIFT is computed for each point of the grid. By default, it uses three scales and builds a pyramid of descriptors.

Most of the above-mentioned models concentrate on grayscale versions of pictures and ignore the color information in pictures. Therefore, researchers have attempted to combine color features with other features to get better results. Vigo et al. [54] found that integrating color significantly improves the overall performance of both feature detection and extraction. Khan et al. [25] presented a method for object recognition by using multiple cues (shape and color). Their combination was based on modified shape features by category-specific color attention. Alqasrawi et al. [2] used a keypoint-density-based weighting method to combine a BoVW model with color information on a spatial pyramid layout. Recently, Barata et al. [4] compared grayscale methods against color sampling methods (Harris Laplacian detector and SIFT descriptor). They found that color detectors and Color-SIFT perform better. Jalali et al. [22] utilized color to enhance object and scene recognition in a method inspired by the characteristics of color and object-selective neurons. A comprehensive discussion of the combination of color with Bag of Visual Words image representations may be found in Weijer et al. [58].

Some researchers have investigated the application of BoVW classification in special domains. For instance, the authors of this article investigated the potential use of the Bag of SIFT feature for animal classification and determined which classification method was better for animal pictures [40]. Ionescu et al. [20] proposed a method for classifying human facial expressions from low-resolution images based on a BoVW representation. Abdelkhalak and Zouaki [1] suggested a new descriptor for bird searches in images. They concatenated shape, first color moment (mean), and second color moments (variance), an early fusion of color and shape, to build a BoVW.

As it can be seen, color features are one of the future methods for concentration, and they are still being improved. In addition, color is one of the important characteristics of human vision. However, in the traditional version of DCD, if background colors are higher than foreground ones, images with similar background colors are retrieved wrongly as belonging to the same category. Also, a lonely color feature is not sufficient for the similar objects with different color information. Therefore, this paper presented a new Salient DCD to add semantic information, and reduce the background effect. Also, a new fusion model for fusing SDCD and PHOW MSDSIFT histogram is proposed.

Our SDCD & PHOW MSDSIFT model approach consists of six main steps. The first one is saliency map computation based on Jiang et al. [24], which discriminates the background from the main object. The second step is divided in two parallel stages: SDCD color extraction of the salient part and PHOW feature extraction of salient and original part. In the next step, their codebook is constructed in parallel by K-means classification. Again in parallel during the fourth stage, spatial histogram descriptors are quantized based on a binary tree in which every node is a k-dimensional point (KD-trees) to identify the visual words. A homogeneous kernel maps of the histograms is extracted. Finally, these histogram kernel maps are fused together with a new fusion model which is described in section 3 to obtain a superior visual word constructed from SDCD and PHOW features. To test our model, spatial histograms of visual words of test pictures were compared with spatial histograms of visual words based on SVM Chi square (SVM C H I 2). Becuase, SVM C H I 2 [42] shown better results in the literature and current researchers early experiments [40]. Subsequently, the appropriate concept names were extracted for the test images by assigning class labels for the test images.

The rest of the current paper is structured as follows. In Section 2, some materials and methods related to our research are reviewed. Section 3 introduces the SDCD algorithm and SDCD & PHOW fusion model for image retrieval. Section 4 presents the experimental setup. A discussion of the proposed model, research results, and usefulness of SDCD & PHOW fusion model are explained in Section 5. The paper concludes with some comments on future research in Section 6.

2 Materials and methods

Finding appropriate methods for image classification and feature extraction based on location is a recent and controversial endeavor [3, 27,28,29, 36, 45, 66]. In the traditional BoVW model, visual words are collected and treated in the same way, even though they may be from an important part or the background of a picture. This means that the classifier often relies on visual words that fall in the background and merely describe the context of the object [47]. Also, background features have higher percentage than foreground ones, previous image classification methods did not add semantic location information to the features. They can retrieve images that contain similar background and not images with similar foreground. This means, that they are dependent on background feature which is not a useful information. On the other hand, color is not sufficient for similar objects with various colors such as white dog or black dog. Based on these problems, a SDCD algorithm to extract important colors of salient parts of pictures and a new SDCD & PHOW fusion model for fusing SDCD color features and PHOW MSDSIFT features are proposed. This model can collect visual words of the whole and salient parts of a picture. In what follows, we first briefly review common stages and materials for color extraction and image retrieval techniques.

2.1 Image segmentation

The first stage, but not a mandatory one, is image segmentation. The segmentation algorithm divides images into different parts based on feature similarity. Different segmentation approaches proposed in the literature are based on: background removing based, clustering based, grid based, model based, contour based, graph based, region growing based and salient based method. For a comprehensive segmentation review, readers are referred to [16]. In this study, the focus is on salient-based methods. Because of the object location, removing the background parts is an important stage. Recently, much research has designed various models to compute the saliency maps. There are five major research areas for detecting saliency in images: Salient Object Detection Methods, Localization Salient Models, Aggregation and Optimization Salient Models, Active Salient Models, and Segmentation Salient Models. These research areas are described in detail in the following paragraphs.

2.1.1 Salient object detection methods

Based on the survey research conducted by Borji et al., there are two attributes for detecting salient or interesting objects in images: Block-based vs. region based analysis and intrinsic cues vs. extrinsic cues [9].

  • Block-based vs. Region-based analysis: Block (i.e. pixels and patches) based is an early method of finding a salient object, while regions are a widespread generation with the development of superpixel algorithms.

  • Intrinsic cues vs. Extrinsic cues: The key difference is for using attributes from one image (i.e. Intrinsic cues) or similar cooperation images (e.g. user annotations, depth map, or statistical information) to facilitate detecting salient objects in the image (i.e. Extrinsic cues).

Based on the literature reviews and mentioned attributes most of the existing salient object detection approaches can be divided into three major categories, block-based models with intrinsic cues, region-based models with intrinsic cues, and models with extrinsic cues.

  1. 1.

    Block-based Models with Intrinsic Cues: These models detect salient objects based on blocks (i.e. pixels or patches) with only utilizing intrinsic cues. Their drawbacks are: they detect high contrast edges as a salient object instead of the real salient object, and if the size of blocks is large, the boundary of the salient object is not protected very well. To control these problems successfully, new researchers considered more on region based maps. Because the number of regions is much less than the number of blocks better features can be extracted from regions.

  2. 2.

    Region-based Models with Intrinsic Cues: In these models, the first input image is segmented into regions aligned with intensity edges and then regional saliency map computed. Three types of region extraction methods are used for saliency computation (Graph-based segmentation algorithm, mean-shift algorithm, or clustering quantization). The first advantage of this method in comparison with block-based is that for improving these models, there are several choices like backgroundness, objectness, focusness and boundary connectivity. Besides, regions give more advanced cues (e.g. color histogram). Another advantage of using regions instead of blocks (i.e. pixels or patches) is for computational cost because each image has far fewer regions than pixels, computation of regional saliency would be less than producing full-resolution saliency maps. Despite these advantages, the new generation will be using extrinsic cues. Jiang et al. proposed an approach based on multi-scale local region contrast, which calculates saliency values across multiple segmentations and combines these regional saliency values to get a pixel-wise saliency map [23].

  3. 3.

    Models with Extrinsic Cues: These models help salient object extraction in images and videos. These cues can be derived from the ground truth annotations of the training images, similar images, the video sequence, a set of input images containing the common salient objects, depth maps, or light field images. Borji et al. [9] concluded that the DRFI, which is presented by Jiang et al. [24], is an extrinsic cue model. This model had been only trained on a small subset of MSRA5K, and it still consistently outperforms other methods on all datasets. Previous categorizations (Block-based vs. Region-based analysis and intrinsic cues vs. Extrinsic cues) were based on salient object detection.

2.1.2 Localization salient models

Borji et al. [9] mentioned that there exist some other researches whose main research effort is not about the saliency map computation; nor can it segment or localize salient objects directly with bounding boxes. They classified them as Localization models, Segmentation Models, Aggregation and Optimization Models, and Active Models. The output of these models is rectangles around the salient objects by converting the binary segmentations to bounding boxes. The most common approach is using a sliding window and classifying each of them as either a target or a background. For example, Lampert et al. [29] proposed an object localization method based on maximization of sub-images with branch and bound scheme, but their research cannot find two or more important objects in one picture. Another problem with using sliding windows occurred when the local image information is insufficient e.g. when the target is very small or highly occluded. In these cases, other parts of the picture will help us to classify the picture [45]. Therefore, K. Murphy et al. presented a combination model of local and global (gist) features of the scene. This would be useful for solving the previous problem. They found that local features alone would cause a lot of false positives. Sometimes the scale estimation is incorrect as well. Also, they concluded that using global features can correct the estimation and decrease the ambiguity caused by only using local object detection methods. However, the basic idea of previous approaches that at least one salient object exists in the input image may not always behold as some background images that contain no salient objects at all. Wang et al. [57], investigated the problem of detecting the existence and the place of salient objects on thumbnail images using random forest learning approach. Recently, current researchers proposed a Salient Based Bag of Visual Word model (SBBoVW) to recognize difficult objects that have had low accuracy in previous methods [41]. This method integrates SIFT features of the original and salient parts of pictures and fuses them together to generate better codebooks using bag of visual word method. Also, it can find object place based on the salient map automatically. However, it did not use any color information.

2.1.3 Aggregation and optimization salient models

These models try to combine some saliency maps and in order to form an accurate map to help the detection of salient objects. Borji et al. [8] proposed a standard saliency aggregation. Recently, Yan et al. [60] combined saliency maps based on the hierarchical segmentation to get a tree-structure graphical model from three layers of different scales. In this model, each node is related to a region. They concluded that hierarchical algorithms could select optimal weights for each region instead of global weighting superpixels.

2.1.4 Active salient models

These models combine two stages into one (the most salient object detection and segmentation). Recently, Borji [7] presented an active model, which can locate the salient object by finding the peak pixels of the fixation map. Then it segments the picture by superpixels. Their method can connect fixation prediction and salient object segmentation. Based on Mikolajczyk et al.’s [44] research on the different scale and affine invariant interest point detectors, the best results are obtained by the Hessian-Laplace and Salient regions method.

2.1.5 Segmentation salient models

In these models separating the salient object from the background is the main approach. Kim et al. [26] proposed a region detection approach. Their method used dense local region detectors to extract suitable features for object recognition and image matching. Having applied boundary-preserving local regions (BPLRs), they asserted that their method can find the connectivity of pixels, and it can save the object boundaries for foreground discovery and object classification. Wang et al. [19] presented a framework to segment the salient object by contextual cues usage automatically. Their method incorporates texture, luminance and color cues. Also, it measures the similarity between neighboring pixels and computes the edge probability map to label them as background/foreground. Recently, Jiang et al. [24] presented saliency estimation as a regression problem, and their method still consistently outperforms other saliency methods on all datasets. Therefore, we selected their method to generate a salient map.

2.2 Feature extraction

The next stage for image retrieval is feature extraction. In the following sections, different SIFT e.g., Speeded Up Robust Features (SURF), PHOW, and Pyramid Histogram of Oriented Gradient (PHOG) and color features are described.

2.3 SIFT and SURF

SIFT was first proposed by Lowe [37]. This feature has four parameters: keypoint center (x and y coordinates), scale (the radius of the region), and orientation (an angle expressed in radians). SIFT detector is invariant and robust to translation, rotations, and scaling, and is partially invariant to affine distortion and illumination changes. Later, Bay et al. [5] proposed Speeded Up Robust Features (SURF), which is a quicker SIFT. Liu et al. [33] suggested a fast algorithm for the computation of a dense set of SIFT descriptors. Dalal et al. [14] used the Histogram of Oriented Gradient (HOG) descriptor for pedestrian detection. Pyramid HOG (PHOG) and PHOW, the new generation of SIFT features, are described below; for more information, the authors refer the reader to [10].

2.4 PHOW and PHOG

PHOW is a new trend of SIFT features proposed by Bosch et al. [10]. It uses dense SIFT under different scales and builds a pyramid of descriptors. PHOG is the edge version of PHOW, which means it gathers features of the edge-detected picture (e.g., Canny). The stages of PHOW and PHOG are depicted in Fig. 1. In forming the pyramid, the grid at level l has 2l cells along each dimension. Consequently, level 0 is represented by an N-vector corresponding to the N bins of the histogram, level 1 by a 4N-vector, etc. The pyramid descriptor of the entire image (PHOW, PHOG) is a vector with dimensionality N\({\sum }_{l=0}^{L} 4^{l} \).

Fig. 1
figure 1

Spatial SIFT representation (a,c) Grids for levels l = 0 to l = 2 for appearance and shape representation; (b,d) Appearance and SIFT histogram representations corresponding to each level

Therefore, we selected this type of SIFT instead of pure SIFT, which is not recommended by recent studies.

2.5 Color features

A color feature in another important feature that helps us recognize objects as a human does. The first step in color feature extraction is color space selection. There are several color spaces e.g., Red, Green, Blue (RGB), Cyan, Magenta, Yellow, and Black (CMYK), Hue, Saturation, Value (HSV) and an Adams chromatic valence color space which is proposed by the International Commission on Illumination (CIE) in 1976 (CIE Luv). Digital images are usually stored in RGB color space. Unfortunately, the color distance in RGB space does not represent perceptual color distance [62] (e.g., two colors with larger distance can be perceptually more similar than another two colors with smaller distance). Considering this drawback, CIE Luv space was selected for the present research, because it is a uniform color space in terms of color distance. MPEG-7 is a color descriptor proposed by Yamada [59]. It includes seven color descriptors (dominant colors, scalable color histogram, color structure, color layout, and a group of frames/picture (GoF/GoP) color). In MPEG-7, Dominant Color Descriptor (DCD) describes color distribution in an image or a region. This paper focuses on using DCD color based on the advantages of this color descriptor in reviewed paper of Zhang et al. [63]. Other color descriptors, such as color coherence vector (CCV), color correlogram, and color structure descriptor (CSD) are useful for whole image representation.

The DCD feature descriptor has two main components: (1) representative colors and (2) a percentage for each color. This descriptor is defined as:

$$ F=\left\{ \left\{c_{i},p_{i}\right\},i=1, . . . ,N \right\} $$
(1)

where N is the overall number of dominant colors for an image, c i is a dominant color vector, p i is the percentage for each dominant color, and the sum of p i is equal to 1. MPEG-7 recommends the number of colors in a region and suggests that the value of N be in the range of 1 to 8.

Yang et al. [61] presented a fast, MPEG-7 dominant color extraction with a new similarity measure for image retrieval. In comparison to the previous versions of DCDs, it has higher accuracy and performance. According to Yang et al. [61], the distance between two images F 1 and F 2 is calculated by:

$$ D^{2} (F_{1},F_{2} )=1-SIM(F_{1},F_{2}) $$
(2)

where SIM(F 1, F 2) is the similarity measurement. This similarity measurement for two color features F 1 = {{c i ,p i } ,i = 1,...,N 1} and F 2 = {{c i ,p i } ,i = 1,...,N 2} is described as:

$$ SIM(F_{1},F_{2} )= \sum\limits_{i=1}^{N_{1}} \sum\limits_{j=1}^{N_{2}} a_{ij} S_{ij} $$
(3)

where a i j , is the coefficient of color similarity:

$$ a_{ij}=\left\{ \begin{array}{cl} \frac{(1-d_{i,j})}{d_{max}} & { d_{i,j}\leq T_{d}} \\ 0 & { d_{i,j}> T_{d}} \end{array} \right. $$
(4)

where d i,j , is the Euclidean distance between two color clusters c i and b j . Also, based on the research of Islam et al. [21], in CIE LUV color space, the value of T d is fixed at 20. Because the dominant colors should be significant enough, we merge insignificant colors into nearby colors.

$$ d_{(i,j)}=\parallel c_{i}-b_{j} \parallel \quad and\quad d_{max}=\alpha T_{d} $$
(5)

To properly reflect the similarity coefficient between two color clusters, the parameter α was set to 1 and T d = 20 in the present research.

S i j is a similarity score between two different dominant colors, given by:

$$ S_{ij}=[1-|p_{q} (i)-p_{t} (j)|]\times min(p_{q} (i),p_{t} (j)) $$
(6)

where p q (i) and p t (j) are percentages of the ith and jth dominant color in query image and target image, respectively.

For dominant color vector quantization, an algorithm for automatic categorization was presented by Islam et al. [21]. They found that only 1.3% of image regions need more than 4 colors. For this reason, they shrink the number of dominant colors to four. With their algorithm, some of the salient regions cannot be properly described. In addition, their algorithm is very complicated, hard to implement, and very time-consuming. In 2013, [50] proposed WDCD for getting weight to each dominant color based on the salient map extraction. However, WDCD has a significant drawback for not retrieving similar objects with different color such as white dog and black dog. This means that color alone is not sufficient for image retrieval. To counteract these disadvantages, this paper presents a new, semantic base. Which is a fast, and easy-to-understand dominant color vector quantization algorithm, and can find an appropriate number of colors based on their distances for extracting the DCD color of the salient part of a picture. This algorithm is described in Section 3. Later this color descriptor fused with another feature in a new fusion model to extract same objects with different colors.

2.6 Learning

In terms of training techniques, learning methods are divided into Supervised, Unsupervised, and Semi-supervised (Hybrid) models. They are the combination of clustering and classification techniques which are quickly growing [51].

In terms of feature selection, learning methods are divided into Single view, and Multiview feature extraction. These methods are described in the following paragraphs.

2.6.1 Single view learning

Single feature selection techniques are traditional learning methods that usually select features from a single task [39]. These methods have a basic drawback: They cannot precisely distinguish images containing several semantic concepts. Therefore, multiple feature selection methods are the new generations for feature extraction to eliminate the problem of single view feature extraction methods [39].

2.6.2 Multi view learning

Feature selection and feature transformation are two main ideas for feature extraction, but the former is the preferred method [34]. Although traditional feature selection methods prefer to use a single task, recent methods focus on multiple feature selection methods [34, 35, 38, 39]. Multi-task feature extraction handles correlated and noisy features. Even though all the features can be joined into a large vector, this strategy is not suitable and ignore the verity between features and may lead to severe cases of dimensionality [34].

A supervised multi-lable multi-task feature learning is proposed by Wang et al. [55], but it is not suitable for classification.

In 2013, Liu and Tao [34] proposed a multiview Hessian Regularization (mHR) method for image annotation. Their method combines multiview features and Hessian regularization from different views.

Sparse coding finds a sparse linear combination of dictionary. It shows promising results for image denoising. In recent years, several of sparse coding algorithms have been developed [35]. The most noticeable sparse coding methods are based on Laplacian Regularization (LR). But LR based methods suffer from poor generality and only deal with a single view even though most of the times images are represented by multiple visual features [35]. Therefore, to overcome these drawbacks, W. Liu et al. applied multiview Hessian Discriminative Sparse Coding (mHDSC) to liner SVM and LS regression for image annotation. However, their method is tested on a small number of concepts (PASCAL VOC07 with 20 concepts).

In 2015, Y. Luo et al. proposed a multimodal multi-task feature extraction frame work (LM3FE) for classification which is suitable for image classification [38]. LM3FE uses all kinds of features even noisy features besides the complementarity of different modalities to reduce the redundancy in each modality. But their method is tested on small datasets (NUS-WIDE 12 concepts, and MIR 38 concepts).

Luo et al. [39] proposed a weight-based Matrix combining framework for transductive multilabel image classification. However the overall performance was not always satisfactory.

3 SDCD algorithm and SDCD & PHOW fusion model

In this section, the proposed algorithm and model are described in detail.

3.1 Proposed salient dominant color descriptor (SDCD) algorithm

Dominant Color Descriptor is an MPEG’7 color descriptor. DCD extracts colors of a region and based on the research of Zhang et al. [63], this color descriptor is useful for region color extraction and other color descriptors such as CCV, color correlogram, and CSD are useful for whole image representation.

However, in the traditional version of DCD, if background colors are higher than foreground ones, the algorithm retrieves images with similar background colors. Besides, color is not sufficient for similar objects with various colors (e.g. white dog, vs. black dog). Another drawback in the previous version of DCD, is that the maximum number of Dominant Colors was fixed at four [64]. Traditional segmentation methods (e.g. JSEG) create a lot of regions for each picture, and can not distinguish the most significant region or the important foreground region of the picture. WDCD which is proposed by [50] can not retrieve similar object with different colors (e.g. white dog or black dog).

To solve these problems in the previous DCD color descriptor, the salient region is extracted to extract colors of foreground or the important region of the picture. Therefore, the proposed algorithm wants to extract DCD colors of the salient part of pictures, so we may need more colors (more than four colors). For this reason, in this new algorithm, a Salient DCD for each region can have several colors based on the distances of the colors. SDCD combines semantic location information with DCD and removes background color which is not useful information. Moreover, implementation and understanding of this algorithm is easier. Later a fusion model is proposed to fuse SDCD color and PHOW MSDSIFT shape feature in order to retrieve similar objects with different colors accurately. SDCD algorithm is depicted in Algorithm 1. Its steps are:

  1. 1.

    Extract saliency map, based on Jiang et al. [24].

  2. 2.

    Extract salient region mask.

  3. 3.

    Clean the salient region mask from small spots.

  4. 4.

    If the mask is empty, this means that no salient region was found. Therefore, another mask with the size of the whole image will be created.

  5. 5.

    Multiply the mask with the original picture, and create a masked picture.

  6. 6.

    Find the line borders of the masked picture, and crop them.

  7. 7.

    Perform image smoothing and impulse noise removal with

  8. 8.

    peer group filtering (PGF) [15]. This algorithm swaps each image pixel with the weighted average of its peer group members, which are classified according to their color resemblance of neighboring pixels.

  9. 9.

    Classify colors of the smoothed picture into N colors.

  10. 10.

    Calculate the histogram of N colors, and divide them by summary to have a percentage of each color.

  11. 11.

    Calculate the Euclidean distances of the colors.

  12. 12.

    Merge near-color clusters (their distance is less than d m a x .

  13. 13.

    Remove colors with small percentages of occurrence (less than 10 percent).

figure f

3.2 Proposed salient based fusion model (SDCD& PHOW MSDSIFT BoVW)

In the traditional BoVW model, a classifier often relies on visual words that fall in the background and merely describe the context of the object [47]. As mentioned before, [50] proposed WDCD which can not retrieve similar objects with different color (e.g. white dog or black dog). Based on this problem, the new SDCD & PHOW model for fusing SDCD color features and PHOW MSDSIFT features is proposed. This model can collect visual words from the whole and salient parts of a picture. The stages of the model are:

  1. 1.

    Saliency map and salient region mask computation

  2. 2.

    SDCD BoVW stages:

    1. (a)

      Generate masked pictures

    2. (b)

      Extract SDCD of the masked pictures

    3. (c)

      Create SDCD codebook, based on K-Means classification

    4. (d)

      Quantize SDCD visual words spatial histograms, based on KD-trees

    5. (e)

      Transform non-linear histograms into a compact linear representation by Homogeneous Kernel Map.

  3. 3.

    PHOW MSDSIFT BoVW stages:

    1. (a)

      Generate saliency rectangular parts of the picture

    2. (b)

      Extract PHOW MSDSIFT feature from salient rectangular parts and normal pictures

    3. (c)

      Create PHOW MSDSIFT codebook, based on K-Means classification

    4. (d)

      Quantize visual words spatial histograms, based on KD-trees

    5. (e)

      Transform non-linear histograms into a compact linear representation by Homogeneous Kernel Map.

  4. 4.

    Histogram fusion by combining SDCD and PHOW MSDSIFT histograms into a one histogram. This fusion concatenates on the Homogeneous Kernel Map of SDCD histograms, mean of SDCD histograms, standard devision of SDCD histograms, Homogeneous Kernel Map of PHOW histograms, mean of PHOW histograms, and standard devision of PHOW histograms into one vector.

  5. 5.

    SVM train by SVM Chi-square

  6. 6.

    Extract scores from SVM

  7. 7.

    Maximum pooling

  8. 8.

    Testing of the model on previously-unseen pictures

To aid understanding, because this model has a lot of stages, we divided it in two parts, Figs. 2 and 3. Immediate results are captioned by a number of above stages. Figure 2 represents how color and PHOW MSDSIFT features are extracted. To test our model, both features (SDCD and PHOW MSDSIFT) are extracted and their histograms are generated. After that, both the color histogram and PHOW histograms change from nonlinear into linear by Homogeneous Kernel Map. Later, mean and standard devision of both of the linear histograms are calculated and combined into a vector with 6 items: SDCD linear histogram, mean of linear SDCD histogram, standard devision of linear SDCD histogram, PHOW linear histogram, mean of linear PHOW histogram, and standard devision of linear PHOW histogram. Then, spatial histograms of visual words from test pictures were compared with spatial histograms of visual words based on SVM C H I 2. With the help of scoring and maximum scores are pooled out. Afterward, the appropriate concept names were extracted by finding maximum score of retrieved concept names for test images. Figure 3 shows how visual words and histograms are quantized and, in the final stage, fused together by the proposed late fusion model. Fusion is the concatenation of both PHOW MSDSIFT and SDCD histograms, mean and standard devision of each feature histograms’ Homogeneous Kernel Map. For more explanation internal fusion is described in the following.

Fig. 2
figure 2

SDCD and PHOW MSDSIFT BoVW model. Continued in the next Fig. 3

Fig. 3
figure 3

SDCD& PHOW MSDSIFT BoVW model (second part). Followed after previous Fig. 2

3.3 The proposed model feature extraction and fusion in detail

In this subsection all the detail information about feature dimensions and explanations for immediate results are described in the following:

  1. 1.

    Images standardized for size to less than 300 × 300 pixels

  2. 2.

    Saliency map and salient region mask computation

  3. 3.

    SDCD BoVW stages:

    1. (a)

      Generate masked pictures

    2. (b)

      Extract SDCD of the masked pictures in LUV color space 3 × C N, which C N is the number of colors for all train pictures.

    3. (c)

      Create SDCD codebook, and color table based on K-Means classification N color × N train, and N color × 3 in which N color is the number of colors remaining after SDCD codebook extraction and N train is the number of train images.

    4. (d)

      Quantize SDCD visual words spatial histograms, based on KD-trees N train × N color

    5. (e)

      Change SDCD histogram from non-linear into a compact linear by homogeneous kernel map.

  4. 4.

    PHOW MSDSIFT BoVW stages:

    1. (a)

      Generate saliency rectangular parts of the picture

    2. (b)

      Extract PHOW MSDSIFT feature from salient rectangular parts and normal pictures 128 × N ,128 × M in which 128 is the dimension for SIFT features, N is the number of features of the salient rectangular part and M is the number of features of the normal picture. These features are combined into a bigger feature matrix with dimensions of 128 × (M + 2 × N). It means that the salient features are two times repeated and normal features just once.

    3. (c)

      Create PHOW MSDSIFT codebook, based on K-Means classification 128 × N #PHOWcodebook, which N #PHOWcodebook is 1024 for Caltech-101 dataset and 2048 for Caltech-255 dataset.

    4. (d)

      Quantize visual words spatial histograms, based on KD-trees N train × N #PHOWcodebook in which N train is the number of train images and

    5. (e)

      Change PHOW histogram from non-linear into a compact linear by homogeneous kernel map.

  5. 5.

    Histogram fusion by combining SDCD, PHOW MSDSIFT histograms (x S D C D , x P H O W ), mean (\(\bar {x}_{SDCD}, \bar {x}_{PHOW}\)), and standard devision (σ S D C D ,σ P H O W ) of both of them into a one vector N train × (N color + N #PHOWcodebook + 4)

  6. 6.

    SVM train by SVM Chi-square.

  7. 7.

    Extract SVM scores

  8. 8.

    Maximum score pooling to recognize the object

  9. 9.

    Testing of the model on previously-unseen pictures

4 Experimental setup

As mentioned earlier, this paper aims to investigate the potential and accuracy of the SDCD & PHOW MSDSIFT BoVW model for fusing SDCD color features and PHOW MSDSIFT features to recognize color objects in image retrieval. MSDSIFT scales are 4,6,8, and 10. MSDSIFT step (in pixels) of the grid at which the dense SIFT features are extracted is 2. Codebook is created based on elkan K-Means classification. Visual words quantize spatial histograms based on KD-trees. The proposed model is trained with SVM Chi-Square and scored histogram fusion is a concatenation of the linear SDCD histogram (x S D C D ), mean of SDCD histogram (\(\bar {x}_{SDCD}\)), standard devision of SDCD histogram (σ S D C D ), the PHOW MSDSIFT SDCD histogram (x P H O W ), mean of PHOW MSDSIFT histogram (\(\bar {x}_{PHOW}\)). standard devision of PHOW MSDSIFT histogram (σ P H O W ). The best result is selected by maximum pooling method. Evaluations are performed on the Caltech-101 dataset [32], in addition to the animal subset of the Caltech-256 [18] dataset. The number of codebooks are 1024 and 1500 for Caltech-101 and Caltech-256 respectively.

4.1 Caltech-101 dataset

This dataset has approximately 40–800 images per category. It contains a total of 9,146 images split between 101 distinct objects (including faces, watches, ants, pianos, etc.) and a background category (for a total of 102 categories). As suggested by Wang et al. [56] and other researchers [6, 18], the dataset is partitioned into 5, 10, …, 30 training images per class and no more than 50 test images per class. The number of extracted code words was 1024. To make a comparison between this method and Vedaldi and Fulkerson [53] grayscale PHOW descriptor, the same training and test images were used. For comparison with color feature extraction methods, PHOW-color, the HSV histogram color, and RGB histogram color, presented by Vedaldi and Fulkerson [53], were used.

4.2 Caltech-256 dataset

From the Caltech-256 dataset, 20 different animals (bear, butterfly, camel, dog, house-fly, frog, giraffe, goose, gorilla, horse, humming bird, ibis, iguana, octopus, ostrich, owl, penguin, starfish, swan, and zebra) were selected from different environments (lake, desert, sea, sand, jungle, bushy, etc.). A common training setup (15, 30, 45, and 60 training images for each class) was followed [56]. There were less than 50 test images per class. To compare this method with the basic BoVW model, the same training and test images were chosen. The number of extracted code words was 1500.

4.3 Essential needs

The Essential software for running the program is: Matlab 2013a/2014a. The essential open source libraries for running the program are: Vlfeat: open source library implements popular computer vision algorithms specializing in image understanding and local features extraction and matching, and LIBSVM: A library for support vector machines.

4.4 Accuracy of proposed method

In the following section, we present the results obtained on the datasets and compare our method with two recent studies. For measuring accuracy, we used three famous methods: precision, accuracy, and classification rate which were also used in [6, 12, 17, 31, 52, 53]. Since these formulas are well known, we do not describe them in detail here.

5 Results and discussion

A comparison with Vedaldi’s color descriptors (PHOW + Opp.-MSDSIFT, PHOW + HSV-MSDSIFT, and PHOW+RGB-MSDDSIFT) was done using the same train and test pictures and number of codebooks (Table 1). In this table, the proposed model consistently outperformed the other color descriptors under different number of train images using different numbers of codebooks; 1,024 codebooks always improve the final classification rate. Therefore, for the rest of the experiments, 1,024 codebooks is selected for the Caltech-101 dataset. The reason behind is the extraction of the dominant color extraction of the salient part instead of feature extraction from different color spaces. The proposed method for object classification performed 100% better than methods from three recent studies [11, 49, 53] and 19 different color feature extraction methods and different number of train images (5, 10, ..., 30) for the Caltech-101 dataset (see Table 2).

Table 1 Comparison with other color descriptors results based on different number of codebook in Caltech-101 dataset
Table 2 Classification rate comparison based on percentage of classification rate in Caltech-101 dataset and different colored SIFT methods three states of arts (color SIFT [49], CSIFT [11], Color PHOW [53])

A comparison between the proposed model and the basic BoVW and previously proposed SBBoVW model using the Caltech-256 dataset, is provided in Table 3 under different number of training images (15, 30, ..., 60). This table demonstrates that the proposed model performed 100% better than SBBoVW model. This is due to the addition of SDCD color descriptor. Figure 4 is provided the retrieved object name for 5 test images between BoVW, SBBoVW, and SDCD + PHOW (the proposed model). The proposed model retrieved 100% accurate names.

Table 3 The comparison of classification rate between SBBoVW [41] and SDCD & PHOW MSDSIFT (the new model) in animal subset of Caltech-256
Fig. 4
figure 4

Retrieved object name between BoVW, SBBoVW, and SDCD + PHOW (proposed)

Fig. 5
figure 5

Accuracy comparison of PHOW+MSDSIFT [53], SBBoVW [41] and the proposed model on Caltech-101

These results demonstrate the effectiveness of the proposed SDCD and SBBoVW model for improving color object classification. The final accuracy and precision results are depicted in Figs. 56 and 7 under the same train and test images. In Fig. 5, the final accuracy results are compared with the PHOW+MSDSIFT [53] and SBBoVW for the Caltech-101 dataset. In addition, Fig. 6 shows the precision comparison between these (PHOW+MSDSIFT, SBBoVW, and the proposed model) methods for the Caltech-101 dataset. This figure, showed that the proposed late fusion model outperforms the PHOW+MSDSIFT because of adding color and salient feature information for 56 concepts (BACKGROUND-Google, Faces, Faces-easy, Leopards, Motorbikes, accordion, airplanes, anchor, binocular, butterfly, cellphone, chair, chandelier, cougar-body, crab, crocodile, cup, dollar-bill, dolphin, dragonfly, electric-guitar, euphonium, ferry, garfield, gerenuk, grand-piano, headphone, hedgehog, helicopter, ibis, inline-skate, joshua-tree, kangaroo, lamp, laptop, lobster, menorah, metronome, minaret, octopus, pagoda, panda, revolver, scissors, sea-horse, snoopy, soccer-ball, stop-sign, strawberry, trilobite, umbrella, watch, wild-cat, windsor-chair, wrench, and yin-yang) but does not outperform for 46 concepts (ant, barrel, bass, beaver, bonsai, brain, brontosaurus, buddha, camera, cannon, car-side, ceiling-fan, cougar-face, crayfish, crocodile-head, dalmatian, elephant, emu, ewer, flamingo, flamingo-head, gramophone, hawksbill, ketch, llama, lotus, mandolin, mayfly, nautilus, okapi, pigeon, pizza, platypus, pyramid, rhino, rooster, saxophone, schooner, scorpion, stapler, starfish, stegosaurus, sunflower, tick, water-lilly, and wheelchair) due to incorrect extraction of salient part for those objects which have narrow area lines, spotted or not smooth patterns or colors. These results were supported by the accuracy comparison (see Fig. 5).

Fig. 6
figure 6

Precision comparison of PHOW+MSDSIFT [53], SBBoVW [41] and the proposed model on Caltech-101

Fig. 7
figure 7

Comparison of accuracy of BoVW model, SBBoVW [41] model and proposed model on a subset of Caltech-256

In a detailed precision comparison of the proposed fusion model and BoVW model using the animal subset of the Caltech-256 dataset (see Fig. 8), the proposed late fusion model outperforms because of adding color and salient feature information for 11 concepts (bear, butterfly, gorilla, horse, hummingbird, iguana, octopus, ostrich, owl, starfish, and zebra) but does not outperform for those animals which do not have smooth patterns and colors (e.g. giraffe, ...) the salient extraction dose not work properly and gets worse results for 9 concepts (camel, dog, frog, giraffe, goose, house-fly, ibis, penguin, and swan). On the other hand, in comparison with the SBBoVW method, the proposed late fusion model outperforms due to adding salient features and color information for 13 concepts (bear, butterfly, dog, gorilla, horse, house-fly, hummingbird, iguana, octopus, ostrich, owl, starfish, and zebra) but does not outperform for seven concepts, because of incorrect salient object extraction on animals which have spotted patterns or not smooth colors (e.g. camel, frog, giraffe, goose, ibis, penguin, and swan). These results are supported by the accuracy comparison for the PHOW+MSDSIFT [53], the previously proposed SBBoVW, and the proposed model in Caltech-101 dataset (see Fig. 7).

Fig. 8
figure 8

Precision comparison of BoVW model, SBBoVW [41] model and proposed model on a subset of Caltech-256

Based on these results, the proposed fusion model improves the final precision, accuracy, and classification rate in images in which the salient region could be correctly extracted.

6 Conclusions and future research

In this paper, first a new SDCD algorithm to extract colors of the salient object of a picture was presented. Using this algorithm, a new model, SDCD & PHOW MSDSIFT BoVW, was proposed to fuse SDCD histogram with PHOW MSDSIFT histogram. The proposed model classifies color objects that have had low accuracy in prior methods. This method mixes SDCD and PHOW MSDSIFT features of the original and salient parts of pictures and fuses them with a new fusion model together to generate better codebooks. The final results and comparison with 3 state-of-the-art models and 19 different color feature extraction methods shows that extraction of SDCD colors improved the final results. However, this model still needs improvements for those objects for which color is not as effective in their classification. In the future, with the help of other features, such as texture, difficult objects can be recognized more accurately. In addition, multi-object datasets, such as VOC-7, may present another approach to improve the proposed model. Parallel processing is another future area to run the code faster.