1 Introduction

The human visual system (HVS) can accurately detect the most salient object in an image, but the goal of developing a computational model for salient object detection [39, 40, 62, 69, 93] with comparable capabilities still remains an open challenge for computer vision and pattern recognition. The task of salient object detection is to locate the most salient object or region in an image. In recent years, salient object detection [10, 46, 49, 89] has gained increasing attentions for many computer vision and graphics applications, such as image and video compression [34], image retargeting [88], image thumb nailing [67], image segmentation [14, 53], object recognition [31, 70, 75], content-aware image editing [3], image classification/retrieval [4, 45], video surveillance systems [38], photograph rearrangement [72], image quality assessment [71], remote sensing [58], automatic image cropping [76], displaying items on small portable screens [17], automatic target detection [43, 44], robotics [17, 43, 76], medical imaging [50], advertising a design [43], image collection browsing [74] and image enhancement [29]. Motivated by these applications, salient object detection emphasizes on highlighting foreground objects.

In general, multi-scale analysis using wavelet helps in representing the images with different resolutions and in implementing the zooming-in and zooming-out process of human vision system (HVS). The ability of wavelet transform to represent singularities of images plays a key role in designing wavelet-based saliency detection algorithms as human eyes are very sensitive to orientation features. It has been verified that the multi-scale edge-to-texture features computed using discrete wavelet transform (DWT) play a significant role in the field of salient object detection because of its ability to provide multi-scale analysis of an image simultaneously in both frequency and spatial domains [41, 52]. Unfortunately, DWT-based saliency detection techniques have a limited ability to reveal singularities in different directions as it has only three directional sub-bands, oriented at 0°, 45° and 90°. But as natural images are comprised of smooth regions that are punctuated with edges at several orientations, DWT may fail to represent the geometric regularity along the singularities, which requires higher directional selectivity. In 2004, Selesnick [77,78,79] proposed 2D double-density dual-tree complex wavelet transform (DDDTCWT) which gives rise to four approximate sub-band images and 32 high-pass detailed sub-band images at each scale which are adequate for the representation of any geometric object with high precision. This motivates us to use double-density dual-tree complex wavelet transform (DDDTCWT) to detect salient object. In this paper, we first compute detailed multi-scale edge-to-texture feature maps using inverse double-density dual-tree complex wavelet transform (IDDDTCWT) to capture band-pass local information with different frequency bandwidths which helps in detecting irregularities at different bandwidths. Then, these feature maps are combined to generate a saliency map. This DDDTCWT-based saliency map is integrated with the segmentation results obtained from bipartite graph partitioning-based approach to generating an initial rough segmented saliency map (RSSM). Each pixel of RSSM is assigned to be part of salient region or the background region based on its value relative to threshold value. Finally, the pixels of the two regions are considered as samples to be drawn from a multivariate kernel function whose parameters are estimated using expectation maximization algorithm, to yield a final saliency map. The performance of the proposed model is evaluated in terms of precision, recall, F-measure, area under curve and computation time on six publicly available image datasets. Performance of the proposed model is also evaluated in terms of mean precision value, mean recall value, mean F-measure values and mean AUC values on all six datasets. Both qualitative and quantitative evaluations on six publicly available benchmark datasets demonstrate the robustness and efficacy of the proposed method against 29 state-of-the-art methods.

The remainder of this paper is organized as follows. Section 2 reviews related state-of-the-art methods to detect salient object. Section 3 describes the proposed model (DDDTCWT-SS). Section 4 presents the experimental results and comparisons with several state-of-the-art salient region detection methods. Section 5 concludes the proposed model with discussions.

2 Related work

Recently, a plethora of computational models have been proposed for salient object detection [18, 19, 28, 55,56,57, 83,84,85], which can be roughly categorized into bottom-up and top-down approaches [11,12,13]. Bottom-up approaches are fast, stimulus driven and task independent. They extract certain low-level features from the image and combine them into a saliency map. However, the top-down approaches consist of the prior knowledge of the human visual system (HVS) and high-level data processing to support the task of salient object detection. Therefore, top-down approaches are slow and task dependant. Such approaches are integrated with the bottom-up approaches in order to detect the salient object. Both of the computational models focus on producing saliency maps to detect salient object in an input image. For a comprehensive review of related work, we refer readers to recent survey papers for detailed discussion of 256 focused researches in computer vision [12], as well as taxonomy and critical comparison of 65 models [11], and qualitative and quantitative analysis of 41 different state-of-the-art [13] methods in the two major research areas such as fixation prediction [11, 48] and salient object detection [13].

Here, we focus on different bottom-up approaches by which salient object detection has been designed. Inspired by the biologically plausible architecture proposed by Koch and Ullman [54], Itti et al. [44] (IT) determined image saliency by utilizing centre-surround differences across multi-scale image features using a difference of Gaussians (DoG) approach. Later, Bruce and Tsotsos [15] proposed a computational model (AIM) based on information maximization to implement saliency using joint likelihood, Shannon’s self-information and features learned from input images using independent component analysis (ICA). Harel et al. [35] utilized Itti et al.’s [44] method to create low-level feature maps but performed normalization using a graph-based approach (GBVS). Liu et al. [64, 65] (SLRG) integrated saliency cues like centre-surround histogram contrast, multi-scale contrast and colour spatial distribution in a conditional random field to segment the salient object. Zhang et al. (SUN) [94] used Bayesian framework to locate the salient object in an image. Achanta et al. (FT) [3] computed saliency of each pixel in the image as the contrast of its colour feature to the mean colour information of the whole image. The research work of Achanta and Susstrunk (ASS) [2] relied on the maximum symmetric surround difference to compute saliency map. Goferman et al. [30] (CASD) used contrast of the patch to the K nearest patches in the image to compute saliency. Shen and Wu [80] solved saliency detection problem as a low-rank matrix (LRK) recovery problem, where salient objects are represented by a sparse matrix (noise), while background is indicated by a low-rank matrix. However, this sparse and low-rank assumption may not be satisfied in complex scenes, leading to unsatisfactory results. The work of Liu et al. (2014) (STREE) locates salient objects by exploiting the concept of saliency tree. Zhu et al. [96] (MSA) used multivariate normal distribution estimation to extract salient regions in an image. Xie et al. [90] (BSM) proposed a Bayesian saliency method by utilizing the low- and mid-level cues. Sun et al. [86] (MCA) used the concept of Markov chain absorption to detect salient object present in the image. Jiang et al. [47] performed pre-segmentation on an input image and extracted a bunch of discriminative features from each segmented region. Then, a random forest regressor is adopted to map multiple features to a region saliency score. Singh et al. (SOD-C-PSO) [82] suggested linearly weighted combination of different feature maps and estimated the weights using constrained particle swarm optimization. Cheng et al. [19] proposed an unsupervised saliency cut (Grab cut)-based image segmentation approach to automatically segmenting the most salient object. They also proposed histogram-based contrast (HC) and spatial information-enhanced region-based contrast (RC) methods for salient object detection. Kim et al. [51] introduced a high-dimensional colour transform (HDCT)-based saliency detection approach. The main idea of their approach is to map a low-dimensional RGB colour to a feature vector in a high-dimensional colour space in order to separate the salient object from the background by finding an optimal linear combination of colour coefficients.

Recent studies [32, 36, 37] have tried to detect image saliency in transform domain [16]. Frequency domain approaches for salient object detection have been popular due to their fast computational speed. The first spectral domain approach for detecting saliency is due to Hou and Zhang [36], who computed image saliency in frequency domain by comparing the dissimilarity of the characteristic spectrum with the perceived spectrum of greyscale images. Hou et al. [37] proposed an image signature (IS) descriptor to spatially approximate the sparse foreground position concealed in a spectrally sparse background. In 2008, Guo et al. [33] observed that phase spectrum of the Fourier transform (PFT) in comparison with amplitude spectrum contributes more to locate the position of salient object. Guo et al. [32, 33] extended PFT model to a phase quaternion Fourier transform (PQFT) model in case of multiple channels to represent the multi-dimensional data at each pixel as a quaternion. Yu et al. [92] used lateral surround inhibition behaviour of neurons to compute saliency in an image. They captured this behaviour of neurons by utilizing the pulsed discrete cosine transform (PCT). In 2008, biological prediction and comparison with spatial biological models were verified by Bian and Zhang [7,8,9] in their frequency divisive normalization (FDN) model. Bian and Zhang [7,8,9] proposed a saliency detection approach that integrates the speed of frequency domain models with the topology of biologically based methods under the assistance of frequency domain divisive normalization (FDN). But, this model takes global surround into consideration. In order to relax the global surround constraint, Bian and Zhang [8, 9] extended FDN model into piecewise frequency domain divisive normalization (PFDN) [8, 9] by separating the input image into overlapping local patches and conducting FDN on every patch in order to provide better biological plausibility. Amplitude spectrum of quaternion Fourier transform (AQFT) [24] and modelling from bitstream (BS) [25] were proposed by Fang et al. In 2013, Li et al. [60] proposed hypercomplex Fourier transform (HFT)-based saliency detection approach which takes advantage of scale-space analysis of the amplitude spectrum. Li et al. [61] (SDS) designed a saliency detector by exploiting the phase of intermediate frequencies. Arya et al. [5] (HLGM) suggested a salient region detection approach by integrating both global saliency and local saliency in the frequency domain by using fast Walsh–Hadamard transform (FWHT) and PFDN, respectively. Arya et al. [6] (BHGT) developed a grey-level co-occurrence matrix (GLCM)-based saliency framework in both spatial domain and frequency domain.

Recently, wavelet transform (WT) has been found to be useful in the field of salient object detection. In 2001, Tian et al. [87] proposed a salient point detector based on wavelet transform. As this WT-based approach detects salient points in an image rather than salient objects, therefore it is difficult for us to compare our proposed salient object detection model with this approach. Murray et al. [68] (SIM) computed saliency based on a nonparametric low-level vision approach, where the scale information is integrated through a simple inverse wavelet transform over the set of extended contrast sensitivity function (ECSF) responses for each colour sub-band. ECSF is a function of scale and centre-surround contrast energy which takes care of the human sensitivity to local contrast and energy ratio of the centre-surround regions. Moreover, they also introduced training steps on both colour appearance and eye-fixation psychophysical data to reduce ad hoc parameters. İmamoğlu et al. [41] (WT) proposed a salient object detection model by utilizing low-level features obtained from the discrete wavelet transform (DWT) domain to create multi-scale feature maps, which can represent different features from edges to texture. These multi-scale feature maps with increasing frequency bandwidths are obtained using inverse wavelet transform with the band-pass filtered regions of the input image at various scales. Using these features, local saliency at a location is modulated with its global saliency calculated based on the likelihood of the features to generate final saliency map. However, DWT cannot be an optimal choice to create feature maps as it gives weak line (curve) singularities because of being limited to few directional sub-bands.

3 Proposed model

In this paper, we propose a double-density dual-tree complex-oriented wavelet transform (DDDTCWT)-based salient object detection model (DDDTCWT-SS), as illustrated in Fig. 1. In this framework, an input image is first segmented using a computationally efficient bipartite graph partitioning-based segmentation approach to capturing intrinsic structural information of the image. For each segment, saliency is computed based on multi-scale low-level edge-to-texture features extracted from two-dimensional double-density dual-tree complex wavelet transform (2D DDDTCWT). Figure 1a, b shows the respective results of segmentation procedure and saliency map calculated using DDDTCWT. As humans are sensitive to orientation features [66], an initial rough segmented saliency map (RSSM) (as shown in Fig. 1c) is generated by assigning saliency to each segment on the basis of DDDTCWT-based saliency map which comprises specific information from 32 detailed sub-band images as shown in Fig. 3d. Then, a mean intensity value of DDDTCWT-based saliency map is used as a threshold to get a rough estimation of salient region and background region in RSSM. In the initial RSSM, a pixel with intensity value greater than or equal to the threshold value is considered to be a part of salient object and is assigned a value ‘1’, and pixel with intensity value less than the threshold value is considered to be a part of background region and is assigned a value ‘0’. In this way, we get a thresholded RSSM which is shown in Fig. 1d. To improve the accuracy of the thresholded initial RSSM, the Gaussian mixture model (GMM) is built over it for saliency re-estimation of each region. The GMM parameters are updated using expectation maximization (EM) method to obtain the final saliency map (shown in Fig. 1e).

Fig. 1
figure 1

Proposed framework for detecting the most salient object in an image

The main contribution of the proposed method is to exploit sensitiveness of human eyes to orientation features using multi-scale structure of the 2D DDDTCWT as natural images are comprised of smooth regions that are punctuated with edges at several orientations. The proposed algorithm performs a multi-scale frequency analysis of the image by representing it at different resolutions to exploit zooming-in and zooming-out process of HVS and sets a trade-off between the detection accuracy and computational time for achieving better detection accuracy in less computation time. The pseudo-code of the proposed algorithm is given as follows.

figure d

3.1 Segmentation of an input image by aggregating superpixels using a bipartite graph

As demonstrated in recent studies [47, 51, 73, 80, 91], features from superpixels [27, 28] are effective and efficient for salient object detection. Superpixels group pixels into perceptually meaningful atomic regions which can be used to replace the rigid structure of the pixel grid. They capture image redundancy, provide a convenient primitive to compute image features, and greatly reduce the complexity of subsequent image processing tasks. These are key building blocks for many computer vision algorithms like image segmentation. There are many approaches to generating superpixels, each with its own advantages and drawbacks that may be better suited to a particular application. For example, graph-based method of Felzenszwalb and Huttenlocher [26] may be an ideal choice [1] to accurately capture image boundaries. Recently, Li et al. [59] proposed an improved image segmentation algorithm which takes advantage of different and complementary information from various popular segmentation algorithms [20, 26, 81]. In order to fuse the complementary information, Li et al. collected a variety of superpixels generated by different segmentation algorithms with varying parameters. Superpixels generated in this way help in capturing diverse and multi-scale visual patterns in the input image. To effectively aggregate these multi-layer superpixels, Li et al. proposed a bipartite graph partitioning-based segmentation framework which is constructed over both pixels and superpixels. These pixels and superpixels work as the vertices of the bipartite graph and edges between these vertices are established on the basis of superpixel cues and smoothness cues. To enforce superpixel cues, a pixel is connected to the superpixel if pixel is a part of that superpixel while smoothness cues are enforced by connecting each superpixel to itself and its nearest neighbour in the feature space among its spatially adjacent superpixels. This bipartite graph segmentation framework is efficiently solved computationally by Li et al. [59] using a linear-time spectral algorithm.

To map the relationship between pixels and superpixels, a bipartite graph is built which consists of two parts describing the pixel–superpixel and superpixel–superpixel relationships. In particular, taking into account the demand of sparsity for a good-quality graph, a pixel is connected to the superpixel containing it and a superpixel is connected to neighbouring superpixel close in feature space. For a given image \( {\mathbf{I}} \), a set of pixels and superpixels (multi-layer) are denoted by P and S, respectively. More precisely, let \( {\mathbf{G}} = \left\{ {\chi ,\gamma , {\mathbf{B}}} \right\} \) denote the bipartite graph, where \( \chi = P \cup S = \left\{ {\varvec{x}_{\varvec{i}} } \right\}_{i = 1}^{{N_{\chi } }} \), \( \gamma = S = \left\{ {\varvec{y}_{\varvec{j}} } \right\}_{j = 1}^{{N_{\gamma } }} \) with \( N_{\chi } = \left| P \right| + \left| S \right| \) and \( N_{\gamma } = \left| S \right| \) and the number of nodes in \( \chi \) and \( \gamma \), respectively.

The across-affinity matrix \( {\mathbf{B}} = \left( {b_{ij} } \right)_{{N_{\chi } \times N_{\gamma } }} \) is defined as:

$$ b_{ij} = \left\{ {\begin{array}{*{20}l} {\alpha ,} \hfill & {\varvec{x}_{\varvec{i}} \in \varvec{y}_{\varvec{j}} ,\varvec{x}_{\varvec{i}} \in P,\varvec{y}_{\varvec{j}} \in S} \hfill \\ {e^{{ - \beta d_{ij} }},} \hfill & { {\text{if}}\,\varvec{x}_{\varvec{i}} \sim\varvec{y}_{\varvec{j}} ,\varvec{x}_{\varvec{i}} \in S,\varvec{y}_{\varvec{j}} \in S} \hfill \\ {0,} \hfill & {\text{otherwise}} \hfill \\ \end{array} }, \right. $$
(1)

where \( d_{ij} \) signifies the distance between the features of superpixels \( \varvec{x}_{\varvec{i}} \) and \( \varvec{y}_{\varvec{j}} \). We use \( d_{ij} = \varvec{c}_{\varvec{i}} - \varvec{c}_{{\varvec{j}2}} \), where \( \varvec{c}_{\varvec{i}} \) and \( \varvec{c}_{\varvec{j}} \) represent the average colour of the pixels within the superpixels \( \varvec{x}_{\varvec{i}} \) and \( \varvec{y}_{\varvec{j}} \), respectively, on the basis of colour space. \( \sim \) signifies a certain neighbourhood between superpixels. \( \varvec{x}\sim\varvec{y} \),\( \varvec{x} \in S \), \( \varvec{y} \in S \), if \( \varvec{x} = \varvec{y} \), or \( \varvec{y} \) is adjacent to \( \varvec{x} \) and is most similar to \( \varvec{x} \) in terms of average colour. \( \alpha \) and \( \beta \) are set to greater than 0 to balance superpixel and smoothness cues, respectively. With the help of this construction, a pixel and a superpixel that it belongs to are likely to be grouped together due to the connections between them. Two superpixels close in feature space are also more likely to be grouped together. Bipartite graph, constructed in this manner, also enforces the smoothness over superpixels. Using bipartite graph \( {\mathbf{G}} \), input image \( {\mathbf{I}} \) is segmented into \( k \) segments by accumulating same label nodes into a segment with the help of spectral clustering algorithm. To segment an image into \( k \) groups, \( k \) bottom eigenvectors of generalized eigenvalue problem are computed as:

$$ {\mathbf{L}}\varvec{f} = \lambda {\mathbf{D}}\varvec{f,} $$
(2)

where \( {\mathbf{L}} \) and \( {\mathbf{D}} \) represent graph Laplacian and degree matrix, respectively. \( {\text{D}} \) is calculated as: \( {\mathbf{D}} = diag\left( {{\mathbf{B}}1} \right) \). Instead of solving this eigenvalue problem using Lanczos method and singular value decomposition method, which take \( O\left( {k\left( {N_{\chi } + N_{\gamma } } \right)^{3/2} } \right) \) running time [59], Li et al. utilized unbalanced structure of the graph to solve it efficiently. The number of columns in affinity matrix \( {\mathbf{B}} \) is much larger than the number of rows (\( N_{\chi } = N_{\gamma } + \left| {\mathbf{I}} \right|, \,{\text{and }}\,{\mathbf{I}} \gg N_{\gamma } \)), so we have \( N_{\chi } \gg N_{\gamma } \). This large variation between the number of rows and number of columns clearly demonstrates the unbalanced structure of the bipartite graph. To exploit the unbalanced structure, Li et al. proposed a transfer cut method to compute bottom \( k \) eigenvectors in reduced time as it transforms the eigenvalue problem into the following:

$$ {\mathbf{L}}_{\gamma } \varvec{v} = \eta {\mathbf{D}}_{\gamma } \varvec{v,} $$
(3)

where \( {\mathbf{L}}_{\gamma } = {\mathbf{D}}_{\gamma } - {\mathbf{W}}_{\gamma } \), \( {\mathbf{D}}_{\gamma } = diag\left( {{\mathbf{B}}^{{\mathbf{T}}} 1} \right) \) and \( {\mathbf{W}}_{\gamma } = {\mathbf{B}}^{{\mathbf{T}}} {\mathbf{D}}_{\chi }^{ - 1} {\mathbf{B}} \), where \( {\mathbf{D}}_{\chi } = diag\left( {{\mathbf{B}}1} \right) \). \( {\mathbf{L}}_{\gamma } \) is exactly the Laplacian of the bipartite graph \( {\mathbf{G}}_{\gamma } = \left\{ {\gamma ,{\mathbf{W}}_{\gamma } } \right\} \) as \( \varvec{D}_{\gamma } = diag\left( {{\mathbf{B}}^{{\mathbf{T}}} 1} \right) = diag\left( {{\mathbf{W}}_{\gamma } 1} \right) \), where \( 1 \) is the vector of ones of appropriate size. The task of partitioning graph \( {\mathbf{G}} \) into \( k \) groups takes \( O\left( {2k\left( {1 + d_{\chi } } \right)N_{\chi } + kN_{\gamma }^{{\frac{3}{2}}} } \right) \) time where \( d_{\chi } \) is the average number of edges connected to each node in \( \chi \). Our work belongs to salient object detection, for which a comprehensive discussion about segmentation approaches is beyond the scope of this paper. We refer readers to the research article of Li et al. [59] for a detailed discussion of this segmentation approach.

To choose the most salient region among these \( k \) segmented regions \( {\mathbf{H}}_{p, } p = 1, \ldots k \), the saliency value of each region, \( {\mathbf{H}}_{p } \), needs to be computed. To find the saliency value of each region, we utilize double-density dual-tree complex wavelet transform (DDDTCWT), which is discussed in Sect. 3.2.

3.2 Double-density dual-tree complex wavelet transform (DDDTCWT)-based saliency detection model

Psychophysical investigation has shown that the HVS performs a multi-scale frequency analysis when we observe an image [21, 87]. This mechanism is similar to zooming-in and zooming-out process of HVS. As natural images exhibit smooth regions that are punctuated with edges at several orientations, discrete wavelet transform (DWT) may fail to represent the geometric regularity along the singularities selectivity. In order to overcome the limitations of directional selectivity of traditional 2D discrete wavelet transform (DWT), we utilize double-density dual-tree complex wavelet transform (DDDTCWT) for salient object detection for the first time in the literature. DDDTCWT possesses the properties of the dual-tree complex wavelet transform (DTCWT) and double-density DWT (DDDWT). Both the double-density DWT and the dual-tree complex DWT are similar in several properties. (Both are nearly shift invariant; both are over-complete by a factor of 2; both are based on FIR perfect reconstruction filter banks.) However, the two wavelets used in dual-tree DWT form an approximate Hilbert transform pair, while the two wavelets used in the double-density DWT are offset by one half. We briefly describe DTCWT and DDDWT in the following sections.

3.2.1 Dual-tree complex wavelet transform (DTCWT)

The dual-tree complex wavelet transform (DTCWT) [79] has more directional sub-bands in comparison with 2D DWT, which has only three directional sub-bands oriented at \( 0^{^\circ } \), \( 45^{^\circ } \) and \( 90^{^\circ } \). 1D DTCWT is implemented using two real discrete wavelet transforms  \( \varPsi h \left( t \right) \)  and \( \varPsi_{g} \left( t \right) \) which are employed in parallel to generate the real and imaginary parts of complex wavelet \( \varPsi \left( t \right) = \varPsi h \left( t \right) + j\varPsi_{g} \left( t \right) \). Here,  \( \varPsi h \left( t \right) \) is approximately analytic and  \( \varPsi_{g} \left( t \right) \)  is approximately the Hilbert transform of \( \varPsi h \left( t \right) i.e. \)   \( \varPsi_{g} \left( t \right) \approx H\left( {\varPsi h \left( t \right)} \right) \).

2D DTCWT is realized by filtering an image separately row and column wise: two trees are used for the rows of the image and two trees for the columns. This process computes 12 sub-bands for each scale in six main directions ± 15°, ± 45° and ± 75°, but there are two wavelets in each direction. One of the two wavelets can be interpreted as the real part of a complex-valued 2D wavelet, while the other wavelet can be interpreted as the imaginary part of a complex-valued 2D wavelet. Unlike 2D DWT, all of the wavelets are free of draughtboard artefact. But six directions are also not sufficient to represent any geometric object with high precision. To capture more information from more directions, Selesnick proposed double-density discrete wavelet transform [77] which is explained in Sect. 3.2.2.

3.2.2 Double-density discrete wavelet transform (DDDWT)

The double-density discrete wavelet transform [77] utilizes one scaling function and two distinct wavelets which are designed to be offset from one another by one half (\( \varPsi_{2} \left( t \right) = \varPsi_{1} \left( {t - 0.5} \right) \)). It satisfies the properties of approximate shift invariant and perfect reconstruction with limited redundancy. In 2D images, this transform outperforms the standard DWT and DTCWT as both have fewer degrees of freedom in comparison with the DDDWT. The procedure of 2D DDDWT is shown in Fig. 2a. 2D DDDWT is realized by alternatively applying the transform first to the rows and then to the columns of the image. After such process, one approximate sub-band image and eight detail sub-band images are attained to describe information in eight distinct directions as shown in Fig. 2a.

Fig. 2
figure 2

a Filter bank structures for two-dimensional DDDWT, b filter bank for two-dimensional DDDTCWT

3.2.3 Double-density dual-tree complex wavelet transform (DDDTCWT)

Although the 2D DDDWT utilizes more wavelets, some lack a dominant spatial orientation, which prevents them from being able to isolate those directions. To overcome this problem, Selesnick [78] suggested double-density dual-tree DWT (DDDTCWT) which combines the characteristics of the double-density DWT (DDDWT) and dual-tree complex DWT (DTCWT). The DDDTCWT employs two different scaling functions \( \varPhi_{h} \left( t \right) \) and \( \varPhi_{g} \left( t \right) \) and four distinct wavelet functions \( \varPsi_{h,j} \left( t \right) \), \( \varPsi_{g,j} \left( t \right)\left( {j = 1,2} \right) \), where the two wavelets \( \varPsi_{h,j} \left( t \right) \) and \( \varPsi_{g,j} \left( t \right) \) are offset from one another by one half:

$$ \varPsi_{h,1} \left( t \right) = \varPsi_{h,2} \left( {t - 0.5} \right), \varPsi_{g,1} \left( t \right) = \varPsi_{g,2} \left( {t - 0.5} \right) $$
(4)

and the two wavelets \( \varPsi_{h,1} \left( t \right) \)and\( \varPsi_{g,1} \left( t \right) \) form an approximate Hilbert transform pair:

$$ \varPsi_{g,1} \left( t \right) \approx H\left\{ {\varPsi_{h,1} \left( t \right)} \right\}. $$
(5)

Similarly, the two wavelets \( \varPsi_{h,2} \left( t \right) \), \( \varPsi_{g,2} \left( t \right) \) form an approximate Hilbert transform pair:

$$ \varPsi_{g,2} \left( t \right) \approx H\left\{ {\varPsi_{h,2} \left( t \right)} \right\}. $$
(6)

The properties satisfied by the four wavelet functions ensure that the DDDTCWT has improved directional selectivity. The 2D DDDTCWT is realized by employing four oversampling 2D DDDWT in parallel to the same image with different filter sets for the rows and columns. We then take the sum and difference. This gives rise to 36 2D sub-band images as shown in Fig. 2b, four of which are the 2D low-pass sub-bands and the other 32 are 2D high-pass (detailed) sub-bands which describe more specific information in 16 directions. The procedure of two levels 2D DDDTCWT is shown in Fig. 2b.

The impulse responses of DWT, DTCWT, DDDWT and DDDTCWT are shown in Fig. 3a–d, respectively.

Fig. 3
figure 3

Impulse responses of a DWT, b DTCWT, c DDDWT, d DDDTCWT

The filter banks are applied recursively to the low-pass sub-band, using the analysis filters for the forward transform and the synthesis filters for the inverse transform. The synthesis filters are the time-reversed versions of the analysis filters. The filter bank structure can be implemented using FIR (finite impulse response) perfect reconstruction filter banks. It is believed that 32 detailed sub-band images generated by 2D DDDTCWT are sufficient to represent any geometric object exactly with high precision [95].

In Fig. 4a, an image containing a curve is purposely designed to demonstrate the improved directionality property of DDDTCWT [95]. As shown in Fig. 4b, DWT reconstructions can only accurately represent vertical and horizontal lines. The reconstructed curve looks smoother with reduced artefacts in Fig. 4c, d due to more directional sub-bands of DTCWT and DDDWT, respectively. However, the reconstructed image shown in Fig. 4e is much closer to the original image due to more number of orientations sub-bands in case of DDDTCWT. Therefore, in this paper, we utilize DDDTCWT to compute edge-to-texture feature maps.

Fig. 4
figure 4

Improved directionality of double-density dual-tree complex wavelet transform: a original test image; reconstructed image using only the lowest level coefficients of b DWT, c DTCWT, d DDDWT, e DDDTCWT, f zoomed-in original image; zoomed-in reconstructed image using only the lowest level coefficients of g DWT, h DTCWT, i DDDWT, j DDDTCWT. Grey level is normalized between [0, 1] for all images and 4-level transform is used

3.2.4 Features extraction using double-density dual-tree complex wavelet transform (DDDTCWT)

To generate feature maps, input RGB image \( {\mathbf{I}} \) is first converted into CIE Lab colour space, which is much closer to human vision. Then, the input image \( {\mathbf{I}} \) is filtered using a low-pass filter to remove the noise:

$$ {\mathbf{I}}^{\prime } = {\mathbf{I}}*{\mathbf{W}}, $$
(7)

where \( {\mathbf{W}} \) is a 2D Gaussian low-pass filter of size 3 × 3; \( {\mathbf{I^{\prime}}} \) is the filtered image; \( * \) represents the convolution operation. Using Eq. 8, we compute the 2D DDDTCWT (\( \cdot \)) of the given image  \( {\mathbf{I}}^{\prime } \). The wavelet coefficients o are stored in a cell array data structure \( o\left\{ s \right\}\left\{ g \right\}\left\{ t \right\}\left\{ d \right\} \), for \( s = 1,2, \ldots N \), \( t = 1,2 \), \( g = 1 - 2d \in 1 \ldots 8 \), where \( g \) represents either the real or imaginary part (by 1 or 2, respectively), and \( \left( {t,d} \right) \) represents the orientation.

$$ \left[ {\varvec{A}_{\varvec{N}}^{\varvec{c}} , \left( {\varvec{o}\left\{ \varvec{s} \right\}\left\{ \varvec{g} \right\}\left\{ \varvec{t} \right\}\left\{ \varvec{d} \right\}} \right)_{\varvec{c}} } \right] = DDDTCWT_{N} \left( { {\mathbf{I^{\prime}}}} \right), $$
(8)

where 2D DDDTCWT is implemented using analysis filters (as explained in Sect. 3.2.3) for N level decomposition; N denotes maximum scaling number for DDDTCWT, i.e. the resolution index \( s \in \left\{ {1, \ldots ,N} \right\} \) and the \( N \)th level refers to the coarsest resolution; c is the channel of \( {\mathbf{I}}^{\prime } \) as \( c \in \left\{ {L, a, b} \right\} \); \( A_{N}^{c} \) denotes the approximation output at \( N \)th level for channel \( c \). In total, DDDTCWT isolates edges by capturing information from 32 detailed high-pass sub-band images.

The detailed wavelet coefficients (neglecting approximation coefficients \( A_{N}^{c} \)) are utilized to compute several feature maps, signifying the contrast from edge to texture, by using inverse double-density dual-tree complex wavelet transform (IDDDTCWT), which is implemented using synthesis filters of DDDTCWT as explained in Sect. 3.2.3. The feature map for a given pixel, channel c and \( s \)th-level decomposition is computed using \( {\text{IDDDTCWT}}\left( \cdot \right) \) as:

$$ f_{s}^{c} \left( {x,y} \right) = \frac{{\left( {({\text{IDDDTCWT}}_{\text{s}} \left( {\varvec{o}\left\{ \varvec{s} \right\}\left\{ \varvec{g} \right\}\left\{ \varvec{t} \right\}\left\{ \varvec{d} \right\}} \right)_{\varvec{c}} } \right)^{2} }}{\eta }, $$
(9)

where \( \eta \) is the scaling factor. For a given input image, we obtained \( 3 \times N \) feature maps. Further, these feature maps are utilized to generate a saliency map, \( S\left( {x,y} \right) \), as follows:

$$ S\left( {x,y} \right) = \left( {\mathop \sum \limits_{s = 1}^{N} arg max\left( {f_{s}^{L} \left( {x,y} \right), f_{s}^{a} \left( {x,y} \right), f_{s}^{b} \left( {x,y} \right) } \right)} \right)*{\mathbf{P}}, $$
(10)

where \( f_{s}^{L} \left( {x,y} \right) \), \( f_{s}^{a} \left( {x,y} \right) \) and \( f_{s}^{b} \left( {x,y} \right) \) represent feature maps for \( L , a, \) and \( b \) channels, respectively, at scale \( s \) and \( {\mathbf{P}} \) is a 2D low-pass Gaussian filter.

3.3 Initial rough segmented saliency map (RSSM) generation

We use DDDTCWT coefficients to locate the salient object in an image, while superpixel segmentation is utilized to improve the object contours. Given a segmented region \( H_{p, } p = 1, \ldots k \), where \( k \) is the number of segmented regions, the average intensity of each region \( H_{p} \) is computed based on the corresponding DDDTCWT coefficients of the region in the saliency map. Each pixel \( x \in H_{p } \) is assigned the average intensity value v, which is computed as:

$$ v = \mathop \sum \limits_{i = 1}^{{\left| {H_{p } } \right|}} v_{i} /\left| {H_{p} } \right|,\quad \forall x \in H_{p, } , p = 1, \ldots ,k, $$
(11)

where \( v_{i} \) is the intensity value of the ith pixel \( x_{i} \). \( \left| {H_{p} } \right| \) is the number of pixels in region \( H_{p} \). In this way, an initial rough segmented saliency map (RSSM) is obtained where each segment is assigned with a saliency value calculated from DDDTCWT coefficients. Then, we use an average intensity value of DDDTCWT coefficients as a threshold to get a rough estimation of salient and background regions in RSSM. If a pixel intensity value in initial RSSM is greater than or equal to the threshold, then the pixel is considered to be salient and assigned a value of ‘1’ otherwise background by assigning it ‘0’. By examining RSSM, it is noted for some images that the some parts of salient objects are not highlighted. It might be because of the following reasons: (1) some pixels may be misclassified by the bipartite graph partitioning-based segmentation approach or (2) some pixels may be wrongly detected as a part of background object while actually being a part of the salient object (or vice versa) by the proposed DDDTCWT-based saliency detection method.

To further improve the accuracy of the initial RSSM, the Gaussian mixture model (GMM) is built over RSSM to re-estimate the saliency of each region. The parameters of GMM are updated using expectation maximization (EM) method to obtain the final saliency map, which is discussed in Sect. 3.4.

3.4 Refinement of initial rough segmented saliency map (RSSM) using Gaussian mixture model (GMM) and expectation maximization (EM) algorithm

A Gaussian mixture model (GMM) is useful for modelling data that come from one of several groups: The groups might be different from each other, but data points within the same group can be well modelled by a Gaussian distribution. The parameters of the GMM include the strengths (weights), means and covariances of the Gaussian distributions. Initial RSSM, given in Sect. 3.3, has two regions: salient and background. The pixels of these two regions in initial RSSM are regarded as two different Gaussian kernels. Then, a GMM is constructed over these two regions with the help of the Gaussian signals’ parameters (means, variances and strengths) estimated using expectation maximization (EM) for both the Gaussian signals. The EM algorithm estimates the parameters of the multivariate probability density function in the form of a Gaussian mixture distribution with a specified number of mixtures. Finally, the GMM initialization for EM algorithm is done in the following way. The strengths \( w_{1} \) and \( w_{2} \) of both the Gaussian signals are given as:

$$ w_{1}^{0} = \frac{{n_{1} }}{{n_{1} + n_{2} }}\quad {\text{and}}\quad w_{2}^{0} = \frac{{n_{2} }}{{n_{1} + n_{2} }}, $$
(12)

where \( n_{1} = \mathop \sum \nolimits_{p \in P} {\text{RSSM}}\left( p \right) \) and \( n_{2} = \mathop \sum \nolimits_{p \in P} \left( {1 - {\text{RSSM}}\left( p \right)} \right) \) and \( P \) denotes the set of image pixels. Since RSSM segments the two regions spatially, their initial spatial means \( \varvec{\mu}_{1}^{0} \) and \( \varvec{\mu}_{2}^{0} \) are given as \( \varvec{I}\left( {\varvec{U}_{1}^{0} } \right) \) and \( \varvec{I}\left( {\varvec{U}_{2}^{0} } \right) \), respectively, where

$$ \varvec{U}_{1}^{0} = \frac{1}{{n_{1} }}\mathop \sum \limits_{{{\mathbf{U}} \in SC_{xy} }} \varvec{U}.{\text{RSSM}}\left( U \right)\quad {\text{and}}\quad \varvec{U}_{2}^{0} = \frac{1}{{n_{2} }}\mathop \sum \limits_{{{\mathbf{U}} \in SC_{xy} }} \varvec{U}.\left( {1 - {\text{RSSM}}\left( U \right)} \right), $$
(13)

where \( {\mathbf{SC}}_{xy} \) are image spatial coordinates. Their covariances \( {\varvec{\Sigma}}_{1} \) and \( {\varvec{\Sigma}}_{2} \) are given as:

$$ \begin{aligned} & {\varvec{\Sigma}}_{1}^{0} = \frac{1}{{n_{1} - 1}}\mathop \sum \limits_{p \in P} \left( {\varvec{I}\left( p \right) -\varvec{\mu}_{1}^{0} } \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{1}^{0} } \right)^{T} .{\text{RSSM}}\left( p \right), \\ & {\varvec{\Sigma}}_{2}^{0} = \frac{1}{{n_{2} - 1}}\mathop \sum \limits_{p \in P} \left( {\varvec{I}\left( p \right) -\varvec{\mu}_{2}^{0} } \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{2}^{0} } \right)^{T} .\left( {1 - {\text{RSSM}}\left( p \right)} \right). \\ \end{aligned} $$
(14)

After initialization step, GMM parameters are updated using EM algorithm until convergence is reached. The probability of a pixel \( p \) to be a part of either of the cluster by utilizing the current parameters of the \( l \)th iteration is calculated as:

$$ \begin{aligned} & Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right) = \frac{{w_{1}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1}^{l} ,{\varvec{\Sigma}}_{1}^{l} } \right)}}{{w_{1}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1}^{l} ,{\varvec{\Sigma}}_{1}^{l} } \right) + w_{2}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2}^{l} ,{\varvec{\Sigma}}_{2}^{l} } \right)}}, \\ & Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right) = \frac{{w_{2}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2}^{l} ,{\varvec{\Sigma}}_{2}^{l} } \right)}}{{w_{1}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1}^{l} ,{\varvec{\Sigma}}_{1}^{l} } \right) + w_{2}^{l} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2}^{l} ,{\varvec{\Sigma}}_{2}^{l} } \right)}}. \\ \end{aligned} $$
(15)

The parameters of both Gaussian signals are updated in the following manner:

$$ w_{1}^{l + 1} = \frac{1}{W \times H}\mathop \sum \limits_{p \in P} Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right)\quad {\text{and}}\quad w_{2}^{l + 1} = \frac{1}{W \times H}\mathop \sum \limits_{p \in P} Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right), $$
$$ \varvec{\mu}_{1}^{l + 1} = \frac{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right).\varvec{I}\left( p \right)}}{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right)}}\quad {\text{and}}\quad\varvec{\mu}_{2}^{l + 1} = \frac{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right).\varvec{I}\left( p \right)}}{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right)}}, $$
$$ {\varvec{\Sigma}}_{1}^{l + 1} = \frac{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{1}^{l} } \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{1}^{l} } \right)^{T} }}{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {1 |\varvec{I}\left( p \right)} \right)}} $$

and

$$ {\varvec{\Sigma}}_{2}^{l + 1} = \frac{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{2}^{l} } \right).\left( {\varvec{I}\left( p \right) -\varvec{\mu}_{2}^{l} } \right)^{T} }}{{\mathop \sum \nolimits_{p \in P} Prob^{l} \left( {2 |\varvec{I}\left( p \right)} \right)}}, $$
(16)

where \( W \) denotes the width and \( H \) denotes the height of the image. The log-likelihood for \( l + 1 \) iteration is computed as:

$$ loglikelihood^{l + 1} = \mathop \sum \limits_{p \in P} \left( {{ \log }\left( {w_{1}^{l + 1} .{\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1}^{l + 1} ,{\varvec{\Sigma}}_{1}^{l + 1} } \right) + w_{2}^{l + 1} .{\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2}^{l + 1} ,{\varvec{\Sigma}}_{2}^{l + 1} } \right)} \right)} \right). $$
(17)

The inequality for the convergence condition is given as:

$$ {\text{abs}}\left( {loglikelihood^{l + 1} - loglikelihood^{l} } \right) < 1.0e - 3. $$
(18)

On completion of the updating procedure, the final parameter values of the GMM are used to assign each and every pixel \( p \) to the two regions (salient or background) with a probability, which is given as:

$$ \begin{aligned} & Prob^{final} \left( {1 |\varvec{I}\left( p \right)} \right) = \frac{{w_{1} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1} ,{\varvec{\Sigma}}_{1} } \right)}}{{w_{1} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1} ,{\varvec{\Sigma}}_{1} } \right) + w_{2} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2} ,{\varvec{\Sigma}}_{2} } \right)}}, \\ & Prob^{final} \left( {2 |\varvec{I}\left( p \right)} \right) = \frac{{w_{2} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2} ,{\varvec{\Sigma}}_{2} } \right)}}{{w_{1} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{1} ,{\varvec{\Sigma}}_{1} } \right) + w_{2} {\mathcal{N}}\left( {\varvec{I}\left( p \right) |\varvec{\mu}_{2} ,{\varvec{\Sigma}}_{2} } \right)}}. \\ \end{aligned} $$
(19)

The two segments are weighted based on a centre prior computed as:

$$ Center\_prior\left( i \right) = \mathop \sum \limits_{p \in P} Prob^{final} \left( {i |\varvec{I}\left( p \right)} \right).sdist_{p} ;i = 1,2, $$
(20)

where \( sdist_{p} \) is the spatial distance of the \( p \)th pixel with the image centre. \( Center\_prior\left( i \right) \) is normalized between \( \left[ {\left. {0,1} \right)} \right. \), which is computed as

$$ Center\_prior\left( i \right) = \frac{{Center\_prior\left( i \right) - min\left\{ {Center\_prior\left( j \right)} \right\}_{j = 1}^{2} }}{{max\left\{ {Center\_prior\left( j \right)} \right\}_{j = 1}^{2} }}. $$
(21)

The saliency value of a pixel \( p \) is given as:

$$ {\text{SM}}\left( p \right) = \mathop \sum \limits_{i = 1}^{2} Prob^{final} \left( {i |\varvec{I}\left( p \right)} \right).\left( {1 - Center\_prior\left( i \right)} \right), $$
(22)

where SM represents the final saliency map generated by the proposed algorithm, which is normalized in the range 0–1.

4 Experimental set-up and results

In this section, we evaluate and compare the performances of our method (DDDTCWT-SS) against 29 state-of-the-art algorithms on six representative benchmark datasets.

Salient object datasets and state-of-the-art methods

Six benchmark datasets for evaluation include commonly used Microsoft Research Asia Salient Object DatabaseFootnote 1 (MSRA SOD) image set B (5000 images), Achanta Saliency Database (ASDFootnote 2) (1000 images), SAA_GTFootnote 3 (5000 images), SODFootnote 4 (500 images), SED which consists of two parts, i.e. SED1Footnote 5 (one object set) and SED2Footnote 6 (two object sets) each containing 100 images. All the images are of size 400 × 300 or 300 × 400 having intensity values in [0255]. We compare the proposed method (DDDTCWT-SS) with 29 state-of-the-art salient object detection methods: IT [44], AIM [15], GBVS [35], SR [36], SLRG [64, 65], SUN [94], FT [3], ASS [2], CASD [30], LRK [80], SIM [68], WT [41], PFT [32], STREE [63], MSA [96], PCT [92], FDN [8, 9], PFDN [8, 9], PQFT [33], AQFT [24], BS [25], HFT [60], SDS [61], HLGM [5], IS [37], BSM [90], MCA [86], HDCT [51] and BHGT [6].

Experimental Set-up All the experiments are carried out using Windows 7 environment over Intel(R) Xeon(R) processor with a speed of 2.27 GHz and 4 GB RAM.

4.1 Qualitative performance

The qualitative analysis of our algorithm (DDDTCWT-SS) with 29 other state-of-the-art recently proposed saliency models is presented in Fig. 5.

Fig. 5
figure 5figure 5

Qualitative comparison of the proposed model with existing 29 models. S spatial domain, FD frequency domain, W wavelet domain

We include these models based on relevance to our work, recency and availability of their saliency maps. We randomly choose three images from MSRA dataset, two images from SOD, two images from SED1 and two images from SED2 datasets for qualitative comparison. Figure 5 clearly shows that the better saliency maps are achieved by the proposed model (DDDTCWT-SS) in comparison with 29 state-of-the-art methods (Fig. 6).

Fig. 6
figure 6

Saliency maps generated by the proposed algorithm for some challenging and complex images

4.2 Quantitative performance

We evaluate the quantitative performance of the proposed model (DDDTCWT-SS) against 29 other state-of-the-art models in terms of precision, recall, F-measure, area under curve (AUC) and computation time. Performance of the proposed model (DDDTCWT-SS) is also evaluated in terms of mean precision value, mean recall value, mean F-measure values and mean AUC values on all six datasets. The outcome of the salient object detection procedure is a saliency map. To compare the quality of saliency maps in terms of precision, recall, F-measure and area under curve for the task of segmenting salient objects, we rely on a ground truth database. Therefore, a suitable threshold \( t \) [5, 82] is first applied to the saliency map of an image \( {\mathbf{I}} \) to generate an attention mask \( {\mathbf{R}} \) (also called ‘predicted condition’ or detection result) and then \( {\mathbf{R}} \) is compared with the ground truth \( {\mathbf{G}} \) (also called ‘true condition’) associated with \( {\mathbf{I}} \). Both \( {\mathbf{R}} \) and \( {\mathbf{G}} \) consist of pixels with only two values 0 or 1. Based on the values of \( {\mathbf{R}} \) and \( {\mathbf{G}} \), the following terms are defined:

  • The true positive (TP) is the number of pixels correctly detected in \( {\mathbf{R}} \) as belonging to the salient object in ground truth image \( {\mathbf{G}} \).

  • False positive (FP) is the number of pixels wrongly detected as salient in \( {\mathbf{R}} \) as belonging to the background in the ground truth image \( {\mathbf{G}} \).

  • False negative (FN) is the number of pixels incorrectly detected as background in \( {\mathbf{R}} \) as belonging to the salient object in the ground truth image \( {\mathbf{G}} \).

  • True negatives (TN) are the pixels correctly detected as background in \( {\mathbf{R}} \) as belonging to the background in the ground truth image \( {\mathbf{G}} \).

The obtained attention mask \( {\mathbf{R}} \) and ground truth map \( {\mathbf{G}} \) are used to compute the precision, recall and F-measure as:

$$ {\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}\quad {\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}\quad F_{\beta } = \frac{{\left( {1 + \beta^{2} } \right) \times {\text{Precision}} \times {\text{Recall}}}}{{\beta^{2} \times {\text{Precision}} + {\text{Recall}}}}, $$
(23)

where \( {\text{TP}} = \mathop \sum \limits_{{G\left( {x,y} \right) = 1}} R\left( {x,y} \right)\quad {\text{FP}} = \mathop \sum \limits_{{G\left( {x,y} \right) = 0}} R\left( {x,y} \right)\quad {\text{FN}} = \mathop \sum \limits_{{R\left( {x,y} \right) = 0}} G\left( {x,y} \right). \)

\( \beta \) is chosen to be 1 to give equal importance to both precision and recall. ROC curve is used for measuring the similarity between the saliency map and ground truth, and the area under curve (AUC) is used for quantitative comparison between different models. The ROC curve is generated by plotting the true positive rate (TPR) on the y-axis against false positive rate (FPR) values on the x-axis, respectively [82]. TPR and FPR are given by

$$ {\text{TPR}} = \frac{\text{TP}}{{\mathop \sum \nolimits_{{\left( {x,y} \right)}} G\left( {x,y} \right)}}\quad {\text{FPR}} = \frac{\text{FP}}{{W \times H - \mathop \sum \nolimits_{{\left( {x,y} \right)}} G\left( {x,y} \right)}}, $$
(24)

where \( W \) and \( H \) represent the width and height of the image, respectively. A model is considered to be good if it achieves high values for precision, recall, F-measure and AUC. Tables 1, 2, 3, 4 and 5 show the quantitative performance analysis of the proposed method in comparison with other 29 state-of-the-art methods on all the six datasets in terms of precision, recall, F-measure, AUC and average computation time per image, respectively. The mean value for each quantitative measure on all six datasets is also shown in the respective table for each model.

Table 1 Quantitative comparison on six datasets in terms of precision
Table 2 Quantitative comparison on six datasets in terms of recall
Table 3 Quantitative comparison on six datasets in terms of F-measure
Table 4 Quantitative comparison on six datasets in terms of AUC
Table 5 Quantitative comparison in terms of computation time

In Fig. 7, we show the ROC curves of the state-of-the-art algorithms, including the proposed method (DDDTCWT-SS) corresponding to the six datasets. On the basis of Fig. 7 and Tables 1, 2, 3, 4 and 5 (the best results are shown in bold), we make the following observation:

Fig. 7
figure 7figure 7figure 7

ROC for the six datasets a MSRA-B, b ASD, c SAA_GT, d SOD, e SED1, f SED2

  1. 1.

    Quantitative evaluation on MSRA dataset The proposed model (DDDTCWT-SS) achieves highest F-measure for MSRA dataset. The proposed model DDDTCWT-SS outperforms all state-of-the-art methods except HFT, SDS and MCA in terms of precision. DDDTCWT-SS outperforms all state-of-the-art methods except SIM and PFDN in terms of recall. DDDTCWT-SS outperforms all state-of-the-art methods except BHGT in terms of AUC.

  2. 2.

    Quantitative evaluation on ASD dataset In terms of precision, DDDTCWT-SS outperforms all state-of-the-art methods except MSA, HDCT, MCA and STREE. In terms of recall, DDDTCWT-SS outperforms all state-of-the-art methods. In terms of F-measure, DDDTCWT-SS outperforms all state-of-the-art methods. In terms of AUC value DDDTCWT-SS outperforms all state-of-the-art methods except MCA and HDCT.

  3. 3.

    Quantitative evaluation on SAA dataset In terms of precision, DDDTCWT-SS outperforms all state-of-the-art methods except HDCT and MCA. In terms of recall and F-measure, DDDTCWT-SS outperforms all state-of-the-art methods. In terms of AUC value, DDDTCWT-SS outperforms all state-of-the-art methods except HDCT.

  4. 4.

    Quantitative evaluation on SOD dataset In terms of precision, DDDTCWT-SS outperforms all state-of-the-art methods except MCA. In terms of recall, F-measure and AUC, DDDTCWT-SS outperforms all state-of-the-art methods.

  5. 5.

    Quantitative evaluation on SED1 dataset In terms of precision, DDDTCWT-SS outperforms all state-of-the-art methods except MSA, HDCT, MCA and STREE. In terms of recall and F-measure, DDDTCWT-SS outperforms all state-of-the-art methods. In terms of AUC value, DDDTCWT-SS outperforms all state-of-the-art methods except BHGT, HDCT, MCA, STREE and BSM.

  6. 6.

    Quantitative evaluation on SED2 dataset In terms of precision, DDDTCWT-SS outperforms all state-of-the-art methods except BHGT, FT, MSA and HDCT. In terms of recall and F-measure, DDDTCWT-SS outperforms all state-of-the-art methods except BHGT. In terms of AUC, DDDTCWT-SS outperforms all state-of-the-art methods.

  7. 7.

    Some models are better in terms of precision and others in terms of recall. A model is considered to be good if both precision and recall are higher. But that is difficult to achieve. It is suggested that a model should have a higher F-measure value which is the weighted harmonic mean of precision and recall. The proposed model furnishes the highest \( F \)-measure value on all the six datasets.

  8. 8.

    Our model covers the maximum area under the ROC curve in comparison with all state-of-the-art methods for SOD and SED2 datasets and hence gives the highest AUC value on these two datasets.

  9. 9.

    PFT [32] takes the least computation time.

  10. 10.

    Spatial domain models provide good detection accuracy at the cost of high computational time while frequency domain models offer fast computational speed to meet real-time requirements at the cost of poor detection accuracy. In order to induce a trade-off between computational time and accuracy, our model provides high detection accuracy and takes less time in comparison with most of the existing methods in spatial domain, which is given in Table 5.

  11. 11.

    Bipartite graph-based image segmentation and DDDTCWT-based saliency map computation steps are independent from each other. Therefore, these steps are carried out in parallel during experiments. This helps us in reducing overall execution time of the proposed model.

  12. 12.

    In terms of mean precision value, DDDTCWT-SS outperforms all state-of-the-art methods except HDCT and MCA. In terms of mean recall value, mean F-measure value and mean AUC value, DDDTCWT-SS outperforms all state-of-the-art methods.

  13. 13.

    We have used a threshold which is equivalent to average intensity value of all pixels computed using multi-scale edge-to-texture features extracted from 2D-DDDTCWT coefficients to get a rough estimation of salient region and background region in initial rough segmented saliency map (RSSM). We need to choose the threshold more carefully as we can lose too much of the salient pixels and sometimes get too many extraneous background pixels while thresholding. To assign appropriate salient and background pixels to two different Gaussian kernels for refinement process, we have chosen more reliable way based on DDDTCWT coefficients.

  14. 14.

    To show the statistical significance of performance results, Friedman statistical test is performed on F-measure and AUC performance measures, which is based on the research work of Demšar [22] and Derrac et al. [23]. The null hypothesis assumes that each of the models is equivalent in terms of their performance. A comparison of multiple models can be accomplished after ranking them according to their F-measure and AUC values. For each case, rank ranging from \( 1 \) to \( k \) is associated with every model. Rank values 1 and k denote the best and worst result, respectively. Let this rank be denoted by \( r_{i}^{j} \left( {1 \le i \le N,1 \le j \le k} \right) \). For each model,  \( j \), let  \( R^{j} \) denote the average of ranks over the  \( N \)  experimental observations. The ranks computed are given in Table 6 for 30 models for both performance measures. Table 6 shows that the best performing DDDTCWT-SS algorithm has the least rank value for both F-measure (1.17) and AUC (2.50). The p values computed by Iman and Davenport [42] statistic for F-measure and AUC are 4.0622267404084276E−55 and − 2.2204454167762003E−16, respectively, which suggest the statistical difference among all models considered in our quantitative comparison, hence rejecting the null hypothesis.

    Table 6 Average ranking of the algorithms corresponding to F-measure and AUC using Friedman statistic
  15. 15.

    We have used six datasets containing images with wide range of shapes, scales and appearances. The proposed method produces more accurate saliency maps in various challenging cases, e.g. salient object touching the image border (columns 6(a) and 6(b)), multiple disconnected salient objects (columns 6(c) and 6(d)), low contrast between salient object and background (column 6(e)) and Camouflage condition (column 6(f)), while the proposed algorithm may not work properly for some images containing multiple objects on highly cluttered backgrounds (columns 6(g) and 6(h)). It also may not provide better performance on images with salient objects under partial occlusion.

  16. 16.

    To show contribution of the refinement step using Gaussian mixture model (GMM) and expectation maximization (EM) algorithm in the proposed algorithm, quantitative and qualitative performance of the algorithm before and after the refinement step is shown in Table 7 and Fig. 8, respectively. Table 7 (shown in bold) and Fig. 8 show that the performance after refinement step is improved.

    Table 7 Quantitative performance of the proposed algorithm before and after the refinement step
    Fig. 8
    figure 8

    Qualitative performance of the proposed algorithm before and after the refinement step

5 Conclusion and future work

Performance assessment of the proposed model is done in terms of precision, recall, \( F \)-measure, AUC and computation time on six publicly available image datasets. Performance of the proposed model (DDDTCWT-SS) is also evaluated in terms of mean precision value, mean recall value, mean F-measure values and mean AUC values on all six datasets. Experimental results exhibited that the proposed model (DDDTCWT-SS) outperforms several existing competitors in terms of F-measure on five datasets, recall on four datasets and AUC on two datasets. The proposed method outperforms all state-of-the-art methods in terms of mean recall value, mean F-measure value and mean AUC value on all six datasets. However, the proposed model demands less computation time in comparison with most of the existing methods in spatial domain. Although the proposed method (DDDTCWT-SS) is simple, still there are several important issues which require further investigation like incorporation of more sophisticated visual features to further improve the performance. The research work can also be reached out to make the framework powerful by handling some difficulties like partial occlusion, articulation, background clutter and real-time requirements. We hope to encourage more future work along this direction. A challenging dataset related to specific challenges like partial occlusion and background clutter with accurate annotation and appropriate evaluation methodology would be desirable. In future, we will also be focussing on solving an application-oriented problem like image or video compression and video summarization using visual saliency.