1 Introduction

While watching a play in a theatre, usually a sudden spotlight grabs the attention of the audience sitting in the dark. At any point of time during performance, a lot of visual stimuli like actors performing the play, their props and costumes, stage setup, etc. reach human eyes. In spite of this huge load of visual stimuli the spotlight guides the human gaze by identifying the areas of relevance in a scene. The mechanism of identifying the relevant regions in a given image or scene is called visual attention [6, 30]. These relevant regions or objects are termed as salient objects in the field of computer vision [6, 30]. Detection of these salient objects finds its real time applications in surveillance systems [21], remote sensing [36], image retrieval [3, 20] and object detection and recognition [35, 45]. It is helpful in automatic target detection [28, 30], robotics, image and video compression [28], automatic cropping/centering [41] to display objects on small portable screens [10], medical imaging [35], advertising a design [28], image collection browsing [40], image enhancement [18], video summarization [38] and many more.

Visual Attention is a cognitive process that helps humans and primates to rapidly select the highly relevant information from a scene [7]. This information is then further processed by high-level visual processes such as scene understanding and object detection. It is commonly believed that visual attention is guided by two components: (i) Bottom-up (BU) visual saliency, a data-driven and task independent component based on only low-level and image-based outliers and conspicuities, and (ii) a top-down (TD) component, a volitionally-controlled mechanism that guides attention and gaze in a task-dependent and goal-directed manner, orchestrating the sequential acquisition of information from the visual environment.

When information about specific search target, search task, and particular time or other constraints is not specified to an observer in advance then bottom-up (image-derived) information plays a predominant role in guiding attention toward potential interesting targets [29]. When attention is exploited by salient stimuli, it is considered to be bottom-up, memory-free, and reactive. It depends only on the instantaneous sensory input, without taking into account the internal state of the organism. It is driven by low-level stimulus in the scene. In some cases when backgrounds are highly cluttered, due to deficiency of top-down prior knowledge, bottom-up saliency algorithms usually respond to numerous unrelated low-level visual stimuli (false positives) and thus may miss the objects of interest (false negatives). Most of research works mostly focused on the bottom-up aspect of visual attention. Currently researchers started distinguishing the two very similar terms with the advancement of bottom-up approaches: fixation prediction and salient object detection [8, 11, 35]. The main objective of the fixation prediction models is to find the fixation points in a given scene. Fixation points are those points in the scene or image where human eyes focus if shown for a few seconds. These points are useful in eye movement prediction. The second category of models which are salient object detection models detects the most salient object in an image by drawing accurate silhouettes of the salient object. To draw accurate silhouettes, segmentation of the image into two regions, a salient object and background is needed. Both categories of models construct saliency maps which are useful for different purposes. The other way to guide and improve the attention is to use top-down, memory-dependent, or anticipatory mechanisms. Top-down attention is driven by cognitive factors such as knowledge, expectations and current goals [13]. The top-down methods are task-dependent and the human observation behavior is exploited to achieve specific goals. Top-down models are always integrated with the bottom-up models to generate saliency maps for localizing objects of interest. This bottom-up or top-down visual attention can be modelled in spatial domain and frequency domain to automatically generate the saliency map which encodes visual conspicuity stimulus. In general, spatial domain methods provide higher detection accuracy but take more computation time to obtain features. In literature, research works are suggested to determine features in frequency domain to reduce computation time.

Most of the models in frequency domain focus only on the local saliency while others focus only on the global saliency. However, in order to detect a salient object, both local as well as global saliency information play a vital role. In literature [5, 12, 31], the research works have utilized fusion of the local and global features obtained in spatial domain to enhance detection accuracy but at the cost of higher computation time. To the best of our knowledge till date, there is no model proposed in literature to detect salient object which utilizes both local and global saliency information in frequency domain. So, in this paper we propose a novel and effective hybrid approach for salient object detection which utilizes both local and global saliency information in frequency domain to reduce computation time without degrading detection accuracy much. Local saliency is computed in terms of PFDN as suggested in the research work of Bian and Zhang [4] and global saliency is determined using fast Walsh -Hadamard transform (WHT) [15, 24, 44]. WHT is less computationally expensive as it takes only binary values ±1 and requires only addition and subtraction operations. Finally the saliency map is generated in terms of linear weighted combination of local and global saliency where the weights are determined using entropy measure [26, 39, 43]. To check the efficacy of the proposed hybrid model, experiments are performed on two publicly available datasets and one new derived dataset and performance is compared with existing state-of-the-art methods in literature.

The contribution of this paper is threefold: 1) A fast frequency domain approach for salient object detection is proposed which allows the full use of local and global information for salient object detection unlike recent methods in frequency domain which model saliency either as a global phenomenon or local phenomenon; 2) We employ fast Walsh-Hadamard transform (WHT) to compute global saliency because of its simplicity, efficiency, and speed; 3) We derived object-contour based ground truth dataset to obtain exact shapes of salient objects. The performance is evaluated on this derived dataset to check how well our proposed method satisfies the accurate object shape for all the 5000 images of MSRA SOD image set B rather than using ground truth based on rectangle constraints.

The paper is organized as follows. Section 2 is the description and review of related state-of-the-art methods to detect salient object. In section 3, we present the proposed saliency model (HLGM) based on local and global saliency in frequency domain. The experimental setup and results are presented in section 4. Conclusion and future work are discussed in Section 5.

2 Related work

2.1 Bottom-up methods

Visual attention [6, 17] is achieved by either a fast bottom-up component or a slow task-dependent top-down component. Most of the researchers focus on computing bottom-up visual attention in spatial domain. Itti et al. [30] suggested a biologically plausible saliency detection approach which generates activation maps by employing the centre-surround operator through a number of scales and finally combines these activation maps into a saliency map based on the early primate visual system. Han et al. [25] proposed a model which uses Markov random field and region growing techniques in combination with the Itti et al.’s model [30] for salient objects segmentation in colour images. Bruce and Tsotsos [9] proposed a neurally plausible bottom-up salient object detection model which works on the principle of information maximization. Achanta et al. [1] proposed a frequency-tuned method for saliency detection. Achanta and Susstrunk [2] proposed a salient region detection approach using maximum symmetric surround technique by assigning large bandwidth to the filter in the centre of images while small bandwidth at border.

The spatial domain models are generally complex and highly computational which limits their usage in real time applications. To overcome these limitations, researchers employed frequency domain techniques for salient object detection. Recently Hou and Zhang [27] used spectral residual of Fourier transform to detect the salient objects. Guo et al. [22] pointed out that phase spectrum of Fourier transform (PFT) is the most important key to determine the position of salient objects rather than amplitude spectrum and proposed a saliency detection model based on the phase spectrum of the Fourier transform. Guo and Zhang [23] extended PFT model to compute the multi-resolution spatiotemporal saliency map which uses quaternion representation of the image. Yu et al. [46] proposed a salient object detection model which is based on the concept of pulsed discrete cosine transform. Bian and Zhang [4] suggested a frequency domain divisive normalization (FDN) approach for saliency detection using contourlet transform and frequency divisive normalization. FDN exhibits biological plausibility as it utilizes the concept of initial feature extraction and cortical surround inhibition. Bian and Zhang [4] extended FDN by decomposing an image into overlapping local patches and then conducting piecewise FDN (PFDN) on these patches. Recently Fang et al. [14] utilized amplitude spectrum of Quaternion Fourier Transform (AQFT) of different patches of an image to detect salient object. More recently, Li et al. [34] built saliency detection model based on the hypercomplex Fourier transform (HFT). In this method Gaussian functions of different variances are used to filter the log amplitude spectrum while keeping the phase spectrum.

2.2 Top-down methods

Top-down approaches are integrated with the bottom-up approaches in order to detect the salient locations. Zhang et al. [47] proposed a Bayesian framework based approach to classify a pixel into salient object or background object by taking position, area and intensity saliencies and a maximum saliency difference technique into consideration. Goferman et al. [19] proposed a context-aware saliency detection algorithm by exploiting four principles: local low level, global, visual organization rules and high-level factors. Liu et al. [37] proposed a supervised approach by incorporating a set of features to depict a salient object at the local, regional and global level. The proposed method consists of two phases. In the first phase the multi-scale contrast, center-surround histogram and color spatial distribution features are extracted from the image. Then in the second phase, conditional random field is employed to linearly combine these features into a saliency map.

3 Hybrid approach based on local and global saliency maps

In this paper, we propose a hybrid approach (HLGM) that utilizes benefits of both local and global saliency information in frequency domain for salient object detection. The proposed HLGM model involves three phases. In the first phase, local saliency information is computed using PFDN as suggested in the research work of Bian and Zhang [4]; in the second phase, global saliency information is computed using fast Walsh-Hadamard transform [15, 24, 44]; and finally, the hybrid saliency map is determined in terms of weighted linear combination of the local and global saliency maps where the weights are determined using entropy measure. The extracted hybrid saliency map is used to produce segmentation mask around region of interest. This segmentation mask around region of interest is also called the attention mask. We will describe the attention mask more in a later section 3.3. Brief details of these three phases are discussed below.

3.1 Local saliency computation

Most of the models working in frequency domain are global in nature except some models which compute local saliency information over image patches such as the PFDN model, which shows biologically plausibility in frequency domain [4]. In spectral domain, each feature map can be seen as a sub-band in the frequency spectrum and saliency map is generated through all scales and orientations [4]. So in the first phase of HLGM, we have utilized PFDN to compute local saliency map. In this phase, the input image is first transformed into LAB colour space and for a given colour channel, the image is decomposed into a number of scales by employing a Laplacian pyramid. Then for a given colour channel and a scale, the input image is partitioned into P overlapping local patches. For a given colour channel and a scale, Fourier transform of a patch (I p ) is computed as

$$ \begin{array}{cc}\hfill {R}_p(k)={F}_k\left({\mathbf{I}}_{\mathbf{p}}\right)\hfill & \hfill \forall p\hfill \end{array}=1,2\dots P $$
(1)

where F represents Fourier transform, R p (k) is the Fourier transform coefficient at frequency component k for patch p. These Fourier coefficients are grouped using the decomposition scheme shown in Fig. 1 as suggested in the research work of Bian and Zhang [4]. This decomposition scheme separates the amplitude spectrum of the input image into feature maps in four scales with 16, 8, 4, and 1 orientations from the highest scale to the lowest, which result in total 29 sub-bands corresponding to the 29 feature maps.

Fig. 1
figure 1

2D-multiscale directional filter bank of contourlet transform

Then normalization term for i th sub-band, E i is calculated as

$$ \begin{array}{cc}\hfill {E}_{i=}\sqrt{w{\displaystyle {\sum}_{k\in i}\frac{{\left\Vert {R}_p(k)\right\Vert}^2}{N}+{\sigma}^2}}\hfill & \hfill \forall i\hfill \end{array}=1,2,\dots 29 $$
(2)

where N is the number of pixels. w and σ 2 represent constants which are set to 1. After computing the normalization term for each sub-band, E i the divisive normalization coefficient of sub-band i in the frequency domain can be obtained by

$$ \begin{array}{cc}\hfill {\widehat{R}}_p{(k)}_{k\in i}=\frac{R_p{(k)}_{k\in i}}{E_i}\hfill & \hfill \forall i\hfill \end{array}=1,2,\dots 29 $$
(3)

The normalization in above equation suppresses frequency band with high energy concentration which helps in obtaining a desired saliency map. The combined divisive normalization coefficient for p th patch is given by

$$ \begin{array}{cc}\hfill {\widehat{\mathbf{R}}}_p={\displaystyle {\cup}_i\left({\widehat{R}}_p{(k)}_{k\in i}\right)}\hfill & \hfill \forall i=1,2,\dots 29\hfill \end{array} $$
(4)

The saliency maps SM p of p th patch is obtained [4] as

$$ \mathbf{S}{\mathbf{M}}_p=\mathbf{W}{\left|{F}^{-1}\left\{{\widehat{\mathbf{R}}}_p\right\}\right|}^2 $$
(5)

where W denotes the windowing function for edge effects removal.

For each scale and a given colour channel, the saliency maps for all patches are combined by taking the maximum value at each pixel location. The obtained saliency maps from all scales are resized to the dimension of original image and the final local saliency map (LSM) is obtained in terms of spatial maximum across all scales and color channels. Finally the saliency map is smoothed using Gaussian filter and values of LSM are normalized to [0, 1].

3.2 Global saliency computation

To reduce the computation time to capture global saliency of the image in the proposed HLGM model, we have used fast Walsh-Hadamard transform (WHT) [15, 16, 24, 44] instead of fast Fourier transform (FFT). The elements of basis vectors of WHT take only +1 and −1 values while FFT requires complex multiplications. The computations involved in WHT are very simple because when an image is projected onto the basis images, all that is needed to do is to multiply each pixel by ± 1. So the WHT is computationally more efficient than FFT [26, 42]. The WHT [26] coefficients of the image I of size N × N where N = 2n is computed as

$$ \begin{array}{cc}\hfill \mathrm{W}\mathrm{H}\left(u,v\right)=\frac{1}{N}{\displaystyle {\sum}_{r=0}^{N-1}{\displaystyle {\sum}_{c=0}^{N-1}\mathrm{I}\left(r,c\right){\left(-1\right)}^{{\displaystyle {\sum}_{i=0}^{n-1}\left[{b}_i(r){p}_i(u)+{b}_i(c){p}_i(v)\right]}}}}\hfill & \hfill 0\le u,v\le N-1\hfill \end{array} $$
(6)

where b i (r) is the i th bit in the binary representation of r. (u, v) is the index in frequency domain and (rc) is the index in spatial domain. In addition, p i (u) is found as follows:

$$ \begin{array}{l}{p}_0(u)={b}_{n-1}(u)\\ {}{p}_i(u)={b}_{n-i}(u)+{b}_{n-i-1}(u)\kern1em for\ i=1 \dots n-1\end{array} $$

These WHT coefficients correspond to the frequencies ranging from lowest to highest from the origin with a lot of mid-range frequencies as shown in Fig. 2, where origin is at the top left corner. LF, MF, and HF refer to the Low Frequencies, Middle Frequencies, and High Frequencies respectively in the image.

Fig. 2
figure 2

Range of Low Frequencies (LF), Middle Frequencies (MF) and High Frequencies (HF)

We pick all the high frequencies coefficients as shown in Fig. 2, which can be selected by the following mask M.

$$ \mathrm{M}\left(u,v\right)=\left\{\begin{array}{cc}\hfill 0,\hfill & \hfill if\ u\le \frac{9N}{10}\ and\ v\le \frac{9N}{10}\hfill \\ {}\hfill 1,\hfill & \hfill otherwise\hfill \end{array}\right. $$

The selected WHT coefficients are given by

$$ \mathrm{W}\mathrm{H}^{\prime}\left(u,v\right)=\mathrm{W}\mathrm{H}\left(u,v\right)*\mathrm{M}\left(u,v\right) $$
(7)

The global saliency map (GSM) is computed as

$$ \begin{array}{cc}\hfill \mathrm{G}\mathrm{S}\mathrm{M}\left(r,c\right)={\displaystyle {\sum}_{u=0}^{N-1}{\displaystyle {\sum}_{v=0}^{N-1}\mathrm{W}\mathrm{H}^{\prime}\left(u,v\right){\left(-1\right)}^{{\displaystyle {\sum}_{i=0}^{n-1}\left[{b}_i(r){p}_i(u)+{b}_i(c){p}_i(v)\right]}}}}\hfill & \hfill \mathrm{f}\mathrm{o}\mathrm{r}\hfill \end{array}\;0\le r,c\le N-1 $$
(8)

The values of GSM are normalized to [0, 1]. Finally the obtained saliency map GSM is resized to the dimension of original image.

Being a real, symmetric and orthogonal transform, the WHT transformation matrix H has the following properties [26, 39, 43]:

$$ \mathbf{H}={\mathbf{H}}^{*}={\mathbf{H}}^{\boldsymbol{T}}={\mathbf{H}}^{-1} $$

The most attractive aspect of the WHT is that it involves only addition and subtraction computations, with no multiplication operation. Since multiplication is a time consuming operation, using WHT saves a significant amount of computation time [15, 24, 26, 39, 43, 44]. In this manuscript, this global approach based on WHT to compute global saliency, is abbreviated as WHTM (Walsh-Hadamard Transform Method).

3.3 Composition saliency computation

It is possible that for some images global saliency information plays a vital role to detect a salient object properly while for others local saliency information is important. Both of these saliency information need to be combined in such a way that the dominant saliency information gets higher weight. To capture the composite information in the proposed HLGM model, we used a linear weighted combination of the local and global saliency information to compute the final saliency map (FSM) which is obtained as

$$ \mathbf{F}\mathbf{S}\mathbf{M}={w}_L\times \mathbf{L}\mathbf{S}\mathbf{M}+{w}_G\times \mathbf{G}\mathbf{S}\mathbf{M} $$
(9)

where w L and w G represent the weights to be assigned to the local and global saliency maps. A desired saliency map should highlight the salient objects while suppressing the objects which are non-salient. In such case the histogram of saliency values will not be uniformly distributed over all bins and the corresponding entropy will be small. It can be easily observed from a few experiments that the saliency map with the minimum entropy value clearly separates the salient region from the background. The lower is the entropy of saliency map, pixels of salient object are minimally scattered. Hence, higher weight is assigned to a saliency map with lower entropy and lower weight is assigned to a saliency map with higher entropy to obtain better saliency map. To choose the weights which satisfy this criterion, the weights w L and w G can be assigned inversely proportional to entropy of local saliency map and entropy of global saliency map respectively. For some images, local weights and global weights corresponding to local saliency maps and global saliency maps respectively are shown in column (b) and (c) of Fig. 3. Entropy of the local and global saliency maps are computed as

$$ \begin{array}{cc}\hfill {E}_L=-{\displaystyle {\sum}_{i=1}^b{p}_i\left(\mathbf{L}\mathbf{S}\mathbf{M}\right){ \log}_2{p}_i\left(\mathbf{L}\mathbf{S}\mathbf{M}\right)}\hfill & \hfill \mathrm{and}\hfill \end{array}\;{E}_G=-{\displaystyle {\sum}_{i=1}^b{p}_i\left(\mathbf{G}\mathbf{S}\mathbf{M}\right){ \log}_2{p}_i\left(\mathbf{G}\mathbf{S}\mathbf{M}\right)} $$
(10)

where b represents the number of bins, p i (∙) indicates the probability of pixels belonging to the i th bin in the histogram. In the experiments, the number of bins used to compute the hybrid saliency map has been set to 16 bins. The weights w L and w G are computed as

$$ {w}_L=\left\{\begin{array}{cc}\hfill \frac{E_G}{E_L+{E}_G}\hfill & \hfill if\ {E}_L\ne 0\ and\ {E}_G\ne 0\hfill \\ {}\hfill 1\hfill & \hfill if{E}_G=0\ and\ {E}_L\ne 0\hfill \end{array}\right. $$
(11)
$$ {w}_G=\left\{\begin{array}{cc}\hfill \frac{E_L}{E_L+{E}_G}\hfill & \hfill\ if\ {E}_L\ne 0\ and\ {E}_G\ne 0\hfill \\ {}\hfill 1\hfill & \hfill if{E}_L=0\ and\ {E}_G\ne 0\hfill \end{array}\right. $$
(12)
Fig. 3
figure 3

a Original image b Global saliency map and its corresponding global weight c Local saliency map and its corresponding local weight d Hybrid saliency map using HLGM approach e Attention mask generated from hybrid saliency map

The saliency map FSM is normalized between [0, 1]. The normalized saliency value of pixel p is computed as

$$ {\mathrm{FSM}}_N(p)=\frac{\mathrm{FSM}(p)-m}{M-m} $$
(13)

where m = min ∀ p ∈ PFSM(p) and M = max ∀ p ∈ PFSM(p) represent the minimum and maximum values of the saliency map respectively, and P indicates the set of all pixels in the image. A threshold is required to classify a pixel p into an attention pixel or a background pixel. For this purpose, generally a fixed threshold is selected which is half of the maximum saliency value. But a fixed threshold may not be suitable for all saliency maps. In our experiments, we used an adaptive threshold τ that is dependent on the saliency map. The adaptive threshold τ is calculated in two steps. In the first step, a Canny edge operator is applied on the normalized saliency map FSM N to generate the object’s silhouette. The edge information E for every pixel p is given as

$$ \mathrm{E}(p)= edge\left({\mathrm{FSM}}_N(p)\right)=\left\{\begin{array}{cc}\hfill 1,\hfill & \hfill\ p\ is\ an\ edge\ pixel\hfill \\ {}\hfill 0,\hfill & \hfill otherwise\hfill \end{array}\right. $$
(14)

In the second step, the average of the saliency values present at the object’s silhouette is used as a threshold τ to classify a pixel p into an attention pixel or a background pixel. The threshold τ is computed as

$$ \tau = \frac{{\displaystyle {\sum}_{p\in \mathrm{P}}\mathrm{E}(p)}.\kern0.5em {\mathrm{FSM}}_N(p)}{{\displaystyle {\sum}_{p\in \mathrm{P}}\mathrm{E}(p)}} $$
(15)

A binary threshold map T from the grayscale saliency map FSM N is generated as

$$ \mathrm{T}(p)=\left\{\begin{array}{cc}\hfill 1,\hfill & \hfill {\mathrm{FSM}}_N(p)\ge \tau \hfill \\ {}\hfill 0,\hfill & \hfill otherwise\hfill \end{array}\right. $$
(16)

where the values of T corresponding to 1 represent attention pixels and 0 indicate background pixels. From Eq. (16), a threshold map T is generated which contains several objects. These objects contain several holes. By holes we here mean a set of background pixels that cannot be reached by filling in the background from the edge of the object. First we fill the holes and obtain connected components in the threshold map T. Then connected component labelling is done by identifying the connected components in threshold map T and assigning each connected component a unique label using 8-connected neighbourhood. Then after discarding background, all the remaining connected components are sorted according to their area. Finally the connected component with the largest area is chosen as an attention mask corresponding to the saliency map.

Figure 4 depicts the local, global and final saliency maps and their corresponding attention masks on certain images for comparison. Case 1 corresponds to the case where local saliency as well as global saliency does not give good result individually. Case 2 corresponds to the case where local saliency performs better than global saliency. Case 3 corresponds to the case where global saliency is better than local saliency. Case 4 corresponds to the case where both the local saliency and global saliency show good and almost comparable results. But in all above mentioned four cases our proposed hybrid approach HLGM renders better performance both in terms of saliency maps and attention masks.

Fig. 4
figure 4

a Original image. b Local saliency maps and corresponding attention masks. c Global saliency maps and corresponding attention masks. d Final saliency maps and corresponding attention masks by the proposed model

4 Experimental setup and results

To check the efficacy of the proposed HLGM model, the performance is evaluated both qualitatively and quantitatively, and is compared with the existing approaches [1, 2, 4, 9, 14, 22, 23, 27, 30, 34, 46]. The performance of the HLGM and eleven other state-of-the-art models is examined using two popular and publicly available datasets, and one new ground truth based dataset. The first one is Microsoft Research Asia Salient Object DatabaseFootnote 1 (MSRA SOD) image set B. It contains 5000 high quality colour images of various object categories and scene types in 10 subfolders with their ground truth manually labelled by nine users. The result is in the form of a rectangle that is bounding the salient object. The second one is Binary Masks,Footnote 2 containing 1000 images selected from 5000 images of MSRA SOD image set B. These images are manually segmented and the result is in the form of a binary mask. Achanta et al. [1] suggested that the bounding box based ground truth is inaccurate as it may contain various objects in a single box. In order to overcome this problem, they suggested an object-contour based ground truth dataset. But they had chosen only 1000 images out of 5000 images. We derived a new ground truth based dataset from a publicly available dataset called the SAA_GroundTruth.Footnote 3 It contains all the 5000 images of MSRA SOD image set B which are manually segmented in such a way that the result matches the majority of bounding boxes as suggested in MSRA SOD image set B. All the images are of size 400 × 300 or 300 × 400 having intensity values in [0, 255]. For both the qualitative evaluation and quantitative evaluation, all the experiments regarding our proposed approach and other state-of-the-art models are carried out using Windows 7 environment over Intel(R) Xeon(R) processor with a speed of 2.27 GHz and 4GB RAM.

4.1 Qualitative evaluation

The qualitative evaluation of the proposed model and eleven other state-of-the-art models on five images can be seen in Fig. 5. We have chosen these five images from the test data set which contain objects differing in shape, size, position, type etc. The following observations regarding the attention masks can be drawn from Fig. 5:

Fig. 5
figure 5

Qualitative evaluation of the HLGM model and eleven other state-of-the-art models

  • Itti et al. [30] worked at the local level and neglected the global details, hence gave disappointing results.

  • Bruce and Tsotsos [9] gave better saliency results than Itti et al. [30] by utilizing the information maximization approach.

  • Hou and Zhang [27] lacked the shape information of the objects.

  • Guo et al. [22] gave unsatisfactory results with lacked shape details.

  • Yu et al. [46] failed to give satisfactory results with deteriorated shapes.

  • Achanta et al. [1] gave clear results for some images but it deteriorated for others.

  • Achanta and Susstrunk [2] gave better results than its previous work but included some extra information.

  • Guo and Zhang [23] failed to notice the shape information of the object.

  • Bian and Zhang [4] gave better results than all the above mentioned models in terms of saliency detection. The shape information needs to be enhanced and it also contained unnecessary details.

  • Fang et al. [14] missed finer shape details. It was able to localise the objects properly but with deteriorated shapes.

  • Li et al. [34] either missed some portion of the object or gave extra information of the object that was not required.

  • The proposed HLGM model gave better shape information and clear boundaries of the object.

4.2 Quantitative evaluation

The quantitative evaluation of the proposed model and eleven other state-of-the-art models is done in terms of precision, recall, F -measure, and computation time. Using the ground truth G and the detection result R, precision, recall, F -measure are calculated as

$$ \begin{array}{l}\begin{array}{ccc}\hfill Precision=\frac{TP}{TP+FP}; \hfill & \hfill\ Recall=\frac{TP}{TP+FN}; \hfill & \hfill {F}_{\upalpha}=\frac{\left(1+\upalpha \right)\times Precision\times Recall}{\upalpha \times Precision+ Recall}\hfill \end{array}\\ {}\mathrm{where}\kern0.24em \begin{array}{ccc}\hfill TP={\displaystyle \sum_{\mathbf{G}\left(x,y\right)=1}}\mathbf{R}\left(x,y\right); \hfill & \hfill FP={\displaystyle \sum_{\mathbf{G}\left(x,y\right)=0}}\mathbf{R}\left(x,y\right);\ \hfill & \hfill FN={\displaystyle \sum_{\mathbf{R}\left(x,y\right)=0}}\mathbf{G}\left(x,y\right)\hfill \end{array}\end{array} $$
(17)

where TP (true positives) is the number of salient pixels that are detected as salient pixels.

FP (false positives) is the number of background pixels that are detected as salient pixels.

FN (false negatives) is the number of salient pixels that are detected as background pixels.

While computing value of F -measure, we have chosen α = 1 to give equal weightage to both precision and recall.

Tables 6, 7 and 8 show the quantitative performance evaluation of the proposed method in comparison to the other state-of-the-art methods on MSRA SOD image set B, Binary Masks and SAA_Ground Truth respectively. The computation time taken by the models can be observed from Table 9. To compare the proposed approach with the existing state-of-the-art methods in terms of computation time, all experiments are carried out using Windows 7 environment over Intel(R) Xeon(R) processor with a speed of 2.27 GHz and 4GB RAM to avoid biases. The best results are shown in bold.

To highlight the contributions of local (PFDN) and global approach (WHTM) to our proposed approach (HLGM), the quantitative results of the local approach (PFDN), global approach (WHTM) and hybrid approach (HLGM) are shown in Tables 1, 2 and 3.

Table 1 Quantitative comparison for MSRA SOD image set B
Table 2 Quantitative comparison for Binary Masks
Table 3 Quantitative comparison for SAA Ground Truth

It can be observed that for all the three datasets, the performance of the proposed method (HLGM) is better than both local approach (PFDN) and global approach (WHTM) in terms of F-measure. It can be also noted that for all the three datasets, the performance of global approach (WHTM) is better in comparison to local approach (PFDN) in terms of precision only. On the other hand, for all the three datasets, the performance of local approach (PFDN) is better in comparison to global approach (WHTM) in terms of Recall and F-measure. Experimental results suggest that the performance of the hybrid approach (HLGM) is better due to combination of both local (PFDN) and global approach (WHTM) and the way they are combined.

For each image in the MSRA SOD B image set B containing 5000 images, we calculate a local and a global weight corresponding to local and global saliency map respectively. In this way, 5000 local weights and 5000 global weights are computed corresponding to 5000 local and 5000 global saliency maps respectively. Now an average local weight is calculated by taking the average of all 5000 local weights. Similarly an average global weight of all 5000 global weights is calculated. Average local weight with standard deviation for local approach (PFDN) and average global weight with standard deviation for global approach (WHTM) are shown in Table 4.

Table 4 Average weight with standard deviation for local and global approaches

The computation time taken by local method, global method and the proposed method can be observed from Table 5. It can also be clearly observed from Table 5 that Global approach (WHTM) is fast enough to compute global saliency in real time.

Table 5 Computational time taken by models

The following can be observed from Tables 6, 7 and 8:

Table 6 Quantitative comparison for MSRA SOD image set B
Table 7 Quantitative comparison for Binary Masks
Table 8 Quantitative comparison for SAA Ground Truth
  • In terms of precision, Li et al. [34] outperforms the rest of the state of-the-art methods for MSRA and Binary-Mask datasets. Achanta and Susstrunk [2] shows the highest precision for SAA dataset.

  • In terms of recall, Bian and Zhang [4] model furnishes the best performance for all three datasets. But it gives the lowest precision for MSRA and SAA and also shows poor precision for Binary-Mask dataset.

  • The performance of model proposed by Guo and Zhang [23] is not good in terms of recall and F-measure for all the three datasets.

In terms of F -measure, the proposed method outperforms other state-of-the-art methods for all the three datasets which can also be observed in Fig. 6. This signifies that the proposed method provides good performance both in terms of precision and recall whereas other models give high precision value but low recall value and vice-versa.

The following can be observed from Table 9:

Table 9 Computational time for the proposed model and different state-of-the-art models
  • The model suggested by Guo et al. [22] takes the least computational time. But this method does not perform well in terms of precision, recall and F-measure.

  • Bruce and Tsotsos [9] model takes maximum time.

  • Hou and Zhang [27], Yu et al. [46], Achanta et al. [1], Bian and Zhang [4], Guo and Zhang [23], and Guo et al. [22] models take less computation time than the proposed model but their performance is not as good in terms of F-measure.

  • The computation time taken by the proposed model is considerably less in comparison to Itti et al. [30], Bruce and Tsotsos [9], Achanta and Susstrunk [2], Fang et al. [14] and Li et al. [34] which can be helpful in detecting salient object with higher detection accuracy in real-time.

    Fig. 6
    figure 6

    Comparison of F-measure values for state-of-the art methods and proposed method

5 Conclusion and future work

In this paper, we proposed a novel and fast biologically plausible frequency domain approach for salient object detection. The proposed approach determined salient object by considering both local saliency and global saliency. The proposed approach involved three phases. In the first phase, locally salient features were generated using the research work of Bian and Zhang. In the second phase, globally salient features were computed using fast Walsh-Hadamard transform. Finally, the saliency map was obtained in terms of linear weighted combination of local and global saliency where the weights were calculated using entropy measure. The performance of the proposed model was evaluated in terms of precision, recall, F–measure and computation time using two publicly available image datasets and one new dataset. Experiments on variety of images showed that the proposed approach outperformed Bian and Zhang model which considers only local saliency and other existing state-of-the-art methods in terms of F-measure. The proposed approach was found to be less computationally expensive to detect salient object accurately.

There are many possible remaining issues for further investigation such as partial occlusion, intra-class variation, viewpoint variation, background clutter, and articulation. In future non-linear combination of features can be planned to evaluate the performance. We also plan to extend our work to detect any number of salient objects or no salient object at all.