1 Introduction

Humans can understand complex natural scenes without much effort. This capability of human beings is computationally implemented with the help of Salient Object Detection (SOD). The term ’Salient’ refers to those areas in an image or video which are distinctive when compared to its surrounding. SOD in natural images has become an active research area in the past two decades due to its enormous applications in areas such as object recognition [145], image resizing [6], image retargetting [40], image and video compression [42, 50], image thumbnailing [117], video summarization [53, 112], photo collages [150], image quality assessment [98], small device displays [20], image segmentation [135], image editing and manipulating [41, 118], image retrieval [159], object discovery [33, 66], human robot interaction [167] etc.

The origins of salient object detection lies in feature integration theory (FIT) proposed by Treisman and Gelade [173]. In this theory, it is pointed out that visual features derive human attention and help the task of searching. This concept was furthered by Koch and Ullman [75] who developed a feed forward neural network model and introduced the concept of saliency map. This map represents the attended locations in an image. The pioneering work on salient object detection was done by Itti et al. [51] who implemented the first salient object detection framework on natural as well as synthetic images. They also suggested that saliency can be computed in two ways i.e. (i) Bottom-up and (ii) Top-down. Bottom-up methods employs low level image features such as contrast, color, edges, orientation etc. for computing the saliency map of an image. Bottom-up methods are stimulus driven, fast and are independent of the task at hand. On the other hand, Top-down methods are based on high level cognitive characteristics such as knowledge, expectation etc. Top-down methods are slow, volition-controlled and task dependent. There are some methods [125, 132] in literature which employ both bottom-up and top-down methods for salient object detection and can be put in the category of mixed methods. Studies in human visual system [162] have also shown that both bottom-up as well as top-down mechanism are necessary for proper functioning of human visual attention system. A comprehensive survey of salient object detection methods has been done by Borji and Itti [12]. They have broadly divided salient object detection models into eight categories [12] viz. (i) Cognitive Models [51, 125, 177] (ii) Bayesian Models [96, 200] (iii) Decision Theoretic Models [39] (iv) Information Theoretic Models [17, 48] (v) Graphical Models [7, 45, 100, 101] (vi) Spectral Analysis Models [3, 43, 47] (vii) Pattern Classification Models [24, 62, 68, 132] (viii) Other Models [142, 143]. There are some methods which lie in more than one category. However, most of the salient object detection methods are derived from cognitive models directly or indirectly.

Once saliency map has been obtained using some salient object detection method, the next step is to get the binary attention mask corresponding to the saliency map by employing some threshold. Thresholding is an important image processing operation having several applications [157] such as document image analysis [1, 63], map processing [174], target detection [8], defect detection in materials [158] etc. In thresholding, it is presumed that there is significant difference between the gray levels of some object in the image and the background. The image containing some object can be converted to a binary image by employing some threshold. In image processing context, a threshold is a gray level which can partition the pixels in an image into two classes i.e. (i) pixels whose values are greater than threshold and (ii) pixels whose values are less than threshold. The pixels whose value is equal to threshold can be put in any one of the two categories. Thus, thresholding converts a gray level image into a binary image. The pixels greater than threshold can be given label 1 and pixels less than threshold can be given label 0 or vice-versa depending upon the application. A comprehensive survey of more than 40 image thresholding techniques is done by Sezgin and Sankur [157]. They have categorized various thresholding methods into six categories depending upon the information used for computing the threshold. Each of these six categories is discussed briefly below:

  • Histogram Shape based: Thresholding methods under this category are based on the histogram shape properties. In this category, methods usually search peaks and valleys for determining the threshold. Rosenfield and Torre [148] analyzed the histogram concavities and the convex hull and suggested that the concavities with minimum values can be used as threshold. Sezan [156] convolved the histogram of an image with smoothing and differencing kernel and analyzed the peaks in the histogram. the threshold is determined to be between the first and second zero crossings.

  • Clustering based: In these thresholding methods, the gray level image data are always clustered into two. The threshold is found to be mid point of the two peaks of the histogram in research works [83]. Some researchers have fitted Gaussian Mixture Models (GMMs) to the histogram for determining the threshold [72]. A modified form of clustering called mean-square clustering is suggested by Otsu [127] and fuzzy clustering is suggested by Jawahar et al. [52] for determining the threshold.

  • Entropy based: These thresholding methods employ entropy measure of an image for determining the threshold. Johannsen and Bille [60] and Pal et al. [128] proposed the seminal work in this category. Threshold is determined based on the idea that maximum information transfer is indicated by maximization of entropy measure in the thresholded image [65, 133, 134, 152]. In the research work [87], threshold is determined by minimizing the cross entropy between the original gray level image and the thresholded image. In recent years, entropy based thresholding has received attention from the researchers in various fields. Mahmoudi and Zaart [114] have carried out a recent survey of entropy based thresholding techniques.

  • Attribute Similarity based: These methods exploit the similarity of some measure between the original image and the binarized image. The attributes can be gray-level moments edge matching, stability of segmented objects, shape compactness, texture etc. In some research works, the similarity of features between original and binary images are measured using fuzzy measure [122, 141] or cumulative probability distributions [32] or the amount of information gained after segmentation [84]. A popular method proposed by Tsai [175] in this category hypothesize that gray image is a blurred binary image. The thresholding is computed in such a way that first three gray level moments of original and the binary image match.

  • Spatial Thresholding Method: In this category of thresholding methods, not only the gray level information is exploited but also the relationship of a pixel among its neighboring pixels is also considered. One of the earliest methods under this category is suggested by Kirby and Rosenfeld [71] in which local gray levels are used for computing the threshold. This was followed by other researchers in other improvements such as using relaxation for improving binary maps [31], enhancement of histogram using Laplacian of image [183], quadtree thresholding [185], etc.

  • Locally Adaptive thresholding: In these thresholding methods, threshold is computed at each pixel depending upon local pixel characteristics. Nakagawa and Rosenfelt [123], Deravi and Pal [25] were among the early researchers in this category. In research works [126, 153] local variance is employed while in [184, 194] local contrast is employed. Palumbo et al. [129] suggested a centre-surround scheme for computing the threshold.

From the above discussion, it is apparent that thresholding is an important image processing operation. It is also extensively used in salient object detection to obtain binary attention mask from the saliency map. However, there is no research work in literature which can give insights to various thresholding methods used in salient object detection. In this paper, we have developed a novel taxonomy of various thresholding methods used in literature for salient object detection. We have considered various factors while forming this taxonomy as discussed in Section 2.

The rest of the paper is organized as follows: Section 2 introduces the general concept of thresholding in SOD context and a novel taxonomy is presented. Section 3 to 6 describe in detail the methods under proposed taxonomy. In Section 7, we present existing and proposed performance measures for SOD methods which depend on thresholding. In Section 8, we have also presented a comparative study of popular thresholding methods in salient object detection. Discussion and unexplored thresholding methods are presented in Section 9. Concluding remarks and future directions are given in Section 10 at the end.

2 Thresholding in salient object detection

As we know, the major objective in salient object detection is to determine a binary attention mask which can be employed for extracting the object of interest from a digital image. Most SOD methods generate a saliency map which is required to be converted into a binary map. In this binary image, foreground is represented as label 1 and background is represented as label 0 or vice versa depending upon the application.

Now suppose we are given a digital image I(x, y) of size W × H where x ?{1, 2,..., W} and y ?{1, 2,..., H} and S(x, y) is the corresponding saliency map. Some thresholding is applied on the saliency map S to convert it into a binary attention mask A(x, y). Figure 1 shows a sample image with its corresponding saliency map, binary attention mask corresponding to the saliency map and the ground truth of the image. Ideally, the thresholded version should be as close to the ground truth as possible.

Fig. 1
figure 1

a Original Image b Corresponding Saliency map c Binary attention mask obtained after thresholding saliency map in (b) and d Ground Truth

From the above discussion, it is apparent that thresholding is an indispensable step in most salient object detection methods. Here, we present a comprehensive survey of thresholding methods employed in salient object detection keeping in view the following factors:

  1. 1.

    Whether threshold is applied by SOD method or not?

  2. 2.

    If a threshold is applied then whether a single, multiple or a combination of thresholds is applied on the image?

  3. 3.

    Whether thresholding is image dependent or not?

  4. 4.

    Whether thresholding requires human intervention or not?

Few SOD methods produce only the saliency maps and not the binary attention masks. On the other hand, most salient object detection methods produce binary attention mask by employing some threshold on saliency map. This threshold can be found in several ways. To understand various thresholding methods employed in SOD in a better manner and considering the above mentioned factors, in this paper, we have developed a simple taxonomy of various thresholding methods in salient object detection as shown in Fig. 2. We broadly divide the thresholding methods employed in salient object detection in five categories viz. (i) No Threshold (ii) Global Thresholding (iii) Local Thresholding (iv) Hybrid Thresholding and (v) Other Thresholding Methods. Methods falling under each of these categories are discussed in details in the following sections.

Fig. 2
figure 2

Taxonomy of Various thresholding methods in Salient Object Detection

There are few SOD methods whose objective is to find salient regions in an image and output only the saliency map. Thus, there is no requirement of thresholding in such cases. The performance of these methods is usually analyzed from the quality of saliency map generated by the methods. This saliency map is simply a grey level digital image which highlights the object of interest i.e. salient object and suppresses the background. Furthermore, the saliency map generated using some method is compared to other saliency map subjectively. SOD methods falling under this category together with features employed by each method are listed in Table 1.

Table 1 Methods in which No Thresholding is employed

As no thresholding is applied on saliency map to obtain the attention mask, these methods possess least time complexity among all other thresholding approaches. In addition, the saliency map obtained can be used for applications including edge or contour detection. The disadvantages of these thresholding methods include (i) lack of objective performance measures and (ii) difficulty in choosing better method among close competitors.

3 Global thresholding

Salient object detection methods under this category employ Global or single threshold (T) for converting the saliency map into binary attention mask using the following equation:

$$ \mathbf{A}(x,y) = \left\{\begin{array}{lllllllll} 0 & \; if \; \mathbf{S}(x,y) < T \\ 1 & \; if \; \mathbf{S}(x,y) \geq T \end{array}\right. $$
(1)

The pixels in the saliency map having values greater than the threshold are marked as foreground while the pixels having value less than the threshold are marked as background. It is inevitable to consider how this single threshold is determined. One of the simplest way to determine the threshold is to try all possible values of threshold for all the images at hand and choose the one which gives best performance. However, it leads to higher computational requirements. Thus, we need to look for some other computationally efficient way of finding this threshold which can give better performance and is computationally efficient. Various researchers have proposed several global thresholding methods from time to time.

In global thresholding methods, an important factor to consider is whether threshold depends on the image at hand or not. The image dependent threshold varies with respect to each image while image independent threshold remains invariant to all the images considered at some instance of time. Another factor to consider is whether human intervention is required or not. Based on these factors, global thresholding is further divided into three subcategories : (i) Fixed thresholding (ii) Semi-Adaptive thresholding and (iii) Adaptive thresholding.

3.1 Fixed thresholding

SOD methods falling under this category employ image independent thresholding and do not use any image characteristics to determine the threshold. In these methods, threshold is fixed to some value between [0, 255] or [0, 1] depending upon the range of saliency map normalization. Thresholding methods under this category are further divided into two classes: (a) Single Fixed Thresholding (b) Range of Fixed thresholding.

In Single Fixed thresholding, only one threshold irrespective of the image is employed. The value of the threshold lies between T ? [0, 255] o r [0, 1]. The methods which employ single fixed threshold are given in Table 2.

Table 2 Methods which use Single Fixed Thresholding

In Range of Fixed Thresholding methods, whole range of possible thresholds with suitable intervals is employed in the range [0,255] or [0,1]. In the range [0,255], the threshold is usually varied from 0 to 255 in steps of size 1 thereby producing 256 attention masks. In the range [0,1], the threshold is varied from 0 to 1 usually in steps of 0.01 resulting in 101 attention masks. This type of thresholding is widely used in literature due to its application in drawing Receiver Operating Characteristics (ROC) curve, Precision-Recall (PR) curve and computing the Area Under the ROC Curve (AUC) performance measures. ROC curve is drawn between true positive rate (TPR) and false positive rate (FPR) at each threshold from 0 to 255. Methods employing fixed range thresholding are listed in Table 3 while in few Fixed Range thresholding methods [59, 64, 92, 178, 182] the threshold is varied between 0 and 1.

Table 3 Methods which use Range ([0,255]) of Fixed Threshoding

3.2 Semi-adaptive thresholding

In this category, the threshold consists of two components: (i) one component set by the user and (ii) other component computed from the image. One of the methods in this category is proposed by Jin et al. [59]. In their method, the threshold T is computed using the following equation:

$$ T = \alpha \times \sigma $$
(2)

where a =?0.70 and s is the variance of the image.

In these thresholding methods, manual intervention is required to set the value of parameter a. This value can be varied to tune the performance of salient object detection for the application at hand. However, these methods have some drawbacks such as (i) not fully automatic (ii) poor generalization. Therefore, this type of thresholding is not very popular in literature.

3.3 Adaptive thresholding

This is the most widely employed thresholding method in salient object detection. In this kind of thresholding, a single global threshold (T) is determined using image characteristics. Due to this dependence, adaptive thresholding is image dependent i.e. it varies with each individual image. To determine the adaptive threshold, various thresholding functions or algorithms are employed by SOD methods as given in Table 4.

Table 4 Thresholding functions for Adpative Global Thresholding

Hou and Zhang [47] set the threshold to be thrice the average saliency value. The threshold T is computed with the help of following (3):

$$ T = \frac{3}{W \times H}\sum\limits_{x = 0}^{W-1}\sum\limits_{y = 0}^{H-1}\mathbf{S}(x,y) $$
(3)

Another most popular threshold is suggested by Achanta et al. [3] in literature. The proposed threshold is twice the average saliency value as given in (4). This threshold is used by several researchers subsequently as shown in Table 4 .

$$ T = \frac{2}{W \times H}\sum\limits_{x = 0}^{W-1}\sum\limits_{y = 0}^{H-1}\mathbf{S}(x,y) $$
(4)

Judd et al. [62] suggested to use such threshold which can highlight some percentage of image pixels (e.g. 1,3,5,10,15,20,25,30) to be salient. Rosin et al. [149] employed Tsai’s moment preserving algorithm [175] for determining the global threshold. In Tsai’s algorithm, a grey level image is assumed as a blurred version of an ideal binary image. In order to find the ideal unblurred version of the image i.e. binary image, first three moments are preserved. Thus, these three moments are same in grey level image and the binarized version of the image. The i - t h moment (m i ) of an image I is defined as below:

$$ m_{i} = \frac{1}{W \times H}\sum\limits_{x = 0}^{W-1}\sum\limits_{y = 0}^{H-1}\mathbf{I}^{i}(x,y) $$
(5)

Alexe et al. [4] first resized the saliency map to 64 × 64 resolution and then found the threshold to be the average saliency value at this re-scaled saliency map. Otsu algorithm [127] has also been employed by various researchers such as Khuwuthyakorn et al. [67], Luo et al. [109], Liang et al. [93], Hu et al. [49] etc. Otsu algorithm is a clustering based algorithm to determine the global threshold. In this method, a global threshold is determined such that the spread between the foreground pixels and the background pixels is maximum while the combined intra-class spread is minimum.

Yu et al. [196] proposed using a threshold which is a t times the average saliency value. The value of a t is set to a low value in order to get high recall rate. Jia et al. [54] and Scharfenberger et al. [154] suggested using a threshold which is sum of two quantities i.e. mean (m) and standard deviation (s) of the saliency map.

Jiang et al. [57] and Zhao et al. [210] employed a threshold which is obtained by multiplying the average saliency value with a factor of 1.5. This multiplication factor was varied between [0.1,6] by Wang et al. [181] and it was set to 1.7 by Xu et al. [188]. Similarly, Xu et al. [189] introduced another factor ? to be multiplied with the average saliency value for determining the threshold. A pixel in the saliency map (SMAP) is termed as a salient if its value is more than ? ×mean(SMAP), otherwise it is marked as background. Singh et al. [165] and Arya et al. [5] have suggested a two step thresholding. In the first step, object’s silhouette’s are determined using Canny edge operator [18]. Subsequently, mean value of the saliency map corresponding to object’s silhouette is used as threshold. Guo et al. [44] have suggested using median value of the saliency map as the threshold.

Gao et al. [37], Zhang et al. [198], Lu et al. [108], Liu et al. [99], Zhou et al. [215] used the average saliency value as the threshold while Wang et al. [179] set the threshold to be half the maximum saliency value. Ren et al. [145] have suggested saliency cut algorithm for converting the saliency map into binary attention mask. In saliency cut algorithm, an attention window AW of size 10 × 10 is first initialized such that the center of the window is the position where mean saliency value in the window in maximum. This window is then extended in both x-direction and y-direction until the pixels on the boundary are smaller than a fixed threshold. Zhou et al. [212] set the threshold to be equal to maximum of the twice of the average saliency value and the maximum grey level value. Kumar et al. [78] have employed negative transform of the mean saliency value as threshold. In their research work, the saliency map was normalized between 0 and 255.

The Global thresholding methods have the major advantage of being simple to implement as only a single threshold is employed throughout the saliency map. The time complexity of these methods is more than No Threshold methods but less than local thresholding methods. The performance measures such as ROC curve and PR curve can only be drawn with the help of range of fixed thresholding which is a Global thresholding method. The disadvantage of global thresholding is that no neighbourhood information i.e. local structure of the image is used in determining the threshold. Single fixed thresholding also has poor generalization across different images as can be observed from Fig. 3. Fixed threshold may be good for some images and bad for other images. Moreover, single global threshold may not be suitable for all the regions in an image. Semi-Adaptive thresholding requires human intervention which is unwanted for fully automatic salient object detection.

Fig. 3
figure 3

(Top to bottom) (i) Sample images (ii) Saliency maps obtained using Liu et al. [101] (iii) Binary attention mask obtained from saliency map after employing fixed threshold of 150 (iv) Binary attention mask obtained by employing global adpative thresholding same as Achanta et al. [3]

4 Local thresholding

In local thresholding methods, multiple thresholds are computed and applied in a local manner. These local thresholds may be based on a superpixel or block. In this paper, we assume that superpixels are group of pixels that can have arbitrary shapes while blocks are rectangular patches consisting of pixels. Local methods are computationally expensive than global thresholding methods. We categorize Local thresholding methods into two claases: (i) Superpixel based and (ii) Block based.

4.1 Superpixel based

Chang et al. [19] suggested using the global adpative thresholding same as Achanta et al. [3] but applied it in a local manner on the superpixels. Given m superpixels of size a m , the threshold is computed using the following equation:

$$ T = 2 \times \frac{{\sum}_{m} d_{m} a_{m}}{{\sum}_{m^{\prime}} a_{m^{\prime}}} $$
(6)

where d m is the number of eye fixations in the super pixel.

4.2 Block based

Li et al. [90] have suggested another local patch based thresholding method in which the saliency value of a patch is multiplied by some constant depending upon the variance of the patch. In this method, a patch p k is determined to non-salient or low-salient with the help of following equation:

$$ p_{k} = \left\{\begin{array}{lllllllll} \mu_{1}p & \text{if } \sigma_{2} \leq \mathbf{S}(p_{k}) \leq \sigma_{1}\\ \mu_{2}p_{k} & \text{if} \; \mathbf{S}(p_{k}) < \sigma_{2} \\ p_{k}& \text{in other cases} \end{array}\right. $$
(7)

Here s 1 and s 2 are the thresholds which determine whether a patch is non-salient or low-salient. Depending upon these thresholds, patch p k is multiplied with some constants to modify the saliency value of all the pixels in that patch. These constants viz. µ 1 and µ 2 are chosen such that 0 < µ 2 < µ 1 <?1 (Table 5).

Table 5 Local Thresholding Methods

The advantages of Local thresholding methods is that local neighbourhood structure is used in determining multiple thresholds in an image. Furthermore, local thresholding methods are usually insensitive to global changes such as illumination. Local thresholding also suffers from some drawbacks such as the time complexity of these methods is more than global thresholding methods. Moreover, determining optimal patch size for threshold determination is another problem in these methods. These methods do not exploit global image characteristics. These methods may also suffer from blocking phenomenon. Local thresholding methods are scarcely used in literature due to several disadvantages.

5 Hybrid thresholding

Hybrid thresholding employ both local as well as global thresholding methods for converting the saliency map into attention mask. In this category, either a combination of local and global methods or region growing is employed for determining the final threshold(s) or for obtaining the binary attention mask from the saliency map. Based on these criteria, hybrid thresholding methods are divided into two categories: (i) Local and Global thresholding (ii) Region growing/Optimization thresholding (Table 6).

Table 6 Hybrid Thresholding Methods

5.1 Local and global

In this thresholding, both local and global thresholds are computed using the saliency map. The local threshold is computed from local patches while global threshold is computed from the whole saliency map. These patches are then marked as foreground or background using some mechanism. In research works [76, 88, 160, 172], first the image I is over-segmented using mean-shift algorithm followed by computing the average saliency value p i for each segment (say s i ) and m is the overall mean saliency value of I. Then s i is marked as foreground if p i >?2 × m.

5.2 Region growing

Ma and Zhang [111] suggested using fuzzy growing for segmenting the saliency map into binary attention mask. In their research work, saliency map is modeled as a probability space with regard to fuzzy events corresponding to attended U f and unattended U b regions, the fuzzy membership function of these events is defined as below:

$$ \mu_{f} = \left\{\begin{array}{lllllllll} 1 & x \geq a \\ \frac{x-u}{a-u} & u < x < a \\ 0 & x \leq u \end{array}\right. $$
(8)
$$ \mu_{b} = \left\{\begin{array}{lllllllll} 0 & x \geq a \\ \frac{x-a}{u-a} & u < x < a \\ 1 & x \leq u \end{array}\right. $$
(9)

where x denotes the gray level of the saliency map. Here, a and u are the parameters which are determined using a modified minimal difference of entropy metric.

Mehrani et al. [120] have suggested using binary graph-cut optimization for converting the saliency map into binary attention mask. In their research work, initial segmentation is refined using a binary graph-cut optimization based trained classifier. This initial segmentation thus obtained is supplied to Iterative graph cut algorithm [15] which renders the final attention mask.

Marchesotti et al. [117], Luo et al. [109], Murato et al. [121] and Xie et al. [187] have employed Iterative graph cut algorithm for finding the binary attention mask from saliency map. Marchesotti et al. [117] suggested obtaining the initial binary map for iterative graph cut algorithm in one of the two ways:

  1. (i)

    binary map S b using a hard threshold t h b i n which is set to 0.6 in their research work.

  2. (ii)

    binary map S g using two thresholds i.e th+ and th-. The pixels above th+ are labeled as foreground pixels and pixels below th- are marked as background pixels. The pixels lying between th + and th- are assigned to foreground or background based on an energy minimization function.

The initialization of the iterative graph cut algorithm is done with binary map S * as follows:

$$ \mathbf{S}^{*} = \left\{\begin{array}{lllllllll} \mathbf{S}_{g} & \text{if} \frac{\mathbf{S}_{b} \cap \mathbf{S}_{g} }{\mathbf{S}_{b} \cup \mathbf{S}_{g}} > th_{d} \\ \mathbf{S}_{b} & \text{otherwise} \end{array}\right. $$
(10)

where t h d is set to 0.1.

Luo et al. [110] suggested global salient information maximization (GSIM) for saliency computation and employed the output of GSIM for initializing the iterative graph cut algorithm. Muratov et al. [121] and Xie et al. [187] employ generic iterative graph cut algorithm without any initialization.

Cheng et al. [22], Jian et al. [56] and Ju et al. [61] have employed fixed threshold value of 70 to find an initial binary attention mask. This attention mask is then supplied to Grab-cut algorithm for segmentation. Zhang et al. [199] and Zhao et al. [211] have employed Grab - Cut algorithm for segmentation of the saliency map. The main drawback of Grab-cut algorithm is that it is interactive and user is required to draw a rectangle for initializing the algorithm.

Hybrid thresholding possesses the advantage of exploiting local as well as global information from the saliency map for determining the threshold. However, the complexity of these methods is more than local and global thresholding methods. In some hybrid thresholding methods, manual intervention is required which is undesirable. Moreover, region growing methods may suffer from local minima.

6 Other thresholding methods

Some thresholding method which do not fall under any of the above categories are classified under this category. Two of the popular thresholding methods used by Itti et al. [51] were Winner take all (WTA) and Inhibition of Return (IOR) [74]. Salient locations are detected in an image based on the decreasing saliency values. Conditional Random Field (CRF) [80] is another way to determine the thresholding in salient object detection. In a CRF model, linear weights for combining multiple features maps is learnt through a training set. Simultaneously, CRF also partitions the image pixels into foreground and background. CRF is based on conditional probability and minimizing a energy function. This energy function is defined in different ways by various researchers. These thresholding methods are not popular in literature.

7 Performance measures

Once a thresholding method has been chosen for an algorithm, the next critical step in salient object detection is measuring the performance of SOD method(s). A method can be shown to outperform other methods if its performance is better in comparison to other methods. In this section, we present a brief review of the popular performance measurement criteria. Performance of the SOD methods can be measured in qualitative and quantitative terms.

7.1 Qualitative measures

The performance of SOD methods is measured in terms of qualitative measure by comparing the saliency maps produced by SOD methods. This type of performance measure is employed to visually show the quality of produced saliency map. A drawback of qualitative measure is that it is subjective and varies across different people.

7.2 Quantitative measures

In contrast to qualitative measures, quantitative measures assign numerical values to performance. Quantitative measures are also objective performance measures. For a given image I, suppose S is the corresponding saliency map produced by some method. If G denotes the ground truth corresponding to I, then a confusion matrix corresponding to pixel wise saliency map is computed as given in Table 7.

Table 7 Confusion matrix for Salient Object Detection

Based on this confusion matrix, various performance measures are computed such as Precision, Recall and F-Measure etc. Precision is defined as the ratio of actual foreground pixels (TP) out of the foreground pixels found by the method. Recall is defined as the ratio of actual foreground pixels found (TP) out of the foreground pixels (TP + FN) in the ground truth.

$$ Precision = \frac{TP}{TP+FP} $$
(11)
$$ Recall = \frac{TP}{TP+FN} $$
(12)

There is always a trade off between P r e c i s i o n and R e c a l l. e.g. One can achieve high P r e c i s i o n or high R e c a l l rate simply by setting appropriate threshold. These two performance measures are therefor combined into another performance measure called F - M e a s u r e in literature [146]. The traditional F - M e a s u r e called F 1 - M e a s u r e is defined as follows:

$$ F_{1} - Measure = \frac{2 \times Precision \times Recall}{Precision + Recall} $$
(13)

F 1 - M e a s u r e gives equal importance to P r e c i s i o n and R e c a l l. A more general F - M e a s u r e called F ß - M e a s u r e is define in literature as follows:

$$ F_{\beta}-Measure = \frac{(1+\beta^{2})\times Precision \times Recall}{\beta^{2} \times Precision + Recall} $$
(14)

Here, the parameter ß controls the relative importance of P r e c i s i o n versus R e c a l l. The value of ß ? [0,1) gives more importance to P r e c i s i o n than R e c a l l while the value of ß =?1 gives equal importance to both P r e c i s i o n and R e c a l l (aka F 1 - M e a s u r e) while the value of ß >?1 gives more importance to R e c a l l than P r e c i s i o n.

All the above performance measures require that a particular threshold is fixed based on which these can be computed. This problem can be overcome if we set every possible threshold (e.g. 0 to 255 if saliency map is normalized between [0,255]). False Positive Rate and True Positive Rate are then computed for each value of threshold and plotted in a 2-D graph. The methods whose performance is to be compared are plotted in a single graph called Receiver Operating Characteristic (ROC) curve. The method whose curve is on the top left corner of the ROC curve is declared best. It is possible that two or more ROC curves cross other ROC curves thereby rendering it difficult to say which method performs best. To overcome this problem, a new performance measure called Area Under the ROC curve (AUC) is defined whose value renders a numerical comparison between methods. Similar to ROC curve, there is another curve called Precision-Recall (PR) Curve in which for every possible Recall, the value of Precision is computed and then plotted in a single graph.

Besides the above popular thresholding performance measures, there are other performance measures which are used in image processing literature [157] but have not been used in salient object detection. Some thresholding measures which can prove useful in measuring performance of SOD models are (i) Misclassification Error and (ii) Relative foreground area error. We have redefined these measures in terms of data available in confusion matrix.

(i) Misclassification Error (ME)::

This performance measures the fraction of false positive (FP) and false negative (FN) relative to the actual pixel class. ME can be defined mathematically as given in (15):

$$ ME = \frac{FP+FN}{TP+FN+FP+TN} $$
(15)
(ii) Relative Foreground area error::

This performance measure was originally defined by Zhang et al. [197] and called relative ultimate measurement accuracy (RUMA). This performance measure was modified by Sezgin et al. [157] for foreground area of an image. This performance measure is define as follows:

$$ RAE = \left\{\begin{array}{lllllllll} \frac{FN}{TP + FN} & \text{if} \; TP < P \\ \frac{FP}{TP + FP} & \text{otherwise} \end{array}\right. $$
(16)

where P is the foreground area in the ground truth of image I and TP is the area of foreground image in the thresholded image I '.

8 Performance evaluation

To choose a thresholding method, one has to consider various aspects before choosing a thresholding method. Here, we present a comparison of various thresholding methods in terms of time complexity, quantitative and qualitative performance.

8.1 Time and space complexity

The amount of time required by a thresholding method is proportional to the number of basic operations performed for computing the threshold and then applying this threshold on the saliency map. The amount of space required is directly proportional to the number of thresholds to be applied on the image. We give a general comparison of time and space complexities among different thresholding categories. In No Thresholding, SOD methods output only the saliency maps and hence no operation is involved for obtaining the attention mask. In global thresholding methods, only a single threshold is applied throughout the saliency map for obtaining the attention mask. In Local thresholding methods, multiple thresholds are computed at super-pixel or block level to convert the saliency map into attention mask. Hybrid thresholding requires both local and global thresholds or region growing for finding the attention mask. As region growing is an optimization problem, which requires more computation. Hence, the thresholding categories according to time and space complexities can be put in the following order:

$$No \; Threshold < Global < Local < Hybrid$$

In Fixed thresholding methods, there is no need for computing the threshold explicitly. Hence, these methods have O(1) complexity. Adaptive thresholding methods exploit the saliency map for computing the global threshold. If n (= W * H for image of size W × H) is the total number of pixels in the saliency map of an image, then the time complexity of popular global adaptive thresholding methods is given in Table 8 below:

Table 8 Time complexity of popular global adaptive thresholding methods

The Semi-adaptive thresholding method proposed by Jin et al. [59] involes computation of standard deviation of setting of a constant. Thus, time complexity of Jin et al. [59] method is O(n). Local thresholding methods involve computation of multiple thresholds on patches of the saliency map. Let p is the number of local patches of average size n p . Thus, Chang et al. [19] have time complexity of O(p n p ). Similarly, Li et al. [90] applies two fixed thresholds s 1 and s 2 to define low saliency regions and non-salient regions respectively. Similar to Chang et al. [19] method, the time complexity of Li et al. [90] method is O(p n p ). Hybrid thresholding methods segment the saliency map into attention mask based on global and local thresholding or optimizing some objective function. We have already discussed the time complexity of global and local methods. The hybrid methods which optimize some thresholding function requires a number of iterations (T) for convergence. The time complexity of mean shift algorithm is O(Tn 2) while for other threshoilding methods such as Grab-Cut or Iterative graph cut algorithm, time complexity is O(Tn 3). The thresholding methods under Other Thresholding require some training images to learn some parameters and then produce the attention mask. It is difficult to directly compare these methods with thresholding methods in other categories.

8.2 Quantitative evaluation

In this paper, we have presented more than 30 thresholding methods. Comparing all of them here is out of scope of this paper. However, more than half of thresholding methods in salient object detection lie in global adaptive thresholding. This is due to the fact that global thresholding is simple to implement and these methods compute the threshold value from the saliency map of an image. Here, we present quantitative comparison of popular thresholding methods. The experiments are performed on ASD [3] and ECSSD [161] datasets. Each of these datasets have 1000 natural images and the corresponding ground truth images. Some example images from both these datasets are shown in Fig. 4. All the images in the datasets are of size 300 × 400 or 400 × 300.

Fig. 4
figure 4

Sample images from (top row) ASD [3] and (bottom row) ECSSD [161] datasets

The saliency map for all the images was obtained using Liu et al. [101] method and the performance was compared in terms of Precision, Recall and F-Measure. The performance comparison of popular thresholding methods is shown in Figs. 5 and 6 for ASD and ECSSD datasets respectively.

Fig. 5
figure 5

Performance comparison of popular thresholding methods on ASD [3] dataset

Fig. 6
figure 6

Performance comparison of popular thresholding methods on ECSSD [161] [3] dataset

It can be readily observed from Figs. 5 and 6 that the thresholding using Yu et al. [196] gives maximum recall among other methods. This is due to the fact that in the thresholding function employed in Yu et al. method, a t is set to a small value in order to have better recall. Further, threshold functions used by Achanta et al. [3] and Kumar et al. [78] give best precision and F-Measure among all the compared methods on both the datasets. It is also worth noting that thresholding certainly affects the performance of salient object detection methods and hence supports the belief that thresholding is an important step in any SOD methods.

8.3 Qualitative comparison

As apparent from previous discussion that thresholding affects the performance of SOD methods. Here, we present qualitative (or visual) comparison of various popular thresholding functions given in Figs. 5 and 6.

Figure 7 shows two sample images from ASD [3] dataset and their corresponding saliency maps obtained using Liu et al. [101] method. These saliency maps are then converted into binary attention by employing thresholding methods as given in previous subsection. From Fig. 7, it can be seen that the binary attention masks for both the images in (c), (h) and (m) are of poor quality. The binary attention masks produced by (e), (f), (g), (j) and (l) are better than other attention masks. However, among these better attention masks, it is difficult to say which one is better. The visual comparison also confirms the dependence of performance of SOD methods on thresholding.

Fig. 7
figure 7

a Sample images from ASD [3] dataset b Corresponding Saliency maps obtained using Liu et al. [101] cp Attention masks corresponding to Global Thresholding functions used in Hou and Zhang [47], Achanta et al. [3], Rosin et al. [149], Alexe et al. [4], Luo et al. [109], Yu et al. [196], Jia et al. [54], Jiang et al. [58], Xu et al. [190], Singh et al. [165], Gao et al. [38], Zhou et al. [212], Kumar et al. [78], Wang et al. [179]

Figure 8 shows two sample images from ECSSD [161] dataset and their corresponding saliency maps obtained using Liu et al. [101] method. From Fig. 8, it can be readily observed for both images, the binary attention mask in (h), (m) and (n) are not of good quality. The binary attention masks produced by (c), (i), (o) and (p) are better than other attention masks. Th eattention masks in (k) and (l) are better for top image and poor for bottom image. However, among the better attention masks, it is difficult to observe which attention mask is best.

Fig. 8
figure 8

a Sample images from ECSSD [161] dataset b Corresponding Saliency maps obtained using Liu et al. [101] cp Attention masks corresponding to Global Thresholding functions used in Hou and Zhang [47], Achanta et al. [3], Rosin et al. [149], Alexe et al. [4], Luo et al. [109], Yu et al. [196], Jia et al. [54], Jiang et al. [58], Xu et al. [190], Singh et al. [165], Gao et al. [38], Zhou et al. [212], Kumar et al. [78], Wang et al. [179]

9 Discussion

Thresholding is an indispensable step in most salient object detection methods. Here, we have presented a novel taxonomy of thresholding methods employed in salient object detection. Few SOD methods output only saliency map without much attention on finding the binary attention mask. Some SOD methods have employed fixed thresholding which do not depend on any image characteristic while some methods have employed all the possible thresholds in a range with suitable steps. The most popular thresholding methods are the fixed range thresholding and the global adaptive thresholding method proposed by Achanta et al. [3]. More than half the research works have employed either of these thresholding methods. Some researchers have modified the multiplying factor to the average saliency value obtained using their proposed methods. Some research works have employed more than one thresholding method for analyzing the performance of their proposed method. The major advantage of global thresholding methods is that these methods are simple to implement and used by most of the researchers in this field. In addition, ROC curve and PR curve can only be drawn with the help of global thresholding which helps in comparing various salient object detection methods. Local thresholding methods have also been used in few research works. However, these methods are scarcely used in literature due to several disadvantages. Hybrid thresholding methods have more complexity than global and local methods rendering their limited use. For automatic salient object detection, thresholding must be fully automatic. There are some methods such as Semi-adaptive thresholding and hybrid thresholding which require human intervention which prohibit their application if the data to be processed is extremely large.

Here, we have presented various thresholding methods employed for salient object detection in literature. But still, there exist some thresholding methods which have not been explored in salient object detection but are extensively employed in image processing context.. These methods can be useful in salient object detection also and are discussed next:

(i) Histogram shape based::

These thresholding methods employ the histogram characteristics of an image for determining the threshold. If the histogram of an individual image is used, then a global adaptive thresholding method can be developed.

(ii) Entropy based thresholding::

Entropy based thresholding has never been employed in salient object detection. These are local methods which employ entropy measure of a pixel together with its neighborhood for determining the threshold. Thus, with the help of entropy based thresholding methods, local thresholding methods for SOD can be developed.

(iii) Pixel level thresholding::

The methods employed for thresholding in SOD either employ global, local or hybrid methods for thresholding. Pixel level thresholding in another approach which is still unexplored for the domain of salient object detection. If a method based on pixel level thresholding is used, this will pave the way for another approach under local thresholding.

(iv) Hybrid thresholding::

A combination of one or more of the above unexplored thresholding methods as well as the ones discussed in Section 3 can be employed for hybrid thresholding methods in SOD.

Besides above thresholding methods, local thresholding methods have also received less attention in comparison to global and hybrid thresholding methods. Thus, local thresholding methods can be further explored for SOD.

10 Conclusion and future work

Salient object detection has become an active research area in last two decades due to its enormous applications in diverse fields. The main objective of SOD methods is to segment the image into foreground and background. SOD methods usually produce a saliency map which is converted to a binary image by applying some thresholding method. In this paper, we have presented a comprehensive survey of various thresholding methods used in salient object detection and developed a novel taxonomy. The thresholding methods are divided into five different categories (i) No Threshold (ii) Global Thresholding (iii) Local Thresholding (iv) Hybrid Thresholding and (v) Other Thresholding methods. Few SOD methods output only the saliency map and use no thresholding. Global thresholding methods employ only single threshold for the whole image. Global thresholding is further divided into three classes (a) Fixed Thresholding (b) Semi Adaptive and (c) Adaptive Thresholding. Fixed thresholding is image independent while global adaptive thresholding is image dependent. In Semi-adaptive thresholding, one factor is set by user and the other is automatically computed from the saliency map of the image at hand. Local thresholding methods apply threshold on superpixels or blocks in the saliency map. Thus, there are multiple thresholds computed in local thresholding methods. In Hybrid methods, either a combination of local and global thresholding or region growing is employed. In region growing methods, an initial threshold is set either manually or saliency map is directly used as initialization. The methods which do not fit in any of these categories are classified under Other Thresholding methods. Most popular thresholding method in salient object detection is Global thresholding which is employed by more than half of the research works in this field. Local thresholding methods are scarcely used but offer a good direction of research. Afterwards, we have also discussed some of the thresholding methods which are employed in image processing but are not explored in salient object detection. Novel thresholding methods provides a good research direction. We have also briefly presented the existing and proposed thresholding performance criteria. Novel performance criteria for SOD performance evaluation is another research direction.