1 Introduction

The object that attracts the human attention and has a unique spatial position in an image, is a salient object. Recognizing a salient object in a cluttered image is a very complex task. Due to the higher resolution in the retina center, the human eye orients the center of its gaze typically to the spatial area of the visual scene. The salient object is characterized by the contrast of an image in the super-pixel plane (i.e. color, orientation, or intensity), that is the most attractive factor for human vision system. Example of salient object segmentation is shown in Fig. 1.

Fig. 1
figure 1

Examples of salient object detection. The first row shows the original images and second row shows the extracted salient objects

The saliency detection and segmentation is helpful in different multimedia tasks such as image retrieval (Gao et al. 2015), adaptive image display (Chen et al. 2003), image segmentation (Liu et al. 2010; Azaza et al. 2018; Badoual et al. 2019), content-aware image editing (Ding and Tong 2010), video surveillance (Ding and Tong 2010; Nawaz et al. 2019; Sokhandan and Monadjemi 2018), image compression (Liu et al. 2014), facial expression recognition (Shahid et al. 2020a; Nawaz and Yan 2020) and image classification (Murabito et al. 2018).

Usually, local and global contrast models have been discussed in the literature. With a local contrast model, we compute the saliency map by comparing the characteristics of each region with the adjacent region. The global contrast refers to the difference of a region with global regions as well as local regions. The saliency models are of two types; (1) saliency detection and (2) human fixation. The saliency detection models are frequently used to identify the salient object that utilizes saliency information for assessment for computer vision applications such as content-aware image resizing and image meditation. The human fixation saliency models are expected to be human obsession patterns to identify the salient region accurately. The generated maps in these models show the valuable difference in features due to their different purpose in saliency detection. For example, salient detection models produce smooth allied areas, while the fixation models generally produce blob-like salient regions. The visual saliency detection from the complex background poses challenges in the real applications of computer vision, according to the neurobiological and psychological definition (Goferman et al. 2011a) (see Fig. 2).

Fig. 2
figure 2

The flow chart of the proposed object detection and segmentation method. A color image is converted into different maps, which are blended into each other by using the Porter–Duff method. a Shows the fast fuzzy c-mean clustering maps, b shows the maps blending and morphological filter process to remove outliers in the blended maps, and c shows the mask generation using combination of frequency, color, and location prior

2 Related work

Many salient object detection and segmentation methods have been proposed in the last two decades that are based on superpixel, histogram, contrast, and feature graph (Azaza et al. 2018). Local and global contrast is often used to extract the salient regions from the noisy background. Specifically, local contrast models compare the characteristics of each region with each of the neighboring regions to measure the saliency chart. As a result, the high contrast pixels are called salient and the inner ground pixels belonging to the salient objects are not used in local contrast-based techniques. While in global contrast models, all pixels are considered as salient pixels, which retrieve better internal consistency. However, some background pixels may be considered as high saliency pixel in global contrast models. Some researchers combined both local and global models (Itti et al. 1998; Harel et al. 2007; Perazzi et al. 2012; Xue et al. 2011; Nawaz and Yan 2020; Shahid et al. 2020b; Goferman et al. 2011a; Zhu 2014; Rahtu et al. 2010). The computed contrast of combined models essentially indicates the uniqueness of an area in the image that is sensitive to the detection of salient objects. However, if some of the foreground objects are different, or the background is cluttered, a critical gap can accrue. On the other hand, some partial background areas may be considered as essential areas of the salient object, which causes the degradation of saliency detection during clustering. Different supervised and unsupervised saliency detection methods have been proposed to overcome these problems. More recently, the convolutional neural network-based models are used in computer vision applications. Xuan yang et al. (Xi et al. 2019) proposed a supervised approach of detection and segmentation that was presented on the basis of an efficient network of end-to-end saliency regression. This strategy did not achieve reasonable results to locate a salient region using pre-processing and post-processing practices. It employed VGG-16 as a backbone to train the model. It outperforms on a single object detection scene, however it breaks down to detect multiple objects. Wenguan et al. comprehensively reviewed the salient object detection problem and their solution in Wang et al. (2019). They efficiently differentiate the supervised and unsupervised SOD (salient object detection) models by using different statistical models with extensive experiments.

The unsupervised methods are based on low-level features (e.g., background prior, color, contrast). Zhang et al. (2013) presented a novel technique to evaluate the importance of the region by integrating three fundamental priors, color, frequency, and location. A visual patch attention saliency detection and segmentation method was suggested by Jian et al. (2014), where patches of information and direction are used to detect the salient object as neuronal signals. Barranco et al. (2014) has proposed a visual savings architecture with top-down care modulation for a field programmable gate array.

Fig. 3
figure 3

An illustration of the histogram and the intensity values of an RGB image. The first row shows the original images, second and third row show the histogram and the intensity values of the image

This model comprises a hardware architecture in real-time that combines FPGA into the robot system. It uses a robust biological operation to detect and segment the salient area in a color image. Li et al. (2013b) used contextual hypergraph modeling for the detection and segmentation of salient objects, which first retrieved the contextual characteristics of the image. A cost-sensitive support vector machine is then used to find the salient object. Wang et al. (2016) proposed an algorithm that uses information from the background to detect the salient object that uses the previous background information to make precise and stronger salient maps. Kim et al. (2017) suggested the Gaussian mixture models for fully automatic object segmentation as a saliency-based initialization. In this approach, the Gaussian mixture model determines the outer color of the object’s background and foreground pixels, followed by average, covariance matrices, and coefficients of mixing that are further used in an image to locate the prominent pixels. A regularized random walk system for saliency detection and segmentation was proposed in Yuan et al. (2017). This approach first eliminates the adjacent boundary of the foreground superpixels, leading to a random walking ranking model that calculates the prior salience for each pixel in the image.

There are several disadvantages to the above-discussed approaches. For example, classifying the background clusters and foreground clusters in contrast and graph-based saliency technique is a challenging task. More specifically, one or more background clusters involve a portion (Yang et al. 2013b) of the foreground cluster (salient region) that can cause error and reduce the accuracy of the image segmentation. The salient maps of other methods of segmentation of images are binaries of different threshold techniques in initial saliency map process, but the proposed salient method did not use any threshold technique at this stage. Due to different threshold values (adaptive threshold, global threshold, fixed threshold, etc.) in saliency detection models, the efficiency of automated segmentation is reduced. The superpixel methods promote the boundary-based pre-processing saliency detection algorithm, as discussed above. The assignment of the same saliency value to all pixels in a patch in superpixel-based methods ignores certain important information, which can lead to a decrease in the image’s visual quality. Based on this consideration, we have proposed a technique, which uses the Porter–Duff method (Duff 2015) to compose the salient object maps obtained by fast fuzzy c-mean (FFCM) clustering. The FFCM clustering technique is a powerful tool, which accurately clusters the large number of features points based on the histogram and intensity values of the color image as shown in Fig. 3. The histogram in an image usually refers to a pixel intensity in an image processing module. This histogram is a graph that shows how many pixels have similar values in an image. In the color image, different channel histograms (like red, green, and blue) can be retrieved. A histogram with three color channels is used to select a class membership at different threshold value. It calculates the chisquared distance (X) between pixels, which is defined as:

$$\begin{aligned} X^{2}(h_{i},h_{j})= \sum _{1}^{K}\frac{\left( h_{i}(k)-h_{j}(k)\right) ^{2}}{h_{i}(k)+h_{j}(k)} \end{aligned}$$
(1)

where i and j are the coordinates of pixel intensity and K is the total number of pixels. Benefiting from the clustering techniques, the obtained maps contain part of salient features (salient regions), which are composed using Porter–Duff composition method. Outliers in the extracted salient maps are removed using morphological techniques in a post-processing step.

This work is an extension of our conference presentation (Nawaz et al. 2019). The main extended contributions of this work are summarized as:

  • We proposed a clustering-based segmentation technique, in which fast fuzzy c-mean clustering-based maps are blended by the baseline Porter–Duff composition method (Duff 2015). The blending method decides how the colors communicate with each other from the foreground pixels as well as from the background pixels. The blending technique efficiently highlights the foreground pixels during composition.

  • We proposed a multiscale morphological gradient reconstruction procedure to eliminate the boundary outliers in rough saliency maps. This technique helps to integrate the adjacent information of the salient pixel and decrease the number of different pixels in the saliency map.

  • The proposed framework takes advantage in fusing the composed FFCM saliency maps efficiently in comparison to others. The FFCM based saliency model outperforms the superpixel-based image segmentation models because the superpixel models facilitate the pre-processing technique in saliency detection. In superpixel segmentation, all pixels in the nearest patches assign the same salience value and ignores some important information, which decreases the visual quality of the salient object in an image.

  • We demonstrate the supremacy of the proposed method and conduct detailed experiments on six different benchmarks in comparison with thirteen different models to validate the efficiency of the proposed method.

3 Proposed framework

In this section, we introduce the proposed technique, including fast fuzzy c-mean clustering based saliency maps, composition and blending of saliency maps, mask construction using frequency prior, color prior, and location prior.

3.1 Saliency maps construction and composition

To find the membership maps, a FFCM clustering technique is used. By using histogram and image intensity level as illustrated in Fig. 2a, it splits several features into different clusters. The FFCM is a leading technique that is used to construct unsupervised data models. Instead of finding the absolute membership of a particular cluster data point, it decides the degree of membership (likelihood) that is discussed below.

3.2 Saliency maps

Let X be the input RGB image. Partition \(X =\left\{ x_{1},x_{2},x_{3},x_{4}...x_{n}\right\}\) is a data set of n samples and assumes that each \(x_{k}\) sample is described by a set of f characteristics. A X partition in C clusters is a series of \(X_i\) of X mutually disjoint subsets such as \(X_i\bigcup ...\bigcup X_{c}=X\) and \(X_i\bigcap X_j=\phi\) for any \(i\ne j\). The clustering can be described by the \((c \times n)\) partition matrix U, the general term of which is \(u_{ik } = 1\) if \(x_{k} \in X_i\), and 0 otherwise, respectively. To get a partition matrix J , we generalized the objective function with \(m>1\) (m is coefficient of fuzziness):

$$\begin{aligned} J_{m}=\sum _{i=1}^{N}\sum _{j=1}^{C}U_{jk}^{m}\left\| x_{i}-{V}_{j} \right\| ^{2}, \end{aligned}$$
(2)

where \(N\) is the number of data points, \(C\) is the total number of clusters, \({V}_j\) is the center of the \(j^{th}\) cluster, and \(i\) is the degree of membership of the \(i^{th}\) data points in the \((x_i)\) cluster. The instruction \(\left\| . \right\|\) denotes the proximity of the data points to the \({V}_j\) center of the \(j\) cluster. It calculates the center of the cluster as:

$$\begin{aligned} {{V}}_j=\frac{\sum _{k=1}^{N}{U_{jk}^m.x_k}}{\sum _{k=1}^{N}U_{jk}^m}, \end{aligned}$$
(3)

where m is the fuzziness coefficient. The membership maps \({U_{jk}}\) of point \(k\) in cluster \(j\) is calculated as:

$$\begin{aligned} U_{jk}=\frac{1}{\sum _{j=1}^{C}{\left( \frac{\parallel x_i-{{V}}_j\parallel }{\parallel x_i-{{V}}_k\parallel }\right) }^\frac{2}{m-1}}. \end{aligned}$$
(4)

The number of initial saliency maps depends on the total number of clusters. Let X be an RGB color image whose size is \(M\times N\times K\), where \(M\) and \(N\) are the height and width of the image respectively, and \(K\) is the total number of color channels. Suppose \(C\) is the total number of classes/clusters in an image then the total number of maps can be calculated as:

$$\begin{aligned} B_{map}\ =\ K\times \ C. \end{aligned}$$
(5)

In natural color images, we found that the optimal value of \(K\) is 3 through many experiments, which produces the best final results. For \(K\) = 3, the total number of saliency maps will be 9. In Fig. 4, different membership maps of \(\textit{Baby}\) images are generated by using fast fuzzy c-mean clustering based technique that relies on image histogram and intensity values. Figure 4a shows the nine different saliency maps of baby image with \(Source-In\) blending mode. Figure 4b shows the saliency maps with \(Destination-In\) blending mode. Figure 4c shows the saliency maps with \(Source-Atop\) blending mode and Fig. 4d shows the saliency maps of baby image with \(Destination-Atop\) blending mode. Figure 5 shows the matching score matrix that is used to differentiate the good saliency maps.

3.3 Maps composition and blending

The Porter–Duff method contains two basic steps; composition and blending Duff (2015). The method of integrating the graphic elements of the foreground with the graphic elements of the background in maps is known as composition. The blending process is defined, how different colors communicate with each other from the foreground and the background graphic element. In composition, the resulting color is first determined from the graphical element of the foreground and the background maps, and then the foreground color is replaced with the resulting color using a particular composition operator. The composition of Porter–Duff is a pixel-based model in which two maps communicate and generate the final salient map (source and destination), as shown in Fig. 2c. Blending is the factor that measures map blending, where the foreground and the background component overlap. The colors are blended between the background and the foreground pixels. The foreground aspect is compounded with the background pixels after blending, as shown in Fig. 2b. In general, There are 12 distinct operators of the Porter–Duff composition that have a different combination of source to destination used. Table 1 only addresses four composition operators, of which the alpha values of the source (foreground) and background pixels are the \({\textit{as}}\) and \({\textit{ab}}\). The \(f_{a}\) and \(f_{b}\) are the fractional terms of source and destination map. The source and background maps color are presented by \(C_{s}\) and \(C_{b}\). The output pixel values of saliency maps are \(C_{o}\) and \(A_{o}\) for the mixing mode. The entire mixing process is basically carried out in one stage. In the proposed method, the following blending modes are used.

3.3.1 Normal blend mode

The default blending mode does not specify blending. The blending formula only selects the foreground, in which the blending function of the background pixels \(B(C_b,C_s)\) is defined as:

$$\begin{aligned} B(C_{b},C_{s})=C_{s}. \end{aligned}$$
(6)

3.3.2 Multiply blend mode

In multiple blending, the foreground pixel color replaces the background pixel color. The resulting map is always as dark as either the foreground or the background color at least. Here, the saliency maps are combination of both foreground and background maps. So, the resultant map has color gray in this blending mode.

$$\begin{aligned} B(C_{b},C_{s})=C_{b}\times C_{s}. \end{aligned}$$
(7)

3.3.3 Screen blend mode

The color of the foreground pixels and the color of the background pixels are multiplied and the result is then complemented. The resulting map is always divided into two integral colors. The white screening of any color produces white in this blending mode.

$$\begin{aligned} B(C_{b},C_{s})= & {} 1-\left[ (1-C_{b})\times (1-C_{s})\right] . \end{aligned}$$
(8)
$$\begin{aligned} B(C_{b},C_{s})= & {} C_{b}+C_{s}-(C_{b}\times C_{s}). \end{aligned}$$
(9)

3.3.4 Overlay blend mode

Depending upon the background pixel value, it multiplies or screens the foreground and the background maps. Due to retaining highlights and shadows, the foreground pixel overlays the background pixel. The background pixel values are not replaced, but are combined to represent the lightness or darkness of the background with the foreground pixel values.

$$\begin{aligned} B(C_{b},C_{s})=\textit{HardLight}(C_{b},C_{s}). \end{aligned}$$
(10)

4 Mask construction and salient region extraction

We used a mask obtained by the SDSP (Saliency Detection Simple Priors) process (Liu et al. 2014), to find the appropriate salient object from the blending maps array. It is a combination of three basic priors, such as frequency, color, and image location as shown in Fig. 2c.

4.1 Mask construction

We use the log-Gabor filter instead of DoG filter for band-pass filtering in a color image. Our decision has some good reasons. At first, we can build an arbitrary log-Gabor filter that has no DC component. Secondly, the log-Gabor filter’s transfer function is extended to the high-frequency end, making it more capable of encoding natural images than other conventional band pass filters.

$$\begin{aligned} g(v)=exp\left\{ -\left( log\frac{\left\| v \right\| _{2}}{\omega } \right) ^{2}/2\sigma _{F}^{2} \right\} , \end{aligned}$$
(11)

where \(\omega\) and \(v=(i,j)\subseteq {\mathbb {R}}^{2}\) are center frequency and coordinate in the frequency domain, respectively. The \(\sigma _{F}^{2}\) handles the frequency bandwidth in the filter. The optimised parameter value of \(\omega\) is 1/6 and \(\sigma _{F}^{2}\) is 0.3. Let X denotes an RGB image. In the first location, we are splitting it in three colors (CIELab): \(C_l\), \(C_a\) and \(C_b\). The prior frequency \(P_{f}(X)\) is known as:

$$\begin{aligned} P_{f}(X)=\left( (C_{l}*g)+(C_{a}*g)+(C_{b}*g)\right) ^{\frac{1}{2}}(X), \end{aligned}$$
(12)

where \(g\) is the Log-Gabor filter and \((*)\) shows the convolution function between the filter and all channels of the color image. All color channel are represented by (i.e. Cl, Ca, and Cb). Human visual systems are more susceptible to colors like red and yellow than colors like green and blue. Therefore, the salient color is known as:

$$\begin{aligned} P_{c}(X)=1-exp\left( -\frac{C_{an}^{2}(X)+C_{bn}^{2}(X)}{\sigma _{c}^{2}}\right) , \end{aligned}$$
(13)

where \(P_{C}\) and \(\sigma\) are the color priors and the parameter, respectively. The minimum value of the channel \(C_a\) is \(\textit{C}_{an}\), and the maximum value of the channel \(C_b\) is \(\textit{C}_{bn}\), which are defined as:

$$\begin{aligned} C_{an}=\frac{C_{a}(X)-min(a)}{max(a)-min(a)}, C_{bn}=\frac{C_{b}(X)-min(a)}{max(b)-min(b)}. \end{aligned}$$
(14)
Fig. 4
figure 4

The results of four blending modes of nine baby saliency maps. a Is “Source In” blend mode, where foreground pixels overlap the background pixels and replace the background, b is “Destination In” blend mode, where background pixels overlap the foreground pixels and replace the foreground, c is “Source Atop” blend mode, where the foreground pixels overlap the background pixels and keep both overlap pixels and background, and d is “Destination Atop” blend mode, where the background pixels overlap the foreground pixels and keep both overlap pixels and foreground

Fig. 5
figure 5

An illustration of the similarity matrix of nine baby images. The maximum similarity values are shown in light orange color. The proposed method choose the high values maps and fused them together. The final fusion process is described in Eq. (16)

Table 1 The mathematical and graphical representation of four compositing methods

The object near the center is more appealing to eyes, the object is considered to be a salient object in the middle or near to the center of the image. Let C be the image center, then the saliency location in the image X under the Gaussian map is calculated as:

$$\begin{aligned} P_{l}(X)=exp\left( -\frac{\left\| X-C \right\| _{2}^{2}}{\omega _{l}^{2}}\right) , \end{aligned}$$
(15)

where, \(P_l\) indicates the position prior to the image and the location parameter is \(\omega _{l}\). By computing the three maps listed above, the final mask is represented as:

$$\begin{aligned} M=P_{f}(X)\times P_{c}(X)\times P_{l}(X). \end{aligned}$$
(16)

The method used to measure the mask is shown in Fig. 2. As shown in the proposed flow map, the resulting mask is used to separate the appropriate output.

4.2 Salient region extraction

In the proposed method, two types of composition and blending are used. In the first type, the “Source Atop” composition operator with “Multiply” blend mode is used. In this mode, background pixels overlap the foreground pixels and then multiplies with overlap area of both foreground and background pixels, which is shown in Fig. 2b. In the second type, the “Source Atop” composition operator with “Screen” blend mode is used. In this mode, background pixels overlap the foreground pixels, the same as above, but the significant difference is the “Screen blend mode,” which multiplies the complement of both overlap area in the background to find the final output. However, the second blending recovered the missing area created during the first blending. To remove the noisy areas from the final salient map, we use both composition and blending techniques. Some outliers are found in the salient maps, which are removed by using advanced morphological operations (closing and opening) (see Fig. 2b).

5 Experiments

We compared our method with thirteen different methods using six different data sets: MSRA (Liu et al. 2010), DUT-OMRON (Nawaz and Yan 2020), MSRA-10K Li et al. (2013a), THUR-15000 THUR (2013), HKU-IS (Nawaz and Yan 2020), and SED (Zhu 2014). These data sets are widely used to extract salient objects from color images. The performance of the proposed method and thirteen different methods, SIM (Nawaz et al. 2019), SUN (Li et al. 2013a), SEG (Itti et al. 1998), SeR (Seo and Milanfar 2009), CA (Goferman et al. 2011a), GR (Yang et al. 2013a), FES (Tavakoli et al. 2011), MC (Jiang et al. 2013), DSR (Li et al. 2013b), RBD (Zhu 2014), CYB (Chen et al. 2020), ResNet-50 (Goyal et al. 2019), and SDDF (Nawaz and Yan 2020) are shown in Figs. 7, 8, 9, 10 and 11.

5.1 Data sets

The MSRA, MSRA-10K, THUR-15K, and SED data sets are used to determine the method’s performance. Both data sets include pixel-wise ground reality labelled by humans. The MSRA data set contains 5000 RGB images with a ground-mask of truth is created by Liu et al. (2010). This data collection is commonly used for the identification of salient objects. All pictures are single based object having noisy background. The MSRA-10K comprises the 10,000 pictures with the mask on the foreground (Li et al. 2013a). THUR-15000 Commonly used data set (THUR 2013) for 15,000 color images. It includes different sizes of low-contrast salient objects, which makes it very difficult for object detection. There are 100 color images with ground truth (Li et al. 2013a) in the SED data set. In this data set, all images have multiple and single salient artifacts and are labelled with pixel-wise ground truth.

Fig. 6
figure 6

The quantitative results from other approaches and proposed method in terms of the precision recall (PR) curve. a Shows the PR curve of the MSRA data set, b shows the PR curve for MSRA-10K data set, c shows the PR curve of the THUR-15000 data set, and d shows the PR curve of the SED data set. The proposed method has efficient precision to recall curve in all data sets

Fig. 7
figure 7

Performance of other methods and proposed method in terms of precision, recall and F-measure graphs. The above left plot shows the average results for MSRA and the top right plot shows the average results for MSRA-10K; the bottom left plot shows the average THUR-15,000 results and the lower right plot shows the average SED results. The proposed approach in all data sets has been effective in terms of precision, recall and F-measure values

5.2 Parameter settings

To maximize the performance of the proposed method, various parameters are used. To choose the optimized value of parameters is difficult some times. We have tested our method on a large number of images with optimized parameter values. In all experiments, nine different saliency maps of each color image are combined and blend one by one using porter-duff technique, as shown in Fig. 2b. To construct a such type of maps, we fixed the values of parameter C as mentioned in Eq. (5), and K is the number of color-image channels, having a value of three due to RGB image and C is the sum of clusters/classes indicated by Eq. (5). The optimized C value in all experiments is 3. In map blending, instead of 12, we use four distinct composition operators and blending modes, which are “Source Atop” with “Multiply Mode” and “Source Atop” with “Screen Mode” operators.

Fig. 8
figure 8

Mean absolute error values between proposed method and other approaches. Top left plot represents the MAE results for MSRA data set, top right plot represents the MAE results for MSRA-10K data set, bottom left plot represents the MAE results for THUR-1500 data set, and bottom right plot represents the MAE results of SED data set. All plots show that the proposed method has the lowest MAE value for all data sets

5.3 Performance comparisons

The qualitative and quantitative analysis of the proposed method and other methods are useful to discern accuracy with the six commonly used data sets. The results of all other methods can be obtained in this section by using publicly accessible codes. The similarity matching score matrix is shown in Fig. 5.

Fig. 9
figure 9

Visual results on six different data sets. We divide the chosen images into two groups and each group has different type of salient object.The first five rows are the results of proposed method on single based object of images and the last six rows are the results of multiple object based images. Columns 2–11 show the results of different state of the art methods. The second last column shows the results of the proposed method. Last column shows the ground truth values of each data set. The visual efficiency of the proposed method on different data sets show the supremacy of the proposed method

Table 2 Saliency detection results on three single object-based benchmarks
Fig. 10
figure 10

Saliency detection results on DUT-OMRON and HKU-IS data sets. The columns 2–4 are the results of recent saliency detection methods and the column 5 shows the results of the proposed method. The ground truth values of these images are shown in the last column. The visual results in this figure show that the proposed method is very close to the ground truth as compared to other methods

Fig. 11
figure 11

Object segmentation results on gray and remote sensing images. First and second rows show the gray images and their results, respectively. Third and fourth rows show the remote sensing land images, and their segmentation results, respectively. These results show that our proposed method has efficient performance on these type of data sets

Table 3 Saliency detection results of different techniques on three multiple object-based benchmarks

5.3.1 Quantitative comparison

The precision and recall rates suggested by Achanta et al. (2009) are the requirements for quantitative evaluation. The precision rate is specified as the ratio of the salient pixels to all ground truth pixels detected, and the ratio of the salient pixels to all ground truth pixels is reported correctly. The precision and recall curves are shown in Fig. 6. The proposed method is dominant in PR-curve in three data sets (MSRA, MSRA-10K, and SED ) and has similar value in THUR-15000 data set. The PR-curve values find by different threshold values (0–255). In certain instances, precision and recall values are not enough to compare the results then the F-measure is found. The general F-measure is calculated by the weighted harmonic of precision and recall, which is defined as:

$$\begin{aligned} F_{m}=\frac{(1+\beta ^{2})\times Precision\times Recall }{\beta ^{2}\times Precision\times Recall}. \end{aligned}$$
(17)

We used \(\beta =0.3\) as using in Achanta et al. (2009) to calculate the F-measure. The average precision, recall, and F-measure is shown in Fig. 7. Our approach has good results in the MSRA data set, where most of the images are single-object images with a clear background. Our approach has substantially better results than other methods with respect to the F-measure value. Furthermore, for MSRA, SED and MSRA-10K data sets, the accuracy and recall curves of our method are also high. The accuracy and recall values of the THUR-15000 dataset are similar to the RBD (Zhu 2014) and the DSR (Li et al. 2013b) method, but our method works better on this data set in terms of the F-measure. The proposed method obtained efficient PR curve results in a SED data collection, while the majority of images have complex and noisy backgrounds. We use the mean absolute error (MAE) benchmark to test our methodology to get a detailed comparison that tests the similarity of the ground truth map and the saliency maps. The average difference between ground truth pixels and pixel to the map of saliency \((S-{map})\) is called the MAE (G truth).

$$\begin{aligned} MAE=\frac{1 }{WH}\times \sum _{x=1}^{W}\sum _{y=1}^{H}\left| S_{map}-G_{truth} \right| . \end{aligned}$$
(18)

Figure 8 indicates the MAE value of all methods for four separate data sets. It shows the error between the saliency map and the ground truth map of the image. Figure 8a shows the MAE values of MSRA results, showing clearly that the proposed method’s MAE values are less than other methods. The MAE values of the MSRA-10K dataset are shown in Fig.  8b. Our method has similar MAE value to RBD (Zhu 2014), but not so much as compared to other methods in this graph. Figure  8c shows the MAE value for all methods in THUR-15000 data set, where the MAE value in our system is low in comparison to other methods. The MAE of all methods in SED data set is shown by Fig. 8d. The proposed method in this data set has the lowest mean of absolute error compared to other methods. As the value of MAE increases, the effects are worse. So the overall results in terms of MAE values are good in the proposed method.

5.3.2 Qualitative comparison

Six widely used data sets are used for the detailed qualitative comparison between the proposed method and other methods as shown in the Figs. 9 and 10. In Fig. 9, the first column shows color RGB images, and the last column shows ground truth value of each color image. The performance of other methods is shown in columns 2–10. The results of columns 2–7 indicate that these methods were unable to carefully remove the salient object from the background, which could be influenced by the dense texture in the background of the image. The visual quality of columns 8–10 show that the MC (Jiang et al. 2013), RBD (Zhu 2014), and DSR (Li et al. 2013b) methods are comparatively better than SIM, SUN, SEG, SeR, and CA methods. If we compare the results of other methods in terms of visual quality, our method is significantly better, which is shown by the second last column of Fig. 9. Through this series of experiment, it is very easy to differentiate the visual comparison of the proposed method and other methods with a complex and noisy background. However, the proposed results have a high degree of resemblance to the ground truth, and it can uniformly highlight the salient object better as compared to other methods.

Table 4 Comparison of average running time (s) per image on different data sets

5.4 Implementation

The implementation of the proposed method, including training data sets and computational time of the proposed method, is explained in this section.

Training data sets Training data greatly influences the final behavior of the saliency detection models (Wang et al. 2019). We construct the model from six data sets (SED1, MSRA, MSRA-10K, THUR-15K, DUT-OMRON, and HKU-IS). These data sets contain both single and multiple objects. The MSRA-10K dataset is randomly used for training and sampling for validation. The proposed model also uses the method of saliency map composition that has been found to improve the performance of many visual tasks effectively. From an intuitive point of view for salient object detection tasks, the sharpness of the salient area has vital importance in the image. Figure 11 shows object segmentation results on gray and remote sensing images. The results of third and fourth rows show that the proposed method also worked well on gray and remote sensing land images. We have executed our method on both single and multiple objects data sets of real images, which are shown in Tables 2 and 3, respectively. The quantitative results (average precision (AP), area under the curve (AUC), F-measure, and generalized F-measure) of 13 different techniques are shown in Tables 2 and 3. The mathematical form of these evaluation metrics is given in Nawaz and Yan (2020)

Running time MATLAB 2018b and Core i7 with a RAM of 8 GB are implementing the proposed method. We compare the average run time of the proposed method and the other thirteen methods for six different data sets as shown in Table 4. The size of an image is \(400\times 300\) in most of the data set. The running time of other methods is obtained by publicly accessible codes. Notice that our approach is implemented without code optimization on MATLAB, but C++ was used by other RBD, DSR, and MC approaches. The proposed method is faster than DSR, GR, MC, CA, but slightly slower than FES and RBD as shown in Table 4.

Fig. 12
figure 12

Failure cases of the proposed method. It shows the difficulty of generating accurate saliency map when dealing with images with small and shaded objects in the landscape context

5.5 Failure cases

In certain cases, efficient results for single or multiple object detection could not be obtained by our proposed method. Figure 12 shows the failure cases of proposed method at six different data sets. In Fig. 12, columns 1 and 4 show the original images, columns 2 and 5 show the ground truth images, and columns 3 and 6 show the failure cases. These four images are taken from six different data sets. In the butterfly image, the ground truth map shows one butterfly, but the salient map has two butterflies, which clearly shows the failure of the proposed method, due to color similarities. In the cup image, the proposed method result is affected by object shadow and similar color appearance in the bottom. In the dog image, the background car color merges with dog color, which affects the object detection process in the proposed method. In the last image, there should be two salient objects, one is a butterfly and the second is a plant. According to the given ground truth map, the butterfly is only the salient object. In this image, our proposed method fails to detect the butterfly due to the high contrast of the plant image. The results of the failure cases, indicates, that in some low and related background color pictures, the proposed method is not to much efficient to extract the salient regions.

6 Conclusion

We proposed a clustering-based segmentation approach, where the fast fuzzy c-mean clustering maps are blended using the Porter–Duff (Duff 2015) composition method. In this method, one foreground pixels map is blended with the second background pixels map. The composite frequency prior, color prior, and location prior is used to select the final saliency map (Zhang et al. 2013) from the list of initially constructed saliency maps. The results of proposed method are effective and precise compared with other method, when the foreground and background pixels have same color appearance. The efficiency of the proposed method can be improved, through optimized context subtraction and effective morphological pixel-based techniques. The boundary smoothing techniques can also be used to improve the visual efficiency of the constructed saliency maps.

The goal of salient object detection is to develop a detection method based on the human visual perception model. The results of the six different data sets using thirteen different detection methods indicate the superiority of our method. The difference between our approach and the other approaches is clearly seen in Figs. 6, 7, 8, 9, and 10. For the single and multiple objects, our proposed approach outperforms in both quantitative and qualitative comparisons as shown in Tables 2 and 3. Our proposed method produces precise salient detection results with complete edge information, while other algorithms ignore the boundary information due to high contrast and humiliating contexts in images. We will continue with more powerful superpixel-based techniques to look for more reliable edge information for an image in future. It is also helpful to improve the background detection process to produce better results. Deep learning patterns and structural properties will also be studied in order to detect low colour, where weakly supervised learning patterns may be used.