1 Introduction

Image segmentation and grouping similar visual objects have received an extensive attention in the last decades. It plays an important role in computer vision and graphics applications such as: object localization or recognition, data compression, tracking, and image retrieval.

Image segmentation techniques can be divided into image-domain-based, physics-based and feature-space-based techniques [18]. All these techniques with different variations have been used extensively, but every one shares some pros and cons. Image-domain-based techniques use both color features and spatial relationship, among pixels, to carry out segmentation task. The main difficulty lies in these techniques is the selection of seed region. Physics-based techniques use the reflection properties of material to carry out color image segmentation. It performs better in a situation where the reflectance of underlying objects are known; hence, these techniques are application specific. The feature-based techniques use only color features and ignore spatial relationship; therefore, the segmented regions are usually fragmented [17]. The contour-based approaches start from an initial boundary shape in the form of spline curves; these curves are iteratively evolved to minimize some energy function [6, 10, 22, 30]. The difficulty lies in these techniques is the manual initialization of spline curves to the boundaries of an object of interest.

Markov Random Field (MRF) as another statistical model that uses spatial connectivity among neighboring pixels [2]. Many approaches are based on Markov Random Fields such as Graph Cut [1214, 25]. These algorithms solve the two-class problem, i.e., separating the background and foreground objects, and need human interaction to manually specify the seed regions. Similarly, the background and foreground segmentation is treated with watershed segmentation [3, 29]. The watershed approach suffers from over-segmentation that can be handled by morphological operations. Moveover, the segmentation accuracy is highly sensitive to morphological structuring element. Recently, [16] proposed a supper pixel segmentation algorithm that uses entropy rate of random walk on the graphs. This algorithm performs best in the case of supper pixel segmentation, but for general segmentation it produces over-segmented results.

Image segmentation can be treated as a clustering problem. Features of each pixel correspond to a pattern, and the combination of the pixels (i.e., segment) corresponds to a cluster. Keeping this analogy in mind, many clustering algorithms have been used for image segmentation [1, 5, 7]. Among the clustering algorithms, adopted for color image segmentation, the most famous is the k-means [7]. The fuzzy c-means (FCM) [1] has also extensively used for image segmentation due to its ease of implementation and simplicity. However, its implementation often faces two problems: 1) how many clusters should be made; 2) and how to select the initial centroids that are properly distributed in the image. The number of cluster centers and initialization has considerable impact on final results. Tan and Isa [27] used the concept of histogram thresholding to initialize the cluster centers to the dominating colors. They used compactness as an objective function to optimize the cluster centers. This approach performs poor for complex color images because only cluster compactness is not good enough to validate the segmentation accuracy. Level sets and spatial fuzzy have been combined for medical image segmentation [15]. This algorithm performs well for two-class problem but unable to segment general color images of complex scenes. Another color image segmentation technique based on the merging of maximal similarity of regions is proposed in [23]. Bhattacharyya distance between the histograms of the regions is used as a cluster validity index. The main weakness of this approach is the human intervention to mark the seed regions.

In this paper, we propose a novel differential evolution algorithm based on spatial fuzzy c-mean (sFCM) [5] to make compact and well-separated clusters. A combination of fuzzy separation and fuzzy overlap is used as a cluster validity index to determine the optimal number of clusters automatically. Initialization greatly effects convergence and diversity. Purely random initial solutions have more probability to visit or even revisit the unproductive regions of the search space [24]. The opposite number decreases the chances to visit the unproductive regions again and again. For this purpose, we take the best individuals from the union of purely random solutions and their opposites as initial population. An external archive is maintained to store the dropped solutions in order to give chances to poor solutions in the evolution process. Poor solutions are not always poor, but it can give best results with slight perturbation. To represent each cluster center, we use the HSV color space relying on the hue, saturation, and value properties the color. Such color space allows us to specify colors in a way that is close to human experience of colors.

The rest of the paper is organized as follows. The basic concept of classical differential evolution algorithm is presented in Sect. 2. Section 3 explains the proposed algorithm, modified adaptive differential evolution (MoADE), in detail. Experimental results are described in Sect. 4 followed by conclusion.

2 Classical differential evolution algorithm

Differential evolution (DE) is a relatively recent population-based evolutionary heuristic. It is used to optimize problems over continuous domains. In DE, a vector of real numbers is used to represent a decision variable or solution. An individual vector can be represented, is a bold face letter \(\varvec{{{x}}}\), in the following way:

$$\begin{aligned} \varvec{{x}}=[x^{(1)}, x^{(2)}, x^{(3)}, \ldots , x^{(D)}] \end{aligned}$$
(1)

where the superscripts \((1,2,3,\ldots ,D)\) represent the dimension of the problem. In classical differential evolution, the initial population \(\{\varvec{{{x}}}_{i,0}\) \(=(x_{i,0}^{(1)}\), \(x_{i,0}^{(2)}\), \(x_{i,0}^{(3)}, \ldots x_{i,0}^{(D)})\}\) \(\mid i=1,2,3, \ldots N_{p}\) is generated randomly according to uniform distribution \( x_{low}^{j}\le x_{i,0}^{j}\le x_{up}^{j} \) for \( j=1,2,\ldots \ldots D\) where \(D\) is the dimension of the problem and \(N_{p}\) is the population size. After initialization DE enters to a loop of evolutionary operations: mutation, crossover, and selection.

Mutation: To mutate a particular target vector \(\varvec{{{x}}}\) at iteration \(t\), three vectors are randomly selected from the current population. In other words, the \(j\)th component of a trial vector \(\varvec{{{x}}}\) at iteration \(t\) is generated as follows:

$$\begin{aligned} \vartheta _{t}^{j}=x_{t}^{j}+F\cdot (x_{r1,t}^{j}-x_{r2,t}^{j}) \end{aligned}$$
(2)

where the indices \(r1\) and \(r2\) are distinct integers uniformly chosen from the set \( \{1,2,\ldots ,N_{p}\}\), and \(F\) is the mutation factor that lies in interval \((0,1+)\).

Crossover: The crossover operation is performed to increase the diversity of the perturb vector. A binomial crossover operation forms the final trial/offspring vector \(\varvec{{{\upsilon }}}_{t}=(\upsilon _{t}^{1}, \upsilon _{t}^{2} ,\ldots , \upsilon _{t}^{D}) \)

$$\begin{aligned} \upsilon _{t}^{j}= \left\{ \begin{array}{ll} \vartheta _{t}^{j} \quad \mathrm{if} \quad rand (0,1)\le CR \quad or \quad j = j_{rand}, \\ x_{t}^{j}, \quad \mathrm{otherwise} \end{array} \right. \end{aligned}$$
(3)

where \(rand(0,1)\) is a uniform random number in the interval \([0,1]\) and independently generated for each \(j\) and \(i\), respectively, \(\quad j_{rand} = randint(1, D)\) is an integer randomly chosen from \(1\) to \(D\). The crossover probability \(CR\in [0, 1]\) roughly corresponds to the average fraction of vector components that are inherited from the mutation vector.

Selection: It decides whether a particular trail individual \(\varvec{{{\upsilon }}}\) or the parent \(\varvec{{{x}}}\) at generation \(t\) should be allowed to next generation. Selection is performed on the bases of fitness values \(f(\cdot )\) of trail and parent vectors. For example, if we have a minimization problem, then the selection is given by:

$$\begin{aligned} \varvec{{x}}_{t+1}= \left\{ \begin{array}{ll} \varvec{{{\upsilon }}}_{t} \,\,\,\, \mathrm{if} \,\, f(\varvec{{{\upsilon }}}_{t})< f(\varvec{{{x}}}_{t}) \\ \varvec{{{x}}}_{t}, \,\,\mathrm{otherwise} \end{array} \right. \end{aligned}$$
(4)

The operation in Eq. (4) is called a successful update if the trial vector \(\varvec{{{\upsilon }}}\) is better than parent \(\varvec{{{x}}}\), i.e., the improvement or evolution progress \(\bigtriangleup = f(\varvec{{{x}}}) - f(\varvec{{{\upsilon }}})\) is positive. The evolutionary process continues up to a fixed number of iterations. The best vector in the last generation is considered the solution.

3 Modified adaptive differential evolution

In classical differential evolution, the crossover and mutation parameters are fixed to generate all trial vectors at all generations. In adaptive DE algorithms, each individual \(i\) is associated with its own crossover probability \(CR_{i}\) and mutation probability \(F_{i}\). In this paper, an adaptive differential evolution ADE [32]) is applied to the domain of color image segmentation by incorporating a new objective function and opposition-based learning. The new algorithm is called MoADE. The pseudocode of algorithm 1 explains MoADE in detail.

figure d

DE algorithm needs to be initialized before start of iterative procedure. In practice, initialization is either done by the user using some sort of the domain knowledge or random procedure. The pure random initialization has proved to be inappropriate since it forces the clustering algorithms to converge to local optima. It means there is more probability to visit or even revisit the unproductive region of the search space. The opposite numbers have high chance to produce best results than purely random ones. Here, the initial population takes the best individuals from the union of purely random solutions and their opposites [24] in order to avoid unproductive samples or solutions. In algorithm 1, steps \(2-8\) show the implementation of opposition-based initialization. The opposite of each component of each individual can be calculated as:

$$\begin{aligned} \bar{x}_{i}^{j}=a^{j}+b^{j}-x_{i}^{j} \quad i=1, 2, \ldots , N_{p}; j=1, 2, \ldots , D \end{aligned}$$
(5)

where \(x_{i}^{j}\) and \(\bar{x}_{i}^{j}\) denote the \(j\)th component/dimension of the \(i\)th vector and its opposite, respectively, \(a^{j}\) and \(b^{j}\) represent the intervals of the \(j\)th dimension of the search space. The probability \(randn(0,1)<1/(3+e^{-1/g})\) encourages high exploration in initial stages while high exploitation in the final stages. If the probability holds for particular individual \(\varvec{{{x}}}_{i}\) then its opposite can be calculated as:

$$\begin{aligned} \varvec{{\bar{x}}}_{i,g}=\hbox {MIN}^{d}+\hbox {MAX}^{d}-\varvec{{x}}_{i,g} \end{aligned}$$
(6)

where \(\hbox {MIN}^{d}\) and \(\hbox {MAX}^{d}\) are the minimum and maximum values of a particular dimension in the current generation \(g\). The intervals \([\hbox {MIN}^{d},\hbox {MAX}^{d}]\) encourages to stay in already shrunken search space. If other boundary intervals are used rather than \([\hbox {MIN}^{d},\hbox {MAX}^{d}]\), then we would jump out side of the already shrunken search space and would lost the knowledge of the current reduced space (converged population).

Sometimes inferior solutions, with little change, provide additional information about the progress direction compared to current population [32]. For this purpose, the mutation strategy DE/current-to-pbest/1 with archives [32] is adopted. This mutation strategy uses a set of archived inferior solutions denoted by A and the current population denoted by P to generate a mutation vector as given by Eq. (7).

$$\begin{aligned} \varvec{{\vartheta }}_{i,g}= \varvec{{x}}_{i,g}+F_{i}(\varvec{{x}}_{best,g} - \varvec{{x}}_{i,g})+F_{i}(\varvec{{x}}_{r1,g}- \varvec{{x}}_{r2,g}) \end{aligned}$$
(7)

where \(\varvec{{{x}}}_{best,g}\) is randomly chosen as one of the top \(p\,\% \) individual in the current population and \(F_{i}\) is the mutation factor that is associated with \(\varvec{{{x}}}_{i}\) and re-generated at each generation by the adaptation process introduced later in Eq. (12). The vectors \(\varvec{{{x}}}_{i,g}\) and \(\varvec{{{x}}}_{r1,g}\) are selected from P, while \(\varvec{{{x}}}_{r2,g}\) is randomly selected from \(\mathbf P \cup \mathbf A \).

Initially the archive is empty. Then after each generation the parent solutions, that fail in the selection process, are added to it. The archive is made of fixed size to avoid significant computation overhead. If the archive size exceeds a certain threshold, say \(N_{p}\), then some solutions are randomly removed to keep the archive size at \(N_{p}\).

The archive provides information about the progress direction and improves the population diversity. The larger \(F\) increases the population diversity but it tends to converge to local minima. The proposed algorithm with archives and opposition-based learning is able to search the space effectively in order to come out of local optima.

3.1 Parameters adaptation

Similar to [32], the crossover probability \(CR_{i}\) of each individual \(\varvec{{{x}}}_i\) is independently generated according to a normal distribution of mean \(\mu _{CR}\) and standard deviation 0.1.

$$\begin{aligned} CR_{i} = randn_{i}(\mu _{CR}, 0.1) \end{aligned}$$
(8)

The set of all successful crossover probabilities \(CR_{i}\)s at generation \(g\) is represented by \(S_{CR}\). The mean \(\mu _{CR}\) is initialized to be 0.5 and then updated at the end of each generation as:

$$\begin{aligned} \mu _{CR} = (1 - c)\cdot \mu _{CR} + c\cdot mean_{W}(S_{CR}) \end{aligned}$$
(9)

where \(c\) is a positive constant between 0 and 1 and \(mean_{W}(\cdot )\) is the weighted average of all successful crossover probabilities at generation \(g\). The \(mean_{W}(\cdot )\) can be calculated as:

$$\begin{aligned} mean_{W}(S_{CR})=\sum _{s_{j} \in S_{CR}}w(\mu ,s_{j})(s_{j}) \end{aligned}$$
(10)

The weights \(w(\mu ,s_{j})\) are then computed as follows:

$$\begin{aligned} w(\mu ,s_{j})=\frac{1}{Z_{j}}e^{-\Vert \mu -s_{j}\Vert _{2}^{2}} \end{aligned}$$
(11)

where \(Z_{j}\) is a normalization constant ensuring that \(\sum _{s_{j} \in S_{CR}}w(\mu ,s_{j})(s_{j})=1\). Compared to arithmetic means, the weighted average gives more weights to frequently used successful crossover factors and thus maintain an optimal crossover range.

Similar to [32], at each generation \(g\), the mutation factor \(F_{i}\) of each individual \(\varvec{{{x}}}_{i}\) is independently generated according to a Cauchy distribution with location parameter \(\mu _{F}\) and scale parameter 0.1.

$$\begin{aligned} F_{i} = randc_{i}(\mu _{F}, 0.1) \end{aligned}$$
(12)

When the mutation factors are highly concentrated around a certain value, then the greedy mutation strategies (such as DE/best, and DE/current-to-pbest) tend toward premature convergence. The Cauchy distribution helps to diversify the mutation factors and thus avoid premature convergence very effectively compared to normal distribution. At a particular generation, \(g\) the set of all successful mutation factors is represented by \(S_{F}\). The location parameter \(\mu _{F}\) of the Cauchy distribution is initialized to be 0.5 and then updated at the end of each generation as:

$$\begin{aligned} \mu _{F} = (1 - c)\cdot \mu _{F} + c\cdot mean_{L}(S_{F}) \end{aligned}$$
(13)

where \(mean_{L}(\cdot )\) is the Lehmer mean

$$\begin{aligned} mean_{L}(S_{F})=\frac{\sum _{F\in S_{F}}F^{2}}{\sum _{F\in S_{F}}F} \end{aligned}$$
(14)

The Lehmer mean gives more weights to larger successful mutation factors. Therefore, Lehmer mean is helpful to propagate larger mutation factors to improve the progress rate.

3.2 Objective function formulation

The objective function or the cluster validity index (CVI) plays an important role in clustering and segmentation. It checks whether the provided clustering \((U;V )\) of \(X\) reflects the original structure of the data or not. Clustering algorithms need \(c\), the number of clusters, to be known in advanced. Due to unsupervised nature of clustering, the user does not have prior knowledge about \(c\). A reliable cluster validity index (CVI) should take into account both compactness and separation for the \(c\)-partition of \(U\). If only the former is considered then the best partition consists of as many clusters as that of points. On the other hand, if only the latter is considered, the best partition consists of a single cluster. For the accurate and reliable segmentation, it is important to combine both measures in the CVI. But the separation generally finds distances between centroids of different clusters and are insufficient to represent the geometrical structure of the data.

In this paper, we use an aggregation functions, for the cluster validity, that map a collection of values \(\varvec{{{u}}}\) in [0,1] to a value in [0,1], formally a vector \(\varvec{{{u}}}=(u_{1};\ldots ;u_{c})^{t}\) \(\longmapsto A(u)\). The norms (t-norms) and conorms (t-conorms) are the frequently used aggregation operators in the class of triangular norms. They are used to implement conjunctive and disjunctive operators in field of fuzzy logic and fuzzy set theory. A t-norm \(\top \) and t-conorm \(\bot \) are commutative, associative and monotone functions having 1 and 0, respectively, for the neutral elements [11].

If \({\varvec{{v}}_{1}}\), \({\varvec{{v}}_{2}}\),..., \({\varvec{{v}}_{c}}\) denote clusters’ centers, then the membership \({{u}_{ij}}\), \(i = 1, 2,\ldots , c\) and \(j = 1, 2,\ldots , n\), can be computed as:

$$\begin{aligned} u_{ij}=\frac{1}{\sum _{k=1}^{c}(\frac{D(\varvec{{v}}_{i},\varvec{{x}}_{j})}{D(\varvec{{v}}_{k},\varvec{{x}}_{j})})^\frac{1}{m-1}}, for \ 1\le i\le c, \ 1 \le j\le n \end{aligned}$$
(15)

where \(D({\varvec{{{v}}}_{i}},{\varvec{{{x}}}_{j}})\) is the Euclidean distance between point \({\varvec{{{x}}}_{j}}\) and cluster center \({\varvec{{{v}}}_{i}}\), \(m\) is the weighting coefficient. While computing \({{u}_{ij}}\) using Eq. (15), if \(D({\varvec{{{v}}}_{i}},{\varvec{{{x}}}_{j}})\) is equal to zero for some \(k\), then \({{u}_{ij}}\) is set to zero for all \(i=1,2,\ldots ,c,\,i\ne k\), and \({{u}_{ij}}\) is set equal to one.

In images, neighboring pixels are highly correlated [5]. In other words, neighboring pixels possess similar features and there is high probability that they belong to the same cluster. A spatial function is defined in Eq. (16) to exploit the idea of spatial information [5].

$$\begin{aligned} {{h}_{ij}}=\sum _{k\in NB({\varvec{{{x}}}_{j}})}{{{u}_{ik}}} \end{aligned}$$
(16)

where \(NB({\varvec{{{x}}}_{j}})\) represents a square window centered on pixel \({\varvec{{{x}}}_{j}}\) in the spatial domain. The spatial function \({{h}_{ij}}\) represents the probability that pixel \({\varvec{{{x}}}_{j}}\) belongs to \(i\)th cluster. The spatial information can be incorporated in membership function as:

$$\begin{aligned} {u}^{\prime }_{ij}=\frac{{{u}^{p}_{ij}}{{h}^{q}_{ij}}}{\sum _{k=1}^{c}{{{u}^{p}_{kj}}{{h}^{q}_{kj}}}} \end{aligned}$$
(17)

where \(p\) and \(q\) are parameters to control the relative importance of both functions. The centers can be updated using (18).

$$\begin{aligned} {\varvec{{{v}}}_{i}}=\frac{\sum _{j=1}^{n}{{{({{u}^{\prime }_{ij}})}^{m}}{\varvec{{{x}}}_{j}}}}{\sum _{j=1}^{n}{{{({{u}^{\prime }_{ij}})}^{m}}}},\,1\le i\le c \end{aligned}$$
(18)

The above equations gives the fuzzy clustering (\(U;V\)) of data \(X\), where \(u_{ij}\) defines the similarity of a data point \(\varvec{{x}}_{j}\) to a center \(\varvec{{v}}_{i}\). Using the membership vector \(\varvec{{u}}^{\prime }_{j} =(u^{\prime }_{1j};\ldots ;u^{\prime }_{cj})^{t}\) (\(j^{th}\) column of \(U\)), to assign a particular data point to the most matching group the standard t-conorm (max operator) is commonly used. However, the data point \(\varvec{{x}}_{j}\) may satisfy more than one group description (i.e., lower values interact with the greatest) then using the exclusive partitioning (only the t-conorm) is not efficient. The aggregate value of \(\varvec{{u}}^{\prime }_{j}\) is adequate to assess that how much a particular point \(\varvec{{x}}_{j}\) belongs to several other clusters. We use the fuzzy OR operator (fOR-\(\ell \) for short) defined in [20] to evaluates the degrees of similarity at a given order \((\ell )\). Let \(\rho \) be the power set of \(C = {1,2,\ldots ,c}\) and \(\rho _{\ell } ={A \in |A|=\ell }\) where \(|A|\) denotes the cardinality of subset \(A\), then the fOR-\((\ell )\) associates a single value \(\bot ^{\ell }(\varvec{{u}}^{\prime }_{j})\in [0; 1]\) to \(\varvec{{u}}^{\prime }_{j}\) as given by the following equation:

$$\begin{aligned} \overset{\ell }{\bot }(\varvec{{u}}^{\prime }_{j})=\underset{i=1,2,\ldots ,c}{\overset{\ell }{\bot }} u^{\prime }_{ij}=\underset{A\in \rho _{l-1}}{\top }\left( \underset{j\in \backslash A }{\bot }\right) u^{\prime }_{ij} \end{aligned}$$
(19)

It must be viewed as some kind of generalization of the \(\ell ^{th}\) highest value, \(\ell \in C\). The standard t-norms, \(\overset{\ell }{\bot }(\varvec{{u}}^{\prime }_{j})\) uses the \(\ell ^{th}\) highest element of \(u^{\prime }_{j}\) [20].

In this paper, we have used two measures, the fuzzy overlap and fuzzy separation [4], to find the relation of a point \(\varvec{{x}}_{j}\) to the clusters and overcome the drawbacks of compactness and separation measures. The fuzzy overlap measure evaluates the degree of overlap of a specified number (\(\ell \)) of fuzzy clusters, and a fuzzy separation measure indicates the degree of overlap of the most probable cluster, i.e., the one corresponding to the highest membership degree, with respect to \(c-1\) other clusters. A low value of this latter measure will denote a large separation of the most probable cluster from the others. For each point, \(\varvec{{x}}_{j}\) of \(X\), described by its membership vector \(\varvec{{u}}^{\prime }\), the overlap measure between \(\ell \) fuzzy clusters can be obtained by (20). A combination of \(\ell \)-order overlap degrees for \(\varvec{{x}}_{j}\) is obtained by successive computing of \(\overset{\ell }{\bot }(\varvec{{u}}^{\prime }_{j})\) for different values of \(\ell \). To determine the over all degree of overlap for a given point \(\varvec{{x}}_{j}\), the order that induces high overlap is find out. The most satisfied order(s) can be quantified by the fuzzy disjunction of the \(\ell \)-order overlap measures \(({\ell } =2;c)\), so the overall overlap measure of a point \(\varvec{{x}}_{j}\) is given by:

$$\begin{aligned} O_{\bot }(\varvec{{u}}^{\prime }_{j};c)=\underset{{\ell }=2,c}{\overset{1}{\bot }}\left( \underset{i=1,c}{\overset{\ell }{\bot }}u^{\prime }_{ij}\right) \end{aligned}$$
(20)

The inter-cluster separation plays an important role in cluster validity. For this purpose, the fuzzy separation [4] of each point \(\varvec{{x}}_{j}\) with (\(\varvec{{u}}^{\prime }_{j}\)) is find out. It is the overlap measure within one fuzzy cluster, i.e., its separation from the other fuzzy clusters, since \(\varvec{{u}}^{\prime }_{j}\) sums to one. This aggregation corresponds to the fuzzy disjunction of membership degrees for a given point \(\varvec{{x}}_{j}\), which selects the most probable cluster. The fuzzy separation of \(\varvec{{x}}_{j}\) with respect to the \(c\) clusters is given as:

$$\begin{aligned} S_{\bot }(\varvec{{u}}^{\prime }_{j};c)=\overset{1}{\bot }\left( \underbrace{\underset{i=1,c}{\overset{1}{\bot }}u^{\prime }_{ij},\ldots \underset{i=1,c}{\overset{1}{\bot }}u^{\prime }_{ij}}_{c-1\ times}\right) \end{aligned}$$
(21)

The highest membership degree of a point \(\varvec{{x}}_{j}\) corresponds to the small value of overlapping degree \(O_{\bot }(\varvec{{u}}^{\prime }_{j}; c)\) and a large value of the separation \(S_{\bot }(\varvec{{u}}^{\prime }_{j}; c)\). It means that if a point \(\varvec{{x}}_{j}\) has a small \(O_{\bot }(\varvec{{u}}^{\prime }_{j}; c)\) and large \(S_{\bot }(\varvec{{u}}^{\prime }_{j}; c)\), it will be well-separated and not overlapping cluster. Then, the \(\ell \)-order overlap and separation index (OSI) taking values in [0; 1] as the average value of the ratios of both measures:

$$\begin{aligned} OSI_{\bot }(c)=\frac{1}{n}\sum _{j=1}^{n}\frac{O_{\bot }(\varvec{{u}}^{\prime }_{j};c)}{S_{\bot }(\varvec{{u}}^{\prime }_{j};c)} \end{aligned}$$
(22)

Running MoADE for \(c\in [c_\mathrm{min}; c_\mathrm{max}]\) and selecting that value of \(c\), which best minimizes Eq. (22). The \(c_\mathrm{min}\) and \(c_\mathrm{max}\) are set to 2 and 20, respectively. The small regions less than 100 pixels are merged to its neighboring segments on the basis of similarity to avoid over-segmentation.

MoADE explores the underlying search space efficiently using the opposition-based learning. Besides to diversify the search process, MoADE also speeds up the convergence considerably. Figure 1 shows the results of ADE [32] and proposes MoADE after 100 iterations by optimizing the new objective function. First row in Fig. 1 shows the results of ADE, while the second row shows the results of MoADE, respectively. It is clear from the figure that ADE does not detect the body of the boy, while MoADE performs better in this case. Similarly, for the pyramid image, (second column of Fig. 1) ADE fails to detect the black-gate-shaped and white-building-shaped objects. But the proposed MoADE performs better to detect the underlying objects accurately. Note that, ADE uses random procedure for initialization and is unable to uniformly initialize the search space to the productive regions. Once the search space is initialized to the unproductive regions, then the algorithm converges to local optimas’. On the other hand, the proposed MoADE uses the opposition-based learning strategy to initialize and evolve the population. By this way, best individuals from the union of current population and its opposition are selected for new generation in order to get global best solution.

Fig. 1
figure 1

Results of ADE [32] and MoADE after 100 iterations. The first row contains the results of ADE, while the second row contains the results of proposed MoADE, respectively

4 Experimental results

Experimentations have been carried out on images taken from the Berkeley image segmentation database (BSD) [19]. The Berkeley image segmentation database (BSD) contain 300 natural images of sizes (\(321\times 481\)) or (\(481\times 321\)). The results of the proposed algorithm are compared with three state-of-the-art image segmentation algorithms: segmentation via lossy data compression (LDC) [31], efficient graph-based image segmentation (EG) [8], and normalized cuts and image segmentation (NC) [26].

The region number in NC is set to 20, which is the average number of segments marked by the human subjects in each image. In EG algorithm, the Gaussian smoothing parameter \(\sigma =0.8\), the threshold value \(k=300\), and the minimal region  = 50 pixels are set, as described in [8]. Source codes of the all algorithms are downloaded from their respective web sites.

4.1 Qualitative comparisons

In this section, the segmentation results of the MoADE, NC, EG, and LDC approaches are evaluated visually. Eight images: shrub (\(481\times 321\)), church1 (\(481\times 321\)), eagle1 (\(481\times 321\)), eagle2 (\(481\times 321\)), texture (\(321\times 481\)), building (\(481\times 321\)), lion (\(481\times 321\)), and church2 (\(481\times 321\)) are selected from the Berkeley image segmentation database for qualitative comparison. Figure 2 shows the results of proposed MoADE, NC, EG, and LDC. For the Shrub image, the result of MoADE is closely matching the human perception, i.e., the shrub and non-shrub areas are separated in a best way, whereas the NC divides the uniform background into unnecessary segments. The performance of EG and LDC is not good for the shrub image because they make over-segmentation. In the case of eagle1 image, the proposed algorithm outperforms the other state-of-the-art algorithms. LDC performs good for eagle1 image compared to NC, which divide the uniform background into multiple segments, and EG, which fails to detect the actual boundaries of the object. Note that, the proposed MoADE also outperforms the other techniques in the case of church1 image. For eagle2 image, the proposed technique performs better by accurately separating the tree and eagle from the background region, but the performance of the other three methods are not good enough. Similar to the human perception, the proposed MoADE correctly segment the texture image by separating the two different textures. For the same image, the LDC performs good enough compared to NC, which divides the same texture into different segments, and EG, which makes over-segmentation, but fails to detect boundaries of the texture correctly. It is clear from the figure that the proposed algorithm also outperforms the other state-of-the-art methods in the case of building, lion, and church2 images.

Fig. 2
figure 2

Segmentation results on some of the images from the Berkeley segmentation dataset and benchmark. The first, second, third, and fourth columns contain the MoADE, NC, EG, and LDC segmented images, respectively

NC tends to partition an image into regions of similar sizes, resulting in the region boundaries different from the real edges. EG gives strongly over-segmented results. LDC usually obtains better visual results than NC, and EG, but it often fails to find real edges and creates strongly over-segmented regions. It is easy to judge from the results that the proposed method performs better than the other state-of-the-art algorithms by generating boundaries which match the real edges of the objects.

4.2 Quantitative comparisons

Quantitative comparisons are also important to evaluate the performance of an algorithm objectively. Region differencing [13] measures how much a particular segmentation can be viewed as a refinement of other. The boundary matching [9] compares the results of an algorithm and that of human subjects by finding the average displacement error of the boundary pixels. Region differencing and boundary matching are not sufficiently good to objectively evaluate segmentation results [28]. If a single pixel is considered a separate segment, then region differencing and boundary matching will have best scores. Thus, the over-segmented results may be ranked good, which are far away from human perception of image understanding.

In this paper, we have used probabilistic rand index (PRI) [28] and variation of information \((VoI)\) [21] to objectively evaluate the segmentation results. Consider a set of ground truths, labeled by \(K\) persons, \({S}_{1}\), \({S}_{2}, \ldots , {S}_{K}\), of an image consisting of \(N\) pixels. Let \(S_{test}\) be the segmentation result to be compared with the ground truths. Then, the PRI value is defined as:

$$\begin{aligned} \hbox {PRI}(S_{test},\{S_{k}\})=\frac{1}{{N\atopwithdelims ()2}}\sum _{p<q}[\bar{p}_{pq}^{c_{pq}}(1-\bar{p}_{pq})^{1-c_{pq}}] \end{aligned}$$
(23)

where \((p,q)\) is a pixel pair in image, \(c_{pq}=T(l_{p}^{S_{test}}=l_{q}^{S_{test}})\) denotes the event of a pair of pixel \(p\) and \(q\) having the same label in the test result \(S_{test}\), and \(\bar{p_{pq}}=1/K\sum _{k=1}^{K}T(l_{p}^{S_{k}}=l_{q}^{S_{k}})\) is regarded the probability of \(p\) and \(q\) having the same labels. The value of PRI lies in the range [0,1], and the larger the better. The variation of information (VoI) is defined as:

$$\begin{aligned} \hbox {VoI}(S_{test},\{S_{k}\})=\frac{1}{K}\sum _{k}[H(S_{test})+H(S_{k})-2I(S_{test},S_{k})] \end{aligned}$$
(24)

where \(H\) and \(I\) denote the entropy and the mutual information, respectively. The detailed definitions of \(H\) and \(I\) can be found in [21]. VoI is an information-based measure, which determines how much one segmentation gives information about other. The value of VoI falls in \([0, \infty )\), and the smaller, the better.

Table 1 shows the average values of PRI and VoI for the four algorithms on 300 images of the Berkeley segmentation database. In this table, the second column shows the average PRI and VoI values between different human subjects, which are the best scores. It can be observed from the table that the proposed algorithm performs better than the other three state-of-the-art algorithms in terms of VoI and PRI scores. The average PRI value of the EG is close to proposed algorithm, but the average VoI value is much greater. Similarly, the average VoI of the LDC is close to proposed MoADE and better than that of the other two algorithms, but the PRI value is not better. Figures 3 and 4 graphically represent the PRI and VoI scores, respectively, for the four algorithms on 300 BSD images. It is clear from the curves in Figs. 3 and 4 that the proposed algorithm performs better compared to other state-of-the-art methods.

Table 1 Average values of PRI and VoI for the four algorithms on 300 BSD images
Fig. 3
figure 3

PRI values achieved on individual images by the four algorithms. The values are plotted in increasing order

Fig. 4
figure 4

VoI values achieved on individual images by the four algorithms. The values are plotted in increasing order

5 Conclusion

In this paper, we presented a novel image segmentation algorithm. It is the combination of spatial fuzzy c-mean (sFCM), which looks for similarity in the local neighborhood among the pixels, and differential evolution (DE), which diversifies the search space to find the optimal solution. The ratio of fuzzy overlap and fuzzy separation is used as a cluster validity index in order to find out the optimal number of clusters automatically. The results of the proposed algorithm are compared with three state-of-the-art image segmentation algorithms. The qualitative and quantitative results demonstrate that the proposed algorithm performs better compared to the given state-of-the-art image segmentation algorithms. In the future, we are going to construct a unified matrix, from different feature spaces. i.e., color and texture, in such away that the important information of each feature space is to be preserved. Then, the segmentation will be performed on a joint or combined search space.