1 Introduction

Promoting and maintaining diversity of candidate solutions is a key requirement of evolutionary algorithms (EAs) in general and multi-objective EAs (MOEAs) in particular [1, 2]. Many ways of measuring diversity have been considered, and many shortcomings identified [3]. Perhaps the most theoretically attractive diversity measure, used by [4, 5], is the Solow-Polasky diversity [6]. It turns out that a recently systematized theory of diversity in generalized metric spaces [7] singles out the Solow-Polasky diversity or magnitude of a (certain frequently total subset of a) finite metric space as equal to the maximum value of the “correct” definition (1) of diversity that uniquely satisfies various natural desiderata. While the notion of magnitude was implicit in the mathematical ecology literature over 25 years ago, an underlying notion of a diversity-maximizing probability distribution is much more recent and has not yet been applied to EAs.

In the context of MOEAs, a practical shortcoming associated with magnitude is its \(O(n^3)\) algorithmic cost. To avoid this, [4, 5] use an efficient approximation to merely measure diversity rather than attempting to enhance it.

However, it can be profitable to incur the marginal cost of computing a so-called weighting en route to the magnitude, since we can use it to enhance diversity near the boundary of the image of the candidate solution set under the objective functions. The nondominated part of this image is the current approximation to the Pareto front; the ability of weightings to couple both diversity and convergence to the Pareto front dovetails with recent indicator-based EA approaches to Pareto-dominance based MOEAs [8, 9]. Moreover, the agnosticism of weightings to dimension further enhances their suitability for such applications.

In this paper, we construct a gradient flow that systematically manipulates finite subsets of Euclidean space to enhance their diversity, which provides a useful primitive for quality diversity [10]. We then apply this primitive in service of MOEAs by diversifying solution data through local mutations. For the sake of illustration, we only perform these mutations on the results already obtained by a MOEA, though they can be performed during evolution.

The paper is organized as follows. In Sect. 2, we sketch the concepts of weightings, magnitude, and diversity, and describe an efficiently computable scale above which a weighting is guaranteed to be proportional to the unique diversity-maximizing distribution. In Sect. 3, we develop a notion of a weighting gradient (estimate) and an associated flow. In Sect. 4, we use this gradient flow to demonstrate diversity enhancement on a toy problem before turning to benchmark problems in Sect. 5. Finally, we discuss algorithmic extensions in Sect. 6 before remarks in Sect. 7.

2 Weightings, Magnitude, and Diversity

For details on the ideas in this section, see §6 of [7] and also [16, 30].

Call a square matrix \(Z \ge 0\) a similarity matrix if \(\text {diag}(Z) > 0\). A motivating class of examples is \(Z = \exp [-td]\) where square brackets indicate entrywise function application, \(t \in (0,\infty )\), and d is a square dissimilarity matrix (e.g., the matrix encoding a finite metric space). A weighting w is a column vector satisfying \(Zw = 1\), where the vector of all ones is indicated on the right. A coweighting is the transpose of a weighting for \(Z^T\). If Z admits both a weighting w and a coweighting, then its magnitude is defined via \(1^T w = \sum _j w_j\), which also turns out to equal the sum of the coweighting components.

In the case \(Z = \exp [-td]\) and d is the distance matrix corresponding to a finite subset of Euclidean space, Z is positive definite [11], hence invertible, and so its weighting and magnitude are well-defined and unique. More generally, if Z is invertible then its magnitude is \(1^T Z^{-1} 1\). For d as specified above, the magnitude function is defined as the map \(t \mapsto 1^T (\exp [-td])^{-1} 1\).

Weightings are excellent scale-dependent boundary detectors in Euclidean space (see, e.g., Fig. 2 and [12, 13]). Meanwhile, magnitude is a very general notion of size that encompasses rich scale-dependent geometrical data [14].

Example 1

Consider a three-point space with \(d_{12} = d_{13} = 1 = d_{21} = d_{31}\) and \(d_{23} = \delta = d_{32}\). A routine calculation yields that

$$\begin{aligned} w_1 = \frac{e^{(\delta +2)t}-2e^{(\delta +1)t}+e^{2t}}{e^{(\delta +2)t}-2e^{\delta t}+e^{2t}}; \quad w_2 = w_3 = \frac{e^{(\delta +2)t}-e^{(\delta +1)t}}{e^{(\delta +2)t}-2e^{\delta t}+e^{2t}}. \end{aligned}$$

For \(\delta = 10^{-3}\), Fig. 1 shows that at \(t = 10^{-2}\), the “effective size” of the nearby points is \({\approx }0.25\); that of the distal point is \({\approx }0.5\), so the “effective number of points” is \({\approx }1\). At \(t = 10\), these effective sizes are respectively \({\approx }0.5\) and \({\approx }1\), so the effective number of points is \({\approx }2\). Finally, at \(t = 10^4\), the effective sizes are all \({\approx }1\), so the effective number of points is \({\approx }3\).

Fig. 1.
figure 1

Weighting components for an “isoceles” metric space. The magnitude function \(w_1+w_2+w_3\) gives a scale-dependent “effective number of points.”

For a probability distribution p and similarity matrix Z, the diversity \(D_q^Z(p)\) is defined for \(1< q < \infty \) (and via limits for \(q = 1,\infty \)) via

$$\begin{aligned} \log D_q^Z(p) := \frac{1}{1-q} \log \sum _{j: p_j > 0} p_j (Zp)_j^{q-1}. \end{aligned}$$
(1)

This is the “correct” measure of diversity in essentially the same way that Shannon entropy is the “correct” measure of information [7]. (In fact, the expression (1) is a generalization of the Rényi entropy of order q. In the event \(Z = I\), the usual Rényi entropy is recovered, with Shannon entropy as the case \(q = 1\).) We therefore restrict our attention to it versus other measures such as those discussed in [2, 3].

Recent mathematical developments [7, 15] have clarified the role of magnitude in maximizing (1) versus merely computing it. Specifically, if \(Z = \exp [-td]\) is positive definite with d symmetric, nonnegative, and with zero diagonal, and if Z admits a positive weighting \(w = Z^{-1}1\), then this (unique) weighting is proportional to the diversity-maximizing distribution. This situation holds automatically if d is the distance matrix of a finite subset of Euclidean space and if Z is diagonally dominant (i.e., \(Z_{jj} > \sum _{k \ne j} Z_{jk}\)).

For d with zero diagonal and all other entries positive, there is a least \(t_d >0\) such that \(\exp [-td]\) is diagonally dominant for any \(t > t_d\). Because \(\exp [-td]\) is diagonally dominant iff \(1 > \max _j \sum _{k \ne j} \exp (-t d_{jk})\), we can efficiently estimate \(t_d\) using the following elementary bounds and a binary search:

Lemma 1

For \(d \in M_n\) as above, \(\frac{\log (n-1)}{\min _j \max _k d_{jk}} \le t_d \le \frac{\log (n-1)}{\min _j \min _{k \ne j} d_{jk}}\).

More importantly, we can also use Lemma 1 to find the least \(t_+ < t_d\) such that \(\exp [-td]\) admits a positive weighting for \(t > t_+\).

Fig. 2.
figure 2

(Top) Weighting components for 500 points sampled without replacement from a probability distribution on \(\mathbb {Z}^2\) that is approximately uniform on its support. From left to right, various scale factors t defining \(Z = \exp [-td]\) (with \(d =\) Euclidean distance) are shown in terms of the intrinsic scales \(t_d\) and \(t_+\). Both the color and size of a point indicate its weighting component; the nonzero color axis tick mark is at half maximum. (Bottom) Weighting gradient estimate (2) for the data above. The gradient vectors are scaled uniformly in each panel for visualization purposes. Note that for the largest value of t the large gradient vectors have basepoints near other large gradient vectors.

3 The Weighting Gradient Flow

We now define a gradient flow that (for \(t \ge t_+\)) increases the diversity of finite subsets of Euclidean space and thereby provides a useful primitive for EAs. Although there are various sophisticated approaches to estimating gradients on point clouds (see, e.g., [17]), a reasonable heuristic estimate for the specific case of the gradient of a weighting w on \(\{x_j\}_j\) in Euclidean space is

$$\begin{aligned} (\hat{\nabla }w)_j := \sum _{k \ne j} \frac{Z_{jk}}{\sum _{k' \ne j} Z_{jk'}} \frac{w_k - w_j}{d_{jk}} e_{jk}, \end{aligned}$$
(2)

where \(e_{jk} := \frac{x_k - x_j}{d_{jk}}\). The weighting gradient flow induced by (2) is

$$\begin{aligned} \dot{x} = \hat{\nabla }w. \end{aligned}$$
(3)

Example 2

Figure 2 illustrates how weightings identify boundaries at various scales, and the corresponding weighting gradient estimates (2).

4 Enhancing Diversity

Following [18], we apply the ideas sketched above to a toy problem where the objective function f has three components, each measuring the distance to a vertex of a regular triangle with vertices in \(S^1\). The application is mostly conceptually straightforward, but we mention a few implementation details:

  • We begin with a uniformly distributed sample of \(n_0 = 10^3\) points in the disk of radius 1.25, and retain n points that are dominated by \(\le \delta = 0.1\);

  • Replace misbehaving points (e.g., out of bounds or NaNs) with predecessors;

  • Set \(S_j := 1-2\frac{\text {dom}_j}{\max _k \text {dom}_k}\), where \(\text {dom}_j = |\{\text {points dominating the } j\text {th point} \}|\);

  • Evolve the n points under a modulated version of (3) on the objective space with \(t = t_+\) as \(dy_j = ds \cdot S_j (\hat{\nabla }w)_j\) for only \(N = 10\) steps and step size \(ds = \sqrt{\langle \min _{k \ne j} (d_f)_{jk} \rangle /n}\), where the pullback metric is \(d_f(x,x') := d(f(x),f(x'))\);

  • Pull back the weighting gradient flow from objective to solution space using the Jacobian’s pseudoinverse, then recompute points in objective space.

The result of this experiment is depicted in Figs. 3 and 4. The salutary effect on diversity in objective (and solution) space is apparent. This can be quantified via the objective space magnitude functions, as shown in Fig. 5.

Fig. 3.
figure 3

Comparision of and locations of points in the solution space. The weighting gradient flow produces more evenly distributed terminal points. The triangle defining objective components (by distance to vertices) is shown. The actual Pareto front is the interior of the triangle; the area displayed is \([-1,1]^2\). Bottom: comparision of and points in the objective space. The terminal points are more evenly distributed. (Color figure online)

Fig. 4.
figure 4

Comparision of and points in the objective space. The terminal points are more evenly distributed. (Color figure online)

Fig. 5.
figure 5

Magnitude increases for the experiment of Sect. 4 at scales above \(t_+\), where magnitude equals diversity. (Top) Magnitude function quotients at various timesteps for feasible points under the evolution of the (modulated) weighting gradient flow. The horizontal axis t indicates the scale parameter; timesteps of numerators are indicated via color, going from to : the denominator is the function at the initial timestep. Circles indicate the scales \(t_+\). (Bottom) As above, but for non-dominated points. (Color figure online)

5 Performance on Benchmarks

The effectiveness of the (modulated) weighting gradient flow approach hinges on the ability to cover and thereby “keep pressure on” the Pareto front. A straightforward way to do this is to use a MOEA to produce an initial overapproximation of the Pareto front as in [19], and then improve the diversity of the overapproximation via the weighting gradient flow. We proceed to detail the results of an experiment along these lines. For the experiment we considered two leading MOEAs (NSGA-II [20] and SPEA2 [21]) and two leading benchmark problem sets (DTLZ [22]Footnote 1 and WFG [24]), all implemented in PlatEMO version 2.9 [25]. For each problem, we used 10 decision variables, three objectives (to enable visualization), and performed 10 runs (which appears quite adequate for characterization purposes) with population size 250 and \(10^4\) fitness evaluations. We then took \(N = 10\) timesteps for the weighting gradient flow as before.

Figure 6 (cf. Fig. 5) shows magnitude functions at various timesteps of the (modulated) weighting gradient flow applied to the results of NSGA-II on the WFG2 benchmark. Feasible points show a diversity (as measured by magnitude at scale \(t_+\) for the feasible objective points) increase of about 10%, whereas non-dominated points show a diversity increase of several percent as well, even as the total number of non-dominated points decreases by about 15%.

Fig. 6.
figure 6

As in Fig. 5, but for a solution of the WFG2 benchmark via NSGA-II.

We produce an ensemble characterization in Fig. 7. The figure shows that the number of non-dominated points decreases since the weighting gradient flow pushes some points a short distance away from the Pareto front (as illustrated in Fig. 9) before they are halted or reversed. The figure also shows that the diversity of non-dominated points generally increases slightly, and the diversity of feasible points increases significantly. As a consequence, the diversity contributions of individual solutions (as measured by the average weighting, i.e., the magnitude of non-dominated points divided by their cardinality) also increases significantly. For less challenging problems such as in Fig. 3, the number of non-dominated points will decrease less, and the diversity gains will be enhanced.

On the other hand, the effects of the weighting gradient flow are considerably reduced in the case of SPEA2, which produces a visibly more uniform distribution in objective space than NSGA-II: see Fig. 11. The weighting gradient flow appears to decrease this uniformity; the formation of a gap just behind the boundary along with a slight increase in the population near the boundary are the main visible indicators that something useful (at least for DTLZ4, WFG2, WFG3, WFG6, and WFG8, per Fig. 7) is actually happening.

Fig. 7.
figure 7

Diversity of solutions increases markedly under the weighting gradient flow, even as some points become slightly dominated. (Top) Average diversity quotients of and points under the weighting gradient flow along with proportion of population that remains non-dominated (black). Here the diversity is the magnitude at scale \(t_+\). Shaded bands indicate one standard deviation. All panels have the same horizontal axis, viz., the number of timesteps (from 0 to \(N = 10\)). The vertical axes are \([1-\varDelta ,1+\varDelta ]\), with \(\varDelta \) shown below each panel. Not shown explicitly is the average weighting of non-dominated points, i.e., the red curve divided by the black one, but so long as the colored bands already shown are visibly separate, this consistently lies above the blue band. (Bottom) As for the top panels, but for SPEA2. (Color figure online)

Fig. 8.
figure 8

As in Fig. 7, but for diversity taken as the magnitude at the scale maximizing the quotient by the initial timestep.

Although Figs. 7 and 8 shows that the weighting gradient flow causes a significant proportion of points to become dominated, Fig. 9 uses the inverted generational distance (IGD) relative to uniformly distributed reference points on Pareto fronts [26] to show that this qualitative change in dominance is belied by only minor quantitative changes in the distance to Pareto fronts.Footnote 2 (Note that the relatively large increases in IGD for DTLZ4 and DTLZ7 are consequences of starting from a low baseline.) That is, feasible points give a better quantitative sense of diversification performance than nondominated points, especially in light of use cases in which the weighting gradient flow is not limited to postprocessing.

Rather than relying solely on a delicate characterization of diversity, we also visualize some of the results directly: this is the rationale for three-objective problems. Figure 10 shows how diversity in objective space is promoted for WFG2-3. Figure 11 shows analogous results for SPEA2.

Careful inspection reveals that the weighting gradient flow tends to induce a gap between the boundary of the non-dominated region and its interior, which is consistent with the generally observed phenomenon that the largest weights in finite subsets of Euclidean space tend to occur on boundaries and the smallest weights immediately “behind” the boundary. Meanwhile, the boundary region tends to become slightly more populated.Footnote 3 From the perspective of a MOEA, this is frequently a benefit, since extremal and non-extremal points on the non-dominated approximation of the Pareto front differ in practical significance.Footnote 4

6 Algorithmic Extensions

6.1 Multi-objective Weighting Gradient Flow

We can combine the weighting gradient flow with a multi-gradient descent strategy in a way somewhat akin to [28]. The basic additional ideas are:

  • Introduce variable regularizing terms \(\lambda _w\) and \(\lambda _f\) for the weighting and function gradient flows, respectively;

  • Form the objective-space differentials \(dy_j = ds \cdot [\lambda _w S_j (\hat{\nabla }w)_j + \lambda _f \sum _\ell (\hat{\nabla }f)_\ell ]\), where the sum is over \(\ell \) such that \(\langle (\hat{\nabla }w)_j,(\hat{\nabla }f_\ell )_j \rangle > 0\).

While we have tried this technique in isolation on MOEA benchmarks, the results are poor. However, this is unsurprising: the benchmarks are designed to frustrate MOEAs, much less multi-objective techniques relying on gradients.

6.2 Recycling Function Evaluations

In our experiments with post-processing the output of MOEAs, the weighting gradient flow evolution took time comparable to (and in the case of NSGA-II, slightly more than) the MOEA itself. Most of the time is spent evaluating the fitness function: apart from an initialization step, the evaluations are performed to compute Jacobians in service of pullback operations, and a lesser number are performed to compute pushforwards to maintain consistency.

Fig. 9.
figure 9

The weighting gradient flow only slightly affects the quantitative dominance behavior of points, as measured by IGD. (Top) IGD under the weighting gradient flow starting from the results of NSGA-II runs, using uniformly distributed reference points on Pareto fronts. Shaded bands indicate one standard deviation. All panels have the same horizontal axis, viz., the number of timesteps (from 0 to \(N = 10\)). The vertical axes are [0, y], where y is shown below each panel. (Bottom) As above, but for SPEA2.

Fig. 10.
figure 10

(Far left) Initial configuration in objective space for WFG2 after a NSGA-II run. (Center left) Configuration after subsequently evolving under the weighting gradient flow. Dominated points are gray. (Right panels) As on the left, but for WFG3.

Fig. 11.
figure 11

As in Fig. 10, but for SPEA2. Note the formation of gaps behind the boundary.

However, our motivating problems require significant time (on the order of a second) for function evaluations. This demands a more efficient pullback scheme that minimizes or avoids function evaluations, even if the results are substantially worse. A reasonable idea is “recycling” in a sense similar to that employed in some modern Monte Carlo algorithms [29]. Specifically, rather than computing a good approximation to the Jacobian by evaluating functions afresh at very close points along coordinate axes, we settle instead for an approximation of lesser quality that exploits existing function evaluations. We have implemented this in concert with a de novo computation of the Jacobian in the event that this initial Jacobian estimate does not have full rank. Our experiments suggest that this works reasonably well: for a typical run from Sect. 5, the number of function evaluations is reduced from 30250 to 2750, and the actual results are broadly comparable (sometimes better, sometimes worse): see Figs. 12 and 13.

This strategy will work poorly if evaluation points lie on a manifold of nonzero codimension or low curvature, because in such cases a matrix that transforms vectors from a base point to evaluation points into (a small multiple of) the standard basis will have a large condition number. However, these situations are relatively unlikely to present major problems in practice, and the recycling approach is likely to be useful when function evaluations are expensive.

7 Remarks

Although our experiments have focused on the results of applying the weighting gradient flow and related constructions after a MOEA has been applied, the more natural application is in the course of a MOEA. As mentioned in Sect. 6, there is ample scope to refine and build on ideas for increasing weighting components in specific contexts. It is nevertheless clear that the theory of magnitude informs principled and practical diversity-promoting mechanisms that can already be usefully applied to benchmark multi-objective problems.

Fig. 12.
figure 12

As in the top panel of Fig. 7, but for a Jacobian approximation that uses existing function evaluations, increasing speed at the cost of accuracy.

Fig. 13.
figure 13

As in Fig. 9, but for a Jacobian approximation that uses existing function evaluations, increasing speed at the cost of accuracy.