Abstract
Simplex gradients are widely used in derivative-free optimization. This article clarifies some of the properties of simplex gradients and presents calculus rules similar to that of an ordinary gradient. For example, the simplex gradient does not depend on the order of sample points in the underdetermined and determined cases but this property is not true in the overdetermined case. Moreover, although the simplex gradient is the gradient of the corresponding linear model in the determined case, this is not necessarily true in the underdetermined and overdetermined cases. However, the simplex gradient is the gradient of an alternative linear model that is required to interpolate the reference data point. Also, the negative of the simplex gradient is a descent direction for any interpolating linear function in the determined and underdetermined cases but this is again not necessarily true for the linear regression model in the overdetermined case. In addition, this article reviews a previously established error bound for simplex gradients. Finally, this article treats the simplex gradient as a linear operator and provides formulas for the simplex gradients of products and quotients of two multivariable functions and a power rule for simplex gradients.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In the fully determined case, a simplex gradient of a function is the gradient of a linear model that interpolates data points on the surface of the function that corresponds to a maximal set of affinely independent points (i.e., a set of points in a simplex). The notion of a simplex gradient is widely used in derivative-free optimization. For example, it is used in the analysis of optimization methods for noisy problems that utilize function values on a sequence of simplices such as Nelder–Mead and implicit filtering (Bortz and Kelley [2], Kelley [8], Conn et al. [5]). Moreover, Custódio et al. [6] analyzed sequences of simplex gradients computed for nonsmooth functions in the context of direct search methods of the directional type such as Generalized Pattern Search (GPS) (Torczon [12]) and Mesh Adaptive Direct Search (MADS) (Audet and Dennis [1]). Simplex gradients have also been used to enhance the performance of pattern search by using them to reorder the objective function evaluations associated with the various poll directions (Custódio and Vicente [7]). In addition, in the benchmarking of derivative-free optimization algorithms, the data profile of a solver is defined as the percentage of problems solved for a given number of simplex gradient estimates (Moré and Wild [9]). More recently, Regis [10] used underdetermined simplex gradients to develop an initialization strategy for surrogate-based, high-dimensional expensive black-box optimization.
The purpose of this article is to clarify some of the properties of the simplex gradient and present calculus rules similar to that of an ordinary gradient. In particular, the simplex gradient does not depend on the order of sample points in the underdetermined and determined cases but this property does not hold in the overdetermined case. Moreover, as expected, the simplex gradient is the gradient of the corresponding linear interpolation model in the determined case. However, it is not necessarily the gradient of the linear model corresponding to the minimum norm least squares solution of the associated linear system in both the underdetermined and overdetermined cases. It turns out, though, that the simplex gradient is the gradient of an alternative linear model that is required to interpolate the reference data point. Also, in the underdetermined and determined cases, the negative of the simplex gradient with respect to a set of data points is shown to be a descent direction for any linear model that interpolates these data points. However, in the overdetermined case, the negative of the simplex gradient is not necessarily a descent direction for the corresponding linear regression model. Next, a previously established error bound for simplex gradients is reviewed. Furthermore, a convenient notation for the simplex gradient is introduced that treats it as a linear operator and some calculus rules such as product and quotient rules are proved. Finally, although the simplex gradient does not seem to satisfy a general chain rule, a power rule for simplex gradients is also proved.
2 Preliminaries
2.1 Definition of a simplex gradient
Throughout this article, \(f\) and \(g\) are functions from \(\mathbb {R}^d\) to \(\mathbb {R}\). Let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) points in \(\mathbb {R}^d\), where \(k \ge 1\). Define
First, consider the case where \(\mathcal {X}\) consists of \(k+1\) affinely independent points (and so, \(k \le d\)). When \(k=d\) (the determined case), \(S(\mathcal {X})\) is invertible and the simplex gradient of \(f\) with respect to \(\mathcal {X}\), denoted by \(\nabla _s f(\mathcal {X})\), is given by
When \(k<d\) (the underdetermined case), the simplex gradient of \(f\) with respect to \(\mathcal {X}\) is the minimum 2-norm solution of the linear system
which is given by \(\nabla _s f(\mathcal {X})=S(\mathcal {X})(S(\mathcal {X})^TS(\mathcal {X}))^{-1}\delta _f(\mathcal {X})\). In this case, note that \(S(\mathcal {X})^TS(\mathcal {X})\) is symmetric and positive definite (and hence nonsingular) since \(S(\mathcal {X})\) has full (column) rank. Moreover, \(\nabla _s f(\mathcal {X})\) is a linear combination of \(x_1-x_0,\ldots ,x_k-x_0\) since \(\nabla _s f(\mathcal {X})=S(\mathcal {X})v\), where \(v=(S(\mathcal {X})^TS(\mathcal {X}))^{-1}\delta _f(\mathcal {X}) \in \mathbb {R}^k\).
Next, let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) distinct points in \(\mathbb {R}^d\) that contains a proper subset of \(d+1\) affinely independent points, and so, \(k>d\) (the overdetermined case). In this case, \(\{[1,x_0^T],[1,x_1^T],\ldots ,[1,x_k^T]\}\) contains a subset of \(d+1\) linearly independent points and it can be assumed that this subset contains \([1,x_0^T]\). Hence, \(\mathcal {X}\) is guaranteed to have a proper subset of \(d+1\) affinely independent points that includes \(x_0\). This implies that \(S(\mathcal {X})\) has full (row) rank, and so, \(S(\mathcal {X})S(\mathcal {X})^T\) is symmetric and positive definite. Now the simplex gradient of \(f\) with respect to \(\mathcal {X}\) is the least squares solution of the linear system
which is given by \(\nabla _s f(\mathcal {X})=(S(\mathcal {X})S(\mathcal {X})^T)^{-1}S(\mathcal {X})\delta _f(\mathcal {X})\).
Custódio et al. [6] notes that the definitions of the simplex gradient for the three cases above can be combined into a single definition by considering a reduced SVD of \(S(\mathcal {X})^T\) as shown next.
Definition 1
Let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) points in \(\mathbb {R}^d\) where \(k \ge 1\). Suppose \(S(\mathcal {X})\) has full rank so that \(\text{ rank }(S(\mathcal {X}))=\min \{d,k\}\). Then the simplex gradient of \(f\) with respect to \(\mathcal {X}\) is given by
where \(U(\mathcal {X})\Sigma (\mathcal {X}) V(\mathcal {X})^T\) is a reduced SVD of \(S(\mathcal {X})^T\).
Note that \(V(\mathcal {X})\Sigma (\mathcal {X})^{-1}U(\mathcal {X})^T\) is a reduced SVD of the Moore-Penrose pseudoinverse of \(S(\mathcal {X})^T\), and so, the simplex gradient of \(f\) with respect to \(\mathcal {X}\) can also be expressed as:
where \(A^{\dagger }\) denotes the Moore–Penrose pseudoinverse of the matrix \(A\).
2.2 Linear interpolation and regression
Consider a set \(\mathcal {X}=\{x_0,x_1,\ldots ,x_k\}\) of \(k+1\) points in \(\mathbb {R}^d\). If the points in \(\mathcal {X}\) are affinely independent (which implies \(k \le d\)), then there exists a linear function (an infinite number if \(k<d\)) that interpolates the data points \(\{(x_0,f(x_0)),(x_1,f(x_1))\), \(\ldots ,(x_k,f(x_k))\}\). More precisely, if \(m(x)=c_0+c^Tx\), where \(c=[c_1,\ldots ,c_d]^T\), is a linear polynomial in \(d\) variables that interpolates these data points, then
For convenience, define the \((k+1) \times (d+1)\) interpolation matrix \(L(\mathcal {X})\) and the column vector \(F(\mathcal {X})\) as follows:
Equation (1) then becomes
When \(k>d\) and \(L(\mathcal {X})\) has full (column) rank, Eq. (2) can be solved in a least squares sense. For convenience, the following definition from Conn, Scheinberg and Vicente [5] is used below.
Definition 2
Let \(\mathcal {X}=\{x_0,x_1,\ldots ,x_k\}\) be a set of \(k+1\) points in \(\mathbb {R}^d\). When \(k=d\) (determined case), the set \(\mathcal {X}\) is said to be poised for linear interpolation in \(\mathbb {R}^d\) if \(L(\mathcal {X})\) is nonsingular. When \(k>d\) (overdetermined case), the set \(\mathcal {X}\) is said to be poised for linear regression in \(\mathbb {R}^d\) if \(L(\mathcal {X})\) has full (column) rank.
If \(\mathcal {X}\) consists of exactly \(d+1\) affinely independent points (determined case), then \(L(\mathcal {X})\) is nonsingular (and so \(\mathcal {X}\) is poised for linear interpolation in \(\mathbb {R}^d\)) and the coefficients of the linear model are given by \(\left[ \begin{array}{c} c_0 \\ c \\ \end{array} \right] = L(\mathcal {X})^{-1} F(\mathcal {X})\). If \(\mathcal {X}\) consists of \(k+1\) affinely independent points, where \(k<d\) (underdetermined case), then \(L(\mathcal {X})\) has full row rank and the minimum 2-norm solution to Eq. (1) is given by \(\left[ \begin{array}{c} c_0 \\ c\\ \end{array} \right] = L(\mathcal {X})^T(L(\mathcal {X})L(\mathcal {X})^T)^{-1} F(\mathcal {X})\). If \(k>d\) and \(\mathcal {X}\) contains \(d+1\) affinely independent points (overdetermined case), then \(L(\mathcal {X})\) has full column rank (and so \(\mathcal {X}\) is poised for linear regression in \(\mathbb {R}^d\)) and the least squares solution to Eq. (1) is given by \(\left[ \begin{array}{c} c_0 \\ c \\ \end{array} \right] = (L(\mathcal {X})^T L(\mathcal {X}))^{-1} L(\mathcal {X})^T F(\mathcal {X})\). As before, \(\left[ \begin{array}{c} c_0 \\ c \\ \end{array} \right] = L(\mathcal {X})^{\dagger } F(\mathcal {X})\) in all these cases. For a comprehensive treatment of the geometry of sample sets of points for interpolation (determined and underdetermined cases) and regression (overdetermined case) in the context of derivative-free optimization, the reader is referred to the papers by Conn, Scheinberg and Vicente ([3, 4]) and Scheinberg and Toint [11].
The first proposition below relates poisedness for linear interpolation or regression with the existence of the simplex gradient in the determined and overdetermined cases.
Proposition 1
Let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) points in \(\mathbb {R}^d\) with \(k \ge d\). Then \(L(\mathcal {X})\) has full (column) rank if and only if \(S(\mathcal {X})\) has full (row) rank.
Proof
Note that \(\text{ rank }(L(\mathcal {X}))=d+1\) if and only if \(L(\mathcal {X})\) has \(d+1\) linearly independent rows. Since the first nonzero row of any matrix can be extended to a maximal set of linearly independent rows of that matrix, it follows that \(L(\mathcal {X})\) has full (column) rank if and only if \(L(\mathcal {X})\) has \(d+1\) linearly independent rows that include \([1,x_0^T]\), and this is true if and only if \(\mathcal {X}\) has a subset of \(d+1\) affinely independent points that include \(x_0\). This implies that \(L(\mathcal {X})\) has full (column) rank if and only if \(S(\mathcal {X})\) has \(d\) linearly independent columns, or equivalently, \(\text{ rank }(S(\mathcal {X}))=d\). \(\square \)
The above proposition says that \(\mathcal {X}\) is poised for linear interpolation or linear regression (depending on whether \(k=d\) or \(k>d\)) if and only if the simplex gradient \(\nabla _s f(\mathcal {X})\) is defined.
3 Basic properties of simplex gradients
When \(k \le d\) (the determined and underdetermined cases), the following proposition, which was proved in Regis [10], shows that the simplex gradient \(\nabla _s f(\mathcal {X})\) does not depend on the order of the points in \(\mathcal {X}\). A slightly different proof from the one in Regis [10] is included below to make this article self-contained.
Proposition 2
Suppose \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) is an ordered set of \(k+1\) affinely independent points in \(\mathbb {R}^d\), where \(1 \le k \le d\). Let \(\alpha \) be a permutation of the indices \(\{0,1,\ldots ,k\}\) and let \({\mathcal {X}}_{\alpha }=\langle x_{\alpha (0)},x_{\alpha (1)},\ldots ,x_{\alpha (k)} \rangle \). Then \(\nabla _s f(\mathcal {X}_{\alpha }) = \nabla _s f(\mathcal {X})\).
Proof
First, consider the case where \(\alpha (0)=0\). Then
for some permutation matrix \(P\). Now
The previous argument shows that if the points in \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) are permuted except for the reference point \(x_0\), then the simplex gradient remains the same. Next, observe that any permutation of \(\{0,1,\ldots ,k\}\) that maps 0 to another element can be obtained from a permutation that fixes 0 by means of a single transposition. Hence, it only remains to show that the simplex gradient is preserved by a permutation of \(\{0,1,\ldots ,k\}\) that switches 0 and another element while holding the other elements fixed.
Let \(\alpha \) be a permutation of \(\{0,1,\ldots ,k\}\) such that \(\alpha (0)=j \ne 0\), \(\alpha (j)=0\), and that fixes all other elements. Note that \(S(\mathcal {X}_{\alpha })=[x_{\alpha (1)} - x_{\alpha (0)} \quad \ldots \quad x_{\alpha (k)}-x_{\alpha (0)}]\) can be transformed to \(S(\mathcal {X})=[x_1-x_0 \quad \ldots \quad x_k-x_0]\) by applying a series of elementary column operations to \(S(\mathcal {X}_{\alpha })\). To see this, begin by multiplying the \(j\)th column of \(S(\mathcal {X}_{\alpha })\) by \(-1\). The result is also given by \(S(\mathcal {X}_{\alpha })M\), where \(M\) is the elementary matrix obtained by replacing the \(j\)th diagonal entry of \(I_k\) by \(-1\). Next, for each \(i=1,\ldots ,k,\ i \ne j\), perform an elementary column operation that consist of adding the \(j\)th column of \(S(\mathcal {X}_{\alpha })M\) to the \(i\)th column and storing the result in the latter column. The result is some permutation of the columns of \(S(\mathcal {X})\), and so,
where \(P\) is a permutation matrix and \(E_1, E_2, \ldots , E_{k-1}\) are the elementary matrices obtained by adding the \(j\)th column of \(I_k\) to the other columns and storing the results in those columns.
Let \(F=M E_1 E_2 \ldots E_{k-1} P\). Then \(S(\mathcal {X}_{\alpha }) F = S(\mathcal {X})\) and \(F\) is nonsingular because it is the product of nonsingular matrices. Observe that
Hence,
\(\square \)
When \(k>d\) (the overdetermined case), \(\nabla _s f(\mathcal {X})\) depends on the order of the points in \(\mathcal {X}\) as can be seen from the following examples in \(\mathbb {R}\) and \(\mathbb {R}^2\).
Example 1
Consider \(x_0=-1, x_1=0, x_2=1\) and suppose \(f(x_0)=2, f(x_1)=1, f(x_2)=3\). Let \({\mathcal {X}}=\langle x_0,x_1,x_2 \rangle \) and consider the permutation of \(\{0,1,2\}\) given by \(\alpha = \left( \begin{array}{c@{\quad }cc} 0 &{} 1 &{} 2 \\ 1 &{} 2 &{} 0 \\ \end{array} \right) \). Then \({\mathcal {X}}_{\alpha }=\langle x_1,x_2,x_0 \rangle \). Note that \(S(\mathcal {X})\) and \(S({\mathcal {X}}_{\alpha })\) have full rank and
and
and so, \(\nabla _s f(\mathcal {X}) \ne \nabla _s f(\mathcal {X}_{\alpha })\).
Example 2
Consider \(x_0=[0, 0]^T, x_1=[1, 0]^T, x_2=[0, 1]^T, x_3=[1, 2]^T\) and suppose \(f(x_0)=1, f(x_1)=2, f(x_2)=0, f(x_3)=1\). Let \({\mathcal {X}}=\langle x_0,x_1,x_2,x_3 \rangle \) and consider the permutation of \(\{0,1,2,3\}\) given by \(\alpha = \left( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} 0 &{} 1 &{} 2 &{} 3 \\ 3 &{} 2 &{} 0 &{} 1 \\ \end{array} \right) \). Then \({\mathcal {X}}_{\alpha }=\langle x_3,x_2,x_0,x_1 \rangle \). Note that \(S(\mathcal {X})\) and \(S({\mathcal {X}}_{\alpha })\) have full rank and
and
and so, \(\nabla _s f(\mathcal {X}) \ne \nabla _s f(\mathcal {X}_{\alpha })\).
For convenience, call the first point in the ordered set \(\mathcal {X}\) the reference point. The next proposition shows that, when \(k>d\) (the overdetermined case), the simplex gradient is not affected by changing the order of the sample points that are not the reference point. That is, \(\nabla _s f(\mathcal {X})\) depends only on which point in \(\mathcal {X}\) is used as the reference point and not on the order of the other sample points. Note that although the DFO book by Conn et al. [5] and other papers (e.g., Custódio and Vicente [7]) did not explicitly mention the dependence of the simplex gradient on the reference point, the notation \(\nabla _s f(x_0)\) in the book and in these other papers suggests that the authors were aware of this dependence.
Proposition 3
Suppose \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) is an ordered set of \(k+1\) points in \(\mathbb {R}^d\), with \(k>d\), that contains a proper subset of \(d+1\) affinely independent points. Let \(\alpha \) be a permutation of the indices \(\{0,1,\ldots ,k\}\) such that \(\alpha (0)=0\) and let \({\mathcal {X}}_{\alpha }=\langle x_{\alpha (0)},x_{\alpha (1)},\ldots ,x_{\alpha (k)} \rangle \). Then \(\nabla _s f(\mathcal {X}_{\alpha }) = \nabla _s f(\mathcal {X})\).
Proof
Since \(\alpha (0)=0\), it follows that
for some permutation matrix \(P\). Now
\(\square \)
4 Simplex gradients and linear interpolation and regression models
Next, this section analyzes the relationship between the simplex gradient and the gradient of the corresponding linear model. When \(\mathcal {X}\) consists of exactly \(d+1\) affinely independent points, the following result mentioned in Conn et al. [5] shows that the simplex gradient \(\nabla _s f(\mathcal {X})\) is the gradient of the unique linear function that interpolates the points in \(\mathcal {X}\) and their function values.
Proposition 4
Let \({\mathcal {X}}=\{ x_0,x_1,\ldots ,x_d \}\) be a set of \(d\,{ +}\, 1\) affinely independent points in \(\mathbb {R}^d\) (and so \(\mathcal {X}\) is poised for linear interpolation in \(\mathbb {R}^d\)).Then \(\nabla _s f(\mathcal {X})\) is the gradient of the unique linear function that interpolates the data points \(\{(x_0,f(x_0)),(x_1,f(x_1))\), \(\ldots ,(x_d,f(x_d))\}\).
Proof
Let \(m(x)=c_0+c^Tx\), where \(c_0 \in \mathbb {R}\) and \(c \in \mathbb {R}^d\), be the unique linear function that interpolates the data points \(\{(x_0,f(x_0)),(x_1,f(x_1))\), \(\ldots ,(x_d,f(x_d))\}\). Then
Subtracting \(c_0+c^Tx_0=f(x_0)\) from each of the equations in (3) for \(i=1,\ldots ,d\) gives
Hence, \(c\) satisfies
Since \(\mathcal {X}\) consists of \(d+1\) affinely independent points, \(S(\mathcal {X})^T\) is nonsingular and
\(\square \)
Next, consider the overdetermined case. Let \(\mathcal {X}=\langle x_0,x_1,\ldots ,x_k\rangle \) be an ordered set of \(k+1\) distinct points in \(\mathbb {R}^d\) that contains a proper subset of \(d+1\) affinely independent points (and so \(k>d\)) and let \(m(x)=c_0+c^Tx\) be the linear regression model for the data points \(\{(x_0,f(x_0)), (x_1,f(x_1))\), \(\ldots , (x_k,f(x_k))\}\). Recall that the gradient \(\nabla m(x)=c\) is obtained from the least squares solution of
while the simplex gradient \(\nabla _s f(\mathcal {X})\) is the least squares solution of
The examples below show that \(\nabla _s f(\mathcal {X})\) is not necessarily equal to \(\nabla m(x)=c\) in the overdetermined case. In fact, Example 4 below shows that none of the simplex gradients \(\nabla _s f(\mathcal {X})\) using all possible reference points have to equal \(\nabla m(x)=c\).
Example 3
Consider \(\mathcal {X}\) and \(F(\mathcal {X})\) from Example 1: \(x_0=-1, x_1=0, x_2=1\) and \(f(x_0)=2, f(x_1)=1, f(x_2)=3\). Then \(\nabla _s f(\mathcal {X}) = [ 1/5 ]\). Now the coefficients of the linear regression model \(m(x)=c_0+c^Tx\) are given by
Hence, \(\nabla m(x) = c = [ 1/2 ] \ne \nabla _s f(\mathcal {X})\).
Example 4
Consider \(\mathcal {X}\) and \(F(\mathcal {X})\) from Example 2: \(x_0=[0, 0]^T, x_1=[1, 0]^T, x_2=[0, 1]^T, x_3=[1, 2]^T\) and \(f(x_0)=1, f(x_1)=2, f(x_2)=0, f(x_3)=1\). Then \(\nabla _s f(\mathcal {X}) = \left[ \begin{array}{c} 7/6\\ -2/3 \\ \end{array} \right] \). Now the coefficients of the linear regression model \(m(x)=c_0+c^Tx\) are given by
Hence, \(\nabla m(x) = c = [13/10, -3/5]^T \ne \nabla _s f(\mathcal {X})\).
By similar calculations, the simplex gradients obtained by using \(x_1, x_2\) and \(x_3\) as reference points are \([11/9, -5/9]^T, [3/2, -2/3]^T\) and \([4/3, -5/9]^T\), respectively. Note that none of these simplex gradients are equal to \(\nabla m(x) = c = [13/10, -3/5]^T\).
The fact that Proposition 4 does not hold for the overdetermined case is not really surprising considering that Examples 1 and 2 and Proposition 3 showed that \(\nabla _s f(\mathcal {X})\) depends on which point in \(\mathcal {X}\) is chosen as the reference point whereas the linear regression model and its gradient are fixed for a given \(\mathcal {X}\) containing a subset of \(d+1\) affinely independent points.
Finally, in the underdetermined case (\(k<d\)), \(\nabla _s f(\mathcal {X})\) is also not the gradient of the linear model whose coefficients are the minimum 2-norm solution to Eq. (1) as can be seen from the following counterexample.
Example 5
Consider \(x_0=[1, 0, 0]^T, x_1=[0, 1, 0]^T, x_2=[0, 0, 1]^T\) and suppose \(f(x_0)=2, f(x_1)=0, f(x_2)=1\). Let \({\mathcal {X}}=\langle x_0,x_1,x_2 \rangle \). Then
On the other hand, the minimum 2-norm solution to Eq. (1) is given by
Again, \(c = [5/4, -3/4, 1/4]^T \ne \nabla _s f(\mathcal {X})\).
The next proposition shows that, in the underdetermined and determined cases, \(-\nabla _s f(\mathcal {X})\) is a descent direction for any linear function that interpolates the points in \(\mathcal {X}\) and their function values. In particular, although \(\nabla _s f(\mathcal {X})\) is not necessarily equal to the gradient of the linear model corresponding to the minimum 2-norm solution to Eq. (1) in the underdetermined case, this proposition says that \(-\nabla _s f(\mathcal {X})\) is always a descent direction for this linear model.
Proposition 5
Suppose \({\mathcal {X}}=\{ x_0,x_1,\ldots ,x_k \}\) \((k \le d)\) is a set of \(k+1\) affinely independent points in \(\mathbb {R}^d\). If \(f(x_0),f(x_1),\ldots ,f(x_k)\) are not all equal, then \(-\nabla _s f(\mathcal {X})\) is a descent direction for any linear function that interpolates the data points \((x_0,f(x_0)),(x_1,f(x_1)),\ldots ,(x_k,f(x_k))\).
Proof
Let \(g(x)=c_0+c^Tx\) be any linear function that interpolates the data points \((x_0,f(x_0)),(x_1,f(x_1))\), \(\ldots ,(x_k,f(x_k))\), where \(c_0 \in \mathbb {R}\) and \(c \in \mathbb {R}^d\). Then \(c_0+c^Tx_i=f(x_i)\) for \(i=0,1,\ldots ,k\), and so, \(c^T(x_i-x_0)=f(x_i)-f(x_0)\) for \(i=1,\ldots ,k\). Hence, \(c^TS({\mathcal {X}})=\delta _f({\mathcal {X}})^T\). Now for any \(x \in \mathbb {R}^d\),
Since \(S({\mathcal {X}})\) has full column rank, it follows that \(S({\mathcal {X}})^T S({\mathcal {X}})\) and its inverse are both symmetric and positive definite. Moreover, since not all the \(f(x_i)\)’s are equal, it follows that \(\delta _f({\mathcal {X}})\) is not the zero vector. Hence, \(\nabla g(x)^T(-\nabla _s f({\mathcal {X}}))<0\) for any \(x \in \mathbb {R}^d\), and so, \(-\nabla _s f({\mathcal {X}})\) is a descent direction for \(g(x)=c_0+c^Tx\) from any point \(x \in \mathbb {R}^d\). \(\Box \)
On the other hand, in the overdetermined case, the following example shows that \(-\nabla _s f({\mathcal {X}})\) is not always a descent direction for the corresponding linear regression model.
Example 6
Consider the ordered set of sample points and their function values: \(\mathcal {X}= \langle x_0, x_1, x_2 \rangle = \langle 0, 1, 2\rangle \) and \(f(x_0)=2, f(x_1)=1, f(x_2)=9/4\). Then the coefficients of the linear regression model \(m(x)=c_0+c^Tx\) are given by
Hence, \(\nabla m(x) = c = [ 1/8]\). The simplex gradient is given by
Note that \(\nabla m(x)^T (-\nabla _s f(\mathcal {X})) = [1/8]^T[1/10] = 1/80 > 0\) for any \(x \in \mathbb {R}\), and so, \(-\nabla _s f(\mathcal {X})\) is not a descent direction for \(m(x)\) from any \(x \in \mathbb {R}\).
Examples 3, 4 and 5 above showed that the simplex gradient \(\nabla _s f(\mathcal {X})\) is not necessarily the gradient of the corresponding linear model \(m(x)=c_0+c^Tx\) for the data points \(\mathcal {X}=\langle x_0,x_1,\ldots ,x_k\rangle \) in the underdetermined and overdetermined cases. In particular, Examples 3 and 4 correct the statement on page 33 of the DFO book by Conn et al. [5] where it is stated that, in the overdetermined case, \(\nabla _s f(\mathcal {X})\) is also the gradient of the linear regression model for the data points \(\{(x_0,f(x_0)),(x_1,f(x_1)), \ldots ,(x_k,f(x_k))\}\). That statement in the DFO book [5] has also been rephrased in the errata for Theorem 2.13 of the book where it is stated that the simplex gradient is the gradient of the following alternative linear model:
where \(c=[c_1,\ldots ,c_d]^T\) are the coefficients to be determined. One main difference between the original linear model \(m(x)=c_0+c^Tx\) and the above alternative linear model is that the original model has \(d+1\) coefficients to be determined (\(c_0\) and the components of \(c\)) while the alternative model has only \(d\) coefficients. Moreover, the alternative linear model is required to interpolate the reference data point \((x_0,f(x_0))\). The coefficients of the linear model (6) are obtained by finding a minimum norm least squares solution to the following linear system:
This linear system is equivalent to:
which is precisely the linear system that yields the simplex gradient \(\nabla _s f(\mathcal {X})\). Thus, the gradient of the alternative linear model \(m(x)=f(x_0)+c^T(x-x_0)\) is indeed the simplex gradient \(\nabla _s f(\mathcal {X})\).
5 Error bounds for linear interpolation and regression models
Throughout this section, let \({\mathcal {X}}=\{ x_0,x_1,\ldots ,x_k\}\) be a set of sample points in \(\mathbb {R}^d\) and assume that \(f\) is continuously differentiable in an open domain \(\Omega \) containing the closed ball \(B(x_0,\Delta )=\{ x \in \mathbb {R}^d\ :\ \Vert x-x_0\Vert \le \Delta \}\), where \(\Delta =\max _{1 \le i \le k}\Vert x_i-x_0\Vert \). Further, assume that \(\nabla f\) is Lipschitz continuous in \(\Omega \) with constant \(\nu >0\).
The following result from Conn et al. [5] (Theorem 2.11 in [5]) provides an error bound for the gradient of the linear interpolation model \(m(x)=c_0+c^Tx\) in the determined case. Since \(\nabla m(x)\) is identical to the simplex gradient \(\nabla _s f(\mathcal {X})\) in the determined case, Proposition 6 also provides an error bound for the simplex gradient.
Proposition 6
Assume that the set \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_d\rangle \subset \mathbb {R}^d\) is poised for linear interpolation in \(\mathbb {R}^d\) and suppose that the conditions mentioned above hold. Then the gradient of the linear interpolation model satisfies, for all points \(x \in B(x_0,\Delta )\), an error bound of the form
where \(\kappa _{eg}=\nu (1+d^{1/2}\Vert \widehat{S}(\mathcal {X})^{-T}\Vert /2)\) and \(\widehat{S}(\mathcal {X})=S(\mathcal {X})/\Delta \).
The following proposition extends Theorems 2.11 and 2.13 in Conn et al. [5] (with the correction from the errata for the book). This result was essentially established in [3, 4], and [8] but the statement of the proposition was derived from Custódio and Vicente [7] and Custódio et al. [6]. This proposition provides an error bound for the gradient of the alternative linear model \(m(x)=f(x_0)+c^T(x-x_0)\), which is also the simplex gradient \(\nabla _s f(\mathcal {X})\). While Theorems 2.11 and 2.13 in [5] cover the determined and overdetermined cases, respectively, this proposition covers all three cases. The proof uses the same arguments as in the proofs of Theorems 2.11 and 2.13 in [5] and is included for completeness.
Proposition 7
Let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) points in \(\mathbb {R}^d\) such that \(S(\mathcal {X})\) has full rank and assume that the conditions mentioned above hold. Consider the alternative minimum norm least squares linear model \(m(x)=f(x_0)+c^T (x-x_0)\) for the data points \(\{(x_0,f(x_0)),(x_1,f(x_1))\), \(\ldots ,(x_k,f(x_k))\}\) so that \(\nabla m(x)=c=\nabla _s f(\mathcal {X})\). Then
where \(\widehat{S}(\mathcal {X})=S(\mathcal {X})/\Delta \). Moreover, when \(k \ge d\) (determined and overdetermined cases),
and the gradient of this alternative linear model satisfies, for all points \(x \in B(x_0,\Delta )\), an error bound of the form
where \(\kappa _{eg}=\nu (1+k^{1/2}\Vert (\widehat{S}(\mathcal {X})^T)^{\dagger }\Vert /2)\).
Proof
Define the vector \(r(\mathcal {X})=[r_1(\mathcal {X}),\ldots ,r_k(\mathcal {X})]^T \in \mathbb {R}^k\) by
By the integral form of the mean value theorem,
From this equation, it follows that for \(i=1,\ldots ,k\),
Hence,
Now
and so,
When \(k \ge d\), \(\widehat{S}(\mathcal {X})^T\) has full column rank, and so, \((\widehat{S}(\mathcal {X})^T)^{\dagger }\) is a left inverse of \(\widehat{S}(\mathcal {X})^T\). In this case,
Finally, when \(k \ge d\), note that for all \(x \in B(x_0,\Delta )\),
\(\square \)
The corollary below provides the error bound for the simplex gradient as stated in Custódio and Vicente [7] and Custódio et al. [6]. Note that this is essentially the first part of Proposition 7 stated in terms of the reduced SVD of \(\widehat{S}(\mathcal {X})^T=S(\mathcal {X})^T\!\!/\Delta \). Moreover, as in the first part of Proposition 7, the gradient \(\nabla f\) is evaluated at \(x_0\) and not just at any \(x \in B(x_0,\Delta )\). An error bound involving \(\nabla f(x)\) for all \(x \in B(x_0,\Delta )\) that is similar to the one in Proposition 7 can be obtained when \(k \ge d\).
Corollary 1
Let \({\mathcal {X}}=\langle x_0,x_1,\ldots ,x_k \rangle \) be an ordered set of \(k+1\) points in \(\mathbb {R}^d\) such that \(S(\mathcal {X})\) has full rank. Moreover, assume that the conditions mentioned above hold. Further, let \(\widehat{U}(\mathcal {X})\widehat{\Sigma }(\mathcal {X}) \widehat{V}(\mathcal {X})^T\) be a reduced SVD of \(\widehat{S}(\mathcal {X})^T=S(\mathcal {X})^T{/}\Delta \). Then
where \(\widetilde{V}(\mathcal {X})=I_d\) if \(k \ge d\) and \(\widetilde{V}(\mathcal {X})=\widehat{V}(\mathcal {X})\) if \(k < d\).
Proof
From the previous proposition,
Since the columns of \(\widehat{U}(\mathcal {X})\) are orthonormal, it follows that
Moreover, since \(S(\mathcal {X})\) has full rank, it follows that \(\widehat{\Sigma }(\mathcal {X})\) is nonsingular, and so,
When \(k \ge d\), \(\widehat{V}(\mathcal {X})^T\) is an orthogonal matrix, and so,
\(\square \)
6 The simplex gradient as a linear operator and calculus rules
The purpose of this section is to explore what calculus rules for ordinary gradients also hold for simplex gradients. It begins with a convenient way of defining the simplex gradients of a function. As before, \(f\) and \(g\) are functions from \(\mathbb {R}^d\) to \(\mathbb {R}\).
Definition 3
Let \(x_0 \in \mathbb {R}^d\) and let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank, i.e., \(\text{ rank }(Y)=\min \{d,k\}\). The simplex gradient of \(f\) at \(x_0\) with respect to the matrix \(Y\), denoted by \(\nabla _Y f(x_0)\), is the minimum 2-norm least-squares solution to the system:
where \(y_1,\ldots ,y_k\) are the columns of \(Y\), which can also be expressed as:
where \(Y^{\dagger }\) is the Moore–Penrose pseudoinverse of \(Y\).
In the previous definition, \(\nabla _Y f(x_0)=Y^{-T}\delta _{f,Y}(x_0)\) when \(k=d\), \(\nabla _Y f(x_0)=Y(Y^TY)^{-1}\delta _{f,Y}(x_0)\) when \(k<d\), and \(\nabla _Y f(x_0)=(YY^T)^{-1}Y\delta _{f,Y}(x_0)\) when \(k>d\). Note that \(\nabla _Y f(x_0) = \nabla _s f(\langle x_0,x_0+y_1,\ldots ,x_0+y_k \rangle )\), where \(y_1,\ldots ,y_k\) are the columns of \(Y\). Moreover, if \(Y=hI_d\), where \(h\) is a positive constant, then \(\displaystyle {\nabla _Y f(x_0)=\frac{1}{h}\delta _{f,Y}(x_0)}\) is simply the finite-difference gradient of \(f\) at \(x_0\) with fixed step size \(h\).
Example 7
Suppose \(f(x)\) is a linear function, say \(f(x)=c_0+c^Tx\), where \(c_0 \in \mathbb {R}\) and \(c \in \mathbb {R}^d\), and \(Y \in \mathbb {R}^{d \times k}\) has full rank. Then, for any \(x \in \mathbb {R}^d\), \(\nabla _Y f(x) = (Y^{\dagger })^T \delta _{f,Y}(x) = (Y^{\dagger })^T Y^T c = (Y Y^{\dagger })^T c\). Furthermore, if \(Y\) is nonsingular, then \(\nabla _Y f(x) =c\) for any \(x \in \mathbb {R}^d\).
The following proposition is an immediate consequence of Propositions 2 and 3.
Proposition 8
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any permutation matrix \(P \in \mathbb {R}^{k \times k}\),
Next, let \(y_0=0\) and let \(y_1,\ldots ,y_k\) be the columns of \(Y\). Furthermore, let \(\alpha \) be a permutation of the indices \(\{0,1,\ldots ,k\}\) and define \(Y_{\alpha } \in \mathbb {R}^{d \times k}\) to be the matrix whose columns are \(y_{\alpha (1)}-y_{\alpha (0)},\ldots ,y_{\alpha (k)}-y_{\alpha (0)}\). When \(k \le d\) (underdetermined and determined cases only), the simplex gradient has the more general property that for any \(x \in \mathbb {R}^d\),
The next proposition shows that the simplex gradient is a linear operator on the space of functions from \(\mathbb {R}^d\) to \(\mathbb {R}\).
Proposition 9
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) and for any constant \(c\),
-
(a)
\(\nabla _Y (f+g)(x) = \nabla _Y f(x) + \nabla _Y g(x)\), and
-
(b)
\(\nabla _Y (cf)(x) = c\nabla _Y f(x)\).
Proof
For any \(x \in \mathbb {R}^d\),
Moreover, for any \(x \in \mathbb {R}^d\) and any constant \(c\),
\(\square \)
The next proposition provides a product rule for simplex gradients. For convenience, \(\text{ diag }(a_1,\ldots ,a_k)\) denotes a diagonal matrix whose diagonal entries are \(a_1,\ldots ,a_k\).
Proposition 10
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\),
Proof
For any \(x \in \mathbb {R}^d\),
Since a diagonal matrix commutes with any other matrix (assuming the products are defined), it follows that
\(\Box \)
The next proposition provides a quotient rule for simplex gradients.
Proposition 11
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) for which \(g(x),g(x+y_1),\ldots ,g(x+y_k)\) are all nonzero,
Proof
By the previous proposition,
Solving for \(\nabla _Y \left( \frac{f}{g}\right) (x)\) gives
\(\square \)
Corollary 2
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) for which \(f(x),f(x+y_1),\ldots ,f(x+y_k)\) are all nonzero,
There does not seem to be a general chain rule for simplex gradients. However, it is possible to derive a version of the power rule for simplex gradients as shown next.
Proposition 12
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) and for any positive integer \(n\),
Proof
Proceed by induction on \(n\). The equation is obviously true for \(n=1\). Next, assume that the equation is true for \(n=\ell \) for some integer \(\ell \ge 1\), i.e.,
Now
Replacing \(i\) by \(i-1\) in the previous sum gives
Hence, the equation is also true for \(n=\ell +1\) and the induction is complete. \(\square \)
Corollary 3
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) and any positive integer \(n\),
Proof
By Corollary 2 and Proposition 12,
\(\square \)
Also, there does not seem to be a chain rule for function composition involving exponential functions. However, the following proposition gives a rule for the simplex gradient of an exponential function.
Proposition 13
Let \(Y \in \mathbb {R}^{d \times k}\) be a matrix with full rank. Then for any \(x \in \mathbb {R}^d\) and for any positive integer \(n\),
where \(1_{k \times 1}\) is a vector of all 1’s and the exponentiation is taken componentwise.
Proof
\(\square \)
7 Summary and conclusions
This article clarified some of the properties of simplex gradients that were previously not explicitly mentioned in the literature. In particular, the simplex gradient was shown to be independent of the order of the points in the underdetermined and determined cases but it depends only on which point is used as the reference point in the overdetermined case. Moreover, although the simplex gradient and the gradient of the corresponding linear model are equal in the determined case, this property is not true for the underdetermined and overdetermined cases. However, the simplex gradient turns out to be the gradient of an alternative linear model that has one less coefficient than the original model and that requires interpolation at the reference point. The negative of the simplex gradient was also shown to be a descent direction for any interpolating linear function in the determined and underdetermined cases but this property does not hold for the overdetermined case. In addition, a previously established error bound for simplex gradients was reviewed. Finally, calculus rules for simplex gradients (similar to those for ordinary gradients) were also proved.
References
Audet, C., Dennis Jr, J.E.: Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim. 17(2), 188–217 (2006)
Bortz, D.M., Kelley, C.T.: The simplex gradient and noisy optimization problems. In: Borggaard, J., et al. (eds.) Computational Methods for Optimal Design and Control, Progress in Systems and Control Theory, vol. 24, pp. 77–90. Springer, New York (1998)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of interpolation sets in derivative free optimization. Math. Progr. 111(1–2), 141–172 (2008a)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of sample sets in derivative-free optimization: polynomial regression and underdetermined interpolation. IMA J. Numer. Anal. 28(4), 721–748 (2008b)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. MOS/SIAM book series on optimization. SIAM, Philadelphia (2009)
Custódio, A.L., Dennis Jr, J.E., Vicente, L.N.: Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal. 28(4), 770–784 (2008)
Custódio, A.L., Vicente, L.N.: Using sampling and simplex derivatives in pattern search methods. SIAM J. Optim. 18(2), 537–555 (2007)
Kelley, C.T.: Iterative Methods for Optimization. SIAM, Philadelphia (1999)
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20(1), 172–191 (2009)
Regis, R.G.: An initialization strategy for high-dimensional surrogate-based expensive black-box optimization. In: Zuluaga, L.F., Terlaky, T. (eds.) Selected Contributions from the MOPTA 2012 Conference Series. Springer Proceedings in Mathematics and Statistics, vol 62, pp. 51–85 (2013)
Scheinberg, K., Toint, P.L.: Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization. SIAM J. Optim. 20(6), 3512–3532 (2010)
Torczon, V.: On the convergence of pattern search algorithms. SIAM J. Optim. 7(1), 1–25 (1997)
Acknowledgments
The author would like to thank the three anonymous referees. Their comments and suggestions greatly improved this article. In particular, the alternative linear model in Sect. 4 was suggested by one of the referees.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Regis, R.G. The calculus of simplex gradients. Optim Lett 9, 845–865 (2015). https://doi.org/10.1007/s11590-014-0815-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-014-0815-x