Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

One of the newest approaches in general NSO is to use gradient sampling algorithms developed by Burke et al. [51, 52]. The gradient sampling method (GS) is a method for minimizing an objective function that is locally Lipschitz continuous and smooth on an open dense subset \(D \subset {\mathbb {R}^{n}}\). The objective may be nonsmooth and/or nonconvex. The GS may be considered as a stabilized steepest descent algorithm. The central idea behind these techniques is to approximate the subdifferential of the objective function through random sampling of gradients near the current iteration point. The ongoing progress in the development of gradient sampling algorithms (see e.g. [67]) suggests that they may have potential to rival bundle methods in the terms of theoretical might and practical performance. However, here we introduce only the original GS [51, 52].

1 Gradient Sampling Method

Let \(f\) be a locally Lipschitz continuous function on \({\mathbb {R}^{n}}\), and suppose that \(f\) is smooth on an open dense subset \( D \subset \mathbb {R}^n\). In addition, assume that there exists a point \(\bar{\varvec{x}}\) such that the level set \({{\mathrm{lev}}}_{f(\bar{\varvec{x}})} = \{ {\varvec{x}}\mid f({\varvec{x}}) \le f(\bar{\varvec{x}})\}\) is compact.

At a given iterate \({\varvec{x}}_k\) the gradient of the objective function is computed on a set of randomly generated nearby points \({\varvec{u}}_{kj}\) with \(j \in \{1,2,\ldots ,m\}\) and \(m>n+1\). This information is utilized to construct a search direction as a vector in the convex hull of these gradients with the shortest norm. A standard line search is then used to obtain a point with lower objective function value. The stabilization of the method is controlled by the sampling radius \(\varepsilon _k\) used to sample the gradients.

The pseudo-code of the GS is the following:

figure a

Note that the probability to obtain a point \({\varvec{x}}_{kj}\notin D\) is zero in the above algorithm. In addition, it is reported in [52] that it is highly unlikely to have \({\varvec{x}}_k+t_k {\varvec{d}}_k \notin D\).

The GS algorithm may be applied to any function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) that is continuous on \(\mathbb {R}^n\) and differentiable almost everywhere. Furthermore, it has been shown that when \(f\) is locally Lipschitz continuous, smooth on an open dense subset \(D\) of \(\mathbb {R}^n\), and has bounded level sets, the cluster point \(\bar{{\varvec{x}}}\) of the sequence generated by the GS with fixed \(\varepsilon \) is \(\varepsilon \)-stationary with probability 1 (that is, \({\pmb 0}\in \partial _\varepsilon ^G f(\bar{{\varvec{x}}})\), see also Definition 3.3 in Part I). In addition, if \(f\) has a unique \(\varepsilon \)-stationary point \(\bar{{\varvec{x}}}\), then the set of all cluster points generated by the algorithm converges to \(\bar{{\varvec{x}}}\) as \(\varepsilon \) is reduced to zero.