Block coordinate descent for smooth nonconvex constrained minimization

Birgin, E. G.; Martínez, J. M.

doi:10.1007/s10589-022-00389-5

Block coordinate descent for smooth nonconvex constrained minimization

Published: 09 July 2022

Volume 83, pages 1–27, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

Block coordinate descent for smooth nonconvex constrained minimization

Download PDF

626 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

At each iteration of a block coordinate descent method one minimizes an approximation of the objective function with respect to a generally small set of variables subject to constraints in which these variables are involved. The unconstrained case and the case in which the constraints are simple were analyzed in the recent literature. In this paper we address the problem in which block constraints are not simple and, moreover, the case in which they are not defined by global sets of equations and inequations. A general algorithm that minimizes quadratic models with quadratic regularization over blocks of variables is defined and convergence and complexity are proved. In particular, given tolerances $\delta >0$ and $\varepsilon >0$ for feasibility/complementarity and optimality, respectively, it is shown that a measure of $(\delta ,0)$-criticality tends to zero; and the number of iterations and functional evaluations required to achieve $(\delta ,\varepsilon )$-criticality is $O(\varepsilon ^{-2})$. Numerical experiments in which the proposed method is used to solve a continuous version of the traveling salesman problem are presented.

Inexact Variable Metric Stochastic Block-Coordinate Descent for Regularized Optimization

Article 29 February 2020

Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization

Analysis of a variable metric block coordinate method under proximal errors

Article 26 December 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The structure of many practical problems suggests the employment of Block Coordinate Descent (BCD) methods for Optimization. At each iteration of a BCD method only a block of variables is modified with the purpose of obtaining sufficient decrease of the objective function.

Wright [11] surveyed traditional approaches and modern advances on the definition and analysis of Coordinate Descent methods. His analysis addresses mostly unconstrained problems in which the objective function is convex. The case in which a smooth function is minimized over a closed and convex set with a separable structure was considered in [7]. Although the Coordinate Descent paradigm is very natural and is implicitly used in different mathematical contexts, a classical example by Powell [10] showed that convergence results cannot be achieved under excessively naive implementations.

In a recent report [2] it was shown that, requiring sufficient descent based on regularization, the drawbacks represented by Powell’s example can be removed. In that paper it was also shown that methods based on high-order regularization can be defined in which convergence and worst-case complexity can be proved. However, the main results shown in [2] indicate that it is not worthwhile to use Taylor-like models of order greater than 2 because complexity is dominated by the necessity of keeping consecutive iterations close enough, a requirement that is hard to achieve if models and regularizations are of high order. This is the reason why, in the present paper, we restrict ourselves to quadratic models of the objective function and quadratic regularization.

The novelty of our approach relies on the extension of the techniques of [2] to the case in which there are arbitrary constraints at every block. As a consequence, at each iteration of BCD we minimize a problem with (probably) a small number of variables that must satisfy arbitrary constraints. Moreover, the block feasible set may not be defined by a global set of equalities and inequalities, as usually in constrained optimization. Instead, equalities and inequalities that define the feasible set are local in nature in a sense that will be defined below, making it possible more general domains than the ones defined by global equalities and inequalities.

This paper is organized as follows. In Sect. 2 the definition of the optimization problem is given. In Sect. 3 we define the BCD method for solving the main problem. In Sect. 4 we prove convergence and complexity results. In Sect. 5 we explain how to solve subproblems. Experiments are shown in Sects. 6 and 7 we state conclusions and lines for future research.

Notation. $\Vert \cdot \Vert$ denotes the Euclidean norm. $\nabla _i$ denotes the gradient with respect to the ith block of coordinates. $\mathbb {N}= \{0, 1, 2,\ldots \}$.

2 The problem

Assume that $f: \mathbb {R}^n \rightarrow \mathbb {R}$, $n_i \ge 1$ for all $i=1,\ldots ,n_{\mathrm {blocks}}$, and $\sum _{i=1}^{n_{\mathrm {blocks}}} n_i = n$. Let us write $x^T = (\varvec{x}_1^T, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^T)^T$, where $\varvec{x}_i \in \mathbb {R}^{n_i}$ for all $i=1, \dots , n_{\mathrm {blocks}}$, so

$$\begin{aligned} f(x) = f(x_1, \dots , x_n) = f(\varvec{x}_1, \dots , \varvec{x}_{n_{\mathrm {blocks}}}). \end{aligned}$$

The problem considered in the present work is given by

$$\begin{aligned} {{\,\mathrm{Minimize}\,}}f(\varvec{x}_1, \dots , \varvec{x}_{n_{\mathrm {blocks}}}) \text{ subject } \text{ to } \varvec{x}_i \in \Omega _i, i = 1,\dots ,n_{\mathrm {blocks}}. \end{aligned}$$

(1)

The assumption below includes conditions on the sets $\Omega _i$ and on the way they are described.

Assumption A1

For all $i=1,\dots ,n_{\mathrm {blocks}}$ the set $\Omega _i \subset \mathbb {R}^{n_i}$ is closed and bounded. Moreover, for all $i = 1, \dots ,n_{\mathrm {blocks}}$, there exist open sets $A_{i,j}$, $j=1,\dots ,n_{\mathrm {ops}}(i)$, such that

$$\begin{aligned} \Omega _i \subset \cup _{j=1}^{n_{\mathrm {ops}}(i)} A_{i,j} \end{aligned}$$

and there exist $n_{\mathrm {g}}(i,j)$ smooth functions

$$\begin{aligned} g_{i,j,\ell }: A_{i,j} \rightarrow \mathbb {R}, \; \ell =1,\dots ,n_{\mathrm {g}}(i,j) \end{aligned}$$

(2)

such that

$$\begin{aligned} \Omega _i \cap A_{i,j} = \{\varvec{x}_i \in A_{i,j} \;|\; g_{i,j,\ell }(\varvec{x}_i) \le 0, \; \ell =1,\dots ,n_{\mathrm {g}}(i,j)\}. \end{aligned}$$

(3)

The constraints $g_{i,j,\ell }(\varvec{x}_i) \le 0$ are said to be constitutive of $\Omega _i$ in the open covering set $A_{i,j}$. The explicit inclusion of equality constraints in (2,3) offers no difficulty and we state the case with only inequalities for the sole purpose of simplifying the notation. See Fig. 1 for an example of a set $\Omega _i$, the open covering $A_{i,j}$ and functions $g_{i,j,\ell }$.

In Fig. 1, sets $A_{1,j}$ for $j=1,2,3,5$ (four out of the five open circles) are such that $n_g(1,j)=2$; while sets $A_{1,j}$ for $j=6,7,8,9,10$ (the open rectangles) are such that $n_g(1,j)=1$. In all cases, constitutive constraints are linear. Sets $A_{1,11}$ and $A_{1,12}$ are not displayed in the picture for clarity. They are two congruent open triangles that cover the interior of $\Omega _1$ that appears uncovered in the picture; and they are such that $n_g(1,11)=n_g(1,12)=0$. The “internal kink” at $A_{1,4}$ makes the constitutive constraint of $A_{1,4}$ to deserve special consideration. In $A_{1,4}$ the feasible region is the complement of a set defined by constraints of the form $a_1 x_1 + b_1 x_2 + c_1 \ge 0$ and $a_2 x_1 + b_2 x_2 + c_2 \ge 0$. Let us write $z_1 = a_1 x_1 + b_1 x_2 + c_1$ and $z_2 = a_2 x_1 + b_2 x_2 + c_2$. Then, in the plane $(z_1, z_2)$ the feasible region is, locally, the complement of the non-negative orthant. Define $\varphi (z_1, z_2) = (z_1 z_2)^2$ if $z_1 \ge 0$ and $z_2 \ge 0$, whereas $\varphi (z_1, z_2) = - (z_1 z_2)^2$ otherwise. Then, the feasible region is, locally, given by $\varphi (z_1, z_2) \le 0$; i.e. $n_g(1,4)=1$ and the constitutive constraint is given by $\varphi$.

In Sect. 6, we illustrate our method using a problem in which the sets $\Omega _i$ are simple nonconvex polygons. A small set $\Omega _i$ of this type is depicted in Fig. 1. Let us show here that we can cover $\Omega _i$ with open sets $A_{ij}$, and define constitutive constraints, in such a way that, if $\bar{x}$ is a (local or global) minimizer of a smooth function $\psi (x_1,x_2)$ over $\Omega _i$, then there exists $A_{ij}$ containing $\bar{x}$ such that the KKT conditions with respect to the constitutive constraints in $A_{ij}$ are satisfied; see Assumption A7 in Sect. 5. In fact, the form described here for defining the open sets $A_{ij}$ and the constitutive constraints clearly satisfies Assumption A1 and also satisfies Assumption A6 in Sect. 5.

The idea is to cover the polygon with the following open sets: (a) The interior of $\Omega _i$; (b) Open sets whose intersection with $\Omega _i$ is an open side of the boundary (a side of the boundary that does not contains the extreme vertices); (c) Convex open sets whose intersection with the boundary of $\Omega _i$ is a vertex plus a fraction of the two segments that connect the vertex with its two neighbors. In Case (a) there are no constitutive constraints. In Case (b) there is only one constitutive constraint, which defines a half-plane. In Case (c), the corresponding open set is divided in a region contained in $\Omega _i$ and a region that is not contained in $\Omega _i$. If the region contained in $\Omega _i$ is convex the constitutive constraints are linear and define the intersection of two half-planes. Otherwise the expression of the constitutive constraints is not relevant and could be, for example, as the constraint $\varphi$ mentioned in the example in Fig. 1.

In Cases (a) and (b) it is straightforward that a local minimizer of $\psi$ must satisfy KKT conditions. Analogously, KKT conditions also hold in Case (c) if the region contained in $\Omega _i$ is convex. In fact, in these three situations constitutive constraints are naturally linear, a condition under which minimizers are necessarily KKT points [9]. Therefore, we only need to analyze Case (c) when the region contained in $\Omega$ is not convex. In this case, by the geometry of the feasible region in a neighborhood of $\bar{x}$, there exist linearly independent directions $v_1$ and $v_2$ such that $\pm v_1$ and $\pm v_2$ are feasible directions. Since $\bar{x}$ is a minimizer, we must have $\nabla \psi (\bar{x})^T v_1 \ge 0$, $\nabla \psi (\bar{x})^T (-v_1) \ge 0$, $\nabla \psi (\bar{x})^T v_2 \ge 0$, and $\nabla \psi (\bar{x})^T (-v_2) \ge 0$, implying that $\nabla \psi (\bar{x}) = 0$. So $\bar{x}$ is a KKT point with null multipliers.

3 Block coordinate descent method

In this section we present the main algorithm designed for solving (1). At each iteration k of the Block Coordinate Descent (BCD) method, given the current iterate $\varvec{x}^k =((\varvec{x}_1^k)^T, \dots , (\varvec{x}_{n_{\mathrm {blocks}}}^k)^T)^T$, we choose $i_k \in \{1, \dots , n_{\mathrm {blocks}}\}$ and compute $\varvec{x}_{i_k}^{k+1}$ by approximately solving

$$\begin{aligned} {{\,\mathrm{Minimize}\,}}_{\varvec{x}_{i_k} \in \mathbb {R}^{n_i}} f(\varvec{x}_1, \dots , \varvec{x}_{n_{\mathrm {blocks}}}) \text{ subject } \text{ to } \varvec{x}_{i_k} \in \Omega _{i_k} \text{ and } \varvec{x}_j = \varvec{x}_j^k \text{ for } \text{ all } j \ne i_k, \end{aligned}$$

(4)

i.e. we approximately minimize the function f fixing the blocks $\varvec{x}_j$ such that $j \ne i_k$. For $j \ne i_k$, we define $\varvec{x}_j^{k+1} = \varvec{x}_j^k$. The sense in which problem (4) is solved only approximately is clarified below.

The algorithmic parameters of BCD are the sufficient descent parameter $\alpha$, which defines progress of the objective function and implicitly penalizes the distance between consecutive iterates, the tolerance $\delta$ with respect to complementarity conditions, the parameter $\theta$ that defines sufficient descent of the model at each iteration, and the minimal positive regularization parameter $\sigma _{\min }$. The model Hessian matrices $B_k$ may be used to mimic available second derivative information but do not play any significant role from the point of view of complexity or convergence and the choice $B_k = 0$ is always possible. At Step 2 of the algorithm we choose the block of variables with respect to which we wish to improve the objective function. In general, we minimize approximately a quadratic model increasing the regularizing parameter as far as the sufficient condition (10) is not satisfied. Alternatively, we employ the true objective function as a model, because such alternative is possible in many practical problems.

Algorithm 3.1

Assume that $\alpha > 0$, $\delta \ge 0$, $\theta > 0$, $\sigma _{\min } > 0$, and $\varvec{x}_i^0 \in \Omega _i$ for $i=1,\dots ,n_{\mathrm {blocks}}$ are given. Initialize the iteration number $k \leftarrow 0$.

Step 1.:

Set $\sigma \leftarrow 0$, choose $i_k \in \{1,\dots ,n_{\mathrm {blocks}}\}$, and define $B_k \in \mathbb {R}^{n_{i_k}} \times \mathbb {R}^{n_{i_k}}$ symmetric.

Step 2.:

Find $j_k \in \{1,\dots ,n_{\mathrm {ops}}(i_k)\}$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}} \in A_{i_k,j_k}$ such that if $\sigma >0$ then Alternative 1 holds while if $\sigma =0$ then either Alternative 1 or Alternative 2 holds. Alternative 1:

$$\begin{aligned} \nabla _{i_k} f(x^k)^T (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{1}{2}(\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k)^T B_k (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{\sigma }{2} \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2 \le 0 \end{aligned}$$

(5)

and there exist $\mu _{i_k,j_k,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i_k,j_k)$ for which

$$\begin{aligned}&\left\| \nabla _{i_k} f(x^k) + B_k (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \sigma (\varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k)\right. \nonumber \\&\quad \left. +\sum _{\ell =1}^{n_{\mathrm {g}}(i_k,j_k)} \mu _{i_k,j_k,\ell } \nabla g_{i_k,j_k,\ell }(\varvec{x}_{i_k}^{{\mathrm {trial}}}) \right\| \le \theta \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k\Vert \end{aligned}$$

(6)

and

$$\begin{aligned} \min \{\mu _{i_k,j_k,\ell }, -g_{i_k,j_k,\ell }(\varvec{x}_{i_k}^{{\mathrm {trial}}})\} \le \delta , \; \ell =1,\dots ,n_{\mathrm {g}}(i_k, j_k). \end{aligned}$$

(7)

Alternative 2: There exist $\mu _{i_k,j_k,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i_k,j_k)$ for which

$$\begin{aligned} \left\| \nabla _{i_k} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i_k,j_k)} \mu _{i_k,j_k,\ell } \nabla g_{i_k,j_k,\ell }(\varvec{x}_{i_k}^{{\mathrm {trial}}}) \right\| = 0 \end{aligned}$$

(8)

and

$$\begin{aligned} \min \{\mu _{i_k,j_k,\ell }, -g_{i_k,j_k,\ell }(\varvec{x}_{i_k}^{{\mathrm {trial}}})\} \le 0, \; \ell =1,\dots ,n_{\mathrm {g}}(i_k, j_k). \end{aligned}$$

(9)

Step 3.:

Test the sufficient descent condition

$$\begin{aligned} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k) \le f(x^k) - \alpha \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2. \end{aligned}$$

(10)

If (10) holds, then define $\sigma _k = \sigma$, $x_{i_k}^{k+1} = x_{i_k}^{{\mathrm {trial}}}$ and $x_i^{k+1} = x_i^k$ for all $i \ne i_k$, set $k \leftarrow k+1$ and go to Step 1. Otherwise, set $\sigma \leftarrow \max \{ \sigma _{\min }, 2 \sigma \}$ and go to Step 2.

Remark

We will see that, from the theoretical point of view, Alternative 2 is not necessary. Convergence and complexity theoretical results follow without any difficulty with Alternative 1 only. Alternative 2 was included because, in many cases, a procedure exists to find a global minimizer with respect to a single block. So, in Alternative 2 we allow the algorithm to choose such minimizer, with the only condition that it must satisfy KKT conditions in the block. However, note that the test (10) is still necessary and cannot be eliminated. The reason is that its fulfillment implies that the difference between consecutive iterations tends to zero and this feature is essential for the convergence of coordinate search methods. See the counterexample in [10] and the discussion with only box constraints in [2].

4 Convergence and complexity

In this section we prove convergence and complexity results. We say that $(\varvec{x}_1, \dots , \varvec{x}_{n_{\mathrm {blocks}}})$ is $(\delta ,\varepsilon )$-critical if there exist open sets $A_{i,j}$ ($i=1,\dots ,n_{\mathrm {blocks}}$, $j=1,\dots ,n_{\mathrm {ops}}(i)$) satisfying Assumption A1 such that, for all $i=1,\dots ,n_{\mathrm {blocks}}$, $\varvec{x}_i$ satisfies the KKT conditions for the minimization of $f(\varvec{x}_1,\ldots ,\varvec{x}_{n_{\mathrm {blocks}}})$ restricted to the constitutive constraints of $A_{i,j}$, $j=1\dots ,n_{\mathrm {ops}}(i)$, with tolerance $\varepsilon >0$ and satisfies complementarity and feasibility with respect to the same constraints with tolerance $\delta >0$. Under proper assumptions we prove that, given $\delta \ge 0$ (which is a parameter of the algorithm BCD), the natural measure of $(\delta , 0)$-criticality tends to zero and the number of iterations and evaluations that are necessary to obtain $(\delta , \varepsilon )$-criticality is $O(\varepsilon ^{-2})$.

In Assumption A2 we state that the gradients of the objective function must satisfy Lipschitz conditions.

Assumption A2

There exists $\gamma > 0$ such that, for all $\varvec{x}_{i_k}$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ computed at Algorithm 3.1,

$$\begin{aligned} \Vert \nabla _{i_k} f(x^k) - \nabla _{i_k} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k) \Vert \le \gamma \Vert \varvec{x}_{i_k} - \varvec{x}_{i_k}^{{\mathrm {trial}}}\Vert \end{aligned}$$

(11)

and

$$\begin{aligned} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k) \le f(x^k) + \nabla _{i_k} f(x^k)^T (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{\gamma }{2} \Vert \varvec{x}_{i_k} - \varvec{x}_{i_k}^{{\mathrm {trial}}}\Vert ^2. \end{aligned}$$

(12)

Moreover, if $x^{k_1}$ differs from $x^{k_2}$ in only one block $i_{\mathrm {diff}}$,

$$\begin{aligned} \Vert \nabla _{i} f(x^{k_1}) - \nabla _{i} f(x^{k_2})\Vert \le \gamma \Vert \varvec{x}_{i_{\mathrm {diff}}}^{k_2} - \varvec{x}_{i_{\mathrm {diff}}}^{k_1}\Vert . \end{aligned}$$

(13)

for all $i=1,\dots ,n_{\mathrm {blocks}}$.

Assumption A3 merely states that model Hessians should be uniformly bounded.

Assumption A3

There exist $c_B > 0$ such that for all $k \in \mathbb {N}$,

$$\begin{aligned} \Vert B_k\Vert \le c_B. \end{aligned}$$

(14)

Assumptions A2 and A3 are sufficient to prove that every iteration of BCD is well defined, as sufficient descent (10) is obtained increasing the regularization parameter $\sigma$ a finite number of times.

Lemma 4.1

Assume that Assumptions A2and A3 hold and that, for all $k \in \mathbb {N}$, the computation of $j_k$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ according to Step 2 of BCD is possible. Then, the test (10) is satisfied after at most

$$\begin{aligned} \log _2 \left( \frac{\gamma + c_B + 2 \alpha }{\sigma _{\min }} \right) + 1 \end{aligned}$$

increases of $\sigma$ at Step 3. Moreover,

$$\begin{aligned} \sigma _k < \sigma _{\max } := 2 (\gamma + c_B + 2 \alpha ). \end{aligned}$$

(15)

Proof

If the test (10) is satisfied when Step 2 is executed with $\sigma =0$, then the thesis holds trivially. So, we need to consider only the case in which $\sigma >0$ and, in consequence, $j_k$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ satisfy Alternative 1. By (12) in Assumption A2,

$$\begin{aligned} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k)&\le f(x^k) + \nabla _{i_k} f(x^k)^T (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{\gamma }{2} \Vert \varvec{x}_{i_k} - \varvec{x}_{i_k}^{{\mathrm {trial}}}\Vert ^2\\&+ \left( \frac{1}{2}(\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k)^T B_k (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{1}{2}\sigma \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2 \right) \\&- \left( \frac{1}{2}(\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k)^T B_k (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) + \frac{1}{2}\sigma \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2 \right) \end{aligned}$$

Then, by (5),

$$\begin{aligned}&f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k) \le f(x^k) + \frac{\gamma }{2} \Vert \varvec{x}_{i_k} - \varvec{x}_{i_k}^{{\mathrm {trial}}}\Vert ^2 - \frac{1}{2}(\varvec{x}_{i_k}^{{\mathrm {trial}}}\\&\quad -\varvec{x}_{i_k}^k)^T B_k (\varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k) - \frac{1}{2}\sigma \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2. \end{aligned}$$

Therefore, by (14) in Assumption A3,

$$\begin{aligned} f(\varvec{x}_1^k, \dots , \varvec{x}_{i_k}^{{\mathrm {trial}}}, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^k)&\le f(x^k) + \frac{\gamma }{2} \Vert \varvec{x}_{i_k} - \varvec{x}_{i_k}^{{\mathrm {trial}}}\Vert ^2 + \frac{1}{2}c_B \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}}-\varvec{x}_{i_k}^k \Vert ^2\\&- \frac{1}{2}\sigma \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2\\&= f(x^k) + \frac{1}{2}( \gamma + c_B - \sigma ) \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2. \end{aligned}$$

Then, the inequality (10) holds if

$$\begin{aligned} \frac{1}{2}( \gamma + c_B - \sigma ) \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2 \le -\alpha \Vert \varvec{x}_{i_k}^{{\mathrm {trial}}} - \varvec{x}_{i_k}^k\Vert ^2, \end{aligned}$$

i.e. if $\sigma \ge \gamma + c_B + 2 \alpha$. Since, by definition, $\sigma$ initially receives the value zero and then receives values of the form $2^{\ell -1} \sigma _{\min }$, where $\ell$ is the number of executions of $\sigma \leftarrow \max \{\sigma _{\min }, 2 \sigma \}$, then the number of increases of $\sigma$ that are necessary to obtain (10) is bounded above by

$$\begin{aligned} \log _2 \left( \frac{\gamma + c_B + 2 \alpha }{\sigma _{\min }} \right) + 1 \end{aligned}$$

as we wanted to prove. Finally, (15) comes from the fact that the largest unsuccessful value of $\sigma$ must be strictly less than $\gamma + c_B + 2 \alpha$ and the next (successful) value is twice that amount by definition. $\square$

Assumption A4

The sequence $\{f(\varvec{x}^k)\}$ is bounded below.

The following lemma is a simple consequence of (10) and Assumption A4.

Lemma 4.2

Assume that Assumptions A2, A3, and A4 hold and that for all $k \in \mathbb {N}$, the computation of $j_k$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ at Step 2 is possible. Then, $\lim _{k \rightarrow \infty } \Vert \varvec{x}_{i_k}^{k+1} - \varvec{x}_{i_k}^k\Vert = \lim _{k\rightarrow \infty } \Vert \varvec{x}^{k+1}-\varvec{x}^k\Vert = 0$ and, given $\varepsilon > 0$, the number of iterations at which $\Vert \varvec{x}_{i_k}^{k+1} - \varvec{x}_{i_k}^k\Vert > \varepsilon$ is bounded above by

$$\begin{aligned} \frac{f(\varvec{x}^0) - f_{{\mathrm {bound}}}}{\alpha \varepsilon ^2} \end{aligned}$$

(16)

where $f_{{\mathrm {bound}}}$ is an arbitrary lower bound of $\{f(\varvec{x}^k)\}$.

Proof

By Assumption A4 there exists $f_{{\mathrm {bound}}} \in \mathbb {R}$ such that $f(\varvec{x}^k) \ge f_{{\mathrm {bound}}}$ for all $k \in \mathbb {N}$. Then, the fact that $\Vert \varvec{x}^{k+1}_{i_k}-\varvec{x}^k_{i_k}\Vert$ tends to zero comes from (10); and this implies that $\Vert \varvec{x}^{k+1}-\varvec{x}^k\Vert$ tends to zero as well because, by definition, $\Vert \varvec{x}^{k+1}-\varvec{x}^k\Vert = \Vert \varvec{x}^{k+1}_{i_k}-\varvec{x}^k_{i_k}\Vert$. Finally, if $\Vert \varvec{x}_{i_k}^{k+1} - \varvec{x}_{i_k}^k\Vert > \varepsilon$, then, by (10), $f(x^{k+1}) \le f(x^k) - \alpha \varepsilon ^2$; and this reduction can no occur more than (16) times, as we wanted to prove. $\square$

Assumption A5 states that every block of components i is chosen for minimization at infinitely many iterations and, moreover, at every m consecutive iterations we necessarily find at least one at which i is chosen.

Assumption A5

There exists $m \in \{1,2,\dots \}$ such that, for all $\nu \in \{1,\dots ,n_{\mathrm {blocks}}\}$, $i_k = \nu$ infinitely many times and, if $k_1< k_2 < k_3, \dots$ is the set of all the iteration indices k such that $i_k=\nu$, one has that $k_1 \le m$, and $k_{j+1}-k_j \le m$ for all $j= 1, 2, 3, \dots$.

The following theorems are the main convergence result of this paper. The idea is the following. According to Algorithm 3.1, at iteration k, we select a block $i_k = i_{{\mathrm {chosen}}}$ and optimize with respect to the variables of this block up to the approximate fulfillment of restricted KKT conditions. These restricted KKT conditions hold in one of the open sets that cover $\Omega _{i_{{\mathrm {chosen}}}}$ and involve the constraints that are constitutive in this open set. The variables $\varvec{x}_{i_{{\mathrm {chosen}}}}$ do not change during some (less than m) iterations; therefore, during these iterations, thanks to the Lipschitz assumption (13), the approximate KKT conditions with respect to the variables $i_{{\mathrm {chosen}}}$ still hold with respect to the same open set and the same constitutive constraints used at iteration k. After these (less than m) iterations the block $i_{{\mathrm {chosen}}}$ is selected again, and the process is repeated. Since all the blocks are chosen infinitely many times in the way described by Assumption A5, approximate KKT conditions eventually hold with respect to all the blocks and we are able to establish the number of iterations that we need for the fulfillment of KKT conditions up to an arbitrary precision $\varepsilon$.

Theorem 4.1

Assume that Assumptions A2, A3, A4, and A5 hold and that for all $k \in \mathbb {N}$, the computation of $j_k$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ at Step 2 is possible. For $i \in \{1, \dots , n_{\mathrm {blocks}}\}$ and $k \ge m$, define $o(i,k):=j_{\hat{k}}$, where $\hat{k}$ is the latest iteration (not larger than k) at which $i_{\hat{k}}=i$. Then, for $i \in \{1, \dots , n_{\mathrm {blocks}}\}$ and $k \ge m$, we have that $\mu _{i,o(i,k),\ell } \ge 0$ for $\ell = 1,\dots ,n_{\mathrm {g}}(i,o(i,k))$,

$$\begin{aligned}&\varvec{x}^k_i \in A_{i,o(i,k)},\\&\quad \min \{\mu _{i,o(i,k),\ell }, -g_{i,o(i,k),\ell }(\varvec{x}^k_i) \} \le \delta , \end{aligned}$$

and

$$\begin{aligned} \lim _{k\rightarrow \infty } \left\| \nabla _i f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,o(i,k))} \mu _{i,o(i,k),\ell } \nabla g_{i,o(i,k),\ell }(\varvec{x}^k_i) \right\| = 0. \end{aligned}$$

Moreover, given $\varepsilon > 0$, the number of iterations at which

$$\begin{aligned} \left\| \nabla _i f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,o(i,k))} \mu _{i,o(i,k),\ell } \nabla g_{i,o(i,k),\ell }(\varvec{x}^k_i) \right\| > \varepsilon \end{aligned}$$

is bounded above by

$$\begin{aligned} \frac{f(\varvec{x}^0) - f_{{\mathrm {bound}}}}{\alpha (\varepsilon /(c_4 m))^2}, \end{aligned}$$

where $f_{{\mathrm {bound}}}$ is an arbitrary lower bound of $\{f(\varvec{x}^k)\}$ and

$$\begin{aligned} c_4 := c_B + \sigma _{\max } + \theta + \gamma . \end{aligned}$$

(17)

Proof

Let $i \in \{1, \dots , n_{\mathrm {blocks}}\}$ be arbitrary. By Assumption A5, there exists $k_1 \le m$ such that $i_{k_1} = i$. Without loss of generality, in order to simplify the notation, suppose that $k_1 = 0$. Consider first the case where, in Step 2, Alternative 1 holds. Then, at iteration $k_1 = 0$ one defines $B_0$ and finds $j_0$, $\varvec{x}_i^1 \in A_{i,j_0}$, and $\mu _{i,j_0,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i,j_0)$ such that

$$\begin{aligned} \left\| \nabla _i f(x^0) + B_0 (\varvec{x}_i^1-\varvec{x}_i^0) + \sigma _0 (\varvec{x}_i^1 - \varvec{x}_i^0) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le \theta \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert \end{aligned}$$

(18)

and

$$\begin{aligned} \min \{\mu _{i,j_0,\ell }, -g_{i,j_0,\ell }(\varvec{x}_i^1)\} \le \delta \; \text{ for } \; \ell =1,\dots , n_{\mathrm {g}}(i,j_0). \end{aligned}$$

(19)

By (18) and (15),

$$\begin{aligned} \left\| \nabla _i f(x^0) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le (c_B + \sigma _{\max } +\theta ) \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert . \end{aligned}$$

(20)

Then, by (11),

$$\begin{aligned} \left\| \nabla _i f(\varvec{x}_1^0, \dots , \varvec{x}_i^1, \dots , \varvec{x}_{n_{\mathrm {blocks}}}^0) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le (c_B + \sigma _{\max } +\theta + \gamma ) \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert . \end{aligned}$$

(21)

Since $\varvec{x}_s^1 = \varvec{x}_s^0$ for every $s=1,\dots ,n_{\mathrm {blocks}}$, $s \ne i$, by (21) and the definition of $c_4$ in (17), we have that

$$\begin{aligned} \left\| \nabla _i f(x^1) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le c_4 \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert . \end{aligned}$$

(22)

Recall that (19) and (22) were obtained under Alternative 1. On the other hand, under Alternative 2, (19) and (22) follow trivially from (9) and (8), respectively.

Since $\varvec{x}^2$ may differ from $\varvec{x}^1$ only in the block $i_1$, by (13), we have that

$$\begin{aligned} \Vert \nabla _i f(\varvec{x}^2) - \nabla _i f(\varvec{x}^1)\Vert \le \gamma \Vert \varvec{x}_{i_1}^2 - \varvec{x}_{i_1}^1\Vert . \end{aligned}$$

Then, by (22),

$$\begin{aligned} \left\| \nabla _i f(x^2) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le c_4 \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert + \gamma \Vert \varvec{x}_{i_1}^2 - \varvec{x}_{i_1}^1\Vert . \end{aligned}$$

(23)

Since $\varvec{x}^3$ may differ from $\varvec{x}^2$ only in the block $i_2$, by (13) and (23), we have that

$$\begin{aligned} \left\| \nabla _i f(x^3) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le c_4 \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert + \gamma \Vert \varvec{x}_{i_1}^2 - \varvec{x}_{i_1}^1\Vert + \gamma \Vert \varvec{x}_{i_2}^3 - \varvec{x}_{i_2}^2\Vert . \end{aligned}$$

(24)

So, using an inductive argument, for all $k \in \mathbb {N}$,

$$\begin{aligned} \left\| \nabla _i f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le c_4 \Vert \varvec{x}_i^1-\varvec{x}_i^0\Vert + \gamma \sum _{\nu =2}^k \Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert . \end{aligned}$$

(25)

Thus, by the definition of $c_4$ in (17), for all $k \in \mathbb {N}$,

$$\begin{aligned} \left\| \nabla _i f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_0)} \mu _{i,j_0,\ell } \nabla g_{i,j_0,\ell }(\varvec{x}_i^1) \right\| \le c_4 \sum _{\nu =1}^k \Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert . \end{aligned}$$

(26)

Let us now get rid of the simplifying assumption $i_0 = i$ and assume that the set of indices k at which $i_k=i$ is $k_1< k_2 < \dots$. Renaming the indices in (19) and (26), we get that

$$\begin{aligned} \min \{\mu _{i,j_{k_r},\ell }, -g_{i,j_{k_r},\ell }(\varvec{x}_i^{k_r+1})\} \le \delta \; \text{ for } \; \ell =1,\dots , n_{\mathrm {g}}(i,j_{k_r}) \end{aligned}$$

(27)

and for all $r=1,2,\dots$ and all $k=k_r+1,k_r+2,\dots$, in particular for all $k=k_r+1,k_r+2,\dots ,k_{r+1}$,

$$\begin{aligned} \left\| \nabla _i f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_{k_r})} \mu _{i,j_{k_r},\ell } \nabla g_{i,j_{k_r},\ell }(\varvec{x}_i^{k_r+1}) \right\| \le c_4 \sum _{\nu =k_r+1}^{k} \Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert . \end{aligned}$$

(28)

But, by the definition of the sequence $k_1, k_2, \dots$, $\varvec{x}_i^{k_r}$ may change from iteration $k_r$ to iteration $k_r+1$ but it does not change from $k_r+1$ to $k_r+2$, $k_r+2$ to $k_r+3$, until $k_{r+1}-1$ to $k_{r+1}$; and it may change again from iteration $k_{r+1}$ to iteration $k_{r+1}+1$. This means that, for all $r=1,2,\dots$, we have that $\varvec{x}_i^{k_r+1} = \varvec{x}_i^{k_r+2} = \dots = \varvec{x}_i^{k_{r+1}}$. Therefore, (27) and (28) imply that for all $r=1,2,\dots$ and $k=k_r+1, k_r+2,\dots ,k_{r+1}$,

$$\begin{aligned} \min \{\mu _{i,j_{k_r},\ell }, -g_{i,j_{k_r},\ell }(\varvec{x}_i^k)\} \le \delta \; \text{ for } \; \ell =1,\dots , n_{\mathrm {g}}(i,j_{k_r}) \end{aligned}$$

(29)

and

$$\begin{aligned} \left\| \nabla _{i} f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,j_{k_r})} \mu _{i,j_{k_r},\ell } \nabla g_{i,j_{k_r},\ell }(\varvec{x}_i^{k}) \right\| \le c_4 \sum _{\nu =k_r+1}^k \Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert . \end{aligned}$$

(30)

Moreover, by definition, $o(i,k_r+1) = o(i,k_r+2) = \dots = o(i,k_{r+1}) = j_{k_r}$ for $r=1,2,\dots$. So, from (29) and (30), we get that, for $k \ge m$, $\varvec{x}_i^k \in A_{i,o(i,k)}$ and $\mu _{i,j_0,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i,j_0)$ are such that

$$\begin{aligned} \min \{\mu _{i,o(i,k),\ell }, -g_{i,o(i,k),\ell }(\varvec{x}_i^k)\} \le \delta \; \text{ for } \; \ell =1,\dots , n_{\mathrm {g}}(i,o(i,k)) \end{aligned}$$

(31)

and

$$\begin{aligned} \left\| \nabla _{i} f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,o(i,k))} \mu _{i,o(i,k),\ell } \nabla g_{i,o(i,k),\ell }(\varvec{x}_i^{k}) \right\| \le c_4 \sum _{\nu =k_r+1}^k \Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert . \end{aligned}$$

(32)

By Lemma 4.2, given $\varepsilon > 0$, the number of iterations at which $\Vert \varvec{x}^\nu - \varvec{x}^{\nu -1}\Vert > \varepsilon /(c_4 m$) is bounded above by $( f(\varvec{x}^0) - f_{{\mathrm {bound}}} ) / ( \alpha (\varepsilon /(c_4 m))^2 )$. Then, since the sum in the second member of (30) involves at most m terms, the number of iterations at which the left-hand side of (30) is bigger than $\varepsilon$ is bounded above by $( f(\varvec{x}^0) - f_{{\mathrm {bound}}} ) / ( \alpha (\varepsilon /(c_4 m))^2 )$. Moreover, since $\varepsilon >0$ is arbitrary, taking limits on both sides of (32), we have that

$$\begin{aligned} \lim _{k\rightarrow \infty } \left\| \nabla _{i} f(x^k) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,o(i,k))} \mu _{i,o(i,k),\ell } \nabla g_{i,o(i,k),\ell }(\varvec{x}_i^{k}) \right\| = 0. \end{aligned}$$

(33)

$\square$

The final theorem of this section proves worst-case functional complexity of order $O(\varepsilon ^{-2})$.

Theorem 4.2

Assume that Assumptions A2, A3, A4, and A5 hold and that for all $k \in \mathbb {N}$, the computation of $j_k$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ at Step 2 is possible. Let the indices o(i, k) be as defined in Theorem 4.1. Then, given $\varepsilon >0$, Algorithm 3.1 performs at most

$$\begin{aligned} n_{\mathrm {blocks}}\left( \frac{f(x^0) - f_{{\mathrm {bound}}}}{\alpha (\varepsilon /(c_4 m))^2} \right) \end{aligned}$$

iterations and at most

$$\begin{aligned} \log _2 \left( \frac{\gamma + c_B + 2 \alpha }{\sigma _{\min }} \right) + 2 \end{aligned}$$

functional evaluations per iteration, where $f_{{\mathrm {bound}}}$ is an arbitrary lower bound of $\{f(\varvec{x}^k)\}$ and $c_4$ is given by (17), to compute an iterate $x^{k+1}$ such that

$$\begin{aligned}&\varvec{x}^{k+1}_i \in A_{i,o(i,k)},\\&\quad \mu _{i,o(i,k),\ell } \ge 0 \; \text{ and } \; \min \{\mu _{i,o(i,k),\ell }, -g_{i,o(i,k),\ell }(\varvec{x}^{k+1}_i) \} \le \delta , \; \ell = 1,\dots ,n_{\mathrm {g}}(i,o(i,k)), \end{aligned}$$

and

$$\begin{aligned} \left\| \nabla _i f(x^{k+1}) + \sum _{\ell =1}^{n_{\mathrm {g}}(i,o(i,k))} \mu _{i,o(i,k),\ell } \nabla g_{i,o(i,k),\ell }(\varvec{x}^{k+1}_i) \right\| \le \varepsilon \end{aligned}$$

for all $i=1,\dots ,n_{\mathrm {blocks}}$.

Proof

The proof follows from Theorem 4.1, Lemma 4.1, and the definition of Algorithm 3.1, because the number of functional evaluations per iterations is equal to the number of increments of $\sigma$ plus one. (This ignores the fact that, disregarding the first iteration, the value of $f(x^k)$ can in fact be obtained from the previous iteration, in which case the number of functional evaluations and the number of increases of $\sigma$ per iteration coincide.) $\square$

5 Solving subproblems

In this section we present an algorithm for computing $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ at Step 2 of iteration k of Algorithm 3.1. The well-definiteness of the proposed approach requires, in addition to Assumption A1, two additional assumptions. The first one concerns the relation between each set $\Omega _i$, its covering sets $A_{i,j}$, and its constitutive constraints $g_{i,j,\ell }$. This assumption states that, if a point is in the closure of $A_{i,j}$ and satisfies the constitutive constraints associated with $A_{i,j}$ then this point necessarily belongs to $\Omega _i$.

Assumption A6

For all $i \in \{1,\dots ,n_{\mathrm {blocks}}\}$ and all $j \in \{1,\dots ,n_{\mathrm {ops}}(i)\}$, if $\varvec{x}_i \in \overline{A_{i,j}}$ and $g_{i,j,\ell }(\varvec{x}_i) \le 0$ for $\ell =1,\dots ,n_g(i,j)$, then $\varvec{x}_i \in \Omega _i$.

To illustrate the necessity of Assumption A6, Fig. 2 exhibits a pathological example at which the assumption does not hold. In the figure, the set $\Omega _1$ (in red) is the union of the segments $S_{h} = \{(t,2) \;|\; t \in [2,4]\}$ and $S_{v} = \{(2,t) \;|\; t \in [-1,2]\}$. Defining $n_{\mathrm {ops}}(1)=2$ and the open sets $A_{1,1} = \{(x_1,x_2) \;|\; (x_1-2)^2 + x_2^2 < 1\}$ and $A_{1,2} = \{(x_1,x_2) \;|\; x_2 > 1/2\}$, we have that $\Omega _1 \subset \cup _{j=1}^{n_{\mathrm {ops}}(1)} A_{1,j}$. The open ball $A_{1,1}$ is displayed in the figure, while the open half-space $A_{1,2}$ is not. Inside the open ball $A_{1,1}$, the set $\Omega _1$ can be described by the deliberately unorthodox choice $(x_1-1)(x_1-2)=0$, while inside the open half-space $A_{1,2}$, the set $\Omega _1$ can be described by the equality $(x_1-2)(x_2-2)=0$ plus the inequalities $2 \le x_1 \le 4$ and $1/2 \le x_2 \le 2$. Therefore, (3) is satisfied for $i=1$ and $j=1,\dots ,n_{\mathrm {ops}}(1)$ by defining $n_g(1,1)=2$, $n_g(1,2)=6$, $g_{1,1,1}(x_1,x_2)=(x_1-1)(x_1-2)$, $g_{1,1,2}(x_1,x_2)=-g_{1,1,1}(x_1,x_2)$, $g_{1,2,1}(x_1,x_2)=(x_1-2)(x_2-2)$, $g_{1,2,2}(x_1,x_2)=-g_{1,2,1}(x_1,x_2)$, $g_{1,2,3}(x_1,x_2)=2-x_1$, $g_{1,2,4}(x_1,x_2)=x_1-4$, $g_{1,2,5}(x_1,x_2)=1/2-x_2$, and $g_{1,2,6}(x_1,x_2)=x_2-2$. The set of points that satisfy the constitutive constraints of $\Omega _1$ in $A_{1,1}$ is shown in light-gray in the figure. The point $(x_1,x_2)^T=(1,0)^T$ belongs to $\overline{A_{1,1}}$ and satisfies the corresponding constitutive constraints $g_{1,1,1}(x_1,x_2) \le 0$ and $g_{1,1,2}(x_1,x_2) \le 0$. However, it does not belong to $\Omega _1$. Therefore, Assumption A6 does not hold. The equality that describes $\Omega _1$ within $A_{1,1}$ was deliberately chosen to construct an example that does not satisfy Assumption A6 and to show that Assumption A6 is necessary. To satisfy Assumption A6 in this example, it would suffice to consider as constitutive constraints of $\Omega _1$ within $A_{1,1}$ the two inequalities equivalent to $x_1=2$ instead of the two inequalities equivalent to $(x_1-1)(x_1-2)=0$.

The second additional assumption states that every global minimizer of a smooth function onto $\Omega _i$ belongs to some $A_{i,j}$ satisfying the KKT conditions with respect to the constitutive constraints related with $A_{i,j}$.

Assumption A7

For all $i \in \{1,\dots ,n_{\mathrm {blocks}}\}$, if $\varvec{x}_i$ is a global minimizer of a smooth function onto $\Omega _i$, there exists $j \in \{1,\dots ,n_{\mathrm {ops}}(i)\}$ such that $\varvec{x}_i \in A_{i,j}$ and $\varvec{x}_i$ satisfies the KKT conditions with respect to the constitutive constraints associated with $A_{i,j}$.

To illustrate the necessity of Assumption A7, let us show that it does not hold in the example of Fig. 1. Assume that the global minimizer G of a smooth function $\psi$ over the set $\Omega _1$ belongs to the boundary of $\Omega _1$ and to the ball $A_{1,4}$ but does not belong to $A_{1,9} \cup A_{1,10} \cup A_{1,11} \cup A_{1,12}$ and is not the center $C_{1,4}$ of $A_{1,4}$^{Footnote 1}. For example, take an adequate infeasible point Q very close to the desired global minimizer G and define $\psi (P) = \Vert P - Q\Vert ^2$. The global minimizer G belongs only to the open set $A_{1,4}$, the gradient $\nabla \psi (G)$ is nonnull but, according to the definition of the constitutive constraint $\varphi$ of $A_{1,4}$, $\nabla \varphi (G)=0$. Therefore, the KKT condition does not hold in this case. Fortunately, Fig. 1 shows just an arbitrary illustrative example and, as already described in Sect. 2, there are simple ways to define the open sets $A_{ij}$ and the constitutive constraints in such a way that Assumptions A1, A6, and A7 hold.

In order to pursue Alternative 1, at Step 2 of Algorithm 3.1, $j_k \in \{1,\dots ,n_{\mathrm {ops}}(i_k)\}$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}} \in A_{i_k,j_k}$ must be found such that (5) holds and such that there exist $\mu _{i_k,j_k,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i_k,j_k)$ for which (6) and (7) hold. To accomplish this, for j from 1 to $n_{\mathrm {ops}}(i_k)$, provided it is affordable, we could compute a global minimizer $\varvec{z}_j^*$ of

$$\begin{aligned} \begin{array}{c} \displaystyle {{\,\mathrm{Minimize}\,}}_{\varvec{x}\in \mathbb {R}^{n_{i_k}}} \nabla _{i_k} f(x^k)^T (\varvec{x}- \varvec{x}_{i_k}^k) + \frac{1}{2}(\varvec{x}- \varvec{x}_{i_k}^k) B_k (\varvec{x}- \varvec{x}_{i_k}^k) + \frac{\sigma }{2} \Vert \varvec{x}- \varvec{x}_{i_k}^k \Vert ^2\\[2mm] \text{ subject } \text{ to } \varvec{x}\in \overline{A_{i_k,j}} \text{ and } g_{i_k,j,\ell }(\varvec{x}) \le 0 \text{ for } \ell =1,\dots ,n_g(i_k,j). \end{array} \end{aligned}$$

(34)

If the objective function value at $\varvec{z}_j^*$ is non-positive and $\varvec{z}_j^* \in A_{i_k,j}$, then, defining $j_k=j$, we have that $\varvec{x}_{i_k}^{{\mathrm {trial}}} = \varvec{z}_j^*$ satisfies (5). If we additionally assume that global minimizers of (34) for every $j \in \{1,\dots ,n_{\mathrm {ops}}(i_k)\}$ satisfy KKT conditions, then we have that there exist $\mu _{i_k,j_k,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i_k,j_k)$ for which (6) and (7) hold.

If none of the global minimizers $\varvec{z}_j^*$ is such the objective function value at $\varvec{z}_j^*$ is non-positive and $\varvec{z}_j^* \in A_{i_k,j}$, then we can define $j_k=j^a$ and $\varvec{x}_{i_k}^{{\mathrm {trial}}} = \varvec{z}_{j^b}^*$, where $\varvec{z}_{j^b}^*$ is the global minimizer (among the $n_{\mathrm {ops}}(i_k)$ computed global minimizers $\varvec{z}_1^*, \dots , \varvec{z}_{n_{\mathrm {ops}}(i_k)}^*$) that achieves the lowest functional value of the objective function in (34) and $j^a$ is such that $\varvec{z}_{j^b}^* \in A_{i_k,j^a}$. The functional value of the objective function of (34) at $\varvec{z}_{j^b}^*$ is non-positive because the objective function vanishes at $x_{i_k}^k$ that is a feasible point of (34) for at least one $j \in \{1,\dots ,n_{\mathrm {ops}}(i_k)\}$; and $\varvec{z}_{j^b}^* \in A_{i_k,j^a}$ for some $j^a$ because, by Assumption A6, $\varvec{z}_{j^b}^* \in \Omega _{i_k}$ and, by definition, $\Omega _{i_k} \subset A_{i_k} = \cup _{j=1}^{n_{\mathrm {ops}}(i_k)} A_{i_k,j}$. Moreover, $\varvec{z}_{j^b}^*$ must also be a global minimizer of (34) with $j=j^a$. Thus, by Assumption A7, it fulfills KKT conditions and, therefore, there exist $\mu _{i_k,j_k,\ell } \ge 0$ for $\ell = 1,\dots , n_{\mathrm {g}}(i_k,j_k)$ for which (6) and (7) hold.

It is worth noting that the objective function in (34) is a linear function if $B_k=0$ and $\sigma =0$; and it is a convex quadratic function if $B_k + \sigma I$ is positive definite. Moreover, it is always possible to choose the open covering sets $A_{i,j}$ for all i and j in such a way their closures are simple sets like balls, boxes, or polyhedrons. Furthermore, it is also possible that more efficient problem-dependent alternatives exist for the computation of $\varvec{x}_{i_k}^{{\mathrm {trial}}}$; and it is also possible trying to find $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ satisfying Alternative 2 when $\sigma =0$.

The correctness of the strategy proposed in this section depends on the satisfaction of Assumptions A1 to A7. Assumptions A1, A6, and A7 refer to the relationship between the feasible region of problem (1), the open sets, and the constitutive constraints. We have already shown that, for simple nonconvex polygons, these things can be defined in such a way that the assumptions are satisfied. Assumption A3 asks the models’ Hessians $B_k$ to be uniformly bounded. This can be satisfied by definition, for example by choosing $B_k=0$ in all iterations, or any other option that satisfies the assumption. Likewise, Assumption A5 refers to the way in which the blocks $i_k$ are chosen at each iteration. This assumption can also be satisfied in many ways, for example with a cyclic choice of $i_k$. The remaining assumptions (Assumptions A2 and A4) depend on the objective function of problem (1) and their validity will be explained in Sect. 6.1, when the problem under consideration is introduced.

6 Experiments

In this section we describe numerical experiments using the BCD method. Sect. 6.1 describes what we have called the continuous version of the traveling salesman problem (TSP). In this problem, the BCD method is used to evaluate the merit of the function that should be minimized. Since this function is computed many times along the whole process, this problem provides many experimental applications of BCD. In Sect. 6.2, we describe a simple heuristic and a way to generate a starting point for solving the continuous TSP. Although simple, these considered methods are part of the state of the art of methods used to solve the classical TSP. Moreover, they serve to illustrate the application of the BCD method, which could be used in the same way in combination with any other strategy. Sect. 6.3 describes a problem-dependent way to find a $x_{i_k}^{{\mathrm {trial}}}$ in the BCD method that satisfies the requirements of Alternative 2. Sect. 6.4 describes the computational experiment itself.

6.1 Continuous traveling salesman problem

The travelling salesman problem (TSP) is one of the most studied combinatorial optimization problems for which a vast literature exists; see, for example, [3], and the references therein. In its classical version, p cities with known pairwise “distances” $d_{ij}>0$ are given and the problem consists in finding a permutation $i_1, i_2, \dots , i_p$ that minimizes $d_{i_p,i_1} + \sum _{\nu =1}^{p-1} d_{i_\nu ,i_{\nu +1}}$. In the present work, we consider a continuous variant of the classical TSP in which “cities” are not fixed and, therefore, their pairwise distances vary. More precisely, given a set of polygons $\Omega _1, \Omega _2, \dots , \Omega _p$, that may be nonconvex, the problem consists of finding points $\varvec{x}_i \in \Omega _i$ for $i=1,\dots ,p$ and a permutation $i_1, i_2, \dots , i_p$ that minimize $\Vert \varvec{x}_{i_p} - \varvec{x}_{i_1} \Vert + \sum _{\nu =1}^{p-1} \Vert \varvec{x}_{i_\nu } - \varvec{x}_{i_{\nu +1}} \Vert$. Polygons may be seen as representing countries, regions, districts, or neighbourhoods of a city; and the interpretation is that “visiting a polygon” is equivalent to “visiting any point within the polygon”.

The application of the BCD method in this context is very natural. Any method to solve the classical TSP requires to evaluate the merit of a permutation $i_1, i_2, \dots , i_p$ by calculating $d_{i_p,i_1} + \sum _{\nu =1}^{p-1} d_{i_\nu ,i_{\nu +1}}$. In the variant we are considering, given a permutation $i_1, i_2, \dots , i_p$, the BCD method is used to find the $\varvec{x}_{i_\nu } \in \Omega _{i_\nu }$ for $\nu =1,\dots ,p$ that minimize $\Vert \varvec{x}_{i_p} - \varvec{x}_{i_1} \Vert + \sum _{\nu =1}^{p-1} \Vert \varvec{x}_{i_\nu } - \varvec{x}_{i_{\nu +1}} \Vert$. In other words, given a permutation $i_1, i_2, \dots , i_p$, the BCD method is used to find a solution $x^*$ to the problem

$$\begin{aligned}&{{\,\mathrm{Minimize}\,}}_{x \in \mathbb {R}^n} f(i_1,\dots ,i_p;x) := \Vert \varvec{x}_{i_p} - \varvec{x}_{i_1} \Vert \nonumber \\&\quad + \sum _{\nu =1}^{p-1} \Vert \varvec{x}_{i_\nu } - \varvec{x}_{i_{\nu +1}} \Vert \text{ subject } \text{ to } \varvec{x}_{i_\nu } \in \Omega _{i_\nu } \text{ for } \nu =1,\dots ,p, \end{aligned}$$

(35)

other words, given a permutation where $n_{\mathrm {blocks}}=p$, $n_i=2$ for $i=1,\dots ,n_{\mathrm {blocks}}$, $n=2p$, and $x = (\varvec{x}_1^T,\dots ,\varvec{x}_p^T)^T$; while the problem as a whole consists in finding a permutation $i_1^*,\dots ,i_p^*$ such that $f(i_1^*,\dots ,i_p^*; x^*)$ is as small as possible. That is, the BCD integrates the process of evaluating the merit of a given permutation. With this tool, constructive heuristics and neighborhood-based local searches already developed for the classical TSP can be adapted to the problem under consideration.

Assumption A4 requires that the sequence $\{f(\varvec{x}^k)\}$ is bounded below. This is clearly valid for the objective function f defined in (35) since, being a sum of distances, it is always nonnegative. Let us now show that Assumption A2 also holds. The closed polygons $\Omega _1, \dots , \Omega _p$ are such that $\Omega _{i_1} \cap \Omega _{i_2} = \emptyset$ for all $i_1 \ne i_2$. Then, there exists $\eta > 0$ such that, for every trial point computed by the algorithm, the distance between two blocks of coordinates is greater than $\eta$. Consider the auxiliary function $d: \mathbb {R}^2 \times \mathbb {R}^2 \rightarrow \mathbb {R}$ given by $d(x,y) = \sqrt{\Vert x-y\Vert ^2 + ( \Vert x-y\Vert ^2 - \eta ^2)^4}$ if $\Vert x-y\Vert \le \eta$ and $d(x,y) = \Vert x-y\Vert$ otherwise; and consider the function $\tilde{f}$ defined as the function f in (35) but replacing each appearance of $\Vert x-y\Vert$ by d(x, y) . Since d(x, y) coincides with $\Vert x-y\Vert$ for every pair of blocks of coordinates computed by the algorithm, this implies that the sequence of iterates computed using f as defined in (35) coincides with the sequence of iterates computed if we use $\tilde{f}$ instead of f. Therefore, we may consider that the objective function of problem (35) is $\tilde{f}$. Clearly, the new function has continuous second derivatives, therefore its gradient satisfy a Lipschitz condition as required in Assumption A2. (It should be noted that $\tilde{f}$, being the sum of nonnegative terms, also satisfies Assumption A4.)

6.2 Discrete optimization strategy

In the present work, among the huge range of possibilities and in order to illustrate the usage of the BCD method, we consider a local search with an insertion-based neighborhood. The initial solution is given by a constructive heuristic also based on insertions, as we now describe; see [1] and the references therein. The construction of the initial guess starts defining $(i_1,i_2)=(1,2)$ and $\varvec{x}_{i_1} \in \Omega _{i_1}$ and $\varvec{x}_{i_2} \in \Omega _{i_2}$ as the ones that minimize $\Vert \varvec{x}_{i_1} - \varvec{x}_{i_2} \Vert$, computed with the BCD method. Then, to construct $(i_1,i_2,i_3)$, the method considers inserting index 3 before $i_1$, between $i_1$ and $i_2$, and after $i_2$. For each of the three possibilities, optimal $\varvec{x}_{i_1} \in \Omega _{i_1}$, $\varvec{x}_{i_2} \in \Omega _{i_2}$, and $\varvec{x}_{i_3} \in \Omega _{i_3}$ are computed with the BCD method. Among the three permutations, the one with smallest $\Vert \varvec{x}_{i_1} - \varvec{x}_{i_2} \Vert + \Vert \varvec{x}_{i_2} - \varvec{x}_{i_3} \Vert + \Vert \varvec{x}_{i_3} - \varvec{x}_{i_1} \Vert$ is chosen. The method proceeds in this way until a permutation with p elements, that constitutes the initial guess, is completed. A typical iteration of the local search proceeds as follows. Given the current permutation $(i_1, i_2, \dots , i_p)$ and its associated points $\varvec{x}_{i_\nu } \in \Omega _{i_\nu }$ for $\nu =1,\dots ,p$, each $i_s$ for $s=1,\dots ,p$ is removed and reinserted at all possible places $t \ne s$. For each possible insertion, corresponding $\varvec{x}_{i_1}, \varvec{x}_{i_2}, \dots , \varvec{x}_p$ are computed with the BCD method. This type of movement is also known as relocation and, as mentioned in [1, p.342], it has been used with great success in the TSP [8]. Once an insertion is found that improves the current solution, the iteration is completed, i.e. the first neighbour that improves the current solution defines the new iterate, in contrast to a “best movement” strategy in which all neighbors are considered and the best of them defines the new iterate. The local search ends when no neighbour is found that improves the current iterate.

6.3 Finding optimal points with BCD method for a given permutation

In this section we describe how to solve problem (35) with the BCD method. At iteration k of the BCD method, an index $i_k \in \{1,\dots ,n_{\mathrm {blocks}}\}$ is chosen at Step 1. Then at Step 2, there are two alternatives. If $\sigma =0$, then $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ satisfying Alternative 1: (5,6,7) or Alternative 2: (8,9) must be computed; while, if $\sigma >0$, then only Alternative 1 is a possibility. Sect. 5 describes a way of computing $\varvec{x}_{i_k}^{{\mathrm {trial}}}$ satisfying Alternative 1 for any value of $\sigma$. However, for the particular problem under consideration, minimizing f(x) as a function of $\varvec{x}_{i_k} \in \Omega _{i_k}$ reduces to

$$\begin{aligned} {{\,\mathrm{Minimize}\,}}_{\varvec{x}_{i_k} \in \mathbb {R}^2} \Vert a - \varvec{x}_{i_k} \Vert + \Vert \varvec{x}_{i_k} - b \Vert \text{ subject } \text{ to } \varvec{x}_{i_k} \in \Omega _{i_k}, \end{aligned}$$

(36)

where a and $b \in \mathbb {R}^2$ stand for the “previous” and the “next” point in the permutation; that, in general, correspond to $\varvec{x}_{i_{k-1}}$ and $\varvec{x}_{i_{k+1}}$, respectively. Thus, when $\sigma =0$, it is easy, computationally tractable, and affordable to compute the global minimizer of (36), which clearly satisfies the requirements of Alternative 2. The global minimizer is either on the segment [a, b] intersected with $\Omega _{i_k}$ (that intersection is given by a finite set of segments) or on the boundary of $\Omega _{i_k}$, which is also given by a finite set of segments (its edges). Each segment can be parameterized with a single variable $\lambda \in [0,1]$. Then, the global minimizer of (36) is given by the best global minimizer among the global minimizers of these simple box-constrained one-dimensional problems. The global minimizer of each box-constrained one-dimensional problem can be computed with brute force up any desired precision. Moreover, if multiple solutions exist, in order in increase the chance of satisfying (10), the closest one to $x_{i_k}^k$ should be preferred.

6.4 Traveling in São Paulo City

For the numerical experiments, we implemented the discrete optimization strategy described in Sect. 6.2 and the BCD method (Algorithm 3.1) described in Sect. 3 with the strategy described in Sect. 6.3 for the computation of $\varvec{x}_{i_k}^{{\mathrm {trial}}}$. In Algorithm 3.1, we chose $i_k=\mathrm {mod}(k+1,n_{\mathrm {blocks}})$. (Preliminary experiments with a random choice of $i_k$, with a discrete uniform distribution in $\{1, 2, \dots , n_{\mathrm {blocks}}\}$, showed similar results.) Based on the theoretical results, we stop the method at iteration k, if $x^k = x^{k-1} = \dots = x^{k-n_{\mathrm {blocks}}+1}$. In the numerical experiments, following [2, 5, 6], we consider $\alpha =10^{-8}$. In all our experiments the required conditions at Step 2 were satisfied using Alternative 2.

All methods were implemented in Fortran 90. Tests were conducted on a computer with a 3.4 GHz Intel Core i5 processor and 8GB 1600 MHz DDR3 RAM memory, running macOS Mojave (version 10.14.6). Code was compiled by the GFortran compiler of GCC (version 8.2.0) with the -O3 optimization directive enabled.

The city of São Paulo, with more than 15 million square kilometers of extension and more than 12 million inhabitants, is the most populous city in Brazil, the American continent, the Portuguese-speaking countries and the entire southern hemisphere. It is administratively divided into thirty-two regions, each of which, in turn, is divided into districts, the latter sometimes subdivided into subdistricts (popularly called neighborhoods); see https://pt.wikipedia.org/wiki/S%C3%A3o_Paulo. The city has a total of 96 neighborhoods. The considered problem consists in finding a shortest route to visit all of them.

The construction of the problem started by downloading a political-administrative map of the city from the city hall website; see http://geosampa.prefeitura.sp.gov.br/. The map describes each neighborhood as a polygon. The polygon with more vertices has 5,691 vertices and, all together, the polygons have 156,852 vertices. To turn the problem into something more tractable, we redefine the polygons with number of vertices $n_v>100$ (all of them in fact) by considering only the vertices with indices of the form form $1 + \lfloor 50 / n_v \rfloor j$ for $j=0,1,2,\dots$. This way, all polygons were left with a number of vertices between 51 and 57, totaling 4,966 vertices. Moreover, for artistic reasons related to the graphical representation of the problem, we shrunk each polygon by 20%. The shrinkage consisted in replacing each vertex $v_i$ by $o_i + 0.8 ( v_i - o_i )$, where the offset $o_i = \frac{1}{2}( x_{\min } + x_{\max }, y_{\min } + y_{\max })^T$, and $(x_{\min },y_{\min })^T$ and $(x_{\max },y_{\max })^T$ correspond to the lower-left and upper-right corners of the smallest rectangle that encloses the polygon. With this procedure we ended up with the $p=96$ polygons $\Omega _i$ for $i=1,\dots ,p$ that determine the problem (35) under consideration; see Fig. 3.

Table 1 shows the details of the optimization process. The table shows, for each iteration, the length of the current route. It also shows, for each iteration, how many neighbors had to be evaluated to find one that improves the current route. Naturally, each evaluation of a neighbor corresponds to a call to the BCD method. Therefore, the next two columns show the number of calls to the BCD method per iteration and the number of cycles these calls used. The last two columns of the table show these two values accumulated over the iterations. It can be noted from the table that the BCD method is used to solve more than 200,000 subproblems and that this requires, altogether, the execution of more than 3 million cycles, i.e. an average of 15 cycles per problem. The instance under consideration has $p=96$ points. The constructive heuristic used to generate the initial point evaluates (by calling the BCD method) $O(\frac{1}{2}p^2)$ permutations; while the reinsertion neighborhood evaluates, in the worst case, $O(p^2)$ neighbors. The first line of the table shows the cost of the constructive heuristic, and is consistent with what we have just mentioned. The remaining lines show that in the first 4 iterations and in a few intermediate iterations the method quickly finds a neighbor that improves the current solution. On the other hand, the average number of neighbors evaluated per iteration is 4,164, which corresponds to approximately 45% of the neighbors. The running time of the algorithm is directly proportional and totally dependent on the cost of computing $\varvec{x}_{i_k}^{trial}$. The constructive heuristic used to compute the initial point $x^0$ used 29.58 seconds of CPU time; while the method as a whole consumed practically one hour of CPU time, exactly 3,589.31 seconds.

Table 1 Performance of the heuristic method applied to solve the considered instance of the continuous version of the TSP problem

Full size table

Figure 4 shows the evolution of the route length over the iterations of the method; while Fig. 5 shows some of the generated routes. Figure 6 shows the final iterate in detail.

Up to this point, we have described and illustrated in detail the way in which the discrete heuristic method described in Sect. 6.2 made intensive use of the BCD method to solve the problem (35). We close the numerical experiments section by showing in Fig. 7, with a graphic and a table, the iterands of the BCD method for a specific fixed permutation. To make this figure, we considered the permutation given by the constructive heuristic used to generate $x^0$ and we randomly draw points inside each of the polygons. The graph and table show the iterations for 13 complete cycles. The method actually uses 44 cycles, but the functional value varies from 229,139.66 at the end of cycle 13 to 229,139.65 at the end of cycle 44, when it stops because all the variables’ blocks are repeated from cycle 43 to cycle 44. The initial points are in yellow or light orange and the color changes to red at cycle 13. The evolution of each point is marked with a dotted line whose color changes along with the color of the point. Independently of that, the route determined by the points of each cycle is marked in blue. The route with the lightest blue corresponds to the route given by the initial points (yellow or light orange) and the color of the route gets darker and darker until it reaches the route of cycle 13, marked with the strongest blue. Roughly speaking, the points move a little more in the first 3 cycles, in which the objective function decreases the most, and then there are only small accommodations of the points until the method converges. The authors are aware of the difficulty to see the figure clearly; a zoom in the image is recommended to see the details of the evolution of the iterands. In particular, the middle left part clearly shows how the route is being modified as the points move.

7 Conclusions

The framework presented in the present work could be extended in order to consider Taylor-like high-order models [4] satisfying well-established regularity assumptions, as it has been done in [2] for the case of box constraints. However, theoretical results in [2] reveal that using high-order models associated with Coordinate Descent methods is not worthwhile. The reason is that overall computed work is dominated by the necessity of obtaining fast decrease of the distance between consecutive iterates, whereas high-order models do not help for achieving such purpose.

More interesting, from the practical point of view, is to exploit the particular case in which the constraints that define each $\Omega _i$ may be expressed in the form of global inequalities and equalities. (Of course, this is a particular case of the one addressed in this paper that corresponds to set $n_{\mathrm {ops}}(i)=1$ for all $i=1,\ldots ,n_{\mathrm {blocks}}$.) In this case the obvious choice for solving subproblems consists of using some well established constrained optimization software. From the theoretical point of view there is nothing to be added, since practical optimization methods for constrained optimization may fail for different reasons, leading the abrupt interruption of the overall optimization process. However, we have no doubts that in many practical problems the standard constrained optimization approach associated with block coordinate descent should be useful.

The reason why, in this paper, we considered feasible sets $\Omega _i$ with the local constrained structure defined by open covering sets and constitutive constraints is not strictly related to block coordinate methods. In fact, in contact with several practical problems (an example of which is the one presented in Sect. 6) we observed that the non-global structure of constraints is not unusual and needs specific ways to be handled properly. We believe that different approaches than the one suggested in this paper are possible, most of them motivated by the particular structure of the practical problems at hand. Further research is expected in the following years with respect to this subject.

In the present work, a variant of the classical TSP problem was introduced. In this variant, each city can move freely within a given feasible set. In that context, the proposed BCD method was used to evaluate the merit function: the tour length for a given ordering of the cities. (Recall that the computation of the tour length requires the determination of the optimal position of each city within its own feasible region.) To illustrate the practical use of the BCD, a basic strategy for solving the TSP (local search with neighborhood determined by relocation moves) was implemented. As future work, state-of-the-art methods for solving the classical TSP could be adapted to solve the problem introduced in this work.

Data availability

The authors confirm that all data generated or analyzed in the development of this work are adequately included or referenced in the article itself.

Notes

We already shown, in Sect. 2, that, although non-regular, if a smooth function $\psi$ has a minimizer at $C_{1,4}$, its gradient $\nabla \psi (C_{1,4})$ is necessarily null; so that $C_{1,4}$ is a KKT point of the minimization of $\psi$ subject to $\varphi \le 0$.

References

Aarts, E., Lenstra, J. K., (eds.): Local search in combinatorial optimization, Princeton University Press, (2003)
Amaral, V.A., Andreani, R., Birgin, E.G., Marcondes, D.S., Martínez, J.M.: On complexity and convergence of high-order coordinate descent algorithms for smooth nonconvex box-constrained minimization. J. Glob. Optim. https://doi.org/10.1007/s10898-022-0.1168-6 (to appear)
Applegate, D. L., Bixby, R. E., Chvátal, V., Cook, W. J.: The traveling salesman problem: a computational study, Princeton University Press, (2006)
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, Ph.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163, 359–368 (2017)
Article MathSciNet Google Scholar
Birgin, E.G., Martínez, J.M.: On regularization and active-set methods with complexity for constrained optimization. SIAM Journal on Optimization 28, 1367–1395 (2018)
Article MathSciNet Google Scholar
Birgin, E.G., Martínez, J.M.: A Newton-like method with mixed factorizations and cubic regularization for unconstrained minimization. Computational Optimization and Applications 73, 707–753 (2019)
Article MathSciNet Google Scholar
Bonettini, S., Prato, M., Begegoldi, S.: A cyclic block coordinate descent method with generalized gradient projections. Applied Mathematics and Computation 286, 288–300 (2016)
Article MathSciNet Google Scholar
Gendreau, M., Hertz, A., Laporte, G.: New insertion and post optimization procedures for the traveling salesman problem. Operations Research 40, 1086–1094 (1992)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York, NY (2006)
MATH Google Scholar
Powell, M.J.D.: On search directions for minimization algorithms. Mathematical Programming 4, 193–201 (1973)
Article MathSciNet Google Scholar
Wright, S.J.: Coordinate descent methods. Mathematical Programming 151, 3–34 (2015)
Article MathSciNet Google Scholar

Download references

Funding

This work was supported by FAPESP (Grants 2013/07375-0, 2016/01860-1, and 2018/24293-0) and CNPq (Grants 302538/2019-4 and 302682/2019-8)

Author information

Authors and Affiliations

Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010, Cidade Universitária, São Paulo, SP, 05508-090, Brazil
E. G. Birgin
Department of Applied Mathematics, Institute of Mathematics, Statistics, and Scientific Computing (IMECC), State University of Campinas, Campinas, SP, 13083-859, Brazil
J. M. Martínez

Authors

E. G. Birgin
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Martínez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to E. G. Birgin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Birgin, E.G., Martínez, J.M. Block coordinate descent for smooth nonconvex constrained minimization. Comput Optim Appl 83, 1–27 (2022). https://doi.org/10.1007/s10589-022-00389-5

Download citation

Received: 26 November 2021
Accepted: 17 June 2022
Published: 09 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10589-022-00389-5

Keywords

AMS Subject Classifications:

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Block coordinate descent for smooth nonconvex constrained minimization

Abstract

Similar content being viewed by others

Inexact Variable Metric Stochastic Block-Coordinate Descent for Regularized Optimization

Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization

Analysis of a variable metric block coordinate method under proximal errors

1 Introduction

2 The problem

Assumption A1

3 Block coordinate descent method

Algorithm 3.1

Remark

4 Convergence and complexity

Assumption A2

Assumption A3

Lemma 4.1

Proof

Assumption A4

Lemma 4.2

Proof

Assumption A5

Theorem 4.1

Proof

Theorem 4.2

Proof

5 Solving subproblems

Assumption A6

Assumption A7

6 Experiments

6.1 Continuous traveling salesman problem

6.2 Discrete optimization strategy

6.3 Finding optimal points with BCD method for a given permutation

6.4 Traveling in São Paulo City

7 Conclusions

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

AMS Subject Classifications:

Search

Navigation