1 Introduction

The celebrated top99 Matlab code developed by Sigmund (2001) has certainly promoted the spreading of topology optimization among engineers and researchers, and the speedups carried by its heir, top88 (Andreassen et al. 2011), substantially increased the scale of examples that can be solved on a laptop.

On these footprints, several other codes have followed, involving extension to 3D problems (Liu and Tovar 2014; Amir et al. 2014), material design (Andreassen and Andreasen 2014; Xia and Breitkopf 2015), level–set parametrizations (Wang 2007; Challis 2010), use of advanced discretization techniques (Talischi et al. 2012; Suresh 2010; Sanders et al. 2018), or integration of TO within some finite element frameworks.

With the evolution of TO and its application to more and more challenging problems, implementations in top88 may have become outdated. Also, Matlab has improved in the last decade. Hence, we believe it is time to present a new “exemplary” code collecting shortcuts and speedups, allowing to tackle medium-/large-scale TO problems efficiently on a laptop. Preconditioned iterative solvers, applied for example in Amir and Sigmund (2011), Amir et al. (2014), Ferrari et al. (2018) and Ferrari and Sigmund (2020), allow the solution of the state equation with nearly optimal efficiency (Saad 1992). Thus, the computational bottleneck has been shifted on other operations, such as the matrix assembly or the repeated application of filters. Efficiency improvements for these operations were touched upon by Andreassen et al. (2011), however, without giving a quantitative analysis about time and memory savings.

Here, we provide compact Matlab codes for minimum compliance topology optimization of 2D and 3D continua which show a substantial speedup compared to the top88 code. We include several extensions by default, such as specification of passive domains, a volume-preserving density projection (Guest et al. 2004; Wang et al. 2011) and continuation strategies for the penalization and projection parameters in a very compact, yet sharp, implementation. Coincidentally, the new 2D TO implementation consists of 99 lines of code and is thus named top99neo. We also show how to include an acceleration technique recently investigated for TO by Li et al. (2020), with a few extra lines of code and potentially carrying major speedups. Changes needed for the extension to 3D problems are remarkably small, making the corresponding code (top3D125) the most compact and efficient Matlab implementation for 3D compliance TO to date.

Our primary goal is not to present innovative new research. Rather, we aim at sharing some shortcuts and speedups that we have noticed through time, to the benefit of the research community. Improvements introduced by the present codes will be much useful also on more advanced problems, such as buckling optimization, which will be dealt with in an upcoming work.

The paper is organized as follows. In Section 2, we recall the setting of TO for minimum compliance. Section 3 is devoted to describe the overall structure of the 2D code, focusing on differences with respect to top88. Sections 3.13.5 give insights about the main speedups and show performance improvements with respect to top88. The very few changes needed for the 3D code are listed in Section 4, where an example is presented and the efficiency is compared to the previous code from Amir et al. (2014). Some final remarks are given in Section 5. Appendix A gives some details about the redesigns step that are useful for better understanding a method proposed in Section 3.2 and the Matlab codes are listed in Appendices B and C.

2 Problem formulation and solution scheme

We consider a 2D/3D discretization Ωh consisting of m equi-sized quadrilateral elements Ωe. Hereafter we denote by n the global number of Degrees of Freedom (DOFs) in the discretization and by d the number of (local) DOFs of each element.

Let x = {xe}e= 1:m ∈ [0,1]m be partitioned between \(\mathbf {x}_{\mathcal {A}}\) and \(\mathbf {x}_{\mathcal {P}}\), the sets of active (design) variables and passive elements, respectively. The latter may be further split in the sets of passive solid \(\mathcal {P}_{1}\) (xe = 1) and void \(\mathcal {P}_{0}\) (xe = 0) elements, of cardinalities \(m_{\mathcal {P}_{1}}\) and \(m_{\mathcal {P}_{0}}\), respectively (see Fig. 1a).

Fig. 1
figure 1

Definition of the active \(\mathcal {A}\), passive solid \(\mathcal {P}_{1}\), and void \(\mathcal {P}_{0}\) domains (a) and illustration of the connectivity matrix C for a simple discretization (b). The set of indices \(\mathcal {I}\), here shown for the element e = 1, is used by the assembly operation. The symmetric repetitions in \(\mathcal {I}\) are highlighted, and their elimination gives the reduced set \(\mathcal {I}_{r}\) (see Section 3.1)

The set of physical variables \(\hat {\mathbf {x}}_{\mathcal {A}} = {\mathscr{H}}(\tilde {\mathbf {x}})\) are defined by the relaxed Heaviside projection (Wang et al. 2011)

$$ \mathcal{H}(\tilde{x}_{e}, \eta, \upbeta ) = \frac{\tanh(\upbeta\eta) + \tanh(\upbeta(\tilde{x}_{e}-\eta))}{\tanh(\upbeta\eta)+\tanh(\upbeta(1-\eta))} $$
(1)

with threshhold η and sharpness factor β, where \(\tilde {\mathbf {x}} = H\mathbf {x}\) is the filtered field, obtained by the linear operator

$$ H\left( x_{e}, r_{\min} \right) := \frac{{\sum}_{i \in \mathcal{N}_{e}}h_{e,i}x_{i}} {{\sum}_{i \in \mathcal{N}_{e}}h_{e,i}} $$
(2)

where \(\mathcal {N}_{e} = \{ i \mid \text {dist}({\Omega }_{i}, {\Omega }_{e} ) \leq r_{\min \limits } \}\) and \(h_{e,i} = \max \limits (0, r_{\min \limits } - \text {dist}({\Omega }_{i}, {\Omega }_{e} ) )\).

Given a load vector \(\mathbf {f}\in \mathbb {R}^{n}\) and the volume fraction f ∈ (0,1), we consider the optimization problem

$$ \begin{cases} & \min\limits_{\mathbf{x}_{\mathcal{A}}\in\left[ 0, 1 \right]^{m_{\mathcal{A}}}} c\left( \hat{\mathbf{x}} \right) \\ \text{s.t.} & V\left( \hat{\mathbf{x}} \right) \leq f|{\Omega}_{h}| \end{cases} $$
(3)

for the minimization of compliance \(c\left (\hat {\mathbf {x}}\right ) = \mathbf {u}^{T}\mathbf {f}\) with an upper bound on the overall volume

$$ V\left( \hat{\mathbf{x}} \right) = \sum\limits^{m}_{e=1} |{\Omega}_{e}| \hat{x}_{e} = \frac{1}{m} \left( m_{\mathcal{P}_{1}} + \sum\limits_{e\in\mathcal{A}} \hat{x}_{e} \right) \leq f $$
(4)

Problem (3) is solved with a nested iterative loop. At each iteration, the displacement u is computed by solving the equilibrium problem

$$ K\mathbf{u} = \mathbf{f} $$
(5)

where the stiffness matrix \(K = K(\hat {\mathbf {x}})\) depends on the physical variables through a SIMP interpolation (Bendsøe and Sigmund 1999) of the Young modulus

$$ E(\hat{x}_{e}) = E_{\min} + \hat{x}^{p}_{e}(E_{0} - E_{\min}) $$
(6)

with E0 and \(E_{\min \limits }\) the moduli of solid and void (\(E_{\min \limits } \ll E_{0}\)). The gradients of compliance and structural volume with respect to \(\hat {\mathbf {x}}\) read (χe = 1 if \(e\in \mathcal {A}\) and 0 otherwise and 1m is the identity vector of dimension m)

$$ \nabla_{\hat{\mathbf{x}}} c\left( \hat{\mathbf{x}} \right) = - {\mathbf{u}}^{T}\nabla_{\hat{\mathbf{x}}}K\mathbf{u}\chi_{\mathcal{A}} , \quad \nabla_{\hat{\mathbf{x}}} V\left( \hat{\mathbf{x}}\right) = \frac{1}{m}\mathbf{1}_{m}\chi_{\mathcal{A}} $$
(7)

and the sensitivities with respect to the design variables are recovered as

$$ \begin{aligned} \nabla_{\mathbf{x}} c\left( \mathbf{x} \right) & = \nabla_{\tilde{\mathbf{x}}}\mathcal{H} \odot (H^{T}\nabla_{\hat{\mathbf{x}}} c\left( \hat{\mathbf{x}} \right)) \\ \nabla_{\mathbf{x}} V\left( \mathbf{x} \right) & = \nabla_{\tilde{\mathbf{x}}}\mathcal{H} \odot (H^{T}\nabla_{\hat{\mathbf{x}}} V\left( \hat{\mathbf{x}} \right)) \end{aligned} $$
(8)

where ⊙ represents the element-wise multiplication and

$$ \nabla_{\tilde{\mathbf{x}}}\mathcal{H} = \upbeta\frac{1-\tanh(\upbeta(\tilde{\mathbf{x}}-\eta))^{2}} {\tanh(\upbeta\eta)+\tanh(\upbeta(1-\eta))} $$
(9)

The active design variables \(e\in \mathcal {A}\) are then updated by the optimality criterion rule (Sigmund 2001)

$$ x_{k+1,e} = \mathcal{U}(x_{k, e}) = \begin{cases} \delta_{-} & \text{if} \mathcal{F}_{k,e} < \delta_{-} \\ \delta_{+} & \text{if} \mathcal{F}_{k,e} > \delta_{+} \\ \mathcal{F}_{k,e} & \text{otherwise} \end{cases} $$
(10)

where \(\delta _{-} = \max \limits (0, x_{k, e} - \mu )\), \(\delta _{+} = \min \limits (1, x_{k, e} + \mu )\), for the fixed move limit μ ∈ (0,1) and

$$ \mathcal{F}_{k,e} = x_{k, e} \left( - \frac{\partial_{e} c_{k}} {\tilde{\lambda}_{k}\partial_{e} V_{k}}\right)^{1/2} $$
(11)

depends on the element sensitivities.

In (11), \(\tilde {\lambda }_{k}\) is the approximation to the current Lagrange multiplier \(\lambda ^{\ast }_{k}\) associated with the volume constraint. This is obtained by imposing \(V(\hat {\mathbf {x}}_{k+1}(\tilde {\lambda })) - f|{\Omega }_{h}| \approx 0\), e.g., by bisection on an interval \({\Lambda }^{(0)}_{k} \supset \lambda ^{\ast }_{k}\).

3 Matlab implementation and speedups

The Matlab routine for 2D problems (see Appendix B) is called with the following arguments

figure a

where nelx and nely define the physical dimensions and the mesh resolution, volfrac is the allowed volume fraction on the overall domain (i.e., \(\mathcal {A}\cup \mathcal {P}\)), penal the penalization used in (6), and rmin the filter radius for (2). The parameter ft is used to select the filtering scheme: density filtering alone if ft= 1, whereas ft= 2 or ft= 3 also allows the projection (1), with eta and beta as parameters. ftBC specifies the filter boundary conditions ('N' for zero- Neumann or 'D' for zero-Dirichlect), move is the move limit used in the OC update and maxit sets the maximum number of redesign steps.

The routine is organized in a set of operations which are performed only once and the loop for the TO iterative redesign. The initializing operations are grouped as follows

$$ \begin{aligned} &\texttt{PRE.1) MATERIAL AND CONTINUATION PARAMETERS} \\ &\texttt{PRE.2) DISCRETIZATION FEATURES}\\ &\texttt{PRE.3) LOADS, SUPPORTS AND PASSIVE DOMAINS} \\ &\texttt{PRE.4) DEFINE IMPLICIT FUNCTIONS} \\ &\texttt{PRE.5) PREPARE FILTER} \\ &\texttt{PRE.6) ALLOCATE AND INITIALIZE OTHER PARAMETERS} \end{aligned} $$

and below we give details only about parameters and instructions not found in the top88 code.

To apply continuation on the generic parameter “par,” a data structure is defined

$$ \texttt{parCont = \{istart, maxPar, isteps, deltaPar\};} $$

such that the continuation starts when loop=istart and the parameter is increased by deltaPar each isteps, up to the value maxPar. This is implemented in Lines 6 and 7 for the penalization parameter p and the projection factor β, respectively. The update is then performed, by the instruction (see Line 92)

figure b

making use of compact logical operations. Continuation can be switched off, e.g., by setting maxPar<=par, or istart>=maxit.

The blocks defining the discretization (PRE.2)) contain some changes compared to top88. The number of elements (nEl), DOFs (nDof), and the set of node numbers (nodeNrs) are defined explicitly, to ease and shorten some following instructions. The setup of indices iK and jK, used for the sparse assembly, is performed in Lines 15–21 and follows the concept detailed in Section 3.1. The coefficients of the lower diagonal part of the elemental stiffness matrix are defined in vectorized form, such that \(\texttt {Ke} = \mathcal {V}(K^{(s)}_{e})\) (see Lines 22–26). Ke is used for the assembly strategy described in Section 3.1. However, in Lines 27–29, we also recover the complete elemental matrix (Ke0), used to perform the double product \(\mathbf {u}^{T}_{e}K_{e}\mathbf {u}_{e}\) when computing the compliance sensitivity (7). Although this could also be written in terms of the matrix \(K^{(s)}_{e}\) only, this option would increase the number of matrix/vector multiplications.

In PRE.3), the user can specify the set of restrained (fixed) and loaded (lcDof) DOFs and passive regions (\(\mathcal {P}_{1} \leftrightarrow \texttt {pasS}\) and \(\mathcal {P}_{0} \leftrightarrow \texttt {pasV}\)) for the given configuration. Supports and loads are defined as in the top88 code, whereas passive domains may be specified targeting a set of column and rows from the array elNrs. Independently of the particular example, Lines 34–36 define the vector of applied loads, the set of free DOFs, and the sets of active \(\mathcal {A} \leftrightarrow \texttt {act}\) design variables.

In order to make the code more compact and readable, operations which are repeatedly performed within the TO optimization loop are defined through inline functions in PRE.4) (Lines 38–43). The filter operator is built in PRE.5) making use of the built-in Matlab function imfilter, which represents a much more efficient alternative to the explicit construction of the neighboring array. A similar approach was already outlined by Andreassen et al. (2011), pointing to the Matlab function conv2, which is however not completely equivalent to the original operator, as it only allows zero-Dirichlet boundary conditions for the convolution operator. Here, we choose imfilter, which is essentially as efficient as conv2, but gives the flexibility to specify zero-Dirichlet (default option), or zero-Neumann boundary conditions.

Some final initializations and allocations are performed in PRE.6). The design variables are initialized with the modified volume fraction, accounting for the passive domains (Line 52–53) and the constant volume sensitivity (7) is computed in Line 51.

Within the redesign loop, the following five blocks of operations are repeatedly performed

$$ \begin{aligned} &\texttt{RL.1) COMPUTE PHYSICAL DENSITY FIELD} \\ &\texttt{RL.2) SETUP AND SOLVE EQUILIBRIUM EQUATIONS}\\ &\texttt{RL.3) COMPUTE SENSITIVITIES} \\ &\texttt{RL.4) UPDATE DESIGN VARIABLES AND APPLY CONTINUATION}\\ &\texttt{RL.5) PRINT CURRENT RESULTS AND PLOT DESIGN} \end{aligned} $$

In block RL.1), the physical field is obtained, applying the density filter and, if selected, also the projection. If ft= 3, the special value of the threshold eta giving a volume-preserving projection is computed, as discussed in Section 3.2.

The stiffness interpolation and its derivative (sK, dsK) are defined, and the stiffness matrix is assembled (see Lines 73–76). Ideally, one could also get rid of Lines 73–74 and directly define sK in Line 75 and dsK within Line 79. However, we decide to keep these operations apart, enhancing the readability of the code and to ease the specification of different interpolation schemes. Equation (5) is solved on Line 77 using the Matlab function decomposition, which can work with only half of the stiffness matrix (see Section 3.1). The sensitivity of compliance is computed, and the backfiltering operations (8) are performed in RL.3).

The update (10), with the nested application of the bisection process for finding \(\tilde {\lambda }_{k}\), is implemented in RL.4) (Lines 86–91), and we remark that lm represents \(\sqrt {\lambda }\).

Some information about the process is printed and the current design is plotted in RL.5) (Lines 94–97). On small discretizations, repeated plotting operations absorb a significant fraction of the CPU time (e.g., 15% for m = 4800). Therefore, one might just plot the final design, moving Lines 96–97 outside the redesign loop.

The tests in the following have been run on a laptop equipped with an Intel(R) Core(TM) i7-5500U@2.40-GHz CPU, 15 GB of RAM, and Matlab 2018b running in serial mode under Ubuntu 18.04 (but a similar performance is expected in Windows setups). We will often refer to the half MBB beam example (see Fig. 2) for numerical testing. Unless stated otherwise, we choose Ωh = 300 × 100, f = 0.5, and \(r_{\min \limits } = 8.75\) (Sigmund 2007). The load, having total magnitude |q| = 1 is applied to the first node. No passive domains are introduced for this example; therefore, pasS=[];, pasV=[]; and we set E1 = 1, E0 = 10− 9, and ν = 0.3 in all the tests.

Fig. 2
figure 2

Geometrical setting for the MBB example

3.1 Speedup of the assembly operation

In top88, the assembly of the global stiffness matrix is performed by the built-in Matlab function sparse

figure d

where \(\texttt {sK}\in \mathbb {R}^{m*d^{2}\times 1}\) collects the coefficients of all the elemental matrices in a column-wise vectorized form (i.e., \(\mathcal {V}(K_{e})\)) and iK and jK are the sets of indices mapping each sK(i) to the global location K(iK(i),jK(i)).

These two sets are set up through the operations

$$ \texttt{iK} = \mathcal{V}\left[(C \otimes \mathbf{1}_{d})^{T}\right] \ , \qquad \texttt{jK} = \mathcal{V}\left[(C \otimes \mathbf{1}^{T}_{d})^{T} \right] $$
(12)

where C[m×d] is the connectivity matrix and “⊗” is the Kronecker product (Horn and Johnson 2012). The size of the array \(\mathcal {I} = [ \texttt {iK}, \texttt {jK} ] \in \mathbb {N}^{m*d^{2}\times 2}\) grows very quickly with the number of elements m, especially for 3D discretizations (see Table 1), and even though its elements are integers, the sparse function requires them to be specified as double precision numbers. The corresponding memory burden slows down the assembly process and restricts the size of problems workable on a laptop.

Table 1 Number of entries in the array \(\mathcal {I}\) and corresponding memory requirement for the 2D and 3D test discretizations. White background refers to the F strategy with coefficients specified as double, cyan background to the H strategy, and light green to the H strategy and element specified as int32. The H strategy cuts \(|\mathcal {I}|\) and memory of ≈ 44% in 2D and ≈ 48% in 3D. Then, specifying the indexes as int32 further cuts memory of another 50%

The efficiency of the assembly can be substantially improved by

  1. 1.

    Acknowledging the symmetry of both Ke and K

  2. 2.

    Using an assembly routine working with iK and jK specified as integers

To understand how to take advantage of the symmetry of matrices, we refer to Fig. 1b and to the connectivity matrix C. Each coefficient \(C_{ej}\in \mathbb {N}\) addresses the global DOF targeted by the j th local DOF of element e. Therefore, (12) explicitly reads

$$ \begin{aligned} \texttt{iK}^{e} & = \{\underbrace{\mathbf{c}_{e}, \mathbf{c}_{e}, \ldots, \mathbf{c}_{e}}_{d \ \text{times}} \} \\ \texttt{jK}^{e} & = \{ \underbrace{c_{e1}, \ldots, c_{e1}}_{d \ \text{times}}, \underbrace{c_{e2}, \ldots, c_{e2}}_{d \ \text{times}}, \ldots, \underbrace{c_{ed}, \ldots, c_{ed}}_{d \ \text{times}}\} \end{aligned} $$
(13)

where \(\mathbf {c}_{e} = \left \{ c_{e1}, c_{e2}, \ldots , c_{ed} \right \}\) is the row corresponding to element e.

If we only consider the coefficients of the (lower) symmetric part of the elemental matrix \(K^{(s)}_{e}\) and their locations into the global one K(s), the set of indices can be reduced to

$$ \begin{aligned} \texttt{iK}^{e} & = \{ c_{e1}, \ldots, c_{ed}, c_{e2}, \ldots, c_{ed}, \ldots, c_{e3}, \ldots, c_{ed}, \ldots, c_{ed} \} \\ \texttt{jK}^{e} & = \{ \underbrace{c_{e1}, \ldots, c_{e1}}_{d \ \text{times}}, \underbrace{c_{e2}, \ldots, c_{e2}}_{(d-1) \ \text{times}}, \underbrace{c_{e3}, \ldots, c_{e3}}_{(d-2) \ \text{times}}, \ldots, c_{ed} \} \end{aligned} $$
(14)

and the overall indexing array becomes \(\mathcal {I}_{r} = [\texttt {iK},\texttt {jK}] \in \mathbb {N}^{\tilde {d}*m\times 2}\) where \(\tilde {d} = {\sum }^{d}_{j=1} {\sum }_{i\leq j} i\). The entries of the indexing array and the memory usage are reduced by approx. 45% (see Table 1).

The set of indices (14) can be constructed by the following instructions (see Lines 15–21)

figure e

which can be adapted to any isoparametric 2D/3D element just by changing accordingly the number d of elemental DOFs. In the attached scripts, based on 4-noded bilinear Q4 and 8-noded trilinear H8 elements, we set d = 8 and d = 24, respectively. The last instruction sorts the indices as iKr(i) > jKr(i), such that K(s) contains only sub-diagonal terms.

The syntax K=sparse(iK,jK,sK) now returns the lower triangular matrix K(s) and we remark that the full operator can be recovered by

$$ K = K^{(s)} + (K^{(s)})^{T} - \text{diag}[K^{(s)}] $$
(15)

which costs as much as the averaging operation \(\frac {1}{2}(K+K^{T})\), performed in top88 to get rid of roundoff errors. However, the Matlab built-in Cholesky solver and the corresponding decomposition routine can use just K(s), if called with the option 'lower'.

Point 2 gives the most dramatic improvement, and can be accomplished by using routines developed by independent researchers. The sparse2 function, from Suite Sparse (Davis 2019), was already pointed out by Andreassen et al. (2011) as a better alternative to the built-in Matlab sparse; however, no quantitative comparisons were performed. According to the CHOLMOD reference manual (Davis 2009), sparse2 works exactly as sparse, but allowing the indices iK and jK to be specified as integers (accomplished by defining this type for the connectivity matrix, see Lines 11 and 13).

Here we suggest the “fsparse” routine, developed by Engblom and Lukarski (2016). Besides working with integers iK and jK, the function enhances the efficiency of the sparse assembly by a better sorting of the operations. From our experience on a single core process, fsparse gives a speedup of 170–250% compared to sparse2, and is also highly parallelizable (Engblom and Lukarski 2016). Defining the sets ik and jk as int32 type, we can drastically cut the memory requirements, still representing n ≈ 2.1 ⋅ 109 numbers, far beyond the size of problems one can tackle in Matlab.

In order to use fsparse, one needs to download the “stenglib” libraryFootnote 1 and follow the installation instructions in the README.md file. The packages of the library can be installed by running the “makeall.m” file. As fsparse is contained within the folder “Fast,” one may only select this folder when running makeall.m.

We test the efficiency of the assembly approaches on 2D and 3D uniform discretizations with m2 and m3 elements, respectively. Figure 3 shows time scalings for the different strategies: “F” corresponds to the assembly in top88, “H” takes advantage of the matrix symmetry only and “H, fsparse” correponds to the use of the fsparse routine (Engblom and Lukarski 2016) also. All the approaches exhibit a linear scaling of CPU time w.r.t the DOFs number. However, half the CPU time can be cut just by assembling K(s) (strategy H, sparse). Therefore, we definitely recommend this to users who aim to solve medium-size (105 to 106 DOFs) structural TO problems on a laptop. However, the most substantial savings follow from using fsparse (Engblom and Lukarski 2016) and by coupling these two strategies (H, fsparse) speedups of 10 for the 2D and 15 for 3D setting can be achieved. It is worth to highlight that a 3D stiffness matrix of the size of ≈ 9 ⋅ 105 can be assembled in less than a second and even one of size 6.2 ⋅ 106 can be assembled on a laptop in less than 10s. For this last case, the sole storage of the arrays iK, jK, and sK would cause a memory overflow, ruling out the “F” approach.

Fig. 3
figure 3

Scaling of assembly time performed with the 3 strategies discussed in Section 3.1. Compared to the standard (F) assembly, the H strategy alone cuts near 50% of time and memory, and with the use of fsparse gives an overall efficiency improvement of 10–15 times

3.2 Speedup of the OC update

The cost of the redesign step \(\mathbf {x}_{k+1} = \mathcal {U}(\mathbf {x}_{k})\) is proportional to the number of bisections (nbs) required for computing the approximation \(\tilde {\lambda }_{k} \approx \lambda ^{\ast }_{k}\). The following estimate (Quarteroni et al. 2000)

$$ n_{\text{bs}} \geq \frac{\log(|{\Lambda}^{(0)}|) - \log(\tau)}{\log(2)} - 1 $$
(16)

is a lower bound to this number for a given accuracy \(\tau > |\lambda ^{\ast }_{k} - \tilde {\lambda }_{k}|\) and it is clear that nbs would decrease if Λ(0), the initial guess for the interval bracketing \(\lambda ^{\ast }_{k}\), could be shrunk. Moreover, the volume constraint should be imposed on the physical field (\(\tilde {\mathbf {x}}\) or \(\hat {\mathbf {x}}\)) and, in the original top88 implementation, this requires a filter application at each bisection step, which may become expensive.

The efficiency of the redesign step can be improved by a two-step strategy

  1. 1.

    Using volume-preserving filtering schemes

  2. 2.

    Estimating the interval \({\Lambda }^{(0)}_{k}\) bracketing the current Lagrange multiplier \(\lambda ^{\ast }_{k}\)

Concerning point 1, the density filter is naturally volume-preserving (i.e., \(V(\mathbf {x}_{k}) = V(\tilde {\mathbf {x}}_{k})\)) (Bourdin 2001; Bruns and Tortorelli 2001). Therefore, the volume constraint can be enforced on V (xk) as long as the density filter alone is considered (ft= 1). The relaxed Heaviside projection (1), on the other hand, is not volume-preserving for any η; thus, it would require one filter-and-projection application at each bisection step. However, (1) can also be made volume-preserving by computing, for each \(\tilde {\mathbf {x}}_{k}\), the threshhold \(\eta ^{\ast }_{k}\) such that (Xu et al. 2010; Li and Khandelwal 2015)

$$ \eta^{\ast}_{k} \longrightarrow \min\limits_{\eta \in [0, 1]} |V(\hat{\mathbf{x}}_{k}(\eta))-V(\tilde{\mathbf{x}}_{k})| $$
(17)

This can be done, e.g., by the Newton method, starting from the last computed \(\eta ^{\ast }_{k-1}\) and provided the derivative of (1) with respect to η

$$ \frac{\partial V(\tilde{\mathbf{x}}(\eta))}{\partial \eta} = - 2 \upbeta {\sum}_{i\in\mathcal{A}} \frac{(e^{\upbeta(1-\tilde{x}_{i})}-e^{\upbeta(\tilde{x}_{i}-1)})(e^{\upbeta x}-e^{\upbeta \tilde{x}_{i}})} {(e^{\upbeta}-e^{-\upbeta})[e^{\upbeta(\tilde{x}_{i}-\eta)}+e^{\upbeta(\eta-\tilde{x}_{i})}]^{2}} $$
(18)

Existence of η∈ [0,1] for all \(\tilde {\mathbf {x}} \in [0, 1]^{m}\) follows from the fact that \(g(\eta ) = V(\hat {\mathbf {x}}(\eta ))-V(\tilde {\mathbf {x}})\) is continuous on [0,1] and g(0)g(1) ≤ 0; uniqueness follows from the fact that \(\frac {\partial g}{\partial \eta } < 0\) for all η ∈ (0,1).

Numerical tests on the MBB beam show that generally \(\eta ^{\ast }_{k}\in \left [ 0.4, 0.52\right ]\), the larger variability occurring for low volume fractions (see Fig. 4a). We also observe that \(\eta ^{\ast }_{k}\) takes values slightly above 0.5 when \(r_{\min \limits }\) is increased or β is raised. Convergence to \(\eta ^{\ast }_{k}\) is generally attained in 1–2 Newton iterations (see Fig. 4a).

Fig. 4
figure 4

Evolution of the parameter η realizing the equivalence \(V(\tilde {\mathbf {x}}) = V(\hat {\mathbf {x}})\), for different volume fractions f and filter radii \(r_{\min \limits }\) (a) and evolution of the Lagrange multiplier estimate λ# given by (19) compared to λ (b). For both plots, the cumulative number of Newton iterations nNewton (viz. number of bisection steps nbs) is shown against the right axis

The procedure for computing \(\eta ^{\ast }_{k}\) from (17), with tolerance 𝜖 = 10− 6 and initial guess η0 = eta, provided by the user, is implemented in Lines 63–67, that are executed if the routine top99neo is called with the parameter ft= 3. Otherwise, if ft= 2, the input threshhold eta is kept fixed. In case of the latter, the volume constraint should be consistently applied on \(V(\hat {\mathbf {x}})\); otherwise, some violation or over-shooting of the constraint will happen. In particular, if the volume constraint is imposed on x and η is kept fixed, one has \(V(\hat {\mathbf {x}}) > f |{\Omega }_{h}|\), if η < 0.5, and \(V(\hat {\mathbf {x}}) < f |{\Omega }_{h}|\), if η > 0.5.

Even tough we usually observed small differences, these may result in local optima or bad designs, especially for low volume fractions or high β values. Therefore, accounting for this more general situation Lines 87–91 should be replaced by the following

figure f

However, there could be other situations when one cannot rely on volume-preserving filters (e.g., when imposing length scale through robust design). Therefore, a more general strategy to reduce the cost of the OC update is to cut the number of bisection steps.

To this end, the selection of the initial bracketing interval \({\Lambda }^{(0)}_{k}\) may build upon the upper bound estimate for \(\lambda ^{\ast }_{k}\) (Hestenes 1969; Arora et al. 1991)

$$ \lambda^{\#}_{k} = \left[ \frac{1}{mf} \sum\limits^{m}_{e=1} x_{k,e} \left( -\frac{\partial_{e}c_{k}} {\partial_{e}V_{k}} \right)^{1/2} \right]^{2} $$
(19)

More details on the derivation of (19) are given in Appendix A. The behavior of the estimate (19) is shown in Fig. 4b for the MBB example. The overall number of bisections (nbs) in order to compute \(\lambda ^{\ast }_{k}\) meeting the tolerance τ = 10− 8 when considering \({\Lambda }^{(0)}_{k} = [0, \lambda ^{\ast }_{k}]\) is cut by about 50%, compared with the one required by starting from Λ(0) = [0,109] as in top88. Moreover, if no projection is applied, (19) could be used together with (10) to perform an explicit Primal-Dual iteration to compute \((\mathbf {x}_{k+1}, \lambda ^{\ast }_{k})\) and this would reduce the number of steps even more (see green curve in Fig. 4b).

However, in the basic versions of the codes, given in Appendices B and Appendix, we consider the bisection process and (19) is used to bracket the search interval, as this procedure is more general.

3.3 Acceleration of the OC iteration

The update rule (10) resembles a fixed-point (FP) iteration \(\mathbf {x}_{k+1} = \mathcal {U}(\mathbf {x}_{k})\), generating a sequence {xk} converging to a point such that \(\mathbf {r} = \mathcal {U}(\mathbf {x}^{\ast })-\mathbf {x}^{\ast } = \mathbf {0}\).

Several methods are available to speedup the convergence of such a sequence (Brezinski and Chehab 1998; Ramiere and Helfer 2015), somehow belonging to the family of quasi-Newton methods (Eyert 1996). The acceleration proposed by Anderson (1965), for instance, is nowadays experiencing a renewed interest (Fang and Saad 2009; Pratapa et al. 2016; Peng et al. 2018) and has recently been applied to TO by Li et al. (2020).

Anderson acceleration takes into account the residuals ri, their differences Δri = ri+ 1ri and the differences of the updates Δxi = xi+ 1xi for the last mr iterations (i.e. i = kmr,…, k − 1), and obtains the new element of the vector sequence as

$$ \mathbf{x}_{k+1} = \mathbf{x}^{\#}_{k} + \zeta\mathbf{r}^{\#}_{k} $$
(20)

where ζ ∈ [0,1] is a damping coefficient and

$$ \begin{aligned} \mathbf{x}^{\#}_{k} & = \mathbf{x}_{k} - \sum\limits^{k-1}_{i=k-m_{r}}\gamma^{(k)}_{i} {\Delta} \mathbf{x}_{i} = \mathbf{x}_{k} - X_{k} \boldsymbol{\gamma}_{k} \\ \mathbf{r}^{\#}_{k} & = \mathbf{r}_{k} - \sum\limits^{k-1}_{i=k-m_{r}}\gamma^{(k)}_{i} {\Delta} \mathbf{r}_{i} = \mathbf{r}_{k} - R_{k} \boldsymbol{\gamma}_{k} \end{aligned} $$
(21)

The coefficients \(\gamma ^{(k)}_{i}\) minimize the following

$$ \{\gamma^{(k)}_{i}\}^{m_{r}}_{i=1} \rightarrow \min\limits_{\boldsymbol{\gamma}} \|\mathbf{r}^{\#}_{k}(\boldsymbol{\gamma})\|^{2}_{2} $$
(22)

The rationale behind the method is to compute a rank-mr update of the inverse Jacobian matrix \(J^{-1}_{k}\) of the nonlinear system rk = 0. This has been shown to be equivalent to a multi-secant Broyden method (Eyert 1996; Fang and Saad 2009) starting from \(J^{-1}_{0} = -\zeta I\).

The update rule (20) is usually applied only once each q steps. Thus, we can write more generally xk+ 1 = xk + zk, where (Pratapa et al. 2016)

$$ \mathbf{z}_{k} = \begin{cases} \alpha \mathbf{r}_{k} & \text{if} \frac{k+1}{q} \notin \mathbb{N} \\ \zeta I - (X_{k} + \zeta F_{k} ) \boldsymbol{\gamma}_{k} & \text{if } \frac{k+1}{q} \in \mathbb{N} \\ \end{cases} $$
(23)

(α ∈ (0,1)) obtaining the so-called periodic Anderson extrapolation (PAE) (Pratapa et al. 2016; Li et al. 2020).

The implementation can be obtained, e.g., by adding the following few lines after the OC step (Line 91)

figure g

where the part solving (22) and the update has been put in a separate routine for better efficiency.

In the above, we use the “∖” for solving the least squares problem (22); however, strategies based on a QR (or SVD) decomposition may be preferred in terms of numerical stability. We refer to Fang and Saad (2009) for a deeper discussion on this point.

In order to assess the effect of different filtering schemes and the introduction of parameter continuation, Anderson acceleration is tested on the MBB example considering the following options

  1. T1

    Density filter alone, p = 3;

  2. T2

    Density-and-projection filter, with η computed from (17) and β = 2

  3. T3

    As T2, but with continuation on both β and p, defined by the parameters betaCnt={250,16,25,2} and penalCnt={50,3,25,0.25}

  4. T4

    As T2, but for the discretization Ωh = 600 × 200

For all the cases, the TO loop stops when \(\|\mathbf {r}_{k}\|_{2}/\sqrt {m} < 10^{-6}\), where the residual is defined with respect to the physical variables (i.e., \(\mathbf {r}_{k} = \tilde {\mathbf {x}}_{k} - \tilde {\mathbf {x}}_{k-1}\) for T1 and \(\mathbf {r}_{k} = \hat {\mathbf {x}}_{k} - \hat {\mathbf {x}}_{k-1}\) for T2–T4). The acceleration is applied each q = 4 steps, considering the last mr = 4 residuals, starting from iteration q0 = 20 for T1–T2 and from q0 = 500 for T3–T4, when both continuations have finished. We set α = 0.9 for the non-accelerated steps. The choice mr = 4 is based on the observation that convergence improvements increase very slowly for mr > 3 (Anderson 1965; Eyert 1996). However, a deeper discussion about the influence of all parameters on the convergence is outside the scope of the present work and we refer to Li et al. (2020) or, in a more general context, to Walker and Ni (2011) for this.

Results are collected in Table 2 and Figs. 5 and 6, showing the evolution of the norm of the residual, the flatness of the normalized compliance Δck/c0 = (ckck− 1)/c0 and the non-discreteness measure \(m_{\text {ND}}= 100 \cdot 4\mathbf {x}^{T}(1-\mathbf {x})/m\). We observe how Anderson acceleration substantially reduces the number of iterations needed to fulfill the stopping criterion, at the price of just a moderate increase in compliance (0.2–3%). Moreover, starting the acceleration just a few iterations later (e.g., it = 50 or it = 100 for T1) gives much lower compliance values (c = 254.3 and c = 252.9, respectively) and for T3 and T4 when the acceleration is started as the design has stabilized, compliance differences are negligible.

Table 2 Comparison of convergence-related parameters for the standard (T) and accelerated (T-PAE) TO tests, for the MBB example
Fig. 5
figure 5

Optimized designs obtained without (left column) and with Anderson acceleration (right column) of the TO loop

Fig. 6
figure 6

Evolution of some parameters related to convergence for the standard and Anderson accelerated TO process. The first row shows the normalized norm of the residual defined on physical variables, the second row shows a measure of the flatness of the objective function and the last row shows the non-discreteness measure

From Fig. 5 it is easy to notice the trend of PAE of producing a design with some more bars. This may even give slightly stiffer structures, such as for case T3, where the non accelerated approach removes some bars after it = 2000, whereas stopping at the design of T3–PAE gives a stiffer structure.

A comment is about the convergence criterion used, which is different from the one in top88 (maximum absolute change of the design variables (\(\| \mathbf {x}_{k+1} - \mathbf {x}_{k} \|_{\infty }\)). Here, we consider it more appropriate to check the residual with respect to the physical design field, and the 2-norm seems to give a more global measure, less affected by local oscillations.

3.4 Performance comparison to top88

We compare the performance of top99neo to the previous top88 code. In the following, we will refer to “top88” as the original code provided by Andreassen et al. (2011) and to “top88U” as its updated version making use of the sparse2 function (Davis 2009) for the assembly, with iK and jK specified as integers, and the filter implemented by using conv2.

The codes are tested by running 100 iterations for the MBB beam example (see Fig. 2), for the discretizations 300 × 100, 600 × 200, and 1200 × 400, a volume fraction f = 0.5 and considering mesh independent filters of radii \(r_{\min \limits } = 4\), 8, and 16, respectively. For top88 and top88U, we only consider density filtering, whereas for the new top99neo, we also consider the Heaviside projection, with the η computed as described in Section 3.2. It will be apparent that the cost of this last operation is negligible.

Timings are collected in Table 3 where tit is the average cost per iteration, tA and tS are the overall time spent by the assembly and solver, respectively, and tU is the overall time spent for updating the design variables. For top88 and top88U, the latter consists of the OC updating and the filtering operations performed when applying the bisection on the volume constraint. For top99neo, this term accounts for the cost of the OC updating, that for estimating the Lagrange multiplier λ as discussed in Section 3.2 and the filter and projection (Lines 59–70). tP collects all the preliminary operations, such as the set up of the discretization, and filter, repeated only once, before the TO loop starts.

Table 3 Comparison of numerical performance between the old top88/top88U and new top99neo Matlab code. tit is the cost per iteration, tA, tS, tU are the overall times for assembly, equilibrium equation solve, and design update, respectively. tP is the time spent for all the preliminary operations. Values within brackets represent the % weight of the corresponding operation on the overall CPU. On the larger mesh, top88 is run with \(r_{\min \limits } = 12\), because of memory issues

From tit, we clearly see that top99neo enhances the performance of the original top88 by 2.66, 3.85, and 5.5 times on the three discretizations, respectively. Furthermore, timings of top88 on the largest discretization (1200 × 400), relate to a smaller filter size (\(r_{\min \limits } = 12\)), because of memory issues; thus, the speedup is even underestimated in this case. Comparing to top88U version, the improvements are less pronounced (i.e., 1.55, 1.57, and 1.78 times) but still substantial. The computational cost of the new assembly strategy is very low, even comparing to the top88U version, and its weight on the overall computational cost is basically constant. Also, from Table 3, it is clear that the design variables update weighs a lot on the overall CPU time, for both top88 and top88U. On the contrary, this becomes very cheap in the new top99neo thanks to the strategies discussed in Section 3.2; tU takes about 4–5% of the overall CPU time.

Computational savings would become even higher when adopting the larger filter size \(r_{\min \limits } = 8.75\) for the mesh 300 × 100, and scaling to \(r_{\min \limits } = 17.5\) and \(r_{\min \limits } = 35\) on the two finer discretizations. For these cases, speedups with respect to top88 amount to 4.45 and 10.35 on the first two meshes, whereas for the larger one, the setup of the filter in top88 causes a memory overflow. Speedups with respect to top88U amount to 1.55, 2.55 and 3.6 times respectively.

3.5 Frame reinforcement problem

Let us go back to the example of Fig. 1a, adding the specification of passive domains and a different loading condition.

We may think of a practical application like a reinforcement problem for the solid frame, with thickness t = L/50 (\(\mathcal {P}_{1}\)), subjected to two simultaneous loads. A vertical, uniformly distributed load with density q = − 2 and a horizontal height-proportional load, with density b = ±y/L. Some structural material has to be optimally placed within the active design domain \(\mathcal {A}\) in order to minimize the compliance, while keeping the void space (\(\mathcal {P}_{0}\)), which may represent a service opening.

To describe this configuration, we only need to replace Lines 31–33 with the following

figure h

where lDofv and lDofh target the DOFs subjected to vertical and horizontal forces, respectively. Then, the load (Line 34) is replaced with

figure i

Figure 7 shows the two optimized design corresponding to the two orientations of the horizontal load b, after 100 redesign steps. The routine top99neo has been called with the following arguments nely=nelx= 900, volfrac= 0.2, penal= 3, rmin= 8, ft= 3, eta= 0.5, beta= 2 and no continuation is applied. The cost per iteration is about 10.8 s and, considering the fairly large discretization of 1.62 ⋅ 106 DOFs, is very reasonable.

Fig. 7
figure 7

Designs obtained for the frame reinforcement problem sketched in Fig. 1a. In a, the horizontal, triangular load distribution is pointing leftwards, whereas in b, it is pointing rightwards

4 Extension to 3D

The implementation described in Section 3 is remarkably easy to be extended to 3D problems (see Section Appendix).

Notable modifications are the definition of \(K^{(s)}_{e}\) for the 8-node hexahedron (Lines 24–47) and the solution of the equilibrium (5), now performed by

figure j

which in this context has been observed to be faster than the decomposition routine. Then, apart from the plotting instructions, all the operations are the same as in the 2D code and only 12 lines need minor modifications, basically to account for the extra space dimension (see tags “#3D#” in Section Appendix).

We test the 3D implementation on the cantilever example shown in Fig. 8a, for the same data considered in Amir et al. (2014). The discretization is set to Ωh = 48 × 24 × 24, the volume fraction is f = 0.12, and the filter radius \(r_{\min \limits } = \sqrt {3}\). We also consider the volume-preserving Heaviside projection, (ft= 3). Figure 8b and c show the designs obtained after 100 redesign steps, for the two different filter boundary conditions. The design in (b), identical to the one in Amir et al. (2014), corresponds to zero-Neumann boundary conditions (i.e., the option “symmetric” was used in imfilter). The design in (c) on the other hand, corresponds to zero-Dirichlect boundary conditions for the filter operator and is clearly a worse local minimum.

Fig. 8
figure 8

Geometrical sketch of the 3D cantilever example (a) and optimized topology for Ωh = 48 × 24 × 24 and considering the two filter boundary conditions (b, c). The design in d corresponds to the finer mesh Ωh = 96 × 48 × 48 and has been obtained by replacing the direct solver with the multigrid–preconditioned CG (see Amir et al. 2014 for details)

The overall CPU time spent over 100 iterations is 1741 s and about 96% of this is due to the solution of the state equation. Only 1.2% of the CPU time is taken by matrix assemblies and 0.4% by filtering and the design update processes.

Upon replacing the direct solver in top3D125 with the same multigrid preconditioned CG solver of Amir et al. (2014), we can compare the efficiency of the two codes. We refer to Table 4 for the CPU timings, considering the discretizations Ωh = 48 × 24 × 24 (l = 3 multigrid levels) and Ωh = 96 × 48 × 48 (l = 4 multigrid levels). top3D125 shows speedups of about 1.8 and 1.9, respectively, and most of the time is cut on the matrix assembly. In the code of Amir et al. (2014), this operation takes about 50% of the overall time (and notably has the same weight as the state equation solve) whereas in top3D125 this weight is cut to 7 − 10%. Also, the time spent for the OC update is reduced, even though the code of Amir et al. (2014) already implemented a strategy for avoiding filtering at each bisection step.

Table 4 Performance comparison between the new top3D125 code and the one from Amir et al. (2014). tit, tA, tS, tU, and tP have the same meaning as in Table 3 and numbers between brackets denote the % weight of the operations on the overall CPU time

5 Concluding remarks

We have presented new Matlab implementations of compliance topology optimization for 2D and 3D domains. Compared to the previous top88 code (Andreassen et al. 2011) and available 3D codes (e.g., by Liu and Tovar 2014 or Amir et al. 2014), the new codes show remarkable speedups.

Improvements are mainly due to the following:

  1. 1.

    The matrix assembly is made much more efficient by defining mesh-related quantities as integers (Matlab int32) and assembling just one half of the matrix.

  2. 2.

    The number of OC iterations is drastically cut by looking at the explicit expression of the Lagrange multiplier for the problem at hand.

  3. 3.

    Filter implementation and volume-preserving density projection allow to speed up the redesign step.

The new codes are computationally well balanced and as the problem size increases the majority of the time (85 to 90% for 2D and even 96% for 3D discretizations) is spent on the solution of the equilibrium system. This is precisely what we aimed at, as this step can be dealt with efficiently by preconditioned iterative solvers (Amir et al. 2014; Ferrari et al. 2018; Ferrari and Sigmund 2020). We also discussed Anderson acceleration, that has recently been applied to TO also by Li et al. (2020), to accelerate the convergence of the overall optimization loop.

We point out that even if we specifically addressed volume constrained compliance minimization and density-based TO the methods above can be applied also to level-set and other TO approaches. Point 1 can be extended to all problems governed by symmetric matrices. Points 2 and 3 can also be extended to other problems, to some extent, and Anderson acceleration is also usable in a more general setting (e.g., within MMA).

Therefore, we believe that this contribution should be helpful to all researchers and practitioners who aim at tackling TO problems on laptops, and set a solid framework for the efficient implementation of more advanced procedures.