1 Introduction

A cluster is a group of identical molecules or atoms loosely bound by inter-atomic forces. The optimal geometry minimises the potential energy of the cluster. The simplest model (yet extremely difficult to solve) uses the Lennard–Jones pairwise potential energy function. Variations in this problem include carbon and argon clusters as well as water molecule clusters (see, e.g. [13]). In addition, the Lennard–Jones potential represents an important component in many of the potential energy models used, for instance, in complex molecular conformation, protein–peptide docking, and protein folding problems [46].

The objective function of the Lennard–Jones potential is smooth (continuously differentiable) and easy to implement. However, it has extremely complicated landscape with huge number of local minima. A smooth penalised modification for the Lennard–Jones pairwise potential function was introduced in [7] that allows a local search method to escape from the enormous number of local minima of the Lennard–Jones energy landscape. This modification was reported to result convergence to the global minimum with much greater success than when starting local optimisation with random points.

The idea of penalised potential was further modified in [8] resulting a nonsmooth penalised Lennard–Jones potential. According to very limited number of test cases used in [8], this formulation together with discrete gradient method [9] used for minimisation yields yet another improvement in the success rate of finding the global minimum.

In this paper, we study the different parameter values for the nonsmooth penalised Lennard–Jones potential introduced in [8] and also some new modifications of this nonsmooth formulation. Our main goal was to confirm the results obtained in [8] and to further improve the success rate of finding the global minimum of the original Lennard–Jones potential.

As a solver for the minimisation problem, we use the limited memory discrete gradient bundle method (LDGB, [10]), that is, a derivative-free method for nonsmooth moderate large problems. The LDGB is a hybrid of the discrete gradient method [9] and the limited memory bundle method [11, 12]. The choice of the solver is reasoned by three facts: First, we need a solver that is capable of solving (locally) nonsmooth nonconvex problems. Second, the computation of subgradients (generalised gradients [13]) is not an easy task, since the problem is subdifferentially irregular (see, e.g. [14]) and, thus, the calculus exists only in the form of inclusions. Therefore, the choice of derivative-free method is justified. Finally, the number of variables in the clustering problem is 3N. This means that our solution algorithm needs to be able to solve moderate large problems. In addition, it has been shown that the discrete gradient method—although not a global optimisation method—has an aptitude to jump over a small local minima (see, e.g. [9, 15]). We hope to find the similar feature from its derivative, the LDGB.

The paper is organised as follows. In Sect. 2, we recall the Lennard–Jones potential and the penalised modifications introduced in [7] and [8]. In Sect. 3, we introduce the formulae used in our experiment. Then, in Sect. 4, we briefly describe the basic ideas of LDGB , and in Sect. 5, we give the results of our numerical experiments. Finally, in Sect. 6, we conclude the paper and give some ideas of future research.

2 Lennard–Jones Pairwise Potential

The optimal geometry of the cluster minimises the potential energy E expressed as a function of Cartesian coordinates

$$\begin{aligned} E(x,y,z):=\sum _{i=1}^N\sum _{j=i+1}^N v(r_{ij}), \end{aligned}$$
(1)

where N is the number of atoms (molecules) in the cluster and \(r_{ij}\) is the distance between the centres of a pair of atoms (molecules). That is,

$$\begin{aligned} r_{ij}:=\sqrt{(x_i-x_j)^2+(y_i-y_j)^2+(z_i-z_j)^2}. \end{aligned}$$
(2)

The simplest model (yet extremely difficult to solve) uses the Lennard–Jones pairwise potential energy function

$$\begin{aligned} v(r_{ij}):=\frac{1}{r_{ij}^{12}}-\frac{2}{r_{ij}^6}. \end{aligned}$$
(3)

The objective function of the Lennard–Jones potential (1) and (3) is smooth assuming that \(r_{ij}>0\) and easy to implement. However, it has extremely complicated landscape with huge number of local minima. In [7], a smooth penalised modification for the Lennard–Jones pairwise potential function (3) was introduced. The formula of this penalised Lennard–Jones potential is

$$\begin{aligned} \bar{v}(r)=\frac{1}{r^{2p}}-\frac{2}{r^p}+\mu r + \beta \left( \max \left\{ 0,r^2-D^2\right\} \right) ^2, \end{aligned}$$
(4)

where \(p>0\), \(\mu , \beta \ge 0\) are real constants, and \(D>0\) is an underestimate of the diameter of the cluster. The local minimum of the modified objective function (1) and (4) was then used as a starting point for a local optimisation of the Lennard–Jones potential function (1) and (3). As said in the introduction, this procedure was reported to result convergence to the global minimum with much greater success than when starting local optimisation with random points [7].

The idea of penalised potential (4) was further modified in [8] resulting a nonsmooth penalised Lennard–Jones potential

$$\begin{aligned} \bar{v}(r)=\frac{1}{r^{12}}-\frac{1}{r^6}+\mu r + \beta \left( \max \left\{ 0,r^2-D^2\right\} \right) . \end{aligned}$$
(5)

This formulation, together with discrete gradient method [9] used for minimisation, has been reported to yield yet another improvement in the success rate of finding the global minimum [8].

3 Nonsmooth Polyatomic Clustering Problem

In this paper, we try to escape from the local minima of the Lennard–Jones energy landscape using a nonsmooth penalised Lennard–Jones potential of the form

$$\begin{aligned} \bar{v}(r)=\frac{1}{r^{2p}}-\frac{2}{r^p}+\mu r + \beta \left( \max \left\{ 0,r^2-D^2\right\} \right) , \end{aligned}$$
(6)

where \(p>0\), \(\mu , \beta \ge 0\) are real constants, and \(D>0\) is an underestimate of the diameter of the cluster. Note that, by choosing \(p=6\) and \(\mu , \beta = 0\), the penalised Lennard–Jones potential \(\bar{v}\) coincides with the Lennard–Jones pairwise potential (3).

The first penalty term \(\mu r\) in (6) gives a penalty to distances between the atoms [see [16] for the detailed analysis of the first penalty term in (6)]. The penalty increases linearly as a function of distance. Nevertheless, there is no good reason (but the smoothness of the model) to penalise the distances smaller than 1. Moreover, using this linear penalty slightly dislocates the minimum of the pairwise potential (see, Figs. 1a, 2b, 3a, b). Thus, we now introduce the formula where, instead of linear penalty \(\mu r\), the first penalty is given with piecewise linear formula. That is,

$$\begin{aligned} \bar{v}(r)=\frac{1}{r^{2p}}-\frac{2}{r^p}+\mu (\max \{0,r-1.1\})+ \beta \left( \max \left\{ 0,r^2-D^2\right\} \right) . \end{aligned}$$
(7)

In (7), we do not punish for atomic distances smaller than 1.1.

In both of the formulae (6) and (7), parameter p affects the rigidity of the model. By choosing \(p<6\), the atoms (molecules) can be moved more freely, and by decreasing p, the infinite barrier at \(r=0.0\), which prevents atoms from getting too close to each other, is also decreased. As already said, the first penalty term \(\mu r\) in (6) and \(\mu (\max \{0,r-1.1\})\) in (7) gives a penalty to distances between the atoms. On its turn, the second penalty term adds a penalty to the diameter of the cluster. It has no influence on pairs of atoms close to each other, but it adds strong penalty to the atoms far away from each other. As in [7], the local minima of the modified objective functions (1) and (6) or (1) and (7) will be used as a starting point for a local optimisation of the Lennard–Jones potential function (1) and (3).

In Fig. 1, the formulae (6) and (7) with parameters \(\mu = 0.3, \beta =0, D=0\) and \(p=6\) or \(p=4\) are displayed and compared with the Lennard–Jones pairwise potential (3). In Fig. 2, the corresponding cases with diagonal penalisation—that is, \(\mu = 0.3, \beta =1.0, D=2.0\)—are also displayed. Finally, in Fig. 3, we compare the smooth formulation (4) with the nonsmooth one (7). Here, we have used two sets of parameters: in Fig. 3a, we have set \(p=6, \mu = 0.3, \beta =1.0\), and \(D=2.0\), and in Fig. 3b, we have set \(p=4, \mu = 1.0, \beta =1.0\), and \(D=2.0\).

Fig. 1
figure 1

Comparison between Lennard–Jones and modified potentials. a Linear penalty, b Piecewise linear penalty

Fig. 2
figure 2

Comparison between Lennard–Jones and modified potentials (cont.) a Penalty (6), b Penalty (7)

Fig. 3
figure 3

Comparison between smooth and nonsmooth potentials. a [p = 6], b [p = 4]

We do not give here more detailed analysis of the effect of these penalised formulae. For a reader more interested, we recommend to see [7, 16, 17].

4 Limited Memory Discrete Gradient Bundle Method

In this section, we briefly describe the basic ideas of derivative-free LDGB that is used as a solver for the minimisation problem discussed in the previous chapter. As said in the introduction, we need a solver that is capable of solving large-scale nonsmooth nonconvex problems. Moreover, the computation of subgradients is not straightforward since the problem is subdifferentially irregular and, thus, we need a derivative-free solver. The only assumptions made here are that the objective function is locally Lipschitz continuous, and at every point , we can evaluate the value of the objective function \(f({\mathbf {x}})\).

A simple flow chart of the method is given in Fig. 4.

Fig. 4
figure 4

Program LDGB

The LDGB exploits the ideas of the variable metric bundle method [18] namely the utilisation of null steps, simple aggregation, and the subgradient locality measures. Nevertheless, the discrete gradients are used instead of subgradients, and the search direction is calculated using the limited memory approach. Both outer and inner iterations are used in the LDGB (see Fig. 4): The inner iteration of the LDGB is essentially same as the limited memory bundle method [11, 12], but now we use the discrete gradients instead of subgradient of the objective function. The outer iteration is used in order to avoid too tight approximations to the subgradients at the beginning of computation (thus, we have a derivative-free method). That is, we start with “large” \(\delta \) and make it smaller when we are closer to the optimum. For a reader more interested in nonsmooth optimisation and details of the method, we recommend to see [10, 14] and Appendix.

5 Numerical Experiments

We now give the results of our numerical experiments. To solve the problems, we have used the solver LDGB discussed in the previous chapter. The Fortran 95 source code of the solver is available for downloading at http://napsu.karmitsa.fi/ldgb/. The experiments were performed on an Intel \({}^{{\circledR }}\) Core\({}^{\mathrm{TM}}\) 2 CPU 1.80GHz. To compile the code, we used gfortran, the GNU Fortran compiler.

In order to test the performance of modified formulae (6) and (7), we made a series of numerical experiments by running LDGB 1000 times with \(N=2,\ldots ,40\) Footnote 1. We started local optimisation of the modified objective functions (1) and (6) or (1) and (7) with random points with \(x_i \in [-\frac{1}{2}\sqrt{3N},\frac{1}{2}\sqrt{3N}]\), \(i=1,\ldots ,3N\)). That is, no special point generation procedure similar to [7] was used, nor have we taken any precaution to prevent atom overlap during the starting point generation. The local minima of the modified potentials were then used as starting points for a local optimisation of the original Lennard–Jones potential function (1) and (3). In what follows, we report the percentage of the trials which led to the putative global minimum as given in [19]Footnote 2.

Note that the original Lennard–Jones potential function (1) and (3)) (and, thus, the second phase of the modified problem) is a smooth problem, and the gradients are readily computable. Thus, some efficient gradient-based method could have been used. Nevertheless, our main interest was in comparing the different formulae, not in solution algorithms, and thus, we have used LDGB to solve also the original Lennard–Jones potential function (1) and (3) as well as the second phase of the modified problem.

5.1 Linear and Piecewise Linear Penalty

Let us first study only the linear and piecewise linear penalty term (i.e. we set \(\beta =0\) in Eqs. (6) and (7)) with different values of parameters p and \(\mu \). The results of these experiments are given in Tables 1 (\(p=6\)) and 2 (\(p\le 4\)), where “N” denotes the number of atoms, “L–J” denotes the success rate obtained with the original Lennard–Jones formulation (1) and (3), “Locatelli” stands for the results of Locatelli and Schoen [7] (we recall some of these results for comparison purposes), “linear” stands for the linear penalty (i.e. Eq. (6) with \(\beta =0\)) and “PW linear” stands for the piecewise linear penalty (Eq. 7 with \(\beta =0\)).

Table 1 Success rate with linear penalty and \(p=6\)
Table 2 Success rate with piecewise linear penalty and \(p=4\)

First, it is worth noting that all the penalised formulations were superior when compared to the original Lennard–Jones formulation. For example, with \(N=13\), the putative global minimum was found only 12 times when only the original Lennard–Jones formulation was used while with piecewise linear penalty with \(p=6\) and \(\mu =5\), it was found 989 times. In addition, with original Lennard–Jones formulation, no putative global minimum was found with \(N>25\). Nevertheless, there was an exception: with \(N=8\), the piecewise linear penalty with \(p=6\) and \(\mu =5\) was worse than the original Lennard–Jones formulation (see Table 1).

In each table, the best values at all LDGB’s results are bolded. In addition, in Table 1, we have used red pen to show which one is the better (or equal): the smooth or the nonsmooth formulation with the same parameters. That is, the same values of p and \(\mu \) are compared. It is now easy to see that the piecewise linear penalty usually gave better or equal results (in 75 % of cases). This may follow from the fact that the minimiser of the piecewise linearly penalised formula (7) and that of the original formula (3) are the same while with the linearly penalised formula (6), there exists a small disruption (see Fig. 1). Nevertheless, the magnitudes of the results are about the same.

In Table 1, we have emphasised with blue pen those values where Locatelli’s results are better when compared to the run with LDGB with the same formula and parameters (i.e. “linear” with \(p=6\) and \(\mu =0.3\)). It can be seen that Locatelli’s results are usually little bit better especially with larger N. This difference is probably due to specialised starting point generation procedure used in [7] rather than inferiority of our solution algorithm. In fact, since the results are of the same magnitude, it may be that LDGB does give some advantage—almost as strong as specialised starting point generation—when solving these kinds of problems.

When comparing different values of \(\mu \) in Tables 1 and 2, it seems that large \(\mu \) is well suited for small N, but when N increases smaller value of \(\mu \) is better. Indeed, when \(N<22\), the best success rates with \(p=6\) were usually obtained with piecewise linear penalty and \(\mu =5\). However, no successful runs were made when \(N>31\). With \(p=4\) and \(\mu =5\), no successful runs were made when \(N>13\). This trend can be seen both with the linear and with the piecewise linear penalty. Therefore, with \(p=4\), we also tested the values of \(\mu \) that depends on the value of N. That is, \(\mu =10/N\) and \(\mu =1/N\). In Table 2, we see that this strategy gives us somewhat better results. Although the best success rate with small N is still usually obtained with either \(\mu =2\) or \(\mu =5\), the overall performance is best with \(\mu =10/N\). Nevertheless, it is worth noting that \(\mu =1/N\) is the only choice of the parameter that gave the putative global optimum at least once with every N during our test drives.

In Table 2, we have compared Locatelli’s results to piecewise linear formulation with the same parameters (blue pen). As before, Locatelli’s results are slightly better from those of LDGB when N increases. However, the trend is not as clear as in Table 1. In addition, we compared Locatelli’s results to our “best” parameters with \(p=4\), that is, \(\mu =10/N\). In Table 2, we have emphasised with red pen those results that are better than or equal to Locatelli’s results. That happens in 77 % of the cases. It is also worth noting that the improvements here are often clear.

When comparing the different values of parameter p, we see that \(p=4\) usually gives better results than \(p=6\) when only the linear or the piecewise linear penalty is used: the best values obtained with \(p=4\) were better than the best values obtained with \(p=6\) in 82 % of cases. In addition to values \(p=6\) and \(p=4\), we made one trial set with piecewise linear penalty and \(p=3\) (see Table 2). Although \(p=3\) worked well for small N, the putative global minimum was not found within 1000 trials with \(N>24\). In Table 2, we have used blue pen to point out those results with \(p=3\) that were better or equal to the corresponding results with \(p=4\).

5.2 Formulations with Diameter Penalisations

Now, we start to study formulations with diameter penalisations. The results with different formulae and parameter values are given in Tables 3, 4, 5. Here, as before, “Locatelli” stands for the results of Locatelli and Schoen [7]. In addition, “Beliakov” stands for the results of Beliakov et al. [8], “smooth” for the smooth penalty (4) ran with LDGB, “lin+max” for formula (6) and “PWlin+max” for formula (7). In Tables 3 and 4, we study the case with \(p=6\), and in Table 5, we have results for \(p=4\). We have used the value \(\beta =1.0\) in all our trials. The best values at all LDGB’s results with different values of p are bolded.

Table 3 Success rate with diameter penalisation and \(p=6\)
Table 4 Number of successes with diameter penalisation and \(p=6\) (cont.)
Table 5 Number of successes with \(p=4\) and diameter penalisation

None of the formulae and parameter combinations tested gave us the putative global optimum with every N within our 1000 test trials. With \(p=6\), the overall best results were obtained with formula (7) with parameters \(\mu =2.0\) and \(D=3.0\) (see Tables 3 and 4). However, this combination failed to find the putative global optimum (at least once) with six different Ns. In that sense, the best results were obtained with the same formula but with \(\mu =10.0/N\) and \(D=5.0\), in which case we failed only with three different values of N. In addition, with \(p=4\), the overall best performance was obtained with formula (7). Here, the best parameters were \(\mu =10.0/N\) and \(D=3.0\), and the number of failures in finding the putative global optimum was five (see Table 5). The same formula with parameters \(\mu =0.3\) and \(D=3.0\) succeeded in finding the putative global optimum in all but two different values of N.

As before, we compare Locatelli’s results to the similar smooth formulation with \(p=6\), \(\mu =0.3\) and \(D=3\), and to formula (7) with \(p=4\), \(\mu =0.3\), and \(D=3\) both ran with LDGB (blue pen in Tables 3 and 5). Now, with \(p=6\), Locatelli’s results were better with only three different values of N and with \(p=4\), they were better only in seven cases. This result—especially, when taking into account the specialised starting point generation procedure used in [7]—means that with these more complex formulae, the usage of the solver LDGB clearly gives us a small advantage.

Next, we compare the smooth penalty (4) to the nonsmooth ones (6) and (7). In Table 3, we have again used red pen for those results of nonsmooth formulae which give better or equal results to the smooth formula with the same parameters. Here, we can conclude that both the nonsmooth formulae give some improvement. This mostly confirms the results obtained in [8]. However, we could not reproduce the huge improvement in a success rate of finding the difficult cluster with \(N=38\) when using LDGB (see, Table 3).

When comparing formulae (6) and (7), there was not a big difference between the success rate of the formulae with the same parameters. Nevertheless, as before (7) was slightly better (see Table 3).

The formulae with diameter penalisations usually gave better results than that with only the linear or the piecewise linear penalisation when the same parameter combinations were compared (see Tables 1, 2, 3, 4, 5). The clear exception for this rule is the smooth formulation with \(D=1.5\), where no successful runs were made with \(N \ge 22\). The differences in the success rate are sometimes enormous. Two interesting examples occur with \(N=6\) and \(N=38\). With these cases, the optimal structure of the cluster is face-centred cubic (FCC) which is claimed to be one of the most difficult structures to find especially with large N (see, e.g. [7, 19]). In both of these cases, the formulae with diameter penalisations made large differences: with \(N=6\), \(p=6\), and \(\mu =0.3\), the formulations without diameter penalisation gave only about 10 % success, while with nonsmooth diameter penalisations, the success rates were around 98 %; with \(N=38\), \(p=6\), and \(\mu =0.3\), the corresponding values were less than 0.5 and 6.7 %. Nevertheless, there are some results where the differences are the other way around: that is the case, for example, with \(N=15\), \(p=6\), \(\mu =0.3\), and \(D=1.5\). Moreover, when comparing the best values obtained with and without diameter penalisation, the choice between better formulae is not so clear. Indeed, the similar improvement as above for \(N=6\) and \(p=6\) can be done with only (piecewise) linear penalty by setting \(\mu =5\) (see Table 1). For \(N=38\), this kind of effect was not observed.

Parameter D is supposed to be an underestimate of the diameter of the cluster and, naturally, its optimal value depends on the number of atoms in the cluster but also on the optimal structure of the cluster. For example, the nonicosahedral optimal cluster structures with \(N=75\) are spherical and compact, with smaller diagonal than the clusters with just a few atoms less or more. With smooth formulation, the smaller D usually gave good results with small N (say \(N<20\)), but the number of successes decreased dramatically when N increased. For example, with \(p=6\), the success rate with \(D=3\) was always better than that with \(D=1.5\) when \(N \ge 18\) and, as already said, with \(N \ge 22\) no successful runs were made with \(D=1.5\). However, when \(N \le 17\), the formula with \(D=1.5\) gave slightly better success rate than that with \(D=3.0\) (see Table 3). With nonsmooth formulations and \(p=6\), this trend was not so obvious. This is probably due to the fact that the nonsmooth penalty term does not increase as fast as the smooth one. However, with \(p=4\) and \(D=1.5\), no successful runs were made when \(N > 18\), although with smaller N the results with different values of D were comparable. The value \(D=3\) usually gave a little bit better success rate than \(D=5\) with our test set with maximum of 40 atoms. Nevertheless, the last ten rows in Table 4 (i.e. results up from \(N=30\)) may indicate that this result might be different if larger clusters were optimised (see Tables 4, 5, we have used red pen to emphasise the best result with different D).

5.3 Performance of the LDGB

Finally, we say a few words about the performance of the optimisation algorithm: apart from some exceptional cases, the average numbers of function evaluations needed to find a putative global minimum were clearly smaller with any of the formulae with \(p=4\) than those with \(p=6\). When comparing the different formulae, the differences were not as clear. Nevertheless, the minimisation of the smooth formula (4) usually used less evaluations than the minimisation of the nonsmooth formulae (6) or (7), although the number of average evaluations was of the same magnitude. When comparing formulae (6) with (7) with \(p=4\), \(\mu =0.3\), \(D=3\), formula (7) usually rose above (6). Nevertheless, again the differences were not large ones. In addition, when comparing the formulae with linear or piecewise linear penalties with parameters \(p=6\) and \(\mu =1.0\), the piecewise linear penalty usually used slightly less evaluations than the linear one. However, with \(\mu =0.3\), they used on average the same amount of evaluations. The most interesting and unexpected result obtained, when comparing the evaluations needed with the piecewise linear penalty and formula (7) with \(p=4\), \(\mu =0.3\), \(D=3\) or \(D=5\): in all cases, the average numbers of function evaluations needed to find a putative global minimum were smaller with the more complicated formula (7) than when only the piecewise linear penalty was used. A possible reason for this is that the diameter penalty forces the atoms that are far from the centre of the cluster to move close enough more quickly.

As already said, we also compared the results obtained with LDGB to those given in [7]. In spite of the fact that in [7], the special starting point generation procedure was used, the results were of the same magnitude when only the linear penalty was used, and the result with LDGB was usually better than those given in [7] when both the linear penalty and the diameter penalisation (4) were used. Thus, we can conclude that LDGB seems to share—at least in small scale—the aptitude of its successor discrete gradient method [9] to jump over a small local minimum.

6 Conclusions

In this paper, we have studied different modifications of the Lennard–Jones potential in order to improve the success rate of finding the global minimum of the original potential. The main interest of the paper was in nonsmooth penalised form of the Lennard–Jones potential. Our goal was to confirm the earlier very promising results with nonsmooth formulation and to improve the success rate of finding the global minimum of the original problem when using a local search optimisation method. The preliminary numerical experiments confirm that with all the penalised formulae, the success rates were greatly improved. The results obtained with nonsmooth formulae (6) or (7) were usually a little bit better than those with the smooth penalty (4). In addition, when both the linear penalty and the diameter penalisation in (4) were used, the result obtained in our experiments was usually better than those given in [7], in spite of the fact that in [7], the special starting point generation procedure was used. This is probably caused by our solution algorithm LDGB that seems to share—at least in small scale—the aptitude of its successor discrete gradient method [9] to jump over a small local minimum.

When comparing the different nonsmooth formulae, the one with the piecewise linear penalty term (i.e. formula 7) seems to be a little bit better one. Nevertheless, the differences were not significant.

In this paper, our main interest was in comparing the different formulae, not in solution algorithm itself. Thus, we have used here a crude multi-start method. Nevertheless, the multi-start method is obviously not the most reliable nor the most efficient way to solve global optimisation problems. Therefore, it is not suitable for solving large clusters. In the future, the aim is in developing efficient and reliable solvers specially target to solve (also larger instances of) the modified Lennard–Jones potentials introduced in this paper. This includes combination of some more sophisticated global optimisation method and LDGB. In addition to traditional global optimisation methods like simulated annealing or genetic algorithms, an interesting idea would be to use an incremental approach (see, e.g. [20]) with modified potentials to solve this kind of clustering problems. In addition, solving the second phase of the problem [i.e. the original Lennard–Jones potential function (1) and (3)] is a smooth problem, and the gradients are readily computable. Thus, solving this part of the problem with some efficient gradient-based method would probably make a big difference to the efficiency of this approach. Quite natural choice to gradient-based method would be the limited memory bundle method [11, 12].