1 Introduction

Probability collectives (PC) algorithm is a distributed random search optimization method. It is an extension and formalization of the collective intelligence (COIN) framework with deep connections to game theory, statistical physics, and optimizations for modeling and controlling of the distributed multi-agents system (MAS) (Bieniawski and Stefan Richard 2005; Wolpert 2006; Kulkarni et al. 2008; Lee et al. 2004).

Insights of PC mainly focus on two points. The first one is that PC is a distributed algorithm. It means each agent independently improves the local objective and updates its own actions according to rewards from the previous round. This process will continue until no further increase in rewards could be made for an individual agent, which, from the perspective of game theory, is exactly the Nash Equilibrium. The second point is that PC algorithm allocates probability values among the possible strategy set instead of operating on variable values. The strategies with higher probability will have more opportunities to show up in the next round, and those with lower values are abandoned in terms of probability. So PC algorithm will ultimately converge to a distribution which has a high probability value (\(\approx \)1) near the optimal strategy, while assigning the other feasible regions a relatively low probability value (\(\approx \)0). In addition, besides the optimal solution, PC also provides a ‘slice’ of the multidimensional objective function, which describes the variation trend of the function. These main features make PC algorithm quite different from the other random search algorithms such as genetic algorithm (GA), simulated annealing (SA) and swarm optimization and allow it to solve problems with discrete, continuous and mixed variables (Kulkarni et al. 2009).

PC process shares the same spirit with various methods such as mutual-information-maximizing input clustering (MIMIC), population-based increased learning (PBIL) and distributed reinforcement learning (DRL). The similarity among these methods is that they all take learning processes and use probability distributions to guide the optimizing procedure instead of operating directly on the variable values. However, each method still maintains its own characteristics. PC uses gradient-based optimization methods and function values to adjust the probability distribution. In comparison, MIMIC directly uses truncated samples and PBIL adopts the idea of GA which simulates the evolution of a population to find the final solution. Besides, PC pays more attention on how individual can influence system’s performance with the presence of the other agents than the DRL does and thus is particularly suitable for distributed problems (Kulkarni and Tai 2010a; Kulkarni et al. 2010b; Autry and Brian 2008; Busoniu et al. 2008).

The PC algorithm is first presented by Wolpert and Bieniawski (2004). Since then, lots of modifications and applications have already been developed. Lee et al. (2004) replace the world utility (called Team Game) with private utility (called Aristocrat Utility) to cut down the sample size used in the original algorithm with no bias and low variance. Bieniawski and Stefan Richard (2005) show that the data aging technique can be used to reduce the sample size effectively. Kulkarni and Kang modify the sampling rule on original MontE\(-\)Carlo by narrowing the sampling region around the current optimal points (Kulkarni et al. 2008; Lee et al. 2004). Some updating strategies, such as steepest decent method, nearest Newton method and Brower fixed point method have also been developed (Wolpert et al. 2011). The BFGS method is used to optimize the model by Kulkarni et al. (2011). Besides these, Wolpert et al. (2011) classify the original PC as ‘delayed sampling PC’ and put forward another ‘immediate sampling PC’ using the importance sampling and parametric machine learning technique.

There are also some applications of the PC algorithm. Numerical results on a set of benchmark functions show that PC method outperforms the traditional GA algorithm in the rate of descent (Huang et al. 2005). PC is also used to improve the Metropolis-Hastings sampling and mechanism design work (Wolpert and Tumer 1999, 2001, 2002). More widely, it has also been used in some combinatorial problems such as multi-depot multiple traveling salesmen problem (MTSP) (Kulkarni and Tai 2010a; Kulkarni et al. 2010b), the singlE\(-\)depot MTSP (Kulkarni and Tai 2010a; Kulkarni et al. 2010b), the fleet assignment problem (Antoine et al. 2004), school table scheduling problems (Autry and Brian 2008) and vehicle routing problems (Kulkarni and Tai 2010a; Kulkarni et al. 2010b).

However, following Wolpert’s PC algorithm which is based on the random sampling, some potential improvements still could be figured out. First of all, it is noticed that the gradient derived from the ‘Maxent Lagrangian’ model directly influences the effect of the original algorithm. In fact, since the gradient contains both the expectation \((E(G))\) and conditional expectation \((E(G\vert \) x)) of the function value, which are estimated through MontE\(-\)Carlo Sampling, the original algorithm may not always guarantee a satisfying result due to the estimation error. For example, if the authentic value of the expectation \((E(G))\) is 1,000 and the nearest Newton method is used to update the probability, one percent estimation error from this number may generate an unbearable outcome, since the error will hugely change the descent direction and destroy the basis of the iteration. Second, the calculation of step size along the descent direction can also be a big challenge for the PC model, because choosing the best step size needs to update the probability and use MC sampling to evaluate the function value repeatedly. Finally, the temperature T is vital to the PC algorithm, so a fixed proportional decrease may not be a satisfying schedule.

With respect to the above analysis, a mixed PC algorithm is presented in this paper, which combines the trust region (TR) method and the PC algorithm. A new adaptive annealing schedule is also adopted to replace the original one. Numerical experiments show that the performance of the proposed algorithm outperforms the original PC algorithm.

The article is organized as follows: in Sect. 2, the original PC algorithm is first reviewed. Then the proposed PC algorithm is elaborated in Sect. 3. And in Sect. 4, the performance of the new algorithm is illustrated by the results of 10 benchmark functions and comparisons with another modified PC algorithm are also made. Finally, conclusions are given in Sect. 5 and future researches are also discussed.

2 The original PC algorithm

2.1 The framework

Different from the conventional ‘black box’ optimization methods, PC algorithm focuses on the distribution over a strategy set of each variable instead of the variable values. Through adapting distributions, the original problem is inverted to a new problem over a convex space of probability distributions with variables taking real values (probability values) on their feasible regions. Thus powerful deterministic optimization tools could be involved to resolve this problem.

In PC algorithm, variables can be viewed as intelligent agents. Every agent independently selects actions from its own particular intervals, gets local rewards based on the objective function per iteration and decides its next round of action according to the previous rewards. The actions that make greater contributions to the global optimization will get higher probability values and could be more likely to appear in the next round while those making little or no contributions will have lower probability and finally disappear from the optimal strategy set in terms of probability. As this iterative process continues, the system will ultimately reach an equilibrium which means the objective function can not be further improved through allocating the probability among strategies and the probability distribution has already converged to the optimal one. At this time, the optimal solution is the combination of actions with the highest probability in each agent.

Another way to interpret this PC theory is through Game Theory. If the final equilibrium is viewed as a direct choice of a group of fully rational agents, then the whole PC process could be viewed as a bounded rational game: through a learning process, each agent explores both possible moves and payoff matrix and then gradually converges towards the Nash Equilibrium. It is worth noting that during this process, actual moves of every agent are independent and all couplings between the agents occur indirectly; it is not their moves but the separate distributions of the agents that are coupled.

2.2 The original models

Consider the following unconstrained optimization problem:

$$\begin{aligned} \mathop {\min }\limits _x G(x), \end{aligned}$$
(1)

where \(G(x)\) is the objective function and \(x = (x_{1},x_{2},\ldots ,x_{\mathrm{n}})\) is the variable/agent vector.

Let \(q_{i}\) be a distribution over the \(i\)th agent’s strategy set. The optimization problem can be seen as a non-cooperative game with \(n\) independent players. Each player has its own strategy set in the game and the goal is to reach the Nash equilibrium, so the problem can also be described as

$$\begin{aligned} \mathop {\min }\limits _q E_q [G]=\int {G(x)\prod \nolimits _\mathrm{i} {q_i (x_i )} \mathrm{d}x} \end{aligned}$$
(2)

The subscript in the expectation suggests its value depends on the distribution \(q=\prod q_{\mathrm{i}}\).

However, as mentioned above, the full rationality assumption in game theory usually cannot be satisfied. It means the optimal probability distribution cannot be found directly but can only be approximated step by step. In other words, the uncertainty of the final distribution is decreased gradually during the game. So if the Shannon entropy is employed here to measure the uncertainty of the distribution \(q\), it satisfies

$$\begin{aligned} S( q)\,=\;-q( x)\ln q( x) \end{aligned}$$
(3)

and the entropy should gradually decrease as the objective function approximates the minimum. Note that according to the maximum entropy principle, distribution \(q\) should keep the minimal amount of extra information beyond the prior knowledge; thus the entropy should be the largest and equality is used instead of inequality.

So the original problem can also be described as follows:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _q E_q [G]=\int {G(x)\prod \nolimits _\mathrm{i} {q_i (x_i )} \mathrm{d}x} \\ s.t.\quad S(q)=s \\ \quad \quad \int {q_i (x_i )\mathrm{d}x_i =1} \\ \quad \quad q_i (x_i )\ge 0 \end{array} \end{aligned}$$
(4)

where \(s\) is the prior knowledge about the distribution.

The last two constraints are to keep the distribution normalized and will be satisfied through iterations. Using the Lagrange multiplication operator, we can get

$$\begin{aligned} \mathop {\min }\limits _q L(q,T)=\int {G(x)\prod \nolimits _\mathrm{i} {q_i (x_i )} \mathrm{d}x} -T(S(q)-s), \end{aligned}$$
(5)

where \(T\) is a Lagrange parameter, also referred to as the temperature, which is fixed during one iteration and decreases as this iteration finishes. The objective function (5) is called Maxent Lagrangian and widely used in statistical physics, where it is referred to as free energy.

Since q is a product joint distribution, the entropy could be written as a summation form. So minimizing the local/private objective function for each agent will simultaneously minimize the global problem. Thus the objective function could be further written as

$$\begin{aligned}&\mathop {\min }\limits _{q_i } L(q_i ,T_i )=\int {G(x)\prod \nolimits _\mathrm{j} {q_j (x_j )} \mathrm{d}x} -T_i (S(q_i )-s_i ),\nonumber \\&\quad i=1,2,\ldots ,n \end{aligned}$$
(6)

Here the private objective functions are set to equal the global objective function, which is called Team Game (TG). A natural thought of solving this problem is to gradually decrease \(s_{{i}}\) and find a certain distribution that matches \(S(q_{{i}})=s_{i }\) in each time. Then the distribution will automatically approximate the optimal solution. When \(s_{i}\) becomes sufficiently small, the unique distribution that minimizes the problem can also be found. However, in Wolpert and Lee (2004), temperature \(T\) is shown to be non-decreasing with \(s_{i}\), which means \(s_{i }\) decreases as \(T\) decreases. Besides, the process of decreasing temperature is much easier than the one decreasing \(s_{i}.\) So objective (6) can be equally shown as the formula below:

$$\begin{aligned}&\mathop {\min }\limits _{q_i } L(q_i ,T_i )=\int {G(x)\prod \nolimits _\mathrm{j} {q_j (x_j )} \mathrm{d}x} -T_i S(q_i ),\nonumber \\&\quad i=1,2,\ldots ,n \end{aligned}$$
(7)

In statistical physics, equations above are also called Boltzmann equations and solutions to them are

$$\begin{aligned} q_i (x_i )\propto \exp \left( {-\frac{E_{(q_i )} (G(x)\vert x_i )}{T}}\right) ,\quad i=1,2,\ldots ,n, \end{aligned}$$

which is called the Boltzmann distribution. The subscript in the conditional expectation means its value depends on a joint distribution without \(q_{\mathrm{i}}\). It reflects that the probability of agent \(i\) choosing pure strategy \(x_{i}\) depends on the effect of that choice on the utilities of the other agents.

3 Trust region PC method (TRPC)

The establishment of the Maxent Lagrangian is excellent, since it successfully converts the primal problem into the probability space and the rest of the work is to solve formula (7). However, not all gradient-based optimization methods are suitable to this model and some potential requirements should be satisfied. So, in this section, these requirements are first identified and then the new PC algorithm is presented.

3.1 Potential requirements

Consider formula (7), we can get

$$\begin{aligned} \frac{\partial L}{\partial q_i (x_i =j)}=E[G\vert x_i =j]+T_i [1+\ln q_i (j)] \end{aligned}$$
(8)

So gradient-based methods could be used to find the optimal \(q\)* if a fixed temperature \(T\) is given. Since the closed form of \(G(x)\) is not known, here MontE\(-\)Carlo sampling is used to estimate the conditional expectation value (\(E[G\vert x_i =j])\) in Eq. (8).

However, the adoption of MontE\(-\)Carlo sampling will bring some critical problems that should be considered. First of all, the optimization method should be an interior point method, because in each iteration the values we update are probability values and they are used to perform MontE\(-\)Carlo sampling and calculate the entropy. So, the optimization method should automatically guarantee the probability value within [0, 1]. (Standardization may not be necessary, it will be discussed later.) Second, the calculation of the step size should be avoided, which can be extremely costly in this model. For example, if a descent direction is given and two step sizes are compared, we need to use them to update probability distributions, sample under the new distributions and evaluate the new function values. When the step size is a real number, this whole method becomes unpractical. Third, since the gradient of \(L\) contains conditional expectation (\(E[G\vert x_i =j])\) which may have random errors in practice, the algorithm should not only guarantee an effective descent in each iteration but should also show tolerance to the randomness and estimation error.

Former PC algorithms did not fully solve all of these problems. In the most widely used Nearest Newton Method, the interior point requirement is satisfied through the iterative step

$$\begin{aligned}&q_\text {i} ( j)\leftarrow q_\text {i} ( j)\,-\alpha q_\text {i} ( j)\times \left\{ \text {E}( {g_\text {i} \vert x_\text {i} =j})\,\right. \nonumber \\&\quad \left. -\text {E}( G)+T\left[ {S( {q_\text {i} })\,+\,\ln q_\text {i} ( j)} \right] \right\} \end{aligned}$$
(9)

It can be easily checked that \(\sum _{\mathrm{j}}q_{\mathrm{i}}(j)\,\times \,\){E\((g_{\mathrm{i}}\vert x_{\mathrm{i}}=j)\) \(E(G)\,+\,T\)[\(S(q_{\mathrm{i}})\)  +  ln \(q_{\mathrm{i}}(j)\)]} \(=\) 0. So it seems that the summation of the probability should always equal to one in every iteration if the algorithm starts from a distribution like the uniform distribution. However, this iterative step works well only when the error of estimation \(E(G\vert x_{\mathrm{i}}=j)\) is small, which can hardly be guaranteed in iterations especially in the very beginning. Besides that this equation also can not guarantee every new \(q_{\mathrm{i}}(j)\) to be strictly positive during iteration. Though the negative \(q_{\mathrm{i}}(j)\) can be set to zero, in an extremely bad situation there exists possibilities that all of \(q_{\mathrm{i}}(j)\)s become negative through this updating. What is worse, the directly setting method could also make the summation of probability smaller than 1, which could influence both the MC sampling and the entropy. Finally, it seems that the parameter \(\alpha \) plays a role like step size to ease the error of estimation and keep the iteration on its track. But a small \(\alpha \) will also shrink the effect of descent along a correct direction.

Given all of these problems, here a Trust Region method is used to solve the Maxent Lagragian. Before introducing details, a slight modification of model is needed. Note that the equation constraint

$$\begin{aligned} \int {q_i (x_i )\mathrm{d}x_i } =1\;,\quad i=1,2,\ldots ,n \end{aligned}$$

may not be necessary, since the purpose is to set one or a few probability values to be large (\(\approx \)1) while the others to be zero and this equation is needed only when the MC sampling is performed or the entropy is calculated. So the probability can be standardized just before these two calculations and kept within (0,1) at the other time. In other words, we only need to keep each probability value between 0 and 1 when solving the Maxent Lagragian but standardize it just for MC sampling and entropy calculation. Thus this constraint can be loosened and the model can be changed to

$$\begin{aligned}&\mathop {\min }\limits _{q_i } L(q_i ,T_i )=\int {G(x)\prod \nolimits _\text {j} {q_j (x_j )} \mathrm{d}x} -T_i S(q_i ),\nonumber \\&\qquad i=1,2,\ldots ,n \nonumber \\&\;\text {s.t}.\quad 0<q_i (x_i )<1 \end{aligned}$$
(10)

3.2 The trust region probability collectives (TRPC)

Now the original problem has been transformed into a minimization problem subject to bounds and the new algorithm can be discussed in detail. Considering all of the requirements, we redesign the updating rule of \(T\) and introduce a modified trust region (TR) method dealing with bounded variables, which was first presented by Coleman et al. (1996), to solve the Maxent Lagragian.

First, a vector \(v(x)\) and an affine scaling matrix \(D(x)\) are defined as follows:

Definition 1

Let \(v(q_{i})=(v_{i1}, v_{i2}, \ldots , v_{in})\) be a vector for \(q_{i}\), and then

$$\begin{aligned} v_{ij} =\left\{ {\begin{array}{ll} q_{ij} -1,&{}\quad \text {if}\;\nabla L_{ij} <0 \\ q_{ij} ,&{}\quad \text {if}\;\nabla L_{ij} \ge 0\\ \end{array}} \right. \end{aligned}$$

where \(q_{\mathrm{ij}}\) is the probability value of the \(j\)th strategy.

Definition 2

For all \(v(q_{i})\), let

$$\begin{aligned} D(q_{i} ) = {\text {diag}}(\left| {v(q_{i} )} \right| ^{{ - 1/2}} ), \end{aligned}$$

where diag(\(\cdot \)) denotes a diagonal matrix.

Based on these two definitions, the quadratic sub-problem in TR can be defined as follows. Here the result is given directly and more details are presented in Appendix A.

$$\begin{aligned} m_k (s)=g_k^T s+\frac{1}{2}s^TM_k s, \end{aligned}$$
(11)

where

$$\begin{aligned} \begin{array}{l} g_k =D_k^{-1} \nabla L_k =\text {diag}(\left| {v(q_i )} \right| ^{-\frac{1}{2}}) \\ M_k =D_k^{-1} B_k D_k^{-1} +\text {diag}(\nabla L_k )J_k \\ \end{array} \end{aligned}$$

\(J_{\mathrm{k}}\) plays a role of the Jacobian matrix of \(\vert v(q_{i})\vert \) to allow for a second-order Newton process (Coleman et al. 1994). We set \(J_{\mathrm{k}}\) \(=\) diag(sign(\(\nabla (L_{\mathrm{k}}))\).

Following (11), it can be easily checked that the statements below hold.

(a) \(\quad g_{\mathrm{k}}=0\) iff the first-order necessary conditions for form (10) are satisfied.

(b) \(\quad M_{\mathrm{k}}\) is positive definite and \(g_{\mathrm{k}}=0\) iff the second-order sufficiency conditions are satisfied.

(c) \(\quad M_{\mathrm{k}}\) is positive semi-definite and \(g_{\mathrm{k}}=0\) iff the second-order necessary conditions are satisfied.

So \(q_{\mathrm{i}}\) will be a local minimum of model (10) iff \(s=0\) is a solution to the quadratic sub-problem

$$\begin{aligned} \mathop {\min }\limits _s \left\{ {m_k (s)=g_k^T s+\frac{1}{2}s^TM_k s:\;\left\| s \right\| \le \Delta _k } \right\} \end{aligned}$$
(12)

Hence solving the sub-problem (12) is a reasonable step to attempt when \(q_{i}\) is not a local minimizer.

Here, a so-called dog-leg method is used to approximately solve the problem and it can be checked that solutions under this method satisfy the assumptions for sufficient reductions which can guarantee the convergence of the algorithm. Still, the result is directly given here and more details are shown in Appendix B.

Here, three cases are considered:

(a) If the trust region radius is sufficiently large (\(\left\| {s^B} \right\| =\left\| {-M_k^{-1} g_k } \right\| \le \Delta _k)\), \(s^{B}\) is exactly the solution.

(b) If the trust region radius is rather small (\(\left\| {s^U} \right\| =\left\| {\,-\frac{g_k^T g_k }{g_k^T M_k g_k }g_k } \right\| \ge \Delta _k )\), \(\vert \vert \alpha s^{U}\vert \vert =\Delta _{\mathrm{k}}\) could be the solution.

(c) Otherwise, the solution should be a combination of \(s^{B }\)and \(s^{U}\), namely \(s^{U}\)+(\(\tau \)-1)(\(s^{B}-s^{U})\), and \(\tau \) could be solved by \(\vert \vert s^{U}\)+(\(\tau \)-1)(\(s^{B}-s^{U})\vert \vert =\Delta _{\mathrm{k}}\).

So the whole TRPC algorithm can be described as follows:

Step 1 Check the stop criteria. If \(\sum _{\mathrm{j}}\vert \vert q_{\mathrm{ij}}^{k}- q_{\mathrm{ij}}^{k-1}\vert \vert \le \eta \), then stop the algorithm and the current distribution is the optimal solution. Otherwise, execute the next step.

Step 2 Use MontE\(-\)Carlo sampling method to sample from current distribution \(q^{k}\) in the \(k\)th iteration and estimate the expectation value \(E(G(x))\) and \(E(G\vert x_{\mathrm{i}})\).

Step 3 According to E(G \(\vert \) x \(_{i})\), calculate \(T\) by the annealing schedule for each agent, and calculate the gradient.

Step 4 Use BFGS method to update \(B_{k}\) as an approximation to the Hessian matrix and obtain \(s_{k}\) by (approximately) solving the sub problem (12). \(r_{k}\) can be evaluated from

$$\begin{aligned} r_k =\frac{L_k -L(q_k +s_k )}{m_k (s_k )} \end{aligned}$$

Then,

If \(r_{k }<\varepsilon \), set \(\Delta _{k+1}=\vert \vert s_{k}\vert \vert \)/4;

if \(r_{k }>\) 0.75 and \(\vert \vert s_{k}\vert \vert =\Delta _{k}\), set \(\Delta _{k+1}=2\vert \vert s_{k}\vert \vert \);

otherwise, set \(\Delta _{k+1}=\Delta _{k}\).

To get an approximation of the exact \(s_{k}^*\) in the sub-problem by dog-leg method, define

$$\begin{aligned} s^B=-M_k^{-1} g_k ,\;s^U=-\frac{g_k^T g_k }{g_k^T M_k g_k }g_k \end{aligned}$$

and choose \(s^{B}\) as \(s\)* if \(\vert \vert s^{B}\vert \vert \le \Delta _{k}\). Otherwise, we replace \(s\)* with the following form:

$$\begin{aligned} \tilde{s}_k (\tau )=\left\{ {\begin{array}{ll} \tau s^U &{}\quad 0\le \tau \le 1 \\ s^U+(\tau -1)(s^B-s^U) &{}\quad 1\le \tau \le 2, \\ \end{array}} \right. \end{aligned}$$

where \(\tau \) is solved through \(\left\| {\tilde{s}_k (\tau )} \right\| =\Delta _k \).

Note that, it is necessary to generate another sample set to evaluate \(L(q_{k}+s_{k})\).

Step 5 If \(r_{k }>\)0, then \(q_{k+1}=q_{k}+s_{k}\). Otherwise, let \(q_{k+1}=q_{k}\), and go to Step 1.

3.3 Annealing

Temperature T plays an important role in the performance of PC for balancing the full rational (the expectation) and irrational (the entropy) parts in the Maxent Lagrangian. Like in SA, it is also necessary to gradually decrease \(T\) in PC to reach the optimal point. Since a quickly decreasing \(T\) may lead to local optimum while a relatively slow reduction may cause too many samplings and iterations, a proper annealing schedule is quite significant.

Traditional annealing schedule is a geometric schedule that decreases \(T\) proportionally by multiplying a fixed parameter. However, the temperature \(T\) used here is quite different from \(T\) used in the simulated annealing algorithm. Since the gradient contains \(T\), it is possible to further describe relationships among the temperature, search scope and accuracy of the algorithm. Thus a much more flexible and efficient schedule can be presented.

Considering the structure of the gradient (8),

$$\begin{aligned} \frac{\partial L}{\partial q_i (x_i =j)}=E[G\vert x_i =j]+T_i [1+\ln q_i (j)] \end{aligned}$$

for the \(i\)th agent; the conditional expectation describes the payoff of its every strategy given the moves of the other agents. In other words, \(E(G\vert \)x\(_{i}=j)\) tells us what the expected objective value we will get when we choose strategy \(j\) for agent \(i\) under other variables’ current probability distributions. And the picture of the conditional expectation \(E(G\vert x_{i}=j)\) reflects how the objective value varies according to the different strategies of agent \(i\). It looks just like a ‘slice cut’ of the objective function. So the optimization of the Maxent Lagrangian can also be interpreted as increasing the probability of strategy with low conditional expectation while decreasing the probability of those with high conditional expectation. Thus, the whole second term in gradient (8) can be treated as a controller, through which the negative gradient value can be assigned to the strategy with smaller conditional expectation and positive gradient values are assigned to the larger ones at the same time. And since the optimal direction in TR is roughly along the opposite direction of the gradient, the strategies with negative gradient value can get a higher probability in the next round and strategies with positive gradient value will lower its probability value. So, through controlling T, we can control the positivity and negativity of the gradient and thus control the increase and decrease of the probability value. In practice, the temperature T can be adjusted according to the corresponding probability.

For example, Fig. 1 shows the conditional expectation of quadratic function \(f(x)=x^{2}\), \(x\in (-2, 2)\) with 50 samples per iteration. Here [\(-2, 2\)] is equally divided with 200 points.

Fig. 1
figure 1

\(E(G\vert x)\) of function \(f(x)=x^{2}\) with 50 samples

Line (a) in Fig. 1 is exactly the second term of the gradient, namely we choose\( T_{\mathrm{ij}}\)[1 + ln(\(q_{\mathrm{ij}})]=-1\). Note that \(g_{\mathrm{ij}}=E(G\vert x_{\mathrm{i}}=j) + T_{\mathrm{ij}}\)[1 + ln(\(q_{\mathrm{ij}})\)], so the gradient value \(g_{\mathrm{ij} }\) is positive in the intervals \(-2\le x_{ij}\le -1\) and 1\(\le x_{ij}\le \) 2 and negative in the interval (\(-1\le x_{ij}\le \)1). Since \(x_{ij }\in (-2, -1)\cup \)(1, 2) can generate a larger objective function value than those from the interval (\(-1, 1\)), the probability corresponding to the former regions should decrease while the probability to the latter region should increase in the next iteration. Thus the gradient can be used to generate the descent direction needed. Based on this interpretation, the annealing process can be viewed as both lowering line (a) and narrowing the probability increasing region on the x-axis. The closer of the controller decreases to 0, the closer of the region narrows to the optimal point. Finally, there will be only one point left that should increase its probability. Besides, as the probability value gradually concentrates on a narrow area, so does the sampling. For example, in Fig. 1, the algorithm will finally assign a high probability (\(\approx \)1) around \(x=0\), so samples generated by MontE\(-\)Carlo samplings will be around \(x=0\). Thus \(E(G(x))\) will finally approximate a certain value through these iterations.

The new annealing schedule will also give us another insight into the optimization process. Considering the function \(f(x) =\sin (2x)+\sin (x)\) for \(x\in [0, 6]\), the corresponding \(E(G\vert x)\) is shown in Fig. 2. It is clear that there is a local minimum and a global minimum in this function. At the beginning of the algorithm, the controller is set to be a relatively high value (the upper line (a) in Fig. 2). Obviously, the probability of any \(x_{i}=j\) that has \(E(G\vert x_{\mathrm{i}}=j)\) above the line (a) will decrease in the next iteration and the probability values of points near the local and global optimal solution will be increased. In other words, when using line (a), MC sampling will mainly focus on the areas round the local minimum and the global minimum. However as we gradually lower the controller from line (a) to line (b), since the conditional expectation of the local minimal (i.e. \(E(G\vert x_{i}\) \(=\) local minimum)) is now above the line (b) and its gradient value changes from negative to positive, the corresponding probability will begin to decrease. So when using line (b), the probability-increasing region will only be the areas around the global optimal point. And as the process continues, it will ultimately be picked out.

Fig. 2
figure 2

Function \(f(x) =\sin (2x)+\sin (x)\) with 50 samples

Besides that, note that the optimal solution of the Maxent Lagrangian is on the boundary of the feasible region, which means the probability is either close to 1 or 0 with negative and positive gradient, respectively. So the final solution satisfies the first-order necessary conditions of the local optimal point.

The estimation of \(E(G\vert x)\) is another important part in our algorithm. If the conditional expectation is known at the first iteration, the optimal point can be directly found. However, \(E(G\vert x)\) may not be as precise as being expected in the beginning or even in the end of the algorithm, especially when the sample size is small. And the randomness in the estimation can hardly be eliminated. So, even though TR method is used, the new PC algorithm still has characteristics of random search methods. In general, the estimation can be viewed as a process that gradually makes \(E(G\vert x)\) clear under the guide of probability. At first, the algorithm equally pays its attention to every point of the space and gets a rough description of the objective function. However, in the following rounds, as the probability concentrating, the algorithm can focus only on a narrower area and describe \(E(G\vert x)\) more accurately than previous rounds did. As this process continues, TRPC gradually narrows the areas which may include the global optimal and finally pick out an interval that indeed contains the optimal point.

In practice, the gradually shrinking controller is a good choice, but here a much easier strategy is used. In each iteration, the minimal \(E(G\vert x)\) is figured out and the largest value in its neighborhood is chosen as the controller. As the algorithm goes on, if the distribution keeps unchanged for several iterations, we sharply narrow our controller to be min \(E(G\vert x)+\varepsilon \). This strategy can be called the two-phase schedule. The selections of the width of neighborhood and value of \(\varepsilon \) all depend on the objective functions. To deal with this, several TRPC iterations (say 1000) could be ran with very large width and \(\varepsilon \) to observe the trend of objective function through \(E(G\vert \) x). For functions changing sharply, numerical experiments show that the width and \(\varepsilon \) with small values can find the global minimum more efficient than large ones.

So far the new TRPC algorithm has been discussed in detail. The whole procedure is described in Fig. 3. It is worth noting that \(q_{i}\) may not be updated in every iteration. If \(q_{i+1}=q_{i}\) in the \(k\)th iteration, samples generated in two iterations can be incorporated.

Fig. 3
figure 3

Procedure of TRPC algorithm

4 Experimental results

To compare arithmetic performance of TRPC with that of the original PC algorithm, the same benchmark functions used in (Huang et al. 2005) are also adopted here. The four benchmark functions are Rosenbrock Function with 10 dimensions, Ackley Function with 10 dimensions, Schaffer’s 7 Function with 2 dimensions and Michalewicz’s Epistatic Function with 10 dimensions. Besides, another 6 functions are also used. In total, the new algorithm is tested by ten benchmark functions, within which 6 of them are of 10 dimensions and 4 are of two dimensions.

As mentioned in Huang et al. (2005), the performance of the original PC algorithm has no difference under the sample size 25 and 50. But in TRPC, it does make sense to increase sample size, so we use 40–50 samples per iteration in our algorithm. Besides, since the original PC algorithm usually needs thousands of iterations while our TRPC algorithm acquires satisfying results within 2,000 iterations, it is inconvenient to show their convergence processes in one figure. So here we show the performance of the original PC and TRPC separately.

In this section, we mainly focus on the first four functions. The results generated by the TRPC algorithm are explicitly explained and comparisons are also made with that of the original PC algorithm. For the other six functions, since the results are very similar, their results are only listed in the table.

The numerical experiments were implemented using Matlab 2010b in Vista operating system and run on a personal computer with Intel(R) Core(TM) 2 2.00 GHz CPU and 2 GB of RAM.

4.1 Comparison with the original PC

(1) The Rosenbrock function

The Rosenbrock function is one particular case discussed in Huang et al. (2005). Its definition is

$$\begin{aligned} f(x)=\sum \limits _{i=1}^{N-1} {[100(x_i^2 -x_{i+1} )^2+(1-x_i )^2]}, \end{aligned}$$

where \(x=(x_{1}, \,x_{2},_{\ldots },x_{N,})\), -5\( \leqslant x_{i} \leqslant \)5 and \(N=10\). The optimal point is (1, 1,...,1) and the corresponding function value is 0.

Fig. 4
figure 4

Performance of the original PC on Rosenbrock function

Figure 4a shows the results obtained by the original PC. The x-axis shows iteration times and the y-axis is the value of \(E(G)\). Figure 4b just displayed a detailed view from the range of the \(E(G)\) on the interval [0,1,000]. In the figure, one can see that the function value is still near 100 after 1,000 iterations by using the original PC. However, for the same function, a satisfactory result is acquired within 1,200 times of iteration by our new TRPC algorithm. The results are shown in Figs. 5, 6. Although TRPC method needs more samples in this case, it sharply decreases the value of objective function from 4,000 to 10 within 10 iterations and the algorithm accurately finds the global optimum.

Fig. 5
figure 5

\(E(G)\) by TRPC in Rosenbrock function

It should be noted that \(E(G)\) is used here for the convenience of comparing with Huang’s numerical results, but it is not the final objective value. The final optimal value comes from \(G(x^{*})\) with setting \(x^{*}\) the point with the highest probability value in each agent. In addition, the accuracy of the result is closely related to the range of intervals and discretization.

Fig. 6
figure 6

\(E(G\vert x)\) and \(P\) in Rosenbrock function

Other outputs of the new algorithm are shown in Fig. 6. Figure 6a presents the curve of E(G \(\vert \) x), and Fig. 6b shows the probability values of \(x_{ij}\). Here, [\(-2,2\)] is equally divided into 200 intervals to discretize the continuous function. So x-axis ranges from 0 to 200 in both figures. In Fig. 6a, the curve \({x}_1\) depicts E(G \(\vert \) x \(_{1})\), the \(\times 10\) depicts E(G \(\vert \) x \(_{10})\), and the other variables share the same unlabeled curve. Since the closed form of the objective function is not known, Figure 6a provides some useful information about it. First, (\(-1, 1, 1, {\ldots }, 1\)) is a local optimal point since the curve \({x}_1\) is close to the x-axis at \(x_{1}=-1\) when the other variables have values 1. Second, a slight change of \(x_{10}\) will not cause big changes in the objective value because the curve \(x_{10}\) is relatively flat around its optimal solution. Third, the variables \(x_{2}\), ..., \(x_{9}\) are symmetric because their curves overlap. The final probability concentrates on point (1,1,...,1) and the exact values are shown in Table 1. It is clear that \(f^{*}=0\).

Table 1 Probability values in Rosenbrock function

(2) The Ackley function

The Ackley function is another case discussed in Huang et al. (2005). The definition of the function is

$$\begin{aligned} f(x)=-ae^{-b(\frac{\sum \nolimits _i {x^2} }{N})^{1/2}}-e^{\frac{\sum \nolimits _i {\cos (cx_i )} }{N}}+a+e, \end{aligned}$$

where \(a=20\), \(b=0.2\), \(c=2{\pi }\), \(N=10\), \(-3 \leqslant x_{i} \leqslant \)3. The optimal point is (0,0,...,0) and \(f(0, 0,{\ldots }, 0)=0\).

Fig. 7
figure 7

\(E(G)\) by TRPC

Fig. 8
figure 8

Performance of the original PC on Ackley function

Figure 7a, b show the results of TRPC and Fig. 8 displays the results of the original PC. This experiment may be more persuasive than the Rosenbrock function, because we only use 40 samples per iteration for our TRPC in this case and obtain the satisfying result within 1,500 iterations. However, in contrast, the original PC algorithm needs more than 8,000 iterations. The result once again shows that TRPC method outperforms the original PC algorithm.

Fig. 9
figure 9

\(E(G\vert x)\) and P in Ackley function

Similar to the Rosenbrock function, Fig. 9a provides us with information about the objective function through \(E(G\vert x)\) and Fig. 9b shows the optimal point should be \(x=(0,0,{\ldots },0)\). It is clear that this function only has one global optimum and all \(x_{i}\) are symmetric. The final probability is shown below (Table 2).

Table 2 Probability values in the Ackley function

In this case, it is worth noting that the \(f^*\) is 0.0427 since \(x_{3}\) deviates from 0 by one unit (0.03 in this case). The reason for this deviation is the two phase schedule. The \(\varepsilon \) we use in min \(E(G\vert x)+\varepsilon \) is too small, so even TRPC finds the right candidate area in the first phase, the probability concentrates too fast on the false optimum without exploring other potential points around it. This problem could be solved using more phases in the annealing schedule.

(3) The Schaffer’s function \(F_{7}\)

The third benchmark function used in the literature is Schaffer’s function \(F_{7}\). The function is defined as

$$\begin{aligned} f(x)=(x_1^2 +x_2^2 )^{0.25}[\sin ^2(50(x_1^2 +x_2^2 )^{0.1})+1], \end{aligned}$$

where \(-2 \leqslant x_{i} \leqslant 2\) for \(i=1,2\). The optimal point is (0,0) and the function value at this point is 0.

Fig. 10
figure 10

\(E(G)\) by TRPC and PC in Schaffer’s function \(F_{7}\)

Figure 10a was obtained by TRPC using one point every ten iterations to compare with the result of the original PC algorithm, which is shown in Fig. 10b. At first glance, these two figures above seemingly make no difference between TRPC and PC. But if we look at the probability, it is obvious that the optimal point has already been found within 2,000 iterations by TRPC. The gap between the \(E(G)\) and the optimal value 0 is due to the randomness and discretization.

Fig. 11
figure 11

\(E(G\vert x)\) and P in Schaffer’s function \(F_{7}\)

From Fig. 11a, it is clear that there are many local minimums in interval [\(-2,2\)], and TRPC is not trapped in any of them and successfully find the global optimum. The optimal function value \(f^*\) is 0 in this case. Table 3 below gives the final probability.

Table 3 Probability values in the Schaffer’s function \(F_{7}\)

(4) The Michalewicz’s epistatic Function

The final test function used in the literature is the Michalewicz’s epistatic function, which has the following form:

$$\begin{aligned} f(x)=-\sum \limits _{i=1}^N {\sin (y_i )\sin ^{2m}\left( \frac{iy_i^2 }{\pi }\right) }, \end{aligned}$$

where

$$\begin{aligned}&y_i =x_i \cos \frac{\pi }{6}-x_{i+1} \sin \frac{\pi }{6}\;\mathrm{if}\,i\;\text {mod 2}=1\text { and}\;i\;\ne N;\\&y_i =x_{i-1} \sin \frac{\pi }{6}+x_i \cos \frac{\pi }{6}\;\mathrm{if}\,i\;\text {mod 2}=0\text { and}\;i\ne N;\\&y_N =x_N ; \end{aligned}$$

\(0 \leqslant x_{i} \leqslant \pi \)    for \(i=1,2,{\ldots },N\).

$$\begin{aligned} m=10; \end{aligned}$$

The optimal function value is \(-9.66\) when \(N=10\).

Fig. 12
figure 12

\(E(G)\) by TRPC and PC in Michalewicz epistatic function

Figure 12a shows the \(E(G)\) generated by TRPC while Fig. 12b shows the results of original PC. The advantage of the TRPC method is even more obvious in this experiment. Though fluctuating sharply before 1,800 iterations, \(E(G)\) finally keeps stable at \(-\)9.5526 in TRPC. However, in the original PC algorithm, the objective value is larger than \(-\)9 even after 10,000 iterations. The gap between \(-\)9.662 and \(-\)9.552 is due to the discretization.

Fig. 13
figure 13

\(E(G\vert x)\) in Michalewicz epistatic function

Here we pick out \(E(G\vert {x}_{i})\) of \(x_{1}\), \(x_{4}\), \(x_{8}\) and \(x_{10}\). It is interesting to note that \(x_{i}\) has \(i\) local minimums in this function. It should be pointed out that for the convenience of coding, Fig. 13 shifts the minimum \(E(G\vert {x}_{i})\) to the point zero, but this does not influence the other parts of the algorithm. And the figure below shows probability values. The actual probability value table is omitted since the optimal points are separated from each other and can hardly be put into one table (Fig. 14).

Fig. 14
figure 14

\(P\) in Michalewicz epistatic function

4.2 Other numerical experiments

Besides the functions used in Huang et al. (2005), here another 6 test problems are chosen and the results are listed in Table 4. The fifth column of Table 4 are solutions from TRPC and the last column is the analytical optimal solutions.

Table 4 Other test functions and results

4.3 Comparison with another PC modification

Another PC modification here refers to the method mentioned in Kulkarni et al. (2009). For convenience, we call it modified PC (MPC) below. Its main process is shown in the following figure:

Fig. 15
figure 15

MPC process

The main character of the MPC is that it indeed shrinks the sampling space per iteration. According to the current best solution (or ‘Favorable Strategy’ in the literature), it artificially sets upper and lower bounds (Favorable Strategy \(\pm \)(0.5 or 0.1)*Favorable strategy) for the sampling space. The advantage of this setting is that it improves the accuracy of the final solution. However, since it still uses the Nearest Newton Method to update the probability, the problems mentioned in Sect. 3.1 still exist and it clearly shows in Fig 15 that a large number of probability updating (k*n times) is needed to ‘sharpen’ the probability of the current best solution per iteration. So, the MPC does not change the updating equation of the original PC; instead, it redesigns the whole process to increase the number of probability updating in each iteration and limits the sampling space to compensate for the potential ineffectiveness of the iterative step.

In contrast, TRPC totally changes the iterative step and thus can make the probability very close to 0 and 1, which means it can shrink the sampling space in terms of probability but does not actually change its interval. In other words, MPC directly limits the sampling space around the current best solution but TRPC uses temperature \(T\) to manipulate the gradient and then controls the increase and decrease of the probability and thus finally locks on the best solution. So, TRPC has more efficient probability updating than that of MPC. But the discretization will have more impacts on TRPC’s final solution than on the MPC’s, since it does not actually reduce the search interval. This could be the major disadvantage of the TRPC method.

Although the Rosenbrock function in Kulkarni and Tai 2009 is 5 dimensions and the range for each variable is different, this will not influence the comparison between these two methods. The reasons for this are as following: First, TRPC could find the optimal solution within 1,200 times function evaluations even in a harder problem (10 dimensions), while MPC needs \(2.047\times 10^5\)\(3.591\times 10^5\) iterations. This clearly shows TRPC needs less function evaluations. Second, the actual search interval in both methods is not related to the dimension. In TRPC, it is [\(-2,2\)] for each variable from the beginning to the end. The discretization is also on the same interval. But in MPC, the actual search interval is determined by the lower and upper bounds (favorable strategy \(\pm \)0.1*favorable strategy) and the final discretized intervals are [\(-1,-0.9\)] for \(x_{1}\), [1, 1.1] for \(x_{5 }\) and [0.9, 1.1] for the other variables. Since the final search interval in MPC is much smaller than that of TRPC, the variation of the optimal solution in MPC (\(\pm 10^{-3})\) is also less than TRPC’s (\(\pm 10^{-2})\). Third, since both TRPC and MPC do not change the structure of the Maxent Lagragian, the range for each variable in these two methods could be different.

5 Conclusions

In this paper, a modified PC algorithm is presented by combining the TR and a new annealing schedule. The original probability collectives (PC) algorithm is a kind of heuristic algorithm, which focuses on adapting the distribution over the strategy set of each variable. In order to improve its performance, potential requirements of solving the Maxent Lagrangian are first identified. It is found that the gradient-based optimization should not only be an interior point method but also can tolerate randomness and at the same time avoid computation of the step size. So an interior point TR method is used to meet all these demands. Another improvement is the adaptation of a new annealing schedule which replaces the geometrically decreasing one. For checking the performance, ten benchmark functions are used and numerical results show that the new TRPC method outperforms the original PC algorithm significantly.

Analysis in this article shows that the TRPC method still has improvement potentials and future researches could focus on the following points. First, PC algorithm is inherently designed as a distributed algorithm, so parallel computing could be used to redesign the whole algorithm. Second, the annealing schedule used here is a basic form. Other much more efficient schedules could be developed on this basis and the algorithm could be even more powerful. Third, since PC algorithm could solve both discrete and continuous problems, there are huge potentials to solve real problems with it.