Keywords

1 Introduction

Learning automaton can be seen as an adaptive decision-making unit, It can constantly interact with the random environment to adjust its choices to maximize the probability of being rewarded. The process of the LA interacting with the environment is shown in Fig. 1; [1]. At each moment t, an action α(t) will be chosen by LA to interact with the random environment and receives the environment feeds back β(t), which can be either a reward or a penalty. Then, the automaton updates the state probability vector according to the received feedback. Because it’s simple algorithm, strong anti-noise ability and strong optimization ability, it has received extensive attention and has been applied in many fields, such as random function optimization, QoS Optimization and certificate authentication.

Fig. 1.
figure 1

LA interacts with the random environment

In the LA field, the most classical discrete pursuit algorithm with deterministic estimator is the DPRI algorithm given by Oommen in [2]. The main idea of the DPRI algorithm is to increase the probability vector which has the maximum value in the running estimates when the environment rewards the current action and decreases others; otherwise, the automaton changes nothing. Furthermore, many classical pursuit algorithms, such as DGPA [3] and SERI [4], also use estimators and discretization to improve the convergence speed of automaton. In [5], an algorithm named last-position elimination-based LA( LELA) which is contrary to the classical pursuit algorithm is proposed. Instead of greedily increasing the probability vector of the optimal estimation action, this algorithm reduces the probability of choosing the current worst estimation action, t Experiments show that LELA can get faster convergence speed than DGPA.

However, the estimator used by LELA has some innate defects. One typical flaw is the cold start and initialization problem, for example, since the maximum likelihood estimator does not have any information at the beginning, each action has to interact with the environment a certain number of times, it will Increase the cost of getting information in some complicated situations. And the update strategy of the LELA algorithm simply makes all active actions equally share the penalized state probability from the last-position action, does not consider the difference between optimal action and other actions at all. Thus, in this paper, we propose a weight assignment LELA (WLELA) algorithm which has made the following three changes: 1. Improvement of initialization parameters; 2. the estimator improvement; 3. changes in the probability vector update strategy.

In Sect. 2, a brief introduction of the LELA algorithm is given. In Sect. 3 we will show our algorithm WLELA in detail. Then in Sect. 4 part, we will give the simulation results of the WLELA algorithm, which will be compared with LELA and other classic algorithms such as DPRI, DGPA. Finally, summarize in Sect. 5.

2 Related Work

LA can be represented by a four-tuple \( \left\langle {\text{A, B, Q, T}} \right\rangle \) model. They are explained as follows.

A is the set of actions. B is the feedbacks from random environment, when B = {0, 1}, where β = 0 represents that the LA has been penalized, and β = 1 means the LA has been rewarded. Q = <P, E>, E represents the estimator, it contains all the historical information that each action interacts with the environment. The most commonly used estimator is the maximum likelihood estimator. P is the state probability vector of choosing an action α(t) at any instant t, it satisfies ∑pi(t) = 1. T is the state transition function of LA, which determines how LA migrates to the state of t + 1 according to the state of output, input, and t at time t.

The random environment can also be described by a triple <A, B, C> mathematical model. where A and B are defined in the same way as above, and C is defined as C = {cij = Pr{β(t) = βj|α(t) = αi}}.

In the original LELA algorithm [5], it uses the maximum likelihood estimator to record the historical information of all actions, according to the following formula

$$ {\text{di}}\left( {\text{t}} \right) = \frac{Wi\left( t \right)}{Zi\left( t \right)} $$
(1)

where Zi(t) is the number of times the action ai was selected up to time instant t and Wi(t) is the sum of the environmental feedbacks received up to time t. when an action is rewarded, the automaton will select the worst performing action from the estimator vector set and decrease corresponding state probability vector by a step ∆ = 1/rn, where r is the number of allowable actions and n is a resolution parameter. If some action’s state probability vector is reduced to zero during the process, this action will be removed from the optional set of actions, while the remaining actions will evenly share the state probability value from each decrease. The update scheme is described as follows:

  • If β(t) = 1 then

    Find m ∈ Nr such that

    $$ d_{m} \left( t \right) = \hbox{min} \left\{ { d_{i} \left( t \right) |p_{i} \left( t \right) \ne 0 } \right\}, i \in N_{r} $$
    $$ p_{m} \left( {t + 1} \right) = { \hbox{max} }\left\{ {p_{m} \left( t \right) - \Delta ,0} \right\} $$

    If \( p_{m} \left( {t + 1} \right) = 0 \) Then \( {\text{k}}\left( {\text{t}} \right) = k\left( t \right) - 1 \) Endif

    $$ p_{j} \left( {t + 1} \right) = \hbox{min} \left\{ {p_{j} \left( t \right) + \frac{{p_{m} \left( t \right) - p_{m} \left( {t + 1} \right)}}{k\left( t \right)},1} \right\}, \forall j \in N_{r} ,such \,that p_{t} \left( t \right) > 0. $$
  • Else

    $$ p_{i} \left( {t + 1} \right) = p_{i} \left( t \right) \forall i \in N_{r} $$
  • Endif.

κ(t) denotes the number of active actions and is initialized by r.

LELA has been proved to be ε-optimal in every stationary random environment.

3 Proposed Learning Automata

In order to overcome the shortcomings of the LELA algorithms, we propose the following improvements.

Firstly, we use the Bayesian estimator introduced in [6] to solve the cold start and initialization problems. However, if the Bayesian estimator is used directly in the LA algorithm, the convergence speed will be additionally affected, so in WLELA, we directly modify it to the mean of the posterior distribution which is to set all actions’ di(0) = 0.5, thereby improving convergence efficiency while ensuring overcoming cold start and initialization problems.

Secondly, in order to get more information, in the WLELA algorithm we used the confidence interval estimator proposed in [7] which is

$$ d_{i} \left( t \right) = \left\{ {1 + \frac{{Z_{i} \left( t \right) - W_{i} \left( t \right)}}{{\left( {W_{i} \left( t \right) + 1} \right)F_{{2\left( {W_{i} \left( t \right) + 1} \right),2(Z_{i} \left( t \right) - W_{i} \left( t \right),0.005}} }}} \right\}^{ - 1} ,\forall i \in N_{r} $$
(2)

where \( F_{{2\left( {W_{i} \left( t \right) + 1} \right),2(Z_{i} \left( t \right) - W_{i} \left( t \right),0.005}} \) is the 0.005 right tail probability of the F distribution \( 2\left( {W_{i} \left( t \right) + 1} \right) \) and \( 2(Z_{i} \left( t \right) - W_{i} \left( t \right) \) dimensional degrees of freedom.

Last, since all state probability vectors add up to a total of 1, so the value of each state probability vector can be thought of as its weights in the vector set. So, WLELA increase their probability vector according to their weights. In this way, similar to the idea of finding the optimal action in the classic pursuit algorithm, the probability vector of the optimal action will get more attention when updating, so that more values can be added to the optimal action’s state probability vector each time. Assigning the added value by weight is more in line with the purpose of learning the automatic machine to select the optimal action.

A detailed description of the WLELA is as follows

Algorithm WLELA

Initialize \( p_{i} \left( 0 \right) = \frac{1}{r};W_{i} \left( 0 \right) = 1,Z_{i} \left( 0 \right) = 2,\forall i \in N_{r} \)

Initialize \( d_{i} \left( 0 \right) = \left\{ {1 + \frac{{Z_{i} \left( 0 \right) - W_{i} \left( 0 \right)}}{{\left( {W_{i} \left( 0 \right) + 1} \right)F_{{2\left( {W_{i} \left( 0 \right) + 1} \right),2(Z_{i} \left( 0 \right) - W_{i} \left( 0 \right),0.005}} }}} \right\}^{ - 1} ,\forall i \in N_{r} \)

Step 1: At time t, pick α(t) = αi according to the state probability vector P(t);

Step 2: Receive feedback βi(t) {0,1}. Update the estimate values

$$ {\text{W}}_{i} \left( {\text{t}} \right) = {\text{W}}_{i} \left( {{\text{t}} - 1} \right) +\upbeta_{i} \left( {\text{t}} \right), $$
$$ {\text{Z}}_{i} \left( {\text{t}} \right) = {\text{Z}}_{i} \left( {{\text{t}} - 1} \right) + 1 $$
$$ d_{i} \left( t \right) = \left\{ {1 + \frac{{Z_{i} \left( t \right) - W_{i} \left( t \right)}}{{\left( {W_{i} \left( t \right) + 1} \right)F_{{2\left( {W_{i} \left( t \right) + 1} \right),2(Z_{i} \left( t \right) - W_{i} \left( t \right),0.005}} }}} \right\}^{ - 1} $$

Step 3: If βi(t) = 1 Then

Find m ∈ Nr such that

$$ d_{m} \left( t \right) = \hbox{min} \left\{ { d_{i} \left( t \right) |p_{i} \left( t \right) \ne 0 } \right\}, i \in N_{r} $$
$$ p_{m} \left( {t + 1} \right) = { \hbox{max} }\left\{ {p_{m} \left( t \right) - \Delta ,0} \right\} $$

If \( p_{m} \left( {t + 1} \right) = 0 \) Then \( {\text{k}}\left( {\text{t}} \right) = k\left( t \right) - 1 \) Endif

$$ p_{j} \left( {t + 1} \right) = \mathop {\hbox{min} }\limits_{{\forall j \in N_{{r, p_{j} \left( t \right) > 0}} }} \left\{ {p_{j} \left( t \right) + \{ p_{m} \left( t \right) - p_{m} \left( {t + 1} \right)\} \times \{ p_{j} \left( t \right) + \frac{{p_{m} \left( t \right) - p_{m} \left( {t + 1} \right)}}{k\left( t \right)}\} ,1} \right\} $$

Endif

Step 4: If βi(t) = 0 Then

$$ p_{i} \left( {t + 1} \right) = p_{i} \left( t \right) \forall i \in N_{r} $$

Goto Step 1.

Endif

Step 5: If \( \hbox{max} \left\{ {{\text{P}}\left( {\text{t}} \right)} \right\} = 1, \)

Then CONVERGE to the action whose p = \( \hbox{max} \left\{ {{\text{P}}\left( {\text{t}} \right)} \right\} \).

ELSE

Goto step 1.

Endif

END

The parameter k(t) has the same meaning in algorithm LELA.

4 Simulation Results

This section we compare the relative performances of the proposed WLELA with the LELA and the classical pursuit algorithms DPRI and DGPA by presenting their accuracy and convergence speed. The random environment we used is the most commonly used benchmark environment E1-E4 with 10 allowable actions as shown in Table 1.

Table 1. Benchmark environments

In the process of the LA simulation experiment, if the state probability vector of a certain action exceeds the set threshold T(0 < T ≤ 1), the algorithm is considered to have converged, if the converged action has the highest reward probability in the environment, it is considered that the learning automaton converged correctly.

For all the algorithms in the experiment, they are simulated with their best parameters, which are defined as the values that yielded the fastest convergence speed and guaranteed the automaton converged to the optimal action in a sequence of NE experiments. Specifically, in our experiment, we set the same threshold T and NE in [2, 3, 5], that is, T = 0.999 and NE = 750. After adjusting to the best parameters, we carried out 250,000 experiments to evaluate the average convergence rate and accuracy.

Accuracy is an indicator for judging the performance of an automaton, accuracy is defined as the probability that a learning automaton converges to the optimal action in an environment. As can be seen from Table 2, “Res” denotes the best resolution parameter, all algorithms can converge with high accuracy, while WLELA has higher accuracy than other algorithms, although the difference is not insignificant.

Table 2. Accuracy (number of correct convergences/number of experiments)
  1. A.

    Average converge times

Convergence speed is one of the most critical performance indicators in learning automata. Convergence speed comparison data is shown in Table 2, “Ite” denotes the convergence speed.

From the Table 3, we can see that the WLELA algorithm is better than other algorithms in terms of convergence speed. Compared with LELA, the rate of convergence improvement in each environment is {6.93, 19.76, 3.13, 12.49 %}. Compared with the traditional DGPA and DPRI algorithms, the rate of improvement is {27.91, 40.43, 18.01, 28.28%} and {51.45, 72.24, 21.65, 56.02%}. It can be seen that WLELA converges faster than the other three algorithms, and the E2 environment is the most complex compared to other environments, and WLELA still performs best.

Table 3. Convergence speed

5 Conclusion

This paper proposes an improved algorithm WLELA. By using Bayesian initialization eliminates the cold start problem, using confidence interval estimator gets more interactive information and using weight allocation strategy to realize the classical LA’s idea of pursuing the best behavior. These three improvements allow the WLELA algorithm to achieve high accuracy and fast convergence in the simulation experiments, and the results show that WLELA not only has the highest accuracy, but also the fastest convergence speed. Especially in the most complex environment, the WLELA still performs very well. In future work, consider using a random estimator instead of a deterministic estimator in WLELA. and the WLELA algorithm can be used in many applications that need to learn automata.