Abstract
Learning Automata (LA) is an adaptive decision-making unit under the reinforcement learning category. It can learn the randomness of the environment by interacting with it and adaptively adjust its behavior to maximize its long-term benefits from the environment. This learning behavior reflects the strong optimization ability of the learning automaton. Therefor LA has been applied in many fields. However, the commonly used estimators in previous LA algorithms have problems such as cold start, and the initialization process can also affect the performance of the estimator. So, in this paper, we improve these two weaknesses by changing the maximum likelihood estimator to a confidence interval estimator, using Bayesian initialization parameters and proposes a new update strategy. Our algorithm is named as weight-assignment last-position elimination-based learning automata (WLELA). Simulation experiments show that the algorithm has higher accuracy and has the fastest convergence speed than various classical algorithms.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Learning automaton can be seen as an adaptive decision-making unit, It can constantly interact with the random environment to adjust its choices to maximize the probability of being rewarded. The process of the LA interacting with the environment is shown in Fig. 1; [1]. At each moment t, an action α(t) will be chosen by LA to interact with the random environment and receives the environment feeds back β(t), which can be either a reward or a penalty. Then, the automaton updates the state probability vector according to the received feedback. Because it’s simple algorithm, strong anti-noise ability and strong optimization ability, it has received extensive attention and has been applied in many fields, such as random function optimization, QoS Optimization and certificate authentication.
In the LA field, the most classical discrete pursuit algorithm with deterministic estimator is the DPRI algorithm given by Oommen in [2]. The main idea of the DPRI algorithm is to increase the probability vector which has the maximum value in the running estimates when the environment rewards the current action and decreases others; otherwise, the automaton changes nothing. Furthermore, many classical pursuit algorithms, such as DGPA [3] and SERI [4], also use estimators and discretization to improve the convergence speed of automaton. In [5], an algorithm named last-position elimination-based LA( LELA) which is contrary to the classical pursuit algorithm is proposed. Instead of greedily increasing the probability vector of the optimal estimation action, this algorithm reduces the probability of choosing the current worst estimation action, t Experiments show that LELA can get faster convergence speed than DGPA.
However, the estimator used by LELA has some innate defects. One typical flaw is the cold start and initialization problem, for example, since the maximum likelihood estimator does not have any information at the beginning, each action has to interact with the environment a certain number of times, it will Increase the cost of getting information in some complicated situations. And the update strategy of the LELA algorithm simply makes all active actions equally share the penalized state probability from the last-position action, does not consider the difference between optimal action and other actions at all. Thus, in this paper, we propose a weight assignment LELA (WLELA) algorithm which has made the following three changes: 1. Improvement of initialization parameters; 2. the estimator improvement; 3. changes in the probability vector update strategy.
In Sect. 2, a brief introduction of the LELA algorithm is given. In Sect. 3 we will show our algorithm WLELA in detail. Then in Sect. 4 part, we will give the simulation results of the WLELA algorithm, which will be compared with LELA and other classic algorithms such as DPRI, DGPA. Finally, summarize in Sect. 5.
2 Related Work
LA can be represented by a four-tuple \( \left\langle {\text{A, B, Q, T}} \right\rangle \) model. They are explained as follows.
A is the set of actions. B is the feedbacks from random environment, when B = {0, 1}, where β = 0 represents that the LA has been penalized, and β = 1 means the LA has been rewarded. Q = <P, E>, E represents the estimator, it contains all the historical information that each action interacts with the environment. The most commonly used estimator is the maximum likelihood estimator. P is the state probability vector of choosing an action α(t) at any instant t, it satisfies ∑pi(t) = 1. T is the state transition function of LA, which determines how LA migrates to the state of t + 1 according to the state of output, input, and t at time t.
The random environment can also be described by a triple <A, B, C> mathematical model. where A and B are defined in the same way as above, and C is defined as C = {cij = Pr{β(t) = βj|α(t) = αi}}.
In the original LELA algorithm [5], it uses the maximum likelihood estimator to record the historical information of all actions, according to the following formula
where Zi(t) is the number of times the action ai was selected up to time instant t and Wi(t) is the sum of the environmental feedbacks received up to time t. when an action is rewarded, the automaton will select the worst performing action from the estimator vector set and decrease corresponding state probability vector by a step ∆ = 1/rn, where r is the number of allowable actions and n is a resolution parameter. If some action’s state probability vector is reduced to zero during the process, this action will be removed from the optional set of actions, while the remaining actions will evenly share the state probability value from each decrease. The update scheme is described as follows:
-
If β(t) = 1 then
Find m ∈ Nr such that
$$ d_{m} \left( t \right) = \hbox{min} \left\{ { d_{i} \left( t \right) |p_{i} \left( t \right) \ne 0 } \right\}, i \in N_{r} $$$$ p_{m} \left( {t + 1} \right) = { \hbox{max} }\left\{ {p_{m} \left( t \right) - \Delta ,0} \right\} $$If \( p_{m} \left( {t + 1} \right) = 0 \) Then \( {\text{k}}\left( {\text{t}} \right) = k\left( t \right) - 1 \) Endif
$$ p_{j} \left( {t + 1} \right) = \hbox{min} \left\{ {p_{j} \left( t \right) + \frac{{p_{m} \left( t \right) - p_{m} \left( {t + 1} \right)}}{k\left( t \right)},1} \right\}, \forall j \in N_{r} ,such \,that p_{t} \left( t \right) > 0. $$ -
Else
$$ p_{i} \left( {t + 1} \right) = p_{i} \left( t \right) \forall i \in N_{r} $$ -
Endif.
κ(t) denotes the number of active actions and is initialized by r.
LELA has been proved to be ε-optimal in every stationary random environment.
3 Proposed Learning Automata
In order to overcome the shortcomings of the LELA algorithms, we propose the following improvements.
Firstly, we use the Bayesian estimator introduced in [6] to solve the cold start and initialization problems. However, if the Bayesian estimator is used directly in the LA algorithm, the convergence speed will be additionally affected, so in WLELA, we directly modify it to the mean of the posterior distribution which is to set all actions’ di(0) = 0.5, thereby improving convergence efficiency while ensuring overcoming cold start and initialization problems.
Secondly, in order to get more information, in the WLELA algorithm we used the confidence interval estimator proposed in [7] which is
where \( F_{{2\left( {W_{i} \left( t \right) + 1} \right),2(Z_{i} \left( t \right) - W_{i} \left( t \right),0.005}} \) is the 0.005 right tail probability of the F distribution \( 2\left( {W_{i} \left( t \right) + 1} \right) \) and \( 2(Z_{i} \left( t \right) - W_{i} \left( t \right) \) dimensional degrees of freedom.
Last, since all state probability vectors add up to a total of 1, so the value of each state probability vector can be thought of as its weights in the vector set. So, WLELA increase their probability vector according to their weights. In this way, similar to the idea of finding the optimal action in the classic pursuit algorithm, the probability vector of the optimal action will get more attention when updating, so that more values can be added to the optimal action’s state probability vector each time. Assigning the added value by weight is more in line with the purpose of learning the automatic machine to select the optimal action.
A detailed description of the WLELA is as follows
Algorithm WLELA
Initialize \( p_{i} \left( 0 \right) = \frac{1}{r};W_{i} \left( 0 \right) = 1,Z_{i} \left( 0 \right) = 2,\forall i \in N_{r} \)
Initialize \( d_{i} \left( 0 \right) = \left\{ {1 + \frac{{Z_{i} \left( 0 \right) - W_{i} \left( 0 \right)}}{{\left( {W_{i} \left( 0 \right) + 1} \right)F_{{2\left( {W_{i} \left( 0 \right) + 1} \right),2(Z_{i} \left( 0 \right) - W_{i} \left( 0 \right),0.005}} }}} \right\}^{ - 1} ,\forall i \in N_{r} \)
Step 1: At time t, pick α(t) = αi according to the state probability vector P(t);
Step 2: Receive feedback βi(t) {0,1}. Update the estimate values
Step 3: If βi(t) = 1 Then
Find m ∈ Nr such that
If \( p_{m} \left( {t + 1} \right) = 0 \) Then \( {\text{k}}\left( {\text{t}} \right) = k\left( t \right) - 1 \) Endif
Endif
Step 4: If βi(t) = 0 Then
Goto Step 1.
Endif
Step 5: If \( \hbox{max} \left\{ {{\text{P}}\left( {\text{t}} \right)} \right\} = 1, \)
Then CONVERGE to the action whose p = \( \hbox{max} \left\{ {{\text{P}}\left( {\text{t}} \right)} \right\} \).
ELSE
Goto step 1.
Endif
END
The parameter k(t) has the same meaning in algorithm LELA.
4 Simulation Results
This section we compare the relative performances of the proposed WLELA with the LELA and the classical pursuit algorithms DPRI and DGPA by presenting their accuracy and convergence speed. The random environment we used is the most commonly used benchmark environment E1-E4 with 10 allowable actions as shown in Table 1.
In the process of the LA simulation experiment, if the state probability vector of a certain action exceeds the set threshold T(0 < T ≤ 1), the algorithm is considered to have converged, if the converged action has the highest reward probability in the environment, it is considered that the learning automaton converged correctly.
For all the algorithms in the experiment, they are simulated with their best parameters, which are defined as the values that yielded the fastest convergence speed and guaranteed the automaton converged to the optimal action in a sequence of NE experiments. Specifically, in our experiment, we set the same threshold T and NE in [2, 3, 5], that is, T = 0.999 and NE = 750. After adjusting to the best parameters, we carried out 250,000 experiments to evaluate the average convergence rate and accuracy.
Accuracy is an indicator for judging the performance of an automaton, accuracy is defined as the probability that a learning automaton converges to the optimal action in an environment. As can be seen from Table 2, “Res” denotes the best resolution parameter, all algorithms can converge with high accuracy, while WLELA has higher accuracy than other algorithms, although the difference is not insignificant.
-
A.
Average converge times
Convergence speed is one of the most critical performance indicators in learning automata. Convergence speed comparison data is shown in Table 2, “Ite” denotes the convergence speed.
From the Table 3, we can see that the WLELA algorithm is better than other algorithms in terms of convergence speed. Compared with LELA, the rate of convergence improvement in each environment is {6.93, 19.76, 3.13, 12.49 %}. Compared with the traditional DGPA and DPRI algorithms, the rate of improvement is {27.91, 40.43, 18.01, 28.28%} and {51.45, 72.24, 21.65, 56.02%}. It can be seen that WLELA converges faster than the other three algorithms, and the E2 environment is the most complex compared to other environments, and WLELA still performs best.
5 Conclusion
This paper proposes an improved algorithm WLELA. By using Bayesian initialization eliminates the cold start problem, using confidence interval estimator gets more interactive information and using weight allocation strategy to realize the classical LA’s idea of pursuing the best behavior. These three improvements allow the WLELA algorithm to achieve high accuracy and fast convergence in the simulation experiments, and the results show that WLELA not only has the highest accuracy, but also the fastest convergence speed. Especially in the most complex environment, the WLELA still performs very well. In future work, consider using a random estimator instead of a deterministic estimator in WLELA. and the WLELA algorithm can be used in many applications that need to learn automata.
References
Thathachar M, Sastry PS (2004) Networks of learning automata: techniques for online stochastic optimization. Kluwer, Dordrecht
Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Trans Syst Man Cybern 20(4):931–938
Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Trans Syst Man Cybern Part B Cybern 32(6):738–749
Papadimitriou GI, Sklira M, Pomportsis AS (2004) A new class of ε-optimal learning automata. IEEE Trans Syst Man Cybern Part B 34(1):246–254
Zhang J, Wang C, Zhou MC (2014) Last-position elimination-based learning automata. IEEE Trans Cybern 44(12):2484–2492
Xuan Z, Granmo OC, Oommen BJ (2013) On incorporating the paradigms of discretization and Bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39(4):782–792
Hao G, Jiang W, Li S et al (2015) A novel estimator based learning automata algorithm. Appl Intell 42(2):262–275
Acknowledgements
This work was supported by the National Key Research and Development Project of China under Grant 2016YFB0801003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
An, H., Di, C., Li, S. (2020). Weight-Assignment Last-Position Elimination-Based Learning Automata. In: Liang, Q., Wang, W., Liu, X., Na, Z., Jia, M., Zhang, B. (eds) Communications, Signal Processing, and Systems. CSPS 2019. Lecture Notes in Electrical Engineering, vol 571. Springer, Singapore. https://doi.org/10.1007/978-981-13-9409-6_41
Download citation
DOI: https://doi.org/10.1007/978-981-13-9409-6_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9408-9
Online ISBN: 978-981-13-9409-6
eBook Packages: EngineeringEngineering (R0)