Keywords

1 Introduction

In order to determine the feasibility or optimality of a given course of action, it is necessary to observe and monitor the outcomes of repeated trials. Repetition is necessary in order to ensure reliability and ongoing effectiveness particularly in an environment which is subject to chance influence and where complete information on all the underlying factors are not available. This is especially vital for mission critical operations, such as aviation safety, where ongoing validation or monitoring of a given policy is essential, but is also relevant for non-mission critical activities [18, 19, 21]. In most practical situations, such trials cannot be performed in parallel but have to be undertaken in a sequential manner. In the context of reinforcement learning, each outcome can be classified as a reward or punishment. However, the cost of carrying out such learning trials can be significant and such sequential validation can have varying degrees of stringency. In this study, the stochastic structure of the environment is explicitly modeled, and the performance measures of the associated validation costs are analyzed.

In trialing or learning a given course of action, the observed rewards and punishments are usually probabilistic. For instance, when one is experimenting a new route between an originating point A and a destination B, an increase of the journey duration by a given amount may be viewed as a punishment, whereas a reduction in the duration of the same may be viewed as a reward, and having thus learned, say, the acceptability of the new route, one would adopt the new route as a policy for traveling between A and B.

Thus, through repeated trials resulting in outcomes of either reward or punishment, one establishes the feasibility of the new policy and completes the learning phase. Subsequent to the learning phase, the new policy, if learned successfully (i.e. when the rewards to punishments ratio is sufficiently high) is adopted from that point onwards without it being questioned or evaluated afterwards. In this particular example, the learning is primarily done during the pre-adoption phase. In some situations, however, even after the policy is adopted, ongoing validation and monitoring is still carried out and this is especially necessary for safety-critical and mission-critical operations. If in the course of ongoing monitoring, there is an overwhelming number of punishments observed, then the adoption of the policy may be called into question, and termination of the policy may be necessary.

In this paper, we study such learning reinforcement scenarios of stochastically receiving rewards and punishments for both the pre-adoption phase as well as the post-adoption phase. To be concrete, we shall use a scenario of aviation safety, as we believe this scenario is sufficiently general and of particular relevance and currency to present day concerns. Despite this, we wish to point out that many other everyday learning situations are similar to this; examples include trialing a new machine translation algorithm, learning the effectiveness of a new advertising channel, and route discovery in self-drive vehicles.

An autonomous agent in reinforcement learning learns through the interaction with the environment to maximize its rewards, while minimizing or avoiding punishments. In most practical situations, the underlying environment is non-stationary and noisy [1, 4, 6, 20, 22], and the next state results from taking the same policy may not result in the same outcome every time but appears to be stochastic [2, 7]. In [3]. Brafman and Tennenholtz introduces a model-based reinforcement learning algorithm R-Max to deal with stochastic games [5]. Such stochastic elements can notably increase the complexity in multi-agent systems and multi-agent tasks, where agents learn to cooperate and compete simultaneously [6, 10]. As other agents adapt and actively adjust their policies, the best policy for each agent would evolve dynamically, giving rise to non-stationarity [8, 9]. In these studies, the cost of a trial to receive either a reward or punishment can be seen to be significant, and ideally, one would like to arrive at the correct conclusion by incurring minimum cost. In reinforce learning algorithms, we are always in the hope to rapidly converge to an optimal policy with least volumes of data, calculations, learning iterations, and minimal degree of complexity [11, 12]. To do so, one should explicitly define the stopping rules for specifying the conditions under which learning should terminate and a conclusion drawn as to whether the learning has been successful or not based on the observations so far.

Establishing stopping rules, is an active research topic in reinforcement learning, which is closely linked to the problems of optimal policies and policy convergence [13]. Conventional approaches mainly aim for relatively small-scale problems with finite states and actions. The stopping rules involved are well-defined for each category of algorithms, such as utilizing Bellman Equation in Q-learning [14]. To deal with continuous action spaces or state spaces, new algorithms, such as the Cacla algorithm [15] and CMA-ES algorithm [16], are developed with specific stopping rules. Some stopping rules for stochastic reinforcement learning under different assumptions have been proposed and studied in [23, 24]. Still, most studies on stopping rules are procedure-oriented and do not have a unified measurement where comparison may be facilitated.

In our study here, in addition to learning from the observations in the pre-adoption and post-adoption phases, we also focus in the stopping criteria, so that what has been observed and learned can form the basis of policy decision making. The next section provides a representation of the stochastic learning environment, and establishes stopping rules for the different phases. An analysis of these rules is given in Sect. 3. Assessment of the learning cost and the rewards ratio, along with experimental evaluation is given in Sect. 4, and the final conclusions are drawn in Sect. 5.

2 The Learning Environment and Stopping Rules

We assume that trials are systematically carried out in a sequential manner. Due to the presence of a multiplicity unknown factors and hidden variables, indicating an environment about which we have incomplete information, the outcome from different trials will be subject to probabilistic influences. As mentioned earlier, we shall employ the aviation safety learning situations to develop the main ideas. The reason for using this situation is twofold:

  1. i.

    it has a high degree of generality that is able to subsume a variety of learning situations as special cases, and

  2. ii.

    it has a particular relevance and interest to current concerns of airlines, aircraft manufacturers, and passengers.

We shall divide the learning reinforcement of a policy into two distinct phases:

  1. i.

    the pre-adoption phase, which we shall refer to as Phase I, and

  2. ii.

    the post-adoption phase, which we shall refer to as Phase II.

The former phase is concerned with learning the acceptability of the policy through systematic trials, while the latter is concerned with the continued validity of the policy subsequent to adoption, and in this case, whether the policy should under some circumstances be discontinued. An especially relevant example is whether a particular aircraft model recently introduced should continue to be in service or should it be discontinued, at least temporarily, for the safety and well-being of its passengers, perhaps following some serious incidents.

Here, we are dealing with a sequence of independent learning trials, each of which either results in a reward or punishment. In our particular aviation example, typically for each trial a set of indicators are logged and a final score is computed which forms the basis of a decision on either a pass (reward) or failure (punishment) for the trial is attained. We let p and q, with p + q = 1, denote the probabilities of receiving a reward or punishment respectively for a given trial. For example, if p > > q, then the decision should be that of adopting the policy. In general the requirements for Phase I is much more stringent for Phase II, and the cost of different decision rules will be analyzed in the next section.

Let us consider the following two stopping rules.

  • Rule I : In the course of undertaking the learning trials, an agent concludes the learning process when m consecutive rewards are obtained.

  • Rule II : In the course of undertaking the learning trials, an agent concludes the learning process when m total rewards are obtained.

Rule I is a somewhat stringent stopping rule but is particular applicable for mission critical operations where a high degree of reliability is required. It is also more widely used for the proper learning phase (Phase I) than for the validation phase after learning (Phase II). Rule II is a less stringent stopping rule and is often used for the validation phase. In some applications, such as finding an optimal route from A to B for a self-driving vehicle, it is mostly sufficient just to use Rule II for Phase I learning, and usually no need for Phase II evaluation.

In addition, there is a significant difference between the objective of Phase I, and that of Phase II. While the objective of Phase I is to aim to adopt the policy by accumulating a sufficient number of rewards, the aim of Phase II, on the other hand, is to look for alerts that may lead to a discontinuation of the policy. As we shall see, the analysis in Phase II requires the application of the Reflection Principle [17], by interchanging the probabilities p and q, as well as interchanging the rewards and punishments. Such reversal of roles leads to a variation of Rule I and Rule II, which we shall call Rule IR, and Rule IIR respectively, with the suffix R signifying reflection.

  • Rule IR : In the course of undertaking the validation trials, an agent concludes the learning process when m’ consecutive punishments are received.

  • Rule IIR : In the course of undertaking the validation trials, an agent concludes the learning process when m’ total punishments are received.

As we shall see in the next section, the use of Rule I for Phase I means that acceptance of the policy is more stringent than that when Rule II is used. On the other hand, the use of IIR in Phase II also signifies a more stringent requirement since rejection of the policy is easier than that of using Rule IR (Table 1).

Table 1. The typical learning scenarios for different types of applications for the two phases.

3 Analysis of the Performance of the Stopping Rules

Rule I above is concerned with collecting a given number of consecutive reinforcements or rewards, so that we shall first establish the probability of occurrence of such an event for the first time. Let bn be the probability that m consecutive rewards occurs at trial n, with n ≥ m, not necessarily for the first time, and we denote by B(z) be the corresponding probability generating function. From [17], this probability generating function can be obtained as

$$ B\left( z \right) = \frac{{1 - z + qp^{m} z^{m + 1} }}{{\left( {1 - z} \right)\left( {1 - p^{m} z^{m} } \right)}} . $$
(1)

Since we need to obtain the corresponding generating function for the probability that the associated event occurs for the first time, we need to consider the relationship between the two events. We shall use the random variable X to denote the number of trials preceding and including the receiving of the first set of m consecutive rewards. Thus X is the stopping time for Rule I, measured in terms of the number of trials, and we let an be the probability

$$ a_{n} = \Pr \left[ {\varvec{X} = n} \right] , n = m, m + 1, \ldots . $$
(2)

We denote by A(z) the probability generating function for the event that the accumulation of m rewards occurs for the first time. It can be shown in [17] that the generating function A(z) is related to B(z) by

$$ A\left( z \right) = \frac{B\left( z \right) - 1}{B\left( z \right)} . $$
(3)

From this, we obtain, after simplification,

$$ A\left( z \right) = \frac{{p^{m} z^{m} }}{{1 - q^{m} \mathop \sum \nolimits_{k = 0}^{m - 1} p^{k} z^{k} }} . $$
(4)

From this, the mean and variance of X can be readily obtained after simplification,

$$ {\text{E}}\left[ \varvec{X} \right] = A^{ '\left( 1 \right)} = \frac{{1 - p^{m} }}{{qp^{m} }} , $$
(5)
$$ {\text{Var}}\left[ \varvec{X} \right] = A^{\prime\prime}\left( 1 \right) + A^{\prime}\left( 1 \right) - A^{\prime}\left( 1 \right)^{2} = \frac{1}{{q^{2} p^{2m} }} - \frac{2m + 1}{{qp^{m} }} - \frac{p}{{q^{2} }} , $$
(6)

where the apostrophe indicates derivative.

Next, we examine Rule II, and let the random variable Y be the number of observations preceding and including the first reward; thus

$$ \Pr \left[ {\varvec{Y} = k} \right] = pq^{k - 1} , k = 0, 1, 2, 3, \ldots $$
(7)

The probability generating function for a random variable W which excludes the reward itself has been obtained in [23] and is given by

$$ \frac{p}{{\left( {1 - qz} \right)}} . $$

The random variables W and Y are related by Y = W + 1, and since the generating function of 1 is z, we have, for the probability generating function of Y,

$$ \frac{pz}{{\left( {1 - qz} \right)}} . $$
(8)

Since after the occurrence of the first reward, the process probabilistically repeats itself again, so that we have for the number of trials to the mth reward, bearing in mind that under Rule II, the rewards need not occur consecutively

$$ \varvec{Z} = \mathop \sum \limits_{k = 1}^{m} \varvec{Y}_{k} , $$
(9)

where each Yk has the same distributional characteristics as Y. Consequently, the probability generating function of F(z) corresponding to Z may be obtained

$$ F\left( z \right) = [\frac{pz}{{\left( {1 - qz} \right)}}]^{m} $$
(10)

From this, the mean and variance of Z can be readily obtained by differentiation,

$$ {\text{E}}\left[ \varvec{Z} \right] = \frac{m}{p} , $$
(11)
$$ {\text{Var}}\left[ \varvec{Z} \right] = \frac{mq}{{p^{2} }} . $$
(12)

It is not hard to see that E(X) ≥ E(Z), since achieving m consecutive rewards necessarily implies achieving at least m total non-consecutive rewards, with equality holding iff m = 1, since in this case, there is no difference between the two situations.

Figure 1 compares the average cost of sequential trials of the two rules. Here, the left vertical axis is used for E(X), while, the right vertical axis is used for E(Z). We see that the stringency of Rule I is manifested in a steep climb in the number of trials as m increases, as opposed to a relatively moderate increase in Rule II.

Fig. 1.
figure 1

Cost comparison of Rules I and II (p = 0.6).

4 Learning Cost Evaluation and Experimentation

The number of trials carried out to complete the learning episode is often a costly process. Let h be the numerical representation of cost associated with a single trial, and we can standardize on h as the cost unit so that and without loss of generality we can set h = 1. Having specified m, a minimum observation cost of mh must therefore be incurred. What is uncertain is the number of punishments obtained in the process, and ideally to achieve minimum cost, this number should be small. Such average trialing cost are given by Eqs. (5) and (11). Simulation experiments are carried out to compare the actual average learning cost with theoretical predictions, and these are given in Table 2. Table 3 gives the corresponding comparisons of the standard deviation from Eqs. (6) and (12).

Table 2. Cost ratios for the two phases.
Table 3. Comparison of the average trialing costs and the standard deviations.

The mean number of trials required in order to accumulate m rewards has a direct bearing on the adoption of the given policy. The learning overhead, or cost ratio, is given by the ratio of the average total number of trials to the number of required rewards m. For Rule I, this is given by

$$ r_{1} \left( m \right) = \frac{{1 - p^{m} }}{{mqp^{m} }} , $$
(13)

and for Rule II, this is given by,

$$ r_{2} \left( m \right) = \frac{1}{p} . $$
(14)

Clearly, the decision for adoption or successful validation will tend to be positive for small r1 and r2, but tends towards negative when r1 and r2 are large. Thus, decision rules can be established by linking them to the cost thresholds h1 and h2, whereby, for example, adoption decision is made whenever r1 < h1.

As indicated earlier, for some situations, only Phase I learning is necessary and Phase II is not required. However, in the case of our aviation scenario, as indicated in the previous section, both Phase I and Phase II are necessary, with Rule I used for Phase I, and Rule IIR used for Phase II. In this case, assuming we have learned the usefulness of the given policy, say, to put the particular aircraft in service, Rule IIR would instead look for punishments that may cause termination of the service to safeguard the safety and well-being of its passengers. We note that Rule IIR represents a stricter criterion with a stronger propensity to termination, since discontinuation would be harder if one uses Rule IR instead: we may decide to discontinue the service if there is an accumulation of m′ punishments, not necessarily consecutive. From (11), the mean number of observations E[Z′] relating to Rule IIR for Phase II is, by the Reflection Principle,

$$ {\text{E}}\left[ {\varvec{Z'}} \right] = \frac{m'}{q} . $$
(15)

Similarly, from (5), the mean number of observations E[X′] relating to Rule IR for Phase II is, again by the Reflection Principle,

$$ {\text{E}}\left[ {\varvec{X'}} \right] = \frac{{1 - q^{m} }}{{pq^{m} }} . $$
(16)

The associated cost ratios are summarized in Table 2 below.

Simulation experiments are performed to gauge the accuracy of the above results. These are shown in Table 3 below. It compares the average values and the standard deviations, with the latter corresponding to the square roots of the variances determined above. For a given combination of parameters, 100,000 trial episodes are performed; the expected values and standard deviations are calculated based on these 100,000 episodes. The error percentages are computed as follows:

$$ Err = \frac{{\left| {{\text{theoretical prediction }}{-} {\text{empirical measurement}}} \right|}}{\text{empirical measurements}} \times 100\% . $$

We see that the agreement is quite acceptable in all cases, with error below 1%. Figure 2, plots the experimental data for the case p = 0.6, m = 5. We see that some significant transient fluctuations are evident in the first 200 episodes, but gradually settles to an equilibrium after around 400 episodes. While some fluctuations are still present thereafter, they eventually converge to the values as predicted by the theory. Although we have not shown the behavior for other parameter settings, they behave in much the same way as those shown in Fig. 2.

Fig. 2.
figure 2

Simulation results with p = 0.6, m = 5.

5 Summary and Conclusion

In this paper, we have studied the practical situation of learning the usefulness of a given policy for adoption by repeated sequential trials, each can result in a reward or a punishment. The entire evaluation process may be divided into two distinct phases, one for assessing initial acceptability, and one for assessing ongoing feasibility. Due to a large number of unknown factors and incomplete information, the outcome of each trial is subject to probabilistic variations and cannot be predicted exactly.

Such learning process requires suitable stopping criteria in order for the results of the observations to be consolidated and learned. Here, the probabilistic influence of the learning environment is explicitly modeled, where the outcome of each observational trial is taken to be independent and identically distributed. Four operational stopping rules, applicable to varying levels of mission-critical requirements, are established that are applicable to the two phases of learning reinforcement.

The performance of these rules are analyzed, and closed-form expressions of key measures of interest are given. In particular, cost ratios are obtained for the two phases of learning for system operations exhibiting different characteristics, and decision rules linking the trial outcomes to policy choices are developed. Experimentations have also been carried out, and the experimental results exhibit good agreements with the theoretical findings.

The present study is applicable to a wide variety of learning situations in an unknown environment based on rewards and punishments. The proposed method is useful in helping to arrive at sound operational decisions, and the associated costs of systematic evaluation has been calculated. While here we have adopted an independent, identically distributed set of random variables for the outcomes, future studies may relax on this assumption and examine situations where the outcomes are Markov dependent or where the underlying random variables are not identically distributed; doing so should be able to further enhance the usefulness of these results.