Keywords

1 Introduction

To address low sample utilization and slow learning rate of RL [7] in large-scale problems, scholars have introduced the HRL algorithm, which decomposes intricate problems into subproblems, such as HAMs, MAXQ, and Options [5]. However, owing to the hierarchical structure, these methods may not guarantee convergence to the optimal strategy. To overcome this limitation, researchers leveraged the prior knowledge of the Reward Machines (RM) [4] to mitigate the problem of converging to suboptimal solutions. The RM has received widespread research attention, including LSRM [8], CRM [2], and HRM [2]. These works use RM to decompose tasks and to output the combination of reward functions under different conditions. In spite of solving some tasks in the environments, they are still prone to issues such as slow learning speed and poor performance.

Our Contribution. In order to improve learning speed and problem-solving ability, this work proposes an algorithm, dubbed Hierarchy based on Options with an Experience Replay Mechanism (HOERM). HOERM adopts an option-based hierarchical structure, which is relevant to HRM [2]. However, we designed a well-structured reward mechanism and included an experience replay mechanism. We refined the segmented reward mechanism by using the value iteration method to calculate the potential values of each state in the RM (See Sect. 2 for details). The difference in potential values between states can serve as intermediate rewards. Therefore, the agent also receives appropriate rewards during the transition of intermediate states in the RM. Besides, we designed an experience replay mechanism that includes two replay buffers. \( ReplayBuffer_1 \) is a typical buffer that normally stores various experiences during the learning process. \( ReplayBuffer_2 \) is a special buffer that specifically stores experiences when the agent accesses task-specific states. We effectively utilized these experiences to help the agent learn the strategy and complete tasks faster. Finally, we proved the effectiveness of the proposed mechanism through experiments.

2 HOERM Through Reward Machines

This section will introduce the details of the definition of Reward Machine (RM) and how the proposed HOERM learns strategies through RM.

2.1 Reward Machine

Reward Machine (RM) [6] is a special finite state machine that takes an abstract description of the environment as input and outputs the combination of reward functions under different conditions. The formal definition of RM is below:

Definition 1

\((Reward\ Machine )\): Given a set of propositional symbols (\(\mathcal {P}\)), a set of environment states (S), and a set of feasible actions (A), an RM can be defined as a five-tuple: \(RM = \langle U, u_0, F, \delta _u, \delta _r \rangle \), where U signifies a finite set of states, \(u_0 \in U\) is the initial state, and F constitutes a finite set of terminal states such that \(U \cap F = \emptyset \). \(\delta _u\) is the state transition function, \(\delta _u: U \times 2^{\mathcal {P}} \rightarrow U \cup F\), and \(\delta _r\) is the reward function outputted by the RM, \(\delta _r: U \rightarrow [S \times A \times S \rightarrow \mathbb R]\).

Fig. 1.
figure 1

The Minecraft environment and one RM for an example task

Figure 1 depicts a “make-bridge” task and its associated RM in the Minecraft environment. An RM \(\mathcal {R}\) starts from an initial state \(u_0\) and is in a certain state \(u_t\) (\(u_t \in U \cup F\)) after the agent moves some steps. Every edge in Fig. 1 (b) is labeled as a tuple \(\langle \varphi , r \rangle \), in which \(\varphi \) is a logical formula composed of propositional symbols from \(\mathcal {P}\) and r is the reward. When a truth assignment \(\sigma \) satisfies \(\varphi \) (i.e., \(\sigma \models \varphi \)), \(\mathcal {R}\) transitions from state \(u_t\) to \(u_{t+1}\) and gives a reward of \(r=\delta _{r}(u_t,u_{t+1})\). The \(\sigma \) is a set containing propositions in \(\mathcal {P}\), which are true at the state \(u_t\) of RM.

To obtain the truth assignment \(\sigma \), we introduced a labeling function \(L: S \times A \times S \rightarrow 2^{\mathcal {P}}\). Given a state transition \(e = (s, a, s^{\prime })\), the labeling function \(L(s, a, s^{\prime })\) assigns truth values to the proposition symbols in \(\mathcal {P}\). According to the RM definition and \(L(s, a, s^{\prime })\), we can formally describe how to apply RMs in RL.

Definition 2

MDPRM: A Markov Decision Process with an RM (MDPRM) can be represented by a tuple \(\mathcal{M}\mathcal{R} = \langle S, A, p, \gamma , \mathcal {P}, L, U, u_0, F, \delta _u, \delta _r \rangle \), in which S, A, p, and \(\gamma \) are the conventional definitions of an MDP; \(\mathcal {P}\) is the set of propositional symbols; \(L:S \times A \times S \rightarrow 2^{\mathcal {P}}\) serves as the labeling function; and U, \(u_0\), F, \(\delta _u\), and \(\delta _r\) are the definitions of the RM.

In \(\mathcal{M}\mathcal{R}\), the states of the environment and RM are updated after each step of the agent’s execution. After the agent takes an action, then s is transitioned to \(s^{\prime }\), the RM state will also transition to \(u^{\prime } = \delta _u(u, L(s, a, s^{\prime }))\). Then, the agent receives a reward \(r = \delta _r(u, L(s, a, s^{\prime }))\), which is used to learn and adjust policies.

2.2 Proposed HOERM

This subsection introduces our HOERM for learning MDPRM policies. Our HOERM learns a set of options [5] for each task, focusing on causing RM to transition from one state to another. The high-level policy learns to select the most suitable option, while the low-level policy is modeled using double DQN, responsible for executing actions, and learns options to complete each subtask.

As mentioned above, our HOERM learns an option for each transition \(\langle u, u_t \rangle \) between RM states. The initial set of each option is defined as \(\mathcal {I}_{\langle u, u_t \rangle } =\{ \langle s, u\rangle : s\in S, u \in U \}\). Since option\(_{\langle u, u_t \rangle }\) can only be selected to execute when the RM state is u, its policy can be represented as \(\pi _{\langle u, u_t \rangle }(a|s)\) by the environmental state s. The termination of an option\(_{\langle u, u_t \rangle }\) is defined by Eq. (1), that is, when it transitions to a new RM state or reaches the final state of the task.

$$\begin{aligned} \beta _{\left\langle u, u_{t}\right\rangle }\left( s^{\prime }, u^{\prime }\right) =\left\{ \begin{array}{ll} 1 &{} \text{ if } u^{\prime } \ne u \text{ or } s^{\prime } \text{ is } \text{ terminal } \\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(1)

The goal of option strategy \(\pi _{\langle u,u_t \rangle }\) is to guide the agent to complete the option as soon as possible. Thus, we designed an innovative reward mechanism.

$$\begin{aligned} r_{\langle u, u_{t} \rangle }\left( s, a, s^{\prime }\right) =\left\{ \begin{array}{ll} \delta _{r}(u)\left( s, a, s^{\prime }\right) +\gamma \varPhi (u_t) - \varPhi (u) &{} \text{ if } u_{t} \ne u \text{ and } u_{t}=\delta _{u}\left( u, L\left( s, a, s^{\prime }\right) \right) \\ \delta _{r}(u)\left( s, a, s^{\prime }\right) +r^{+} &{} \text{ if } u_{t} \ne u \text{ and } u_{t} \ne \delta _{u}\left( u, L\left( s, a, s^{\prime }\right) \right) \\ \delta _{r}(u)\left( s, a, s^{\prime }\right) +r^{-} &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(2)

In Eq. (2), \(\delta _{r}(u)\left( s, a, s^{\prime }\right) \) is the reward function output by RM, which outputs 1 only when it reaches the final state and 0 otherwise. \(\varPhi (u)\) is the potential function used to measure the proximity of the current state to the terminal state. In simple terms, the closer u is to the terminal state, the larger \(\varPhi (u)\) is. \(\gamma \) is the discount factor defined in the MDP. The hyper-parameters \(r^+\) and \(r^-\) represent the auxiliary reward and punishment, respectively. If the transition \((s, a, s^{\prime })\) makes the RM state transition from u to \(u_t\) (\(u_t = \delta _u(u, L(s, a, s^{\prime }))\)), agent will receive a reward of \(\delta _r(u)(s, a, s^{\prime })\), as well as an additional reward of \(\gamma \varPhi (u_t) - \varPhi (u)\), to encourage the agent for completing the RM state transition.

In addition, when the RM state transitions from u to another state \(\overline{u}\) (\(\overline{u} \notin \{u, u_t\}\), but \(\overline{u} \in U\)), the agent receives a reward of \(\delta _r(u)(s, a, s^{\prime })\) and an additional auxiliary reward of \(r^+\). This is because even though the transition is not the intended one, the experience is still valuable for the overall task.

For this reason, the experience replay mechanism includes a unique experience buffer, denoted as \( ReplayBuffer_2 \). It specifically stores the agent’s experiences when triggering high-level events that lead to transition from u to another state \(\overline{u} = \delta _u(u, L(s, a, s^{\prime }))\), where \(\overline{u} \notin \{u, u_t\}\) but \(\overline{u} \in U\). The two main reasons for this design are as follows: (1) The conventional random sampling from the experience buffer often results in numerous irrelevant experiences, leading to slow training. (2) No matter whether \(\overline{u}\) is equal to \(u_t\), the experience gained when the agent triggers a high-level event is valuable, which helps to expedite the learning of policies for completing the subsequent subtask swiftly.

Finally, if the agent fails to trigger any high-level event and the RM state remains unchanged, it will receive an auxiliary punishment (\(r^-\)) in addition to the reward (\(\delta _r(s, a, s^{\prime })\)). This negative feedback can motivate the agent to adjust its policy for the subtask as soon as possible.

3 Experiments

In this section, we perform algorithm performance tests on environments with discrete (Minecraft [1]) and continuous (Water World [3]) state spaces. These environments cover both single-task and multi-task scenarios. Three baseline algorithms were used, which are representative, state-of-the-art approaches for combining RMs with reinforcement learning. The first is Q-learning for RMs (QRM), proposed by [4]. The second and third are counterfactual experiences for RMs (CRM) and hierarchical RL for RMs (HRM), proposed by [2].

3.1 Experimental Setup

Each algorithm was run independently three times, and the average performance was reported. The results are presented uniformly in this paper, with four algorithms compared: the red line is the proposed HOERM; the green line is QRM; the sky-blue line is CRM; and the magenta line is HRM. The X-axis corresponds to the number of training steps, in millions, while the Y-axis represents the average normalized reward across all tasks. Tables 1 and 2 also provide the average and maximum reward values obtained by each algorithm.

3.2 Experiment 1: Results in Minecraft

The first experiment was conducted in the Minecraft environment with a discrete state space, introduced by [1]. As shown in Fig. 1 (a), the environment grid contains the agent and raw materials needed for the tasks, with randomly generated positions. The task set \(\varOmega = \{\varphi _{1},\varphi _{2}, \ldots , \varphi _{9}, \varphi _{10} \}\), where \(\varphi _{i} \in \varOmega , i \in [1,10]\) is one task in Minecraft (e.g., the “make-bridge” task).

Fig. 2.
figure 2

Results in the Minecraft environment.

Figure 2 and Table 1 show the results obtained by each algorithm in single-task and multitask tests. According to the experimental results, the HRM algorithm has a fast initial learning speed in the two tests. In the single-task test, HRM’s average reward is 0.6480, which is only slightly lower than our HOERM’s 0.6712. However, HRM’s maximum reward is the lowest, only 0.81. The QRM and CRM have a slower speed of obtaining rewards than HRM but result in a maximum reward of 1. The results are similar for the multitask test. On the whole, our HOERM algorithm performs best in both tests. Although the initial learning speed is slightly slower than that of HRM, it obtains the highest values for both average and maximum rewards. The maximum rewards in two types of tests are 1, indicating that HOERM can complete all tasks excellently.

Table 1. Rewards in the Minecraft environment.

3.3 Experiment 2: Results in Water World

The second experiment was conducted in “Water World”, which is an environment with continuous state space and consists of a 2D box containing balls of various colors. We defined 10 tasks in this environment, including touching different specific colored balls. More detailed information can be found in [2].

Fig. 3.
figure 3

Results in the Water World environment.

Figure 3 and Table 2 present the result of four algorithms in Water World. In the single-task test, the four algorithms exhibit similar initial learning speeds, but the HRM algorithm still achieves the lowest maximum reward of 0.666. In multitask test, notably, HRM’s average reward is 0.6318, which is close to the value for CRM and higher than QRM thanks to its fast initial learning speed. However, HRM’s maximum reward of 0.7125 still remains the lowest. Both CRM and QRM do not significantly improve their rewards compared to HRM in this complex environment. Although HOERM has a slightly slower initial learning speed than HRM, the effective sample utilization enhances its task completion capability, resulting in the highest rewards than the three baseline algorithms, demonstrating its excellent performance.

Table 2. Rewards in the Water World environment.

4 Discussion

In this paper, we introduced the concept of RM and proposed an HRL algorithm. The proposed algorithm effectively leverages RMs and achieves efficient sample utilization, leading to rapid strategy learning and impressive experimental results. In future work, we will explore the application of RMs in multi-agent problems to tackle more complex tasks effectively by sharing RM knowledge.