Keywords

1 Introduction

Markov Decision Processes (MDPs) are extensively used for problems in which an agent needs to control a process by selecting actions that are allowed at the process’ states and that affect state transitions. Decision making problems in MDPs are equivalent to resolving action non-determinism, and result in policy synthesis problems. Policies are synthesised to maximise expected long-term rewards obtained from the process. This paper introduces a new software tool, LCRL, which performs policy synthesis for unknown MDPs when the goal is that of maximising the probability to abide by a task (or constraint) that is specified using Linear Temporal Logic (LTL). LTL is a formal, high-level, and intuitive language to describe complex tasks [9]. In particular, unlike static (space-dependent) rewards, LTL can describe time-dependent and complex non-Markovian tasks that can be derived from natural languages [16, 36, 46]. Any LTL specification can be translated efficiently into a Limit-Deterministic Büchi Automaton (LDBA), which allows LCRL to automatically shape a reward function for the task that is later employed by the RL learner for optimal policy synthesis. LCRL is implemented in Python, the de facto standard programming language for machine learning applications.

1.1 Related Work

There exists a few tools that solve control (policy) synthesis in a model-free fashion, but not under full LTL specifications. One exception is the work in [6] which proposes an interleaved reward and discounting mechanism. However, the reward shaping dependence on the discounting mechanism can make the reward sparse and small, which might negatively affect convergence. The work in [17] puts forward a tool for an average-reward scheme based on earlier theoretical work. Other model-free approaches with available code-bases are either (1) focused on fragments of LTL and classes of regular languages (namely finite-horizon specs) or (2) cannot deal with unknown black-box MDPs. The proposed approach in [29, 30] presents a model-free RL solution but for regular specifications that are expressed as deterministic finite-state machines. The work in [10, 11] takes a set of LTL\({}_f\)/LDL\({}_f\) formulae interpreted over finite traces as constraints, and then finds a policy that maximises an external reward function. The VSRL software tool [12,13,14, 28] solves a control synthesis problem whilst maintaining a set of safety constraints during learning.

1.2 Contributions

The LCRL software tool has the architecture presented in Fig. 1, and presents the following features:

  • LCRL leverages model-free RL algorithms, employing only traces of the system (assumed to be an unknown MDP) to formally synthesise optimal policies that satisfy a given LTL specification with maximal probability. LCRL finds such policies by learning over a set of traces extracted from the MDP under LTL-guided exploration. This efficient, guided exploration is owed to reward shaping based on the automaton [18,19,20,21, 23, 26]. The guided exploration enables the algorithm to focus only on relevant parts of the state/action spaces, as opposed to traditional Dynamic Programming (DP) solutions, where the Bellman iteration is exhaustively applied over the whole state/action spaces [5]. Under standard RL convergence assumptions, the LCRL output is an optimal policy whose traces satisfy the given LTL specification with maximal probability.

  • LCRL is scalable owing to LTL-guided exploration, which allows LCRL to cope and perform efficiently with MDPs whose state and action spaces are significantly large. There exist a few LDBA construction algorithms for LTL, but not all of resulting automata can be employed for quantitative model-checking and probabilistic synthesis [31]. The succinctness of the construction proposed in [39], which is used in LCRL, is another contributing factor to LCRL scalability. The scalability of LCRL is evaluated in an array of numerical examples and benchmarks including high-dimensional Atari 2600 games [3, 7].

  • LCRL is the first RL synthesis method for LTL specifications in continuous state/action MDPs. So far no tool is available to enable RL, whether model-based or model-free, to synthesise policies for LTL on continuous-state/action MDPs. Alternative approaches for continuous-space MDPs [1, 34, 41, 44] discretise the model into a finite-state MDP, or alternatively propose a DP-based method with value function approximation [15].

  • LCRL displays robustness features to hyper-parameter tuning. Specifically, we observed that LCRL results, although problem-specific, are not significantly affected when hyper-parameters are not tuned with care.

Fig. 1.
figure 1

The LCRL architecture: the inputs to the tool are the environment (MDP) - in particular its states s and corresponding labels L(s) - as well as the LDBA generated from the user-defined LTL specification. The MDP (with state s) and the LDBA (with state q) are synchronised on-the-fly, thus generating the pair (sq). A reward r is then automatically generated by LCRL, based on the environment label L(s) and the automaton state q; actions a are selected accordingly and passed back to the environment (MDP). (Color figure online)

2 Logically-Constrained Reinforcement Learning (LCRL)

We assume the unknown environment is encompassed by an MDP, which in this work is a discrete-time stochastic control processes defined as the tuple \(\mathfrak {M}=(\mathcal {S},\mathcal {A},s_0,P)\) over a set of continuous states \(\mathcal {S}=\mathbb {R}^n\), and where \(\mathcal {A}=\mathbb {R}^m\) is a set of continuous actions, and \(s_0 \in \mathcal {S}\) is an initial state. \(P:\mathcal {B}(\mathcal {S})\times \mathcal {S}\times \mathcal {A}\rightarrow [0,1]\) is a conditional transition kernel which assigns to any pair comprising a state \(s \in \mathcal {S}\) and an action \(a \in \mathcal {A}\) a probability measure \(P(\cdot |s,a)\) on properly measurable sets on \(\mathcal {S}\) [4]. A finite state/action MDP is a special case in which \(|\mathcal {S}|<\infty ,~|\mathcal {A}|<\infty \), and \(P:\mathcal {S}\,\times \mathcal {A}\,\times \mathcal \,{S}\rightarrow [0,1]\) is a transition probability matrix assigning a conditional probability to enter sets of states in \(\mathcal {S}\). A variable \(R(s,a)\sim \varUpsilon (\cdot |s,a)\in \mathcal {P}(\mathbb {R^+})\) is defined over the MDP \(\mathfrak {M}\), representing the reward obtained when action a is taken at a given state s, sampled from the reward distribution \(\varUpsilon \) defined over the set of probability distributions \(\mathcal {P}(\mathbb {R^+})\) on subsets of \(\mathbb {R^+}\).

LCRL is a policy synthesis architecture for tasks that are expressed as specifications in LTL [18,19,20,21,22,23,24,25,26]. The LCRL Core in Fig. 1 is compatible with any general RL scheme that conforms with the environment state and action spaces. Inside the LCRL module the MDP and LDBA states are synchronised, resulting in an on-the-fly product MDP. Intuitively, the product MDP encompasses the extra dimension of the LDBA states, which is added to the state space of the original MDP. The role of the added dimension is to track the sequence of labels that have been read across episodes, and thus to act as a memory register for the given task. This allows to evaluate the (partial) satisfaction of the corresponding temporal property. More importantly, this synchronisation breaks down the non-Markovian LTL specification into a set of Markovian reach-avoid components, which facilitates the RL convergence to a policy whose traces satisfy the overall LTL specification. In practice, no product between the MDP and LDBA is computed: the LDBA simply monitors traces executed by the agents as episodes of the RL scheme.

Remark 1

The LDBA construction inherently introduces limited form of non-determinism, called \(\varepsilon \)-transitions, which is treated as an extra action over the original MDP action space [39, 40]. Namely, when there exists a non-deterministic transition in an LDBA state, the MDP action space is augmented with the non-deterministic transition predicate of the LDBA. These non-deterministic transitions are automatically handled by LCRL during the learning and appropriate on-the-fly modifications are carried out, so that the RL agent can learn to deal with those non-deterministic transitions in order to reach the accepting conditions of the LDBA. We emphasise that the underlying assumption in LCRL is that the MDP model is unknown (Fig. 1), and thus a single state is obtained as output when given a state and an action as input. \(\square \)

LCRL defines a reward function R for the RL Core, whose objective is to maximise the expected discounted return [42]:

$$\begin{aligned} \begin{aligned}&\mathbb {E}^\pi [\sum \limits _{n=0}^{\infty } \gamma ^n~ R(s_n,a_n)|s_0=s], \end{aligned} \end{aligned}$$
(1)

where \(\mathbb {E}^{\pi } [\cdot ]\) denotes the expected value given that the agent follows the policy \(\pi :\mathcal {S} \times \mathcal {A} \rightarrow [0,1]\) from state s; parameter \(\gamma \in [0,1]\) is a discount factor; and \(s_0,a_0,s_1,a_1...\) is the sequence of state/action pairs, initialised at \(s_0 = s\). This reward is intrinsically defined over the product MDP, namely it is a function of the MDP state (describing where the agent is in the environment) and the sate of the automaton (encompassing partial task satisfaction). For further details on the LCRL reward shaping, please refer to [18,19,20,21, 23, 26].

The discount factor \(\gamma \) affects the optimality of the synthesised policy and has to be tuned with care. There is standard work in RL on state-dependent discount factors [6, 35, 37, 45, 47], which is shown to preserve convergence and optimality guarantees. For LCRL the learner discounts the received reward whenever it’s positive, and leaves it un-discounted otherwise:

$$\begin{aligned} \gamma (s) = \left\{ \begin{array}{ll} \eta &{} { if} R(s,a)>0,\\ 1 &{} \textrm{otherwise}, \end{array} \right. \end{aligned}$$
(2)

where \(0<\eta <1\) is a constant [20, 47]. Hence, (1) reduces to an expected return that is bounded, namely

$$\begin{aligned} \mathbb {E}^{\pi } [\sum \limits _{n=0}^{\infty } \gamma (s_n)^{N(s_n)}~ R(s_n,\pi (s_n))|s_0=s],~0<\gamma (s)\le 1, \end{aligned}$$
(3)

where \(N(s_n)\) is the number of times a positive reward has been observed at state \(s_n\).

For any state \(s \in \mathcal {S}\) and any action \(a \in \mathcal {A}\), LCRL assigns a quantitative action-value \(Q:\mathcal {S}\times \mathcal {A}\rightarrow \mathbb {R}\), which is initialised with an arbitrary and finite value over all state-action pairs. As the agent starts learning, the action-value Q(sa) is updated by a linear combination between the current Q(sa) and the target value:

$$R(s,a)+\gamma \max \limits _{a' \in \mathcal {A}}Q({s'},a'),$$

with the weight factors \(1-\mu \) and \(\mu \) respectively, where \(\mu \) is the learning rate.

An optimal stationary Markov policy synthesised by LCRL on the product MDP that maximises the expected return, is guaranteed to induce a finite-memory policy on the original MDP that maximises the probability of satisfying the given LTL specification [20]. Of course, in finite-state and -action MDPs, the set of stationary deterministic policies is finite and thus after a finite number of learning steps RL converges to an optimal policy. However, when function approximators are used in RL to tackle extensive or even infinite-state (or -action) MDPs, such theoretical guarantees are valid only asymptotically [21, 24].

2.1 Installation

LCRL can be set up by the pip package manager as easy as:

$$\begin{aligned} \texttt {pip\,install\,lcrl} \end{aligned}$$

This allows to readily import LCRL as a package into any Python project

$$\begin{aligned}>>> \texttt {import\,lcrl} \end{aligned}$$

and employ its modules. Alternatively, the provided setup file found within the distribution package will automatically install all the required dependencies. The installation setup has been tested successfully on Ubuntu 18.04.1, macOS 11.6.5, and Windows 11.

2.2 Input Interface

LCRL training module lcrl.src.train inputs two main objects (cf. Fig. 1): an MDP black-box object that generates training episodes; and an LDBA object; as well as learning hyper-parametersFootnote 1 that are listed in Table 1.

Table 1. List of hyper-parameters and features that can be externally selected

MDP: An MDP is an object with internal attributes that are a priori unknown to the agent, namely the state space, the transition kernel, and the labelling function (respectively denoted by \(\mathcal {S}\), P, and L). The states and their labels are observable upon reaching. To formalise the agent-MDP interface we adopt a scheme that is widely accepted in the RL literature [7]. In this scheme the learning agent can invoke the following methods from any state of the MDP:

  • reset(): this resets the MDP to its initial state. This allows the agent to start a new learning episode whenever necessary.

  • step(action): the MDP step function takes an action (the yellow signal in Fig. 1) as input, and outputs a new state, i.e. the black signal in Fig. 1.

A number of well-known MDP environments (e.g., the stochastic grid-world) are embedded as classes within LCRL, and can be found within the module lcrl.src.environments. Most of these classes can easily set up an MDP object. However, note that the state signal output by the step function needs to be fed to a labelling function state_label(state), which outputs a list of labels (in string format) for its input state (in Fig. 1, the black output signal from the MDP is fed to the blue box, or labelling function, which outputs the set of label). For example, state_label(state) = [‘safe’, ‘room1’]. The labelling function state_label(state) can then be positioned outside of the MDP class, or it can be an internal method in the MDP class. The built-in MDP classes in lcrl.src.environments module have an empty state_label(state) method that are ready to be overridden at the instance level:

figure b

LDBA: An LDBA object is an instance of the lcrl.src.automata.ldba class. This class is structured according to the automaton construction in [39], and it encompasses modifications dealing with non-determinism, as per Remark 1. The LDBA initial state is numbered as 0, or can alternatively be specified using the class attribute initial_automaton_state once an LDBA object is created. The LDBA non-accepting sink state is numbered as \(-1\). Finally, the set of accepting sets, on which we elaborate further below, has to be specified at the instance level by configuring accepting_sets (Listing 1.2 line 1). The key interface methods for the LDBA object are:

  • accepting_frontier_function(state): this automatically updates an internal attribute of an LDBA class called accepting_sets. This is a list of accepting sets of the LDBA, e.g. \(\mathcal {F}=\{F_1,...,F_f\}\). For instance, if the set of LDBA accepting sets is \(\mathcal {F}=\{\{3,4\},~\{5,6\}\}\) then this attribute is a list of corresponding state numbers accepting_sets = [[3,4],[5,6]]. As discussed above, the accepting_sets has to be specified once the LDBA class is instanced (Listing 1.2 line 1). The main role of the accepting frontier function is to determine if an accepting set can be reached, so that a corresponding reward is given to the agent (cf. red signal in Fig. 1). Once an accepting set is visited it will be temporarily removed from the accepting_sets until the agent visits all the accepting sets within accepting_sets. After that, accepting_sets is reset to the original list. To set up an LDBA class in LCRL the user needs to specify accepting_sets for the LDBA. LCRL then automatically shapes the reward function and calls the accepting_frontier_function whenever necessary. Further details on the accepting_frontier_function and the accepting_sets can be found in [18,19,20,21, 23, 26].

  • step(label): LDBA step function takes a label set, i.e. the blue signal in Fig. 1, as input and outputs a new LDBA state. The label set is delivered to the step function by LCRL. The step method is empty by default and has to be specified manually after the LDBA class is instanced (Listing 1.2 line 5).

  • reset(): this method resets the state and accepting_sets to their initial assignments. This corresponds to the agent starting a new learning episode.

    figure c
Fig. 2.
figure 2

LDBA for the LTL specification \((\lozenge \square goal1 \vee \lozenge \square goal2 ) \wedge \square \lnot unsafe \).

If the automaton happens to have \(\varepsilon \)-transitions, e.g. Fig. 2, they have to distinguishable, e.g. numbered. For instance, there exist two \(\varepsilon \)-transitions in the LDBA in Fig. 2 and each is marked by an integer. Furthermore, the LDBA class has an attribute called epsilon_transitions, which is a dictionary to specify which states in the automaton contain \(\varepsilon \)-transitions. In Fig. 2, only state 0 has outgoing \(\varepsilon \)-transitions and thus, the attribute epsilon_transitions in the LDBA object goal1_or_goal2 has to be set to

$$\begin{aligned} \texttt {goal1\_or\_goal2.epsilon\_transitions = \{0:[`epsilon\_0', `epsilon\_1']\}} \end{aligned}$$

2.3 Output Interface

LCRL provides the results of learning and testing as .pkl files. Tests are closed-loop simulations where we apply the learned policy over the MDP and observe the results. For any selected learning algorithm, the learned model is saved as learned_model.pkl and test results as test_results.pkl. The instruction on how to load these files is also displayed at the end of training for ease of re-loading data and for post-processing. Depending on the chosen learning algorithm, LCRL generates a number of plots to visualise the learning progress and the testing results. These plots are saved in the save_dir directory. The user has the additional option to export a generated animation of the testing progress, LCRL prompts this option to the user following the completion of the test. During the learning phase, LCRL displays the progress in real-time and allows the user to stop the learning task (in an any-time fashion) and save the generated outcomes.

3 Experimental Evaluation

Fig. 3.
figure 3

(a) Synthesised policy by LCRL in \(\mathtt {minecraft\text {-}t3}\); (b) \(\mathtt {cart\text {-}pole}\) experiment [43]; (c) \(\mathtt {pacman\text {-}lrg}\) - the white square on the left is labelled as food 1 (\( f_1 \)) and the one on the right as food 2 (\( f_2 \)), the state of being caught by a ghost is labelled as (g) and the rest of the state space is labelled as neutral (n). (Color figure online)

Table 2. Learning results with LCRL. MDP state and action space cardinalities are \(|\mathcal {S}|\) and \(|\mathcal {A}|\), the number of automaton states in LDBA is denoted by \(|\mathcal {Q}|\), the optimal action value function in the initial state is denoted by “LCRL \(\max _a Q(s_0,a)\)”, which represents the LCRL estimation of the maximum satisfaction probability. For each experiment, the reported result includes the mean and the standard error of ten learning trials with LCRL. This probability is also calculated by the PRISM model checker [33] and, whenever the MDP model can be processed by PRISM, it is reported in column “max sat. prob. at \(s_0\)”. The closer “LCRL \(\max _a Q(s_0,a)\)” and “max sat. prob. at \(s_0\)” the better. Note that for continuous-state-action MDPs the maximum satisfaction probability cannot be precisely computed by model checking tools, unless abstraction approximation techniques are applied, hence “n/a”. Furthermore, if the MDP state (or action) space is large enough, e.g. pacman, the model checkers tools cannot parse the model and the model checking process times out, i.e. “t/o”. The column “LCRL conv. ep.” presents the episode number in which LCRL converged. Finally, “wall_clocktime” presents the average elapsed real time needed for LCRL to converge on a test machine. The rest of the columns provide the values of the hyper-parameters, as described in Table 1.
Table 3. Robustness of LCRL performance against hyper-parameter tuning, for the frozen-lake-1 experiment. Maximum probability of satisfaction is \(99.83\%\) as calculated by PRISM (cf. Table 2). The reported values are the percentages of times that execution of LCRL final policy produced traces that satisfied the LTL property. Statistics are taken over 10 trainings and 100 testing for each training, namely 1000 trials for each hyper-parameter configuration.

We apply LCRL on a number of case studies highlighting its features, performance and robustness across various environment domains and tasks. All the experiments are run on a standard machine, with an Intel Core i5 CPU at 2.5 GHz and with 20 GB of RAM. The experiments are listed in Table 2 and discussed next.

The minecraft environment [2] requires solving challenging low-level control tasks (\(\mathtt {minecraft\text {-}tX}\)), and features many sequential goals. For instance, in \(\mathtt {minecraft\text {-}t3}\) (Fig. 3a) the agent is tasked with collecting three items sequentially and to reach a final checkpoint, which is encoded as the following LTL specification: \(\Diamond ( wood \wedge \Diamond ( grass \wedge \Diamond ( iron \wedge \Diamond ( craft\_table ))))\), where \(\Diamond \) is the known eventually temporal operator.

The \(\mathtt {mars\text {-}rover}\) problems are realistic robotic benchmarks taken from [21], where the environment features continuous state and action spaces.

The known \(\mathtt {cart\text {-}pole}\) experiment (Fig. 3b) [8, 24, 43] has a task that is expressed by the LTL specification \(\square \lozenge y \wedge \square \lozenge g \wedge \square \lnot u\), namely, starting the pole in upright position, the goal is to prevent it from falling over (\(\square \lnot u\), namely always not u) by moving the cart, whilst in particular alternating between the yellow (y) and green (g) regions (\(\square \lozenge y \wedge \square \lozenge g\)), while avoiding the red (unsafe) parts of the track (\(\square \lnot u\)).

The \(\mathtt {robot\text {-}surve}\) example [38] has the task to repeatedly visit two regions (A and B) in sequence, while avoiding multiple obstacles (C) on the way: \(\square \lozenge A \wedge \square \lozenge B \wedge \square \lnot C\).

Environments \(\mathtt {slp\text {-}easy}\) and \(\mathtt {slp\text {-}hard}\) are inspired by the widely used stochastic MDPs in [42, Chapter 6]: the goal in \(\mathtt {slp\text {-}easy}\) is to reach a particular region of the state space, whereas the goal in \(\mathtt {slp\text {-}hard}\) is to visit four distinct regions sequentially in a given order.

The \(\mathtt {frozen\text {-}lake}\) benchmarks are adopted from the OpenAI Gym [7]: the first three are reachability problems, whereas the last three require sequential visits of four regions, in the presence of unsafe regions to be always avoided.

Finally, \(\mathtt {pacman\text {-}sml}\) and \(\mathtt {pacman\text {-}lrg}\) are inspired by the well-known Atari game Pacman, and are initialised in a tricky configuration (\(\mathtt {pacman\text {-}lrg}\) as in Fig. 3c), which is likely for the agent to be caught: in order to win the game, the agent has to collect the available tokens (food sources) without being caught by moving ghosts. Formally, the agent is required to choose between one of the two available foods and then find the other one (\( \lozenge [ (f_1 \wedge \lozenge f_2) \vee (f_2 \wedge \lozenge f_1)] \)), while avoiding the ghosts (\( \square \lnot g\)). We thus feed to the agent a conjunction of these associations, as the following LTL specification: \( \lozenge [ (f_1 \wedge \lozenge f_2) \vee (f_2 \wedge \lozenge f_1)] \wedge \square \lnot g\). Standard QL fails to find a policy generating satisfying traces for this experiment. We emphasise that the two tasks in cart-pole and robot-surve are not co-safe, namely require possibly infinite traces as witnesses.

Additionally, we have evaluated the LCRL robustness to RL key hyper-parameter tuning, i.e. discount factor \(\eta \) and learning rate \(\mu \), by training the LCRL agent for 10 times and testing its final policy for 100 times. The evaluation results and an overall rate of satisfying the given LTL specifications are reported for the frozen-lake-1 experiments in Table 3. The statistics are taken across \(10\times 100\) tests, which results in 1000 trials for each hyper-parameter configuration.

4 Conclusions and Extensions

This paper presented LCRL, a new software tool for policy synthesis with RL under LTL and omega-regular specifications. There is a plethora of extensions that we are planning to explore. In the short term, we intend to: (1) directly interface LCRL with automata synthesis tools such as OWL [32]; (2) link LCRL with other model checking tools such as PRISM [33] and Storm [27]; and (3) embed more RL algorithms for policy synthesis, so that we can tackle policy synthesis problems for more challenging environments. In the longer term, we plan to extend LCRL such that (1) it will be able to handle other forms of temporal logic, e.g., signal temporal logic; and (2) it will have a graphical user-interface for the ease of interaction.