1 Introduction

Control theory methods have attracted research in fluid dynamics due to the scientific challenges and the potential impact that such a technology might have in several engineering sectors, ranging from aeronautics to naval and road transport. Further impulse to these developments is undoubtedly due to the current environmental needs. Carbon dioxide emissions are considered among the causes of global warming and any reduction of these emissions can be beneficial in this regard. In this work, we focus on active control based on reinforcement learning (RL) algorithms, one of the main sub-fields of machine learning [3, 7]; RL is mainly used in robotics and has gained popularity in the last years for the super-human performance achieved in solving tasks as complex as solving games such as go, [14]. A possible definition can be given by quoting a recent work by [11]: “RL [...] studies how to use past data to enhance the future manipulation of a dynamical system”. Not surprisingly, this definition could also apply to control theory algorithms: RL is deeply rooted into optimal control theory [9, 11, 12] as it relies on data-driven based solutions to the Bellman equation [2]. Indeed, while sharing the theoretical ground of the optimal control, RL is fully data-driven and, as such, is characterized by the applicability of data-based approaches, like the ability of using only a limited amount of sensor measurements for determining an optimal control policy.

Following this rationale, we aim at leveraging RL strategies for the closed-loop nonlinear control. We have demonstrated in our recent work [4] that a nonlinear, chaotic system governed by the Kuramoto-Sivashinsky (KS) equations can be controlled without relying on a priori knowledge of the dynamics of the system, but solely on localized measurements of the system. Effective policies were computed, capable of driving the system to the vicinity of the unstable, non-trivial solutions of the KS in a chaotic regime [5]. Here, we further extend these results; we briefly introduce the basics of nonlinear optimal control theory and RL in §2; a parametric analysis is proposed in §3 aiming at discussing the robustness of the computed controllers.

2 Reinforcement Learning: Introductory Elements

In this section, we briefly introduce the main elements for comparing control theory and the fundamentals of RL. We refer the interested reader to the literature in optimal control [9], including the fluid mechanics applications [6, 8, 13], and RL for deeper insights [7, 10, 14].

2.1 Bellman’s Optimality Condition

First of all, we introduce the state-space model

$$\begin{aligned} \dfrac{{\mathrm {d}}\mathbf {v}}{{\mathrm {d}}t}&= \mathcal {F}\left( \mathbf {v}(t),\mathbf {u}(t),t\right) , \end{aligned}$$
(1a)
$$\begin{aligned} \mathbf {x}(t)&= \mathcal {G}\left( \mathbf {v}(t)\right) , \end{aligned}$$
(1b)

describing a dynamical system governed by the nonlinear map \(\mathcal {F}\) and propagating the state \(\mathbf {v}\in \mathbb {R}^{N}\). The model is forced by an input vector \(\mathbf {u}\in \mathbb {R}^{m}\), with \(m\) being the number of inputs. In the second relation, the map \(\mathcal {G}\) associates the observed state \(\mathbf {v}\) to the observable \(\mathbf {x}\in \mathbb {R}^{p}\), function of time t, recorded as outputs by \(p\) sensors. In the following, we will generally define the observables \(\mathbf {x}\) and the input vector \(\mathbf {u}\) as signals. The control signal \(\mathbf {u}\) corresponds to the amplitude in time of localized forcing introduced in the system, typically as actuators.

The optimal control problem applied to the dynamical system in Eq. 1 can be stated as follows:

To compute the control signal \(\mathbf {u}\in \mathbb {R}^{m}\) using the sensor measurements \(\mathbf {x}\in \mathbb {R}^{p}\), such that an objective function \(\mathcal {J}\) is minimized.

A general expression of the objective function is given by

$$\begin{aligned} \mathcal {J}(\mathbf {v}_t,t,\underset{t\le \tau \le T}{\mathbf {u}(\tau )}) = h\left( \mathbf {v}(T),T\right) + \int _t^{T} r\left( \mathbf {v}(\tau ),\mathbf {u}(\tau ),\tau \right) \,{\mathrm {d}}\tau , \end{aligned}$$
(2)

where h provides the terminal condition at time T, the optimization horizon, and r is the reward associated with the state \(\mathbf {v}\) and the action \(\mathbf {u}\). Note that t can be any value less than or equal to T. As previously stated, the objective of the controller is to provide a mapping between the sensor signal \(\mathbf {x}\) and the control actions \(\mathbf {u}\); this mapping is usually called policy and will be indicated as \(\mathbf {\pi }\) such that unknown optimal signal is obtained as

$$\begin{aligned} \mathbf {u}^\star (t) = \mathbf {\pi }^\star \left( \mathbf {x}(t),t\right) . \end{aligned}$$
(3)

Hereafter, optimal solutions will be indicated with a \(({^\star })\). When the system in Eq. 1 is known, linear (or linearizable) and time-invariant, a classic approach to optimal control is the linear quadratic regulator (LQR), obtained when the reward r is quadratic; in that case, it is possible to resolve the associated Riccati equation and compute the corresponding policy [9]. Here, we keep the formulation as general as possible and proceed by maximizing the value of the objective function in Eq. 2 on the right-hand side (RHS) as

$$\begin{aligned} \begin{aligned} \mathcal {J}^\star (\mathbf {v}(t),t)&= \max _{\underset{t\le \tau \le T}{\mathbf {\pi }(\tau )}} \left[ \int _t^{T} r\left( \mathbf {v}(\tau ),\mathbf {u}(\tau ),\tau \right) \,{\mathrm {d}}\tau + h\left( \mathbf {v}(T),T\right) \right] . \end{aligned} \end{aligned}$$
(4)

The RHS can be further manipulated by splitting the integral in two contributions

$$\begin{aligned} \mathcal {J}^\star (\mathbf {v}(t),t) = \max _{\underset{t\le \tau \le t+\Delta t}{\mathbf {\pi }(\tau )}} \left[ \int _t^{t+\Delta t} r\,{\mathrm {d}}\tau + \mathcal {J}^\star (\mathbf {v}(t+\Delta t),t+\Delta t) \right] , \end{aligned}$$
(5)

where the first term defined in the interval \([t,t+\Delta t]\) and corresponds to an immediate reward while the remaining terms are now replaced by the optimal value function. The term \(\mathcal {J}^\star (\mathbf {v}(t+\Delta t),t+\Delta t)\) can be developed in Taylor series about \(\mathbf {v}(t)\) and, in the limit for \(\Delta t \rightarrow 0\), it leads to the well known Hamilton-Jacobi-Bellman (HJB)

$$\begin{aligned} - \dot{\mathcal {J}^\star }(\mathbf {v}(t),t) = \max _{\mathbf {\pi }(t)} \left[ r\left( \mathbf {v}(t),\mathbf {u}(t),t\right) + \mathcal {J}^\star _{\mathbf {v}}\left( \mathbf {v}(t),t\right) \, \mathcal {F}\left( \mathbf {v}(t),\mathbf {u}(t),t\right) \right] , \end{aligned}$$
(6)

with \(\mathcal {J}^\star _{\mathbf {v}}\) being the derivative with respect to the state, and the terminal condition \(\mathcal {J}^\star (\mathbf {v}(T),T) = h\left( \mathbf {v}(T),T\right) \). This functional equation is continuous in time and defined backward. If the HJB is solved on the whole state-space and its value function is differentiable, the equation provides a necessary and sufficient condition for the optimum. More interestingly for what it follows, it can be shown that the discrete counterpart of the HJB equation is given by the Bellman equation

$$\begin{aligned} \mathcal {J}^\star (\mathbf {v}_t) = \max _\mathbf {u} \left[ \Delta t \, r(\mathbf {v}_t,\mathbf {u}_t) + \gamma \mathcal {J}^\star (\mathbf {v}_{t+\Delta t}) \right] , \end{aligned}$$
(7)

where \(\gamma = \exp \left( -\Delta t \, \rho \right) \) is the discount factor and \(\Delta t\) the time step. This equation is applied using the Markov decision process (MDP) framework, where the probability of evolving from the present state to the future one under the action \(\mathbf {u}\) is expressed by transition matrices. Due to the probabilistic framework, the value function is reformulated in terms of expectation of the cumulative discounted reward defined by

$$\begin{aligned} \mathcal {J}^{\mathbf {\pi }}(\mathbf {v}_t) = \mathbb {E}\left[ \sum _{l =0}^\infty {\gamma ^l r(\mathbf {v}_{t+l \, \Delta t})}\right] . \end{aligned}$$
(8)

The Bellman equation in (7) is central in dynamic programming, discrete optimal control and RL. We can observe an important property: the discounted infinite-horizon optimal problem is decomposed in a series of local optimal problems; more precisely, by quoting [2]

“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”

This property is the Bellman’s principle of optimality and allows to solve the optimization problem by breaking it in a sequence of simpler problems.

2.2 Reinforcement Learning

One of the assumptions in the previous section was the knowledge of the whole action-state space: when considering nonlinear maps \(\mathcal {F}\) of large dimensions, the computational costs would be prohibitive. As an alternative, we can observe that, in the Bellman equation, the model does not appear explicitly: it suffices to observe the state \(\mathbf {v}_t\) and measure the reward r for recovering \(\mathcal {J}^\pi (\mathbf {v}_t)\) from the interaction of the system with the environment under the policy. If \(\mathcal {J}^\pi (\mathbf {v}_t)\) is a solution of Eq. 7, we get a data-driven approximation of the optimal solution of the nonlinear control problem. This idea leads to the reinforcement learning (RL) framework; in the specific case of the deep reinforcement learning (DRL), the policy and the value function are represented by neural networks (NN).

2.3 General Classification for RL Algorithms

A rather general way to classify the RL algorithms can be made by identifying three main classes of techniques: (i) Actor–only, (ii) Critic–only and (iii) Actor–Critic. The word actor is synonym of policy, while critic indicates the value function.

  1. 1.

    Actor–only methods consist of evaluating parametric policies. In this procedure, each policy is evaluated by recording the system for a long time and computing the cumulative discounted reward; the optimization is performed by means of stochastic gradient-descent algorithms for the update of the policy. These algorithms are usually referred to as REINFORCE algorithms. From the mathematical viewpoint, the actor–only method satisfies the Pontryagin’s maximum principle, a necessary condition for the optimality, where the system is optimized in the vicinity of only one trajectory.

  2. 2.

    Critic–only methods are based on the value function approximation; a general expression of the Bellman equation associated with this class of algorithm is given by

    $$\begin{aligned} Q^{\pi }(\mathbf {v}_t, \mathbf {u}_t) = r(\mathbf {v}_t, \mathbf {u}_t) + \gamma Q^{\pi }(\mathbf {v}_{t+\Delta t},\mathbf {u}_{t+\Delta t}). \end{aligned}$$
    (9)

    In this way, the state-action value function, or Q-function, is written as solution of the Bellman equation and measures the long-term reward of a system evolving along a trajectory emanating from \(\mathbf {v}_t\) under an action \(\mathbf {u}_t\), and subsequently driven by a policy \(\pi \). Q-learning algorithms are aimed at approximating the optimal action-value function.

  3. 3.

    Actor–critic algorithms combine the two techniques, by providing an approximation for the policy and guaranteeing that the Q-function is a solution of the Bellman equation; this condition is satisfied when the analyzed system is Markovian and fully known from the observables.

2.4 Deep Deterministic Policy Gradient as an Actor-Critic Algorithm for DRL

In this application, we opted for an actor-critic strategy, the Deep Deterministic Policy Gradient (DDPG) [10], capable of handling continuous actions. First of all, we define the so-called tuple, composed by the current state \(\mathbf {x}_t\), the associated reward \(r_t\), and the state \(\mathbf {x}_{t+1}\) obtained under the action \(\mathbf {u}_t\). The tuples are iteratively stacked in memory, and define the MDP. Note that in the most general case, we do not consider the full state, but only local measurements such that \(\mathbf {x}=\mathcal {G}(\mathbf {v})\): in this case, the observability of the MDP can be limited, so typically we refer to as partial observability (PO)-MDP. This aspect is crucial as it can lead to non-Markovian representations of the system and, as consequence, non-optimal solutions.

The approximations of policy \(\pi \) and value function Q are obtained by NN. In particular, each element of the i-th layer of the NN approximation can be written as

$$\begin{aligned} x^{i}_j = f_j\left( \mathbf {\psi } \mathbf {x}^{i-1} +\mathbf {b}\right) , \end{aligned}$$
(10)

where \(\{f_j\}\) represents the basis of nonlinear functions (swish or tanh in the present work) selected for the approximation, with \(j=1,\dots h\) and h the dimension of the hidden layer. The argument of these functions is given as a linear combination of each nodes at each layer \(x_j^i\) and the coefficients \(\mathbf {\theta }=\{\mathbf {\psi },\mathbf {b}\}\); the expansion coefficients \(\theta \) are the unknowns and are computed using a stochastic, gradient-based optimization. In particular, by following the sketch in Fig. 1, the update of the value function Q, the critic part, is obtained by temporal difference TD as

$$\begin{aligned} TD = Q^{\pi }(\mathbf {x}_t, \mathbf {u}_t|\theta ) -\left[ r(\mathbf {x}_t, \mathbf {u}_t) + \gamma Q^{\pi }(\mathbf {x}_{t+1},\mathbf {u}_{t+1})|\theta \right] . \end{aligned}$$
(11)

The gradient \(\nabla _{\theta } TD\) allows to update the coefficients of the NN approximating the value function. By feeding back into the system the signal \(\mathbf {u}\), based on the sensor measurements \(\mathbf {x}\), we are able to close the loop and control the system, as sketched in Fig. 1. More in detail, the Q-function allows the update of the actor part providing the policy \(\pi \)

$$\begin{aligned} \mathbf {u}_t=\pi (\mathbf {x}_t|\omega ) +\mathcal {N}, \end{aligned}$$
(12)

and the action \(\mathbf {u}_t\). The coefficients \(\omega \) of the NN approximating \(\pi \) are updated via the gradient \(\nabla _{\omega } Q\). A crucial aspect is the exploration: the optimality of the control is guaranteed by the hypothesis that the state-action space is known. To this end, the parameters \(\omega \) of the NN describing the policy are perturbed and noise \(\mathcal {N}\) is introduced on the action; both the noise processes vary over time and are damped as the solution converges. As last note, we stress that one of the main features of DRL is the continuous learning in real time of the optimal policy.

Fig. 1
figure 1

Sketch of the DDPG algorithm applied for the control of the KS system. The system is detected by means of localized sensors; the current state of the system \(\mathbf {x}\) is recorded from these measurements. Based on the action \(\mathbf {u}\) and the scalar reward r, the updates of the Q-function and the policy \(\pi \) are performed. More details are provided in the text

3 Control of Chaotic Regimes: The Kuramoto-Sivashinsky System

In this section, we discuss the control of the one-dimensional Kuramoto–Sivashinsky (KS) equation using DRL. The KS system exhibits a rather rich dynamics, ranging from the steady solution to chaotic regimes. The critical parameter is the domain extent, here indicated with L. In particular, it can be shown that for \(L<L_c = 2\pi \), the dynamics is stable and converges towards \(\mathbf {E}_0=\mathbf {0}\), while chaotic dynamics emerges for \(L>L_c\). We consider the solutions obtained for \(L=22\), corresponding to a regime characterized by maximum Lyapunov exponent \(\lambda _1\approx 0.043\) and Kaplan-Yorke dimension \(D_{KY}\approx 5.2\); for this case, the dynamics is low-dimensional and lies in a space characterized by three non-trivial equilibria and two traveling waves [5]. In Fig. 2, we show the null solution \(\mathbf {E}_0\) and the three non-trivial solutions labelled \(\mathbf {E}_i\), with \(i=1,2,3\); each of these solutions is unstable: the dynamics of the system becomes chaotic after a short transient. When increasing the domain extension, the number of positive Lyapunov exponents increases and the dynamics exhibits spatio-temporal chaos. The dynamics in time of the velocity \(\mathbf {v}\in \mathbb {R}^{N}\) is governed by the equation

$$\begin{aligned} \dfrac{\partial \mathbf {v}}{\partial t}+\mathbf {v}\dfrac{\partial \mathbf {v}}{\partial x} = -\dfrac{\partial ^2 \mathbf {v}}{\partial x^2} -\dfrac{\partial ^4 \mathbf {v}}{\partial x^4} + \mathbf {g}(t), \end{aligned}$$
(13)

here discretized with a resolution of \(N=64\) grid points on a periodic domain. The periodic domain allows for a Fourier mode expansion for the numerical resolution. Time marching was performed by 3rd-order Runge-Kutta scheme; the nonlinear terms are solved explicitly, while the linear terms are implicit. For all numerical simulations, a time step of 0.05 was adopted.

Fig. 2
figure 2

When the domain is \(L=22\), the KS system exhibits 4 equilibria: the null solution \(\mathbf {E}_0\) (top-left), and the non-trivial solutions labelled with \(\mathbf {E}_i\) and \(i=1,2,3\) (top-right, bottom-left and bottom-right, respectively). All of them are unstable, as shown by the dynamics of the system in the spatio-temporal plots

The control forcing is introduced by the term \(\mathbf {g}(t)=\mathbf{B} \mathbf {u}(t)\), where \(\mathbf{B} \in \mathbb {R}^{N\times m}\) is the spatial distribution of \(m=4\) localized, Gaussian shaped actuators

$$\begin{aligned} \mathbf {B}(x_a)= \left( 2\pi \sigma \right) ^{-1/2} \, \exp \left( -\dfrac{\left( x -x_a\right) ^2}{2 \sigma ^2}\right) , \end{aligned}$$
(14)

placed at \(x_a\in \left\{ 0, L/4, L/2, 3L/4\right\} \) and amplitude modulated in time by the forcing in time \(\mathbf {u}\in \mathbb {R}^{m}\), computed by the DDPG and based on \(p=8\) localized sensor measurements, staggered with the respect to the actuators location and equidistant. It can be shown that the KS equation can be controlled using linear controllers in combination with localized actuation [1]; however, the scope of this investigation is to demonstrate the feasibility of a purely model-free approach to the control of nonlinear flows.

3.1 Implementation of DDPG

We choose as objective for our controller to drive the system towards the solution \(\mathbf {E}_2\) such that the distance \(\Vert \mathbf {E}_2-\mathbf {v}\Vert _2 := - r\) is minimized. As mentioned before, the DDPG policy is based on NN and its structure is as follows:

  1. 1.

    The actor part, representing the mapping between sensors and actuators, has \(m=4\) inputs and \(p=8\) outputs. Two hidden layers are considered, of respective dimensions 128 and 64, with activation functions swish and tanh.

  2. 2.

    The critic part, representing the value function, consists of an input of dimension \(m+p=12\), and a scalar output. Two hidden layers are introduced of 256 and 128 nodes, both with swish activation functions.

Adam optimization is applied for the update.

Fig. 3
figure 3

Five policies for the control of the dynamics of the KS are compared for the same objective function: driving the system to the vicinity of \(\mathbf {E}_2\). For simplicity of the discussion, the initial condition is set to be the invariant solution \(\mathbf {E}_1\). The insets (a-b) show the behaviour of the system for \(|\mathbf {u}|<1\) and \(\gamma =0.95\) (blue-dashed), \(\gamma =0.97\) (red-dotted), \(\gamma =0.99\) (green); the trajectory is shown in phase space (a), while the corresponding reward is in (b). In the plots (c-d), we fix \(\gamma =0.99\) and consider three amplitudes: \(|\mathbf {u}|<0.5\) (blue-dashed), \(|\mathbf {u}|<1.0\) (red-dotted), \(|\mathbf {u}|<1.5\) (green); the corresponding trajectories (c) and rewards (d) are shown

3.2 Results

We extend the results of [4] by considering a parametric analysis on the values of the discount factor \(\gamma \) and maximum amplitude of the outputs. In particular, we consider three policies with \(\gamma =\{0.95, 0.97, 0.99\}\) and \(|\mathbf {u}|<1.0\), and two other policies with \(\gamma ={0.99}\) and maximum output amplitude set as \(|\mathbf {u}|<\{0.5,1.5\}\). The policy with \(|\mathbf {u}|<1.0\) and \(\gamma =0.99\) is the same as analysed in [4]. Due to the Markovianity of the system, the controllers are capable of driving the system to the target state \(\mathbf {E}_2\) regardless of the initial conditions; here, for sake of conciseness and to make the comparison possible, we choose \(\mathbf {E}_1\) as initial condition of all the test-cases.

In Fig. 3a-b, we show the trajectory in the phase-space (obtained by projecting the dynamics on the first three Fourier modes) and the reward, respectively, for \(\gamma =0.95\) (blue-dashed), \(\gamma =0.97\) (blue-dotted) and \(\gamma =0.99\) (green). The output is bounded as \(|\mathbf {u}|<1.0\). Surprisingly, despite the three controllers are always capable to drive the dynamics of the system towards \(\mathbf {E}_2\), the case with \(\gamma =0.99\) is also the one which exhibits smaller excursions in the phase-space before converging, with a higher reward. This behaviour resembles what is observed in model predictive control when longer time-horizon are chosen. In the second set of results, we show how with \(\gamma =0.99\), a different behaviour appears when changing the amplitudes \(|\mathbf {u}|<\{0.5,1,1.5\}\), respectively depicted with a blue-dashed, red-dotted and green curve in Fig. 3c-d. In this case, as one would expect, in presence of greater control authority the policies are capable of converging rapidly towards the vicinity of \(\mathbf {E}_2\); although the case with \(|\mathbf {u}|<1.5\) is the one showing higher reward, it is also characterized by a behaviour less clear than the case with \(|\mathbf {u}|<1.0\) when considering the phase-space (Fig. 3c).

4 Conclusions and Perspectives

This proceeding is part of a larger research effort aimed at applying reinforcement learning strategies to Navier-Stokes systems. Without any a-priori knowledge of the system, it is possible, by using localized sensors and actuators, to drive the dynamics of the chaotic KS system towards target states, here represented by unstable solutions of the system. The results are encouraging, although there are still numerous questions to be addressed. From the application point of view, the control signals (not-shown here) are highly non-trivial; in this sense, we are currently analysing the extent to which we are capable of reproducing an action comparable to a linearized, optimal control in the vicinity of the unstable state and the associated energy budget. A challenging aspect of this work is represented by the extension to Navier-Stokes systems of this control strategy. A well-known limitation is represented, for instance, by the presence of time-delays in convective systems [6, 13]; this problem “translates” in RL into the so called credit-assignment problem. Also, it is important to keep a reasonable and realistic set-up, i.e. by limiting the number of sensors and actuators; these choices require a trade-off between the engineering needs and the low-observability, leading to the loss of Markovianity of the system, and low control-authority.

A future path is represented by the re-interpretation of RL from a control-oriented viewpoint: tools in standard, model-based control theory, such as model predictive control and adaptive algorithms [6, 15], rely on the Bellman formalism. The interplay between tools from optimal control theory and RL could help the development of reliable tools for the control of fluid systems.