Keywords

1 Introduction

Natural agents are believed to develop an internal model of their motor system by generating actions in muscles and observing limb movements [11]. It has been suggested that forming this internal model is analogous to a form of online system identification [24]. System identification, i.e. estimating dynamical parameters from observed input and output signals, has a rich history in engineering. But there might still be much to gain from considering biologically-plausible procedures. Here, I explore online system identification using a leading theory of how brains process information: free energy minimisation [3, 8].

To test free energy minimisation for use in engineering applications, I consider a specific benchmarkFootnote 1 problem called a Duffing oscillator. Duffing oscillators are relatively well-behaved nonlinear differential equations, making them excellent toy problems for methodological research. Its differential equation is cast to a generative model, with a corresponding factor graph. The factor graph admits a recursive parameter estimation procedure through message passing [12, 14]. Specifically, variational message passing minimises free energy [5, 13, 18]. Here, I infer the parameters of a Duffing oscillator using online variational message passing. Experiments show that it performs as well as a nonlinear ARX model with parameters trained offline using prediction error minimisation [2].

2 System

Consider a rigid frame with two prongs facing rightwards (see Fig. 1 left). A steel beam is attached to the top prong. If the frame is driven by a periodic forcing term, the beam will displace horizontally as a driven damped harmonic oscillator. Two magnets are attached to the bottom prong, with the steel beam suspended in between. These act as a nonlinear feedback term on the beam’s position, attracting or repelling it as it gets closer [15].

Fig. 1.
figure 1

(Left) Example of a physical implementation of a Duffing oscillator. (Right) Example of input and output signals.

Let y(t) be the observed displacement, x(t) the true displacement, and u(t) the observed driving force. The position of the beam is described as follows [25]:

$$\begin{aligned} {\mathrm m}\frac{d^2 x(t)}{dt^2} + {\mathrm c}\frac{d x(t)}{dt} + {\mathrm a}x(t) + {\mathrm b}x^3(t) =&\ u(t) + w(t) \end{aligned}$$
(1a)
$$\begin{aligned} y(t) =&\ x(t) + v(t), \end{aligned}$$
(1b)

where \({\mathrm m}\) is mass, \({\mathrm c}\) is damping, \({\mathrm a}\) the linear and \({\mathrm b}\) the nonlinear spring stiffness coefficient. Both the state transition as well as the observation likelihood contain noise terms, which are assumed to be Gaussian distributed: \(w(t) \sim \mathcal {N}(0, \tau ^{-1})\) (process noise) and \(v(t) \sim \mathcal {N}(0, \xi ^{-1})\) (measurement noise). The challenge is to estimate \({\mathrm m}\), \({\mathrm c}\), \({\mathrm a}\), \({\mathrm b}\), \(\tau \) and \(\xi \) such that the output of the system can be predicted as accurately as possible.

3 Identification

First, I discretise the state transition of Eq. 1 using a central difference for the second derivative and a forward difference for the first derivative. Re-arranging to form an expression in terms of \(x_{t+1}\) yields:

$$\begin{aligned} x_{t+1} = \frac{2{\mathrm m}+ {\mathrm c}\delta - {\mathrm a}\delta ^2}{{\mathrm m}+ {\mathrm c}\delta } x_{t} + \frac{-{\mathrm b}\delta ^2}{{\mathrm m}+ {\mathrm c}\delta }x_t^3 + \frac{-{\mathrm m}}{{\mathrm m}+ {\mathrm c}\delta } x_{t-1} + \frac{\delta ^2}{{\mathrm m}+ {\mathrm c}\delta } (u_t + w_t) \, , \end{aligned}$$
(2)

where \(\delta \) is the sample time step. Secondly, to ease inference at a later stage, I perform the following variable substitutions:

$$\begin{aligned} \theta _1 \! \,=\, \! \frac{2{\mathrm m}\! + \! {\mathrm c}\delta \! - \! {\mathrm a}\delta ^2}{{\mathrm m}\! + \! {\mathrm c}\delta } , \ \theta _2 \! \,=\, \! \frac{-{\mathrm b}\delta ^2}{{\mathrm m}\! + \! {\mathrm c}\delta } , \ \theta _3 \! \,=\, \! \frac{-{\mathrm m}}{{\mathrm m}\! + \! {\mathrm c}\delta } , \ \eta \! \,=\, \! \frac{\delta ^2}{{\mathrm m}\! + \! {\mathrm c}\delta } , \ \gamma \! \,=\, \! \frac{\tau ({\mathrm m}\! + \! {\mathrm c}\delta )^2}{\delta ^4} , \end{aligned}$$
(3)

where the square in the numerator for \(\gamma \) stems from absorbing the coefficient into the noise term (\(\mathbb {V}[\eta w_t] = \eta ^2\mathbb {V}[w_t]\)). Note that the mapping between \(\phi = ({\mathrm m}, {\mathrm c}, {\mathrm a}, {\mathrm b}, \tau )\) and \(\psi = (\theta _1, \theta _2, \theta _3, \eta , \gamma )\) can be inverted to recover point estimates:

$$\begin{aligned} {\mathrm m}= \frac{-\theta _3 \delta ^2}{\eta } \, , \ \ {\mathrm c}= \frac{(1+\theta _3)\delta }{\eta } \, , \ \ {\mathrm a}= \frac{1-\theta _1 - \theta _3}{\eta } \, , \ \ {\mathrm b}= \frac{-\theta _2}{\eta } \, , \ \ \tau = \gamma \eta ^2 \, . \end{aligned}$$
(4)

Thirdly, the state transition can be cast to a multivariate first-order form:

$$\begin{aligned} \underbrace{\begin{bmatrix} x_{t+1} \\ x_{t} \end{bmatrix}}_{z_t} = \underbrace{\begin{bmatrix} 0 \ &{} \ 0 \\ 1 \ &{} \ 0 \end{bmatrix}}_{S} \underbrace{\begin{bmatrix} x_{t} \\ x_{t-1} \end{bmatrix}}_{z_{t-1}} + \underbrace{\begin{bmatrix} 1 \\ 0 \end{bmatrix}}_{s} g(\theta , z_{t-1}) + \begin{bmatrix} 1 \\ 0 \end{bmatrix} \eta u_t + \begin{bmatrix} 1 \\ 0 \end{bmatrix} \tilde{w}_t \, , \end{aligned}$$
(5)

where \(g(\theta , z_{t-1}) = \theta _1 x_t + \theta _2 x_t^3 + \theta _3 x_{t-1}\) and \(\tilde{w}_t \sim \mathcal {N}(0, \gamma ^{-1})\). The system is now a nonlinear autoregressive process. Lastly, integrating out \(\tilde{w}_t\) and \(v_t\) produces a Gaussian state transition and a Gaussian likelihood, respectively:

$$\begin{aligned} z_t \sim&\ \mathcal {N}(f(\theta , z_{t-1}, \eta , u_t), V) \end{aligned}$$
(6a)
$$\begin{aligned} y_t \sim&\ \mathcal {N}(s^{\top } z_t, \xi ^{-1}) \, , \end{aligned}$$
(6b)

where \(f(\theta , z_{t-1}, \eta , u_t) = Sz_{t-1} + s g(\theta , z_{t-1}) + s \eta u_t\) and \(V = \begin{bmatrix} \gamma ^{-1}\,\,&0\ ; 0&\epsilon \end{bmatrix}\). The number \(\epsilon \) represents a small noise injection to stabilise inference [6].

To complete the generative model description, priors must be defined. Mass \({\mathrm m}\) and process precision \(\tau \) are known to be strictly positive parameters, while the damping and stiffness coefficients can be both positive and negative. By examining the variable substitutions, it can be seen that \(\theta _1\), \(\theta _2\), \(\theta _3\) and \(\eta \) can be both positive and negative, but \(\gamma \) can only be positive. As such, the following parametric forms can be chosen for the priors:

$$\begin{aligned} \theta \sim \mathcal {N}(m^{0}_{\theta }, V^{0}_{\theta }) \, , \quad \eta \sim \mathcal {N}(m^{0}_{\eta }, v^{0}_{\eta }) \, , \quad \gamma \sim \varGamma (a^{0}_\gamma , b^{0}_\gamma ) \, , \quad \xi \sim \varGamma (a^{0}_\xi , b^{0}_\xi ) \, . \end{aligned}$$
(7)

3.1 Free Energy Minimisation

Given the generative model, a free energy functional with a recognition model q can be formed as follows:

$$\begin{aligned} -\log p({\mathbf{y}}, {\mathbf{u}}) \le \iint q(\psi ,{\mathbf{z}}) \frac{q(\psi ,{\mathbf{z}})}{p({\mathbf{y}}, {\mathbf{u}}, {\mathbf{z}}, \psi )}\ {\mathrm d}{\mathbf{z}}{\mathrm d}\psi \ = \mathcal {F}[q] \end{aligned}$$
(8)

where \({\mathbf{z}}= (z_1, \dots , z_T)\), \({\mathbf{y}}= (y_1, \dots , y_T)\) and \({\mathbf{u}}= (u_1, \dots , u_T)\). I assume the states factor over time and that the parameters are largely independent:

$$\begin{aligned} q(\psi , {\mathbf{z}}) = q(\theta ) q(\eta ) q(\gamma ) q(\xi ) \prod _{t=1}^{T} q(z_t). \end{aligned}$$
(9)

All recognition densities are Gaussian distributed, except for \(q(\gamma )\) and \(q(\xi )\), which are Gamma distributed. In free energy minimisation, the parameters of the recognition distributions depend on each other and are iteratively updated.

3.2 Factor Graphs and Message Passing

In online system identification, parameter estimates should be updated at each time-step. That puts time constraints on the inference procedure. Message passing is an ideal inference procedure due to its efficiency in factorised generative models [12]. Figure 2 is a graphical representation of the generative model, with nodes for factors and edges for variables. Square nodes with Greek letters represent stochastic operations while and represent deterministic operations. The node marked “NLARX” represents the state transition described in Eq. 6a.

Fig. 2.
figure 2

Forney-style factor graph of the generative model of a Duffing oscillator. Nodes represent conditional distributions and edges represent variables. Nodes send messages to connected edges. When two messages on an edge collide, the marginal belief q for the corresponding variable is updated. Each belief update reduces free energy. By iterating message passing, free energy is minimised.

The terminal nodes on the left represent the initial priors for the states and dynamical parameters. Inference starts when these nodes pass messages. The subgraph - separated by columns of dots - represents the structure of a single time step, recursively applied. Messages and represent beliefs q from previous time-steps. Message , arriving at the state transition node, originates from the likelihood node attached to observation \(y_t\). Messages , and combine priors from previous time steps and likelihoods of observations, and are used to update beliefs q. Message is the current state belief and becomes message in the next time step.

The graph actually contains more messages, such as those sent by equality nodes. I have hidden them to avoid complicating the figure. Their form has been extensively described in the literature and can be looked up easily [12, 14]. Modern message passing toolboxes, such as Infer.NET and ForneyLab.jl, automatically incorporate them. However, the NLARX node is new. Its messages can be computed withFootnote 2:

(10a)
(10b)
(10c)
(10d)

where I use a first-order Taylor expansion to approximate the expected value of the nonlinear autoregressive function \(g(\theta , z_{t-1})\).

Loeliger et al. (2007) have written an accessible introduction on message passing in factor graphs [14]. Variational message passing in autoregressive processes has been described in detail as well [5, 19].

4 Experiment

The Duffing oscillator has been implemented in an electronic system called Silverbox [25]. It consists of T = 131702 samples, gathered with a sampling frequency of 610.35 Hz. Figure 3 shows the time-series, plotted at every 80 time steps. There are two regimes: the first 40000 samples are subject to a linearly increasing amplitude in the input (left of the black line in Fig. 3) and the remaining samples are subject to a constant amplitude but contain only odd harmonics (right of the black line). The second regime is used as a training data set, where both input and output data were given and parameters needed to be inferred. The first regime is used as a validation data set, where the inferred parameters are fixed and the model needs to make predictions for the output signal.

Fig. 3.
figure 3

Silverbox data set, sampled at every 80 time steps for visualisation. The black line splits it into validation data (left) and training data (right).

I performed two experimentsFootnote 3: a 1-step ahead prediction error and a simulation error setting. I used ForneyLab.jl, with NLARX as a custom node, to run the message passing inference procedure [4]. I call the model above FEM-NLARX, for Nonlinear Latent Autoregressive model with eXogenous input using Free Energy Minimisation. I implemented two baselines: the first is NLARX without the nonlinearity (i.e. the nonlinear spring coefficient \({\mathrm b}= 0\)), dubbed FEM-LARX. The second is a standard NARX model, implemented using MATLAB’s System Identification Toolbox. I modelled the static nonlinearity with a sigmoid network of 4 units (in line with the 4 coefficients used by NLARX and LARX). Parameters were inferred offline using Prediction Error Minimisation. Hence, this baseline is called PEM-NARX.

I chose uninformative priors for the coefficients \(\theta \) and \(\eta \): Gaussians centred at 1 with precisions of 0.1. The authors of Silverbox indicate that the signal-to-noise ratio at measurement time was high [25]. I therefore chose informative priors for the noise parameters: \(a_{\xi }^0 = 1e8\) and \(a_{\gamma }^0 = 1e3\) (shape parameters) and \(b_{\xi }^0 = 1e3\) and \(b_{\gamma }^0 = 1e1\) (scale parameters).

4.1 1-Step Ahead Prediction Error

At each time-step in the validation data, the models were given the previous output signal \(y_{t-1}, y_{t-2}\) and the current input signal \(u_t\) and had to infer the current output \(y_t\). It is a relatively easy task, which is reflected in all three models’ performance. The top row in Fig. 4 shows the predictions of all three models in purple and their squared error with respect to the true output signal in black. The left column shows the offline NARX baseline (PEM-NARX), the middle column the linear online latent autoregressive baseline (FEM-LARX) and the right column the nonlinear online latent autoregressive model (FEM-NLARX). Note that the errors in the top row seem completely flat. The bottom row in the figure plots the errors on a log-scale. PEM-NARX has a mean squared error of 5.831e−5, FEM-LARX one of 5.945e−5 and FEM-NLARX one of 5.830e−5.

Fig. 4.
figure 4

1-step ahead prediction errors. (Left) Offline NARX model with sigmoid net (PEM-NARX), (middle) online linear model (FEM-LARX) and (right) online nonlinear model (FEM-NLARX). (Top) Predictions (purple) and squared error (black). (Bottom) Squared prediction errors in log-scale. (Color figure online)

4.2 Simulation Error

In this experiment, the models were not given the previous output signal, but had to use their predictions from the previous time-step. This is a much harder task, because errors will accumulate. The top row in Fig. 5 again shows the predictions of all three models (purple) and their squared error (black). It can already be seen that the errors increase as the input signal’s amplitude rises. The bottom row plots the errors on a log-scale. PEM-NARX has a mean squared error of 1.000e−3, FEM-LARX one of 1.002e−3 and FEM-NLARX one of 0.926e−3.

Fig. 5.
figure 5

Simulation errors. (Left) Offline NARX model with sigmoid net (PEM-NARX), (middle) online linear model (FEM-LARX) and (right) online nonlinear model (FEM-NLARX). (Top) Predictions (purple) and squared error (black). (Bottom) Squared prediction errors in log-scale. (Color figure online)

5 Discussion

The experimental results seem to justify looking to nature for inspiration. Free energy minimisation, in the form of variational message passing, seems a generally applicable and well-performing inference technique. The difficulties mostly lie in deriving variational messages (i.e. Eqs. 10).

Improvements in the proposed procedure could be made with a richer approximation of the nonlinear autoregressive function (e.g. unscented transform) [20]. Alternatively, a hierarchy of latent Gaussian filters or autoregressive processes could be used to obtain time-varying noise parameters or time-varying coefficients [19, 22]. Furthermore, instead of discretising such that an auto-regressive model is obtained, one could express the evolution of the states in generalised coordinates. Lastly, black-box models could be explored for further performance improvements.

A natural next step is for an active inference agent to determine the control signal regime (i.e. optimal design). Unfortunately, this is not straightforward: the current formulation relies on variational free energy which does not produce an epistemic term in the objective. The epistemic term is needed to encourage exploration; i.e. try sub-optimal inputs to reduce uncertainty. To arrive at an epistemic term, one would need to work with expected free energy [17]. But it is unclear how expected free energy could be incorporated into factor graphs.

5.1 Related Work

Online system identification procedures typically employ recursive least-squares or maximum likelihood inference, with nonlinearities modelled by basis expansions or neural networks [7, 16, 23]. Online Bayesian identification procedures come in two flavours: sequential Monte Carlo samplers [1, 10] and online variational Bayes [9, 26]. This work is novel in the use of variational message passing as an efficient implementation of online variational Bayes and its application to a nonlinear autoregressive model.

6 Conclusion

I have presented a free energy minimisation procedure for online system identification. Experimental results showed comparable performance to a state-of-the-art nonlinear model with parameters estimated offline. This indicates that the procedure performs well enough to be deployed in engineering applications.

Future work should test variational message passing in more challenging nonlinear identification settings, such as a Wiener-Hammerstein benchmark [21]. Furthermore, problems with time-varying dynamical parameters, such as a robotic arm picking up objects with mass, would be interesting for their connection to natural agents.