Variational Approach for Learning Markov Processes from Time Series Data

Wu, Hao; Noé, Frank

doi:10.1007/s00332-019-09567-y

Variational Approach for Learning Markov Processes from Time Series Data

Published: 05 August 2019

Volume 30, pages 23–66, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Nonlinear Science Aims and scope Submit manuscript

Variational Approach for Learning Markov Processes from Time Series Data

Download PDF

4814 Accesses
144 Citations
6 Altmetric
4 Mentions
Explore all metrics

Abstract

Inference, prediction, and control of complex dynamical systems from time series is important in many areas, including financial markets, power grid management, climate and weather modeling, or molecular dynamics. The analysis of such highly nonlinear dynamical systems is facilitated by the fact that we can often find a (generally nonlinear) transformation of the system coordinates to features in which the dynamics can be excellently approximated by a linear Markovian model. Moreover, the large number of system variables often change collectively on large time- and length-scales, facilitating a low-dimensional analysis in feature space. In this paper, we introduce a variational approach for Markov processes (VAMP) that allows us to find optimal feature mappings and optimal Markovian models of the dynamics from given time series data. The key insight is that the best linear model can be obtained from the top singular components of the Koopman operator. This leads to the definition of a family of score functions called VAMP-r which can be calculated from data, and can be employed to optimize a Markovian model. In addition, based on the relationship between the variational scores and approximation errors of Koopman operators, we propose a new VAMP-E score, which can be applied to cross-validation for hyper-parameter optimization and model selection in VAMP. VAMP is valid for both reversible and nonreversible processes and for stationary and nonstationary processes or realizations.

Learning Stochastic Dynamical Systems via Bridge Sampling

Koopman Operator Framework for Time Series Modeling and Analysis

Article 09 January 2018

Bayesian model selection for complex dynamic systems

Article Open access 04 May 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Extracting dynamical models and their main characteristics from time series data is a recurring problem in many areas of science and engineering. In the particularly popular approach of Markovian models, the future evolution of the system, e.g., state ${\mathbf {x}}_{t+\tau }$, only depends on the current state ${\mathbf {x}}_{t}$, where t is the time step and $\tau $ is the delay or lag time. Markovian models are easier to analyze than models with explicit memory terms. They are justified by the fact that many physical processes—including both deterministic and stochastic processes—are inherently Markovian. Even when only a subset of the variables in which the system is Markovian are observed, a variety of physics and engineering processes have been shown to be accurately modeled by Markovian models on sufficiently long observation lag times $\tau $. Examples include molecular dynamics (Chodera and Noé 2014; Prinz et al. 2011), wireless communications (Konrad et al. 2001; Ma et al. 2001) and fluid dynamics (Mezić 2013; Froyland et al. 2016).

In the past decades, a collection of closely related Markov modeling methods were developed in different fields, including Markov state models (MSMs) (Schütte et al. 1999; Prinz et al. 2011; Bowman et al. 2014), Markov transition models (Wu and Noé 2015), Ulam’s Galerkin method (Dellnitz et al. 2001; Bollt and Santitissadeekorn 2013; Froyland et al. 2014), blind-source separation (Molgedey and Schuster 1994; Ziehe and Müller 1998), the variational approach of conformation dynamics (VAC) (Noé and Nüske 2013; Nüske et al. 2014), time-lagged independent component analysis (TICA) (Perez-Hernandez et al. 2013; Schwantes and Pande 2013), dynamic mode decomposition (DMD) (Rowley et al. 2009; Schmid 2010; Tu et al. 2014), extended dynamic mode decomposition (EDMD) (Williams et al. 2015a), variational Koopman models (Hao et al. 2017), variational diffusion maps (Boninsegna et al. 2015), sparse identification of nonlinear dynamics (Brunton et al. 2016b) and corresponding kernel embeddings (Harmeling et al. 2003; Song et al. 2013; Schwantes and Pande 2015) and tensor formulations (Nüske et al. 2016; Klus and Schütte 2015). All these models approximate the Markov dynamics at a lag time $\tau $ by a linear model in the following form:

$$\begin{aligned} {\mathbb {E}}\left[ {\mathbf {g}}({\mathbf {x}}_{t+\tau })\right] = {\mathbf {K}}^{\top }{\mathbb {E}}\left[ {\mathbf {f}}({\mathbf {x}}_{t})\right] . \end{aligned}$$

(1)

Here ${\mathbf {f}}({\mathbf {x}})=(f_{1}({\mathbf {x}}),f_{2}({\mathbf {x}}),\ldots )^{\top }$ and ${\mathbf {g}}({\mathbf {x}})=(g_{1}({\mathbf {x}}),g_{2}({\mathbf {x}}),\ldots )^{\top }$ are feature transformations that transform the state variables ${\mathbf {x}}$ into the feature space in which the dynamics are approximately linear. ${\mathbb {E}}$ denotes an expectation value over time that accounts for stochasticity in the dynamics and can be omitted for deterministic dynamical systems. In some methods, such as DMD, the feature transformation is an identity transformation: ${\mathbf {f}}\left( {\mathbf {x}}\right) ={\mathbf {g}}\left( {\mathbf {x}}\right) ={\mathbf {x}}$—and then Eq. (1) defines a linear dynamical system in the original state variables. If ${{\mathbf {{f}}}}$ and ${\mathbf {g}}$ are indicator functions that partition $\varOmega $ into substates, such that $f_{i}\left( {\mathbf {x}}\right) =g_{i}\left( {\mathbf {x}}\right) =1$ if ${\mathbf {x}}\in A_{i}$ and 0 otherwise, Eq. (1) is the propagation law of an MSM, or equivalently of Ulam’s Galerkin method, as the expectation values ${\mathbb {E}}\left[ {\mathbf {f}}({\mathbf {x}}_{t})\right] $ and ${\mathbb {E}}\left[ {\mathbf {g}}({\mathbf {x}}_{t+\tau })\right] $ represent the vector of probabilities to be in any substate at times t and $t+\tau $, and $K_{ij}$ is the probability to transition from set $A_{i}$ to set $A_{j}$ in time $\tau $. In general, (1) can be interpreted as a finite-rank approximation of the so-called Koopman operator (Koopman 1931; Mezić 2005), which governs the time evolution of observables of the system state and can fully characterize the Markovian dynamics. As shown in Korda and Mezić (2018), this approximation becomes exact in the limit of infinitely sized feature transformations with ${\mathbf {f}}={\mathbf {g}}$, and a similar conclusion can also be obtained when ${\mathbf {f}},{\mathbf {g}}$ are infinite-dimensional feature functions deduced from a characteristic kernel (Song et al. 2013).

A direct method to estimate the matrix ${\mathbf {K}}$ from data is to solve the linear regression problem ${\mathbf {g}}({\mathbf {x}}_{t+\tau })\approx {\mathbf {K}}^{\top }{\mathbf {f}}({\mathbf {x}}_{t})$, which facilitates the use of regularized solution methods, such as the LASSO method (Tibshirani 1996). Alternatively, feature functions ${\mathbf {f}}$ and ${\mathbf {g}}$ that allow Eq. (1) to have a probabilistic interpretation (e.g., in MSMs), ${\mathbf {K}}$ can be estimated by a maximum likelihood or Bayesian method (Prinz et al. 2011; Noé 2008).

However, as yet, it is still unclear what are the optimal choices for ${\mathbf {f}}$and ${\mathbf {g}}$—either given a fixed dimension or a fixed amount of data. Notice that this problem cannot be solved by minimizing the regression error of Eq. (1), because a regression error of zero can be trivially achieved by choosing a completely uninformative model with ${\mathbf {f}}\left( {\mathbf {x}}\right) \equiv {\mathbf {g}}\left( {\mathbf {x}}\right) \equiv 1$ and ${\mathbf {K}}=1$. An approach that can be applied to deterministic systems and for stochastic systems with additive white noise is to set ${\mathbf {g}}\left( {\mathbf {x}}\right) ={\mathbf {x}}$, and then choose ${\mathbf {f}}$ as the transformation with smallest modeling error (Brunton et al. 2016a, b).

A more general approach is to optimize the dominant spectrum of the Koopman operator. At long timescales, the dynamics of the system are usually dominated by the Koopman eigenfunctions of the Koopman operator with large eigenvalues. If the dynamics obey detailed balance, those eigenvalues are real-valued, and the variational approach for reversible Markov processes can be applied that has made great progress in the field of molecular dynamics (Noé and Nüske 2013; Nüske et al. 2014). In such processes, the smallest modeling error of (1) is achieved by setting ${\mathbf {f}}={\mathbf {g}}$ equal to the corresponding eigenfunctions. Noé and Nüske (2013) describes a general approach to approximate the unknown eigenfunction from time series data of a reversible Markov process: Given a set of orthogonal candidate functions, ${\mathbf {f}}$, it can be shown that their time-autocorrelations are lower bounds to the corresponding Koopman eigenvalues, and are equal to them exactly if, and only if ${\mathbf {f}}$ are equal to the Koopman eigenfunctions. This approach provides a variational score, such as the sum of estimated eigenvalues (the Rayleigh trace), that can be optimized to approximate the eigenfunctions. If ${\mathbf {f}}$ is defined by a linear superposition of a given set of basis functions, then the optimal coefficients are found equivalently by either maximizing the variational score, or minimizing the regression error in the feature space as done in EDMD (Williams et al. 2015a)—see Hao et al. (2017). However, the regression error cannot be used to select the form and the number of basis functions themselves, whereas the variational score can. When working with a finite dataset, however, it is important to avoid overfitting, and to this end a cross-validation method has been proposed to compute variational scores that take the statistical error into account (McGibbon and Pande 2015). Such cross-validated variational scores can be used to determine the size and type of the function classes and the other hyper-parameters of the dynamicalmodel.

While this approach is extremely powerful for stationary and data and reversible Markov processes, almost all real-world dynamical processes and time series thereof are irreversible and often even nonstationary. In this paper, we introduce a variational approach for Markov processes (VAMP) that can be employed to optimize parameters and hyper-parameters of arbitrary Markov processes. VAMP is based on the singular value decomposition of the Koopman operator, which overcomes the limited usefulness of the eigenvalue decomposition of time-irreversible and nonstationary processes. We first show that the approximation error of the Koopman operator deduced from the linear model (1) can be minimized by setting ${\mathbf {f}}$ and ${\mathbf {g}}$ to be the top left and right singular functions of the Koopman operator. Then, by using the variational description of singular components, a class of variational scores, VAMP-r for $r=1,2,\ldots $, are proposed to measure the similarity between the estimated singular functions and the true ones. Maximization of any of these variational scores leads to optimal model parameters and is algorithmically identical to Canonical Correlation Analysis (CCA) between the featurized time-lagged pair of variables ${\mathbf {x}}_{t}$ and ${\mathbf {x}}_{t+\tau }$. This approach can also be employed to learn the feature transformations by nonlinear function approximators, such as deep neural networks. Furthermore, we establish a relationship between the VAMP-2 score and the approximation error of the dynamical model with respect to the true Koopman operator. We show that this approximation error can be practically computed up to a constant, and define its negative as the VAMP-E score. Finally, we demonstrate that optimizing the VAMP-E score in a cross-validation framework leads to an optimal choice of hyper-parameters.

2 Theory

2.1 Koopman Analysis of Dynamical Systems and Its Singular Value Decomposition

The Koopman operator ${\mathcal {K}}_{\tau }$ of a Markov process is a linear operator defined by

$$\begin{aligned} {\mathcal {K}}_{\tau }g({\mathbf {x}})\triangleq {\mathbb {E}} \left[ g({\mathbf {x}}_{t+\tau })\mid {\mathbf {x}}_{t}= {\mathbf {x}}\right] . \end{aligned}$$

(2)

For given ${\mathbf {x}}_{t}$, the Koopman operator can be used to compute the conditional expected value of an arbitrary observable g at time $t+\tau $. For the special choice that g is the Dirac delta function $\delta _{{\mathbf {y}}}$ centered at ${\mathbf {y}}$, application of the Koopman operator evaluates the transition density of the dynamics, ${\mathcal {K}}_{\tau }\delta _{{\mathbf {y}}}({\mathbf {x}})={\mathbb {P}}({\mathbf {x}}_{t+\tau }={\mathbf {y}}|{\mathbf {x}}_{t}={\mathbf {x}})$ (see Appendix A.3). Thus, the Koopman operator is a complete description of the dynamical properties of a Markovian system. For convenience of analysis, we consider here ${\mathcal {K}}_{\tau }$ as a mapping from ${\mathcal {L}}_{\rho _{1}}^{2}=\left\{ g|\left\langle g,g\right\rangle _{\rho _{1}}<\infty \right\} $ to ${\mathcal {L}}_{\rho _{0}}^{2}=\left\{ f|\left\langle f,f\right\rangle _{\rho _{0}}<\infty \right\} $, where $\rho _{0}$ and $\rho _{1}$ are empirical distributions of ${\mathbf {x}}_{t}$ and ${\mathbf {x}}_{t+\tau }$ of all transition pairs $\{({\mathbf {x}}_{t},{\mathbf {x}}_{t+\tau })\}$ occurring in the given time series (see Appendix A.1), and the inner products are defined by

$$\begin{aligned} \left\langle f,g\right\rangle _{\rho _{0}}=\int f\left( {\mathbf {x}}\right) g\left( {\mathbf {x}}\right) \rho _{0} \left( {\mathbf {x}}\right) \mathrm {d}{\mathbf {x}},\quad \left\langle f,g\right\rangle _{\rho _{1}}=\int f\left( {\mathbf {x}}\right) g\left( {\mathbf {x}}\right) \rho _{1} \left( {\mathbf {x}}\right) \mathrm {d}{\mathbf {x}}. \end{aligned}$$

(3)

How is the finite-dimensional linear model (1) related to the Koopman operator description? Let us consider ${\mathbf {f}}({\mathbf {x}}_{t})$ to be a sufficient statistics for ${\mathbf {x}}_{t}$, and let ${\mathbf {g}}$ be a dictionary of observables, then the value of an arbitrary observable h in the subspace of ${\mathbf {g}}$, i.e., $h={\mathbf {c}}^{\top }{\mathbf {g}}$, with some coefficients ${\mathbf {c}}$, can be predicted from ${\mathbf {x}}_{t}$ as ${\mathbb {E}}\left[ h({\mathbf {x}}_{t+\tau })|{\mathbf {x}}_{t}\right] ={\mathbf {c}}^{\top }{\mathbf {K}}^{\top }{\mathbf {f}}({\mathbf {x}}_{t})$. This implies that Eq. (1) is an algebraic representation of the projection of the Koopman operator onto the subspace spanned by functions ${\mathbf {f}}$ and ${\mathbf {g}}$, and the matrix ${\mathbf {K}}$ is therefore called the Koopman matrix. Combining this insight with the generalized Eckart–Young Theorem (Hsing and Eubank 2015) leads to our first result, namely what is the optimal choice of functions ${\mathbf {f}}$ and ${\mathbf {g}}$:

Theorem 1

Optimal approximation of Koopman operator. If ${\mathcal {K}}_{\tau }$ is a Hilbert–Schmidt operator between the separable Hilbert spaces ${\mathcal {L}}_{\rho _{1}}^{2}$ and ${\mathcal {L}}_{\rho _{0}}^{2}$, the linear model (1) with the smallest modeling error in Hilbert–Schmidt norm is given by ${\mathbf {f}}=(\psi _{1},\ldots ,\psi _{k})^{\top }$, ${\mathbf {g}}=(\phi _{1},\ldots ,\phi _{k})^{\top }$ and ${\mathbf {K}}=\mathrm {diag}(\sigma _{1},\ldots ,\sigma _{k})$, i.e.,

$$\begin{aligned} {\mathbb {E}}\left[ \phi _{i}({\mathbf {x}}_{t+\tau })\right] =\sigma _{i}{ \mathbb {E}}\left[ \psi _{i}({\mathbf {x}}_{t})\right] ,\quad \mathrm{for}\; i=1,\ldots ,k \end{aligned}$$

(4)

under the constraint $\mathrm {dim}({\mathbf {f}}),\mathrm {dim}({\mathbf {g}})\le k$, and the projected Koopman operator deduced from (4) is

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\sum _{i=1}^{k}\sigma _{i}\left\langle g,\phi _{i}\right\rangle _{\rho _{1}}\psi _{i}, \end{aligned}$$

(5)

where the singular value $\sigma _{i}>0$ is the square root of the $i\hbox {th}$ largest eigenvalue of ${\mathcal {K}}_{\tau }^{*}{\mathcal {K}}_{\tau }$ or ${\mathcal {K}}_{\tau }{\mathcal {K}}_{\tau }^{*}$, the left and right singular function $\psi _{i},\phi _{i}$ are the ith eigenfunctions of ${\mathcal {K}}_{\tau }^{*}{\mathcal {K}}_{\tau }$ and ${\mathcal {K}}_{\tau }{\mathcal {K}}_{\tau }^{*}$ with

$$\begin{aligned} \left\langle \psi _{i},\psi _{j}\right\rangle _{\rho _{0}}=1_{i=j},\quad \left\langle \phi _{i},\phi _{j}\right\rangle _{\rho _{1}}=1_{i=j}, \end{aligned}$$

(6)

and the first singular component is always given by $(\sigma _{1},\phi _{1},\psi _{1})=(1,\mathbb {1},\mathbb {1})$ with $\mathbb {1}\left( {\mathbf {x}}\right) \equiv 1$.

Proof

See Appendix A.2. $\square $

This theorem is universal for Markov processes, and the major assumption is that the Koopman operator is Hilbert–Schmidt, which is required for the existences of the singular value decomposition (SVD) of ${\mathcal {K}}_{\tau }$ and the finite Hilbert–Schmidt norm $\Vert {\mathcal {K}}_{\tau }\Vert _{\mathrm {HS}}$. Appendix A.4 provides two sufficient conditions for the assumption. However, it is worth noting that the Koopman operators of deterministic systems are not Hilbert–Schmidt or even compact in usual cases (see Appendix A.5), and thus all conclusions and methods in this paper are not applicable to deterministic systems.

In addition, we prove in Appendix A.3 that $\Vert \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\Vert _{\mathrm {HS}}$ is equal to a weighted ${\mathcal {L}}^{2}$ error of the transition density, which provides a more meaningful interpretation of the modeling error in Hilbert–Schmidt norm.

Example 1

Consider a one-dimensional dynamical system

$$\begin{aligned} x_{t+1}=\frac{x_{t}}{2}+\frac{7x_{t}}{1+0.12x_{t}^{2}}+6\cos x_{t}+\sqrt{10}u_{t} \end{aligned}$$

(7)

evolving in the state space $[-20,20]$, where $u_{t}$ is a standard Gaussian white noise zero mean and unit variance (see Appendix K.1 for details on the numerical simulations and analysis). This system has two metastable states with the boundary close to $x=0$ as shown in Fig. 1a, and the singular components are summarized in Fig. 1c, d. As shown in the figures, the sign structures of the second left and right singular functions clearly indicate the metastable states, and the third and forth singular functions provide more detailed information on the dynamics. An accurate estimate of the transition density can be obtained by combining the first four singular components, and the corresponding relative approximation error of the Koopman operator is only $6.6\%$ (see Fig. 1b, e). In addition, we utilize the finite-rank approximate Koopman operators to predict the time evolution of the distribution of $x_{t}$ for $t=1,\ldots ,256$ with the initial state $x_{0}=12$, and a small error can also be achieved when the rank is only 4 as displayed in Fig. 2, where

$$\begin{aligned} \mathrm {error}=\sum _{t=1}^{256}\int \rho _{1}(x_{t})^{-1} \left( \hat{{\mathbb {P}}}(x_{t}|x_{0})- {\mathbb {P}}(x_{t}|x_{0})\right) ^{2}\mathrm {d}x_{t} \end{aligned}$$

(8)

is the cumulative kinetic distance (Noé and Clementi 2015) between the transition density ${\mathbb {P}}(x_{t}|x_{0})$ and its estimate $\hat{{\mathbb {P}}}(x_{t}|x_{0})$, and $\rho _{1}$ is the stationary distribution.

There are other formalisms to describe Markovian dynamics, for example, the Markov propagator or the weighted Markov propagator, also called transfer operator (Schütte et al. 1999). These propagators are commonly used for modeling physical processes such as molecular dynamics, and describe the evolution of probability densities instead of observables. We show in Appendix B that all conclusions in this paper can be equivalently established by interpreting $(\sigma _{i},\rho _{1}\phi _{i},\rho _{0}\psi _{i})$ as the singular components of the Markov propagator.

2.2 Variational Principle for Markov Processes

In order to allow the optimal model (4) to be estimated from data, we develop a variational principle for the approximation of singular values and singular functions of Markov processes.

According to the Rayleigh variational principle of singular values, the first singular component maximizes the generalized Rayleigh quotient of ${\mathcal {K}}_{\tau }$ as

$$\begin{aligned} (\psi _{1},\phi _{1})=\arg \max _{f,g}\frac{\left\langle f,{\mathcal {K}}_{\tau }g\right\rangle _{\rho _{0}}}{\sqrt{\left\langle f,f\right\rangle _{\rho _{0}}\cdot \left\langle g,g\right\rangle _{\rho _{1}}}} \end{aligned}$$

(9)

and the maximal value of the generalized Rayleigh quotient is equal to the first singular value $\sigma _{1}=\left\langle \psi _{1},\,{\mathcal {K}}_{\tau }\phi _{1}\right\rangle _{\rho _{0}}$. For the $i\hbox {th}$ singular component with $i>1$, we have

$$\begin{aligned} (\psi _{i},\phi _{i})=\arg \max _{f,g}\frac{\left\langle f,{\mathcal {K}}_{\tau }g\right\rangle _{\rho _{0}}}{\sqrt{\left\langle f,f\right\rangle _{\rho _{0}}\cdot \left\langle g,g\right\rangle _{\rho _{1}}}} \end{aligned}$$

(10)

under constraints

$$\begin{aligned} \left\langle f,\psi _{j}\right\rangle _{\rho _{0}}=\left\langle g,\phi _{j}\right\rangle _{\rho _{1}}=0,\quad \forall j=1,\ldots ,i-1 \end{aligned}$$

(11)

and the maximal value is equal to $\sigma _{i}=\left\langle \psi _{i},\,{\mathcal {K}}_{\tau }\phi _{i}\right\rangle _{\rho _{0}}$. These insights can be summarized by the following variational theorem for seeking all top k singular components simultaneously:

Theorem 2

VAMP variational principle. The k dominant singular components of a Koopman operator are the solution of the following maximization problem:

$$\begin{aligned} \sum _{i=1}^{k}\sigma _{i}^{r}&=\max _{{\mathbf {f}},{\mathbf {g}}}{\mathcal {R}}_{r}\left[ {\mathbf {f}},{\mathbf {g}}\right] ,\nonumber \\ s.t.&\left\langle f_{i},f_{j}\right\rangle _{\rho _{0}}=1_{i=j},\nonumber \\&\left\langle g_{i},g_{j}\right\rangle _{\rho _{1}}=1_{i=j}, \end{aligned}$$

(12)

where $r\ge 1$ can be any positive integer. The maximal value is achieved by the singular functions $f_{i}=\psi _{i}$ and $g_{i}=\phi _{i}$ and

$$\begin{aligned} {\mathcal {R}}_{r}\left[ {\mathbf {f}},{\mathbf {g}}\right] =\sum _{i=1}^{k}\left\langle f_{i},{\mathcal {K}}_{\tau }g_{i}\right\rangle _{\rho _{0}}^{r} \end{aligned}$$

(13)

is called the VAMP-r score of ${\mathbf {f}}$ and ${\mathbf {g}}$.

Proof

See Appendix C. $\square $

This theorem generalizes Proposition 2 in Froyland (2013) where only the case of $k=2$ is considered. It is important to note that this theorem has direct implications for the data-driven estimation of dynamical models. For $r=1$, ${\mathcal {R}}_{r}\left[ {\mathbf {f}},{\mathbf {g}}\right] $ is actually the time correlation between ${\mathbf {f}}({\mathbf {x}}_{t})$ and ${\mathbf {g}}({\mathbf {x}}_{t+\tau })$ since $\left\langle f_{i},{\mathcal {K}}_{\tau }g_{i}\right\rangle _{\rho _{0}}={\mathbb {E}}_{t}[f_{i}({\mathbf {x}}_{t})g_{i}({\mathbf {x}}_{t+\tau })]$ and ${\mathbb {E}}_{t}[\cdot ]$ denotes the expectation value over all transition pairs $(x_{t},x_{t+\tau })$ in the time series. Hence the maximization of VAMP-r is analogous to the problem of seeking orthonormal transformations of ${\mathbf {x}}_{t}$ and ${\mathbf {x}}_{t+\tau }$ with maximal time-correlations, and we can thus utilize the canonical correlation analysis (CCA) algorithm (Hardoon et al. 2004) in order to estimate the singular components from data.

2.3 Comparison with Related Analysis Approaches

The SVD of the Koopman operator is equivalent to the eigenvalue decomposition when the Markov process is time-reversible and stationary with $\rho _{0}=\rho _{1}$, and therefore the variational principle presented here is a generalization of that developed for reversible conformation dynamics (Noé and Nüske 2013; Nüske et al. 2014). Specifically, VAMP-1 maximizes the Rayleigh trace, i.e., the sum of the estimated eigenvalues (Noé and Nüske 2013; McGibbon and Pande 2015), and VAMP-2 maximizes the kinetic variance introduced in Noé and Clementi (2015). See Appendix D for a detailed derivation of the reversible variational principle from the VAMP variational principle. For irreversible Markov processes, the singular functions can provide low-dimensional embeddings of kinetic distances between states like eigenfunctions of reversible processes (Paul et al. 2018). Furthermore, the coherent sets of nonstationary Markov processes, which are the generalization of metastable states, can be identified from dominant singular functions (Koltai et al. 2018).

The dynamics of an irreversible Markov process can also be analyzed through solving the eigenvalue problem ${\mathcal {K}}_{\tau }g=\lambda g$ [see, e.g., Williams et al. (2015a, b), Klus et al. (2015),Klus and Schütte (2015)], and the eigenfunctions form an invariant subspace of the Koopman operator for multiple lag times since the eigenvalue problem satisfies

$$\begin{aligned} {\mathcal {K}}_{\tau }g=\lambda g\Rightarrow {\mathcal {K}}_{n\tau }g=\lambda ^{n}g,\quad \forall n\ge 1. \end{aligned}$$

(14)

However, as far as we know, there is no variational principle for approximate eigenfunctions of irreversible Markov processes, and it is difficult to evaluate errors of projections of Koopman operators to the invariant subspaces. The SVD-based analysis approach overcomes the above problems and yields the optimal finite-rank approximate models. The major limitation of this approach comes from the fact that the singular functions are dependent on the choice of the lag time and the optimality of model (4) holds only for a fixed $\tau $. The optimization and error analysis of Koopman models for multiple lag times will be studied in our future work.

3 Estimation Algorithms

We introduce algorithms to estimate optimal dynamical models from time series data. We make the Ansatz to represent the feature functions ${\mathbf {f}}$ and ${\mathbf {g}}$ as linear combinations of basis functions $\varvec{\chi }_{0}=(\chi _{0,1},\chi _{0,2},\ldots )^{\top }$ and $\varvec{\chi }_{1}=(\chi _{1,1},\chi _{1,2},\ldots )^{\top }$:

$$\begin{aligned} {\mathbf {f}}&={\mathbf {U}}^{\top }\varvec{\chi }_{0},\nonumber \\ {\mathbf {g}}&={\mathbf {V}}^{\top }\varvec{\chi }_{1}. \end{aligned}$$

(15)

Here, ${\mathbf {U}}$ and ${\mathbf {V}}$ are matrices of size $m\times k$ and $m^{\prime }\times k$, i.e., we are trying to approximate k singular components by linearly combining m and $m^{\prime }$ basis functions. For the sake of generality we have assumed that ${\mathbf {f}}$ and ${\mathbf {g}}$ are represented by different basis sets. However, in practice one can justify using a single basis set the joint set $\varvec{\chi }^{\top }=(\varvec{\chi }_{0}^{\top },\varvec{\chi }_{1}^{\top })$ as an Ansatz for both ${\mathbf {f}}$ and ${\mathbf {g}}$. Please note that despite the linear Ansatz (15), the feature functions may be strongly nonlinear in the system’s state variables ${\mathbf {x}}$, thus we are not restricting the generality of the functions ${\mathbf {f}}$ and ${\mathbf {g}}$ that can be represented. In this section, we consider three problems: (i) optimizing ${\mathbf {U}}$ and ${\mathbf {V}}$, (ii) optimizing $\varvec{\chi }_{0}$ and $\varvec{\chi }_{1}$ and (iii) assessing the quality of the resulting dynamical model.

For convenience of notation, we denote by ${\mathbf {C}}_{00},{\mathbf {C}}_{11},{\mathbf {C}}_{01}$ the covariance matrices and time-lagged covariance matrices of basis functions, which can be computed from a trajectory $\{x_{1},\ldots ,x_{T}\}$ by

$$\begin{aligned} {\mathbf {C}}_{00}\triangleq & {} {\mathbb {E}}_{t}\left[ \varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) \varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) ^{\top }\right] \approx \frac{1}{T-\tau }\sum _{t=1}^{T-\tau }\varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) \varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) ^{\top }, \end{aligned}$$

(16)

$$\begin{aligned} {\mathbf {C}}_{11}\triangleq & {} {\mathbb {E}}_{t}\left[ \varvec{\chi }_{1}\left( {\mathbf {x}}_{t+\tau }\right) \varvec{\chi }_{1}\left( {\mathbf {x}}_{t+\tau }\right) ^{\top }\right] \approx \frac{1}{T-\tau }\sum _{t=1+\tau }^{T}\varvec{\chi }_{1}\left( {\mathbf {x}}_{t}\right) \varvec{\chi }_{1}\left( {\mathbf {x}}_{t}\right) ^{\top }, \end{aligned}$$

(17)

$$\begin{aligned} {\mathbf {C}}_{01}\triangleq & {} {\mathbb {E}}_{t}\left[ \varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) \varvec{\chi }_{1}\left( {\mathbf {x}}_{t+\tau }\right) ^{\top }\right] \approx \frac{1}{T-\tau }\sum _{t=1}^{T-\tau }\varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) \varvec{\chi }_{1}\left( {\mathbf {x}}_{t+\tau }\right) ^{\top }. \end{aligned}$$

(18)

If there are multiple trajectories, the covariance matrices can be computed in the same manner by averaging over all trajectories. Instead of the direct estimators (16–18), more elaborated estimation methods such as regularization methods (Tibshirani 1996) and reweighting estimators (Hao et al. 2017) may be used.

3.1 Feature TCCA: Finding the Best Linear Model in a Given Feature Space

We first propose a solution for the problem of finding the optimal parameter matrices ${\mathbf {U}}$ and ${\mathbf {V}}$ given that the basis functions $\varvec{\chi }_{0}$ and $\varvec{\chi }_{1}$ are known. Substituting the linear Ansatz (15) into the VAMP variational principle shows that ${\mathbf {U}}$ and ${\mathbf {V}}$ can be computed as the solutions of the maximization problem:

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}&{\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})\nonumber \\ \mathrm {s.t.}&{\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}={\mathbf {I}}\nonumber \\&{\mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}={\mathbf {I}}, \end{aligned}$$

(19)

where

$$\begin{aligned} {\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})=\sum _{i=1}^{k}\left( {\mathbf {u}}_{i}^{\top }{\mathbf {C}}_{01}{\mathbf {v}}_{i}\right) ^{r} \end{aligned}$$

(20)

is a matrix representation of VAMP-r score, and ${\mathbf {u}}_{i}$ and ${\mathbf {v}}_{i}$ are the ith columns of ${\mathbf {U}}$ and ${\mathbf {V}}$. This problem can be solved by applying linear CCA (Hardoon et al. 2004) in the feature spaces defined by the basis sets $\varvec{\chi }_{0}({\mathbf {x}}_{t})$ and $\varvec{\chi }_{1}({\mathbf {x}}_{t+\tau })$, and the same solution will be obtained for any other choice of r. (See Appendices E.1 and E.2 for more detailed proof and analysis.) The resulting algorithm for finding the best linear model is a CCA in feature space, applied on time-lagged data. Hence we briefly call this algorithm feature TCCA:

1.
Compute covariance matrices ${\mathbf {C}}_{00},{\mathbf {C}}_{01},{\mathbf {C}}_{11}$ via (16–18).
2.
Perform the truncated SVD
$$\begin{aligned} \bar{{\mathbf {K}}}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{ \mathbf {C}}_{11}^{-\frac{1}{2}}\approx {\mathbf {U}}^{\prime }{ \mathbf {K}}{\mathbf {V}}^{\prime \top }, \end{aligned}$$
where $\bar{{\mathbf {K}}}$ is the Koopman matrix for the normalized basis functions ${\mathbf {C}}_{00}^{-\frac{1}{2}}\varvec{\chi }_{0}$ and ${\mathbf {C}}_{11}^{-\frac{1}{2}}\varvec{\chi }_{1}$, ${\mathbf {K}}=\mathrm {diag}(K_{11},\ldots ,K_{kk})$ is a diagonal matrix of the first k singular values that approximate the true singular values $\sigma _{1},\ldots ,\sigma _{k}$, and ${\mathbf {U}}^{\prime }$ and ${\mathbf {V}}^{\prime }$ consist of the k corresponding left and right singular vectors respectively.
3.
Compute ${\mathbf {U}}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {U}}^{\prime }$ and ${\mathbf {V}}={\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {V}}^{\prime }$.
4.
Output the linear model (1) with $K_{ii}$, $f_{i}={\mathbf {u}}_{i}^{\top }\varvec{\chi }_{0}$ and $g_{i}={\mathbf {v}}_{i}^{\top }\varvec{\chi }_{1}$ being the estimates of the ith singular value, left singular function and right singular function of the Koopman operator.

Please note that this pseudocode is given only for illustrative purposes and cannot be executed literally if ${\mathbf {C}}_{00}$ and ${\mathbf {C}}_{11}$ do not have full rank, i.e., are not invertible. To handle this problem, we ensure that the basis functions are linearly independent by applying a decorrelation (whitening) transformation that ensures that ${\mathbf {C}}_{00}$ and ${\mathbf {C}}_{11}$ will both have full rank. We then add the constant function $\mathbb {1}\left( x\right) \equiv 1$ to the decorrelated basis sets to ensure that $\mathbb {1}$ belongs to the subspaces spanned by $\varvec{\chi }_{0}$ and by $\varvec{\chi }_{1}$. It can be shown that the singular values given by the feature TCCA algorithm with these numerical modifications are bounded by 1, and the first estimated singular component is exactly $(K_{11},f_{1},g_{1})=(1,\mathbb {1},\mathbb {1})$ even in the presence of statistical noise and modeling error—see Appendix F.1 for details.

In the case of $k=\mathrm {dim}(\varvec{\chi }_{0})=\mathrm {dim}(\varvec{\chi }_{1})$ and full rank ${\mathbf {C}}_{00},{\mathbf {C}}_{11}$, the output of the feature TCCA can be equivalently written as

$$\begin{aligned}&{\mathbb {E}}\left[ {\mathbf {V}}^{\top }\varvec{\chi }_{1} \left( {\mathbf {x}}_{t+\tau }\right) \right] ={\mathbf {K}}^{\top }{\mathbb {E}} \left[ {\mathbf {U}}^{\top }\varvec{\chi }_{0} \left( {\mathbf {x}}_{t}\right) \right] \nonumber \\&\quad \Rightarrow {\mathbb {E}}\left[ \varvec{\chi }_{1} \left( {\mathbf {x}}_{t+\tau }\right) \right] ={\mathbf {K}}_{\chi }^{\top }{ \mathbb {E}}\left[ \varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) \right] \end{aligned}$$

(21)

where

$$\begin{aligned} {\mathbf {K}}_{\chi }= & {} {\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{-1}\nonumber \\= & {} {\mathbf {C}}_{00}^{-1}{\mathbf {C}}_{01} \end{aligned}$$

(22)

is equal to the least square solution to the regression problem $\varvec{\chi }_{1}\left( {\mathbf {x}}_{t+\tau }\right) \approx {\mathbf {K}}_{\chi }^{\top }\varvec{\chi }_{0}\left( {\mathbf {x}}_{t}\right) $. Note that if we further assume that $\varvec{\chi }_{0}=\varvec{\chi }_{1}$, (21) is identical to the linear model of EDMD. Thus, the feature TCCA can be seen as a generalization of EDMD that can provide approximate Markov models for different basis $\varvec{\chi }_{0}$ and $\varvec{\chi }_{1}$. More discussion on the relationship between the two methods is provided in Appendix G.

3.2 Nonlinear TCCA: Optimizing the Basis Functions

We now extend feature TCCA to a more flexible representation of the transformation functions ${\mathbf {f}}$ and ${\mathbf {g}}$ by optimizing the basis functions themselves:

$$\begin{aligned} {\mathbf {f}}\left( {\mathbf {x}}\right)&={\mathbf {U}}^{\top }\varvec{\chi }_{0}\left( {\mathbf {x}};{\mathbf {w}}\right) ,\nonumber \\ {\mathbf {g}}\left( {\mathbf {x}}\right)&={\mathbf {V}}^{\top }\varvec{\chi }_{1}\left( {\mathbf {x}};{\mathbf {w}}\right) . \end{aligned}$$

(23)

Here, ${\mathbf {w}}$ represents a set of parameters that determines the form of the basis functions. As a simple example, consider ${\mathbf {w}}$ to represent the mean vectors and covariance matrices of a Gaussian basis set. However, $\varvec{\chi }_{0}\left( {\mathbf {x}};{\mathbf {w}}\right) $ and $\varvec{\chi }_{1}\left( {\mathbf {x}};{\mathbf {w}}\right) $ can also represent very complex and nonlinear learning structures, such as neural networks and decision trees.

The parameters ${\mathbf {w}}$ could conceptually be determined together with the linear expansion coefficients ${\mathbf {U}},{\mathbf {V}}$ by solving (19) with ${\mathbf {C}}_{00}$, ${\mathbf {C}}_{11}$, ${\mathbf {C}}_{01}$ treated as functions of ${\mathbf {w}}$, but this method is not practical due to the nonlinear equality constraints are involved. In practice, we can set k to be $\min \{\mathrm {dim}\left( \varvec{\chi }_{0}\right) ,\mathrm {dim}\left( \varvec{\chi }_{1}\right) \}$, i.e., the largest number of singular components that can be approximated given the basis set. Then the maximal VAMP-r score for a fixed ${\mathbf {w}}$ can be represented as

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}{\mathcal {R}}_{r}=\left\| {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}, \end{aligned}$$

(24)

which can also be interpreted as the sum over the r’th power of all singular values of the projected Koopman operator on subspaces of $\varvec{\chi }_{0},\varvec{\chi }_{1}$ (see Eq. (80) in Appendix E.1 and Eq. (84) in Appendix E.2). Here $\left\| {\mathbf {A}}\right\| _{r}$ denotes the r-Schatten norm of matrix ${\mathbf {A}}$, which is the $\ell ^{r}$ norm of singular values of ${\mathbf {A}}$, and $\left\| {\mathbf {A}}\right\| _{2}$ equals the Frobenius norm of ${\mathbf {A}}$. The parameters ${\mathbf {w}}$ can be optimized without computing ${\mathbf {U}}$ and ${\mathbf {V}}$ explicitly. Using these ideas, nonlinear TCCA can be performed as follows:

1.
Compute ${\mathbf {w}}^{*}=\arg \max _{{\mathbf {w}}}\left\| {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}$ by gradient descent or other nonlinear optimization methods.
2.
Approximate the Koopman singular values and singular functions using the feature TCCA algorithm with basis sets $\varvec{\chi }_{0}\left( {\mathbf {x}};{\mathbf {w}}^{*}\right) $ and $\varvec{\chi }_{1}\left( {\mathbf {x}};{\mathbf {w}}^{*}\right) $.

Unlike the estimated singular components generated by the feature TCCA, the estimation results of the nonlinear TCCA do generally depend on the value of r. (An example is given in Appendix E.3, where the VAMP scores can be analytically computed.) We suggest to set $r=2$ in applications for the direct relationship between the VAMP-2 score and the approximation error of Koopman operators and the convenience of cross-validation (see below). The details of the nonlinear TCCA, including the optimization algorithm and regularization, are beyond the scope of this paper. Appendix F.2 provides a brief description of the implementation, and related work based on kernel methods and deep networks can be found in Andrew et al. (2013) and Mardt et al. (2018).

Example 2

Let us consider the stochastic system described in Example 1 again. We generate 10 simulation trajectories of length 500 and approximate the dominant singular components by the feature TCCA. Here, the basis functions are

$$\begin{aligned} \chi _{0,i}(x)=\chi _{1,i}(x)=1_{\frac{40\cdot (i-1)}{m}-20\le x\le \frac{40\cdot i}{m}-20},\quad \text { for }i=1,\ldots ,m, \end{aligned}$$

(25)

which define a partition of the domain $[-20,20]$ into $m=33$ disjoint intervals. In other words, the approximation is performed based on an MSM with 33 discrete states. Estimation results are given in Fig. 3a, where the discretization errors arising from indicator basis functions are clearly shown. For comparison, we also implement the nonlinear TCCA algorithm with radial basis functions

$$\begin{aligned} \chi _{0,i}(x;w)=\chi _{1,i}(x;w)=\frac{\exp \left( -w\left( x-c_{i} \right) ^{2}\right) }{\sum _{j=1}^{m}\exp \left( -w\left( x-c_{j}\right) ^{2} \right) } \end{aligned}$$

(26)

with smoothing parameter $w\ge 0$, where $c_{i}=\frac{40\cdot (i-0.5)}{m}-20$ for $i=1,\ldots ,m$ are uniformly distributed in $[-20,20]$. Notice that the basis functions given in (25) are a specific case of the radial basis functions with $w=\infty $, and it is therefore possible to achieve better approximation by optimizing w. As can be seen from Fig. 3b, the nonlinear TCCA provides more accurate estimates of singular functions and singular values (see Appendix K.1 for more details). In addition, both feature TCCA and nonlinear TCCA underestimate the dominant singular values as stated by the variational principle.

The nonlinear TCCA is similar to the EDMD with dictionary learning (EDMD-DL) (Li et al. 2017), where the feature transformations are optimized by minimizing the regression error of (1). The major advantages of the nonlinear TCCA over EDMD-DL are: First, the uninformative model with zero regression error can be systematically excluded without any extra constraints on features. Second, the optimization objective is directly related to the approximation error of the Koopman operator (see Sect. 3.3). Some recent methods extend EDMD-DL for modeling Koopman operators of deterministic systems (Takeishi et al. 2017; Lusch et al. 2018; Otto and Rowley 2019), and solve the first problem by using the prediction error between the observed $x_{t}$ and that predicted by the low-dimensional model. But they cannot be applied to stochastic Koopman operators of Markov processes directly.

3.3 Error Analysis

According to (5), both feature TCCA and nonlinear TCCA lead to a rank k approximation

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\sum _{i=1}^{k}K_{ii}\left\langle g,g_{i}\right\rangle _{\rho _{1}}f_{i}=\sum _{i=1}^{k}K_{ii}\left\langle g,{\mathbf {v}}_{i}^{\top }\varvec{\chi }_{1}\right\rangle _{\rho _{1}}{\mathbf {u}}_{i}^{\top }\varvec{\chi }_{0} \end{aligned}$$

(27)

to ${\mathcal {K}}_{\tau }$. We consider here the approximation error of (27) in a general case where ${\mathbf {f}}={\mathbf {U}}^{\top }\varvec{\chi }_{0}$ and ${\mathbf {g}}={\mathbf {V}}^{\top }\varvec{\chi }_{1}$ may not satisfy the orthonormal constraints due to statistical noise and numerical errors. After a few steps of derivation, the approximation error can be expressed as

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}=-{\mathcal {R}}_{E}[{\mathbf {K}},{\mathbf {f}},{\mathbf {g}}]+\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2} \end{aligned}$$

(28)

with

$$\begin{aligned} {\mathcal {R}}_{E}[{\mathbf {K}},{\mathbf {f}},{\mathbf {g}}]=2\sum _{i}K_{ii}\left\langle f_{i},{\mathcal {K}}_{\tau }g_{i}\right\rangle _{\rho _{0}}-\sum _{i,j}K_{ii}K_{jj}\left\langle f_{i},f_{j}\right\rangle _{\rho _{0}}\left\langle g_{i},g_{j}\right\rangle _{\rho _{1}}. \end{aligned}$$

(29)

Remarkably, this error decomposes into a unknown constant part (the square of Hilbert–Schmidt norm of ${\mathcal {K}}_{\tau }$), and a model-dependent part ${\mathcal {R}}_{E}$ that can be entirely estimated from data by its matrix representation:

$$\begin{aligned} {\mathcal {R}}_{E}({\mathbf {K}},{\mathbf {U}},{\mathbf {V}})=\mathrm {tr} \left[ 2{\mathbf {K}}{\mathbf {U}}^{\top }{\mathbf {C}}_{01}{\mathbf {V}}-{ \mathbf {K}}{\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}{\mathbf {K}}{ \mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}\right] . \end{aligned}$$

(30)

${\mathcal {R}}_{E}$, is thus a score that can be used alternatively to the VAMP-r scores, and we call ${\mathcal {R}}_{E}$ VAMP-E score. It can be proved that the maximization of ${\mathcal {R}}_{E}$ is equivalent to maximization of ${\mathcal {R}}_{2}$ in feature TCCA or nonlinear TCCA. However, these scores will behave differently in terms of hyper-parameter optimization (see Sect. 4.1). Proofs and analysis are given in Appendix H.

4 Model Validation

4.1 Cross-Validation for Hyper-parameter Optimization

For a data-driven estimation of dynamical models, either using feature TCCA or nonlinear TCCA, we have to strike a balance between the modeling or discretization error and the statistical or overfitting error. The choice of number and type of basis functions is critical for both. If basis sets are very small and not flexible enough to capture singular functions, the approximation results may be inaccurate with large biases. We can improve the variational score and reduce the modeling error by larger and more flexible basis sets. But too complicated basis sets will produce unstable estimates with large statistical variances, and in particular poor predictions on data that has not been used in the estimation process—this problem is known as overfitting in the machine learning community. A popular way to achieve the balance between the statistical bias and variance are resampling methods, including bootstrap and cross-validation (Friedman et al. 2001). They iteratively fit a model in a training set, which are sampled from the data with or without replacement, and validate the model in the complementary dataset. Alternatively, there are also Bayesian hyper-parameter optimization methods. See Arlot and Celisse (2010) and Snoek et al. (2012) for an overview. Here, we will focus on cross-validation and describe how to use the VAMP scores in this and similar resampling frameworks.

Let $\varvec{\theta }$ be hyper-parameters in feature TCCA or nonlinear TCCA that need to be specified. For example, $\varvec{\theta }$ includes the number and functional form of basis functions used in feature TCCA, or the architecture and connectivity of a neural network used for nonlinear TCCA. Generally speaking, different values of $\varvec{\theta }$ correspond to different dynamical models that we want to rank, and these models may be of completely different types. The cross-validation of $\varvec{\theta }$ can be performed as follows:

1.
Separate the available trajectories into J disjoint folds ${\mathcal {D}}_{1},\ldots ,{\mathcal {D}}_{J}$ with approximately equal size. If there are only a small number of long trajectories, we can divide each trajectory into blocks of length L with $\tau <L\ll T$ and create folds based on the blocks. This defines a number of J training sets, with training set j consisting of all data except the jth fold, ${\mathcal {D}}_{j}^{\mathrm {train}}=\cup _{l\ne j}{\mathcal {D}}_{l}$, and the jth fold used as test set ${\mathcal {D}}_{j}^{\mathrm {test}}={\mathcal {D}}_{j}$.
2.
For each hyper-parameter set $\varvec{\theta }$:
1. (a)
  For $j=1,\ldots ,J:$
  1. i.
    Train on ${\mathcal {D}}_{j}^{\mathrm {train}}$n: training set ${\mathcal {D}}_{j}^{\mathrm {train}}$, construct the best k-dimensional linear model consisting of $({\mathbf {K}},{\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {V}}^{\top }\varvec{\chi }_{1})$ by applying the feature TCCA or nonlinear TCCA with hyper-parameters $\varvec{\theta }$
  2. ii.
    Validate on ${\mathcal {D}}_{\mathrm {test}}$: measure the performance of the estimated singular components by a score
    $$\begin{aligned} \mathrm {CV}_{j}\left( \varvec{\theta }\right) =\mathrm {CV}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) \end{aligned}$$
    (31)
2. (b)
  Compute cross-validation score
  $$\begin{aligned} \mathrm {MCV}\left( \varvec{\theta }\right) = \frac{1}{J}\sum _{j=1}^{J}\mathrm {CV}_{j}\left( \varvec{\theta }\right) \end{aligned}$$
  (32)
3.
Select model/hyper-parameter set with maximal $\mathrm {MCV}\left( \varvec{\theta }\right) $.

The key to the above procedure is how to evaluate the estimated singular components for given test set. It is worth pointing out that we cannot simply define the validation score directly as the VAMP-r score of estimated singular functions for the test data, because the singular functions obtained from training data are usually not orthonormal with respect to the test data.

A feasible way is to utilize the subspace variational score as proposed for reversible Markov processes in McGibbon and Pande (2015). For VAMP-r this score becomes:

$$\begin{aligned} \mathrm {CV}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{ \mathrm {test}}\right)= & {} {\mathcal {R}}_{r}^{\mathrm {space}}\left( { \mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) \nonumber \\= & {} \left\| \left( {\mathbf {U}}^{\top }{\mathbf {C}}_{00}^{ \mathrm {test}}{\mathbf {U}}\right) ^{-\frac{1}{2}}\left( {\mathbf {U}}^{\top }{ \mathbf {C}}_{01}^{\mathrm {test}}{\mathbf {V}}\right) \left( { \mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\mathrm {test}}{\mathbf {V}} \right) ^{-\frac{1}{2}}\right\| _{r}^{r}, \end{aligned}$$

(33)

where ${\mathcal {R}}_{r}^{\mathrm {space}}$ measures the consistency between the singular subspace and the estimated one without the constraint of orthonormality. However, this scheme suffers from the following limitations in practical applications: Firstly, the value of k must be chosen a priori and kept fixed during the cross-validation procedure, which implies that models with a different number of singular components cannot be compared by the validation scores. Secondly, computation of the validation score possibly suffers from numerical instability. (See Appendix I for detailed analysis.)

We suggest in this paper to perform the cross-validation based on the approximation error of Koopman operators. According to conclusions in Sect. 3.3, feature TCCA and VAMP-2 base nonlinear TCCA both maximize the VAMP-E score ${\mathcal {R}}_{E}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {train}}\right) $ for a given training set ${\mathcal {D}}_{\mathrm {train}}$.

Therefore, we can score the performance of estimated singular components on the test set by

$$\begin{aligned} \mathrm {CV}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{ \mathrm {test}}\right) ={\mathcal {R}}_{E}\left( {\mathbf {K}},{\mathbf {U}},{ \mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) . \end{aligned}$$

(34)

In contrast with the validation score (33) deduced from the subspace VAMP-r score, the validation score defined by (34) allows us to choose k according to practical requirements: If we are only interested in a small number of dominant singular components, we can select a fixed value of k. If we want to evaluate the statistical performance of the approximate model consisting of all available estimated singular components as in the EDMD method, we can set $k=\min \{\mathrm {dim}(\varvec{\chi }_{0}),\mathrm {dim}(\varvec{\chi }_{1})\}$. We can even view k as a hyper-parameter and select a suitable rank of the model via cross-validation. Another advantage of the VAMP-E based validation score is that it does not involve any inverse operation of matrices and can be stably computed.

It is worth pointing out that a validation score is proposed Kurebayashi et al. (2016) for cross-validation of kernel DMD based on the analysis of approximation error of transition densities, which has a similar form to that of VAMP-E. The theoretical and empirical comparisons between the two scores will be performed in our future work.

Example 3

We consider here the choice of the basis function number m for the nonlinear TCCA in Example 2. We use five fold cross-validation with the VAMP-E score to compare different values of m. While the average score computed by training sets keeps increasing with m, both the cross-validation score and the exact VAMP-E score achieve their maximum value at $m=33$ as in Example 2 (see Fig. 4a). The optimality can also be demonstrated by comparing Figs. 3b and 4b. A much smaller basis set with $m=13$ yields large errors in the approximation of singular functions. When $m=250$, the estimation of singular functions suffers from overfitting and the estimated singular value is even larger than the true value due to the statistical noise.

4.2 Chapman–Kolmogorov Test for Choice of Lag Times

Besides hyper-parameters mentioned in above, the lag time $\tau $ is also an essential parameter especially for time-continuous Markov processes. If $\tau \rightarrow 0$, the ${\mathcal {K}}_{\tau }$ is usually close to the identity operator and cannot be accurately approximated by a low-rank model, whereas a too high value of $\tau $ can cause the loss of kinetic information in data since ${\mathbb {P}}({\mathbf {x}}_{t+\tau }|{\mathbf {x}}_{t})$ is approximately independent of ${\mathbf {x}}_{t}$ in the case of ergodic processes. However, the variational approach presented in this paper is based on analysis of the approximation error of the Koopman operator for a fixed $\tau $, so we cannot compare models with different lag times and choose $\tau $ by the VAMP scores.

In order to address this problem, the Chapman–Kolmogorov test can be used, which is common in building Markov state models (Prinz et al. 2011). Let us consider the covariance

$$\begin{aligned} \mathrm {cov}(f,g;n\tau )\triangleq & {} \left\langle f,{\mathcal {K}}_{n\tau }g\right\rangle _{\rho _{0}(n\tau )}\nonumber \\= & {} {\mathbb {E}}_{{\mathbf {x}}_{t}\sim \rho _{0}(n\tau )}\left[ f({\mathbf {x}}_{t})g({\mathbf {x}}_{t+n\tau })\right] \end{aligned}$$

(35)

between observables f and g of lag time $n\tau $, which can be estimated from data as

$$\begin{aligned} \mathrm {cov}^{\mathrm {emp}}(f,g;n\tau )=\frac{1}{T-n\tau } \sum _{t=1}^{T-n\tau }f\left( {\mathbf {x}}_{t}\right) g \left( {\mathbf {x}}_{t+n\tau }\right) ^{\top }, \end{aligned}$$

(36)

where $\rho _{0}(n\tau )$ is the empirical distribution of the simulation data excluding $\{x_{t}|t>T-n\tau \}$. If our methods provide an ideal Markov model of lag time $\tau $, the Koopman operator ${\mathcal {K}}_{n\tau }$ can be approximated by $\hat{{\mathcal {K}}}_{\tau }^{n}$, and the covariance can also be predicted as

$$\begin{aligned} \mathrm {cov}^{\mathrm {pred}}(f,g;n\tau )= & {} \left\langle f,\hat{{\mathcal {K}}}_{\tau }^{n}g\right\rangle _{\rho _{0}(n\tau )}\nonumber \\= & {} {\mathbb {E}}_{{\mathbf {x}}_{t}\sim \rho _{0}(n\tau )}\left[ f({\mathbf {x}}_{t})\varvec{\chi }_{0}({\mathbf {x}}_{t})^{\top }\right] \nonumber \\&\cdot {\mathbf {U}}{\mathbf {R}}^{n-1}{\mathbf {K}}{\mathbf {V}}^{\top }\nonumber \\&\cdot {\mathbb {E}}_{{\mathbf {x}}_{t}\sim \rho _{1}}\left[ \varvec{\chi }_{1}({\mathbf {x}}_{t})g({\mathbf {x}}_{t})\right] \end{aligned}$$

(37)

(see Appendix J), where

$$\begin{aligned} {\mathbf {R}}={\mathbf {K}}\cdot {\mathbb {E}}_{{\mathbf {x}}_{t} \sim \rho _{1}}\left[ {\mathbf {g}}({\mathbf {x}}_{t})^{\top }{\mathbf {f}}({ \mathbf {x}}_{t})\right] . \end{aligned}$$

(38)

Therefore, the lag time $\tau $ can be selected according to the following criteria in applications: (i) The lag time is smaller than the timescale that we are interested in. (ii) The equation

$$\begin{aligned} \mathrm {cov}^{\mathrm {pred}}(f,g;n\tau )=\mathrm {cov}^{\mathrm {emp}}(f,g;n\tau ) \end{aligned}$$

(39)

holds approximately for multiple observables f, g and lag times $n\tau $. In this paper, we simply set f, g to be the estimated leading singular functions since they dominate the dynamics of the Markov process.

5 Numerical Examples

5.1 Double-Gyre System

Let’s consider a stochastic double-gyre system defined by:

$$\begin{aligned} \mathrm {d}x_{t}= & {} -\pi \,A\,\sin (\pi \,x_{t})\,\cos (\pi \,y_{t})\,\mathrm {d}t+ \varepsilon \sqrt{x_{t}/4+1}\,\mathrm {d}{\mathbf {W}}_{t,1},\nonumber \\ \mathrm {d}y_{t}= & {} \phantom {-}\pi \,A\,\cos (\pi \,x_{t})\,\sin (\pi \,y_{t})\,\mathrm {d}t+\varepsilon \, \mathrm {d}{\mathbf {W}}_{t,2}, \end{aligned}$$

(40)

where ${\mathbf {W}}_{t,1}$ and ${\mathbf {W}}_{t,2}$ are two independent standard Wiener processes. The dynamics are defined on the domain $[0,2]\times [0,1]$ with reflecting boundary. For $\varepsilon =0$, it can be seen from the flow field depicted in Fig. 5a that there is no transport between the left half and the right half of the domain and both subdomains are invariant sets with measure $\frac{1}{2}$ (Froyland and Padberg 2009; Froyland and Padberg-Gehle 2014). For $\varepsilon >0$, there is a small amount of transport due to diffusion and the subdomains are almost invariant. Here we used the parameters $A=0.25$, $\epsilon =0.1$, and lag time $\tau =2$ in analysis and simulations. The first two nontrivial singular components are shown in Fig. 5c, where the two almost invariant sets are clearly visible in $\psi _{2},\phi _{2}$ and $\psi _{3},\phi _{3}$ are associated with the rotational kinetics within the almost invariant sets.

We generate 10 trajectories of length 4 with step size 0.02, and perform modeling by nonlinear TCCA with basis functions

$$\begin{aligned} \chi _{0,i}(x,y;w)=\chi _{1,i}(x,y;w)=\frac{\exp \left( -w\left\| (x,y)^{\top }-{\mathbf {c}}_{i}\right\| ^{2}\right) }{\sum _{j=1}^{m}\exp \left( -w\left\| (x,y)^{\top }-{\mathbf {c}}_{j}\right\| ^{2}\right) },\quad \text {for }i=1,\ldots ,m \end{aligned}$$

(41)

where ${\mathbf {c}}_{1},\ldots ,{\mathbf {c}}_{m}$ are cluster centers given by k-means algorithm, and the smoothing parameter w is determined via maximizing the VAMP-2 score given in (24) (see Appendix K.2 for more details of numerical computations). The size of basis set $m=37$ is selected by the VAMP-E based cross-validation proposed in 4.1 with 5 folds (see Fig. 5b), and it can be observed from Fig. 5c–e that the leading singular components are accurately estimated. In contrast, as shown in Fig. 5f, g, a much small value of m leads to significant approximation errors of singular components, while for a much larger value, the estimates are obviously influenced by statistical noise. Figure 6 illustrates that the Koopman operator estimated by nonlinear TCCA can successfully predict the long-time evolution of the distribution of the state. The Chapman–Kolmogorov test results are displayed in Fig. 7, which confirm that $\tau =2$ is a suitable choice of the lag time.

5.2 Stochastic Lorenz System

As the last example, we investigate the stochastic Lorenz system which obeys the following stochastic differential equation:

$$\begin{aligned} \mathrm {d}x_{t}= & {} s(y-x)\,\mathrm {d}t+\epsilon x_{t}\,\mathrm {d}{\mathbf {W}}_{t,1},\nonumber \\ \mathrm {d}y_{t}= & {} (rx_{t}-y_{t}-x_{t}z_{t})\,\mathrm {d}t+\varepsilon y_{t}\,\mathrm {d}{\mathbf {W}}_{t,2},\nonumber \\ \mathrm {d}z_{t}= & {} (-bz_{t}+x_{t}y_{t})\,\mathrm {d}t+\varepsilon z_{t}\,\mathrm {d}{\mathbf {W}}_{t,2}, \end{aligned}$$

(42)

with parameters $s=10$, $r=28$ and $b=8/3$. The deterministic Lorenz system with $\epsilon =0$ is known to exhibit chaotic behavior (Sparrow 1982) with a strange attractor characterized by two lobes as illustrated in Fig. 8a. We generate 20 trajectories of length 25 with $\epsilon =0.3$ by using the Euler–Maruyama scheme with step size 0.005, and one of them is shown in Fig. 8b. As stated in Chekroun et al. (2011), all the trajectories move around the deterministic attractor with small random perturbations and switch between the two lobes.

The leading singular components computed from the simulation data by the nonlinear TCCA are summarized in Fig. 8c, where the lag time $\tau =0.75$ is determined via the Chapman–Kolmogorov test (see Fig. 8d), $\varvec{\chi }_{0}=\varvec{\chi }_{1}$ consist of m normalized radial basis functions similar to those used in Sect. 5.1, and the selection of m is also implemented by 5-fold cross-validation. According to the patterns of the singular functions, the stochastic Lorenz system can be coarse-grained into a simplified model which transitions between four macrostates corresponding to inner and outer basins of the two attractor lobes. In particular, the sign boundary of $\psi _{1}$ closely matches that between the almost invariant sets of the Lorenz flow (Froyland and Padberg 2009).

Next, we map the simulation data to a higher dimensional space via the nonlinear transformation $\varvec{\eta }_{t}=\varvec{\eta }(x_{t},y_{t},z_{t})$ defined by

$$\begin{aligned} \begin{array}{lll} \eta _{t}^{1}=\left( \frac{z_{t}}{50}+\frac{1}{2}\right) \cos \left( \frac{\pi x_{t}}{30}+\frac{z_{t}}{50}-1\right) , &{} &{} \eta _{t}^{2}=\left( \frac{z_{t}}{50}+\frac{1}{2}\right) \sin \left( \frac{\pi x_{t}}{30}+\frac{z_{t}}{50}-1\right) ,\\ \eta _{t}^{3}=\left( \frac{z_{t}}{50}+\frac{1}{2}\right) \cos \left( \frac{\pi y_{t}}{30}+\frac{z_{t}}{50}-1\right) , &{} &{} \eta _{t}^{4}=\left( \frac{z_{t}}{50}+\frac{1}{2}\right) \sin \left( \frac{\pi y_{t}}{30}+\frac{z_{t}}{50}-1\right) ,\\ \eta _{t}^{5}=\cos \frac{\pi \left( x_{t}+y_{t}\right) }{40}, &{} &{} \eta _{t}^{6}=\cos \frac{\pi \left( x_{t}-y_{t}\right) }{40}. \end{array} \end{aligned}$$

(43)

Figure 9a plots the transformed points of the illustrative trajectory in Fig. 8b. We utilize the nonlinear TCCA to compute the singular components in the space of $\varvec{\eta }_{t}=(\eta _{t}^{1},\ldots ,\eta _{t}^{6})$ by assuming that the available observable is $\varvec{\eta }_{t}$ instead of $(x_{t},y_{t},z_{t})$, show in Fig. 9b the projections of the singular functions back on the three-dimensional space

$$\begin{aligned} \psi _{i}^{\mathrm {proj}}(x_{t},y_{t},z_{t})=\psi _{i}( \varvec{\eta }(x_{t},y_{t},z_{t})),\quad \phi _{i}^{ \mathrm {proj}}(x_{t},y_{t},z_{t})=\phi _{i}( \varvec{\eta }(x_{t},y_{t},z_{t})), \end{aligned}$$

(44)

and compares the singular values estimated from trajectories of $(x_{t},y_{t},z_{t})$ and $(\eta _{t}^{1},\ldots ,\eta _{t}^{6})$. It can be seen the projected leading singular components are almost the same as those directly computed from the three-dimensional data, which illustrates the transformation invariance of VAMP. Notice it is straightforward to prove that the exact $\psi _{i}^{\mathrm {proj}}$ and $\phi _{i}^{\mathrm {proj}}$ are the solution to the variational problem (12) in the space of $(x_{t},y_{t},z_{t})^{\top }$ if there is an inverse mapping $\varvec{\eta }^{-1}$ with $\varvec{\eta }^{-1}(\varvec{\eta }(x_{t},y_{t},z_{t}))\equiv (x_{t},y_{t},z_{t})$.

6 Conclusion

The linearized coarse-grained models of Markov systems are commonly used in a broad range of fields, such as power systems, fluid mechanics and molecular dynamics. Although the models were developed independently in different communities, the VAMP proposed in this paper provides a general framework for analysis of them, and the modeling accuracy can be quantitatively evaluated by the VAMP-r and VAMP-E scores. Moreover, a set of data-driven methods, including feature TCCA, nonlinear TCCA and VAMP-E based cross-validation, are developed to achieve optimal modeling for given finite model dimensions and finite data sets.

The major challenge in real-world applications of VAMP is how to overcome the curse of dimensionality and solve the variational problem effectively and efficiently for high-dimensional systems. One feasible way of addressing this challenge is to approximate singular components by deep neural networks, which yields the concept of VAMPnet (Mardt et al. 2018). The optimal models can therefore be obtained by deep learning techniques. Another possible way is to utilize tensor decomposition based approximation approaches. Some tensor analysis methods have been presented based on the reversible variational principle and EDMD (Nüske et al. 2016; Klus and Schütte 2015; Klus et al. 2018), and it is worth studying more general variational tensor method within the framework of VAMP in future.

One drawback of the methods developed in this paper is that the resulting models are possibly not valid probabilistic models with nonnegative transition densities if only the operator error is considered, and the probability-preserving modeling method requires further investigations. Moreover, the applications of VAMP to detection of metastable states (Deuflhard and Weber 2005), coherent sets (Froyland and Padberg-Gehle 2014) and dominant cycles (Conrad et al. 2016) will also be explored in nextsteps.

References

Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
MathSciNet MATH Google Scholar
Bollt, E.M., Santitissadeekorn, N.: Applied and Computational Measurable Dynamics. SIAM (2013)
Boninsegna, L., Gobbo, G., Noé, F., Clementi, C.: Investigating molecular kinetics by variationally optimized diffusion maps. J. Chem. Theory Comput. 11, 5947–5960 (2015)
Google Scholar
Bowman, G.R., Pande, V.S., Noé, F. (eds.): An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation. Volume 797 of Advances in Experimental Medicine and Biology. Springer, Heidelberg (2014)
MATH Google Scholar
Brunton, S.L., Brunton, B.W., Proctor, J.L., Kutz, J.N.: Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control. PLoS ONE 11(2), e0150171 (2016a)
Google Scholar
Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113(15), 3932–3937 (2016b)
MathSciNet MATH Google Scholar
Chekroun, M.D., Simonnet, E., Ghil, M.: Stochastic climate dynamics: random attractors and time-dependent invariant measures. Physica D Nonlinear Phenom. 240(21), 1685–1700 (2011)
MathSciNet MATH Google Scholar
Chodera, J.D., Noé, F.: Markov state models of biomolecular conformational dynamics. Curr. Opin. Struct. Biol. 25, 135–144 (2014)
Google Scholar
Conrad, N.D., Weber, M., Schütte, C.: Finding dominant structures of nonreversible Markov processes. Multiscale Model. Simul. 14(4), 1319–1340 (2016)
MathSciNet MATH Google Scholar
Dellnitz, M., Froyland, G., Junge, O.: The algorithms behind gaio–set oriented numerical methods for dynamical systems. In: Fiedler, B. (ed.) Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems, pp. 145–174. Springer, Berlin (2001)
MATH Google Scholar
Deuflhard, P., Weber, M.: Robust perron cluster analysis in conformation dynamics. In: Dellnitz, M., Kirkland, S., Neumann, M., Schütte, C. (eds.) Linear Algebra Application, vol. 398C, pp. 161–184. Elsevier, New York (2005)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2001)
MATH Google Scholar
Froyland, G.: An analytic framework for identifying finite-time coherent sets in time-dependent dynamical systems. Physica D Nonlinear Phenom. 250, 1–19 (2013)
MathSciNet MATH Google Scholar
Froyland, G., Padberg, K.: Almost-invariant sets and invariant manifolds—connecting probabilistic and geometric descriptions of coherent structures in flows. Physica D Nonlinear Phenom. 238(16), 1507–1523 (2009)
MathSciNet MATH Google Scholar
Froyland, G., Padberg-Gehle, K.: Almost-invariant and finite-time coherent sets: directionality, duration, and diffusion. In: Bahsoun, W., Bose, C., Froyland, G. (eds.) Ergodic Theory, Open Dynamics, and Coherent Structures, pp. 171–216. Springer, Berlin (2014)
MATH Google Scholar
Froyland, G., Gottwald, G.A., Hammerlindl, A.: A computational method to extract macroscopic variables and their dynamics in multiscale systems. SIAM J. Appl. Dyn. Syst. 13(4), 1816–1846 (2014)
MathSciNet MATH Google Scholar
Froyland, G., González-Tokman, C., Watson, T.M.: Optimal mixing enhancement by local perturbation. SIAM Rev. 58(3), 494–513 (2016)
MathSciNet MATH Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
MATH Google Scholar
Harmeling, S., Ziehe, A., Kawanabe, M., Müller, K.-R.: Kernel-based nonlinear blind source separation. Neural Comput. 15(5), 1089–1124 (2003)
MATH Google Scholar
Hsing, T., Eubank, R.: Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. Wiley, Amsterdam (2015)
MATH Google Scholar
Klus, S., Schütte, C.: Towards tensor-based methods for the numerical approximation of the perron-frobenius and koopman operator (2015). arXiv:1512.06527
Klus, S., Koltai, P., Schütte, C.: On the numerical approximation of the perron-frobenius and koopman operator (2015). arXiv:1512.05997
Klus, S., Gelß, P., Peitz, S., Schütte, C.: Tensor-based dynamic mode decomposition. Nonlinearity 31(7), 3359 (2018)
MathSciNet MATH Google Scholar
Koltai, P., Wu, H., Noe, F., Schütte, C.: Optimal data-driven estimation of generalized Markov state models for non-equilibrium dynamics. Computation 6(1), 22 (2018)
Google Scholar
Konrad, A., Zhao, B.Y., Joseph, A.D., Ludwig, R.: A Markov-based channel model algorithm for wireless networks. In: Proceedings of the 4th ACM International Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp. 28–36. ACM (2001)
Koopman, B.O.: Hamiltonian systems and transformations in hilbert space. Proc. Natl. Acad. Sci. U.S.A. 17, 315–318 (1931)
MATH Google Scholar
Korda, M., Mezić, I.: On convergence of extended dynamic mode decomposition to the Koopman operator. J. Nonlinear Sci. 28(2), 687–710 (2018)
MathSciNet MATH Google Scholar
Kurebayashi, W., Shirasaka, S., Nakao, H.: Optimal parameter selection for kernel dynamic mode decomposition. In: Proceedings of the International Symposium NOLTA, volume 370, p. 373 (2016)
Li, Q., Dietrich, F., Bollt, E.M., Kevrekidis, I.G.: Extended dynamic mode decomposition with dictionary learning: a data-driven adaptive spectral decomposition of the Koopman operator. Chaos 27(10), 103111 (2017)
MathSciNet MATH Google Scholar
Lusch, B., Kutz, J.N., Brunton, S.L.: Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9(1), 4950 (2018)
Google Scholar
Ma, Y., Han, J.J., Trivedi, K.S.: Composite performance and availability analysis of wireless communication networks. IEEE Trans. Veh. Technol. 50(5), 1216–1223 (2001)
Google Scholar
Mardt, A., Pasquali, L., Wu, H., Noé, F.: Vampnets for deep learning of molecular kinetics. Nat. Commun. 9(1), 5 (2018)
Google Scholar
Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)
MATH Google Scholar
McGibbon, R.T., Pande, V.S.: Variational cross-validation of slow dynamical modes in molecular kinetics. J. Chem. Phys. 142, 124105 (2015)
Google Scholar
Mezić, I.: Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dyn. 41, 309–325 (2005)
MathSciNet MATH Google Scholar
Mezić, I.: Analysis of fluid flows via spectral properties of the Koopman operator. Annu. Rev. Fluid Mech. 45, 357–378 (2013)
MathSciNet MATH Google Scholar
Molgedey, L., Schuster, H.G.: Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett. 72, 3634–3637 (1994)
Google Scholar
Noé, F.: Probability distributions of molecular observables computed from Markov models. J. Chem. Phys. 128, 244103 (2008)
Google Scholar
Noé, F., Clementi, C.: Kinetic distance and kinetic maps from molecular dynamics simulation. J. Chem. Theory Comput. 11, 5002–5011 (2015)
Google Scholar
Noé, F., Nüske, F.: A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul. 11, 635–655 (2013)
MathSciNet MATH Google Scholar
Nüske, F., Keller, B.G., Pérez-Hernández, G., Mey, A.S.J.S., Noé, F.: Variational approach to molecular kinetics. J. Chem. Theory Comput. 10, 1739–1752 (2014)
Google Scholar
Nüske, F., Schneider, R., Vitalini, F., Noé, F.: Variational tensor approach for approximating the rare-event kinetics of macromolecular systems. J. Chem. Phys. 144, 054105 (2016)
Google Scholar
Otto, S.E., Rowley, C.W.: Linearly recurrent autoencoder networks for learning dynamics. SIAM J. Appl. Dyn. Syst. 18(1), 558–593 (2019)
MathSciNet MATH Google Scholar
Paul, F., Wu, H., Vossel, M., Groot, B., Noe, F.: Identification of kinetic order parameters for non-equilibrium dynamics. J. Chem. Phys. 150, 164120 (2018)
Google Scholar
Perez-Hernandez, G., Paul, F., Giorgino, T., Fabritiis, G.D., Noé, F.: Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013)
Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
MATH Google Scholar
Prinz, J.-H., Wu, H., Sarich, M., Keller, B.G., Senne, M., Held, M., Chodera, J.D., Schütte, C., Noé, F.: Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134, 174105 (2011)
Google Scholar
Renardy, M., Rogers, R.C.: An Introduction to Partial Differential Equations. Springer, New York (2004)
MATH Google Scholar
Rowley, C.W., Mezić, I., Bagheri, S., Schlatter, P., Henningson, D.S.: Spectral analysis of nonlinear flows. J. Fluid Mech. 641, 115 (2009)
MathSciNet MATH Google Scholar
Schmid, P.J.: Dynamic mode decomposition of numerical and experimental data. J. Fluid Mech. 656, 5–28 (2010)
MathSciNet MATH Google Scholar
Schütte, C., Fischer, A., Huisinga, W., Deuflhard, P.: A direct approach to conformational dynamics based on hybrid Monte Carlo. J. Comput. Phys. 151, 146–168 (1999)
MathSciNet MATH Google Scholar
Schwantes, C.R., Pande, V.S.: Improvements in Markov state model construction reveal many non-native interactions in the folding of NTL9. J. Chem. Theory Comput. 9, 2000–2009 (2013)
Google Scholar
Schwantes, C.R., Pande, V.S.: Modeling molecular kinetics with tica and the kernel trick. J. Chem. Theory Comput. 11, 600–608 (2015)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 30(4), 98–111 (2013)
Google Scholar
Sparrow, C.: The Lorenz Equations: Bifurcations, Chaos, and Strange Attractors. Springer, New York (1982)
MATH Google Scholar
Takeishi, N., Kawahara, Y., Yairi, T.: Learning Koopman invariant subspaces for dynamic mode decomposition. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 1130–1140 (2017)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Tu, J.H., Rowley, C.W., Luchtenburg, D.M., Brunton, S.L., Kutz, J.N.: On dynamic mode decomposition: theory and applications. J. Comput. Dyn. 1(2), 391–421 (2014)
MathSciNet MATH Google Scholar
Williams, M.O., Kevrekidis, I.G., Rowley, C.W.: A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J. Nonlinear Sci. 25, 1307–1346 (2015a)
MathSciNet MATH Google Scholar
Williams, M.O., Rowley, C.W., Kevrekidis, I.G.: A kernel-based method for data-driven Koopman spectral analysis. J. Comput. Dyn. 2(2), 247–265 (2015b)
MathSciNet MATH Google Scholar
Wu, H., Noé, F.: Gaussian Markov transition models of molecular kinetics. J. Chem. Phys. 142, 084104 (2015)
Google Scholar
Wu, H., Nüske, F., Paul, F., Klus, S., Koltai, P., Noé, F.: Variational Koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations. J. Chem. Phys. 146, 154104 (2017)
Google Scholar
Ziehe, A., Müller, K.-R.: TDSEP —an efficient algorithm for blind separation using time structure. In: ICANN 98, pp. 675–680. Springer (1998)

Download references

Author information

Authors and Affiliations

School of Mathematical Sciences, Tongji University, 1239 Siping Road, Shanghai, 200092, China
Hao Wu
Department of Mathematics and Computer Science, Freie Universität Berlin, Arnimallee 6, 14195, Berlin, Germany
Hao Wu & Frank Noé
Department of Physics, Freie Universität Berlin, Arnimallee 14, 14195, Berlin, Germany
Frank Noé
Department of Chemistry, Rice University, Houston, TX, 77005, USA
Frank Noé

Authors

Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Frank Noé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hao Wu or Frank Noé.

Additional information

Communicated by Dr. Paul Newton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was funded by Deutsche Forschungsgemeinschaft (SFB 1114/A4) and European Research Commission (ERC StG 307494 “pcCell” ).

Appendices

Appendix

For convenience of notation, we denote by $p_{\tau }({\mathbf {x}},{\mathbf {y}})={\mathbb {P}}({\mathbf {x}}_{t+\tau }={\mathbf {y}}|{\mathbf {x}}_{t}={\mathbf {x}})$ the transition density which satisfies

$$\begin{aligned} \int _{A}p_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {y}}={\mathbb {P}}({\mathbf {x}}_{t+\tau }\in A|{\mathbf {x}}_{t}={\mathbf {x}}) \end{aligned}$$

(45)

for every measurable set A, and define the matrix of scalar products:

$$\begin{aligned} \left\langle {\mathbf {a}},{\mathbf {b}}^{\top }\right\rangle _{\rho }= & {} \left[ \left\langle a_{i},b_{j}\right\rangle _{\rho }\right] \in {\mathbb {R}}^{m\times n} \end{aligned}$$

(46)

$$\begin{aligned} {\mathcal {K}}{\mathbf {g}}= & {} ({\mathcal {K}}g_{1},{\mathcal {K}}g_{2},\ldots )^{\top } \end{aligned}$$

(47)

for ${\mathbf {a}}=(a_{1},a_{2},\ldots ,a_{m})^{\top }$, ${\mathbf {b}}=(b_{1},b_{2},\ldots ,b_{n})^{\top }$ and ${\mathbf {g}}=(g_{1},g_{2},\ldots )^{\top }$. In addition, ${\mathcal {N}}(\cdot |c,\sigma ^{2})$ denotes the probability density function of the normal distribution with mean c and variance $\sigma ^{2}$.

Analysis of Koopman Operators

1.1 Definition of Empirical Distributions

We first consider the case where the simulation data consist of S independent trajectories $\{{\mathbf {x}}_{t}^{1}\}_{t=1}^{T},\ldots ,\{{\mathbf {x}}_{t}^{S}\}_{t=1}^{T}$ of length T and the initial state $x_{0}^{s}{\mathop {\sim }\limits ^{\mathrm {iid}}}p_{0}\left( {\mathbf {x}}\right) $. In this case, $\rho _{0}$ and $\rho _{1}$ can be defined by

$$\begin{aligned} \rho _{0}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }{\mathcal {P}}_{t}p_{0}, \quad \rho _{1}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }{\mathcal {P}}_{t+\tau }p_{0}, \end{aligned}$$

(48)

and they satisfy

$$\begin{aligned} \rho _{1}={\mathcal {P}}_{\tau }\rho _{0}, \end{aligned}$$

(49)

where ${\mathcal {P}}_{t}$ denotes the Markov propagator defined in (63). We can then conclude that the estimates of ${\mathbf {C}}_{00},{\mathbf {C}}_{11},{\mathbf {C}}_{01}$ given by (16–18) are unbiased and consistent as$S\rightarrow \infty $.

In more general cases where trajectories $\{{\mathbf {x}}_{t}^{1}\}_{t=1}^{T_{1}},\ldots ,\{{\mathbf {x}}_{t}^{S}\}_{t=1}^{T_{S}}$ are generated with different initial conditions and different lengths, the similar conclusions can be obtained by defining $\rho _{0},\rho _{1}$ as the averages of marginal distributions of $\{{\mathbf {x}}_{t}^{s}|1\le t\le T_{s}-\tau ,1\le s\le S\}$ and $\{{\mathbf {x}}_{t}^{s}|1+\tau \le t\le T_{s},1\le s\le S\}$ respectively.

1.2 Proof of Theorem 1

Because ${\mathcal {K}}_{\tau }$ is a Hilbert–Schmidt operator from ${\mathcal {L}}_{\rho _{1}}^{2}$ to ${\mathcal {L}}_{\rho _{0}}^{2}$, there exists the following SVD of ${\mathcal {K}}_{\tau }$:

$$\begin{aligned} {\mathcal {K}}_{\tau }g=\sum _{i=1}^{\infty }\sigma _{i}\left\langle g,\phi _{i}\right\rangle _{\rho _{1}}\psi _{i}. \end{aligned}$$

(50)

Due to the orthonormality of right singular functions, the projection of any function $g\in {\mathcal {L}}_{\rho _{1}}^{2}$ onto the space spanned by $\{\phi _{1},\ldots ,\phi _{k}\}$ can be written as $\sum _{i=1}^{k}\left\langle g,\phi _{i}\right\rangle _{\rho _{1}}\phi _{i}$. Then $\hat{{\mathcal {K}}}_{\tau }$ defined by (5) is the approximate Koopman operator deduced from model (4), and it is the best rank k approximation to ${\mathcal {K}}_{\tau }$ in Hilbert–Schmidt norm according to the generalized Eckart–Young Theorem (see Theorem 4.4.7 in Hsing and Eubank (2015)).

Since the adjoint operator ${\mathcal {K}}_{\tau }^{*}$ of ${\mathcal {K}}_{\tau }$ satisfies

$$\begin{aligned} \left\langle f,{\mathcal {K}}_{\tau }^{*}\mathbb {1}\right\rangle _{\rho _{1}}= & {} \left\langle {\mathcal {K}}_{\tau }f,\mathbb {1}\right\rangle _{\rho _{0}}\\= & {} \int {\mathbb {E}}[f({\mathbf {x}}_{t+\tau })|{\mathbf {x}}_{t}={\mathbf {x}}]\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\\= & {} \int {\mathbb {E}}[f({\mathbf {x}})]\rho _{1}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\\= & {} \left\langle f,\mathbb {1}\right\rangle _{\rho _{1}} \end{aligned}$$

for all f, we can obtain

$$\begin{aligned} {\mathcal {K}}_{\tau }^{*}\mathbb {1}={\mathcal {K}}_{\tau }\mathbb {1}=\mathbb {1}, \end{aligned}$$

(51)

and conclude from Proposition 2 in Froyland (2013) that $(\sigma _{1},\phi _{1},\psi _{1})=(1,\mathbb {1},\mathbb {1})$.

1.3 Transition Densities Deduced from Koopman Operators

The Koopman operator can also be written as

$$\begin{aligned} {\mathcal {K}}_{\tau }g({\mathbf {x}})=\int p_{\tau }({\mathbf {x}},{\mathbf {y}})g({\mathbf {y}})\mathrm {d}{\mathbf {y}} \end{aligned}$$

(52)

if the transition density is given, which implies that

$$\begin{aligned} {\mathcal {K}}_{\tau }\delta _{{\mathbf {y}}}({\mathbf {x}})=p_{\tau }({\mathbf {x}},{\mathbf {y}}). \end{aligned}$$

(53)

Then the transition density deduced from the approximate Koopman operator $\hat{{\mathcal {K}}}_{\tau }$ defined by (5) is

$$\begin{aligned} {\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})= & {} \hat{{\mathcal {K}}}_{\tau }\delta _{{\mathbf {y}}}({\mathbf {x}})\nonumber \\= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}({\mathbf {x}})\phi _{i}({\mathbf {y}})\rho _{1}({\mathbf {y}}). \end{aligned}$$

(54)

From (52), we can show that

$$\begin{aligned} \left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{i}\left\langle {\mathcal {K}}_{\tau }\phi _{i},{\mathcal {K}}_{\tau }\phi _{i}\right\rangle _{\rho _{0}}\nonumber \\= & {} \int \sum _{i}\left( \int p({\mathbf {x}},{\mathbf {y}})\phi _{i}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) ^{2}\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \int \sum _{i}\left( \int \frac{p({\mathbf {x}},{\mathbf {y}})}{\rho _{1}({\mathbf {y}})}\cdot \phi _{i}({\mathbf {y}})\cdot \rho _{1}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) ^{2}\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \int \left( \int \left( \frac{p({\mathbf {x}},{\mathbf {y}})}{\rho _{1}({\mathbf {y}})}\right) ^{2}\cdot \rho _{1}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) \rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}p({\mathbf {x}},{\mathbf {y}})^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}, \end{aligned}$$

(55)

and

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}=\iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}\left( {\hat{p}}({\mathbf {x}},{\mathbf {y}})-p({\mathbf {x}},{\mathbf {y}})\right) ^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}, \end{aligned}$$

(56)

i.e., the operator error between $\hat{{\mathcal {K}}}_{\tau }$ and ${\mathcal {K}}_{\tau }$ can be represented by the error between ${\hat{p}}_{\tau }$ and $p_{\tau }$.

It is worth pointing out that the approximate transition density in (54) satisfies the normalization constraint with

$$\begin{aligned} \int {\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {y}}= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}({\mathbf {x}})\left\langle \phi _{j},\mathbb {1}\right\rangle _{\rho _{1}}\nonumber \\= & {} \sigma _{1}\psi _{1}({\mathbf {x}})\nonumber \\\equiv & {} 1, \end{aligned}$$

(57)

but ${\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})$ is possibly negative for some ${\mathbf {x}},{\mathbf {y}}$. Thus, the approximate Koopman operators and transition densities are not guaranteed to yield valid probabilistic models, although they can still be utilized to quantitative analysis of Markov processes.

1.4 Sufficient Conditions for Theorem 1

We show here ${\mathcal {L}}_{\rho _{0}}^{2},{\mathcal {L}}_{\rho _{1}}^{2}$ are separable Hilbert spaces and ${\mathcal {K}}_{\tau }:{\mathcal {L}}_{\rho _{1}}^{2}\mapsto {\mathcal {L}}_{\rho _{0}}^{2}$ is Hilbert–Schmidt if one of the following conditions is satisfied:

Condition 1

The state space of the Markov process is a finite set.

Proof

The proof is trivial by considering ${\mathcal {K}}_{\tau }$ is a linear operator between finite-dimensional spaces, and thus omitted. $\square $

Condition 2

The state space of the Markov process is ${\mathbb {R}}^{d}$, $\rho _{0}({\mathbf {x}}),\rho _{1}({\mathbf {y}})$ are positive for all ${\mathbf {x}},{\mathbf {y}}\in {\mathbb {R}}^{d}$, and there exists a constant M so that

$$\begin{aligned} p_{\tau }({\mathbf {x}},{\mathbf {y}})\le M\rho _{1}({\mathbf {y}}),\quad \forall {\mathbf {x}},{\mathbf {y}} \end{aligned}$$

(58)

Proof

Let $\{e_{1},e_{2},\ldots \}$ be a orthonormal basis of ${\mathcal {L}}^{2}({\mathbb {R}}^{d})$. Then ${\mathcal {L}}_{\rho _{0}}^{2},{\mathcal {L}}_{\rho _{1}}^{2}$ are separable because they have the countable orthonormal bases $\{\rho _{0}^{-\frac{1}{2}}e_{1},\rho _{0}^{-\frac{1}{2}}e_{2},\ldots \}$ and $\{\rho _{1}^{-\frac{1}{2}}e_{1},\rho _{1}^{-\frac{1}{2}}e_{2},\ldots \}$.

Now we prove that $\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}<\infty $. Because

$$\begin{aligned} \iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}p_{\tau }({\mathbf {x}},{\mathbf {y}})^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}\le & {} \iint M\rho _{0}({\mathbf {x}})p_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}\nonumber \\= & {} M, \end{aligned}$$

(59)

the operator ${\mathcal {S}}$ defined by

$$\begin{aligned} {\mathcal {S}}f({\mathbf {x}})=\int \sqrt{\frac{\rho _{0}({ \mathbf {x}})}{\rho _{1}({\mathbf {y}})}}p_{\tau }({\mathbf {x}},{ \mathbf {y}})f({\mathbf {y}})\mathrm {d}{\mathbf {y}} \end{aligned}$$

(60)

is a Hilbert–Schmidt integral operator from ${\mathcal {L}}^{2}({\mathbb {R}}^{d})$ to ${\mathcal {L}}^{2}({\mathbb {R}}^{d})$ with $\left\| {\mathcal {S}}\right\| _{\mathrm {HS}}^{2}\le M$ (Renardy and Rogers 2004). Therefore,

$$\begin{aligned} \left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{i}\left\langle {\mathcal {K}}_{\tau }\rho _{1}^{-\frac{1}{2}}e_{i},{\mathcal {K}}_{\tau }\rho _{1}^{-\frac{1}{2}}e_{i}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{i}\left\langle {\mathcal {S}}e_{i},{\mathcal {S}}e_{i}\right\rangle \nonumber \\= & {} \left\| {\mathcal {S}}\right\| _{\mathrm {HS}}^{2}\le M, \end{aligned}$$

(61)

where $\left\langle f,g\right\rangle =\int f({\mathbf {x}})g({\mathbf {x}})\mathrm {d}{\mathbf {x}}$. $\square $

1.5 Koopman Operators of Deterministic Systems

For the completeness of paper, we prove here the following proposition by contradiction: The Koopman operator ${\mathcal {K}}_{\tau }$ of the deterministic system ${\mathbf {x}}_{t+\tau }=F({\mathbf {x}}_{t})$defined by

$$\begin{aligned} {\mathcal {K}}_{\tau }g({\mathbf {x}})=g(F({\mathbf {x}})) \end{aligned}$$

(62)

is not a compact operator from ${\mathcal {L}}_{\rho _{1}}^{2}$ to ${\mathcal {L}}_{\rho _{0}}^{2}$ if ${\mathcal {L}}_{\rho _{1}}^{2}$ is infinite-dimensional.

Assume that ${\mathcal {K}}_{\tau }$ is compact. Then, the SVD (50) of ${\mathcal {K}}_{\tau }$ exists with $\sigma _{i}\rightarrow 0$ as $i\rightarrow \infty $, and there is j so that $0\le \sigma _{j}<1$. This implies $\left\langle {\mathcal {K}}_{\tau }\psi _{j},{\mathcal {K}}_{\tau }\psi _{j}\right\rangle _{\rho _{0}}=\sigma _{j}^{2}<1$. However, according to the definition of the Koopman operator, $\left\langle {\mathcal {K}}_{\tau }\psi _{j},{\mathcal {K}}_{\tau }\psi _{j}\right\rangle _{\rho _{0}}=\left\langle \psi _{j},\psi _{j}\right\rangle _{\rho _{1}}=1$, which leads to a contradiction. We can conclude that ${\mathcal {K}}_{\tau }$ is not compact and hence not Hilbert–Schmidt.

Markov Propagators

The Markov propagator ${\mathcal {P}}_{\tau }$ is defined by

$$\begin{aligned} p_{t+\tau }\left( {\mathbf {x}}\right)= & {} {\mathcal {P}}_{\tau }p_{t}\left( {\mathbf {x}}\right) \nonumber \\\triangleq & {} \int p_{\tau }\left( {\mathbf {y}},{\mathbf {x}}\right) p_{t}\left( {\mathbf {y}}\right) \mathrm {d}{\mathbf {y}}, \end{aligned}$$

(63)

with $p_{t}\left( {\mathbf {x}}\right) ={\mathbb {P}}({\mathbf {x}}_{t}={\mathbf {x}})$ being the probability density of ${\mathbf {x}}_{t}$. According to the SVD of the Koopman operator given in (50), we have

$$\begin{aligned} p_{\tau }\left( {\mathbf {x}},{\mathbf {y}}\right) ={\mathcal {K}}_{\tau }\delta _{{\mathbf {y}}}\left( {\mathbf {x}}\right) =\sum _{i=1}^{\infty }\sigma _{i}\psi _{i}\left( {\mathbf {x}}\right) \phi _{i}\left( {\mathbf {y}}\right) \rho _{1}\left( {\mathbf {y}}\right) . \end{aligned}$$

(64)

Then

$$\begin{aligned} {\mathcal {P}}_{\tau }p_{t}\left( {\mathbf {x}}\right)= & {} \int p_{\tau }\left( {\mathbf {y}},{\mathbf {x}}\right) p_{t}\left( {\mathbf {y}}\right) \mathrm {d}{\mathbf {y}}\nonumber \\= & {} \sum _{i=1}^{\infty }\sigma _{i}\left\langle p_{t},\rho _{0}\psi _{i}\right\rangle _{\rho _{0}^{-1}}\rho _{1}\left( {\mathbf {x}}\right) \phi _{i}\left( {\mathbf {x}}\right) . \end{aligned}$$

(65)

Where the following normalizations were used:

$$\begin{aligned} \left\langle \rho _{0}\psi _{i},\rho _{0}\psi _{j}\right\rangle _{\rho _{0}^{-1}}= & {} \left\langle \psi _{i},\psi _{j}\right\rangle _{\rho _{0}}=1_{i=j}\end{aligned}$$

(66)

$$\begin{aligned} \left\langle \rho _{1}\phi _{i},\rho _{1}\phi _{j}\right\rangle _{\rho _{1}^{-1}}= & {} \left\langle \phi _{i},\phi _{j}\right\rangle _{\rho _{1}}=1_{i=j}, \end{aligned}$$

(67)

The SVD of ${\mathcal {P}}_{\tau }$ can be written as

$$\begin{aligned} {\mathcal {P}}_{\tau }p_{t}=\sum _{i=1}^{\infty }\sigma _{i}\left\langle p_{t},\rho _{0}\psi _{i}\right\rangle _{\rho _{0}^{-1}}\rho _{1}\phi _{i}. \end{aligned}$$

(68)

Proof of the Variational Principle

Notice that ${\mathbf {f}}$ and ${\mathbf {g}}$ can be expressed as

$$\begin{aligned} {\mathbf {f}}={\mathbf {D}}_{0}^{\top } \varvec{\psi },\quad {\mathbf {g}}={\mathbf {D}}_{1}^{\top }\varvec{\phi } \end{aligned}$$

(69)

where $\varvec{\psi }=(\psi _{1},\psi _{2},\ldots )^{\top }$, $\varvec{\phi }=(\phi _{1},\phi _{2},\ldots )^{\top }$ and ${\mathbf {D}}_{0},{\mathbf {D}}_{1}\in {\mathbb {R}}^{\infty \times k}$.

Since

$$\begin{aligned} \left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}= & {} {\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}\end{aligned}$$

(70)

$$\begin{aligned} \left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}= & {} {\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1} \end{aligned}$$

(71)

and

$$\begin{aligned} \left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}= & {} {\mathbf {D}}_{0}^{\top }\left\langle \varvec{\psi },{\mathcal {K}}_{\tau }\varvec{\phi }^{\top }\right\rangle _{\rho _{0}}{\mathbf {D}}_{1}\\= & {} {\mathbf {D}}_{0}^{\top }\left\langle \varvec{\psi },\varvec{\psi }^{\top }\right\rangle _{\rho _{0}}\varvec{\varSigma }{\mathbf {D}}_{1}\\= & {} {\mathbf {D}}_{0}^{\top }\varvec{\varSigma }{\mathbf {D}}_{1}, \end{aligned}$$

the optimization problem can be equivalently written as

$$\begin{aligned} \max _{{\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}={\mathbf {I}},{\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1}={\mathbf {I}}}\sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}, \end{aligned}$$

(72)

where $\varvec{\varSigma }=\mathrm {diag}(\sigma _{1},\sigma _{2},\ldots )$.According to the Cauchy–Schwarz inequality and the conclusion in Section I.3.C of Marshall et al. (1979), we have

$$\begin{aligned} \sum _{i=1}^{k}\left| \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right| \le \sum _{i=1}^{k}\sigma _{i} \end{aligned}$$

(73)

and

$$\begin{aligned} \sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}\le \sum _{i=1}^{k}\left| \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right| ^{r}\le \sum _{i=1}^{k}\sigma _{i}^{r} \end{aligned}$$

(74)

under the constraint ${\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}={\mathbf {I}},{\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1}={\mathbf {I}}$. The variational principle can then be proven by considering

$$\begin{aligned} \sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}=\sum _{i=1}^{k}\sigma _{i}^{r} \end{aligned}$$

(75)

when the first k rows of ${\mathbf {D}}_{0}$ and ${\mathbf {D}}_{1}$ are identity matrix.

Variational Principle of Reversible Markov Processes

The variational principle of reversible Markov processes can be summarized as follows: If the Markov process $\{{\mathbf {x}}_{t}\}$ is time-reversible with respect to stationary distribution $\mu $ and all eigenvalues of ${\mathcal {K}}_{\tau }$ is nonnegative, then

$$\begin{aligned} \sum _{i=1}^{k}\lambda _{i}^{r}&=\max \sum _{i=1}^{k}\left\langle f_{i},{\mathcal {K}}_{\tau }f_{i}\right\rangle _{\mu }^{r}\nonumber \\ s.t.&\left\langle f_{i},f_{j}\right\rangle _{\mu }=1_{i=j} \end{aligned}$$

(76)

for $r\ge 1$ and the maximal value is achieved with $f_{i}=\psi _{i}$, where $\psi _{i}$ denotes the eigenfunction with the ith largest eigenvalue $\lambda _{i}$. The proof is trivial by using variational principle of general Markov processes and considering that the eigendecomposition of ${\mathcal {K}}_{\tau }$ is equivalent to its SVD if $\{{\mathbf {x}}_{t}\}$ is time-reversible and $\rho _{0}=\rho _{1}=\mu $.

Analysis of Estimation Algorithms

1.1 Correctness of Feature TCCA

We show in this appendix that the feature TCCA algorithm described in Sect. 3.1 solves the optimization problem (19).

Let ${\mathbf {U}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}=({\mathbf {u}}_{1}^{\prime },\ldots ,{\mathbf {u}}_{k}^{\prime })$ and ${\mathbf {V}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {V}}=({\mathbf {v}}_{1}^{\prime },\ldots ,{\mathbf {v}}_{k}^{\prime })$, (19) can be equivalently expressed as

$$\begin{aligned} \max _{{\mathbf {U}}^{\prime },{\mathbf {V}}^{\prime }}&\sum _{i=1}^{k}\left( {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right) ^{r}\nonumber \\ \mathrm {s.t.}&{\mathbf {U}}^{\prime \top }{\mathbf {U}}^{\prime }={\mathbf {I}}\nonumber \\&{\mathbf {V}}^{\prime \top }{\mathbf {V}}^{\prime }={\mathbf {I}}. \end{aligned}$$

(77)

According to the Cauchy–Schwarz inequality and the conclusion in Section I.3.C of Marshall et al. (1979), we have

$$\begin{aligned} \sum _{i=1}^{k}\left( {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right) ^{r}\le & {} \sum _{i=1}^{k}\left| {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right| ^{r}\nonumber \\\le & {} \sum _{i=1}^{k}s_{i}^{r} \end{aligned}$$

(78)

under the constraints ${\mathbf {U}}^{\prime \top }{\mathbf {U}}^{\prime }={\mathbf {I}},{\mathbf {V}}^{\prime \top }{\mathbf {V}}^{\prime }={\mathbf {I}}$, where $s_{i}$ is the ith largest singular value of ${\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}$. Considering the equalities hold in the above when ${\mathbf {U}}^{\prime },{\mathbf {V}}^{\prime }$ are the first k left and right singular vectors of ${\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}$, we get

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}&{\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})=\sum _{i=1}^{k}s_{i}^{r}\nonumber \\ \mathrm {s.t.}&{\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}={\mathbf {I}}\nonumber \\&{\mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}={\mathbf {I}}, \end{aligned}$$

(79)

and the correctness of the feature TCCA can then be proved.

Furthermore, if $k=\min \{\mathrm {dim}\left( \varvec{\chi }_{0}\right) ,\mathrm {dim}\left( \varvec{\chi }_{1}\right) \}$, we can get

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}{\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})=\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$

(80)

under the orthonormality constraints.

1.2 Feature TCCA of Projected Koopman Operators

Define projection operators

$$\begin{aligned} {\mathcal {Q}}_{\varvec{\chi }_{0}}f\triangleq & {} \mathop {\mathrm{arg\,min}}\limits _{f^{\prime }\in \mathrm {span}\{\chi _{0,1},\chi _{0,2},\ldots \}}\left\langle f^{\prime }-f,f^{\prime }-f\right\rangle _{\rho _{0}}\nonumber \\= & {} \left\langle f,\varvec{\chi }_{0}^{\top }\right\rangle _{\rho _{0}}{\mathbf {C}}_{00}^{-1}\varvec{\chi }_{0},\end{aligned}$$

(81)

$$\begin{aligned} {\mathcal {Q}}_{\varvec{\chi }_{1}}g\triangleq & {} \mathop {\mathrm{arg\,min}}\limits _{g^{\prime }\in \mathrm {span}\{\chi _{1,1},\chi _{1,2},\ldots \}}\left\langle g^{\prime }-g,g^{\prime }-g\right\rangle _{\rho _{1}}\nonumber \\= & {} \left\langle g,\varvec{\chi }_{1}^{\top }\right\rangle _{\rho _{1}}{\mathbf {C}}_{11}^{-1}\varvec{\chi }_{1}, \end{aligned}$$

(82)

and let ${\mathcal {K}}_{\tau }^{\mathrm {proj}}={\mathcal {Q}}_{\varvec{\chi }_{0}}{\mathcal {K}}_{\tau }{\mathcal {Q}}_{\varvec{\chi }_{1}}$ be the projection of the Koopman operator ${\mathcal {K}}_{\tau }$ onto the subspaces of $\varvec{\chi }_{0},\varvec{\chi }_{1}$. Then for any $f={\mathbf {u}}^{\top }\varvec{\chi }_{0}\in \mathrm {span}\{\chi _{0,1},\chi _{0,2},\ldots \}$ and $g={\mathbf {v}}^{\top }\varvec{\chi }_{1}\in \mathrm {span}\{\chi _{1,1},\chi _{1,2},\ldots \}$,

$$\begin{aligned} \left\langle f,{\mathcal {K}}_{\tau }^{\mathrm {proj}}g\right\rangle _{\rho _{0}}= & {} \left\langle g,\varvec{\chi }_{1}^{\top }\right\rangle _{\rho _{1}}{\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}\left\langle \varvec{\chi }_{0},f\right\rangle _{\rho _{0}}\nonumber \\= & {} {\mathbf {u}}^{\top }{\mathbf {C}}_{01}{\mathbf {v}}\nonumber \\= & {} \left\langle f,{\mathcal {K}}_{\tau }g\right\rangle _{\rho _{0}}, \end{aligned}$$

(83)

which implies that Eq. (19) can also be interpreted as the variational problem for the feature TCCA of ${\mathcal {K}}_{\tau }^{\mathrm {proj}}$.

By ignoring the statistical noise, we can conclude from Theorem 2 that the $\{(s_{i},f_{i},g_{i})\}$ provided by the feature TCCA are exactly the singular components of ${\mathcal {K}}_{\tau }^{\mathrm {proj}}$, and the optimality of the estimation result is therefore invariant for any choice of $r\ge 1$. In addition, the sum over the r’th power of all singular values of ${\mathcal {K}}_{\tau }^{\mathrm {proj}}$ is

$$\begin{aligned} \sum _{i}s_{i}^{r}=\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{r}^{r}. \end{aligned}$$

(84)

1.3 An Example of Nonlinear TCCA

Consider a stochastic system

$$\begin{aligned} x_{t+1}=\frac{1}{2}x_{t}+u_{t}, \end{aligned}$$

(85)

where $u_{t}$ is Gaussian white noise with mean zero and variance 1. By setting

$$\begin{aligned} \rho _{0}(x)=\rho _{1}(x)={\mathcal {N}}\left( x|0,\frac{4}{3}\right) \end{aligned}$$

(86)

to be the stationary distribution and basis functions

$$\begin{aligned} \varvec{\chi }_{0}(x)=\varvec{\chi }_{1}(x)=\left( 1,\exp (-wx^{2})-\sqrt{\frac{3}{8w+3}},x\exp (-(1-w^{\frac{1}{10}})x^{2})\right) ^{\top } \end{aligned}$$

(87)

with parameter $w\in [0.01,1]$, we can obtain

$$\begin{aligned} {\mathbf {C}}_{00}={\mathbf {C}}_{11}= & {} \mathrm {diag}\left( 1,\left( \frac{16}{3}w+1\right) ^{-\frac{1}{2}}-\frac{3}{8w+3},4\sqrt{3}\left( -16w^{\frac{1}{10}}+19\right) ^{-\frac{3}{2}}\right) ,\nonumber \\ {\mathbf {C}}_{01}= & {} \mathrm {diag}\Bigg (1,\left( \frac{16}{3}w^{2}+\frac{16}{3}w+1\right) ^{-\frac{1}{2}}-\frac{3}{8w+3},\nonumber \\&\quad \quad \quad 2\sqrt{3}\left( 16(1-w^{\frac{1}{10}})^{2}-16w^{\frac{1}{10}}+19\right) ^{-\frac{3}{2}}\Bigg ) \end{aligned}$$

(88)

The maximal VAMP-r score for a given w can then be analytically computed by

$$\begin{aligned} {\mathcal {R}}_{r}(w)=\mathrm {tr}\left[ \left( {\mathbf {C}}_{00}(w)^{-1}{\mathbf {C}}_{01}(w)\right) ^{r}\right] \end{aligned}$$

(89)

according to (24). We evaluate ${\mathcal {R}}_{r}(w)$ at 9901 equally spaced points of w in the interval [0.01, 1] for $r=1,2$, and the maximal values of ${\mathcal {R}}_{1},{\mathcal {R}}_{2}$ are achieved at $w=0.3157$ and $w=0.7069$ respectively.

Implementation of Estimation Algorithms

1.1 Decorrelation of Basis Functions

For convenience of notation, here we define

$$\begin{aligned} {\mathbf {X}}= & {} \left( \varvec{\chi }_{0}({\mathbf {x}}_{1}),\ldots ,\varvec{\chi }_{0}({\mathbf {x}}_{T-\tau })\right) ^{\top } \end{aligned}$$

(90)

$$\begin{aligned} {\mathbf {Y}}= & {} \left( \varvec{\chi }_{1}({\mathbf {x}}_{1+\tau }),\ldots ,\varvec{\chi }_{0}({\mathbf {x}}_{T})\right) ^{\top }. \end{aligned}$$

(91)

In this paper, we utilize principal component analysis (PCA) to explicitly reduce correlations between basis functions as follows: First, we compute the empirical means of basis functions and the covariance matrices of mean-centered basis functions:

$$\begin{aligned} \varvec{\pi }_{0}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {1}} \end{aligned}$$

(92)

$$\begin{aligned} \varvec{\pi }_{1}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {1}} \end{aligned}$$

(93)

$$\begin{aligned} \mathrm {COV}_{0}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {X}}-\varvec{\pi }_{0}\varvec{\pi }_{0}^{\top } \end{aligned}$$

(94)

$$\begin{aligned} \mathrm {COV}_{1}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {Y}}-\varvec{\pi }_{1}\varvec{\pi }_{1}^{\top }. \end{aligned}$$

(95)

Next, perform the truncated eigen decomposition of the covariance matrices as

$$\begin{aligned} \mathrm {COV}_{0}\approx & {} {\mathbf {Q}}_{0,d}^{\top }{\mathbf {S}}_{0,d}{\mathbf {Q}}_{0,d} \end{aligned}$$

(96)

$$\begin{aligned} \mathrm {COV}_{1}\approx & {} {\mathbf {Q}}_{1,d}^{\top }{\mathbf {S}}_{1,d}{\mathbf {Q}}_{1,d}, \end{aligned}$$

(97)

where the diagonal of matrices ${\mathbf {S}}_{0,d},{\mathbf {S}}_{1,d}$ contain all positive eigenvalues that are larger than $\epsilon _{0}$ and absolute values of all negative eigenvalues ($\epsilon _{0}=10^{-10}$ in our applications). Last, the new basis functions are given by

$$\begin{aligned} \varvec{\chi }_{0}^{\mathrm {new}}=\left[ \begin{array}{c} {\mathbf {Q}}_{0,d}^{\top }{\mathbf {S}}_{0,d}^{\frac{1}{2}}\left( \varvec{\chi }_{0}-\varvec{\pi }_{0}\right) \\ \mathbb {1} \end{array}\right] ,\quad \varvec{\chi }_{1}^{\mathrm {new}}=\left[ \begin{array}{c} {\mathbf {Q}}_{1,d}^{\top }{\mathbf {S}}_{1,d}^{\frac{1}{2}}\left( \varvec{\chi }_{1}-\varvec{\pi }_{1}\right) \\ \mathbb {1} \end{array}\right] \end{aligned}$$

(98)

We denote the transformation (98) by

$$\begin{aligned} \varvec{\chi }_{0}^{\mathrm {new}},\varvec{\chi }_{1}^{\mathrm {new}}=\mathrm {DC}\left[ \varvec{\chi }_{0},\varvec{\chi }_{1}|\varvec{\pi }_{0},\varvec{\pi }_{1},\mathrm {COV}_{0},\mathrm {COV}_{1}\right] \end{aligned}$$

(99)

Then the feature TCCA algorithm with decorrelation of basis functions can be summarized as:

1.
Compute $\varvec{\pi }_{0},\varvec{\pi }_{1}$ and $\mathrm {COV}_{0},\mathrm {COV}_{1}$ by (92–95).
2.
Let $\varvec{\chi }_{0},\varvec{\chi }_{1}:=\mathrm {DC}\left[ \varvec{\chi }_{0},\varvec{\chi }_{1}|\varvec{\pi }_{0},\varvec{\pi }_{1},\mathrm {COV}_{0},\mathrm {COV}_{1}\right] $, and recalculate ${\mathbf {X}}$ and ${\mathbf {Y}}$ according to the new basis functions.
3.
Compute covariance matrices ${\mathbf {C}}_{00},{\mathbf {C}}_{01},{\mathbf {C}}_{11}$ by
$$\begin{aligned} {\mathbf {C}}_{00}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {X}}\\ {\mathbf {C}}_{01}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {Y}}\\ {\mathbf {C}}_{11}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {Y}} \end{aligned}$$
4.
Perform the truncated SVD ${\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}={\mathbf {U}}_{k}^{\prime }\hat{\varvec{\varSigma }}_{k}{\mathbf {V}}_{k}^{\prime \top }$.
5.
Output estimated singular components $\hat{\varvec{\varSigma }}_{k}=\mathrm {diag}({\hat{\sigma }}_{1},\ldots ,{\hat{\sigma }}_{k})$, ${\mathbf {U}}_{k}^{\top }\varvec{\chi }_{0}=({\hat{\psi }}_{1},\ldots ,{\hat{\psi }}_{k})^{\top }$ and ${\mathbf {V}}_{k}^{\top }\varvec{\chi }_{1}=({\hat{\phi }}_{1},\ldots ,{\hat{\phi }}_{k})^{\top }$ with ${\mathbf {U}}_{k}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {U}}_{k}^{\prime }$ and ${\mathbf {V}}_{k}={\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {V}}_{k}^{\prime }$.

Notice that the estimated ${\mathbf {C}}_{00}$, ${\mathbf {C}}_{01}$ and ${\mathbf {C}}_{11}$ in the above algorithm satisfy

$$\begin{aligned} \left[ \begin{array}{cc} {\mathbf {C}}_{00} &{} {\mathbf {C}}_{01}\\ {\mathbf {C}}_{01}^{\top } &{} {\mathbf {C}}_{11} \end{array}\right]= & {} \frac{1}{T-\tau }\left[ \begin{array}{cc} {\mathbf {X}}^{\top }{\mathbf {X}} &{} {\mathbf {X}}^{\top }{\mathbf {Y}}\\ {\mathbf {Y}}^{\top }{\mathbf {X}} &{} {\mathbf {Y}}^{\top }{\mathbf {Y}} \end{array}\right] \nonumber \\= & {} \frac{1}{T-\tau }\left( {\mathbf {X}},{\mathbf {Y}}\right) ^{\top }\left( {\mathbf {X}},{\mathbf {Y}}\right) \nonumber \\\succeq & {} 0 \end{aligned}$$

(100)

where ${\mathbf {C}}\succeq 0$ means ${\mathbf {C}}$ is a positive semi-definite matrix. According to the Schur complement lemma, we have

$$\begin{aligned}&{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top } \preceq {\mathbf {C}}_{00}\nonumber \\&\quad \Rightarrow \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) ^{\top } \preceq {\mathbf {I}} \end{aligned}$$

(101)

where ${\mathbf {I}}$ denotes an identity matrix of appropriate size. So the estimated $\sigma _{1}\le 1$.

Furthermore, since ${\mathbf {v}}_{0}^{\top }\varvec{\chi }_{0}={\mathbf {v}}_{1}^{\top }\varvec{\chi }_{1}=\mathbb {1}$ for ${\mathbf {v}}_{0}=(0,\ldots ,0,1)^{\top }$ and ${\mathbf {v}}_{1}=(0,\ldots ,0,1)^{\top }$,

$$\begin{aligned} \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) ^{\top }{\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {v}}_{0}= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}\left( {\mathbf {X}}^{\top }{\mathbf {X}}\right) ^{-1}{\mathbf {X}}^{\top }{\mathbf {Y}}\left( {\mathbf {Y}}^{\top }{\mathbf {Y}}\right) ^{-1}{\mathbf {Y}}^{\top }{\mathbf {X}}{\mathbf {v}}_{0}\nonumber \\= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {X}}^{+}{\mathbf {Y}}{\mathbf {Y}}^{+}{\mathbf {1}}\nonumber \\= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {v}}_{0} \end{aligned}$$

(102)

which implies that 1 is the largest singular value of ${\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}$.

1.2 Parameter Optimization in Nonlinear TCCA

The optimization problem

$$\begin{aligned} \max _{{\mathbf {w}}}{\mathcal {R}}_{r}({\mathbf {w}})=\left\| {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$

(103)

can be solved by direct search as in our examples (see Appendix K.1). But for a high-dimensional parameter vector ${\mathbf {w}}$, it is more efficient to perform the optimization by the gradient descent method in the form of

$$\begin{aligned} {\mathbf {w}}\leftarrow {\mathbf {w}}+\eta \frac{\partial {\mathcal {R}}_{r}({\mathbf {w}})}{\partial {\mathbf {w}}}, \end{aligned}$$

(104)

where $\eta $ is the step size. When $r=2$, the gradient of ${\mathcal {R}}_{r}$ with respect to an element $w_{i}$ in ${\mathbf {w}}$ can be written as

$$\begin{aligned} \frac{\partial {\mathcal {R}}_{r}}{\partial w_{i}}= & {} \frac{2}{T-\tau }\mathrm {tr}\left[ {\mathbf {C}}_{00}^{-1}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}\left( {\mathbf {Y}}-{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}{\mathbf {X}}\right) \left( \frac{\partial {\mathbf {X}}}{\partial w_{i}}\right) ^{\top }\right] \nonumber \\&+\frac{2}{T-\tau }\mathrm {tr}\left[ {\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}\left( {\mathbf {X}}-{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}{\mathbf {Y}}\right) \left( \frac{\partial {\mathbf {Y}}}{\partial w_{i}}\right) ^{\top }\right] , \end{aligned}$$

(105)

where ${\mathbf {X}},{\mathbf {Y}}$ have the same definitions as in Appendix F.1. If the data size is too large, we can approximate the gradient based on a random subset of data in each iteration, and update ${\mathbf {w}}$ in a stochastic gradient descent manner (Andrew et al. 2013; Mardt et al. 2018).

Like feature TCCA, the nonlinear TCCA also suffers from the numerical singularity when ${\mathbf {C}}_{00}$ or ${\mathbf {C}}_{11}$ is not full rank. This problem can be addressed by the decorrelation of basis functions when performing direct search. For the gradient descent method (or stochastic gradient descent method), we can replace the objective function ${\mathcal {R}}_{r}({\mathbf {w}})$ by a regularized one

$$\begin{aligned} {\mathcal {R}}_{r}({\mathbf {w}};\epsilon )=\left\| \left( {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) +\epsilon { \mathbf {I}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) \left( {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) + \epsilon {\mathbf {I}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}, \end{aligned}$$

(106)

where $\epsilon >0$ is a hyper-parameter and can be selected by the cross-validation.

Relationship Between VAMP and EDMD

The proof of (21) is trivial. Here, we only show that the eigenvalue problem of $\hat{{\mathcal {K}}}_{\tau }$ given by the feature TCCA is equivalent to that of matrix ${\mathbf {K}}_{\chi }$ as

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\lambda g\Longleftrightarrow {\mathbf {K}}_{\chi }{\mathbf {b}}=\lambda {\mathbf {b}}\text { with }g={\mathbf {b}}^{\top }\varvec{\chi } \end{aligned}$$

(107)

under the assumption that $\varvec{\chi }_{0}=\varvec{\chi }_{1}=\varvec{\chi }$ and ${\mathbf {C}}_{00}$ is invertible, which is consistent with the spectral approximation theory in EDMD. First, if g and $\lambda $ satisfy ${\mathcal {K}}_{\tau }g=\lambda g$, there must exist vector ${\mathbf {b}}$ so that $g={\mathbf {b}}^{\top }\varvec{\chi }$. Then

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g= & {} \lambda g\nonumber \\ \Rightarrow {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }\varvec{\chi }= & {} \lambda {\mathbf {b}}^{\top }\varvec{\chi }\nonumber \\ \Rightarrow {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }{\mathbf {C}}_{00}= & {} \lambda {\mathbf {b}}^{\top }{\mathbf {C}}_{00}\nonumber \\ \Rightarrow {\mathbf {K}}_{\chi }{\mathbf {b}}= & {} \lambda {\mathbf {b}}. \end{aligned}$$

(108)

Second, if ${\mathbf {K}}_{\chi }{\mathbf {b}}=\lambda {\mathbf {b}}$,

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }{\mathbf {b}}^{\top }\varvec{\chi }= & {} {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }\varvec{\chi }\nonumber \\= & {} \lambda {\mathbf {b}}^{\top }\varvec{\chi }. \end{aligned}$$

(109)

Analysis of the VAMP-E Score

1.1 Proof of (28)

Here we define

$$\begin{aligned} {\mathbf {C}}_{ff}= & {} \left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}={\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}, \end{aligned}$$

(110)

$$\begin{aligned} {\mathbf {C}}_{gg}= & {} \left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}={\mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}, \end{aligned}$$

(111)

$$\begin{aligned} {\mathbf {C}}_{fg}= & {} \left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}={\mathbf {U}}^{\top }{\mathbf {C}}_{01}{\mathbf {V}}. \end{aligned}$$

(112)

Considering $\{\phi _{i}\}$ is an orthonormal basis of ${\mathcal {L}}_{\rho _{1}}^{2}$, we have

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau }\phi _{j},\hat{{\mathcal {K}}}_{\tau }\phi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}{\mathbf {f}},{\mathbf {f}}^{\top }{\mathbf {K}}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\sum _{j}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle \sum _{j}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{ff}{\mathbf {K}}{\mathbf {C}}_{gg}\right] \end{aligned}$$

(113)

and

$$\begin{aligned} \left\langle \hat{{\mathcal {K}}}_{\tau },{\mathcal {K}}_{\tau }\right\rangle _{\mathrm {HS}}= & {} \sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau }\phi _{j},{\mathcal {K}}_{\tau }\phi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {S}}{\mathbf {f}},\sigma _{j}\psi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\sigma _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {S}}\left\langle {\mathbf {f}},\psi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\sum _{j}\sigma _{j}\left\langle {\mathbf {f}},\psi _{j}\right\rangle _{\rho _{0}}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},\sum _{j}\sigma _{j}\psi _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right\rangle _{\rho _{0}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{fg}\right] , \end{aligned}$$

(114)

where $\left\langle \cdot ,\cdot \right\rangle _{\mathrm {HS}}$ denotes the Hilbert–Schmidt inner product of operators. Then, according to the definition of Hilbert–Schmidt norm,

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \left\| \hat{{\mathcal {K}}}_{\tau }\right\| _{\mathrm {HS}}^{2}-2\sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau },{\mathcal {K}}_{\tau }\right\rangle _{\mathrm {HS}}+\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{ff}{\mathbf {K}}{\mathbf {C}}_{gg}-2{\mathbf {K}}{\mathbf {C}}_{fg}\right] +\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2} \end{aligned}$$

(115)

1.2 Relationship Between VAMP-2 and VAMP-E

We first show that the feature TCCA algorithm maximizes VAMP-E. Notice that

$$\begin{aligned} {\mathcal {R}}_{E}({\mathbf {K}},{\mathbf {U}},{\mathbf {V}})= & {} \mathrm {tr}\left[ 2\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) ^{\top }\left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \right. \nonumber \\&\left. -\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) ^{\top }\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) \right] \nonumber \\= & {} -\left\| {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}\nonumber \\= & {} -\left\| {\mathbf {U}}^{\prime }{\mathbf {K}}{\mathbf {V}}^{\prime \top }-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}, \end{aligned}$$

(116)

where $\left\| \cdot \right\| _{F}$ denotes the Frobenius norm and ${\mathbf {U}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}$, ${\mathbf {V}}^{\prime }={\mathbf {C}}_{11}^{\frac{1}{2}}{\mathbf {V}}$. It can be seen that the feature TCCA algorithm maximizes the first term on the right-hand side of (116) and therefore maximizes VAMP-E.

For the optimal model generated by the nonlinear TCCA, the first term on the right-hand side of (116) is equal to zero, and the second term is maximized as a function of ${\mathbf {w}}$. Thus, the nonlinear TCCA also maximizes VAMP-E.

In addition, for ${\mathbf {K}},{\mathbf {U}},{\mathbf {V}}$ provided by both feature TCCA and nonlinear TCCA,

$$\begin{aligned} {\mathcal {R}}_{E}({\mathbf {K}},{\mathbf {U}},{\mathbf {V}})= & {} -\left\| {\mathbf {U}}^{\prime }{\mathbf {K}}{\mathbf {V}}^{\prime \top }-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}\nonumber \\= & {} -\sum _{i=k+1}^{\min \{m,n\}}K_{ii}^{2}+\sum _{i=1}^{\min \{m,n\}}K_{ii}^{2}\nonumber \\= & {} \sum _{i=1}^{k}K_{ii}^{2}\nonumber \\= & {} {\mathcal {R}}_{2}({\mathbf {U}},{\mathbf {V}}). \end{aligned}$$

(117)

Subspace Variational Principle

The variational principle proposed in Sect. 2.2 can be further extended to singular subspaces of the Koopman operator as follows:

$$\begin{aligned} \sum _{i=1}^{k}\sigma _{i}^{r}\ge {\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] =\left\| {\mathbf {C}}_{ff}^{-\frac{1}{2}}{\mathbf {C}}_{fg}{\mathbf {C}}_{gg}^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$

(118)

for $r\ge 1$, and the equality holds if $\mathrm {span}\{\psi _{1},\ldots ,\psi _{k}\}=\mathrm {span}\{f_{1},\ldots ,f_{k}\}$ and $\mathrm {span}\{\phi _{1},\ldots ,\phi _{k}\}=\mathrm {span}\{g_{1},\ldots ,g_{k}\}$, where ${\mathbf {C}}_{ff}=\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}$, ${\mathbf {C}}_{fg}=\left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}$ and ${\mathbf {C}}_{gg}=\left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}$. This statement can be proven by implementing the feature TCCA algorithm with feature functions ${\mathbf {f}}$ and ${\mathbf {g}}$.

The ${\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] $ is a relaxation of VAMP-r, which measures the consistency between the subspaces spanned by ${\mathbf {f}},{\mathbf {g}}$ and the dominant singular spaces, and we call it the subspace VAMP-r score. ${\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] $ is invariant with respect to the invertible linear transformations of ${\mathbf {f}}$ and ${\mathbf {g}}$, i.e., ${\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] ={\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {A}}_{f}{\mathbf {f}},{\mathbf {A}}_{g}{\mathbf {g}}\right] $ for any invertible matrices ${\mathbf {A}}_{f},{\mathbf {A}}_{g}$.

In the cross-validation for feature TCCA, we can utilize ${\mathcal {R}}_{r}^{\mathrm {space}}$ to calculate the validation score by

$$\begin{aligned} \mathrm {CV}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right)= & {} {\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) \nonumber \\= & {} {\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {V}}^{\top }\varvec{\chi }_{1}|{\mathcal {D}}_{\mathrm {test}}\right] \nonumber \\= & {} \left\| \left( {\mathbf {U}}^{\top }{\mathbf {C}}_{00}^{\mathrm {test}}{\mathbf {U}}\right) ^{-\frac{1}{2}}\left( {\mathbf {U}}^{\top }{\mathbf {C}}_{01}^{\mathrm {test}}{\mathbf {V}}\right) \left( {\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\mathrm {test}}{\mathbf {V}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}.\nonumber \\ \end{aligned}$$

(119)

We now analyze the difficulties of applying ${\mathcal {R}}_{r}^{\mathrm {space}}$ to the cross-validation. First, for given basis functions $\varvec{\chi }_{0},\varvec{\chi }_{1}$, ${\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) $ is monotonically increasing with respect to k and

$$\begin{aligned} {\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}}_{k},{\mathbf {V}}_{k}|{\mathcal {D}}_{\mathrm {test}}\right) =\left\| \left( {\mathbf {C}}_{00}^{\mathrm {test}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}^{\mathrm {test}}\left( {\mathbf {C}}_{11}^{\mathrm {test}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$

(120)

is independent of the estimated singular components if $k=\max \{\mathrm {dim}(\varvec{\chi }_{0}),\mathrm {dim}(\varvec{\chi }_{1})\}$. Therefore, k is a new hyper-parameter that cannot be determined by the cross-validation. Second, for training set, ${\mathbf {U}}_{k}^{\top }{\mathbf {C}}_{00}^{\mathrm {train}}{\mathbf {U}}_{k}={\mathbf {V}}_{k}^{\top }{\mathbf {C}}_{11}^{\mathrm {train}}{\mathbf {V}}_{k}={\mathbf {I}}$. But for test set, ${\mathbf {U}}_{k}^{\top }{\mathbf {C}}_{00}^{\mathrm {test}}{\mathbf {U}}_{k}$ and ${\mathbf {V}}_{k}^{\top }{\mathbf {C}}_{11}^{\mathrm {test}}{\mathbf {V}}_{k}$ are possibly singular and the validation score cannot be reliably computed.

Computation of $\hat{{\mathcal {K}}}_{\tau }^{n}$

The approximate Koopman operator in the form of (27) can also be written as

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\left\langle g,{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}{\mathbf {f}}. \end{aligned}$$

(121)

Hence,

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }^{n}g=\left\langle g,{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}\left( {\mathbf {R}}^{n-1}\right) ^{\top }{\mathbf {f}}, \end{aligned}$$

(122)

and we have

$$\begin{aligned} \left\langle f,\hat{{\mathcal {K}}}_{\tau }^{n}g\right\rangle _{\rho _{0}(n\tau )}=\left\langle f,{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}(n\tau )}{\mathbf {R}}^{n-1}{\mathbf {K}}\left\langle {\mathbf {g}},g\right\rangle _{\rho _{1}} \end{aligned}$$

(123)

and

$$\begin{aligned} {\hat{p}}_{n\tau }({\mathbf {x}},{\mathbf {y}})= & {} \hat{{\mathcal {K}}}_{\tau }^{n}\delta _{{\mathbf {y}}}({\mathbf {x}})\nonumber \\= & {} {\mathbf {f}}({\mathbf {x}})^{\top }{\mathbf {R}}^{n-1}{\mathbf {K}}{\mathbf {g}}({\mathbf {y}})\rho _{1}({\mathbf {y}}), \end{aligned}$$

(124)

where

$$\begin{aligned} {\mathbf {R}}={\mathbf {K}}\left\langle {\mathbf {g}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{1}}. \end{aligned}$$

(125)

Notice that substituting ${\mathbf {f}}={\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {g}}={\mathbf {V}}^{\top }\varvec{\chi }_{1}$ into (123) yields (37).

Details of Numerical Examples

1.1 One-Dimensional System

For convenience of analysis and computation, we partition the state space $[-20,20]$ into 2000 bins $S_{1},\ldots ,S_{2000}$ uniformly, and discretize the one-dimensional dynamical system described in Example 1 as

$$\begin{aligned} {\mathbb {P}}(x_{t+1}\in S_{j}|x_{t}\in S_{i})\propto {\mathcal {N}}\left( s_{j}|\frac{s_{i}}{2}+\frac{7s_{i}}{1+0.12s_{i}^{2}}+6\cos s_{i},10\right) , \end{aligned}$$

(126)

where $s_{i}$ is the center of the bin $S_{i}$, and the local distribution of $x_{t}$ within any bin is always uniform distribution. All numerical computations and simulations in Examples 1, 2 and 3 are based on (126), and the initial state $x_{0}$ is distributed according to the stationary distribution $\rho _{0}=\rho _{1}=\mu $.

In Example 1, the stationary distribution and singular components of the Koopman operator are analytically computed by the feature TCCA with basis functions $\chi _{0,i}(x)=\chi _{1,i}(x)=1_{x\in S_{i}}$ as follows:

1.
Compute the transition matrix ${\mathbf {P}}=[P_{ij}]=[{\mathbb {P}}(x_{t+1}\in S_{j}|x_{t}\in S_{i})]$ and the stationary vector $\varvec{\pi }=[\pi _{i}]$ satisfying
$$\begin{aligned} \varvec{\pi }^{\top }{\mathbf {P}}=\varvec{\pi }^{\top },\quad \sum _{i}\pi _{i}=1. \end{aligned}$$
2.
Compute covariance matrices ${\mathbf {C}}_{00}={\mathbf {C}}_{11}=\mathrm {diag}(\varvec{\pi })$ and ${\mathbf {C}}_{01}=\mathrm {diag}(\varvec{\pi }){\mathbf {P}}$.
3.
Perform the SVD
$$\begin{aligned} \bar{{\mathbf {K}}}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{ \mathbf {C}}_{11}^{-\frac{1}{2}}={\mathbf {U}}^{\prime }{\mathbf {K}}{ \mathbf {V}}^{\prime \top } \end{aligned}$$
with ${\mathbf {K}}=\mathrm {diag}(\sigma _{1},\ldots ,\sigma _{2000})$ and $\sigma _{1}\ge \sigma _{2}\ge \ldots \ge \sigma _{2000}$.
4.
Compute ${\mathbf {U}}=[U_{ij}]={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {U}}^{\prime }$ and ${\mathbf {V}}=[V_{ij}]={\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {V}}^{\prime }$.
5.
Output the stationary distribution $\mu (x)=\sum _{i}50\pi _{i}\cdot 1_{x\in S_{i}}$ and singular components $(\sigma _{i},\psi _{i}(x),\phi _{i}(x))=(\sigma _{i},\sum _{j}U_{ji}\cdot 1_{x\in S_{j}},\sum _{j}V_{ji}\cdot 1_{x\in S_{j}})$.

The transition density of the projected Koopman operator $\hat{{\mathcal {K}}}_{\tau }=\sum _{i=1}^{k}\sigma _{i}\left\langle \cdot ,\phi _{i}\right\rangle _{\rho _{1}}\psi _{i}$ is obtained by

$$\begin{aligned} {\hat{p}}_{\tau }(x,y)= & {} \hat{{\mathcal {K}}}_{\tau }\delta _{y}(x)\nonumber \\= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}(x)\phi _{i}(y)\mu (y) \end{aligned}$$

(127)

(see Appendix A.3) and the corresponding approximate transition matrix is

$$\begin{aligned} \hat{{\mathbf {P}}}={\mathbf {U}}_{k}^{\top }{\mathbf {K}}_{k}{ \mathbf {V}}_{k}\mathrm {diag}(\varvec{\pi }), \end{aligned}$$

(128)

where ${\mathbf {U}}_{k},{\mathbf {V}}_{k}$ consist of the first k columns of ${\mathbf {U}},{\mathbf {V}}$, and ${\mathbf {K}}_{k}=\mathrm {diag}(\sigma _{1},\ldots ,\sigma _{k})$. Then the relative error of $\hat{{\mathcal {K}}}_{\tau }$ in Fig. 1e can be calculated by

$$\begin{aligned} \frac{\Vert \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau } \Vert _{\mathrm {HS}}}{\Vert {\mathcal {K}}_{\tau }\Vert _{\mathrm {HS}}}= \frac{\sqrt{\sum _{i=k+1}^{2000}\sigma _{i}^{2}}}{ \sqrt{\sum _{i=1}^{2000}\sigma _{i}^{2}}}, \end{aligned}$$

(129)

the long-time transition density in Fig. 2 is given by

$$\begin{aligned} {\hat{p}}_{n\tau }(x,y)=50\sum _{j}\left[ \hat{{\mathbf {P}}}^{n} \right] _{ij}\cdot 1_{y\in S_{j}}, \end{aligned}$$

(130)

and the cumulative error of ${\hat{p}}_{n\tau }(x,y)$ is

$$\begin{aligned} \mathrm {error}= & {} \sum _{n=1}^{256}\int \mu (y)^{-1}\left( {\hat{p}}_{n\tau } (x,y)-p_{n\tau }(x,y)\right) ^{2}\mathrm {d}y\nonumber \\= & {} \sum _{n=1}^{256}\sum _{j=1}^{2000}\pi _{j}^{-1}\left( \left[ \hat{{\mathbf {P}}}^{n}\right] _{ij}-\left[ {\mathbf {P}}^{n}\right] _{ij}\right) ^{2} \end{aligned}$$

(131)

for $x\in S_{i}$.

In Examples 2 and 3 , the smoothing parameter w are optimized by the golden-section search algorithm (Press et al. 2007) as follows for nonlinear TCCA:

1.
Let $a=-6$, $b=6$, $c=0.618a+0.382b$, $d=0.382a+0.618b$.
2.
Compute ${\mathcal {R}}_{2}(\exp a)$, ${\mathcal {R}}_{2}(\exp b)$, ${\mathcal {R}}_{2}(\exp c)$ and ${\mathcal {R}}_{2}(\exp d)$, where ${\mathcal {R}}_{2}(w)=\left\| {\mathbf {C}}_{00}\left( w\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( w\right) {\mathbf {C}}_{11}\left( w\right) ^{-\frac{1}{2}}\right\| _{F}^{2}$ and $\Vert \cdot \Vert _{F}$ denotes the Frobenius norm.
3.
If $\max \{{\mathcal {R}}_{2}(\exp a),{\mathcal {R}}_{2}(\exp b),{\mathcal {R}}_{2}(\exp c)\}>\max \{{\mathcal {R}}_{2}(\exp b),{\mathcal {R}}_{2}(\exp c),{\mathcal {R}}_{2}(\exp d)\}$, let $(a,b,c,d):=(a,d,0.618a+0.382d,c)$. Otherwise, let $(a,b,c,d):=(c,b,d,0.618b+0.382c)$.
4.
If $|a-b|<10^{-3}$, output $\log w\in \{a,b,c,d\}$ with the largest value of ${\mathcal {R}}_{2}(w)$. Otherwise, go back to Step 2.

Furthermore, w is computed in the same way when perform nonlinear TCCA in Sects. 5.1 and 5.2 .

1.2 Double-Gyre System

For the double-gyre system in Sect. 5.1, we first perform the temporal discretization by the Euler–Maruyama scheme as

$$\begin{aligned} {\mathbb {P}}(x_{t+\varDelta }|{\mathbf {x}}_{t})= & {} {\mathcal {N}}(x_{t+\varDelta }|x_{t}-\pi A\sin (\pi x_{t})\cos (\pi y_{t})\varDelta ,\epsilon ^{2}(x_{t}/4+1)),\nonumber \\ {\mathbb {P}}(y_{t+\varDelta }|{\mathbf {x}}_{t})= & {} {\mathcal {N}}(y_{t+\varDelta }|y_{t}+\pi A\cos (\pi x_{t})\sin (\pi y_{t})\varDelta ,\epsilon ^{2}), \end{aligned}$$

(132)

where ${\mathbf {x}}_{t}=(x_{t},y_{t})^{\top }$ and $\varDelta =0.02$ is the step size. Then perform the spatial discretization as

$$\begin{aligned} {\mathbb {P}}({\mathbf {x}}_{t+\varDelta }\in S_{j}|{\mathbf {x}}_{t}\in S_{i})\propto & {} {\mathcal {N}}(s_{j,x}|s_{i,x}-\pi A\sin (\pi s_{i,x})\cos (\pi s_{i,y})\varDelta ,\epsilon ^{2}(s_{i,x}/4+1))\nonumber \\&\cdot {\mathcal {N}}(s_{j,y}|s_{i,y}+\pi A\cos (\pi s_{i,x})\sin (\pi s_{i,y})\varDelta ,\epsilon ^{2}). \end{aligned}$$

(133)

Here $S_{1},\ldots ,S_{1250}$ are $50\times 25$ bins which form a uniform partition of the state space $[0,2]\times [0,1]$ and $(s_{i,x},s_{i,y})$ represents the center of $S_{i}$. Simulation data and the “true” singular components are all computed by using (133) with the initial distribution of $(x_{0},y_{0})$ being the stationary one.

In Fig. 6, the transition density of lag time $n\tau $ is computed from the estimated singular components $({\mathbf {K}},{\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {V}}^{\top }\varvec{\chi }_{1})$ as

$$\begin{aligned} {\hat{p}}_{n\tau }({\mathbf {x}},{\mathbf {y}})=625\sum _{j}\left[ \hat{{\mathbf {P}}}^{n}\right] _{ij}\cdot 1_{{\mathbf {y}}\in S_{j}},\quad \text {for }x\in S_{i} \end{aligned}$$

(134)

where

$$\begin{aligned} \hat{{\mathbf {P}}}={\mathbf {U}}^{\top }{\mathbf {K}}{\mathbf {V}}\mathrm {diag}(\varvec{\rho }_{1}) \end{aligned}$$

(135)

is the approximate transition matrix, and $\varvec{\rho }_{1}=[\varvec{\rho }_{1i}]$ with

$$\begin{aligned} \varvec{\rho }_{1i}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }1_{{\mathbf {x}}_{t+\tau }\in S_{i}}. \end{aligned}$$

(136)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, H., Noé, F. Variational Approach for Learning Markov Processes from Time Series Data. J Nonlinear Sci 30, 23–66 (2020). https://doi.org/10.1007/s00332-019-09567-y

Download citation

Received: 01 February 2018
Accepted: 22 July 2019
Published: 05 August 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00332-019-09567-y

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Variational Approach for Learning Markov Processes from Time Series Data

Abstract

Similar content being viewed by others

Learning Stochastic Dynamical Systems via Bridge Sampling

Koopman Operator Framework for Time Series Modeling and Analysis

Bayesian model selection for complex dynamic systems

1 Introduction

2 Theory

2.1 Koopman Analysis of Dynamical Systems and Its Singular Value Decomposition

Theorem 1

Proof

Example 1

2.2 Variational Principle for Markov Processes

Theorem 2

Proof

2.3 Comparison with Related Analysis Approaches

3 Estimation Algorithms

3.1 Feature TCCA: Finding the Best Linear Model in a Given Feature Space

3.2 Nonlinear TCCA: Optimizing the Basis Functions

Example 2

3.3 Error Analysis

4 Model Validation

4.1 Cross-Validation for Hyper-parameter Optimization

Example 3

4.2 Chapman–Kolmogorov Test for Choice of Lag Times

5 Numerical Examples

5.1 Double-Gyre System

5.2 Stochastic Lorenz System

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix

Analysis of Koopman Operators

1.1 Definition of Empirical Distributions

1.2 Proof of Theorem 1

1.3 Transition Densities Deduced from Koopman Operators

1.4 Sufficient Conditions for Theorem 1

Condition 1

Proof

Condition 2

Proof

1.5 Koopman Operators of Deterministic Systems

Markov Propagators

Proof of the Variational Principle

Variational Principle of Reversible Markov Processes

Analysis of Estimation Algorithms

1.1 Correctness of Feature TCCA

1.2 Feature TCCA of Projected Koopman Operators

1.3 An Example of Nonlinear TCCA

Implementation of Estimation Algorithms

1.1 Decorrelation of Basis Functions

1.2 Parameter Optimization in Nonlinear TCCA

Relationship Between VAMP and EDMD

Analysis of the VAMP-E Score

1.1 Proof of (28)

1.2 Relationship Between VAMP-2 and VAMP-E

Subspace Variational Principle

Computation of \(\hat{{\mathcal {K}}}_{\tau }^{n}\)

Details of Numerical Examples

1.1 One-Dimensional System

1.2 Double-Gyre System

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation