A Proposal on Machine Learning via Dynamical Systems

E, Weinan

doi:10.1007/s40304-017-0103-z

A Proposal on Machine Learning via Dynamical Systems

Published: 22 March 2017

Volume 5, pages 1–11, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Communications in Mathematics and Statistics Aims and scope Submit manuscript

A Proposal on Machine Learning via Dynamical Systems

Download PDF

Weinan E^1,2,3

22k Accesses
284 Citations
9 Altmetric
1 Mention
Explore all metrics

Abstract

We discuss the idea of using continuous dynamical systems to model general high-dimensional nonlinear functions used in machine learning. We also discuss the connection with deep learning.

Modeling of dynamical systems through deep learning

Article 22 November 2020

Machine Learning in Nonlinear Dynamical Systems

Article 31 July 2021

Deep Neural Networks—A Brief History

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The number one task in machine learning is to efficiently create a sufficiently rich class of functions that can represent the data with the desired accuracy. The most straightforward approach is that of approximation theory: One starts with linear functions and then builds nonlinear ones using splines, wavelets or other basis functions [1]. The obvious obstacle with this approach is the curse of dimensionality. To deal with this issue, one often has to make simplified “ansatz,” by postulating special forms for the functions, for example an additive form or multiplicative form [2].

In recent years, a new class of techniques has shown remarkable success, the deep neural network model [3]. Neural network is an old idea, but recent experience has shown that deep networks with many layers seem to do a surprisingly good job in modeling complicated data sets. The difference between neural networks and traditional approximation theory is that neural networks use compositions of simple functions to approximate complicated ones, i.e., the neural network approach is compositional, whereas classical approximation theory is usually additive. Although we still lack a theoretical framework for understanding deep neural networks, their practical success has been very encouraging.

In this note, we go one step further, by exploring the possibility of producing nonlinear functions using continuous dynamical systems, pushing the compositional approach to an infinitesimal limit. In the framework of supervised learning, this gives rise to a new class of control problems. In this view, the deep neural networks can be thought of as being discrete dynamical systems. Compared with deep neural networks, there are several potential advantages with a continuous approach.

1.
Mathematically it is easier to think about and deal with continuous dynamical systems. Continuous formulation offers more flexibility (for example adding constraints, adapting the dynamical system to the problem, imposing structure on the dynamical system), and it is easier to analyze continuous dynamical systems than discrete ones.
2.
Deep neural network can be thought of as a discretization of the continuous dynamical systems. However, from the viewpoint of discretizing dynamical systems, there are many possibilities that one can explore. For example, one can use adaptive time step size, which corresponds to choosing the layers adaptively. One can use high order or even implicit discretization, and these do not yet have an analog in deep neural networks. One can also use advanced numerical methods for training, such as the multi-grid method or the parallel shooting method (see [4, 5]).
3.
Most models in physical sciences (physics, chemistry, etc) are represented using dynamical systems in the form of differential equations. The continuous dynamical systems approach makes it easier to combining ideas from machine learning and physical modeling.
4.
The vast majority of the applied mathematics community is familiar with differential equations. The continuous dynamical systems approach to machine learning will be of particular interest to them.

In this note we only introduce some of the simplest ideas, leaving their further exploration and practical implementation to future work (see for example [4, 5]). We note that continuous dynamical systems approach was mentioned in [6,7,8]. The connection to control theory has been explored in the work of [7], among others. See also the interesting blog by Recht [9]. However, it is the author’s believe that there is still lots of room for development in this setting.

1 Formulation

Consider the differential equation in ${\mathbb {R}}^d$:

$$\begin{aligned} \frac{\mathrm{{d}}z}{\mathrm{{d}}t} = f(t, z), \quad z(0) =x. \end{aligned}$$

(1.1)

The solution at time t will be denoted by z(t, x). Fix a time horizon T, the flow map:

$$\begin{aligned} x \rightarrow z(T, x) \end{aligned}$$

(1.2)

gives rise to a function of x, which in general is nonlinear. The basic idea behind the dynamical system approach to supervised learning is to tune f so that the flow map can produce the kind of nonlinear function needed to fit the data.

We will consider supervised learning. At the abstract level, the problem can be formulated as follows. Let $\mu $ be a probability distribution on a domain D in ${\mathbb {R}}^d$. Let y be a function on D. If y is a discrete valued function, then the problem is called a classification problem. Otherwise, the problem is called a regression problem. Let $\{x_i\}$ be a finite sample of $\mu $, and $y_i = y(x_i) + \varepsilon _i$ be the label associated with $x_i$. Here $\varepsilon _i$ denotes measurement noise. Our problem is to predict y from the data $\{(x_i, y_i)\}$.

There are several well-known very elementary supervised learning models, for example linear regression for the regression problem and support vector machine (SVM) and logistic regression for the classification problem. Associated with each of these models, there is a loss function. In the case of linear regression, the loss function is $(u(x_i) - y_i)^2$ where u is the fitted linear function. Below we will focus on the regression problem.

Consider the dynamical system:

$$\begin{aligned} \frac{\mathrm{{d}}z}{\mathrm{{d}}t} = f(A(t), z), \quad z(0) =x, \end{aligned}$$

(1.3)

where f is a function chosen beforehand, A is the control. Fix a time horizon T. Our goal is to choose A such that the flow map at time T can be fitted accurately with linear regression. Let $u_\alpha (x) = \alpha _1 \cdot z(T, x) + \alpha _0$ where $\alpha _1 \in {\mathbb {R}}^d, \alpha _0 \in {\mathbb {R}}^1$, and $\alpha = (\alpha _0, \alpha _1)$. Our objective is to

$$\begin{aligned} \min _{\alpha , A} \int _D \Vert y(x) - u(x) \Vert ^2 \mathrm{{d}} \mu (x). \end{aligned}$$

(1.4)

Obviously we can use more general loss function L in the last expression:

$$\begin{aligned} \min _{\alpha , A} \int _D L(y(x), u(x)) \mathrm{{d}} \mu (x). \end{aligned}$$

(1.5)

In practice, we only have a finite set of data $\{ (x_i, y_i) \}$, where $\{x_i\}$ is a finite sample of $\mu $ and $y_i$ is the label associated with $x_i$, so (1.3) should really be replaced by:

$$\begin{aligned} \min _{\alpha , A} \sum _{i=1}^N (y_i - u(x_i))^2. \end{aligned}$$

(1.6)

(1.5) and (1.6) have the flavor of a control problem of an unusual type. Here we are interested in a large collection of initial data, instead of a single initial configuration and a single target.

We need to specify the function f. This is where some creative thinking has to take place. One simplest possible example is:

1.
Component-wise nonlinearity
$$\begin{aligned} f(A(t), z) = (\beta _1(t)F(A_1(t) z_1), \beta _2(t)F(A_2(t) z_2), \cdots , \beta _d(t)F(A_d(t)z_d))^T, \nonumber \\ \end{aligned}$$
(1.7)
where F is a simple nonlinear function, for example the sigmoid function: $F(t) = 1/(1+ e^{-t})$. We have also added the functions $(\beta _1, \beta _2, \cdots , \beta _d)$ to the set of controls.
2.
Full nonlinearity
$$\begin{aligned} f(A(t), z) = \beta (t)F(A(t) z), \end{aligned}$$
(1.8)
where F is some simple vector-valued nonlinear function.

It is straightforward to extend this formulation to other supervised learning models such as the logistic regression and support vector machines.

1.1 Finding the Optimal Control

To find the optimal control, we almost always need to compute the gradient of the objective function with respect to the control. Here we show how this can be done. Our notation is a bit sloppy here. But this makes things more transparent.

Fix $ t \in [0, T]$. Define

$$\begin{aligned} I(\alpha , A) = \int \left( y(x) - u(x) \right) ^2 \mathrm{{d}}\mu (x). \end{aligned}$$

(1.9)

We will compute $\frac{\delta I}{\delta A(t)}$. It is easy to see that

$$\begin{aligned} \frac{\delta I}{\delta A(t)} = -2\alpha _1 \cdot \int \left( y(x) - \alpha _1 \cdot z(T, x) -\alpha _0 \right) \frac{\delta z(T, x)}{\delta A(t)} \mathrm{{d}}\mu (x). \end{aligned}$$

(1.10)

To compute $\frac{\delta z(T, x)}{\delta A(t)}$, consider the dynamical system in which the control is kept the same except in a small neighborhood around t, A is perturbed by a small amount. The leading order perturbation to the solution of the dynamical system is given by the following problem:

$$\begin{aligned} \frac{\mathrm{{d}}v(\tau )}{\mathrm{{d}}\tau } = \nabla _z f(A(\tau ), z(\tau , x)) v(\tau ), \quad \tau > t \end{aligned}$$

(1.11)

with the condition $v(t) = \nabla _A f(A(t), z(t, x))$. We then have

$$\begin{aligned} \frac{\delta z(T, x)}{\delta A(t)} = v(T). \end{aligned}$$

(1.12)

At the first sight, (1.11) and (1.12) should be solved forward in time. However, note that we are interested in solving this equation for multiple values of t. Therefore, it might be more advantageous to solve (1.11) and (1.12) backward in time, i.e., first by forming the product of the propagators from time T to time t and then multiplying this propagator by the initial value at time t. In this way, (1.11) and (1.12) can be solved simultaneously for all t in one sweep. This corresponds to the idea of back propagation.

With these results in place, it is now clear how to formulate various optimization algorithms, including the gradient decent and stochastic gradient decent algorithm.

2 Connection with Deep Neural Network

Deep neural networks can be considered as discrete dynamical systems, the basic dynamics at each step being a linear transformation followed by a component-wise nonlinear (activation) function. As such, it is among the simplest nonlinear discrete dynamical systems one can imagine. These discrete dynamical systems offer another feature, namely the dimensionality of the system can be changed from step to step, simply by changing the size of the linear transformation.

The continuous dynamical systems approach advocated here can in principle incorporate more general dynamics. This added flexibility can be important in many situations, such as the ones encountered in physical science. For example, when trying to learn the inter-atomic potential of a molecular dynamics system with both short and long range interactions, this added flexibility should come in handy. Continuous dynamical systems do not allow the change of dimensionality. Therefore, the lifting and projection of dynamics to higher and lower dimensions have to be performed at the beginning or end of the dynamics, i.e., at $t=0$ or $t=T$.

The behavior of dynamical systems at large times is a notoriously difficult problem in mathematics, particularly for discrete dynamical systems. One may encounter situations where the dynamics explodes, converges to stationary states or exhibits chaotic behavior. This may explain some of the difficulties (such as the vanishing and explosion of gradients) encountered in the training of deep neural networks. Continuous dynamical systems also experience these difficulties. However, in the continuous setting, several ideas can be explored to help alleviating these problems:

1.
Imposing structure on the dynamical system. The simplest structures that one can imagine are the Hamiltonian structure or the gradient flow structure [10].
2.
Sophisticated numerical discretization methods, such as adaptive time stepping, implicit or semi-implicit algorithms. These are particularly helpful for long time (large time horizon) computations.

2.1 Connection with Deep Residual Networks

To better appreciate these points, we take a more detailed look at the deep residual networks. In its general form, deep residual networks are described by the following discrete dynamical system [11]:

$$\begin{aligned} y_l= & {} h(z_l) + \mathscr {F}(z_l, W_l), \end{aligned}$$

(2.1)

$$\begin{aligned} z_{l+1}= & {} g(y_l), \end{aligned}$$

(2.2)

where $z_l$ and $z_{l+1}$ are the input and output of the l-th layer, $y_l$ is an auxiliary variable for the l-th layer, h and g are some mappings which could in principle be nonlinear. A main result of [11], found through numerous numerical experiments, is that for very deep networks (hundreds or thousands of layers), training is the easiest if both g and h are the identity map [11]. This is quite expected from the viewpoint of dynamical systems. In fact, denote by G the inverse map of g, we can then write the above dynamical system as:

$$\begin{aligned} z_{l+1} = G(h(z_l) + {\mathscr {F}}(z_l, W_l)). \end{aligned}$$

(2.3)

In order to have a stable behavior (nonvanishing or exploding), the gradient of the right hand side should be close to an identity map. Assuming that ${\mathscr {F}}$ is a small perturbation, we then need

$$\begin{aligned} \nabla G \nabla h \sim I. \end{aligned}$$

(2.4)

This is certainly fulfilled when both g and h are identity maps. In general, one has

$$\begin{aligned} z_{l+1} \sim G(h(z_l)) + \nabla G \cdot {\mathscr {F}}(z_l, W_l). \end{aligned}$$

(2.5)

This means that the leading order behavior is dominated by G(h). The added flexibility from choosing more general g and h does not offer much real improvement if one wants to use a large number of layers.

Let g and h both be identity maps. Then (2.3) becomes:

$$\begin{aligned} z_{l+1} = z_l + {\mathscr {F}}(z_l, W_l). \end{aligned}$$

(2.6)

This can be viewed as a discretization of the dynamical system:

$$\begin{aligned} \frac{\mathrm{{d}}z}{\mathrm{{d}}t} = {\mathscr {F}}(z, W(t)). \end{aligned}$$

(2.7)

In fact the simplest discretization of (2.7) takes the form:

$$\begin{aligned} z_{l+1} = z_l + \Delta t_l {\mathscr {F}}(z_l, W_l), \end{aligned}$$

(2.8)

where $\Delta t_l$ is the step size at the l-th time step. By adaptively choosing $\Delta t_l$, one can improve the efficiency and stability of the algorithm at the same time [12]. In the language of deep learning, adaptive time stepping means adaptively choosing the layers.

Remark It should be noted that contrary to the conventional problem of solving differential equations, here our concern is not the dynamical system itself, but rather the representation of the data. Therefore, accuracy in the numerical approximation of the differential equation is not our primary concern, the training error, efficiency and stability in the training are our main interests.

3 The Representability and Controllability Problem

One important question is whether the flow map of a given dynamical system can represent the data at hand. The same issue is also baffling the deep learning community. From our perspective, this representability problem can be viewed as a controllability problem.

The standard controllability problem asks the question: Given an initial and a target position, can we use control to move the system from the initial position to an arbitrarily small neighborhood of the target position during the specified time horizon. For some general results, we refer to [13]. Our concern here is different. Here we are interested in the following question. Given a final supervised learning model, such as linear regression, can we use the control to move the flow map to a set in which the specified model works satisfactorily?

To get some clear idea about this, let us consider an idealized version of this problem: Given a mapping M on D and given T, can we choose the control such that $z( x, T) = M(x)$ for all $x \in D$?

The one-dimensional version of this problem may shed some light. Consider the case when our dynamical system takes the form:

$$\begin{aligned} \frac{\mathrm{{d}}z}{\mathrm{{d}}t} = A(t) f(z) \end{aligned}$$

(3.1)

For simplicity of illustration, we assume that f does not vanish. We can rewrite the above equation as

$$\begin{aligned} \frac{1}{f(z)} \frac{\mathrm{{d}}z}{\mathrm{{d}}t} = A(t) \end{aligned}$$

(3.2)

Integrating both sides, we obtain

$$\begin{aligned} G(M(x)) - G(x) = \int _0^T A(t) \mathrm{{d}}t \end{aligned}$$

(3.3)

where G is the primitive function of 1 / f. The right hand side of (3.3) is independent of x. Therefore for (3.3) to hold, the left hand side should also be independent of x. This imposes a severe constraint on M and f.

This example is highly simplified. But it gives us some hints about the issues involved in understanding whether the dynamical system approach gives a good model to the learning problem at hand. In general, the viewpoint of exact representability is very limited. Instead, one should take an approximation viewpoint. For that purpose, one may consider lifting the dynamical system to higher and higher dimensions, i.e., instead of consider the dynamical system in the same dimension as the data, one may consider the dynamical system in a different dimension, and use

$$\begin{aligned} z(0) = P x \end{aligned}$$

(3.4)

as the initial condition. Here P is a lifting matrix. One may also consider extending (1.8) to

$$\begin{aligned} f(A(t), z) = \sum _{k=1}^n \beta _k(t)F_k(A_k(t) z). \end{aligned}$$

(3.5)

In the general case, if the mapping M is smoothly homotopic to the identity map, i.e., if there exists a one parameter family of mappings $G(\cdot ): [0, 1] \rightarrow \mathrm{{diff}}(D)$, where diff(D) is the space of all diffeomorphism on D, such that $G(0, x) = x, G(1, x) = M(x)$, then there always exists an $f: [0, 1] \times D \rightarrow {\mathbb {R}}^d$, such that the flow map generated by the dynamical system

$$\begin{aligned}\frac{\mathrm{{d}}z}{\mathrm{{d}}t} = f(t, z)\end{aligned}$$

gives rise to M: $ z(1, x) = M(x)$.

To see this one just has to define $f(t, z) = \frac{\partial G}{\partial t}(t, G^{-1}(z))$, where $G^{-1}$ is the inverse of the mapping G at time t.

4 Extensions

4.1 Continuum in Space

For some tasks in computer vision and artificial intelligence, it is of interest to view the data itself as being continuum objects. This situation can be found, for example, in image recognition and voice recognition. In this case, we need to introduce another independent variable to represent space.

When the spatial variable is taken into account, the most commonly seen dynamical systems that are in the form of partial differential equations (PDEs). These dynamical systems are local since differentiation is a local operation. For applications in vision, the situation seems to be quite different. The dynamical systems that are of interest should be nonlocal in nature.

One way of introducing nonlocal operation is to use integral operators. For example, we may replace A(t)z in (1.7) by $\int K(x, y, t) z(y, t) \mathrm{{d}}y$ where K is some kernel function (in this section we use x and y to denote the spatial variable). To ensure translation invariance, the kernel function should take the form of a convolution:

$$\begin{aligned} \int K(x, y, t)z(y, t) \mathrm{{d}}y = \int K_0(x-y, t) z(y, t) \mathrm{{d}}y = (K_0 * z)(x, t). \end{aligned}$$

(4.1)

The simplest dynamical system that uses this convolution operator may take the form:

$$\begin{aligned} \frac{\partial z}{ \partial t}(x, t)= f((K_0 * z)(x, t)), \end{aligned}$$

(4.2)

where f is some nonlinear function. This leads to a model that is similar in spirit to the convolutional neuron network model.

4.2 Adding Constraints

The dynamical system formulation also allows us to impose constraints with relative ease. For example, one might ask the control A to be orthogonal at each time step: $A(t)^T A(t) = I$ where I is the identity matrix. One might also impose constraints on z. This can be done by introducing suitable Lagrange multipliers.

4.3 Structure of the Dynamical System

We may ask the dynamical system to have certain structure. For example we may ask the dynamical system to be a gradient flow. One such example is given below (see the section on density estimation). We may also ask the dynamical system to have a Hamiltonian structure. This direction remains to be explored.

4.4 Regularization

Within the context of control theory, it is natural to consider regularized models, by adding regularization terms on the control. For example, two natural choice of regularization terms are $\int _0^T \Vert A(t) \Vert _1 \mathrm{{d}}t$ and $\int _0^T \Vert A(t)\Vert _F^2 \mathrm{{d}}t $ where $\Vert A(t) \Vert _1 $ and $ \Vert A(t)\Vert _F $ are the $L^1$ norm and the Frobenius norm of A(t), respectively. In the language of control theory, these regularization terms give rise to the “running cost” for the model (the costs considered in (1.4), (1.5), (1.6) are “terminal costs”). The corresponding Bellman equation becomes less degenerate in the presence of running costs.

Deep neural network models usually do not include such regularization terms, except that pooling and drop out operations often used there may lead to some regularization effects.

5 Other Examples

There are numerous other examples of learning via dynamical systems. Here we only mention a few.

5.1 Clustering Dynamics

One well-known model of clustering dynamics is the Hegselmann–Krause System. An interesting noisy form of this dynamics is described by the following system of stochastic differential equations (see [14]):

$$\begin{aligned} \mathrm{{d}} x_i = - \frac{1}{N} \sum _{j, \, |x_i - x_j| \le R} (x_i - x_j) + \sigma \mathrm{{d}} W_i, \quad x_i(0) = x_i^0. \end{aligned}$$

(5.1)

Here the $\{x_i\}$’s are the interacting agents, the $\{W_i\}$’s are independent Wiener processes, R is the interaction range. This can be considered as a clustering algorithm for the data set $\{x_i^0\}$. In the continuous limit when there is a continuum of agents, the density of the agents $\rho $ evolves according to:

$$\begin{aligned} \frac{\partial \rho (x, t)}{\partial t} = - \rho (x, t)\int _{|x-y| \le R} (x-y) \rho (y,t) \mathrm{{d}}y + \frac{\sigma ^2}{2} \Delta \rho (x, t). \end{aligned}$$

(5.2)

As shown in [14], PDE (partial differential equations) methods can be quite powerful in helping us to understand the dynamics of the Hegselmann–Krause System.

5.2 Density Estimation

In a very interesting paper, Tabak and Vanden-Eijnden proposed performing density estimation by mapping the initial distribution of data to a target distribution, usually an isotropic Gaussian [15]. This mapping can be obtained from the gradient flow dynamics of the log-likelihood function. In the continuous form (when the data set is a continuum), the flow dynamics takes the form:

$$\begin{aligned} \frac{\partial \phi (x, t)}{\partial t} = \frac{\nabla \mu (x)}{\mu (x)} \rho (x, t) - \nabla \rho (x, t), \quad \rho (x, t) = \frac{\rho _0(z)}{J(z, t)}, \quad x= \phi (z, t). \end{aligned}$$

(5.3)

Here $\phi $ is the flow map, $\rho _0$ is the initial distribution, $\mu $ is the target distribution (usually a Gaussian), J is the Jacobian of the flow map. The evolution of the density takes a simpler form:

$$\begin{aligned} \frac{\partial \rho (x, t)}{\partial t} = \nabla \left( \left( \nabla \rho (x, t) - \frac{\nabla \mu (x)}{\mu (x)} \rho (x, t) \right) \rho (x, t) \right) . \end{aligned}$$

(5.4)

The idea behind this can be useful for other unsupervised learning tasks.

6 Conclusions

We proposed some ideas about using continuous dynamical systems as a tool for machine learning. Besides offering some algorithmic and analytical advantages, we also hope that this provides an attractive alternative view to deep learning.

There are numerous open questions. The most important one is whether there are other natural candidates for the underlying dynamical systems besides the ones suggested in deep learning. Practical issues such as discretization and advanced numerical algorithms also remain to be explored.

References

Fan, J., Gijbels, I.: Local Polynomial Modeling and Its Applications. Chapman & Hall, London (1996)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction a, Springer Series in Statistics, second edition, (2013)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Han, J., E, W.: in preparation
Li, Q., Tai, C., E, W.: in preparation
Almeida, L.B.: A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In: Proceedings ICNN 87. San Diego, IEEE (1987)
LeCun, Y.: A theoretical framework for back propagation. In: Touretzky, D., Hinton, G., Sejnouski, T. (eds.) Proceedings of the 1988 connectionist models summer school, Carnegie-Mellon University, Morgan Kaufmann, (1989)
Pineda, F.J.: Generalization of back propagation to recurrent and higher order neural networks. In: Proceedings of IEEE conference on neural information processing systems, Denver, November, IEEE (1987)
Recht, B.: http://www.argmin.net/2016/05/18/mates-of-costate/
E, W., Ming, P.: Calculus of Variations and Differential Equations, lecture notes, to appear
He, K., Zhang, X., Ren, S., Sun, J.: Identity mapping in deep residual networks. (July, 2016) arXiv:1603.05027v3
Lambert, J.D.: Numerical Methods for Ordinary Differential Systems: The Initial Value Problem. Wiley, New York (1992)
Google Scholar
Stroock, D.W., Varadhan, S.R.S.: Multi-Dimensional Diffusion Processes. Springer, Berlin (2006)
MATH Google Scholar
Wang, C., Li, Q., E, W., Chazelle, B.: Noisy Hegselmann–Krause systems: phase transition and the 2R-conjecture. In: Proceedings of 55th IEEE Conference on Decision and Control, Las Vegas, (2016) (Full paper at arXiv:1511.02975v3, 2015)
Tabak, E.G., Vanden-Eijnden, E.: Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci. 8(1), 217–233 (2010)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This is part of an ongoing project with several collaborators, including Jiequn Han, Qianxiao Li, Jianfeng Lu and Cheng Tai. The author benefitted a great deal from discussions with them, particularly Jiequn Han. This work is supported in part by the Major Program of NNSFC under Grant 91130005, ONR N00014-13-1-0338 and DOE DE-SC0009248.

Author information

Authors and Affiliations

Beijing Institute of Big Data Research (BIBDR), Beijing, China
Weinan E
Department of Mathematics and PACM, Princeton University, Princeton, NJ, USA
Weinan E
Center for Data Science and BICMR, Peking University, Beijing, China
Weinan E

Authors

Weinan E
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weinan E.

Additional information

Dedicated to Professor Chi-Wang Shu on the occasion of his 60th birthday.

Rights and permissions

Reprints and permissions

About this article

Cite this article

E, W. A Proposal on Machine Learning via Dynamical Systems. Commun. Math. Stat. 5, 1–11 (2017). https://doi.org/10.1007/s40304-017-0103-z

Download citation

Received: 07 February 2017
Revised: 21 February 2017
Accepted: 24 February 2017
Published: 22 March 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s40304-017-0103-z

Keywords

Mathematics Subject Classification

37N99

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Proposal on Machine Learning via Dynamical Systems

Abstract

Similar content being viewed by others

Modeling of dynamical systems through deep learning

Machine Learning in Nonlinear Dynamical Systems

Deep Neural Networks—A Brief History

1 Formulation

1.1 Finding the Optimal Control

2 Connection with Deep Neural Network

2.1 Connection with Deep Residual Networks

3 The Representability and Controllability Problem

4 Extensions

4.1 Continuum in Space

4.2 Adding Constraints

4.3 Structure of the Dynamical System

4.4 Regularization

5 Other Examples

5.1 Clustering Dynamics

5.2 Density Estimation

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A Proposal on Machine Learning via Dynamical Systems

Abstract

Similar content being viewed by others

Modeling of dynamical systems through deep learning

Machine Learning in Nonlinear Dynamical Systems

Deep Neural Networks—A Brief History

Explore related subjects

1 Formulation

1.1 Finding the Optimal Control

2 Connection with Deep Neural Network

2.1 Connection with Deep Residual Networks

3 The Representability and Controllability Problem

4 Extensions

4.1 Continuum in Space

4.2 Adding Constraints

4.3 Structure of the Dynamical System

4.4 Regularization

5 Other Examples

5.1 Clustering Dynamics

5.2 Density Estimation

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation