1 Background

The goal of dynamic time warping (DTW) is to find a time warping function that transforms, or warps, time in order to approximately align two signals together (Sakoe and Chiba 1978). At the same time, we prefer that the time warping be as gentle as possible, in some sense, or we require that it satisfy some requirements.

DTW is a versatile tool used in many scientific fields, including biology, economics, signal processing, finance, and robotics. It can be used to measure a realistic distance between two signals, usually by taking the distance between them after one is time-warped. In another case, the distance can be the minimum amount of warping needed to align one signal to the other with some level of fidelity. Time warping can be used to develop a simple model of a signal, or to improve a predictor; as a simple example, a suitable time warping can lead to a signal being well fit by an auto-regressive or other model. DTW can also be used for pattern recognition and searching for a match among a database of signals (Rakthanmanon et al 2012). It can be employed in any machine-learning application that relies on signals, such as PCA, clustering, regression, logistic regression, or multi-class classification. (We return to this topic in §7.)

Almost all DTW methods are based on the original DTW algorithm (Sakoe and Chiba 1978), which uses dynamic programming to compute a time warping path that minimizes misalignments in the time-warped signals while satisfying monotonicity, boundary, and continuity constraints. The monotonicity constraint ensures that the path represents a monotone increasing function of time. The boundary constraint enforces that the warping path beings with the origin point of both signals and ends with their terminal points. The continuity constraint restricts transitions in the path to adjacent points in time.

Despite its popularity, DTW has a longstanding problem with producing sharp irregularities in the time warp function that cause many time points of one signal to be erroneously mapped onto a single point, or “singularity,” in the other signal. Most of the literature on reducing the occurrence of singularities falls into two camps: preprocessing the input signals, and variations on continuity constraints. Preprocessing techniques rely on transformations of the input signals, which make them smoother or emphasize features or landmarks, to indirectly influence the smoothness of the warping function. Notable approaches use combinations of first and second derivatives (Keogh and Pazzani 2001; Marron et al 2015; Singh et al 2008), square-root velocity functions (Srivastava et al 2011), adaptive down-sampling (Dupont and Marteau 2015), and ensembles of features including wavelet transforms, derivatives, and several others (Zhao and Itti 2016). Variations of the continuity constraints relax the restriction on transitions in the path, which allows smoother warping paths to be chosen. Instead of only restricting transitions to one of three neighboring points in time, as in the original DTW algorithm, these variations expand the set of allowable points to those specified by a “step pattern,” of which there are many, including symmetric or asymmetric, types I-IV, and sub-types a-d (Sakoe and Chiba 1978; Itakura 1975; Myers et al 1980; Rabiner and Juang 1993). While preprocessing and step patterns may result in smoother warping functions, they are ad-hoc techniques that often require hand-selection for different types of input signals.

We propose to handle these issues entirely within an optimization framework in continuous time. Here we pose DTW as an optimization problem with several penalty terms in the objective. The basic term in our objective penalizes misalignments in the time-warped signals, while two additional terms penalize (and constrain) the time warping function. One of these terms penalizes the cumulative warping, which limits over-fitting similar to “ridge” or “lasso” regularization (Tikhonov and Arsenin 1977; Tibshirani 1996). The other term penalizes the instantaneous rate of time warping, which produces smoother warping functions, an idea that previously proposed in (Green and Silverman 1993; Ramsay and Silverman 2005, 2007; Srivastava and Klassen 2016).

Our formulation offers almost complete freedom in choosing the functions used to compare the sequences, and to penalize the warping function. We include constraints on the fit and warping functions by allowing these functions to take on the value \(+\infty\). Traditional penalty functions include the square or absolute value. Less traditional but useful ones include for example the fraction of time the two signals are within some threshold distance, or a minimum or maximum on the cumulative warping function. The choice of these functions, and how much they are scaled with respect to each other, gives a very wide range of choices for potential time warpings.

Our continuous time formulation allows for non-uniformly sampled signals, which allows us to use simple out-of-sample validation techniques to help guide the choice of time warping penalties; in particular, we can determine whether a time warp is ‘over-fit’. Our handling of missing data in the input signals is useful in itself since real-world data often have missing entries. There are many examples of using validation to select hyper-parameters, such as the “warping window,” and do this by splitting their dataset of signals into test signals and train signals (Dau et al 2018). To the best of our knowledge, we are the first use of out-of-sample validation for performing model selection in DTW, where we build a test and training dataset by partitioning samples from a single signal.

We develop a single, efficient algorithm that solves our formulation, independent of the particular choices of the penalty functions. Our algorithm uses dynamic programming to exactly solve a discretized version of the problem with linear time complexity, coupled with iterative refinement at higher and higher resolutions. Our discretized formulation can be thought of as generalizing the Itakura parallelogram (Itakura 1975); the iterated refinement scheme is similar in nature to FastDTW (Salvador and Chan 2007). We offer our implementation as open source C++ code with an intuitive Python package called GDTW (https://github.com/dderiso/gdtw).

We describe several extensions and variations of our method. In one extension, we extend our optimization framework to find a time-warped center of or template for a set of signals; in a further extension, we cluster a set of signals into groups, each of which is time-warped into one of a set of templates or prototypes.

2 Dynamic time warping

Signals. A (vector-valued) signal f is a function \(f:[a,b] \rightarrow {\mathbf{R}}^d\), with argument time. A signal can be specified or described in many ways, for example a formula, or via a sequence of samples along with a method for interpolating the signal values in between samples. For example we can describe a signal as taking values \(s_1, \ldots , s_N \in {\mathbf{R}}^d\), at points (times) \(a\le t_1< t_2< \cdots < t_N \le b\), with linear interpolation in between these values and a constant extension outside the first and last values:

$$\begin{aligned} f(t)= \left\{ \begin{array}{ll} s_1 &{} a \le t< t_1\\ \frac{t_{i+1}-t}{t_{i+1}-t_i} s_{i} + \frac{t-t_i}{t_{i+1}-t_i} s_{i+1} &{} t_i \le t< t_{i+1}, \quad i=1, \ldots , N-1,\\ s_N &{} t_N < t \le b, \end{array}\right. \end{aligned}$$

For simplicity, we will consider signals on the time interval [0, 1].

Time warp function. Suppose \(\phi :[0,1]\rightarrow [0,1]\) is increasing, with \(\phi (0)=0\) and \(\phi (1)=1\). We refer to \(\phi\) as the time warp function, and \(\tau = \phi (t)\) as the warped time associated with real or original time t. When \(\phi (t)=t\) for all t, the warped time is the same as the original time. In general we can think of

$$\begin{aligned} \tau -t = \phi (t) -t \end{aligned}$$

as the amount of cumulative warping at time t, and

$$\begin{aligned} \frac{d\tau }{d t}(\tau -t)= \phi '(t)-1 \end{aligned}$$

as the instantaneous rate of time warping at time t. These are both zero when \(\phi (t)=t\) for all t.

Time-warped signal. If x is a signal, we refer to the signal \({\tilde{x}} = x\circ \phi\), i.e.,

$$\begin{aligned} {\tilde{x}}(t) = x(\tau )=x(\phi (t)), \end{aligned}$$

as the time-warped signal, or the time-warped version of the signal x.

Dynamic time warping. Suppose we are given two signals x and y. Roughly speaking, the dynamic time warping problem is to find a warping function \(\phi\) so that \({\tilde{x}} = x\circ \phi \approx y\). In other words, we wish to warp time so that the time-warped version of the first signal is close to the second one. We refer to the signal y as the target, since the goal is warp x to match, or align with, the target.

Fig. 1
figure 1

Top. A signal x and target signal y. Middle. Warping function \(\phi\) drawn as lines between x and y. Bottom. The time-warped signal \({\tilde{x}}\) and y

Fig. 2
figure 2

Top. A time warping function \(\phi (t)\). Middle. The cumulative warp \(\phi (t)-t\). Bottom. The instantaneous rate of time warping \(\phi '(t) - 1\)

Example. An example is shown in Fig. 1. The top plot shows a scalar signal x and target signal y, and the bottom plot shows the time-warped signal \({\tilde{x}} = x\circ \phi\) and y. The middle plot shows the correspondence between x and y associated with the warping function \(\phi\). Figure 2 shows the time warping function; the next plot is the cumulative warp, and the next is the instantaneous rate of time warping.

3 Optimization formulation

We will formulate the dynamic time warping problem as an optimization problem, where the time warp function \(\phi\) is the (infinite-dimensional) optimization variable to be chosen. Our formulation is very similar to those used in machine learning, where a fitting function is chosen to minimize an objective that includes a loss function that measures the error in fitting the given data, and regularization terms that penalize the complexity of the fitting function (Friedman et al 2001).

Loss functional. Let \(L: {\mathbf{R}}^d \rightarrow {\mathbf{R}}\) be a vector penalty function. We define the loss associated with a time warp function \(\phi\), on the two signals x and y, as

$$\begin{aligned} {\mathcal {L}}(\phi ) = \int _{0}^{1} L(x(\phi (t))- y(t)) { dt}, \end{aligned}$$
(1)

the average value of the penalty function of the difference between the time-warped first signal and the second signal. The smaller \({\mathcal {L}}(\phi )\) is, the better we consider \({\tilde{x}} =x\circ \phi\) to approximate y.

Simple choices of the penalty include \(L(u)=\Vert u\Vert _2^2\) or \(L(u)=\Vert u\Vert _1\). The corresponding losses are the mean-square deviation and mean-absolute deviation, respectively. One useful variation is the Huber penalty (Huber 2011; Boyd and Vandenberghe 2004),

$$\begin{aligned} L(u)= \left\{ \begin{array}{ll} \Vert u\Vert _2^2 &{} \Vert u\Vert _2 \le M\\ 2M \Vert u\Vert _2-M^2 &{} \Vert u\Vert _2 > M, \end{array}\right. \end{aligned}$$

where \(M>0\) is a parameter. The Huber penalty coincides with the least squares penalty for small u, but grows more slowly for u large, and so is less sensitive to outliers. Many other choices are possible, for example

$$\begin{aligned} L(u)= \left\{ \begin{array}{ll} 0 &{} \Vert u\Vert \le \epsilon \\ 1 &{} \text{ otherwise }, \end{array}\right. \end{aligned}$$

where \(\epsilon \in {\mathbf{R}}_{+}\) is a positive parameter. The associated loss \({\mathcal {L}}(\phi )\) is the fraction of time the time-warped signal is farther than \(\epsilon\) from the second signal (measured by the norm \(\Vert \cdot \Vert\)).

The choice of penalty function L (and therefore loss functional \({\mathcal {L}}\)) will influence the warping found, and should be chosen to capture the notion of approximation appropriate for the given application.

Cumulative warp regularization functional. We express our desired qualities for or requirements on the time warp function using a regularization functional for the cumulative warp,

$$\begin{aligned} {\mathcal {R}}^\mathrm {cum}(\phi ) = \int _{0}^{1} R^\mathrm {cum}(\phi (t)-t) { dt}, \end{aligned}$$
(2)

where \(R^\mathrm {cum}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is a penalty function on the cumulative warp. The function \(R^\mathrm {cum}\) can take on the value \(+\infty\), which allows us to encode constraints on \(\phi\). While we do not require it, we typically have \(R^\mathrm {cum}(0)=0\), i.e., there is no cumulative regularization cost when the warped time and true time are the same.

Instantaneous warp regularization functional. The regularization functional for the instantaneous warp is

$$\begin{aligned} {\mathcal {R}}^\mathrm {inst}(\phi ) = \int _{0}^{1} R^\mathrm {inst}(\phi '(t)-1) { dt}, \end{aligned}$$
(3)

where \(R^\mathrm {inst}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is the penalty function on the instantaneous rate of time warping. Like the function \(R^\mathrm {cum}\), \(R^\mathrm {inst}\) can take on the value \(+\infty\), which allows us to encode constraints on \(\phi '\). By assigning \(R^\mathrm {inst}(u)=+\infty\) for \(u<s^\mathrm {min}\), for example, we require that \(\phi '(t)\ge s^\mathrm {min}\) for all t. We will assume that this is the case for some positive \(s^\mathrm {min}\), which ensures that \(\phi\) is invertible. While we do not require it, we typically have \(R^\mathrm {inst}(0)=0\), i.e., there is no instantaneous regularization cost when the instantaneous rate of time warping is one.

As a simple example, we might choose

$$\begin{aligned} R^\mathrm {cum}(u) = u^2, \qquad R^\mathrm {inst}(u) = \left\{ \begin{array}{ll} u^2 &{} s^\mathrm {min}\le u \le s^\mathrm {max}\\ \infty &{} \text{ otherwise }, \end{array}\right. \end{aligned}$$

i.e., a quadratic penalty on cumulative warping, and a square penalty on instantaneous warping, plus the constraint that the slope of \(\phi\) must be between \(s^\mathrm {min}\) and \(s^\mathrm {max}\). A very wide variety of penalties can be used to express our wishes and requirements on the warping function.

Dynamic time warping via regularized loss minimization. We propose to choose \(\phi\) by solving the optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(\phi ) = {\mathcal {L}}(\phi ) + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi ) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi ) \\ \text{ subject } \text{ to } &{} \phi (0)=0, \quad \phi (1)=1, \end{array} \end{aligned}$$
(4)

where \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters used to vary the relative weight of the three terms. The variable in this optimization problem is the time warp function \(\phi\).

Optimal control formulation. The problem (4) is an infinite-dimensional, and generally non-convex, optimization problem. Such problems are generally impractical to solve exactly, but we will see that this particular problem can be efficiently and practically solved.

It can be formulated as a classical continuous-time optimal control problem (Bertsekas 2005), with scalar state \(\phi (t)\) and action or input \(u(t) = \phi '(t)\):

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \int _0^1 \left( \ell (\phi (t),u(t),t) + \lambda ^\mathrm {inst}R^\mathrm {inst}(u(t))\right) { dt}\\ \text{ subject } \text{ to } &{} \phi (0)=0, \quad \phi (1)=1, \quad \phi '(t)=u(t), \quad 0\le t \le 1, \end{array} \end{aligned}$$
(5)

where \(\ell\) is the state-action cost function

$$\begin{aligned} \ell (u,v,t) = L(x(u)- y(t)) + \lambda ^\mathrm {cum}R^\mathrm {cum}(u). \end{aligned}$$

There are many classical methods for numerically solving the optimal control problem (5), but these generally make strong assumptions about the loss and regularization functionals (such as smoothness), and do not solve the problem globally. We will instead solve (5) by brute force dynamic programming, which is practical since the state has dimension one, and so can be discretized.

Lasso and ridge regularization. Before describing how we solve the optimal control problem (5), we mention two types of regularization that are widely used in machine learning, and what types of warping functions typically result when using them. They correspond to \(R^\mathrm {cum}\) and \(R^\mathrm {inst}\) being either \(u^2\) (quadratic, ridge, or Tikhonov regularization (Tikhonov and Arsenin 1977; Hansen 2005)) or |u| (absolute value, \(\ell _1\) regularization, or Lasso (Golub and Van Loan 2012, p564) (Tibshirani 1996))

With \(R^\mathrm {cum}(u)=u^2\), the regularization discourages large deviations between \(\tau\) and t, but the not the rate at which \(\tau\) changes with t. With \(R^\mathrm {inst}(u)=u^2\), the regularization discourages large instantaneous warping rates. The larger \(\lambda ^\mathrm {cum}\) is, the less \(\tau\) deviates from t; the larger \(\lambda ^\mathrm {inst}\) is, the more smooth the time warping function \(\phi\) is.

Using absolute value regularization is more interesting. It is well known in machine learning that using absolute value or \(\ell _1\) regularization leads to solutions with an argument of the absolute value that is sparse, that is, often zero (Boyd and Vandenberghe 2004). When \(R^\mathrm {cum}\) is the absolute value, we can expect many times when \(\tau = t\), that is, the warped time and true time are the same. When \(R^\mathrm {inst}\) is the absolute value, we can expect many times when \(\phi '(t) =1\), that is, the instantaneous rate of time warping is zero. Typically these regions grow larger as we increase the hyper-parameters \(\lambda ^\mathrm {cum}\) and \(\lambda ^\mathrm {inst}\).

Discretized time formulation. To solve the problem (4) we discretize time with the N values

$$\begin{aligned} 0 = t_1< t_2< \cdots < t_N = 1. \end{aligned}$$

We will assume that \(\phi\) is piecewise linear with knot points at \(t_1, \ldots , t_N\); to describe it we only need to specify the warp values \(\tau _i = \phi (t_i)\) for \(i=1, \ldots , N\), which we express as a vector \(\tau \in {\mathbf{R}}^N\). We assume that the points \(t_i\) are closely enough spaced that the restriction to piecewise linear form is acceptable. The values \(t_i\) could be taken as the values at which the signal y is sampled (if it is given by samples), or just the default linear spacing, \(t_i = (i-1)/(N-1)\). The constraints \(\phi (0)=0\) and \(\phi (1)=1\) are expressed as \(\tau _1=0\) and \(\tau _N=1\).

Using a simple Riemann approximation of the integrals and the approximation

$$\begin{aligned} \phi '(t_i) = \frac{\phi (t_{i+1})-\phi (t_i)}{t_{i+1}-t_i} = \frac{\tau _{i+1}-\tau _i}{t_{i+1}-t_i}, \quad i=1, \ldots , N-1, \end{aligned}$$

we obtain the discretized objective

$$\begin{aligned} {\hat{f}} (\tau ) = \sum _{i=1}^{N-1} (t_{i+1} - t_i) \left( L(x(\tau _i)-y(t_i)) + \lambda ^\mathrm {cum}R^\mathrm {cum}(\tau _i - t_i) + \lambda ^\mathrm {inst}R^\mathrm {inst}\left( \frac{\tau _{i+1} - \tau _{i}}{t_{i+1} - t_i} - 1\right) \right) .\nonumber \\ \end{aligned}$$
(6)

The discretized problem is to choose the vector \(\tau ^\star \in {\mathbf{R}}^N\) that minimizes \({\hat{f}}(\tau )\), subject to \(\tau _1=0\), \(\tau _N=1\). With \(\tau ^\star\), we can construct an approximation to function \(\phi\) using piecewise-linear interpolation. The only approximation here is the discretization; we can use standard techniques based on bounds on derivatives of the functions involved to bound the deviation between the continuous-time objective \(f(\phi )\) and its discretized approximation \({\hat{f}}(\tau )\).

4 Dynamic programming with refinement

In this section we describe a simple method to minimize \({\hat{f}}(\tau )\) subject to \(\tau _1 =0\) and \(\tau _N=1\), i.e., to solve the optimal control problem (5) to obtain \(\tau ^*\). We first discretize the possible values of \(\tau _i\), whereupon the problem can be expressed as a shortest path problem on a graph, and then efficiently and globally solved using standard dynamic programming techniques. To reduce the error associated with the discretization of the values of \(\tau _i\), we choose a new discretization with the same number of values, but in a reduced range (and therefore, more finely spaced values) around the previously found values. This refinement converges in a few steps to a highly accurate solution of the discretized problem. Subject only to the reasonable assumption that the discretization of the original time and warped time are sufficiently fine, this method finds the global solution.

4.1 Dynamic programming

We now discretize the values that \(\tau _i\) is allowed to take:

$$\begin{aligned} \tau _i \in {\mathcal {T}}_i = \{\tau _{i1}, \ldots , \tau _{iM}\}, \quad i=1,\ldots , N. \end{aligned}$$

One choice for these discretized values is linear spacing between given lower and upper bounds on \(\tau _i\), \(0\le l_i \le u_i\le 1\):

$$\begin{aligned} \tau _{ij} = l_i + \frac{j-1}{M-1} (u_i-l_i), \quad j=1,\ldots , M, \quad i=1,\ldots , N. \end{aligned}$$

Here M is the number of values that we use to discretize each value of \(\tau _i\) (which we take to be the same for each i, for simplicity). We will assume that \(0 \in {\mathcal {T}}_1\) and \(1\in {\mathcal {T}}_N\), so the constraints \(\tau _1=0\) and \(\tau _N=1\) are feasible.

The bounds can be chosen as

$$\begin{aligned} l_i = \max \{ s^\mathrm {min}t_i , 1-s^\mathrm {max}(1-t_i) \}, \quad u_i = \min \{ s^\mathrm {max}t_i , 1-s^\mathrm {min}(1-t_i) \}, \quad i=1,\ldots , N, \end{aligned}$$
(7)

where \(s^\mathrm {min}\) and \(s^\mathrm {max}\) are the given minimum and maximum allowed values of \(\phi '\). This is illustrated in Fig. 3, where the nodes of \({\mathcal {T}}\) are drawn at position \((t_i,\tau _{ij})\), for \(N=30, M=20\) and various values of \(s^\mathrm {min}\) and \(s^\mathrm {max}\). Note that since \(\frac{N}{M}\) is the minimum slope, M should be chosen to satisfy \(M > \frac{N}{s^\mathrm {max}}\), a consideration that is automated in the provided software.

Fig. 3
figure 3

Left. Unconstrained grid. Left center. Effect of introducing \(s^\mathrm {min}\). Right center. Effect of \(s^\mathrm {max}\). Right. Typical parameters that work well for our method

The objective (6) splits into a sum of terms that are functions of \(\tau _i\), and terms that are functions of \(\tau _{i+1}-\tau _i\). (These correspond to the separable state-action loss function terms in the optimal control problem associated with \(\phi (t)\) and \(\phi '(t)\), respectively.) The problem is then globally solved by standard methods of dynamic programming (Bellman and Dreyfus 2015), using the methods we now describe.

We form a graph with MN nodes, associated with the values \(\tau _{ij}\), \(i=1,\ldots , N\) and \(j=1,\ldots , M\). (Note that i indexes the discretized values of t, and j indexes the discretized values of \(\tau\).) Each node \(\tau _{ij}\) with \(i<N\) has M outgoing edges that terminate at the nodes of the form \(\tau _{i+1,k}\) for \(k=1, \ldots , M\). The total number of edges is therefore \((N-1)M^2\). This is illustrated in Fig. 3 for \(M=25\) and \(N=100\), where the nodes are shown at the location \((t_i,\tau _{ij})\). (In practice M and N would be considerably larger.)

At each node \(\tau _{ij}\), we associate the node cost

$$\begin{aligned} (t_{i+1} - t_i) ( L(x(\tau _{ij})-y(t_i)) + \lambda ^\mathrm {cum}R^\mathrm {cum}(\tau _{ij} - t_i) ), \end{aligned}$$

and on the edge from \(\tau _{ij}\) to \(\tau _{i+1,k}\), we associate the edge cost

$$\begin{aligned} (t_{i+1} - t_i) \left( \lambda ^\mathrm {inst}R^\mathrm {inst}\left( \frac{ \tau _{i+1,k} - \tau _{ij} }{ t_{i+1} - t_{i}} \right) \right) . \end{aligned}$$

With these node and edge costs, the objective \({\hat{f}}(\tau )\) is the total cost of a path starting at node \(\tau _{11}=0\) and ending at \(\tau _{NM}=1\). (Infeasible paths, for examples ones for which \(\tau _{i+1,k}<\tau _{i,j}\), have cost \(+\infty\).) Our problem is therefore to find the shortest weighted path through a graph, which is readily done by dynamic programming.

The computational cost of dynamic programming is order \(NM^2\) flops (not counting the evaluation of the loss and regularization terms). With current hardware, it is entirely practical for \(M=N=1000\) or even (much) larger. The path found is the globally optimal one, i.e., \(\tau ^*\) minimizes \({\hat{f}}(\tau )\), subject to the discretization constraints on the values of \(\tau _i\).

4.2 Iterative refinement

After solving the problem above by dynamic programming, we can reduce the error induced by discretizing the values of \(\tau _i\) by updating \(l_i\) and \(u_i\). We shrink them both toward the current value of \(\tau ^*_i\), thereby reducing the gap between adjacent discretized values and reducing the discretization error. One simple method for updating the bounds is to reduce the range \(u_i-l_i\) by a fixed fraction \(\eta\), say 1/2 or 1/8.

To do this we set

$$\begin{aligned} l_i^{(q+1)} = \max \{ \tau ^{*(q)}_i - \eta \frac{u_i^{(q)} - l_i^{(q)}}{2}, l_i^{(0)} \}, \quad u_i^{(q+1)} = \min \{ \tau ^{*(q)}_i + \eta \frac{u_i^{(q)} - l_i^{(q)}}{2}, u_i^{(0)} \} \end{aligned}$$

in iteration \(q+1\), where the superscripts in parentheses above indicate the iteration. Using the same data as Figs. 2 and 4 shows the iterative refinement of \(\tau ^*\). Here, nodes of \({\mathcal {T}}\) are plotted at position \((t_i,\tau _{ij})\), as it is iteratively refined around \(\tau ^*_i\).

Fig. 4
figure 4

Left to right. Iterative refinement of \(\tau ^*\) for iterations \(q=0,1,2,3\), with \(\tau ^*\) colored orange

4.3 Implementation

GDTW package. The algorithm described above has been implemented as the open source Python package GDTW, with the dynamic programming portion written in C++ for improved efficiency. The node costs are computed and stored in an \(M\times N\) array, and the edge costs are computed on the fly and stored in an \(M\times M\times N\) array. For multiple iterations on group-level alignments (see §7), multi-threading is used to distribute the program onto worker threads.

Performance. We give an example of the performance attained by GDTW using real-world signals described in §5, which are uniformly sampled with \(N=1000\). Although it has no effect on method performance, we take square loss, square cumulative warp regularization, and square instantaneous warp regularization. We take \(M=100\).

The computations are carried on a 4 core MacBook. To compute the node costs requires 0.0055 seconds, and to compute the shortest path requires 0.0832 seconds. With refinement factor \(\eta = .15\), only three iterations are needed before no significant improvement is obtained, and the result is essentially the same with other choices for the algorithm parameters N, M, and \(\eta\). Over 10 trials, our method only took an average of 0.25 seconds, to compute using a radius of 50, which is equivalent to \(M=100\). All of the data and example code necessary to reproduce these results are available in the GDTW repository. Also available are supplementary materials that contain step-by-step instructions and demonstrations on how to reproduce these results.

4.4 Validation

To test the generalization ability of a specific time warping model, parameterized by L, λcum, Rcum, λinst, and Rinst, we use out-of-sample validation. We form two increasing sequences, ttrain\({\text{R}}^{N^{\text{train}}}\) and ttest\({\text{R}}^{N^{\text{test}}}\), by randomly sampling without replacement from the N discretized time values, and we include the boundaries tt = 0 and tN = 1 in each sequence. Using only the time points in ttrain, we obtain our time warping function ϕ by minimizing our discretized objective (6). (Recall that since our method does not require signals to be sampled at regular intervals, it will work with the irregularly spaced time points in ttrain.)

We compute two loss values: a training error

$$ \ell^{\text{train}} = \sum\limits_{i=1}^{N^{\text{train}}-1} (t^{\text{train}}_{i+1} - t^{\text{train}}_{i} ) (L(x(\phi(t^{\text{train}}_i ))- y(t^{\text{train}}_i ))),$$

and a test error

$$\ell^{\text{test}} = \sum\limits_{i=1 }^{N^{\text{test}}-1} (t^{\text{test}}_{i+1}-t^{\text{test}}_i ) (L(x(\phi(t^{\text{test}}_i ))-y(t^{\text{test}}_i ))).$$

Figure 5 shows \(\ell^{\text{test}}\) over a grid of values of λcum and λinst, for a partitioning where ttrain and ttest each contain 50% of the time points. In this example, we use the signals shown figure 1.

Fig. 5
figure 5

Test Loss

4.4.1 Ground truth estimation

When a ground truth warping function, \(\phi^{\text{true}}\), is available, we can score how well our ϕ approximates \(\phi^{\text{true}}\) by computing the following errors:

$$\epsilon^{\text{train}}={{{{\sum^{N^{\text{train}}{-1}}_ {i=1}}}}}\left(t^{\text{train}}_{i+1}-t^{\text{train}}_{i}\right)\left(L(\phi^{\text{true}}(t^{\text{train}}_{i})-\phi(t^{\text{train}}_{i}))\right),$$

and

$$\epsilon^{\text{test}}={{{{\sum^{N^{\text{test}}{-1}}_ {i=1}}}}}\left(t^{\text{test}}_{i+1}-t^{\text{test}}_{i}\right)\left(L(\phi^{\text{true}}(t^{\text{test}}_{i})-\phi(t^{\text{test}}_{i}))\right).$$

In the example shown in Fig. 5, target signal y is constructed by composing x with a known warping function \(\phi^{\text{true}}\), such that y(t) = (x\(\phi^{\text{true}}\))(t). Figure 6 shows the contours of ϵtest for this example.

Fig. 6
figure 6

Test error

5 Examples

We present a few examples of alignments using our method. Figure 7 is a synthetic example of different types of time warping functions. Figure 8 is real-world example using biological signals (ECGs). We compare our method using varying amounts of regularization \(\lambda ^\mathrm {inst}\in \{0.01,0.1,0.5 \},N=1000,M=100\) to those using with FastDTW (Salvador and Chan 2007), using the equivalent graph size \(N=1000,\mathrm {radius}=50\). As expected, the alignments using regularization are smoother and less prone to singularities than those from FastDTW, which are unregularized. Figure 9 shows how the time warp functions become smoother as \(\lambda ^\mathrm {inst}\) grows.

Fig. 7
figure 7

Left. Signal x and target signal y. Middle. Warping function \(\phi\) and the ground truth warping \(\phi ^\mathrm {true}\). Right. The time-warped x and y

Fig. 8
figure 8

Top four. ECGs warped using our method while increasing \(\lambda ^\mathrm {inst}\). Bottom. Results using FastDTW, with a few of the singularities circled in red

Fig. 9
figure 9

Left. \(\phi (t)\) Right. \(\phi (t)-t\) for ECGs. (Smoother lines correspond to larger \(\lambda ^\mathrm {inst}\).)

6 Extensions and variations

We will show how to extend our formulation to address complex scenarios, such as aligning a portion of a signal to the target, regularization of higher-order derivatives, and symmetric time warping, where both signals align to each other.

6.1 Alternate boundary and slope constraints

We can align a portion of a signal with the target by adjusting the boundary constraints to allow \(0 \ge \phi (0) \ge \beta\) and \((1-\beta ) \le \phi (1) \le 1\), We incorporate this by reformulating (7) as

$$\begin{aligned} l_i = \max \{ s^\mathrm {min}t_i , (1-s^\mathrm {max}(1-t_i))-\beta \}, \quad u_i = \min \{ s^\mathrm {max}t_i + \beta , 1 -s^\mathrm {min}(1-t_i) \}, \end{aligned}$$

for \(i=1,\ldots , N\).

We can also allow the slope of \(\phi\) to be negative, by choosing \(s^\mathrm {min}< 0\). These modifications are illustrated in Fig. 10, where the nodes of \({\mathcal {T}}\) are drawn at position \((t_i,\tau _{ij})\), for \(N=30, M=20\) and various values of \(\beta\), \(s^\mathrm {min}\), and \(s^\mathrm {max}\).

Fig. 10
figure 10

Left. Effect of introducing \(\beta\) to unconstrained grid. Left center. Effect of introducing \(\beta\) using typical parameters. Right center. Effect of introducing \(\beta\) using larger \(s^\mathrm {min}\). Right. Effect of negative \(s^\mathrm {min}\)

6.2 Penalizing higher-order derivatives

We can extend the formulation to include a constraint or objective term on the higher-order derivatives, such as the second derivative \(\phi ''\). This requires us to extend the discretized state space to include not just the current M values, but also the last M values, so the state space size grows to \(M^2\) in the dynamic programming problem.

In the continuous formulation, the regularization functional for the second-order instantaneous warp is

$$\begin{aligned} {\mathcal {R}}^\mathrm {inst^2}(\phi ) = \int _{0}^{1} R^\mathrm {inst^2}(\phi ''(t)) { dt}, \end{aligned}$$

where \(R^\mathrm {inst^2}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is the penalty function on the second-order instantaneous rate of time warping. Like the function \(R^\mathrm {inst}\), \(R^\mathrm {inst^2}\) can take on the value \(+\infty\), which allows us to enforce constraints on \(\phi ''\).

With this additional regularization functional, we can reformulate the problem in (4) as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(\phi ) = {\mathcal {L}}(\phi ) + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi ) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi ) + \lambda ^\mathrm {inst^2}{\mathcal {R}}^\mathrm {inst^2}(\phi ) \\ \text{ subject } \text{ to } &{} \phi (0)=0, \quad \phi (1)=1, \end{array} \end{aligned}$$
(8)

where \(\lambda ^\mathrm {inst^2}\in {\mathbf{R}}_{+}\) is a positive hyper-parameter.

To solve the problem (8), we can include a discretized version of \({\mathcal {R}}^\mathrm {inst^2}\) inside the discretized objective (6). We propose a discretized formulation of \({\mathcal {R}}^\mathrm {inst^2}\) using a three-point central difference approximation of the second derivative. Note that this approximation depends on the spacing of the time points. For regularly spaced time points, we can use

$$\begin{aligned} \phi ''(t_i) = \frac{\phi (t_{i+1})-2\phi (t_i)+\phi (t_{i-1})}{(t_{i+1}-t_{i})^2} = \frac{\tau _{i+1}-2\tau _i+\tau _{i-1}}{(t_{i+1}-t_{i})^2}, \quad i=1, \ldots , N-1, \end{aligned}$$

and for irregularly spaced time points, we can use

$$\begin{aligned} \phi ''(t_i)= & {} \frac{2(\delta _1 \phi (t_{i+1}) - (\delta _1 + \delta _2)\phi (t_i) + \delta _2 \phi (t_{i-1}))}{\delta _1 \delta _2 (\delta _1 + \delta _2)}\\= & {} \frac{2(\delta _1 \tau _{i+1} - (\delta _1 + \delta _2)\tau _i + \delta _2 \tau _{i-1})}{ \delta _1 \delta _2 (\delta _1 + \delta _2)}, \end{aligned}$$

for \(i=1, \ldots , N-1\), and where \(\delta _1 = t_{i}-t_{i-1}\) and \(\delta _2 = t_{i+1}-t_{i}\).

6.3 General loss

The two signals need not be vector valued; they could have categorical values, for example

$$\begin{aligned} L(\tau _i, t_i)= \left\{ \begin{array}{ll} 1 &{} \tau _i \ne t_i\\ 0 &{} \text{ otherwise }, \end{array}\right. \end{aligned}$$

or

$$\begin{aligned} L(\tau _i, t_i)= \left\{ \begin{array}{ll} g(\tau _i,t_i) &{} \tau _i \ne t_i\\ 0 &{} \text{ otherwise }, \end{array}\right. \end{aligned}$$

where \(g:{\mathbf{R}}_{++} \times {\mathbf{R}}_{++} \rightarrow {\mathbf{R}}\) is a categorical distance function that can specify the cost of certain mismatches or a similarity matrix (Needleman and Wunsch 1970).

Another example could use the Earth mover’s distance, \(\text{ EMD }:{\mathbf{R}}^n \times {\mathbf{R}}^n \rightarrow {\mathbf{R}}\), between two short-time spectra

$$\begin{aligned} L(\phi , t_i) = \text{ EMD }( \{ \phi (t_i - \rho ), \ldots , \phi (t_i), \ldots , \phi (t_i + \rho ) \}, \{ t_i - \rho , \ldots , t_i , \ldots , t_i + \rho \} ), \end{aligned}$$

where \(\rho \in {\mathbf{R}}\) is a radius around time point \(t_i\).

6.4 Symmetric time warping

Until this point, we have used unidirectional time warping, where signal x is time-warped to align with y such that \(x\circ \phi \approx y\). We can also perform bidirectional time warping, where signals x and y are time-warped each other. Bidirectional time warping results in two time warp functions, \(\phi\) and \(\psi\), where \(x \circ \phi \approx y \circ \psi\).

Bidirectional time warping requires a different loss functional. Here we define the bidirectional loss associated with time warp functions \(\phi\) and \(\psi\), on the two signals x and y, as

$$\begin{aligned} {\mathcal {L}}(\phi ,\psi ) = \int _{0}^{1} L(x(\phi (t))- y(\psi (t))) { dt}, \end{aligned}$$

where we distinguish the bidirectional case by using two arguments, \({\mathcal {L}}(\phi ,\psi )\), instead of one, \({\mathcal {L}}(\phi )\), as in (1).

Bidirectional time warping can be symmetric or asymmetric. In the symmetric case, we choose \(\phi , \psi\) by solving the optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} {\mathcal {L}}(\phi ,\psi ) + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi ) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi ) \\ \text{ subject } \text{ to } &{} \phi (0)=0, \quad \phi (1)=1, \quad \psi (t)=2t-\phi (t), \end{array} \end{aligned}$$

where the constraint \(\psi (t)=2t-\phi (t)\) ensures that \(\phi\) and \(\psi\) are symmetric about the identity. The symmetric case does not add additional computational complexity, and can be readily solved using the iterative refinement procedure described in §4.

In the asymmetric case, \(\phi , \psi\) are chosen by solving the optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} {\mathcal {L}}(\phi ,\psi ) + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi ) + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\psi ) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi ) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\psi )\\ \text{ subject } \text{ to } &{} \phi (0)=0, \quad \phi (1)=1, \quad \psi (0)=0, \quad \psi (1)=1. \end{array} \end{aligned}$$

The asymmetric case requires \(R^\mathrm {cum}\), \(R^\mathrm {inst}\) to allow negative slopes for \(\psi\). Further, it requires a modified iterative refinement procedure (not described here) with an increased complexity of order \(NM^4\) flops, which is impractical when M is not small.

7 Time-warped distance, centering, and clustering

In this section we describe three simple extensions of our optimization formulation that yield useful methods for analyzing a set of signals \(x_1, \ldots , x_M\).

7.1 Time-warped distance

For signals x and y, we can interpret the optimal value of (4) as the time-warped distance between x and y, denoted D(xy). (Note that this distance measures takes into account both the loss and the regularization, which measures how much warping was needed.) When \(\lambda ^\mathrm {cum}\) and \(\lambda ^\mathrm {inst}\) are zero, we recover the unconstrained DTW distance (Sakoe and Chiba 1978). This distance is not symmetric; we can (and usually do) have \(D(x,y)\ne D(y,x)\). If a symmetric distance is preferred, we can take \((D(x,y)+D(y,x))/2\), or the optimal value of the group alignment problem (9), with a set of original signals xy.

The warp distance can be used in many places where a conventional distance between two signals is used. For example we can use warp distance to carry out k nearest neighbors regression (Xi et al 2006) or classification. Warp distance can also be used to create features for further machine learning. For example, suppose that we have carried out clustering into K groups, as discussed above, with target or group centers or exemplar signals \(y_1, \ldots , y_K\). From these we can create a set of K features related to the warp distance of a new signal x to the centers \(y_1, \ldots , y_K\), as

$$\begin{aligned} z_i = \frac{e^{d_i/\sigma }}{\sum _{j=1}^K e^{d_j/\sigma }}, \quad i=1, \ldots , K, \end{aligned}$$

where \(d_i = D(x,y_i)\), and \(\sigma \in {\mathbf{R}}_{+}\) is a positive (scale) hyper-parameter.

7.2 Time-warped alignment and centering

In time-warped alignment, the goal is to find a common target signal \(\mu\) that each of the original signals can be warped to, at low cost. We pose this in the natural way as the optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \sum _{i=1}^M \left( \int _0^1 L(x_i(\phi _i(t)) - \mu (t)) \; dt + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi _i) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi _i) \right) \\ \text{ subject } \text{ to } &{} \phi _i(0)=0, \quad \phi _i(1)=1, \end{array} \end{aligned}$$
(9)

where the variables are the warp functions \(\phi _1, \ldots , \phi _M\) and the target \(\mu\), and \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters. The objective is the sum of the objectives for time warping each \(x_i\) to \(\mu\). This is very much like our basic formulation (4), except that we have multiple signals to warp, and the target \(\mu\) is also a variable that we can choose.

The problem (9) is hard to solve exactly, but a simple iterative procedure seems to work well. We observe that if we fix the target \(\mu\), the problem splits into M separate dynamic time warping problems that we can solve (separately, in parallel) using the method described in §4. Conversely, if we fix the warping functions \(\phi _1, \ldots , \phi _M\), we can optimize over \(\mu\) by minimizing

$$\begin{aligned} \sum _{i=1}^M \int _0^1 L(x_i(\phi _i(t)) - \mu (t)) \; dt. \end{aligned}$$

This amounts to choosing each \(\mu (t)\) to minimize

$$\begin{aligned} \sum _{i=1}^M L(x_i(\phi _i(t))- \mu (t)). \end{aligned}$$

This is typically easy to do; for example, with square loss, we choose \(\mu (t)\) to be the mean of \(x_i(\phi _i(t))\); with absolute value loss, we choose \(\mu (t)\) to be the median of \(x_i(\phi _i(t))\).

This method of alternating between updating the target \(\mu\) and updating the warp functions (in parallel) typically converges quickly. However, it need not converge to the global minimum. One simple initialization is to start with no warping, i.e., \(\phi _i(t)=t\). Another is to choose one of the original signals as the initial value for \(\mu\).

As a variation, we can also require the warping functions to be evenly arranged about a common time warp center, for example \(\phi (t) = t\). We can do this by imposing a centering constraint on (9),

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \sum _{i=1}^M \left( \int _0^1 L(x_i(\phi _i(t)) - \mu (t)) \; dt + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi _i) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi _i) \right) \\ \text{ subject } \text{ to } &{} \phi _i(0)=0, \quad \phi _i(1)=1, \quad \frac{1}{M}\sum _{i=1}^M \phi _i(t) = t , \end{array} \end{aligned}$$
(10)

where \(\frac{1}{M}\sum _{i=1}^M \phi _i(t) = t\) forces \(\phi _1, \ldots , \phi _M\) to be evenly distributed around the identity \(\phi (t)=t\). The resulting centered time warp functions, can be used to produce a centered time-warped mean. Figure 11 compares a time-warped mean with and without centering, using synthetic data consisting of multi-modal signals from (Srivastava et al 2011).

Fig. 11
figure 11

Top. Time-warped mean. Bottom. Centered time-warped mean. Left. Original signals. Left center. Warped signals after iteration 1. Right center. Warped signals after iteration 2. Right. Time warp functions after iteration 2

Fig. 12
figure 12

Top. ECG signals. Bottom. Engine sensor signals. Left. Original signals. Left center. Warped signals after iteration 1. Right center. Warped signals after iteration 2. Right. Time warp functions after iteration 2

Figure 12 shows examples of centered time-warped means of real-world data (using our default parameters), consisting of ECGs and sensor data from an automotive engine (Abou-Nasr and Feldkamp 2008). The ECG example demonstrates that subtle features of the input sequences are preserved in the alignment process, and the engine example demonstrates that the alignment process can find structure in noisy data.

7.3 Time-warped clustering

A further generalization of our optimization formulation allows us to cluster set of signals \(x_1, \ldots , x_M\) into K groups, with each group having a template or center or exemplar. This can be considered a time-warped version of K-means clustering; see, e.g., (Boyd and Vandenberghe 2018, Chapter 4). To describe the clusters we use the M-vector c, with \(c_i = j\) meaning that signal \(x_i\) is assigned to group j, where \(j\in \{1,\ldots ,M\}\). The exemplars or templates are the signals denoted \(y_1, \ldots , y_K\).

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \sum _{i=1}^M \left( \int _0^1 L(x_i(\phi _i(t)) - y_{c_i}(t)) \; dt + \lambda ^\mathrm {cum}{\mathcal {R}}^\mathrm {cum}(\phi _i) + \lambda ^\mathrm {inst}{\mathcal {R}}^\mathrm {inst}(\phi _i) \right) \\ \text{ subject } \text{ to } &{} \phi _i(0)=0, \quad \phi _i(1)=1, \end{array} \nonumber \\ \end{aligned}$$
(11)

where the variables are the warp functions \(\phi _1, \ldots , \phi _M\), the templates \(y_1, \ldots , y_K\), and the assignment vector c. As above, \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters.

We solve this (approximately) by cyclically optimizing over the warp functions, the templates, and the assignments. Figure 13 shows an example of this procedure (using our default parameters) on a set of sinusoidal, square, and triangular signals of varying phase and amplitude.

Fig. 13
figure 13

K-means alignment on synthetic data

8 Conclusion

We claim three main contributions. We propose a full reformulation of DTW in continuous time that eliminates singularities without the need for preprocessing or step functions. Because our formulation allows for non-uniformly sampled signals, we are the first to demonstrate how out-of-sample validation can be used on a single signal for selecting DTW hyper-parameters. Finally, we distribute our C++ code as an open-source Python package called GDTW.