Abstract
The goal of dynamic time warping is to transform or warp time in order to approximately align two signals. We pose the choice of warping function as an optimization problem with several terms in the objective. The first term measures the misalignment of the time-warped signals. Two additional regularization terms penalize the cumulative warping and the instantaneous rate of time warping; constraints on the warping can be imposed by assigning the value \(+\infty\) to the regularization terms. Different choices of the three objective terms yield different time warping functions that trade off signal fit or alignment and properties of the warping function. The optimization problem we formulate is a classical optimal control problem, with initial and terminal constraints, and a state dimension of one. We describe an effective general method that minimizes the objective by discretizing the values of the original and warped time, and using standard dynamic programming to compute the (globally) optimal warping function with the discretized values. Iterated refinement of this scheme yields a high accuracy warping function in just a few iterations. Our method is implemented as an open source Python package GDTW.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Background
The goal of dynamic time warping (DTW) is to find a time warping function that transforms, or warps, time in order to approximately align two signals together (Sakoe and Chiba 1978). At the same time, we prefer that the time warping be as gentle as possible, in some sense, or we require that it satisfy some requirements.
DTW is a versatile tool used in many scientific fields, including biology, economics, signal processing, finance, and robotics. It can be used to measure a realistic distance between two signals, usually by taking the distance between them after one is time-warped. In another case, the distance can be the minimum amount of warping needed to align one signal to the other with some level of fidelity. Time warping can be used to develop a simple model of a signal, or to improve a predictor; as a simple example, a suitable time warping can lead to a signal being well fit by an auto-regressive or other model. DTW can also be used for pattern recognition and searching for a match among a database of signals (Rakthanmanon et al 2012). It can be employed in any machine-learning application that relies on signals, such as PCA, clustering, regression, logistic regression, or multi-class classification. (We return to this topic in §7.)
Almost all DTW methods are based on the original DTW algorithm (Sakoe and Chiba 1978), which uses dynamic programming to compute a time warping path that minimizes misalignments in the time-warped signals while satisfying monotonicity, boundary, and continuity constraints. The monotonicity constraint ensures that the path represents a monotone increasing function of time. The boundary constraint enforces that the warping path beings with the origin point of both signals and ends with their terminal points. The continuity constraint restricts transitions in the path to adjacent points in time.
Despite its popularity, DTW has a longstanding problem with producing sharp irregularities in the time warp function that cause many time points of one signal to be erroneously mapped onto a single point, or “singularity,” in the other signal. Most of the literature on reducing the occurrence of singularities falls into two camps: preprocessing the input signals, and variations on continuity constraints. Preprocessing techniques rely on transformations of the input signals, which make them smoother or emphasize features or landmarks, to indirectly influence the smoothness of the warping function. Notable approaches use combinations of first and second derivatives (Keogh and Pazzani 2001; Marron et al 2015; Singh et al 2008), square-root velocity functions (Srivastava et al 2011), adaptive down-sampling (Dupont and Marteau 2015), and ensembles of features including wavelet transforms, derivatives, and several others (Zhao and Itti 2016). Variations of the continuity constraints relax the restriction on transitions in the path, which allows smoother warping paths to be chosen. Instead of only restricting transitions to one of three neighboring points in time, as in the original DTW algorithm, these variations expand the set of allowable points to those specified by a “step pattern,” of which there are many, including symmetric or asymmetric, types I-IV, and sub-types a-d (Sakoe and Chiba 1978; Itakura 1975; Myers et al 1980; Rabiner and Juang 1993). While preprocessing and step patterns may result in smoother warping functions, they are ad-hoc techniques that often require hand-selection for different types of input signals.
We propose to handle these issues entirely within an optimization framework in continuous time. Here we pose DTW as an optimization problem with several penalty terms in the objective. The basic term in our objective penalizes misalignments in the time-warped signals, while two additional terms penalize (and constrain) the time warping function. One of these terms penalizes the cumulative warping, which limits over-fitting similar to “ridge” or “lasso” regularization (Tikhonov and Arsenin 1977; Tibshirani 1996). The other term penalizes the instantaneous rate of time warping, which produces smoother warping functions, an idea that previously proposed in (Green and Silverman 1993; Ramsay and Silverman 2005, 2007; Srivastava and Klassen 2016).
Our formulation offers almost complete freedom in choosing the functions used to compare the sequences, and to penalize the warping function. We include constraints on the fit and warping functions by allowing these functions to take on the value \(+\infty\). Traditional penalty functions include the square or absolute value. Less traditional but useful ones include for example the fraction of time the two signals are within some threshold distance, or a minimum or maximum on the cumulative warping function. The choice of these functions, and how much they are scaled with respect to each other, gives a very wide range of choices for potential time warpings.
Our continuous time formulation allows for non-uniformly sampled signals, which allows us to use simple out-of-sample validation techniques to help guide the choice of time warping penalties; in particular, we can determine whether a time warp is ‘over-fit’. Our handling of missing data in the input signals is useful in itself since real-world data often have missing entries. There are many examples of using validation to select hyper-parameters, such as the “warping window,” and do this by splitting their dataset of signals into test signals and train signals (Dau et al 2018). To the best of our knowledge, we are the first use of out-of-sample validation for performing model selection in DTW, where we build a test and training dataset by partitioning samples from a single signal.
We develop a single, efficient algorithm that solves our formulation, independent of the particular choices of the penalty functions. Our algorithm uses dynamic programming to exactly solve a discretized version of the problem with linear time complexity, coupled with iterative refinement at higher and higher resolutions. Our discretized formulation can be thought of as generalizing the Itakura parallelogram (Itakura 1975); the iterated refinement scheme is similar in nature to FastDTW (Salvador and Chan 2007). We offer our implementation as open source C++ code with an intuitive Python package called GDTW (https://github.com/dderiso/gdtw).
We describe several extensions and variations of our method. In one extension, we extend our optimization framework to find a time-warped center of or template for a set of signals; in a further extension, we cluster a set of signals into groups, each of which is time-warped into one of a set of templates or prototypes.
2 Dynamic time warping
Signals. A (vector-valued) signal f is a function \(f:[a,b] \rightarrow {\mathbf{R}}^d\), with argument time. A signal can be specified or described in many ways, for example a formula, or via a sequence of samples along with a method for interpolating the signal values in between samples. For example we can describe a signal as taking values \(s_1, \ldots , s_N \in {\mathbf{R}}^d\), at points (times) \(a\le t_1< t_2< \cdots < t_N \le b\), with linear interpolation in between these values and a constant extension outside the first and last values:
For simplicity, we will consider signals on the time interval [0, 1].
Time warp function. Suppose \(\phi :[0,1]\rightarrow [0,1]\) is increasing, with \(\phi (0)=0\) and \(\phi (1)=1\). We refer to \(\phi\) as the time warp function, and \(\tau = \phi (t)\) as the warped time associated with real or original time t. When \(\phi (t)=t\) for all t, the warped time is the same as the original time. In general we can think of
as the amount of cumulative warping at time t, and
as the instantaneous rate of time warping at time t. These are both zero when \(\phi (t)=t\) for all t.
Time-warped signal. If x is a signal, we refer to the signal \({\tilde{x}} = x\circ \phi\), i.e.,
as the time-warped signal, or the time-warped version of the signal x.
Dynamic time warping. Suppose we are given two signals x and y. Roughly speaking, the dynamic time warping problem is to find a warping function \(\phi\) so that \({\tilde{x}} = x\circ \phi \approx y\). In other words, we wish to warp time so that the time-warped version of the first signal is close to the second one. We refer to the signal y as the target, since the goal is warp x to match, or align with, the target.
Example. An example is shown in Fig. 1. The top plot shows a scalar signal x and target signal y, and the bottom plot shows the time-warped signal \({\tilde{x}} = x\circ \phi\) and y. The middle plot shows the correspondence between x and y associated with the warping function \(\phi\). Figure 2 shows the time warping function; the next plot is the cumulative warp, and the next is the instantaneous rate of time warping.
3 Optimization formulation
We will formulate the dynamic time warping problem as an optimization problem, where the time warp function \(\phi\) is the (infinite-dimensional) optimization variable to be chosen. Our formulation is very similar to those used in machine learning, where a fitting function is chosen to minimize an objective that includes a loss function that measures the error in fitting the given data, and regularization terms that penalize the complexity of the fitting function (Friedman et al 2001).
Loss functional. Let \(L: {\mathbf{R}}^d \rightarrow {\mathbf{R}}\) be a vector penalty function. We define the loss associated with a time warp function \(\phi\), on the two signals x and y, as
the average value of the penalty function of the difference between the time-warped first signal and the second signal. The smaller \({\mathcal {L}}(\phi )\) is, the better we consider \({\tilde{x}} =x\circ \phi\) to approximate y.
Simple choices of the penalty include \(L(u)=\Vert u\Vert _2^2\) or \(L(u)=\Vert u\Vert _1\). The corresponding losses are the mean-square deviation and mean-absolute deviation, respectively. One useful variation is the Huber penalty (Huber 2011; Boyd and Vandenberghe 2004),
where \(M>0\) is a parameter. The Huber penalty coincides with the least squares penalty for small u, but grows more slowly for u large, and so is less sensitive to outliers. Many other choices are possible, for example
where \(\epsilon \in {\mathbf{R}}_{+}\) is a positive parameter. The associated loss \({\mathcal {L}}(\phi )\) is the fraction of time the time-warped signal is farther than \(\epsilon\) from the second signal (measured by the norm \(\Vert \cdot \Vert\)).
The choice of penalty function L (and therefore loss functional \({\mathcal {L}}\)) will influence the warping found, and should be chosen to capture the notion of approximation appropriate for the given application.
Cumulative warp regularization functional. We express our desired qualities for or requirements on the time warp function using a regularization functional for the cumulative warp,
where \(R^\mathrm {cum}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is a penalty function on the cumulative warp. The function \(R^\mathrm {cum}\) can take on the value \(+\infty\), which allows us to encode constraints on \(\phi\). While we do not require it, we typically have \(R^\mathrm {cum}(0)=0\), i.e., there is no cumulative regularization cost when the warped time and true time are the same.
Instantaneous warp regularization functional. The regularization functional for the instantaneous warp is
where \(R^\mathrm {inst}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is the penalty function on the instantaneous rate of time warping. Like the function \(R^\mathrm {cum}\), \(R^\mathrm {inst}\) can take on the value \(+\infty\), which allows us to encode constraints on \(\phi '\). By assigning \(R^\mathrm {inst}(u)=+\infty\) for \(u<s^\mathrm {min}\), for example, we require that \(\phi '(t)\ge s^\mathrm {min}\) for all t. We will assume that this is the case for some positive \(s^\mathrm {min}\), which ensures that \(\phi\) is invertible. While we do not require it, we typically have \(R^\mathrm {inst}(0)=0\), i.e., there is no instantaneous regularization cost when the instantaneous rate of time warping is one.
As a simple example, we might choose
i.e., a quadratic penalty on cumulative warping, and a square penalty on instantaneous warping, plus the constraint that the slope of \(\phi\) must be between \(s^\mathrm {min}\) and \(s^\mathrm {max}\). A very wide variety of penalties can be used to express our wishes and requirements on the warping function.
Dynamic time warping via regularized loss minimization. We propose to choose \(\phi\) by solving the optimization problem
where \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters used to vary the relative weight of the three terms. The variable in this optimization problem is the time warp function \(\phi\).
Optimal control formulation. The problem (4) is an infinite-dimensional, and generally non-convex, optimization problem. Such problems are generally impractical to solve exactly, but we will see that this particular problem can be efficiently and practically solved.
It can be formulated as a classical continuous-time optimal control problem (Bertsekas 2005), with scalar state \(\phi (t)\) and action or input \(u(t) = \phi '(t)\):
where \(\ell\) is the state-action cost function
There are many classical methods for numerically solving the optimal control problem (5), but these generally make strong assumptions about the loss and regularization functionals (such as smoothness), and do not solve the problem globally. We will instead solve (5) by brute force dynamic programming, which is practical since the state has dimension one, and so can be discretized.
Lasso and ridge regularization. Before describing how we solve the optimal control problem (5), we mention two types of regularization that are widely used in machine learning, and what types of warping functions typically result when using them. They correspond to \(R^\mathrm {cum}\) and \(R^\mathrm {inst}\) being either \(u^2\) (quadratic, ridge, or Tikhonov regularization (Tikhonov and Arsenin 1977; Hansen 2005)) or |u| (absolute value, \(\ell _1\) regularization, or Lasso (Golub and Van Loan 2012, p564) (Tibshirani 1996))
With \(R^\mathrm {cum}(u)=u^2\), the regularization discourages large deviations between \(\tau\) and t, but the not the rate at which \(\tau\) changes with t. With \(R^\mathrm {inst}(u)=u^2\), the regularization discourages large instantaneous warping rates. The larger \(\lambda ^\mathrm {cum}\) is, the less \(\tau\) deviates from t; the larger \(\lambda ^\mathrm {inst}\) is, the more smooth the time warping function \(\phi\) is.
Using absolute value regularization is more interesting. It is well known in machine learning that using absolute value or \(\ell _1\) regularization leads to solutions with an argument of the absolute value that is sparse, that is, often zero (Boyd and Vandenberghe 2004). When \(R^\mathrm {cum}\) is the absolute value, we can expect many times when \(\tau = t\), that is, the warped time and true time are the same. When \(R^\mathrm {inst}\) is the absolute value, we can expect many times when \(\phi '(t) =1\), that is, the instantaneous rate of time warping is zero. Typically these regions grow larger as we increase the hyper-parameters \(\lambda ^\mathrm {cum}\) and \(\lambda ^\mathrm {inst}\).
Discretized time formulation. To solve the problem (4) we discretize time with the N values
We will assume that \(\phi\) is piecewise linear with knot points at \(t_1, \ldots , t_N\); to describe it we only need to specify the warp values \(\tau _i = \phi (t_i)\) for \(i=1, \ldots , N\), which we express as a vector \(\tau \in {\mathbf{R}}^N\). We assume that the points \(t_i\) are closely enough spaced that the restriction to piecewise linear form is acceptable. The values \(t_i\) could be taken as the values at which the signal y is sampled (if it is given by samples), or just the default linear spacing, \(t_i = (i-1)/(N-1)\). The constraints \(\phi (0)=0\) and \(\phi (1)=1\) are expressed as \(\tau _1=0\) and \(\tau _N=1\).
Using a simple Riemann approximation of the integrals and the approximation
we obtain the discretized objective
The discretized problem is to choose the vector \(\tau ^\star \in {\mathbf{R}}^N\) that minimizes \({\hat{f}}(\tau )\), subject to \(\tau _1=0\), \(\tau _N=1\). With \(\tau ^\star\), we can construct an approximation to function \(\phi\) using piecewise-linear interpolation. The only approximation here is the discretization; we can use standard techniques based on bounds on derivatives of the functions involved to bound the deviation between the continuous-time objective \(f(\phi )\) and its discretized approximation \({\hat{f}}(\tau )\).
4 Dynamic programming with refinement
In this section we describe a simple method to minimize \({\hat{f}}(\tau )\) subject to \(\tau _1 =0\) and \(\tau _N=1\), i.e., to solve the optimal control problem (5) to obtain \(\tau ^*\). We first discretize the possible values of \(\tau _i\), whereupon the problem can be expressed as a shortest path problem on a graph, and then efficiently and globally solved using standard dynamic programming techniques. To reduce the error associated with the discretization of the values of \(\tau _i\), we choose a new discretization with the same number of values, but in a reduced range (and therefore, more finely spaced values) around the previously found values. This refinement converges in a few steps to a highly accurate solution of the discretized problem. Subject only to the reasonable assumption that the discretization of the original time and warped time are sufficiently fine, this method finds the global solution.
4.1 Dynamic programming
We now discretize the values that \(\tau _i\) is allowed to take:
One choice for these discretized values is linear spacing between given lower and upper bounds on \(\tau _i\), \(0\le l_i \le u_i\le 1\):
Here M is the number of values that we use to discretize each value of \(\tau _i\) (which we take to be the same for each i, for simplicity). We will assume that \(0 \in {\mathcal {T}}_1\) and \(1\in {\mathcal {T}}_N\), so the constraints \(\tau _1=0\) and \(\tau _N=1\) are feasible.
The bounds can be chosen as
where \(s^\mathrm {min}\) and \(s^\mathrm {max}\) are the given minimum and maximum allowed values of \(\phi '\). This is illustrated in Fig. 3, where the nodes of \({\mathcal {T}}\) are drawn at position \((t_i,\tau _{ij})\), for \(N=30, M=20\) and various values of \(s^\mathrm {min}\) and \(s^\mathrm {max}\). Note that since \(\frac{N}{M}\) is the minimum slope, M should be chosen to satisfy \(M > \frac{N}{s^\mathrm {max}}\), a consideration that is automated in the provided software.
The objective (6) splits into a sum of terms that are functions of \(\tau _i\), and terms that are functions of \(\tau _{i+1}-\tau _i\). (These correspond to the separable state-action loss function terms in the optimal control problem associated with \(\phi (t)\) and \(\phi '(t)\), respectively.) The problem is then globally solved by standard methods of dynamic programming (Bellman and Dreyfus 2015), using the methods we now describe.
We form a graph with MN nodes, associated with the values \(\tau _{ij}\), \(i=1,\ldots , N\) and \(j=1,\ldots , M\). (Note that i indexes the discretized values of t, and j indexes the discretized values of \(\tau\).) Each node \(\tau _{ij}\) with \(i<N\) has M outgoing edges that terminate at the nodes of the form \(\tau _{i+1,k}\) for \(k=1, \ldots , M\). The total number of edges is therefore \((N-1)M^2\). This is illustrated in Fig. 3 for \(M=25\) and \(N=100\), where the nodes are shown at the location \((t_i,\tau _{ij})\). (In practice M and N would be considerably larger.)
At each node \(\tau _{ij}\), we associate the node cost
and on the edge from \(\tau _{ij}\) to \(\tau _{i+1,k}\), we associate the edge cost
With these node and edge costs, the objective \({\hat{f}}(\tau )\) is the total cost of a path starting at node \(\tau _{11}=0\) and ending at \(\tau _{NM}=1\). (Infeasible paths, for examples ones for which \(\tau _{i+1,k}<\tau _{i,j}\), have cost \(+\infty\).) Our problem is therefore to find the shortest weighted path through a graph, which is readily done by dynamic programming.
The computational cost of dynamic programming is order \(NM^2\) flops (not counting the evaluation of the loss and regularization terms). With current hardware, it is entirely practical for \(M=N=1000\) or even (much) larger. The path found is the globally optimal one, i.e., \(\tau ^*\) minimizes \({\hat{f}}(\tau )\), subject to the discretization constraints on the values of \(\tau _i\).
4.2 Iterative refinement
After solving the problem above by dynamic programming, we can reduce the error induced by discretizing the values of \(\tau _i\) by updating \(l_i\) and \(u_i\). We shrink them both toward the current value of \(\tau ^*_i\), thereby reducing the gap between adjacent discretized values and reducing the discretization error. One simple method for updating the bounds is to reduce the range \(u_i-l_i\) by a fixed fraction \(\eta\), say 1/2 or 1/8.
To do this we set
in iteration \(q+1\), where the superscripts in parentheses above indicate the iteration. Using the same data as Figs. 2 and 4 shows the iterative refinement of \(\tau ^*\). Here, nodes of \({\mathcal {T}}\) are plotted at position \((t_i,\tau _{ij})\), as it is iteratively refined around \(\tau ^*_i\).
4.3 Implementation
GDTW package. The algorithm described above has been implemented as the open source Python package GDTW, with the dynamic programming portion written in C++ for improved efficiency. The node costs are computed and stored in an \(M\times N\) array, and the edge costs are computed on the fly and stored in an \(M\times M\times N\) array. For multiple iterations on group-level alignments (see §7), multi-threading is used to distribute the program onto worker threads.
Performance. We give an example of the performance attained by GDTW using real-world signals described in §5, which are uniformly sampled with \(N=1000\). Although it has no effect on method performance, we take square loss, square cumulative warp regularization, and square instantaneous warp regularization. We take \(M=100\).
The computations are carried on a 4 core MacBook. To compute the node costs requires 0.0055 seconds, and to compute the shortest path requires 0.0832 seconds. With refinement factor \(\eta = .15\), only three iterations are needed before no significant improvement is obtained, and the result is essentially the same with other choices for the algorithm parameters N, M, and \(\eta\). Over 10 trials, our method only took an average of 0.25 seconds, to compute using a radius of 50, which is equivalent to \(M=100\). All of the data and example code necessary to reproduce these results are available in the GDTW repository. Also available are supplementary materials that contain step-by-step instructions and demonstrations on how to reproduce these results.
4.4 Validation
To test the generalization ability of a specific time warping model, parameterized by L, λcum, Rcum, λinst, and Rinst, we use out-of-sample validation. We form two increasing sequences, ttrain ∈ \({\text{R}}^{N^{\text{train}}}\) and ttest ∈ \({\text{R}}^{N^{\text{test}}}\), by randomly sampling without replacement from the N discretized time values, and we include the boundaries tt = 0 and tN = 1 in each sequence. Using only the time points in ttrain, we obtain our time warping function ϕ by minimizing our discretized objective (6). (Recall that since our method does not require signals to be sampled at regular intervals, it will work with the irregularly spaced time points in ttrain.)
We compute two loss values: a training error
and a test error
Figure 5 shows \(\ell^{\text{test}}\) over a grid of values of λcum and λinst, for a partitioning where ttrain and ttest each contain 50% of the time points. In this example, we use the signals shown figure 1.
4.4.1 Ground truth estimation
When a ground truth warping function, \(\phi^{\text{true}}\), is available, we can score how well our ϕ approximates \(\phi^{\text{true}}\) by computing the following errors:
and
In the example shown in Fig. 5, target signal y is constructed by composing x with a known warping function \(\phi^{\text{true}}\), such that y(t) = (x◦\(\phi^{\text{true}}\))(t). Figure 6 shows the contours of ϵtest for this example.
5 Examples
We present a few examples of alignments using our method. Figure 7 is a synthetic example of different types of time warping functions. Figure 8 is real-world example using biological signals (ECGs). We compare our method using varying amounts of regularization \(\lambda ^\mathrm {inst}\in \{0.01,0.1,0.5 \},N=1000,M=100\) to those using with FastDTW (Salvador and Chan 2007), using the equivalent graph size \(N=1000,\mathrm {radius}=50\). As expected, the alignments using regularization are smoother and less prone to singularities than those from FastDTW, which are unregularized. Figure 9 shows how the time warp functions become smoother as \(\lambda ^\mathrm {inst}\) grows.
6 Extensions and variations
We will show how to extend our formulation to address complex scenarios, such as aligning a portion of a signal to the target, regularization of higher-order derivatives, and symmetric time warping, where both signals align to each other.
6.1 Alternate boundary and slope constraints
We can align a portion of a signal with the target by adjusting the boundary constraints to allow \(0 \ge \phi (0) \ge \beta\) and \((1-\beta ) \le \phi (1) \le 1\), We incorporate this by reformulating (7) as
for \(i=1,\ldots , N\).
We can also allow the slope of \(\phi\) to be negative, by choosing \(s^\mathrm {min}< 0\). These modifications are illustrated in Fig. 10, where the nodes of \({\mathcal {T}}\) are drawn at position \((t_i,\tau _{ij})\), for \(N=30, M=20\) and various values of \(\beta\), \(s^\mathrm {min}\), and \(s^\mathrm {max}\).
6.2 Penalizing higher-order derivatives
We can extend the formulation to include a constraint or objective term on the higher-order derivatives, such as the second derivative \(\phi ''\). This requires us to extend the discretized state space to include not just the current M values, but also the last M values, so the state space size grows to \(M^2\) in the dynamic programming problem.
In the continuous formulation, the regularization functional for the second-order instantaneous warp is
where \(R^\mathrm {inst^2}: {\mathbf{R}}\rightarrow {\mathbf{R}}\cup \{ \infty \}\) is the penalty function on the second-order instantaneous rate of time warping. Like the function \(R^\mathrm {inst}\), \(R^\mathrm {inst^2}\) can take on the value \(+\infty\), which allows us to enforce constraints on \(\phi ''\).
With this additional regularization functional, we can reformulate the problem in (4) as
where \(\lambda ^\mathrm {inst^2}\in {\mathbf{R}}_{+}\) is a positive hyper-parameter.
To solve the problem (8), we can include a discretized version of \({\mathcal {R}}^\mathrm {inst^2}\) inside the discretized objective (6). We propose a discretized formulation of \({\mathcal {R}}^\mathrm {inst^2}\) using a three-point central difference approximation of the second derivative. Note that this approximation depends on the spacing of the time points. For regularly spaced time points, we can use
and for irregularly spaced time points, we can use
for \(i=1, \ldots , N-1\), and where \(\delta _1 = t_{i}-t_{i-1}\) and \(\delta _2 = t_{i+1}-t_{i}\).
6.3 General loss
The two signals need not be vector valued; they could have categorical values, for example
or
where \(g:{\mathbf{R}}_{++} \times {\mathbf{R}}_{++} \rightarrow {\mathbf{R}}\) is a categorical distance function that can specify the cost of certain mismatches or a similarity matrix (Needleman and Wunsch 1970).
Another example could use the Earth mover’s distance, \(\text{ EMD }:{\mathbf{R}}^n \times {\mathbf{R}}^n \rightarrow {\mathbf{R}}\), between two short-time spectra
where \(\rho \in {\mathbf{R}}\) is a radius around time point \(t_i\).
6.4 Symmetric time warping
Until this point, we have used unidirectional time warping, where signal x is time-warped to align with y such that \(x\circ \phi \approx y\). We can also perform bidirectional time warping, where signals x and y are time-warped each other. Bidirectional time warping results in two time warp functions, \(\phi\) and \(\psi\), where \(x \circ \phi \approx y \circ \psi\).
Bidirectional time warping requires a different loss functional. Here we define the bidirectional loss associated with time warp functions \(\phi\) and \(\psi\), on the two signals x and y, as
where we distinguish the bidirectional case by using two arguments, \({\mathcal {L}}(\phi ,\psi )\), instead of one, \({\mathcal {L}}(\phi )\), as in (1).
Bidirectional time warping can be symmetric or asymmetric. In the symmetric case, we choose \(\phi , \psi\) by solving the optimization problem
where the constraint \(\psi (t)=2t-\phi (t)\) ensures that \(\phi\) and \(\psi\) are symmetric about the identity. The symmetric case does not add additional computational complexity, and can be readily solved using the iterative refinement procedure described in §4.
In the asymmetric case, \(\phi , \psi\) are chosen by solving the optimization problem
The asymmetric case requires \(R^\mathrm {cum}\), \(R^\mathrm {inst}\) to allow negative slopes for \(\psi\). Further, it requires a modified iterative refinement procedure (not described here) with an increased complexity of order \(NM^4\) flops, which is impractical when M is not small.
7 Time-warped distance, centering, and clustering
In this section we describe three simple extensions of our optimization formulation that yield useful methods for analyzing a set of signals \(x_1, \ldots , x_M\).
7.1 Time-warped distance
For signals x and y, we can interpret the optimal value of (4) as the time-warped distance between x and y, denoted D(x, y). (Note that this distance measures takes into account both the loss and the regularization, which measures how much warping was needed.) When \(\lambda ^\mathrm {cum}\) and \(\lambda ^\mathrm {inst}\) are zero, we recover the unconstrained DTW distance (Sakoe and Chiba 1978). This distance is not symmetric; we can (and usually do) have \(D(x,y)\ne D(y,x)\). If a symmetric distance is preferred, we can take \((D(x,y)+D(y,x))/2\), or the optimal value of the group alignment problem (9), with a set of original signals x, y.
The warp distance can be used in many places where a conventional distance between two signals is used. For example we can use warp distance to carry out k nearest neighbors regression (Xi et al 2006) or classification. Warp distance can also be used to create features for further machine learning. For example, suppose that we have carried out clustering into K groups, as discussed above, with target or group centers or exemplar signals \(y_1, \ldots , y_K\). From these we can create a set of K features related to the warp distance of a new signal x to the centers \(y_1, \ldots , y_K\), as
where \(d_i = D(x,y_i)\), and \(\sigma \in {\mathbf{R}}_{+}\) is a positive (scale) hyper-parameter.
7.2 Time-warped alignment and centering
In time-warped alignment, the goal is to find a common target signal \(\mu\) that each of the original signals can be warped to, at low cost. We pose this in the natural way as the optimization problem
where the variables are the warp functions \(\phi _1, \ldots , \phi _M\) and the target \(\mu\), and \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters. The objective is the sum of the objectives for time warping each \(x_i\) to \(\mu\). This is very much like our basic formulation (4), except that we have multiple signals to warp, and the target \(\mu\) is also a variable that we can choose.
The problem (9) is hard to solve exactly, but a simple iterative procedure seems to work well. We observe that if we fix the target \(\mu\), the problem splits into M separate dynamic time warping problems that we can solve (separately, in parallel) using the method described in §4. Conversely, if we fix the warping functions \(\phi _1, \ldots , \phi _M\), we can optimize over \(\mu\) by minimizing
This amounts to choosing each \(\mu (t)\) to minimize
This is typically easy to do; for example, with square loss, we choose \(\mu (t)\) to be the mean of \(x_i(\phi _i(t))\); with absolute value loss, we choose \(\mu (t)\) to be the median of \(x_i(\phi _i(t))\).
This method of alternating between updating the target \(\mu\) and updating the warp functions (in parallel) typically converges quickly. However, it need not converge to the global minimum. One simple initialization is to start with no warping, i.e., \(\phi _i(t)=t\). Another is to choose one of the original signals as the initial value for \(\mu\).
As a variation, we can also require the warping functions to be evenly arranged about a common time warp center, for example \(\phi (t) = t\). We can do this by imposing a centering constraint on (9),
where \(\frac{1}{M}\sum _{i=1}^M \phi _i(t) = t\) forces \(\phi _1, \ldots , \phi _M\) to be evenly distributed around the identity \(\phi (t)=t\). The resulting centered time warp functions, can be used to produce a centered time-warped mean. Figure 11 compares a time-warped mean with and without centering, using synthetic data consisting of multi-modal signals from (Srivastava et al 2011).
Figure 12 shows examples of centered time-warped means of real-world data (using our default parameters), consisting of ECGs and sensor data from an automotive engine (Abou-Nasr and Feldkamp 2008). The ECG example demonstrates that subtle features of the input sequences are preserved in the alignment process, and the engine example demonstrates that the alignment process can find structure in noisy data.
7.3 Time-warped clustering
A further generalization of our optimization formulation allows us to cluster set of signals \(x_1, \ldots , x_M\) into K groups, with each group having a template or center or exemplar. This can be considered a time-warped version of K-means clustering; see, e.g., (Boyd and Vandenberghe 2018, Chapter 4). To describe the clusters we use the M-vector c, with \(c_i = j\) meaning that signal \(x_i\) is assigned to group j, where \(j\in \{1,\ldots ,M\}\). The exemplars or templates are the signals denoted \(y_1, \ldots , y_K\).
where the variables are the warp functions \(\phi _1, \ldots , \phi _M\), the templates \(y_1, \ldots , y_K\), and the assignment vector c. As above, \(\lambda ^\mathrm {cum}\in {\mathbf{R}}_{+}\) and \(\lambda ^\mathrm {inst}\in {\mathbf{R}}_{+}\) are positive hyper-parameters.
We solve this (approximately) by cyclically optimizing over the warp functions, the templates, and the assignments. Figure 13 shows an example of this procedure (using our default parameters) on a set of sinusoidal, square, and triangular signals of varying phase and amplitude.
8 Conclusion
We claim three main contributions. We propose a full reformulation of DTW in continuous time that eliminates singularities without the need for preprocessing or step functions. Because our formulation allows for non-uniformly sampled signals, we are the first to demonstrate how out-of-sample validation can be used on a single signal for selecting DTW hyper-parameters. Finally, we distribute our C++ code as an open-source Python package called GDTW.
References
Abou-Nasr M, Feldkamp L (2008) Ford Classification Challenge. Zip Archive http://www.timeseriesclassificationcom/descriptionphp?Dataset=FordA
Bellman RE & Dreyfus SE (2015) Applied dynamic programming (Vol. 2050). Princeton university press
Bertsekas DP (2005) Vol. 1 of Dynamic programming and optimal control. Athena scientific Belmont, MA
Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Boyd S, Vandenberghe L (2018) Introduction to applied linear algebra: vectors, matrices, and least squares. Cambridge University Press, Cambridge
Dau H, Silva D, Petitjean F et al (2018) Optimizing dynamic time warping’s window width for time series data mining applications. Data min knowl Discov 32(4):1074–1120
Dupont M, Marteau P (2015) Coarse-DTW for sparse time series alignment. In: International Workshop on Advanced Analytics and Learning on Temporal Data, Springer, pp 157–172
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics, New York, NY, USA
Golub GH & Loan CF Van (2012) Matrix computations, vol. 3 JHU press
Green PJ & Silverman BW (1993) Nonparametric regression and generalized linear models: a roughness penalty approach. Crc Press
Hansen PC (1998). Rank-deficient and discrete ill-posed problems: numerical aspects of linear inversion. Society for Industrial and Applied Mathematics
Huber P (2011) Robust statistics. In: International encyclopedia of statistical science. Springer, p 1248–1251
Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Signal Process 23(1):67–72
Keogh E, Pazzani M (2001) Derivative dynamic time warping. In: Proceedings of the 2001 SIAM international conference on data mining, SIAM, pp 1–1
Marron J, Ramsay J, Sangalli L et al (2015) Functional data analysis of amplitude and phase variation. Stat Sci 30(4):468–484
Myers C, Rabiner L, Rosenberg A (1980) Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans Acoust Speech Signal Process 28(6):623–635
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Rabiner L & Juang BH (1993) Fundamentals of speech recognition. Prentice-Hall, Inc.
Rakthanmanon T, Campana B, Mueen A, et al (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the international conference on knowledge discovery and data mining, ACM, pp 262–270
Ramsay JO, Silverman BW (2005) Fitting differential equations to functional data: Principal differential analysis. Springer, New York, pp 327–348
Ramsay J, Silverman B (2007) Applied functional data analysis: Methods and case studies. Springer
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26(1):43–49
Salvador S, Chan P (2007) Toward accurate dynamic time warping in linear time and space. Intell Data Anal 11(5):561–580
Singh M, Cheng I, Mandal M, et al (2008) Optimization of symmetric transfer error for sub-frame video synchronization. In: European conference on computer vision, Springer, pp 554–567
Ramsay JO & Silverman BW(2005). Fitting differential equations to functional data: Principal differential analysis (pp. 327-348). Springer New York
Srivastava A, Wu W, Kurtek S, et al (2011) Registration of functional data using Fisher-Rao metric. arXiv preprint arXiv:1103.3817
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
Tikhonov A, Arsenin V (1977) Solutions of Ill-Posed Problems, vol 14. Winston
Xi X, Keogh E, Shelton C, et al (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 1033–1040
Zhao J, Itti L (2016) ShapeDTW: Shape dynamic time warping. arXiv preprint arXiv:1606.01601
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Disclosure statement
The authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Deriso, D., Boyd, S. A general optimization framework for dynamic time warping. Optim Eng 24, 1411–1432 (2023). https://doi.org/10.1007/s11081-022-09738-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11081-022-09738-z