Dynamic Time Warping (DTW) Simulator Back
Signal / Time Series

Dynamic Time Warping (DTW) Simulator

An interactive tool for Dynamic Time Warping, the classic algorithm that compares two time series whose speed or phase differ by finding the optimal alignment along a stretched time axis. Change the sequence lengths, Sakoe-Chiba window width, distance metric, phase shift and amplitude scale, and the DTW distance, naive L2 distance, restricted cell count, memory savings and speedup update in real time.

Parameters
Sequence 1 length N
pt
Number of samples in the reference series
Sequence 2 length M
pt
Number of samples in the query series
Sakoe-Chiba window width
%
Maximum deviation from the diagonal (% of series length)
Distance metric
Definition of the point-to-point distance d(x,y)
True phase shift
%
How much sequence 2 is shifted in time relative to sequence 1
Amplitude scale
%
Amplitude of sequence 2 relative to sequence 1
Results
DTW distance
Naive L2 distance
Alignment length
Restricted cells
Memory savings (%)
Speedup ratio
Cost matrix and optimal warping path

Top-left: cost matrix grid with the Sakoe-Chiba band highlighted in yellow and the optimal warping path in blue. Bottom: the two raw time series (white = series 1, blue = series 2), with the alignment links connecting matched points.

Raw series vs aligned series
DTW distance vs window width (%)
Theory & Key Formulas

$$D(i,j) = d(x_i, y_j) + \min\bigl\{D(i-1,j-1),\, D(i-1,j),\, D(i,j-1)\bigr\}$$

D is the accumulated cost matrix and d is the point-to-point distance. The dynamic-programming recursion fills D cell by cell, and the final cell D(N,M) is the DTW distance.

$$|i-j| \le w, \qquad w = \max\bigl(|N-M|,\ \lceil r\cdot\max(N,M)\rceil\bigr)$$

Sakoe-Chiba band constraint. r is the window-width percentage; cells outside the band are skipped. The number of restricted cells N·(2w+1) − w·(w+1) sets the memory and computation budget.

$$\text{Speedup} = \frac{N \cdot M}{N \cdot (2w+1) - w(w+1)}$$

Speedup over the full N·M search. Narrowing the window makes DTW faster but risks pushing the path off its true optimum.

Dynamic Time Warping (DTW) — Time-Series Pattern Matching

🙋
"Dynamic time warping" sounds heavy — what does it actually do? I read that it compares "two series with the same waveform but different speed".
🎓
That's roughly right. Imagine the same word spoken in 1.0 s and again in 1.3 s. If you stack the waveforms and subtract, the elementwise difference explodes because of the speed mismatch. DTW instead searches, via dynamic programming, for the alignment that "stretches and shrinks" one of them so they match best. Two utterances of the same word land close together, even at different speeds. Sakoe and Chiba proposed it at NTT in 1978 for speech recognition — it's a venerable, still-current algorithm.
🙋
I set N=100 and M=80 on the left. The naive L2 distance is 101, but DTW reports just 1.2. So DTW understood that these are basically the same waveform, only at different lengths?
🎓
Exactly. The naive L2 includes a 5-per-step penalty for the length mismatch — that gives 100 right off the bat. DTW compresses the time axis to align the 100 samples of series 1 onto the 80 samples of series 2, so only the genuine waveform difference is left. With phase shift 10% and amplitude 100% (i.e. equal), the final cell D(N,M) lands at 1.2. That's the reason DTW is still in active use for ECG, gait analysis and speech recognition.
🙋
What are "restricted cells" and "54% memory savings" on the right? With window 100% it uses all 8000 cells, with 20% only 3680.
🎓
That's the Sakoe-Chiba band kicking in. Plain DTW must fill all N·M = 100·80 = 8000 cells, but in real speech or ECG the correspondence never strays far from the diagonal. So we declare "only cells within ±w of the diagonal will be computed", and the cost drops massively. At 20% you get 3680 cells, 54% memory saved, and a 8000/3680 ≈ 2.17× speedup. On embedded devices and long sequences with thousands of samples, that's a huge win.
🙋
Then should I just push the window down to 1% and get a 50× speedup?
🎓
That's the trap. Too narrow a window kicks the true optimal path outside the band, and the distance shoots up. That's why the verdict turns warn when the window is below 5%. Rules of thumb: speech recognition uses 5-10%, gait or heartbeats 10-20%, unknown data start at 20-30%. Think of the window as "the maximum speed variation you're willing to allow", then narrow it as your validation accuracy holds.
🙋
Why are there three distance metrics — Euclidean, Manhattan and cosine?
🎓
To swap the definition of the cell distance d(x,y). For 1-D scalar series (temperature, stock price, audio amplitude) Euclidean squared error is the default. If the data is noisy with outliers, Manhattan (L1) is more robust because it doesn't get dragged by single spikes. For multidimensional feature vectors such as MFCC in speech, the direction matters more than the magnitude, so cosine distance is standard. The DP skeleton of DTW stays the same — you just swap d. Try the dropdown and see how both the path and the cost react.

Frequently Asked Questions

Elementwise L2 stacks two series at the same indices and subtracts them, so even small speed or phase differences blow up the distance, and series of different lengths cannot be compared at all. DTW fills a cost matrix D with dynamic programming and finds the optimal path that allows the time axis to stretch or compress. Two walking patterns at slightly different speeds therefore land close in DTW distance, and speech or ECG recordings of different length become comparable. With the defaults N=100, M=80 in this tool, the naive L2 is 101.0 while the DTW distance is just 1.2.
The Sakoe-Chiba band restricts the warping path to within distance w of the diagonal, slashing computation and memory. At 100% you get full DTW; at 0% only the diagonal is used, which collapses back to naive L2. In practice 5-20% is standard for speech and gait — pick the value that bounds the speed variation you expect. Too narrow and the path is pushed off its true optimum, producing false mismatches; tune it on a validation set.
For one-dimensional scalar series (temperature, stock prices, audio amplitude), Euclidean or Manhattan are the natural choices; Manhattan (L1) is more robust to outliers. For multidimensional feature vectors such as MFCC coefficients in speech, you usually care about direction more than magnitude, so cosine distance is common. This tool only swaps the definition of d(x,y), so you can feel how the same series gives different optimal paths and costs under different metrics.
Classical DTW is O(N·M), which becomes painful when N and M exceed a few thousand. FastDTW finds a coarse path at low resolution and refines it, achieving roughly O(N). PrunedDTW uses an upper bound on the minimum cost to prune cells that cannot be optimal. UCR-DTW is the gold standard for query search, with normalization and aggressive early abandonment. Use FastDTW for thousands to tens of thousands of points, UCR-DTW for database search, and a straight Sakoe-Chiba implementation on the embedded side.

Real-world applications

Speech recognition and speaker verification: the original use case for DTW, and the standard algorithm of the DARPA speech recognition program. The MFCC sequence of the input is matched against dictionary templates with DTW, and the lowest-distance word is chosen. Even after the rise of HMMs and deep learning, DTW is still dominant in low-resource, small-vocabulary settings such as toy voice commands and embedded keyword spotting.

ECG and pulse-wave analysis: a patient's heartbeats vary with breathing and motion, so naive distance fails to detect even obvious arrhythmias. Matching template heartbeats against the patient's recording with DTW catches atrial fibrillation and premature ventricular contractions reliably. It is also the standard tool for searching for specific patterns inside long Holter monitor recordings.

Gait analysis and motion capture: stride and walking speed change day by day, so directly comparing accelerometer or IMU streams is hard. DTW absorbs the speed difference, allowing detection of changes in gait pattern before and after rehab, fall premonitions and recognition of specific movements. Smartwatch fall detection uses this kind of pipeline.

Stock price and sensor anomaly detection: chart-pattern recognition in financial time series (head-and-shoulders, etc.), vibration diagnosis on factory motors, and energy-consumption pattern comparison in HVAC all depend on the "same pattern at slightly different speed" problem. With NumPy/Pandas and a small DTW library it is easy to implement and works without labelled data.

Common misconceptions and pitfalls

The biggest trap is "DTW can compare anything". DTW absorbs differences along the time axis, but amplitude differences still go straight into the cost. That is why setting the amplitude scale slider to 50% or 200% in this tool increases the DTW distance. In practice you must Z-score normalize (mean 0, standard deviation 1) before DTW; the UCR-DTW paper calls comparisons without normalization an outright benchmark error.

Second, "the wider the window, the better the accuracy". A wider band reduces the risk of cutting off the true optimum, but it also lets noise create unphysical warps (paths that wander far from the diagonal and invent spurious correspondences). Large-scale experiments by Ratanamahatana et al. on the UCR datasets show that for most series a window of 5-10% is optimal and that wider bands either plateau or hurt accuracy. Start narrow and widen until validation error stops dropping.

Third, "DTW distance is a real metric". DTW distance is non-negative and symmetric, but it does not satisfy the triangle inequality (you can have d(A,C) > d(A,B) + d(B,C)). That rules out metric-tree accelerators such as VP-trees for nearest-neighbour search. The standard workaround is to prune candidates with a fast lower bound such as LB_Keogh first and run full DTW only on the survivors. Assume that kind of two-stage pipeline whenever you scale DTW up.

How to Use

  1. Set Sequence 1 length (50–500 samples) and Sequence 2 length (50–500 samples) using the range sliders or numeric inputs.
  2. Define the warping window constraint as a percentage (0–100%) to limit the diagonal band width in the cost matrix, reducing computation from O(mn) to O(mn·w).
  3. Apply phase shift (0–100%) to offset Sequence 2 temporally, simulating real-world misalignments in sensor data or speech signals.
  4. Click Calculate to compute DTW distance, view the accumulated cost matrix heatmap, and observe the optimal warping path.

Worked Example

Two vibration time series from bearing monitoring: Sequence 1 = 200 samples, Sequence 2 = 220 samples with 15% phase shift. Without windowing constraint, cost matrix = 44,000 cells evaluated. Applying 10% warping window reduces restricted cells to 4,840 (89.9% memory savings). DTW distance = 47.3 units vs. naive Euclidean L2 = 156.8 units, revealing that temporal elasticity captures the 22-sample lag better than pointwise comparison. Alignment length = 312 warping steps.

Practical Notes

  1. Use warping window (5–15%) for signals with known maximum acceptable time skew; wider windows capture anomalies but increase computation. At 0% window, DTW becomes standard L2 distance.
  2. Phase shift >20% typically indicates preprocessing misalignment; check signal synchronization (e.g., ECG baseline drift, seismic P-wave arrival variance) before increasing threshold.
  3. Speedup ratio reflects wall-clock improvement; at 50-sample sequences with 20% window, expect 3.5–4.2x speedup on CPU vs. unrestricted DTW.