0
# Distance Matrices
1
2
Efficient computation of distance matrices for multiple time series, supporting parallel processing, memory optimization through blocking, and various output formats for large-scale time series analysis. Distance matrices are essential for clustering, similarity search, and comparative analysis of time series datasets.
3
4
## Capabilities
5
6
### Full Distance Matrix Computation
7
8
Compute pairwise DTW distances between all sequences in a collection, with support for both full matrices and condensed formats.
9
10
```python { .api }
11
def distance_matrix(s, max_dist=None, max_length_diff=None, window=None,
12
max_step=None, penalty=None, psi=None, block=None,
13
compact=False, parallel=False, use_c=False,
14
use_nogil=False, show_progress=False):
15
"""
16
Compute distance matrix for all sequence pairs in a collection.
17
18
Parameters:
19
- s: list/array/SeriesContainer, collection of time series sequences
20
- max_dist: float, early stopping threshold for individual distances
21
- max_length_diff: int, maximum length difference between sequences
22
- window: int, warping window constraint
23
- max_step: float, maximum step size constraint
24
- penalty: float, penalty for compression/expansion
25
- psi: int, psi relaxation parameter
26
- block: tuple, memory blocking configuration (start_idx, end_idx)
27
- compact: bool, return condensed array instead of full matrix
28
- parallel: bool, enable parallel computation
29
- use_c: bool, use C implementation
30
- use_nogil: bool, use GIL-free C implementation for better parallelization
31
- show_progress: bool, display progress bar
32
33
Returns:
34
array: distance matrix of shape (n, n) or condensed array of length n*(n-1)/2
35
"""
36
37
def distance_matrix_fast(s, max_dist=None, max_length_diff=None, window=None,
38
max_step=None, penalty=None, psi=None, block=None,
39
compact=False, parallel=True):
40
"""
41
Fast C version of distance matrix computation.
42
43
Automatically uses optimized C implementation with parallel processing
44
enabled by default for better performance on multi-core systems.
45
46
Parameters:
47
Same as distance_matrix() but with parallel=True by default.
48
49
Returns:
50
array: distance matrix or condensed array
51
"""
52
53
def distance_matrix_python(s, block=None, show_progress=False,
54
max_length_diff=None, dist_opts=None):
55
"""
56
Pure Python distance matrix implementation.
57
58
Provides a reference implementation that doesn't require C extensions,
59
useful for debugging or when C extensions are unavailable.
60
61
Parameters:
62
- s: list/array, collection of sequences
63
- block: tuple, memory blocking configuration
64
- show_progress: bool, display progress bar
65
- max_length_diff: int, maximum length difference
66
- dist_opts: dict, options passed to distance function
67
68
Returns:
69
array: condensed distance array
70
"""
71
```
72
73
### Distance Matrix Configuration
74
75
Get pre-configured distance matrix functions with specific settings for repeated use with consistent parameters.
76
77
```python { .api }
78
def distance_matrix_func(use_c=False, use_nogil=False, parallel=False,
79
show_progress=False):
80
"""
81
Return a configured distance matrix function.
82
83
Creates a partially configured distance matrix function with specified
84
performance and display options, useful for repeated computations.
85
86
Parameters:
87
- use_c: bool, use C implementation
88
- use_nogil: bool, use GIL-free implementation
89
- parallel: bool, enable parallel processing
90
- show_progress: bool, show progress bar
91
92
Returns:
93
function: configured distance matrix function
94
"""
95
```
96
97
### Matrix Format Conversion
98
99
Convert between different distance matrix representations for compatibility with various analysis tools.
100
101
```python { .api }
102
def distances_array_to_matrix(dists, nb_series, block=None):
103
"""
104
Convert condensed distance array to full symmetric matrix.
105
106
Transforms the condensed array format (n*(n-1)/2 elements) to a full
107
symmetric matrix format (n x n) with zeros on the diagonal.
108
109
Parameters:
110
- dists: array, condensed distance array
111
- nb_series: int, number of original sequences
112
- block: tuple, optional blocking for memory efficiency
113
114
Returns:
115
array: full symmetric distance matrix of shape (nb_series, nb_series)
116
"""
117
118
def distance_array_index(a, b, nb_series):
119
"""
120
Get index in condensed distance array for sequence pair (a, b).
121
122
Computes the position in the condensed array where the distance
123
between sequences a and b is stored.
124
125
Parameters:
126
- a, b: int, sequence indices (where a < b)
127
- nb_series: int, total number of sequences
128
129
Returns:
130
int: index in condensed distance array
131
"""
132
```
133
134
## Usage Examples
135
136
### Basic Distance Matrix
137
138
```python
139
from dtaidistance import dtw
140
import numpy as np
141
142
# Create a collection of time series
143
series = [
144
[1, 2, 3, 2, 1],
145
[0, 1, 2, 3, 2, 1, 0],
146
[1, 3, 2, 1],
147
[2, 1, 0, 1, 2, 3, 2]
148
]
149
150
# Compute full distance matrix
151
distances = dtw.distance_matrix(series)
152
print("Distance matrix shape:", distances.shape)
153
print("Distance matrix:")
154
print(distances)
155
156
# Compute condensed distance array (more memory efficient)
157
distances_condensed = dtw.distance_matrix(series, compact=True)
158
print("Condensed array length:", len(distances_condensed))
159
print("Condensed distances:", distances_condensed)
160
```
161
162
### High-Performance Distance Matrix
163
164
```python
165
from dtaidistance import dtw
166
import numpy as np
167
import time
168
169
# Generate larger dataset
170
np.random.seed(42)
171
series = [np.cumsum(np.random.randn(100)) for _ in range(50)]
172
173
# Compare different implementations
174
methods = [
175
("Python", lambda: dtw.distance_matrix_python(series)),
176
("C sequential", lambda: dtw.distance_matrix(series, use_c=True, parallel=False)),
177
("C parallel", lambda: dtw.distance_matrix_fast(series)),
178
("C nogil parallel", lambda: dtw.distance_matrix(series, use_c=True,
179
use_nogil=True, parallel=True))
180
]
181
182
for name, method in methods:
183
start = time.time()
184
result = method()
185
elapsed = time.time() - start
186
print(f"{name}: {elapsed:.2f}s, shape: {result.shape}")
187
```
188
189
### Memory-Efficient Blocking
190
191
```python
192
from dtaidistance import dtw
193
import numpy as np
194
195
# Large dataset that might not fit in memory
196
large_series = [np.random.randn(200) for _ in range(100)]
197
198
# Process in blocks to manage memory usage
199
block_size = 25
200
n_series = len(large_series)
201
202
# Compute distance matrix in blocks
203
distances = np.zeros((n_series, n_series))
204
205
for i in range(0, n_series, block_size):
206
for j in range(i, n_series, block_size):
207
end_i = min(i + block_size, n_series)
208
end_j = min(j + block_size, n_series)
209
210
# Extract block
211
block_series = large_series[i:end_i]
212
213
if i == j:
214
# Diagonal block
215
block_distances = dtw.distance_matrix(block_series, compact=False)
216
distances[i:end_i, i:end_i] = block_distances
217
else:
218
# Off-diagonal block - compute cross distances
219
for x in range(len(block_series)):
220
for y in range(j, end_j):
221
if y < len(large_series):
222
dist = dtw.distance(block_series[x], large_series[y])
223
distances[i + x, y] = dist
224
distances[y, i + x] = dist # Symmetric
225
226
print(f"Computed {n_series}x{n_series} distance matrix in blocks")
227
```
228
229
### Distance Matrix with Constraints
230
231
```python
232
from dtaidistance import dtw
233
234
# Time series with different characteristics
235
series = [
236
[1, 2, 3, 4, 5], # Increasing trend
237
[5, 4, 3, 2, 1], # Decreasing trend
238
[1, 3, 2, 4, 1], # Oscillating
239
[2, 2, 2, 2, 2], # Constant
240
[1, 2, 3, 4, 5, 6, 7, 8] # Longer increasing
241
]
242
243
# Distance matrix with various constraints
244
constraints = {
245
'window': 3, # Sakoe-Chiba band
246
'max_dist': 10.0, # Early stopping
247
'max_length_diff': 4, # Length difference limit
248
'penalty': 0.1 # Warping penalty
249
}
250
251
distances = dtw.distance_matrix(series, **constraints, show_progress=True)
252
253
print("Constrained distance matrix:")
254
print(distances)
255
256
# Find most similar pairs
257
n = len(series)
258
for i in range(n):
259
for j in range(i + 1, n):
260
print(f"Distance between series {i} and {j}: {distances[i, j]:.2f}")
261
```
262
263
### Integration with Clustering
264
265
```python
266
from dtaidistance import dtw, clustering
267
import numpy as np
268
269
# Generate sample time series data
270
np.random.seed(42)
271
n_series = 20
272
series_length = 50
273
274
# Create three clusters of similar time series
275
cluster1 = [np.sin(np.linspace(0, 4*np.pi, series_length)) +
276
0.1*np.random.randn(series_length) for _ in range(7)]
277
cluster2 = [np.cos(np.linspace(0, 3*np.pi, series_length)) +
278
0.1*np.random.randn(series_length) for _ in range(6)]
279
cluster3 = [np.linspace(0, 1, series_length) +
280
0.1*np.random.randn(series_length) for _ in range(7)]
281
282
all_series = cluster1 + cluster2 + cluster3
283
284
# Compute distance matrix
285
distances = dtw.distance_matrix(all_series, use_c=True, parallel=True)
286
287
# Perform hierarchical clustering
288
clusterer = clustering.Hierarchical(
289
dists_fun=dtw.distance_matrix_fast,
290
dists_options={},
291
max_dist=np.inf
292
)
293
294
cluster_tree = clusterer.fit(all_series)
295
print("Clustering completed")
296
print(f"Number of clusters found: {len(cluster_tree)}")
297
```
298
299
### Format Conversion Examples
300
301
```python
302
from dtaidistance import dtw
303
import numpy as np
304
305
series = [[1, 2, 3], [2, 3, 1], [3, 1, 2], [1, 3, 2]]
306
307
# Get condensed distance array
308
condensed = dtw.distance_matrix(series, compact=True)
309
print("Condensed array:", condensed)
310
311
# Convert to full matrix
312
full_matrix = dtw.distances_array_to_matrix(condensed, len(series))
313
print("Full matrix:")
314
print(full_matrix)
315
316
# Verify conversion by accessing specific distances
317
for i in range(len(series)):
318
for j in range(i + 1, len(series)):
319
condensed_idx = dtw.distance_array_index(i, j, len(series))
320
condensed_dist = condensed[condensed_idx]
321
matrix_dist = full_matrix[i, j]
322
print(f"Distance ({i},{j}): condensed={condensed_dist:.3f}, "
323
f"matrix={matrix_dist:.3f}, match={abs(condensed_dist - matrix_dist) < 1e-10}")
324
```
325
326
### Performance Optimization Strategies
327
328
```python
329
from dtaidistance import dtw
330
from dtaidistance.util import SeriesContainer
331
import numpy as np
332
333
# Optimize data format
334
raw_series = [list(np.random.randn(100)) for _ in range(30)]
335
336
# Use SeriesContainer for better performance
337
series_container = SeriesContainer(raw_series)
338
339
# Configure optimized distance matrix function
340
fast_distance_matrix = dtw.distance_matrix_func(
341
use_c=True,
342
use_nogil=True,
343
parallel=True,
344
show_progress=True
345
)
346
347
# Compute with optimization
348
distances = fast_distance_matrix(
349
series_container,
350
window=10, # Reasonable window constraint
351
max_dist=50.0, # Early stopping
352
compact=True # Memory efficient format
353
)
354
355
print(f"Computed {len(raw_series)}x{len(raw_series)} distances")
356
print(f"Result shape: {distances.shape if hasattr(distances, 'shape') else len(distances)}")
357
```
358
359
## Memory and Performance Considerations
360
361
### Memory Usage
362
- **Full matrix**: O(n²) memory for n sequences
363
- **Condensed array**: O(n²/2) memory, more efficient for large datasets
364
- **Blocking**: Process subsets to manage memory with very large datasets
365
366
### Computational Complexity
367
- **Time complexity**: O(n² × m²) where n is number of sequences, m is average sequence length
368
- **Parallel processing**: Near-linear speedup on multi-core systems with `parallel=True`
369
- **C extensions**: 10-100x speedup over pure Python implementation
370
371
### Optimization Guidelines
372
1. Use `compact=True` for memory efficiency
373
2. Enable `parallel=True` for multi-core systems
374
3. Set appropriate `window` constraints to reduce computation
375
4. Use `max_dist` for early stopping in similarity search
376
5. Consider blocking for very large datasets (>1000 sequences)