0
# Clustering Analysis
1
2
Specialized clustering tree operations for hierarchical clustering analysis, cluster validation, and dendogram-based data exploration. ETE3 provides enhanced tree classes specifically designed for clustering workflows.
3
4
## Capabilities
5
6
### ClusterTree and ClusterNode Classes
7
8
Enhanced tree classes specialized for clustering analysis and validation.
9
10
```python { .api }
11
class ClusterTree(Tree):
12
"""
13
Tree specialized for hierarchical clustering analysis.
14
Inherits all Tree functionality plus clustering-specific methods.
15
"""
16
17
def __init__(self, newick=None, **kwargs):
18
"""
19
Initialize clustering tree.
20
21
Parameters:
22
- newick (str): Newick format string or file path
23
- kwargs: Additional Tree initialization parameters
24
"""
25
26
class ClusterNode(ClusterTree):
27
"""Alias for ClusterTree - same functionality."""
28
pass
29
```
30
31
### Cluster Profile Analysis
32
33
Extract and analyze cluster profiles and characteristics.
34
35
```python { .api }
36
def get_cluster_profile(self):
37
"""
38
Get profile characteristics for cluster represented by this node.
39
40
Returns:
41
dict: Cluster profile including:
42
- size: Number of items in cluster
43
- height: Cluster height/dissimilarity
44
- members: List of cluster members
45
- profile: Statistical summary of cluster data
46
"""
47
48
def get_cluster_size(self):
49
"""
50
Get number of items in cluster.
51
52
Returns:
53
int: Cluster size (number of leaf nodes)
54
"""
55
56
def get_cluster_members(self):
57
"""
58
Get all members (leaf names) of this cluster.
59
60
Returns:
61
list: List of member names in cluster
62
"""
63
64
def get_cluster_height(self):
65
"""
66
Get cluster height (distance at which cluster was formed).
67
68
Returns:
69
float: Cluster formation height/distance
70
"""
71
```
72
73
### Cluster Validation Metrics
74
75
Calculate various cluster validation and quality metrics.
76
77
```python { .api }
78
def get_silhouette(self):
79
"""
80
Calculate silhouette coefficient for cluster.
81
82
Measures how similar items are to their own cluster compared to other clusters.
83
Values range from -1 to 1, where higher values indicate better clustering.
84
85
Returns:
86
float: Silhouette coefficient (-1 to 1)
87
"""
88
89
def get_intra_cluster_distance(self):
90
"""
91
Calculate average intra-cluster distance.
92
93
Returns:
94
float: Average distance between items within cluster
95
"""
96
97
def get_inter_cluster_distance(self, other_cluster):
98
"""
99
Calculate distance between this cluster and another cluster.
100
101
Parameters:
102
- other_cluster (ClusterTree): Other cluster for comparison
103
104
Returns:
105
float: Distance between clusters
106
"""
107
108
def get_cluster_variance(self):
109
"""
110
Calculate within-cluster variance.
111
112
Returns:
113
float: Variance of distances within cluster
114
"""
115
```
116
117
### Cluster Cutting and Partitioning
118
119
Methods for extracting clusters at different levels of the hierarchy.
120
121
```python { .api }
122
def get_clusters_at_height(self, height):
123
"""
124
Get clusters by cutting tree at specified height.
125
126
Parameters:
127
- height (float): Height at which to cut the tree
128
129
Returns:
130
list: List of ClusterTree objects representing clusters
131
"""
132
133
def get_clusters_by_size(self, min_size=2, max_size=None):
134
"""
135
Get clusters within specified size range.
136
137
Parameters:
138
- min_size (int): Minimum cluster size
139
- max_size (int): Maximum cluster size (None for no limit)
140
141
Returns:
142
list: List of clusters meeting size criteria
143
"""
144
145
def get_optimal_clusters(self, criterion="silhouette"):
146
"""
147
Find optimal number of clusters using specified criterion.
148
149
Parameters:
150
- criterion (str): Optimization criterion ("silhouette", "gap", "elbow")
151
152
Returns:
153
tuple: (optimal_k, clusters_list, criterion_values)
154
"""
155
```
156
157
### Cluster Comparison and Analysis
158
159
Compare different clusterings and analyze cluster relationships.
160
161
```python { .api }
162
def compare_clusters(self, other_clustering, method="adjusted_rand"):
163
"""
164
Compare this clustering with another clustering.
165
166
Parameters:
167
- other_clustering (ClusterTree or dict): Other clustering to compare
168
- method (str): Comparison metric ("adjusted_rand", "normalized_mutual_info", "homogeneity")
169
170
Returns:
171
float: Clustering similarity score
172
"""
173
174
def get_cluster_stability(self, bootstrap_samples=100):
175
"""
176
Assess cluster stability through bootstrap resampling.
177
178
Parameters:
179
- bootstrap_samples (int): Number of bootstrap iterations
180
181
Returns:
182
dict: Stability scores for each cluster
183
"""
184
```
185
186
### Distance Matrix Integration
187
188
Work with distance matrices and clustering algorithms.
189
190
```python { .api }
191
def from_distance_matrix(self, distance_matrix, labels=None, method="average"):
192
"""
193
Create clustering tree from distance matrix.
194
195
Parameters:
196
- distance_matrix (array-like): Symmetric distance matrix
197
- labels (list): Labels for matrix rows/columns
198
- method (str): Linkage method ("single", "complete", "average", "ward")
199
200
Returns:
201
ClusterTree: Hierarchical clustering result
202
"""
203
204
def get_distance_matrix(self):
205
"""
206
Extract distance matrix from clustering tree.
207
208
Returns:
209
numpy.ndarray: Distance matrix between all leaf pairs
210
"""
211
```
212
213
### Cluster Visualization
214
215
Specialized visualization methods for clustering results.
216
217
```python { .api }
218
def show_cluster_heatmap(self, data_matrix=None, color_map="viridis"):
219
"""
220
Display clustering results with associated data heatmap.
221
222
Parameters:
223
- data_matrix (array-like): Data matrix to display alongside tree
224
- color_map (str): Color scheme for heatmap
225
"""
226
227
def render_dendrogram(self, orientation="top", leaf_rotation=90, **kwargs):
228
"""
229
Render tree as dendrogram with clustering-specific formatting.
230
231
Parameters:
232
- orientation (str): Dendrogram orientation ("top", "bottom", "left", "right")
233
- leaf_rotation (int): Rotation angle for leaf labels
234
- kwargs: Additional rendering parameters
235
"""
236
```
237
238
## Integration with Data Analysis
239
240
### ArrayTable Integration
241
242
Seamless integration with ETE3's ArrayTable for data-driven clustering.
243
244
```python { .api }
245
# In ArrayTable class
246
def cluster_data(self, method="ward", metric="euclidean"):
247
"""
248
Perform hierarchical clustering on table data.
249
250
Parameters:
251
- method (str): Linkage method ("ward", "complete", "average", "single")
252
- metric (str): Distance metric ("euclidean", "manhattan", "cosine", "correlation")
253
254
Returns:
255
ClusterTree: Clustering result tree
256
"""
257
```
258
259
## Usage Examples
260
261
### Basic Clustering Analysis
262
263
```python
264
from ete3 import ClusterTree
265
import numpy as np
266
267
# Load clustering result (from distance matrix or linkage)
268
cluster_tree = ClusterTree("clustering_result.nw")
269
270
# Basic cluster information
271
print(f"Total items clustered: {len(cluster_tree.get_leaves())}")
272
print(f"Tree height: {cluster_tree.get_tree_root().get_cluster_height()}")
273
274
# Analyze individual clusters
275
for node in cluster_tree.traverse():
276
if not node.is_leaf():
277
profile = node.get_cluster_profile()
278
print(f"Cluster size: {profile['size']}, height: {profile['height']:.3f}")
279
```
280
281
### Cluster Validation
282
283
```python
284
from ete3 import ClusterTree
285
286
cluster_tree = ClusterTree("hierarchical_clustering.nw")
287
288
# Calculate silhouette scores for all clusters
289
silhouette_scores = {}
290
for node in cluster_tree.traverse():
291
if not node.is_leaf() and len(node.get_leaves()) > 1:
292
silhouette = node.get_silhouette()
293
silhouette_scores[node] = silhouette
294
print(f"Cluster {len(node.get_leaves())} items: silhouette = {silhouette:.3f}")
295
296
# Find best clusters based on silhouette
297
best_clusters = [node for node, score in silhouette_scores.items() if score > 0.5]
298
print(f"Found {len(best_clusters)} high-quality clusters")
299
```
300
301
### Cluster Cutting and Optimization
302
303
```python
304
from ete3 import ClusterTree
305
306
cluster_tree = ClusterTree("clustering_dendrogram.nw")
307
308
# Cut tree at different heights
309
heights = [0.1, 0.2, 0.5, 1.0]
310
for height in heights:
311
clusters = cluster_tree.get_clusters_at_height(height)
312
print(f"Height {height}: {len(clusters)} clusters")
313
314
# Analyze cluster sizes
315
sizes = [len(cluster.get_leaves()) for cluster in clusters]
316
print(f" Cluster sizes: {sizes}")
317
318
# Find optimal clustering
319
optimal_k, optimal_clusters, scores = cluster_tree.get_optimal_clusters(criterion="silhouette")
320
print(f"Optimal number of clusters: {optimal_k}")
321
print(f"Optimal clustering silhouette: {max(scores):.3f}")
322
```
323
324
### Integration with Data Analysis
325
326
```python
327
from ete3 import ArrayTable, ClusterTree
328
import numpy as np
329
330
# Load expression data
331
expression_data = ArrayTable("gene_expression.txt")
332
333
# Perform clustering
334
cluster_result = expression_data.cluster_data(method="ward", metric="euclidean")
335
336
# Analyze clustering quality
337
for node in cluster_result.traverse():
338
if not node.is_leaf():
339
cluster_profile = node.get_cluster_profile()
340
if cluster_profile['size'] >= 5: # Focus on larger clusters
341
silhouette = node.get_silhouette()
342
variance = node.get_cluster_variance()
343
print(f"Cluster {cluster_profile['size']} genes:")
344
print(f" Silhouette: {silhouette:.3f}")
345
print(f" Variance: {variance:.3f}")
346
print(f" Members: {node.get_cluster_members()[:5]}...") # Show first 5
347
```
348
349
### Cluster Comparison
350
351
```python
352
from ete3 import ClusterTree
353
354
# Load two different clustering results
355
clustering1 = ClusterTree("method1_clustering.nw")
356
clustering2 = ClusterTree("method2_clustering.nw")
357
358
# Compare clusterings
359
similarity = clustering1.compare_clusters(clustering2, method="adjusted_rand")
360
print(f"Clustering similarity (Adjusted Rand Index): {similarity:.3f}")
361
362
# Assess stability
363
stability_scores = clustering1.get_cluster_stability(bootstrap_samples=50)
364
for cluster, stability in stability_scores.items():
365
print(f"Cluster stability: {stability:.3f}")
366
```
367
368
### Advanced Clustering Workflow
369
370
```python
371
from ete3 import ArrayTable, ClusterTree
372
import numpy as np
373
374
# Complete clustering analysis workflow
375
def analyze_clustering(data_file, methods=["ward", "complete", "average"]):
376
# Load data
377
data = ArrayTable(data_file)
378
379
# Try different clustering methods
380
results = {}
381
for method in methods:
382
cluster_tree = data.cluster_data(method=method, metric="euclidean")
383
384
# Find optimal clusters
385
opt_k, opt_clusters, scores = cluster_tree.get_optimal_clusters()
386
387
# Calculate overall quality metrics
388
avg_silhouette = np.mean([cluster.get_silhouette()
389
for cluster in opt_clusters
390
if len(cluster.get_leaves()) > 1])
391
392
results[method] = {
393
'tree': cluster_tree,
394
'optimal_k': opt_k,
395
'avg_silhouette': avg_silhouette,
396
'clusters': opt_clusters
397
}
398
399
print(f"{method}: k={opt_k}, silhouette={avg_silhouette:.3f}")
400
401
# Select best method
402
best_method = max(results.keys(),
403
key=lambda m: results[m]['avg_silhouette'])
404
405
print(f"\nBest method: {best_method}")
406
return results[best_method]
407
408
# Run analysis
409
best_clustering = analyze_clustering("expression_matrix.txt")
410
411
# Visualize best result
412
best_clustering['tree'].show_cluster_heatmap()
413
```
414
415
### Custom Distance Metrics
416
417
```python
418
from ete3 import ClusterTree
419
import numpy as np
420
from scipy.spatial.distance import pdist, squareform
421
from scipy.cluster.hierarchy import linkage, to_tree
422
423
# Custom clustering with correlation distance
424
def correlation_clustering(data_matrix, method="average"):
425
# Calculate correlation-based distances
426
correlation_matrix = np.corrcoef(data_matrix)
427
distance_matrix = 1 - np.abs(correlation_matrix) # Convert correlation to distance
428
429
# Perform hierarchical clustering
430
condensed_distances = pdist(data_matrix, metric='correlation')
431
linkage_matrix = linkage(condensed_distances, method=method)
432
433
# Convert to ETE3 tree format
434
scipy_tree = to_tree(linkage_matrix)
435
436
# Create ClusterTree (would need conversion function)
437
# This is a simplified example
438
return ClusterTree(newick_from_scipy_tree(scipy_tree))
439
440
# Use custom clustering
441
data = np.random.rand(50, 100) # 50 samples, 100 features
442
custom_cluster_tree = correlation_clustering(data)
443
444
# Analyze results
445
optimal_clusters = custom_cluster_tree.get_optimal_clusters()
446
print(f"Custom clustering found {len(optimal_clusters[1])} optimal clusters")
447
```