Tessl Tile for pypi/ete3@3.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

clustering.md core-tree.md data-tables.md external-formats.md index.md ncbi-taxonomy.md phylogenetic.md sequences.md visualization.md

clustering.mddocs/

0
# Clustering Analysis
1

2
Specialized clustering tree operations for hierarchical clustering analysis, cluster validation, and dendogram-based data exploration. ETE3 provides enhanced tree classes specifically designed for clustering workflows.
3

4
## Capabilities
5

6
### ClusterTree and ClusterNode Classes
7

8
Enhanced tree classes specialized for clustering analysis and validation.
9

10
```python { .api }
11
class ClusterTree(Tree):
12
    """
13
    Tree specialized for hierarchical clustering analysis.
14
    Inherits all Tree functionality plus clustering-specific methods.
15
    """
16
    
17
    def __init__(self, newick=None, **kwargs):
18
        """
19
        Initialize clustering tree.
20

21
        Parameters:
22
        - newick (str): Newick format string or file path
23
        - kwargs: Additional Tree initialization parameters
24
        """
25

26
class ClusterNode(ClusterTree):
27
    """Alias for ClusterTree - same functionality."""
28
    pass
29
```
30

31
### Cluster Profile Analysis
32

33
Extract and analyze cluster profiles and characteristics.
34

35
```python { .api }
36
def get_cluster_profile(self):
37
    """
38
    Get profile characteristics for cluster represented by this node.
39

40
    Returns:
41
    dict: Cluster profile including:
42
        - size: Number of items in cluster
43
        - height: Cluster height/dissimilarity
44
        - members: List of cluster members
45
        - profile: Statistical summary of cluster data
46
    """
47

48
def get_cluster_size(self):
49
    """
50
    Get number of items in cluster.
51

52
    Returns:
53
    int: Cluster size (number of leaf nodes)
54
    """
55

56
def get_cluster_members(self):
57
    """
58
    Get all members (leaf names) of this cluster.
59

60
    Returns:
61
    list: List of member names in cluster
62
    """
63

64
def get_cluster_height(self):
65
    """
66
    Get cluster height (distance at which cluster was formed).
67

68
    Returns:
69
    float: Cluster formation height/distance
70
    """
71
```
72

73
### Cluster Validation Metrics
74

75
Calculate various cluster validation and quality metrics.
76

77
```python { .api }
78
def get_silhouette(self):
79
    """
80
    Calculate silhouette coefficient for cluster.
81
    
82
    Measures how similar items are to their own cluster compared to other clusters.
83
    Values range from -1 to 1, where higher values indicate better clustering.
84

85
    Returns:
86
    float: Silhouette coefficient (-1 to 1)
87
    """
88

89
def get_intra_cluster_distance(self):
90
    """
91
    Calculate average intra-cluster distance.
92

93
    Returns:
94
    float: Average distance between items within cluster
95
    """
96

97
def get_inter_cluster_distance(self, other_cluster):
98
    """
99
    Calculate distance between this cluster and another cluster.
100

101
    Parameters:
102
    - other_cluster (ClusterTree): Other cluster for comparison
103

104
    Returns:
105
    float: Distance between clusters
106
    """
107

108
def get_cluster_variance(self):
109
    """
110
    Calculate within-cluster variance.
111

112
    Returns:
113
    float: Variance of distances within cluster
114
    """
115
```
116

117
### Cluster Cutting and Partitioning
118

119
Methods for extracting clusters at different levels of the hierarchy.
120

121
```python { .api }
122
def get_clusters_at_height(self, height):
123
    """
124
    Get clusters by cutting tree at specified height.
125

126
    Parameters:
127
    - height (float): Height at which to cut the tree
128

129
    Returns:
130
    list: List of ClusterTree objects representing clusters
131
    """
132

133
def get_clusters_by_size(self, min_size=2, max_size=None):
134
    """
135
    Get clusters within specified size range.
136

137
    Parameters:
138
    - min_size (int): Minimum cluster size
139
    - max_size (int): Maximum cluster size (None for no limit)
140

141
    Returns:
142
    list: List of clusters meeting size criteria
143
    """
144

145
def get_optimal_clusters(self, criterion="silhouette"):
146
    """
147
    Find optimal number of clusters using specified criterion.
148

149
    Parameters:
150
    - criterion (str): Optimization criterion ("silhouette", "gap", "elbow")
151

152
    Returns:
153
    tuple: (optimal_k, clusters_list, criterion_values)
154
    """
155
```
156

157
### Cluster Comparison and Analysis
158

159
Compare different clusterings and analyze cluster relationships.
160

161
```python { .api }
162
def compare_clusters(self, other_clustering, method="adjusted_rand"):
163
    """
164
    Compare this clustering with another clustering.
165

166
    Parameters:
167
    - other_clustering (ClusterTree or dict): Other clustering to compare
168
    - method (str): Comparison metric ("adjusted_rand", "normalized_mutual_info", "homogeneity")
169

170
    Returns:
171
    float: Clustering similarity score
172
    """
173

174
def get_cluster_stability(self, bootstrap_samples=100):
175
    """
176
    Assess cluster stability through bootstrap resampling.
177

178
    Parameters:
179
    - bootstrap_samples (int): Number of bootstrap iterations
180

181
    Returns:
182
    dict: Stability scores for each cluster
183
    """
184
```
185

186
### Distance Matrix Integration
187

188
Work with distance matrices and clustering algorithms.
189

190
```python { .api }
191
def from_distance_matrix(self, distance_matrix, labels=None, method="average"):
192
    """
193
    Create clustering tree from distance matrix.
194

195
    Parameters:
196
    - distance_matrix (array-like): Symmetric distance matrix
197
    - labels (list): Labels for matrix rows/columns
198
    - method (str): Linkage method ("single", "complete", "average", "ward")
199

200
    Returns:
201
    ClusterTree: Hierarchical clustering result
202
    """
203

204
def get_distance_matrix(self):
205
    """
206
    Extract distance matrix from clustering tree.
207

208
    Returns:
209
    numpy.ndarray: Distance matrix between all leaf pairs
210
    """
211
```
212

213
### Cluster Visualization
214

215
Specialized visualization methods for clustering results.
216

217
```python { .api }
218
def show_cluster_heatmap(self, data_matrix=None, color_map="viridis"):
219
    """
220
    Display clustering results with associated data heatmap.
221

222
    Parameters:
223
    - data_matrix (array-like): Data matrix to display alongside tree
224
    - color_map (str): Color scheme for heatmap
225
    """
226

227
def render_dendrogram(self, orientation="top", leaf_rotation=90, **kwargs):
228
    """
229
    Render tree as dendrogram with clustering-specific formatting.
230

231
    Parameters:
232
    - orientation (str): Dendrogram orientation ("top", "bottom", "left", "right")  
233
    - leaf_rotation (int): Rotation angle for leaf labels
234
    - kwargs: Additional rendering parameters
235
    """
236
```
237

238
## Integration with Data Analysis
239

240
### ArrayTable Integration
241

242
Seamless integration with ETE3's ArrayTable for data-driven clustering.
243

244
```python { .api }
245
# In ArrayTable class
246
def cluster_data(self, method="ward", metric="euclidean"):
247
    """
248
    Perform hierarchical clustering on table data.
249

250
    Parameters:
251
    - method (str): Linkage method ("ward", "complete", "average", "single")
252
    - metric (str): Distance metric ("euclidean", "manhattan", "cosine", "correlation")
253

254
    Returns:
255
    ClusterTree: Clustering result tree
256
    """
257
```
258

259
## Usage Examples
260

261
### Basic Clustering Analysis
262

263
```python
264
from ete3 import ClusterTree
265
import numpy as np
266

267
# Load clustering result (from distance matrix or linkage)
268
cluster_tree = ClusterTree("clustering_result.nw")
269

270
# Basic cluster information
271
print(f"Total items clustered: {len(cluster_tree.get_leaves())}")
272
print(f"Tree height: {cluster_tree.get_tree_root().get_cluster_height()}")
273

274
# Analyze individual clusters
275
for node in cluster_tree.traverse():
276
    if not node.is_leaf():
277
        profile = node.get_cluster_profile()
278
        print(f"Cluster size: {profile['size']}, height: {profile['height']:.3f}")
279
```
280

281
### Cluster Validation
282

283
```python
284
from ete3 import ClusterTree
285

286
cluster_tree = ClusterTree("hierarchical_clustering.nw")
287

288
# Calculate silhouette scores for all clusters
289
silhouette_scores = {}
290
for node in cluster_tree.traverse():
291
    if not node.is_leaf() and len(node.get_leaves()) > 1:
292
        silhouette = node.get_silhouette()
293
        silhouette_scores[node] = silhouette
294
        print(f"Cluster {len(node.get_leaves())} items: silhouette = {silhouette:.3f}")
295

296
# Find best clusters based on silhouette
297
best_clusters = [node for node, score in silhouette_scores.items() if score > 0.5]
298
print(f"Found {len(best_clusters)} high-quality clusters")
299
```
300

301
### Cluster Cutting and Optimization
302

303
```python
304
from ete3 import ClusterTree
305

306
cluster_tree = ClusterTree("clustering_dendrogram.nw")
307

308
# Cut tree at different heights
309
heights = [0.1, 0.2, 0.5, 1.0]
310
for height in heights:
311
    clusters = cluster_tree.get_clusters_at_height(height)
312
    print(f"Height {height}: {len(clusters)} clusters")
313
    
314
    # Analyze cluster sizes
315
    sizes = [len(cluster.get_leaves()) for cluster in clusters]
316
    print(f"  Cluster sizes: {sizes}")
317

318
# Find optimal clustering
319
optimal_k, optimal_clusters, scores = cluster_tree.get_optimal_clusters(criterion="silhouette")
320
print(f"Optimal number of clusters: {optimal_k}")
321
print(f"Optimal clustering silhouette: {max(scores):.3f}")
322
```
323

324
### Integration with Data Analysis
325

326
```python
327
from ete3 import ArrayTable, ClusterTree
328
import numpy as np
329

330
# Load expression data
331
expression_data = ArrayTable("gene_expression.txt")
332

333
# Perform clustering
334
cluster_result = expression_data.cluster_data(method="ward", metric="euclidean")
335

336
# Analyze clustering quality
337
for node in cluster_result.traverse():
338
    if not node.is_leaf():
339
        cluster_profile = node.get_cluster_profile()
340
        if cluster_profile['size'] >= 5:  # Focus on larger clusters
341
            silhouette = node.get_silhouette()
342
            variance = node.get_cluster_variance()
343
            print(f"Cluster {cluster_profile['size']} genes:")
344
            print(f"  Silhouette: {silhouette:.3f}")
345
            print(f"  Variance: {variance:.3f}")
346
            print(f"  Members: {node.get_cluster_members()[:5]}...")  # Show first 5
347
```
348

349
### Cluster Comparison
350

351
```python
352
from ete3 import ClusterTree
353

354
# Load two different clustering results
355
clustering1 = ClusterTree("method1_clustering.nw")
356
clustering2 = ClusterTree("method2_clustering.nw")
357

358
# Compare clusterings
359
similarity = clustering1.compare_clusters(clustering2, method="adjusted_rand")
360
print(f"Clustering similarity (Adjusted Rand Index): {similarity:.3f}")
361

362
# Assess stability
363
stability_scores = clustering1.get_cluster_stability(bootstrap_samples=50)
364
for cluster, stability in stability_scores.items():
365
    print(f"Cluster stability: {stability:.3f}")
366
```
367

368
### Advanced Clustering Workflow
369

370
```python
371
from ete3 import ArrayTable, ClusterTree
372
import numpy as np
373

374
# Complete clustering analysis workflow
375
def analyze_clustering(data_file, methods=["ward", "complete", "average"]):
376
    # Load data
377
    data = ArrayTable(data_file)
378
    
379
    # Try different clustering methods
380
    results = {}
381
    for method in methods:
382
        cluster_tree = data.cluster_data(method=method, metric="euclidean")
383
        
384
        # Find optimal clusters
385
        opt_k, opt_clusters, scores = cluster_tree.get_optimal_clusters()
386
        
387
        # Calculate overall quality metrics
388
        avg_silhouette = np.mean([cluster.get_silhouette() 
389
                                 for cluster in opt_clusters 
390
                                 if len(cluster.get_leaves()) > 1])
391
        
392
        results[method] = {
393
            'tree': cluster_tree,
394
            'optimal_k': opt_k,
395
            'avg_silhouette': avg_silhouette,
396
            'clusters': opt_clusters
397
        }
398
        
399
        print(f"{method}: k={opt_k}, silhouette={avg_silhouette:.3f}")
400
    
401
    # Select best method
402
    best_method = max(results.keys(), 
403
                     key=lambda m: results[m]['avg_silhouette'])
404
    
405
    print(f"\nBest method: {best_method}")
406
    return results[best_method]
407

408
# Run analysis
409
best_clustering = analyze_clustering("expression_matrix.txt")
410

411
# Visualize best result
412
best_clustering['tree'].show_cluster_heatmap()
413
```
414

415
### Custom Distance Metrics
416

417
```python
418
from ete3 import ClusterTree
419
import numpy as np
420
from scipy.spatial.distance import pdist, squareform
421
from scipy.cluster.hierarchy import linkage, to_tree
422

423
# Custom clustering with correlation distance
424
def correlation_clustering(data_matrix, method="average"):
425
    # Calculate correlation-based distances
426
    correlation_matrix = np.corrcoef(data_matrix)
427
    distance_matrix = 1 - np.abs(correlation_matrix)  # Convert correlation to distance
428
    
429
    # Perform hierarchical clustering
430
    condensed_distances = pdist(data_matrix, metric='correlation')
431
    linkage_matrix = linkage(condensed_distances, method=method)
432
    
433
    # Convert to ETE3 tree format
434
    scipy_tree = to_tree(linkage_matrix)
435
    
436
    # Create ClusterTree (would need conversion function)
437
    # This is a simplified example
438
    return ClusterTree(newick_from_scipy_tree(scipy_tree))
439

440
# Use custom clustering
441
data = np.random.rand(50, 100)  # 50 samples, 100 features
442
custom_cluster_tree = correlation_clustering(data)
443

444
# Analyze results
445
optimal_clusters = custom_cluster_tree.get_optimal_clusters()
446
print(f"Custom clustering found {len(optimal_clusters[1])} optimal clusters")
447
```

Version

Tile

Files

clustering.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

clustering.mddocs/