or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

clustering.mdcore-tree.mddata-tables.mdexternal-formats.mdindex.mdncbi-taxonomy.mdphylogenetic.mdsequences.mdvisualization.md

clustering.mddocs/

0

# Clustering Analysis

1

2

Specialized clustering tree operations for hierarchical clustering analysis, cluster validation, and dendogram-based data exploration. ETE3 provides enhanced tree classes specifically designed for clustering workflows.

3

4

## Capabilities

5

6

### ClusterTree and ClusterNode Classes

7

8

Enhanced tree classes specialized for clustering analysis and validation.

9

10

```python { .api }

11

class ClusterTree(Tree):

12

"""

13

Tree specialized for hierarchical clustering analysis.

14

Inherits all Tree functionality plus clustering-specific methods.

15

"""

16

17

def __init__(self, newick=None, **kwargs):

18

"""

19

Initialize clustering tree.

20

21

Parameters:

22

- newick (str): Newick format string or file path

23

- kwargs: Additional Tree initialization parameters

24

"""

25

26

class ClusterNode(ClusterTree):

27

"""Alias for ClusterTree - same functionality."""

28

pass

29

```

30

31

### Cluster Profile Analysis

32

33

Extract and analyze cluster profiles and characteristics.

34

35

```python { .api }

36

def get_cluster_profile(self):

37

"""

38

Get profile characteristics for cluster represented by this node.

39

40

Returns:

41

dict: Cluster profile including:

42

- size: Number of items in cluster

43

- height: Cluster height/dissimilarity

44

- members: List of cluster members

45

- profile: Statistical summary of cluster data

46

"""

47

48

def get_cluster_size(self):

49

"""

50

Get number of items in cluster.

51

52

Returns:

53

int: Cluster size (number of leaf nodes)

54

"""

55

56

def get_cluster_members(self):

57

"""

58

Get all members (leaf names) of this cluster.

59

60

Returns:

61

list: List of member names in cluster

62

"""

63

64

def get_cluster_height(self):

65

"""

66

Get cluster height (distance at which cluster was formed).

67

68

Returns:

69

float: Cluster formation height/distance

70

"""

71

```

72

73

### Cluster Validation Metrics

74

75

Calculate various cluster validation and quality metrics.

76

77

```python { .api }

78

def get_silhouette(self):

79

"""

80

Calculate silhouette coefficient for cluster.

81

82

Measures how similar items are to their own cluster compared to other clusters.

83

Values range from -1 to 1, where higher values indicate better clustering.

84

85

Returns:

86

float: Silhouette coefficient (-1 to 1)

87

"""

88

89

def get_intra_cluster_distance(self):

90

"""

91

Calculate average intra-cluster distance.

92

93

Returns:

94

float: Average distance between items within cluster

95

"""

96

97

def get_inter_cluster_distance(self, other_cluster):

98

"""

99

Calculate distance between this cluster and another cluster.

100

101

Parameters:

102

- other_cluster (ClusterTree): Other cluster for comparison

103

104

Returns:

105

float: Distance between clusters

106

"""

107

108

def get_cluster_variance(self):

109

"""

110

Calculate within-cluster variance.

111

112

Returns:

113

float: Variance of distances within cluster

114

"""

115

```

116

117

### Cluster Cutting and Partitioning

118

119

Methods for extracting clusters at different levels of the hierarchy.

120

121

```python { .api }

122

def get_clusters_at_height(self, height):

123

"""

124

Get clusters by cutting tree at specified height.

125

126

Parameters:

127

- height (float): Height at which to cut the tree

128

129

Returns:

130

list: List of ClusterTree objects representing clusters

131

"""

132

133

def get_clusters_by_size(self, min_size=2, max_size=None):

134

"""

135

Get clusters within specified size range.

136

137

Parameters:

138

- min_size (int): Minimum cluster size

139

- max_size (int): Maximum cluster size (None for no limit)

140

141

Returns:

142

list: List of clusters meeting size criteria

143

"""

144

145

def get_optimal_clusters(self, criterion="silhouette"):

146

"""

147

Find optimal number of clusters using specified criterion.

148

149

Parameters:

150

- criterion (str): Optimization criterion ("silhouette", "gap", "elbow")

151

152

Returns:

153

tuple: (optimal_k, clusters_list, criterion_values)

154

"""

155

```

156

157

### Cluster Comparison and Analysis

158

159

Compare different clusterings and analyze cluster relationships.

160

161

```python { .api }

162

def compare_clusters(self, other_clustering, method="adjusted_rand"):

163

"""

164

Compare this clustering with another clustering.

165

166

Parameters:

167

- other_clustering (ClusterTree or dict): Other clustering to compare

168

- method (str): Comparison metric ("adjusted_rand", "normalized_mutual_info", "homogeneity")

169

170

Returns:

171

float: Clustering similarity score

172

"""

173

174

def get_cluster_stability(self, bootstrap_samples=100):

175

"""

176

Assess cluster stability through bootstrap resampling.

177

178

Parameters:

179

- bootstrap_samples (int): Number of bootstrap iterations

180

181

Returns:

182

dict: Stability scores for each cluster

183

"""

184

```

185

186

### Distance Matrix Integration

187

188

Work with distance matrices and clustering algorithms.

189

190

```python { .api }

191

def from_distance_matrix(self, distance_matrix, labels=None, method="average"):

192

"""

193

Create clustering tree from distance matrix.

194

195

Parameters:

196

- distance_matrix (array-like): Symmetric distance matrix

197

- labels (list): Labels for matrix rows/columns

198

- method (str): Linkage method ("single", "complete", "average", "ward")

199

200

Returns:

201

ClusterTree: Hierarchical clustering result

202

"""

203

204

def get_distance_matrix(self):

205

"""

206

Extract distance matrix from clustering tree.

207

208

Returns:

209

numpy.ndarray: Distance matrix between all leaf pairs

210

"""

211

```

212

213

### Cluster Visualization

214

215

Specialized visualization methods for clustering results.

216

217

```python { .api }

218

def show_cluster_heatmap(self, data_matrix=None, color_map="viridis"):

219

"""

220

Display clustering results with associated data heatmap.

221

222

Parameters:

223

- data_matrix (array-like): Data matrix to display alongside tree

224

- color_map (str): Color scheme for heatmap

225

"""

226

227

def render_dendrogram(self, orientation="top", leaf_rotation=90, **kwargs):

228

"""

229

Render tree as dendrogram with clustering-specific formatting.

230

231

Parameters:

232

- orientation (str): Dendrogram orientation ("top", "bottom", "left", "right")

233

- leaf_rotation (int): Rotation angle for leaf labels

234

- kwargs: Additional rendering parameters

235

"""

236

```

237

238

## Integration with Data Analysis

239

240

### ArrayTable Integration

241

242

Seamless integration with ETE3's ArrayTable for data-driven clustering.

243

244

```python { .api }

245

# In ArrayTable class

246

def cluster_data(self, method="ward", metric="euclidean"):

247

"""

248

Perform hierarchical clustering on table data.

249

250

Parameters:

251

- method (str): Linkage method ("ward", "complete", "average", "single")

252

- metric (str): Distance metric ("euclidean", "manhattan", "cosine", "correlation")

253

254

Returns:

255

ClusterTree: Clustering result tree

256

"""

257

```

258

259

## Usage Examples

260

261

### Basic Clustering Analysis

262

263

```python

264

from ete3 import ClusterTree

265

import numpy as np

266

267

# Load clustering result (from distance matrix or linkage)

268

cluster_tree = ClusterTree("clustering_result.nw")

269

270

# Basic cluster information

271

print(f"Total items clustered: {len(cluster_tree.get_leaves())}")

272

print(f"Tree height: {cluster_tree.get_tree_root().get_cluster_height()}")

273

274

# Analyze individual clusters

275

for node in cluster_tree.traverse():

276

if not node.is_leaf():

277

profile = node.get_cluster_profile()

278

print(f"Cluster size: {profile['size']}, height: {profile['height']:.3f}")

279

```

280

281

### Cluster Validation

282

283

```python

284

from ete3 import ClusterTree

285

286

cluster_tree = ClusterTree("hierarchical_clustering.nw")

287

288

# Calculate silhouette scores for all clusters

289

silhouette_scores = {}

290

for node in cluster_tree.traverse():

291

if not node.is_leaf() and len(node.get_leaves()) > 1:

292

silhouette = node.get_silhouette()

293

silhouette_scores[node] = silhouette

294

print(f"Cluster {len(node.get_leaves())} items: silhouette = {silhouette:.3f}")

295

296

# Find best clusters based on silhouette

297

best_clusters = [node for node, score in silhouette_scores.items() if score > 0.5]

298

print(f"Found {len(best_clusters)} high-quality clusters")

299

```

300

301

### Cluster Cutting and Optimization

302

303

```python

304

from ete3 import ClusterTree

305

306

cluster_tree = ClusterTree("clustering_dendrogram.nw")

307

308

# Cut tree at different heights

309

heights = [0.1, 0.2, 0.5, 1.0]

310

for height in heights:

311

clusters = cluster_tree.get_clusters_at_height(height)

312

print(f"Height {height}: {len(clusters)} clusters")

313

314

# Analyze cluster sizes

315

sizes = [len(cluster.get_leaves()) for cluster in clusters]

316

print(f" Cluster sizes: {sizes}")

317

318

# Find optimal clustering

319

optimal_k, optimal_clusters, scores = cluster_tree.get_optimal_clusters(criterion="silhouette")

320

print(f"Optimal number of clusters: {optimal_k}")

321

print(f"Optimal clustering silhouette: {max(scores):.3f}")

322

```

323

324

### Integration with Data Analysis

325

326

```python

327

from ete3 import ArrayTable, ClusterTree

328

import numpy as np

329

330

# Load expression data

331

expression_data = ArrayTable("gene_expression.txt")

332

333

# Perform clustering

334

cluster_result = expression_data.cluster_data(method="ward", metric="euclidean")

335

336

# Analyze clustering quality

337

for node in cluster_result.traverse():

338

if not node.is_leaf():

339

cluster_profile = node.get_cluster_profile()

340

if cluster_profile['size'] >= 5: # Focus on larger clusters

341

silhouette = node.get_silhouette()

342

variance = node.get_cluster_variance()

343

print(f"Cluster {cluster_profile['size']} genes:")

344

print(f" Silhouette: {silhouette:.3f}")

345

print(f" Variance: {variance:.3f}")

346

print(f" Members: {node.get_cluster_members()[:5]}...") # Show first 5

347

```

348

349

### Cluster Comparison

350

351

```python

352

from ete3 import ClusterTree

353

354

# Load two different clustering results

355

clustering1 = ClusterTree("method1_clustering.nw")

356

clustering2 = ClusterTree("method2_clustering.nw")

357

358

# Compare clusterings

359

similarity = clustering1.compare_clusters(clustering2, method="adjusted_rand")

360

print(f"Clustering similarity (Adjusted Rand Index): {similarity:.3f}")

361

362

# Assess stability

363

stability_scores = clustering1.get_cluster_stability(bootstrap_samples=50)

364

for cluster, stability in stability_scores.items():

365

print(f"Cluster stability: {stability:.3f}")

366

```

367

368

### Advanced Clustering Workflow

369

370

```python

371

from ete3 import ArrayTable, ClusterTree

372

import numpy as np

373

374

# Complete clustering analysis workflow

375

def analyze_clustering(data_file, methods=["ward", "complete", "average"]):

376

# Load data

377

data = ArrayTable(data_file)

378

379

# Try different clustering methods

380

results = {}

381

for method in methods:

382

cluster_tree = data.cluster_data(method=method, metric="euclidean")

383

384

# Find optimal clusters

385

opt_k, opt_clusters, scores = cluster_tree.get_optimal_clusters()

386

387

# Calculate overall quality metrics

388

avg_silhouette = np.mean([cluster.get_silhouette()

389

for cluster in opt_clusters

390

if len(cluster.get_leaves()) > 1])

391

392

results[method] = {

393

'tree': cluster_tree,

394

'optimal_k': opt_k,

395

'avg_silhouette': avg_silhouette,

396

'clusters': opt_clusters

397

}

398

399

print(f"{method}: k={opt_k}, silhouette={avg_silhouette:.3f}")

400

401

# Select best method

402

best_method = max(results.keys(),

403

key=lambda m: results[m]['avg_silhouette'])

404

405

print(f"\nBest method: {best_method}")

406

return results[best_method]

407

408

# Run analysis

409

best_clustering = analyze_clustering("expression_matrix.txt")

410

411

# Visualize best result

412

best_clustering['tree'].show_cluster_heatmap()

413

```

414

415

### Custom Distance Metrics

416

417

```python

418

from ete3 import ClusterTree

419

import numpy as np

420

from scipy.spatial.distance import pdist, squareform

421

from scipy.cluster.hierarchy import linkage, to_tree

422

423

# Custom clustering with correlation distance

424

def correlation_clustering(data_matrix, method="average"):

425

# Calculate correlation-based distances

426

correlation_matrix = np.corrcoef(data_matrix)

427

distance_matrix = 1 - np.abs(correlation_matrix) # Convert correlation to distance

428

429

# Perform hierarchical clustering

430

condensed_distances = pdist(data_matrix, metric='correlation')

431

linkage_matrix = linkage(condensed_distances, method=method)

432

433

# Convert to ETE3 tree format

434

scipy_tree = to_tree(linkage_matrix)

435

436

# Create ClusterTree (would need conversion function)

437

# This is a simplified example

438

return ClusterTree(newick_from_scipy_tree(scipy_tree))

439

440

# Use custom clustering

441

data = np.random.rand(50, 100) # 50 samples, 100 features

442

custom_cluster_tree = correlation_clustering(data)

443

444

# Analyze results

445

optimal_clusters = custom_cluster_tree.get_optimal_clusters()

446

print(f"Custom clustering found {len(optimal_clusters[1])} optimal clusters")

447

```