0
# Data Tables and Arrays
1
2
Efficient handling of numerical data associated with trees and sequences, supporting matrix operations, statistical analysis, and integration with scientific computing workflows. ETE3's ArrayTable provides high-performance data manipulation capabilities.
3
4
## Capabilities
5
6
### ArrayTable Class
7
8
Main class for handling 2D numerical data with matrix operations and scientific computing integration.
9
10
```python { .api }
11
class ArrayTable:
12
"""
13
Efficient 2D data table with matrix operations and scientific computing support.
14
Built on NumPy for high performance numerical operations.
15
"""
16
17
def __init__(self, matrix_file=None, mtype="float"):
18
"""
19
Initialize array table.
20
21
Parameters:
22
- matrix_file (str): Path to matrix data file
23
- mtype (str): Data type ("float", "int", "str")
24
"""
25
26
def __len__(self):
27
"""Number of rows in table."""
28
29
def __str__(self):
30
"""String representation of table."""
31
```
32
33
### Data Access and Retrieval
34
35
Methods for accessing rows, columns, and individual data elements.
36
37
```python { .api }
38
def get_column_array(self, colname):
39
"""
40
Get column data as NumPy array.
41
42
Parameters:
43
- colname (str): Column name
44
45
Returns:
46
numpy.ndarray: Column data array
47
"""
48
49
def get_row_array(self, rowname):
50
"""
51
Get row data as NumPy array.
52
53
Parameters:
54
- rowname (str): Row name
55
56
Returns:
57
numpy.ndarray: Row data array
58
"""
59
60
def get_several_column_arrays(self, colnames):
61
"""
62
Get multiple columns as arrays.
63
64
Parameters:
65
- colnames (list): List of column names
66
67
Returns:
68
dict: Mapping from column names to arrays
69
"""
70
71
def get_several_row_arrays(self, rownames):
72
"""
73
Get multiple rows as arrays.
74
75
Parameters:
76
- rownames (list): List of row names
77
78
Returns:
79
dict: Mapping from row names to arrays
80
"""
81
82
# Properties for data access
83
matrix: numpy.ndarray # Underlying data matrix
84
colNames: list # Column names
85
rowNames: list # Row names
86
colValues: dict # Column name to index mapping
87
rowValues: dict # Row name to index mapping
88
```
89
90
### Matrix Operations
91
92
Mathematical operations and transformations on the data matrix.
93
94
```python { .api }
95
def transpose(self):
96
"""
97
Transpose the matrix (swap rows and columns).
98
99
Returns:
100
ArrayTable: New transposed table
101
"""
102
103
def remove_column(self, colname):
104
"""
105
Remove column from table.
106
107
Parameters:
108
- colname (str): Column name to remove
109
"""
110
111
def remove_row(self, rowname):
112
"""
113
Remove row from table.
114
115
Parameters:
116
- rowname (str): Row name to remove
117
"""
118
119
def add_column(self, colname, colvalues):
120
"""
121
Add new column to table.
122
123
Parameters:
124
- colname (str): Name for new column
125
- colvalues (array-like): Column data values
126
"""
127
128
def add_row(self, rowname, rowvalues):
129
"""
130
Add new row to table.
131
132
Parameters:
133
- rowname (str): Name for new row
134
- rowvalues (array-like): Row data values
135
"""
136
```
137
138
### File I/O Operations
139
140
Read and write table data in various formats.
141
142
```python { .api }
143
def write(self, fname=None, colnames=None):
144
"""
145
Write table to file.
146
147
Parameters:
148
- fname (str): Output file path, if None returns string
149
- colnames (list): Specific columns to write
150
151
Returns:
152
str: Formatted table string (if fname is None)
153
"""
154
155
def read(self, matrix_file, mtype="float", **kwargs):
156
"""
157
Read table data from file.
158
159
Parameters:
160
- matrix_file (str): Input file path
161
- mtype (str): Data type for parsing
162
- kwargs: Additional parsing parameters
163
"""
164
```
165
166
### Statistical Operations
167
168
Built-in statistical analysis and data summary methods.
169
170
```python { .api }
171
def get_stats(self):
172
"""
173
Calculate basic statistics for all columns.
174
175
Returns:
176
dict: Statistics including mean, std, min, max for each column
177
"""
178
179
def get_column_stats(self, colname):
180
"""
181
Calculate statistics for specific column.
182
183
Parameters:
184
- colname (str): Column name
185
186
Returns:
187
dict: Column statistics (mean, std, min, max, etc.)
188
"""
189
190
def normalize(self, method="standard"):
191
"""
192
Normalize data using specified method.
193
194
Parameters:
195
- method (str): Normalization method ("standard", "minmax", "robust")
196
197
Returns:
198
ArrayTable: Normalized table
199
"""
200
```
201
202
### Data Filtering and Selection
203
204
Filter and select subsets of data based on criteria.
205
206
```python { .api }
207
def filter_columns(self, condition_func):
208
"""
209
Filter columns based on condition function.
210
211
Parameters:
212
- condition_func (function): Function that takes column array, returns bool
213
214
Returns:
215
ArrayTable: Filtered table
216
"""
217
218
def filter_rows(self, condition_func):
219
"""
220
Filter rows based on condition function.
221
222
Parameters:
223
- condition_func (function): Function that takes row array, returns bool
224
225
Returns:
226
ArrayTable: Filtered table
227
"""
228
229
def select_columns(self, colnames):
230
"""
231
Select specific columns.
232
233
Parameters:
234
- colnames (list): Column names to select
235
236
Returns:
237
ArrayTable: Table with selected columns
238
"""
239
240
def select_rows(self, rownames):
241
"""
242
Select specific rows.
243
244
Parameters:
245
- rownames (list): Row names to select
246
247
Returns:
248
ArrayTable: Table with selected rows
249
"""
250
```
251
252
### Integration with Trees
253
254
Methods for associating tabular data with tree structures.
255
256
```python { .api }
257
def link_to_tree(self, tree, attr_name="profile"):
258
"""
259
Link table data to tree nodes.
260
261
Parameters:
262
- tree (Tree): Tree to link data to
263
- attr_name (str): Attribute name for storing data in nodes
264
"""
265
266
def get_tree_profile(self, tree, attr_name="profile"):
267
"""
268
Extract profile data from tree nodes.
269
270
Parameters:
271
- tree (Tree): Tree with profile data
272
- attr_name (str): Attribute name containing data
273
274
Returns:
275
ArrayTable: Table with tree profile data
276
"""
277
```
278
279
## Clustering Integration
280
281
### ClusterTree with ArrayTable
282
283
Enhanced clustering functionality when combined with data tables.
284
285
```python { .api }
286
def get_distance_matrix(self):
287
"""
288
Calculate distance matrix between rows.
289
290
Returns:
291
numpy.ndarray: Symmetric distance matrix
292
"""
293
294
def cluster_data(self, method="ward", metric="euclidean"):
295
"""
296
Perform hierarchical clustering on data.
297
298
Parameters:
299
- method (str): Linkage method ("ward", "complete", "average", "single")
300
- metric (str): Distance metric ("euclidean", "manhattan", "cosine")
301
302
Returns:
303
ClusterTree: Tree representing clustering hierarchy
304
"""
305
```
306
307
## Usage Examples
308
309
### Basic Table Operations
310
311
```python
312
from ete3 import ArrayTable
313
import numpy as np
314
315
# Create table from file
316
table = ArrayTable("data_matrix.txt", mtype="float")
317
318
# Basic properties
319
print(f"Table dimensions: {len(table.rowNames)} x {len(table.colNames)}")
320
print(f"Column names: {table.colNames}")
321
print(f"Row names: {table.rowNames}")
322
323
# Access data
324
col_data = table.get_column_array("column1")
325
row_data = table.get_row_array("row1")
326
327
print(f"Column1 stats: mean={np.mean(col_data):.2f}, std={np.std(col_data):.2f}")
328
```
329
330
### Data Manipulation
331
332
```python
333
from ete3 import ArrayTable
334
335
# Load data
336
table = ArrayTable("expression_data.txt")
337
338
# Remove unwanted columns/rows
339
table.remove_column("control_sample")
340
table.remove_row("uninformative_gene")
341
342
# Add new data
343
new_column_data = [1.5, 2.3, 0.8, 3.1, 1.9]
344
table.add_column("new_condition", new_column_data)
345
346
# Transpose for different analysis perspective
347
transposed = table.transpose()
348
349
# Save results
350
table.write("modified_data.txt")
351
```
352
353
### Statistical Analysis
354
355
```python
356
from ete3 import ArrayTable
357
358
table = ArrayTable("experimental_data.txt")
359
360
# Get overall statistics
361
stats = table.get_stats()
362
for col, col_stats in stats.items():
363
print(f"{col}: mean={col_stats['mean']:.2f}, std={col_stats['std']:.2f}")
364
365
# Normalize data
366
normalized_table = table.normalize(method="standard")
367
368
# Filter based on criteria
369
def high_variance_filter(col_array):
370
return np.var(col_array) > 1.0
371
372
high_var_table = table.filter_columns(high_variance_filter)
373
print(f"Filtered to {len(high_var_table.colNames)} high-variance columns")
374
```
375
376
### Integration with Trees
377
378
```python
379
from ete3 import ArrayTable, Tree
380
381
# Load data and tree
382
table = ArrayTable("gene_expression.txt")
383
tree = Tree("species_tree.nw")
384
385
# Link expression data to tree nodes
386
table.link_to_tree(tree, attr_name="expression")
387
388
# Access linked data
389
for leaf in tree.get_leaves():
390
if hasattr(leaf, 'expression'):
391
print(f"{leaf.name}: {leaf.expression[:5]}...") # First 5 values
392
393
# Extract profile data back from tree
394
extracted_table = table.get_tree_profile(tree, attr_name="expression")
395
```
396
397
### Clustering Analysis
398
399
```python
400
from ete3 import ArrayTable
401
402
# Load expression data
403
expression_table = ArrayTable("gene_expression_matrix.txt")
404
405
# Perform hierarchical clustering
406
cluster_tree = expression_table.cluster_data(method="ward", metric="euclidean")
407
408
# Analyze clustering results
409
print(f"Clustering tree: {cluster_tree.get_ascii()}")
410
411
# Get distance matrix for further analysis
412
dist_matrix = expression_table.get_distance_matrix()
413
print(f"Distance matrix shape: {dist_matrix.shape}")
414
```
415
416
### Advanced Data Analysis
417
418
```python
419
from ete3 import ArrayTable, ClusterTree
420
import numpy as np
421
422
# Load and prepare data
423
table = ArrayTable("multi_condition_data.txt")
424
425
# Select specific conditions
426
selected_conditions = ["treatment1", "treatment2", "control"]
427
filtered_table = table.select_columns(selected_conditions)
428
429
# Normalize and filter
430
normalized = filtered_table.normalize(method="standard")
431
432
# Filter for genes with significant variation
433
def significant_variation(row_array):
434
return np.max(row_array) - np.min(row_array) > 2.0
435
436
variable_genes = normalized.filter_rows(significant_variation)
437
438
# Cluster the filtered, normalized data
439
cluster_result = variable_genes.cluster_data(method="complete")
440
441
# Visualize clustering
442
cluster_result.show()
443
444
# Save processed data
445
variable_genes.write("filtered_normalized_data.txt")
446
```
447
448
### Custom Data Processing
449
450
```python
451
from ete3 import ArrayTable
452
import numpy as np
453
454
# Create table from Python data
455
data_matrix = np.random.rand(100, 20) # 100 genes, 20 samples
456
row_names = [f"gene_{i}" for i in range(100)]
457
col_names = [f"sample_{i}" for i in range(20)]
458
459
# Initialize empty table and populate
460
table = ArrayTable()
461
table.matrix = data_matrix
462
table.rowNames = row_names
463
table.colNames = col_names
464
table.rowValues = {name: i for i, name in enumerate(row_names)}
465
table.colValues = {name: i for i, name in enumerate(col_names)}
466
467
# Apply custom transformations
468
log_transformed = table.matrix.copy()
469
log_transformed = np.log2(log_transformed + 1) # log2(x+1) transformation
470
471
# Create new table with transformed data
472
log_table = ArrayTable()
473
log_table.matrix = log_transformed
474
log_table.rowNames = table.rowNames
475
log_table.colNames = table.colNames
476
log_table.rowValues = table.rowValues
477
log_table.colValues = table.colValues
478
479
# Save transformed data
480
log_table.write("log_transformed_data.txt")
481
```