0
# Quantification Data Processing
1
2
Comprehensive quantification data processing capabilities for handling multi-format quantified peptide and protein data from various proteomics platforms. Provides unified interfaces for reading, reformatting, and processing quantification results from DIA-NN, Spectronaut, MaxQuant, and other proteomics tools.
3
4
## Capabilities
5
6
### Quantification Reader Manager
7
8
Central management system for importing and processing quantified proteomics data from multiple sources with automatic format detection and standardization.
9
10
```python { .api }
11
def import_data(data_path: str,
12
data_type: str = None,
13
config_dict: dict = None,
14
**kwargs) -> pd.DataFrame:
15
"""
16
Import quantified proteomics data from various formats.
17
18
Parameters:
19
- data_path: Path to quantification data file
20
- data_type: Format type ('spectronaut', 'diann', 'maxquant', etc.)
21
- config_dict: Configuration dictionary for import settings
22
- **kwargs: Additional format-specific options
23
24
Returns:
25
DataFrame with standardized quantification data
26
"""
27
28
def get_supported_formats() -> List[str]:
29
"""
30
Get list of supported quantification formats.
31
32
Returns:
33
List of supported format names
34
"""
35
36
def get_format_config(format_name: str) -> dict:
37
"""
38
Get default configuration for specific format.
39
40
Parameters:
41
- format_name: Name of the format
42
43
Returns:
44
Configuration dictionary with default settings
45
"""
46
47
def validate_quantification_data(df: pd.DataFrame,
48
format_type: str = None) -> dict:
49
"""
50
Validate quantification data integrity and completeness.
51
52
Parameters:
53
- df: Quantification DataFrame to validate
54
- format_type: Expected format type for validation
55
56
Returns:
57
Dictionary with validation results and issues
58
"""
59
```
60
61
### Long-Format Data Reader
62
63
Specialized reader for long-format quantification tables commonly produced by DIA-NN, Spectronaut, and other DIA search engines.
64
65
```python { .api }
66
class LongFormatReader:
67
"""Reader for long-format quantification data tables."""
68
69
def __init__(self, config_dict: dict = None):
70
"""
71
Initialize long-format reader.
72
73
Parameters:
74
- config_dict: Configuration for column mappings and processing
75
"""
76
77
def read_file(self, filepath: str, **kwargs) -> pd.DataFrame:
78
"""
79
Read long-format quantification file.
80
81
Parameters:
82
- filepath: Path to quantification file
83
- **kwargs: Additional reading options
84
85
Returns:
86
DataFrame with processed quantification data
87
"""
88
89
def set_column_mapping(self, mapping: dict) -> None:
90
"""
91
Set custom column name mappings.
92
93
Parameters:
94
- mapping: Dictionary mapping file columns to standard names
95
"""
96
97
def filter_data(self, df: pd.DataFrame,
98
min_confidence: float = 0.01,
99
remove_decoys: bool = True) -> pd.DataFrame:
100
"""
101
Apply quality filters to quantification data.
102
103
Parameters:
104
- df: Input quantification DataFrame
105
- min_confidence: Minimum confidence threshold
106
- remove_decoys: Whether to remove decoy identifications
107
108
Returns:
109
Filtered DataFrame
110
"""
111
112
def aggregate_to_protein_level(self, df: pd.DataFrame,
113
method: str = 'sum') -> pd.DataFrame:
114
"""
115
Aggregate peptide-level to protein-level quantification.
116
117
Parameters:
118
- df: Peptide-level quantification DataFrame
119
- method: Aggregation method ('sum', 'mean', 'median', 'maxlfq')
120
121
Returns:
122
Protein-level quantification DataFrame
123
"""
124
125
def standardize_long_format_columns(df: pd.DataFrame,
126
source_format: str) -> pd.DataFrame:
127
"""
128
Standardize column names for long-format data.
129
130
Parameters:
131
- df: Input DataFrame with format-specific columns
132
- source_format: Source format name ('diann', 'spectronaut', etc.)
133
134
Returns:
135
DataFrame with standardized column names
136
"""
137
```
138
139
### Wide-Format Data Reader
140
141
Reader for wide-format quantification tables with samples as columns, commonly used in label-free quantification workflows.
142
143
```python { .api }
144
class WideFormatReader:
145
"""Reader for wide-format quantification data tables."""
146
147
def __init__(self, config_dict: dict = None):
148
"""
149
Initialize wide-format reader.
150
151
Parameters:
152
- config_dict: Configuration for processing settings
153
"""
154
155
def read_file(self, filepath: str, **kwargs) -> pd.DataFrame:
156
"""
157
Read wide-format quantification file.
158
159
Parameters:
160
- filepath: Path to quantification file
161
- **kwargs: Additional reading options
162
163
Returns:
164
DataFrame with processed quantification data
165
"""
166
167
def identify_sample_columns(self, df: pd.DataFrame) -> List[str]:
168
"""
169
Automatically identify sample/intensity columns.
170
171
Parameters:
172
- df: Input DataFrame
173
174
Returns:
175
List of column names containing quantification values
176
"""
177
178
def convert_to_long_format(self, df: pd.DataFrame,
179
sample_columns: List[str] = None) -> pd.DataFrame:
180
"""
181
Convert wide-format to long-format table.
182
183
Parameters:
184
- df: Wide-format DataFrame
185
- sample_columns: List of sample columns to melt
186
187
Returns:
188
Long-format DataFrame
189
"""
190
191
def normalize_intensities(self, df: pd.DataFrame,
192
method: str = 'median') -> pd.DataFrame:
193
"""
194
Normalize quantification intensities across samples.
195
196
Parameters:
197
- df: Quantification DataFrame
198
- method: Normalization method ('median', 'mean', 'quantile')
199
200
Returns:
201
Normalized DataFrame
202
"""
203
204
def detect_wide_format_type(df: pd.DataFrame) -> str:
205
"""
206
Detect the type of wide-format quantification data.
207
208
Parameters:
209
- df: Input DataFrame
210
211
Returns:
212
Format type string ('maxquant', 'proteomics_ruler', 'generic')
213
"""
214
```
215
216
### Configuration Management
217
218
System for managing format-specific configurations and column mappings for different quantification platforms.
219
220
```python { .api }
221
class ConfigDictLoader:
222
"""Configuration management for quantification data formats."""
223
224
def __init__(self, config_path: str = None):
225
"""
226
Initialize configuration loader.
227
228
Parameters:
229
- config_path: Path to custom configuration file
230
"""
231
232
def load_config(self, format_name: str) -> dict:
233
"""
234
Load configuration for specific format.
235
236
Parameters:
237
- format_name: Name of the quantification format
238
239
Returns:
240
Configuration dictionary
241
"""
242
243
def save_config(self, config: dict, format_name: str) -> None:
244
"""
245
Save custom configuration for format.
246
247
Parameters:
248
- config: Configuration dictionary to save
249
- format_name: Name for the configuration
250
"""
251
252
def get_column_mapping(self, format_name: str) -> dict:
253
"""
254
Get column name mappings for format.
255
256
Parameters:
257
- format_name: Format name
258
259
Returns:
260
Dictionary mapping format columns to standard names
261
"""
262
263
def update_column_mapping(self, format_name: str,
264
mapping: dict) -> None:
265
"""
266
Update column mappings for format.
267
268
Parameters:
269
- format_name: Format name to update
270
- mapping: New column mappings
271
"""
272
273
# Standard configuration constants
274
STANDARD_QUANTIFICATION_COLUMNS: dict = {
275
'sequence': str, # Peptide sequence
276
'proteins': str, # Protein identifiers
277
'sample': str, # Sample identifier
278
'intensity': float, # Quantification intensity
279
'rt': float, # Retention time
280
'charge': int, # Precursor charge
281
'mz': float, # Precursor m/z
282
'qvalue': float, # Identification confidence
283
'run': str, # LC-MS run identifier
284
'channel': str, # Labeling channel (for TMT/iTRAQ)
285
}
286
287
def get_default_config(format_name: str) -> dict:
288
"""
289
Get default configuration for quantification format.
290
291
Parameters:
292
- format_name: Quantification format name
293
294
Returns:
295
Default configuration dictionary
296
"""
297
```
298
299
### Data Reformatting and Processing
300
301
Utilities for reformatting and processing quantification data for downstream analysis workflows.
302
303
```python { .api }
304
class TableReformatter:
305
"""Reformatter for quantification data tables."""
306
307
def __init__(self):
308
"""Initialize table reformatter."""
309
310
def reformat_for_analysis(self, df: pd.DataFrame,
311
analysis_type: str = 'differential') -> pd.DataFrame:
312
"""
313
Reformat data for specific analysis workflows.
314
315
Parameters:
316
- df: Input quantification DataFrame
317
- analysis_type: Type of analysis ('differential', 'network', 'timecourse')
318
319
Returns:
320
Reformatted DataFrame suitable for analysis
321
"""
322
323
def create_design_matrix(self, df: pd.DataFrame,
324
sample_info: pd.DataFrame) -> pd.DataFrame:
325
"""
326
Create design matrix for statistical analysis.
327
328
Parameters:
329
- df: Quantification DataFrame
330
- sample_info: Sample metadata DataFrame
331
332
Returns:
333
Design matrix DataFrame
334
"""
335
336
def pivot_to_matrix(self, df: pd.DataFrame,
337
index_cols: List[str],
338
value_col: str = 'intensity') -> pd.DataFrame:
339
"""
340
Pivot quantification data to matrix format.
341
342
Parameters:
343
- df: Long-format quantification DataFrame
344
- index_cols: Columns to use as row identifiers
345
- value_col: Column containing values to pivot
346
347
Returns:
348
Matrix-format DataFrame
349
"""
350
351
def handle_missing_values(self, df: pd.DataFrame,
352
method: str = 'impute') -> pd.DataFrame:
353
"""
354
Handle missing quantification values.
355
356
Parameters:
357
- df: Quantification DataFrame with missing values
358
- method: Handling method ('impute', 'remove', 'flag')
359
360
Returns:
361
DataFrame with missing values handled
362
"""
363
364
class PlexDIAReformatter:
365
"""Specialized reformatter for plexDIA quantification data."""
366
367
def __init__(self):
368
"""Initialize plexDIA reformatter."""
369
370
def process_plexdia_output(self, filepath: str) -> pd.DataFrame:
371
"""
372
Process plexDIA output files.
373
374
Parameters:
375
- filepath: Path to plexDIA output file
376
377
Returns:
378
Processed quantification DataFrame
379
"""
380
381
def extract_channel_intensities(self, df: pd.DataFrame) -> pd.DataFrame:
382
"""
383
Extract individual channel intensities from plexDIA data.
384
385
Parameters:
386
- df: Raw plexDIA DataFrame
387
388
Returns:
389
DataFrame with separated channel intensities
390
"""
391
392
def normalize_channels(self, df: pd.DataFrame,
393
method: str = 'sum') -> pd.DataFrame:
394
"""
395
Normalize intensities across plexDIA channels.
396
397
Parameters:
398
- df: plexDIA quantification DataFrame
399
- method: Normalization method ('sum', 'median', 'reference')
400
401
Returns:
402
Channel-normalized DataFrame
403
"""
404
405
def merge_quantification_data(dataframes: List[pd.DataFrame],
406
merge_on: List[str] = None) -> pd.DataFrame:
407
"""
408
Merge multiple quantification datasets.
409
410
Parameters:
411
- dataframes: List of quantification DataFrames to merge
412
- merge_on: Columns to merge on (default: sequence, proteins, charge)
413
414
Returns:
415
Merged quantification DataFrame
416
"""
417
418
def calculate_fold_changes(df: pd.DataFrame,
419
control_samples: List[str],
420
treatment_samples: List[str]) -> pd.DataFrame:
421
"""
422
Calculate fold changes between sample groups.
423
424
Parameters:
425
- df: Quantification DataFrame
426
- control_samples: List of control sample identifiers
427
- treatment_samples: List of treatment sample identifiers
428
429
Returns:
430
DataFrame with fold changes and statistics
431
"""
432
```
433
434
### Quality Control and Statistics
435
436
Functions for quality assessment and statistical analysis of quantification data.
437
438
```python { .api }
439
def assess_data_quality(df: pd.DataFrame) -> dict:
440
"""
441
Assess quantification data quality metrics.
442
443
Parameters:
444
- df: Quantification DataFrame
445
446
Returns:
447
Dictionary with quality metrics and statistics
448
"""
449
450
def calculate_cv_statistics(df: pd.DataFrame,
451
sample_groups: dict) -> pd.DataFrame:
452
"""
453
Calculate coefficient of variation statistics.
454
455
Parameters:
456
- df: Quantification DataFrame
457
- sample_groups: Dictionary mapping samples to groups
458
459
Returns:
460
DataFrame with CV statistics
461
"""
462
463
def identify_outlier_samples(df: pd.DataFrame,
464
method: str = 'pca') -> List[str]:
465
"""
466
Identify outlier samples in quantification data.
467
468
Parameters:
469
- df: Quantification DataFrame
470
- method: Outlier detection method ('pca', 'correlation', 'distance')
471
472
Returns:
473
List of outlier sample identifiers
474
"""
475
476
def generate_qa_report(df: pd.DataFrame,
477
output_path: str = None) -> dict:
478
"""
479
Generate comprehensive quality assessment report.
480
481
Parameters:
482
- df: Quantification DataFrame
483
- output_path: Optional path to save HTML report
484
485
Returns:
486
Dictionary with QA metrics and plots
487
"""
488
```
489
490
## Usage Examples
491
492
### Basic Quantification Data Import
493
494
```python
495
from alphabase.quantification.quant_reader.quant_reader_manager import import_data
496
497
# Import DIA-NN quantification data
498
diann_df = import_data('report.tsv', data_type='diann')
499
print(f"Imported {len(diann_df)} quantification entries")
500
501
# Import Spectronaut data
502
spectronaut_df = import_data('spectronaut_export.tsv', data_type='spectronaut')
503
print(f"Imported {len(spectronaut_df)} quantification entries")
504
505
# Auto-detect format
506
unknown_df = import_data('unknown_quant.tsv') # Auto-detects format
507
print(f"Auto-detected format: {unknown_df.attrs.get('format_type', 'unknown')}")
508
```
509
510
### Processing Long-Format Data
511
512
```python
513
from alphabase.quantification.quant_reader.longformat_reader import LongFormatReader
514
import pandas as pd
515
516
# Create reader with custom configuration
517
reader = LongFormatReader()
518
519
# Read and process DIA-NN data
520
df = reader.read_file('diann_report.tsv')
521
522
# Apply quality filters
523
filtered_df = reader.filter_data(
524
df,
525
min_confidence=0.01, # 1% FDR
526
remove_decoys=True
527
)
528
529
# Aggregate to protein level
530
protein_df = reader.aggregate_to_protein_level(
531
filtered_df,
532
method='sum' # Sum peptide intensities
533
)
534
535
print(f"Peptide-level: {len(filtered_df)} entries")
536
print(f"Protein-level: {len(protein_df)} entries")
537
```
538
539
### Working with Wide-Format Data
540
541
```python
542
from alphabase.quantification.quant_reader.wideformat_reader import WideFormatReader
543
544
# Process MaxQuant proteinGroups.txt
545
reader = WideFormatReader()
546
df = reader.read_file('proteinGroups.txt')
547
548
# Auto-identify intensity columns
549
sample_cols = reader.identify_sample_columns(df)
550
print(f"Found {len(sample_cols)} sample columns: {sample_cols[:5]}...")
551
552
# Convert to long format for analysis
553
long_df = reader.convert_to_long_format(df, sample_columns=sample_cols)
554
555
# Normalize intensities
556
normalized_df = reader.normalize_intensities(long_df, method='median')
557
print(f"Converted to long format: {len(long_df)} entries")
558
```
559
560
### Advanced Data Processing
561
562
```python
563
from alphabase.quantification.quant_reader.table_reformatter import TableReformatter
564
from alphabase.quantification.quant_reader.quantreader_utils import (
565
merge_quantification_data, calculate_fold_changes
566
)
567
568
# Merge data from multiple experiments
569
experiment_dfs = [
570
import_data('exp1_diann.tsv', data_type='diann'),
571
import_data('exp2_diann.tsv', data_type='diann'),
572
import_data('exp3_diann.tsv', data_type='diann')
573
]
574
575
merged_df = merge_quantification_data(
576
experiment_dfs,
577
merge_on=['sequence', 'proteins', 'charge']
578
)
579
580
# Create design matrix for statistical analysis
581
reformatter = TableReformatter()
582
sample_info = pd.DataFrame({
583
'sample': ['exp1', 'exp2', 'exp3'],
584
'condition': ['control', 'treatment', 'treatment'],
585
'batch': [1, 1, 2]
586
})
587
588
design_matrix = reformatter.create_design_matrix(merged_df, sample_info)
589
590
# Calculate fold changes
591
fold_changes = calculate_fold_changes(
592
merged_df,
593
control_samples=['exp1'],
594
treatment_samples=['exp2', 'exp3']
595
)
596
597
print(f"Calculated fold changes for {len(fold_changes)} proteins")
598
```
599
600
### Quality Assessment
601
602
```python
603
from alphabase.quantification.quant_reader.quantreader_utils import (
604
assess_data_quality, generate_qa_report, identify_outlier_samples
605
)
606
607
# Assess data quality
608
quality_metrics = assess_data_quality(merged_df)
609
print(f"Quality metrics:")
610
print(f" Missing values: {quality_metrics['missing_percentage']:.1f}%")
611
print(f" CV median: {quality_metrics['cv_median']:.2f}")
612
print(f" Dynamic range: {quality_metrics['dynamic_range']:.1f}")
613
614
# Identify outlier samples
615
outliers = identify_outlier_samples(merged_df, method='pca')
616
if outliers:
617
print(f"Outlier samples detected: {outliers}")
618
619
# Generate comprehensive QA report
620
qa_report = generate_qa_report(merged_df, output_path='qa_report.html')
621
print(f"QA report saved with {len(qa_report['plots'])} plots")
622
```
623
624
### Custom Configuration
625
626
```python
627
from alphabase.quantification.quant_reader.config_dict_loader import ConfigDictLoader
628
629
# Create custom configuration
630
config_loader = ConfigDictLoader()
631
632
# Get default DIA-NN configuration
633
diann_config = config_loader.load_config('diann')
634
print(f"Default DIA-NN columns: {diann_config['column_mapping']}")
635
636
# Create custom configuration for new format
637
custom_config = {
638
'column_mapping': {
639
'peptide_sequence': 'sequence',
640
'protein_id': 'proteins',
641
'sample_name': 'sample',
642
'peak_area': 'intensity',
643
'retention_time': 'rt',
644
'precursor_charge': 'charge'
645
},
646
'filters': {
647
'min_confidence': 0.01,
648
'remove_contaminants': True
649
}
650
}
651
652
config_loader.save_config(custom_config, 'custom_format')
653
654
# Use custom configuration
655
custom_df = import_data('custom_data.tsv',
656
data_type='custom_format',
657
config_dict=custom_config)
658
```