or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced-peptide-operations.mdadvanced-spectral-libraries.mdchemical-constants.mdfragment-ions.mdindex.mdio-utilities.mdprotein-analysis.mdpsm-readers.mdquantification.mdsmiles-chemistry.mdspectral-libraries.md

quantification.mddocs/

0

# Quantification Data Processing

1

2

Comprehensive quantification data processing capabilities for handling multi-format quantified peptide and protein data from various proteomics platforms. Provides unified interfaces for reading, reformatting, and processing quantification results from DIA-NN, Spectronaut, MaxQuant, and other proteomics tools.

3

4

## Capabilities

5

6

### Quantification Reader Manager

7

8

Central management system for importing and processing quantified proteomics data from multiple sources with automatic format detection and standardization.

9

10

```python { .api }

11

def import_data(data_path: str,

12

data_type: str = None,

13

config_dict: dict = None,

14

**kwargs) -> pd.DataFrame:

15

"""

16

Import quantified proteomics data from various formats.

17

18

Parameters:

19

- data_path: Path to quantification data file

20

- data_type: Format type ('spectronaut', 'diann', 'maxquant', etc.)

21

- config_dict: Configuration dictionary for import settings

22

- **kwargs: Additional format-specific options

23

24

Returns:

25

DataFrame with standardized quantification data

26

"""

27

28

def get_supported_formats() -> List[str]:

29

"""

30

Get list of supported quantification formats.

31

32

Returns:

33

List of supported format names

34

"""

35

36

def get_format_config(format_name: str) -> dict:

37

"""

38

Get default configuration for specific format.

39

40

Parameters:

41

- format_name: Name of the format

42

43

Returns:

44

Configuration dictionary with default settings

45

"""

46

47

def validate_quantification_data(df: pd.DataFrame,

48

format_type: str = None) -> dict:

49

"""

50

Validate quantification data integrity and completeness.

51

52

Parameters:

53

- df: Quantification DataFrame to validate

54

- format_type: Expected format type for validation

55

56

Returns:

57

Dictionary with validation results and issues

58

"""

59

```

60

61

### Long-Format Data Reader

62

63

Specialized reader for long-format quantification tables commonly produced by DIA-NN, Spectronaut, and other DIA search engines.

64

65

```python { .api }

66

class LongFormatReader:

67

"""Reader for long-format quantification data tables."""

68

69

def __init__(self, config_dict: dict = None):

70

"""

71

Initialize long-format reader.

72

73

Parameters:

74

- config_dict: Configuration for column mappings and processing

75

"""

76

77

def read_file(self, filepath: str, **kwargs) -> pd.DataFrame:

78

"""

79

Read long-format quantification file.

80

81

Parameters:

82

- filepath: Path to quantification file

83

- **kwargs: Additional reading options

84

85

Returns:

86

DataFrame with processed quantification data

87

"""

88

89

def set_column_mapping(self, mapping: dict) -> None:

90

"""

91

Set custom column name mappings.

92

93

Parameters:

94

- mapping: Dictionary mapping file columns to standard names

95

"""

96

97

def filter_data(self, df: pd.DataFrame,

98

min_confidence: float = 0.01,

99

remove_decoys: bool = True) -> pd.DataFrame:

100

"""

101

Apply quality filters to quantification data.

102

103

Parameters:

104

- df: Input quantification DataFrame

105

- min_confidence: Minimum confidence threshold

106

- remove_decoys: Whether to remove decoy identifications

107

108

Returns:

109

Filtered DataFrame

110

"""

111

112

def aggregate_to_protein_level(self, df: pd.DataFrame,

113

method: str = 'sum') -> pd.DataFrame:

114

"""

115

Aggregate peptide-level to protein-level quantification.

116

117

Parameters:

118

- df: Peptide-level quantification DataFrame

119

- method: Aggregation method ('sum', 'mean', 'median', 'maxlfq')

120

121

Returns:

122

Protein-level quantification DataFrame

123

"""

124

125

def standardize_long_format_columns(df: pd.DataFrame,

126

source_format: str) -> pd.DataFrame:

127

"""

128

Standardize column names for long-format data.

129

130

Parameters:

131

- df: Input DataFrame with format-specific columns

132

- source_format: Source format name ('diann', 'spectronaut', etc.)

133

134

Returns:

135

DataFrame with standardized column names

136

"""

137

```

138

139

### Wide-Format Data Reader

140

141

Reader for wide-format quantification tables with samples as columns, commonly used in label-free quantification workflows.

142

143

```python { .api }

144

class WideFormatReader:

145

"""Reader for wide-format quantification data tables."""

146

147

def __init__(self, config_dict: dict = None):

148

"""

149

Initialize wide-format reader.

150

151

Parameters:

152

- config_dict: Configuration for processing settings

153

"""

154

155

def read_file(self, filepath: str, **kwargs) -> pd.DataFrame:

156

"""

157

Read wide-format quantification file.

158

159

Parameters:

160

- filepath: Path to quantification file

161

- **kwargs: Additional reading options

162

163

Returns:

164

DataFrame with processed quantification data

165

"""

166

167

def identify_sample_columns(self, df: pd.DataFrame) -> List[str]:

168

"""

169

Automatically identify sample/intensity columns.

170

171

Parameters:

172

- df: Input DataFrame

173

174

Returns:

175

List of column names containing quantification values

176

"""

177

178

def convert_to_long_format(self, df: pd.DataFrame,

179

sample_columns: List[str] = None) -> pd.DataFrame:

180

"""

181

Convert wide-format to long-format table.

182

183

Parameters:

184

- df: Wide-format DataFrame

185

- sample_columns: List of sample columns to melt

186

187

Returns:

188

Long-format DataFrame

189

"""

190

191

def normalize_intensities(self, df: pd.DataFrame,

192

method: str = 'median') -> pd.DataFrame:

193

"""

194

Normalize quantification intensities across samples.

195

196

Parameters:

197

- df: Quantification DataFrame

198

- method: Normalization method ('median', 'mean', 'quantile')

199

200

Returns:

201

Normalized DataFrame

202

"""

203

204

def detect_wide_format_type(df: pd.DataFrame) -> str:

205

"""

206

Detect the type of wide-format quantification data.

207

208

Parameters:

209

- df: Input DataFrame

210

211

Returns:

212

Format type string ('maxquant', 'proteomics_ruler', 'generic')

213

"""

214

```

215

216

### Configuration Management

217

218

System for managing format-specific configurations and column mappings for different quantification platforms.

219

220

```python { .api }

221

class ConfigDictLoader:

222

"""Configuration management for quantification data formats."""

223

224

def __init__(self, config_path: str = None):

225

"""

226

Initialize configuration loader.

227

228

Parameters:

229

- config_path: Path to custom configuration file

230

"""

231

232

def load_config(self, format_name: str) -> dict:

233

"""

234

Load configuration for specific format.

235

236

Parameters:

237

- format_name: Name of the quantification format

238

239

Returns:

240

Configuration dictionary

241

"""

242

243

def save_config(self, config: dict, format_name: str) -> None:

244

"""

245

Save custom configuration for format.

246

247

Parameters:

248

- config: Configuration dictionary to save

249

- format_name: Name for the configuration

250

"""

251

252

def get_column_mapping(self, format_name: str) -> dict:

253

"""

254

Get column name mappings for format.

255

256

Parameters:

257

- format_name: Format name

258

259

Returns:

260

Dictionary mapping format columns to standard names

261

"""

262

263

def update_column_mapping(self, format_name: str,

264

mapping: dict) -> None:

265

"""

266

Update column mappings for format.

267

268

Parameters:

269

- format_name: Format name to update

270

- mapping: New column mappings

271

"""

272

273

# Standard configuration constants

274

STANDARD_QUANTIFICATION_COLUMNS: dict = {

275

'sequence': str, # Peptide sequence

276

'proteins': str, # Protein identifiers

277

'sample': str, # Sample identifier

278

'intensity': float, # Quantification intensity

279

'rt': float, # Retention time

280

'charge': int, # Precursor charge

281

'mz': float, # Precursor m/z

282

'qvalue': float, # Identification confidence

283

'run': str, # LC-MS run identifier

284

'channel': str, # Labeling channel (for TMT/iTRAQ)

285

}

286

287

def get_default_config(format_name: str) -> dict:

288

"""

289

Get default configuration for quantification format.

290

291

Parameters:

292

- format_name: Quantification format name

293

294

Returns:

295

Default configuration dictionary

296

"""

297

```

298

299

### Data Reformatting and Processing

300

301

Utilities for reformatting and processing quantification data for downstream analysis workflows.

302

303

```python { .api }

304

class TableReformatter:

305

"""Reformatter for quantification data tables."""

306

307

def __init__(self):

308

"""Initialize table reformatter."""

309

310

def reformat_for_analysis(self, df: pd.DataFrame,

311

analysis_type: str = 'differential') -> pd.DataFrame:

312

"""

313

Reformat data for specific analysis workflows.

314

315

Parameters:

316

- df: Input quantification DataFrame

317

- analysis_type: Type of analysis ('differential', 'network', 'timecourse')

318

319

Returns:

320

Reformatted DataFrame suitable for analysis

321

"""

322

323

def create_design_matrix(self, df: pd.DataFrame,

324

sample_info: pd.DataFrame) -> pd.DataFrame:

325

"""

326

Create design matrix for statistical analysis.

327

328

Parameters:

329

- df: Quantification DataFrame

330

- sample_info: Sample metadata DataFrame

331

332

Returns:

333

Design matrix DataFrame

334

"""

335

336

def pivot_to_matrix(self, df: pd.DataFrame,

337

index_cols: List[str],

338

value_col: str = 'intensity') -> pd.DataFrame:

339

"""

340

Pivot quantification data to matrix format.

341

342

Parameters:

343

- df: Long-format quantification DataFrame

344

- index_cols: Columns to use as row identifiers

345

- value_col: Column containing values to pivot

346

347

Returns:

348

Matrix-format DataFrame

349

"""

350

351

def handle_missing_values(self, df: pd.DataFrame,

352

method: str = 'impute') -> pd.DataFrame:

353

"""

354

Handle missing quantification values.

355

356

Parameters:

357

- df: Quantification DataFrame with missing values

358

- method: Handling method ('impute', 'remove', 'flag')

359

360

Returns:

361

DataFrame with missing values handled

362

"""

363

364

class PlexDIAReformatter:

365

"""Specialized reformatter for plexDIA quantification data."""

366

367

def __init__(self):

368

"""Initialize plexDIA reformatter."""

369

370

def process_plexdia_output(self, filepath: str) -> pd.DataFrame:

371

"""

372

Process plexDIA output files.

373

374

Parameters:

375

- filepath: Path to plexDIA output file

376

377

Returns:

378

Processed quantification DataFrame

379

"""

380

381

def extract_channel_intensities(self, df: pd.DataFrame) -> pd.DataFrame:

382

"""

383

Extract individual channel intensities from plexDIA data.

384

385

Parameters:

386

- df: Raw plexDIA DataFrame

387

388

Returns:

389

DataFrame with separated channel intensities

390

"""

391

392

def normalize_channels(self, df: pd.DataFrame,

393

method: str = 'sum') -> pd.DataFrame:

394

"""

395

Normalize intensities across plexDIA channels.

396

397

Parameters:

398

- df: plexDIA quantification DataFrame

399

- method: Normalization method ('sum', 'median', 'reference')

400

401

Returns:

402

Channel-normalized DataFrame

403

"""

404

405

def merge_quantification_data(dataframes: List[pd.DataFrame],

406

merge_on: List[str] = None) -> pd.DataFrame:

407

"""

408

Merge multiple quantification datasets.

409

410

Parameters:

411

- dataframes: List of quantification DataFrames to merge

412

- merge_on: Columns to merge on (default: sequence, proteins, charge)

413

414

Returns:

415

Merged quantification DataFrame

416

"""

417

418

def calculate_fold_changes(df: pd.DataFrame,

419

control_samples: List[str],

420

treatment_samples: List[str]) -> pd.DataFrame:

421

"""

422

Calculate fold changes between sample groups.

423

424

Parameters:

425

- df: Quantification DataFrame

426

- control_samples: List of control sample identifiers

427

- treatment_samples: List of treatment sample identifiers

428

429

Returns:

430

DataFrame with fold changes and statistics

431

"""

432

```

433

434

### Quality Control and Statistics

435

436

Functions for quality assessment and statistical analysis of quantification data.

437

438

```python { .api }

439

def assess_data_quality(df: pd.DataFrame) -> dict:

440

"""

441

Assess quantification data quality metrics.

442

443

Parameters:

444

- df: Quantification DataFrame

445

446

Returns:

447

Dictionary with quality metrics and statistics

448

"""

449

450

def calculate_cv_statistics(df: pd.DataFrame,

451

sample_groups: dict) -> pd.DataFrame:

452

"""

453

Calculate coefficient of variation statistics.

454

455

Parameters:

456

- df: Quantification DataFrame

457

- sample_groups: Dictionary mapping samples to groups

458

459

Returns:

460

DataFrame with CV statistics

461

"""

462

463

def identify_outlier_samples(df: pd.DataFrame,

464

method: str = 'pca') -> List[str]:

465

"""

466

Identify outlier samples in quantification data.

467

468

Parameters:

469

- df: Quantification DataFrame

470

- method: Outlier detection method ('pca', 'correlation', 'distance')

471

472

Returns:

473

List of outlier sample identifiers

474

"""

475

476

def generate_qa_report(df: pd.DataFrame,

477

output_path: str = None) -> dict:

478

"""

479

Generate comprehensive quality assessment report.

480

481

Parameters:

482

- df: Quantification DataFrame

483

- output_path: Optional path to save HTML report

484

485

Returns:

486

Dictionary with QA metrics and plots

487

"""

488

```

489

490

## Usage Examples

491

492

### Basic Quantification Data Import

493

494

```python

495

from alphabase.quantification.quant_reader.quant_reader_manager import import_data

496

497

# Import DIA-NN quantification data

498

diann_df = import_data('report.tsv', data_type='diann')

499

print(f"Imported {len(diann_df)} quantification entries")

500

501

# Import Spectronaut data

502

spectronaut_df = import_data('spectronaut_export.tsv', data_type='spectronaut')

503

print(f"Imported {len(spectronaut_df)} quantification entries")

504

505

# Auto-detect format

506

unknown_df = import_data('unknown_quant.tsv') # Auto-detects format

507

print(f"Auto-detected format: {unknown_df.attrs.get('format_type', 'unknown')}")

508

```

509

510

### Processing Long-Format Data

511

512

```python

513

from alphabase.quantification.quant_reader.longformat_reader import LongFormatReader

514

import pandas as pd

515

516

# Create reader with custom configuration

517

reader = LongFormatReader()

518

519

# Read and process DIA-NN data

520

df = reader.read_file('diann_report.tsv')

521

522

# Apply quality filters

523

filtered_df = reader.filter_data(

524

df,

525

min_confidence=0.01, # 1% FDR

526

remove_decoys=True

527

)

528

529

# Aggregate to protein level

530

protein_df = reader.aggregate_to_protein_level(

531

filtered_df,

532

method='sum' # Sum peptide intensities

533

)

534

535

print(f"Peptide-level: {len(filtered_df)} entries")

536

print(f"Protein-level: {len(protein_df)} entries")

537

```

538

539

### Working with Wide-Format Data

540

541

```python

542

from alphabase.quantification.quant_reader.wideformat_reader import WideFormatReader

543

544

# Process MaxQuant proteinGroups.txt

545

reader = WideFormatReader()

546

df = reader.read_file('proteinGroups.txt')

547

548

# Auto-identify intensity columns

549

sample_cols = reader.identify_sample_columns(df)

550

print(f"Found {len(sample_cols)} sample columns: {sample_cols[:5]}...")

551

552

# Convert to long format for analysis

553

long_df = reader.convert_to_long_format(df, sample_columns=sample_cols)

554

555

# Normalize intensities

556

normalized_df = reader.normalize_intensities(long_df, method='median')

557

print(f"Converted to long format: {len(long_df)} entries")

558

```

559

560

### Advanced Data Processing

561

562

```python

563

from alphabase.quantification.quant_reader.table_reformatter import TableReformatter

564

from alphabase.quantification.quant_reader.quantreader_utils import (

565

merge_quantification_data, calculate_fold_changes

566

)

567

568

# Merge data from multiple experiments

569

experiment_dfs = [

570

import_data('exp1_diann.tsv', data_type='diann'),

571

import_data('exp2_diann.tsv', data_type='diann'),

572

import_data('exp3_diann.tsv', data_type='diann')

573

]

574

575

merged_df = merge_quantification_data(

576

experiment_dfs,

577

merge_on=['sequence', 'proteins', 'charge']

578

)

579

580

# Create design matrix for statistical analysis

581

reformatter = TableReformatter()

582

sample_info = pd.DataFrame({

583

'sample': ['exp1', 'exp2', 'exp3'],

584

'condition': ['control', 'treatment', 'treatment'],

585

'batch': [1, 1, 2]

586

})

587

588

design_matrix = reformatter.create_design_matrix(merged_df, sample_info)

589

590

# Calculate fold changes

591

fold_changes = calculate_fold_changes(

592

merged_df,

593

control_samples=['exp1'],

594

treatment_samples=['exp2', 'exp3']

595

)

596

597

print(f"Calculated fold changes for {len(fold_changes)} proteins")

598

```

599

600

### Quality Assessment

601

602

```python

603

from alphabase.quantification.quant_reader.quantreader_utils import (

604

assess_data_quality, generate_qa_report, identify_outlier_samples

605

)

606

607

# Assess data quality

608

quality_metrics = assess_data_quality(merged_df)

609

print(f"Quality metrics:")

610

print(f" Missing values: {quality_metrics['missing_percentage']:.1f}%")

611

print(f" CV median: {quality_metrics['cv_median']:.2f}")

612

print(f" Dynamic range: {quality_metrics['dynamic_range']:.1f}")

613

614

# Identify outlier samples

615

outliers = identify_outlier_samples(merged_df, method='pca')

616

if outliers:

617

print(f"Outlier samples detected: {outliers}")

618

619

# Generate comprehensive QA report

620

qa_report = generate_qa_report(merged_df, output_path='qa_report.html')

621

print(f"QA report saved with {len(qa_report['plots'])} plots")

622

```

623

624

### Custom Configuration

625

626

```python

627

from alphabase.quantification.quant_reader.config_dict_loader import ConfigDictLoader

628

629

# Create custom configuration

630

config_loader = ConfigDictLoader()

631

632

# Get default DIA-NN configuration

633

diann_config = config_loader.load_config('diann')

634

print(f"Default DIA-NN columns: {diann_config['column_mapping']}")

635

636

# Create custom configuration for new format

637

custom_config = {

638

'column_mapping': {

639

'peptide_sequence': 'sequence',

640

'protein_id': 'proteins',

641

'sample_name': 'sample',

642

'peak_area': 'intensity',

643

'retention_time': 'rt',

644

'precursor_charge': 'charge'

645

},

646

'filters': {

647

'min_confidence': 0.01,

648

'remove_contaminants': True

649

}

650

}

651

652

config_loader.save_config(custom_config, 'custom_format')

653

654

# Use custom configuration

655

custom_df = import_data('custom_data.tsv',

656

data_type='custom_format',

657

config_dict=custom_config)

658

```