0
# Utilities and Constants
1
2
Helper functions, error handling classes, and constants used throughout the pysam library for quality score conversion, string encoding, and genomic data processing.
3
4
## Capabilities
5
6
### HTSlib Verbosity Control
7
8
Functions for controlling HTSlib logging verbosity.
9
10
```python { .api }
11
def get_verbosity():
12
"""
13
Get current HTSlib verbosity level.
14
15
Returns:
16
int, current verbosity level
17
"""
18
19
def set_verbosity(verbosity):
20
"""
21
Set HTSlib verbosity level.
22
23
Parameters:
24
- verbosity: int, verbosity level (0=silent, 1=errors, 2=warnings, 3=info)
25
26
Returns:
27
int, previous verbosity level
28
"""
29
```
30
31
### Quality Score Utilities
32
33
Functions for converting between different quality score representations.
34
35
```python { .api }
36
def qualitystring_to_array(input_str, offset=33):
37
"""
38
Convert quality string to array of integers.
39
40
Parameters:
41
- input_str: str, quality string
42
- offset: int, ASCII offset (33 for Sanger, 64 for Illumina 1.3+)
43
44
Returns:
45
array, quality scores as integers
46
"""
47
48
def array_to_qualitystring(qualities, offset=33):
49
"""
50
Convert array of integers to quality string.
51
52
Parameters:
53
- qualities: array, quality scores as integers
54
- offset: int, ASCII offset (33 for Sanger, 64 for Illumina 1.3+)
55
56
Returns:
57
str, quality string
58
"""
59
60
def qualities_to_qualitystring(qualities, offset=33):
61
"""
62
Convert list of quality scores to quality string.
63
64
Parameters:
65
- qualities: list, quality scores
66
- offset: int, ASCII offset
67
68
Returns:
69
str, quality string
70
"""
71
```
72
73
### String Encoding Utilities
74
75
Functions for handling string encoding in genomic data.
76
77
```python { .api }
78
def get_encoding_error_handler():
79
"""
80
Get current string encoding error handler.
81
82
Returns:
83
str, error handler name ('strict', 'ignore', 'replace')
84
"""
85
86
def set_encoding_error_handler(errors):
87
"""
88
Set string encoding error handler.
89
90
Parameters:
91
- errors: str, error handler ('strict', 'ignore', 'replace')
92
"""
93
```
94
95
### Error Handling Classes
96
97
Exception classes for pysam operations.
98
99
```python { .api }
100
class SamtoolsError(Exception):
101
"""
102
Exception raised when samtools/bcftools commands fail.
103
104
Attributes:
105
- cmd: str, command that failed
106
- returncode: int, exit code
107
- stderr: str, error output
108
"""
109
110
def __init__(self, cmd, returncode, stderr=None):
111
"""
112
Initialize SamtoolsError.
113
114
Parameters:
115
- cmd: str, failed command
116
- returncode: int, exit code
117
- stderr: str, error output
118
"""
119
120
class PysamDispatcher:
121
"""
122
Command-line dispatcher for samtools/bcftools.
123
124
Handles execution of external commands and error reporting.
125
"""
126
127
def __init__(self, name):
128
"""
129
Initialize dispatcher.
130
131
Parameters:
132
- name: str, tool name ('samtools' or 'bcftools')
133
"""
134
135
def __call__(self, *args, **kwargs):
136
"""
137
Execute command.
138
139
Parameters:
140
*args: command arguments
141
**kwargs: execution options
142
143
Returns:
144
int, exit code
145
"""
146
147
class unquoted_str(str):
148
"""
149
String subclass for unquoted VCF header values.
150
151
Used to indicate that a header value should not be quoted
152
when writing VCF files.
153
"""
154
```
155
156
### CIGAR Operation Constants
157
158
Constants for CIGAR string operations in sequence alignments.
159
160
```python { .api }
161
# CIGAR operation constants
162
CMATCH: int # Match/mismatch (0)
163
CINS: int # Insertion (1)
164
CDEL: int # Deletion (2)
165
CREF_SKIP: int # Reference skip (3)
166
CSOFT_CLIP: int # Soft clipping (4)
167
CHARD_CLIP: int # Hard clipping (5)
168
CPAD: int # Padding (6)
169
CEQUAL: int # Sequence match (7)
170
CDIFF: int # Sequence mismatch (8)
171
CBACK: int # Back (9)
172
173
# CIGAR operations tuple
174
CIGAR_OPS: tuple # All CIGAR operation constants
175
```
176
177
### SAM Flag Constants
178
179
Constants for SAM flag bits indicating read properties.
180
181
```python { .api }
182
# SAM flag constants
183
FPAIRED: int # Template having multiple segments (0x1)
184
FPROPER_PAIR: int # Each segment properly aligned (0x2)
185
FUNMAP: int # Segment unmapped (0x4)
186
FMUNMAP: int # Next segment unmapped (0x8)
187
FREVERSE: int # SEQ reverse complemented (0x10)
188
FMREVERSE: int # SEQ of next segment reverse complemented (0x20)
189
FREAD1: int # First segment in template (0x40)
190
FREAD2: int # Last segment in template (0x80)
191
FSECONDARY: int # Secondary alignment (0x100)
192
FQCFAIL: int # Not passing quality controls (0x200)
193
FDUP: int # PCR or optical duplicate (0x400)
194
FSUPPLEMENTARY: int # Supplementary alignment (0x800)
195
196
# SAM flags tuple
197
SAM_FLAGS: tuple # All SAM flag constants
198
```
199
200
### Version Information
201
202
Access to library version information.
203
204
```python { .api }
205
__version__: str # Pysam version string
206
__samtools_version__: str # Underlying samtools version
207
__htslib_version__: str # Underlying htslib version
208
```
209
210
### Configuration Functions
211
212
Functions for getting pysam build configuration.
213
214
```python { .api }
215
def get_include():
216
"""
217
Get list of include directories for pysam headers.
218
219
Returns:
220
list, include directory paths
221
"""
222
223
def get_defines():
224
"""
225
Get list of compilation defines.
226
227
Returns:
228
list, compilation defines
229
"""
230
231
def get_libraries():
232
"""
233
Get list of pysam libraries for linking.
234
235
Returns:
236
list, library file paths
237
"""
238
```
239
240
## Usage Examples
241
242
### Quality Score Conversion
243
244
```python
245
import pysam
246
247
# Convert quality string to numeric scores
248
quality_string = "###$$%&'()*+,-./0123456789"
249
scores = pysam.qualitystring_to_array(quality_string)
250
print(f"Quality scores: {scores}")
251
252
# Convert back to string
253
converted_string = pysam.array_to_qualitystring(scores)
254
print(f"Converted string: {converted_string}")
255
assert quality_string == converted_string
256
257
# Handle different encoding offsets
258
illumina_quality = "BCDEFGHIJKLMNOP" # Illumina 1.3+ encoding
259
sanger_scores = pysam.qualitystring_to_array(illumina_quality, offset=64)
260
sanger_quality = pysam.array_to_qualitystring(sanger_scores, offset=33)
261
print(f"Illumina to Sanger: {illumina_quality} -> {sanger_quality}")
262
263
# Convert list of qualities
264
quality_list = [20, 25, 30, 35, 40]
265
quality_str = pysam.qualities_to_qualitystring(quality_list)
266
print(f"Quality list to string: {quality_list} -> {quality_str}")
267
```
268
269
### Working with CIGAR Operations
270
271
```python
272
import pysam
273
274
def analyze_cigar(cigar_tuples):
275
"""Analyze CIGAR operations in an alignment."""
276
operations = {
277
pysam.CMATCH: "Match/Mismatch",
278
pysam.CINS: "Insertion",
279
pysam.CDEL: "Deletion",
280
pysam.CREF_SKIP: "Reference Skip",
281
pysam.CSOFT_CLIP: "Soft Clip",
282
pysam.CHARD_CLIP: "Hard Clip",
283
pysam.CPAD: "Padding",
284
pysam.CEQUAL: "Sequence Match",
285
pysam.CDIFF: "Sequence Mismatch",
286
pysam.CBACK: "Back"
287
}
288
289
print("CIGAR analysis:")
290
for op, length in cigar_tuples:
291
op_name = operations.get(op, f"Unknown({op})")
292
print(f" {op_name}: {length} bases")
293
294
# Example usage with aligned segment
295
with pysam.AlignmentFile("example.bam", "rb") as bamfile:
296
for read in bamfile.fetch():
297
if read.cigar:
298
print(f"Read: {read.query_name}")
299
analyze_cigar(read.cigar)
300
break
301
```
302
303
### Working with SAM Flags
304
305
```python
306
import pysam
307
308
def analyze_sam_flags(flag):
309
"""Analyze SAM flags for a read."""
310
flags = {
311
pysam.FPAIRED: "Paired",
312
pysam.FPROPER_PAIR: "Proper pair",
313
pysam.FUNMAP: "Unmapped",
314
pysam.FMUNMAP: "Mate unmapped",
315
pysam.FREVERSE: "Reverse strand",
316
pysam.FMREVERSE: "Mate reverse strand",
317
pysam.FREAD1: "First in pair",
318
pysam.FREAD2: "Second in pair",
319
pysam.FSECONDARY: "Secondary alignment",
320
pysam.FQCFAIL: "QC fail",
321
pysam.FDUP: "Duplicate",
322
pysam.FSUPPLEMENTARY: "Supplementary"
323
}
324
325
print(f"SAM flag {flag} ({flag:b} binary):")
326
active_flags = []
327
for flag_bit, description in flags.items():
328
if flag & flag_bit:
329
active_flags.append(description)
330
331
if active_flags:
332
for desc in active_flags:
333
print(f" - {desc}")
334
else:
335
print(" - No flags set")
336
337
# Example usage
338
with pysam.AlignmentFile("example.bam", "rb") as bamfile:
339
for read in bamfile.fetch():
340
print(f"Read: {read.query_name}")
341
analyze_sam_flags(read.flag)
342
343
# Check specific flags using properties
344
if read.is_paired:
345
print(" This is a paired read")
346
if read.is_proper_pair:
347
print(" This is a proper pair")
348
if read.is_reverse:
349
print(" This read is reverse complemented")
350
351
break
352
```
353
354
### Error Handling
355
356
```python
357
import pysam
358
359
def safe_samtools_operation(command_func, *args, **kwargs):
360
"""Safely execute samtools command with error handling."""
361
try:
362
result = command_func(*args, **kwargs)
363
print(f"Command succeeded with exit code: {result}")
364
return result
365
except pysam.SamtoolsError as e:
366
print(f"Samtools command failed:")
367
print(f" Command: {e.cmd}")
368
print(f" Exit code: {e.returncode}")
369
if e.stderr:
370
print(f" Error: {e.stderr}")
371
return None
372
373
# Example usage
374
result = safe_samtools_operation(pysam.sort, "-o", "output.bam", "input.bam")
375
if result is not None:
376
print("Sort operation completed successfully")
377
```
378
379
### String Encoding Handling
380
381
```python
382
import pysam
383
384
# Check current encoding error handler
385
current_handler = pysam.get_encoding_error_handler()
386
print(f"Current encoding error handler: {current_handler}")
387
388
# Set more permissive error handling for problematic files
389
pysam.set_encoding_error_handler('ignore')
390
391
try:
392
# Process file that might have encoding issues
393
with pysam.AlignmentFile("problematic.bam", "rb") as bamfile:
394
for read in bamfile.fetch():
395
# Process reads, ignoring encoding errors in strings
396
pass
397
finally:
398
# Restore original handler
399
pysam.set_encoding_error_handler(current_handler)
400
```
401
402
### Version and Configuration Information
403
404
```python
405
import pysam
406
407
# Display version information
408
print(f"Pysam version: {pysam.__version__}")
409
print(f"Samtools version: {pysam.__samtools_version__}")
410
411
# Get build configuration
412
includes = pysam.get_include()
413
print(f"Include directories: {includes}")
414
415
defines = pysam.get_defines()
416
print(f"Compilation defines: {defines}")
417
418
libraries = pysam.get_libraries()
419
print(f"Libraries: {libraries}")
420
421
# Check available CIGAR operations
422
print(f"Available CIGAR operations: {pysam.CIGAR_OPS}")
423
424
# Check available SAM flags
425
print(f"Available SAM flags: {pysam.SAM_FLAGS}")
426
```
427
428
### Custom Quality Processing
429
430
```python
431
import pysam
432
import numpy as np
433
434
def filter_by_average_quality(input_bam, output_bam, min_avg_quality=20):
435
"""Filter reads by average base quality."""
436
437
with pysam.AlignmentFile(input_bam, "rb") as infile:
438
with pysam.AlignmentFile(output_bam, "wb", template=infile) as outfile:
439
440
reads_processed = 0
441
reads_kept = 0
442
443
for read in infile:
444
reads_processed += 1
445
446
if read.query_qualities:
447
# Calculate average quality
448
avg_quality = np.mean(read.query_qualities)
449
450
if avg_quality >= min_avg_quality:
451
outfile.write(read)
452
reads_kept += 1
453
else:
454
# Keep reads without quality scores
455
outfile.write(read)
456
reads_kept += 1
457
458
print(f"Processed {reads_processed} reads, kept {reads_kept} reads")
459
print(f"Filtering rate: {(reads_processed - reads_kept) / reads_processed * 100:.1f}%")
460
461
# Usage
462
filter_by_average_quality("input.bam", "filtered.bam", min_avg_quality=25)
463
```
464
465
## Performance and Best Practices
466
467
- Quality score conversions are optimized for speed with Cython implementations
468
- Use appropriate offset values for different sequencing platforms
469
- CIGAR and SAM flag constants enable readable code instead of magic numbers
470
- Error handlers can be adjusted for problematic input files
471
- Version information helps ensure compatibility across different installations
472
- Configuration functions are useful for building extensions that depend on pysam