0
# Pysam
1
2
A comprehensive Python wrapper for the HTSlib library that provides facilities for reading, manipulating, and writing genomic data sets in standard bioinformatics formats. Pysam supports SAM/BAM/CRAM for sequence alignments, VCF/BCF for variant calls, and FASTA/FASTQ for sequences, along with tabix-indexed files and compressed formats.
3
4
## Package Information
5
6
- **Package Name**: pysam
7
- **Language**: Python
8
- **Installation**: `pip install pysam`
9
10
## Core Imports
11
12
```python
13
import pysam
14
```
15
16
Common specific imports:
17
18
```python
19
from pysam import AlignmentFile, VariantFile, FastaFile, TabixFile, BGZFile
20
from pysam import samtools, bcftools # For command-line functions
21
```
22
23
## Basic Usage
24
25
```python
26
import pysam
27
28
# Reading SAM/BAM/CRAM alignment files
29
with pysam.AlignmentFile("example.bam", "rb") as samfile:
30
for read in samfile.fetch("chr1", 1000, 2000):
31
print(f"Read: {read.query_name}, Position: {read.reference_start}")
32
33
# Reading VCF/BCF variant files
34
with pysam.VariantFile("example.vcf") as vcffile:
35
for record in vcffile.fetch("chr1", 1000, 2000):
36
print(f"Variant at {record.pos}: {record.ref} -> {record.alts}")
37
38
# Reading FASTA files
39
with pysam.FastaFile("reference.fa") as fastafile:
40
sequence = fastafile.fetch("chr1", 1000, 2000)
41
print(f"Sequence: {sequence}")
42
43
# Command-line tool integration
44
pysam.sort("-o", "sorted.bam", "input.bam")
45
pysam.index("sorted.bam")
46
47
# BCFtools for variant processing
48
pysam.call("-mv", "-o", "calls.vcf", "pileup.bcf")
49
pysam.filter("-i", "QUAL>=20", "-o", "filtered.vcf", "calls.vcf")
50
```
51
52
## Architecture
53
54
Pysam follows a modular architecture built around HTSlib's C API:
55
56
- **File Classes**: High-level interfaces (`AlignmentFile`, `VariantFile`, `FastaFile`, `TabixFile`) that provide Pythonic access to genomic file formats
57
- **Record Classes**: Data structures (`AlignedSegment`, `VariantRecord`, `FastxRecord`) representing individual entries with attribute access
58
- **Proxy Classes**: Efficient access to parsed data without copying (`GTFProxy`, `VCFProxy`, `BedProxy`)
59
- **Iterator Classes**: Different iteration patterns (row-wise, column-wise, pileup) for accessing data
60
- **Command Integration**: Direct access to samtools and bcftools command-line functionality
61
62
This design enables efficient processing of large genomic datasets while maintaining Python's ease of use.
63
64
## Capabilities
65
66
### SAM/BAM/CRAM Alignment Files
67
68
Read and write sequence alignment files with support for indexing, random access, and comprehensive metadata handling.
69
70
```python { .api }
71
class AlignmentFile:
72
def __init__(self, filepath, mode, **kwargs): ...
73
def fetch(self, contig=None, start=None, stop=None): ...
74
def pileup(self, contig=None, start=None, stop=None): ...
75
76
class AlignedSegment:
77
query_name: str
78
reference_start: int
79
reference_end: int
80
query_sequence: str
81
query_qualities: list
82
```
83
84
[SAM/BAM/CRAM Files](./alignment-files.md)
85
86
### VCF/BCF Variant Files
87
88
Handle variant call format files with full header support, sample data access, and filtering capabilities.
89
90
```python { .api }
91
class VariantFile:
92
def __init__(self, filepath, mode="r", **kwargs): ...
93
def fetch(self, contig=None, start=None, stop=None): ...
94
95
class VariantRecord:
96
contig: str
97
pos: int
98
ref: str
99
alts: tuple
100
qual: float
101
```
102
103
[VCF/BCF Files](./variant-files.md)
104
105
### FASTA/FASTQ Sequence Files
106
107
Access sequence files with both random access (FASTA with index) and streaming capabilities (FASTA/FASTQ).
108
109
```python { .api }
110
class FastaFile:
111
def __init__(self, filename): ...
112
def fetch(self, reference, start=None, end=None): ...
113
114
class FastxFile:
115
def __init__(self, filename, mode="r"): ...
116
def __iter__(self): ...
117
118
class FastxRecord:
119
name: str
120
sequence: str
121
comment: str
122
quality: str
123
```
124
125
[FASTA/FASTQ Files](./sequence-files.md)
126
127
### Tabix-Indexed Files
128
129
Access compressed, indexed genomic files with support for multiple formats (BED, GFF, GTF, VCF).
130
131
```python { .api }
132
class TabixFile:
133
def __init__(self, filename, parser=None): ...
134
def fetch(self, reference, start=None, end=None, parser=None): ...
135
136
def tabix_index(filename, preset=None, **kwargs): ...
137
def tabix_compress(filename_in, filename_out, **kwargs): ...
138
```
139
140
[Tabix Files](./tabix-files.md)
141
142
### Compressed Files (BGZF)
143
144
Handle block gzip compressed files commonly used in genomics.
145
146
```python { .api }
147
class BGZFile:
148
def __init__(self, filepath, mode): ...
149
def read(self, size=-1): ...
150
def write(self, data): ...
151
def seek(self, offset, whence=0): ...
152
```
153
154
[BGZF Files](./bgzf-files.md)
155
156
### Command-Line Tools Integration
157
158
Access samtools and bcftools functionality directly from Python with all subcommands available as functions.
159
160
```python { .api }
161
def view(*args, **kwargs): ...
162
def sort(*args, **kwargs): ...
163
def index(*args, **kwargs): ...
164
def stats(*args, **kwargs): ...
165
def call(*args, **kwargs): ...
166
def merge(*args, **kwargs): ...
167
```
168
169
[Command-Line Tools](./command-tools.md)
170
171
### Utility Functions and Constants
172
173
Helper functions for quality score conversion, error handling, and genomic constants.
174
175
```python { .api }
176
def qualitystring_to_array(s): ...
177
def array_to_qualitystring(a): ...
178
179
class SamtoolsError(Exception): ...
180
181
# CIGAR operations
182
CMATCH: int
183
CINS: int
184
CDEL: int
185
# SAM flags
186
FPAIRED: int
187
FUNMAP: int
188
FREVERSE: int
189
```
190
191
[Utilities](./utilities.md)
192
193
## Error Handling
194
195
Pysam uses `SamtoolsError` for command-line tool errors and standard Python exceptions for file I/O and data access issues. Most file operations support context managers for proper resource cleanup.
196
197
## Performance Considerations
198
199
- Use indexed files (`fetch()` with coordinates) for random access
200
- Stream processing with iterators for large datasets
201
- Context managers ensure proper file handle cleanup
202
- Proxy classes provide memory-efficient access to parsed data