or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

alignment-files.mdbgzf-files.mdcommand-tools.mdindex.mdsequence-files.mdtabix-files.mdutilities.mdvariant-files.md
tile.json

tessl/pypi-pysam

Package for reading, manipulating, and writing genomic data

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pysam@0.23.x

To install, run

npx @tessl/cli install tessl/pypi-pysam@0.23.0

index.mddocs/

Pysam

A comprehensive Python wrapper for the HTSlib library that provides facilities for reading, manipulating, and writing genomic data sets in standard bioinformatics formats. Pysam supports SAM/BAM/CRAM for sequence alignments, VCF/BCF for variant calls, and FASTA/FASTQ for sequences, along with tabix-indexed files and compressed formats.

Package Information

  • Package Name: pysam
  • Language: Python
  • Installation: pip install pysam

Core Imports

import pysam

Common specific imports:

from pysam import AlignmentFile, VariantFile, FastaFile, TabixFile, BGZFile
from pysam import samtools, bcftools  # For command-line functions

Basic Usage

import pysam

# Reading SAM/BAM/CRAM alignment files
with pysam.AlignmentFile("example.bam", "rb") as samfile:
    for read in samfile.fetch("chr1", 1000, 2000):
        print(f"Read: {read.query_name}, Position: {read.reference_start}")

# Reading VCF/BCF variant files
with pysam.VariantFile("example.vcf") as vcffile:
    for record in vcffile.fetch("chr1", 1000, 2000):
        print(f"Variant at {record.pos}: {record.ref} -> {record.alts}")

# Reading FASTA files
with pysam.FastaFile("reference.fa") as fastafile:
    sequence = fastafile.fetch("chr1", 1000, 2000)
    print(f"Sequence: {sequence}")

# Command-line tool integration
pysam.sort("-o", "sorted.bam", "input.bam")
pysam.index("sorted.bam")

# BCFtools for variant processing  
pysam.call("-mv", "-o", "calls.vcf", "pileup.bcf")
pysam.filter("-i", "QUAL>=20", "-o", "filtered.vcf", "calls.vcf")

Architecture

Pysam follows a modular architecture built around HTSlib's C API:

  • File Classes: High-level interfaces (AlignmentFile, VariantFile, FastaFile, TabixFile) that provide Pythonic access to genomic file formats
  • Record Classes: Data structures (AlignedSegment, VariantRecord, FastxRecord) representing individual entries with attribute access
  • Proxy Classes: Efficient access to parsed data without copying (GTFProxy, VCFProxy, BedProxy)
  • Iterator Classes: Different iteration patterns (row-wise, column-wise, pileup) for accessing data
  • Command Integration: Direct access to samtools and bcftools command-line functionality

This design enables efficient processing of large genomic datasets while maintaining Python's ease of use.

Capabilities

SAM/BAM/CRAM Alignment Files

Read and write sequence alignment files with support for indexing, random access, and comprehensive metadata handling.

class AlignmentFile:
    def __init__(self, filepath, mode, **kwargs): ...
    def fetch(self, contig=None, start=None, stop=None): ...
    def pileup(self, contig=None, start=None, stop=None): ...

class AlignedSegment:
    query_name: str
    reference_start: int
    reference_end: int
    query_sequence: str
    query_qualities: list

SAM/BAM/CRAM Files

VCF/BCF Variant Files

Handle variant call format files with full header support, sample data access, and filtering capabilities.

class VariantFile:
    def __init__(self, filepath, mode="r", **kwargs): ...
    def fetch(self, contig=None, start=None, stop=None): ...

class VariantRecord:
    contig: str
    pos: int
    ref: str
    alts: tuple
    qual: float

VCF/BCF Files

FASTA/FASTQ Sequence Files

Access sequence files with both random access (FASTA with index) and streaming capabilities (FASTA/FASTQ).

class FastaFile:
    def __init__(self, filename): ...
    def fetch(self, reference, start=None, end=None): ...

class FastxFile:
    def __init__(self, filename, mode="r"): ...
    def __iter__(self): ...

class FastxRecord:
    name: str
    sequence: str
    comment: str
    quality: str

FASTA/FASTQ Files

Tabix-Indexed Files

Access compressed, indexed genomic files with support for multiple formats (BED, GFF, GTF, VCF).

class TabixFile:
    def __init__(self, filename, parser=None): ...
    def fetch(self, reference, start=None, end=None, parser=None): ...

def tabix_index(filename, preset=None, **kwargs): ...
def tabix_compress(filename_in, filename_out, **kwargs): ...

Tabix Files

Compressed Files (BGZF)

Handle block gzip compressed files commonly used in genomics.

class BGZFile:
    def __init__(self, filepath, mode): ...
    def read(self, size=-1): ...
    def write(self, data): ...
    def seek(self, offset, whence=0): ...

BGZF Files

Command-Line Tools Integration

Access samtools and bcftools functionality directly from Python with all subcommands available as functions.

def view(*args, **kwargs): ...
def sort(*args, **kwargs): ...
def index(*args, **kwargs): ...
def stats(*args, **kwargs): ...
def call(*args, **kwargs): ...
def merge(*args, **kwargs): ...

Command-Line Tools

Utility Functions and Constants

Helper functions for quality score conversion, error handling, and genomic constants.

def qualitystring_to_array(s): ...
def array_to_qualitystring(a): ...

class SamtoolsError(Exception): ...

# CIGAR operations
CMATCH: int
CINS: int
CDEL: int
# SAM flags  
FPAIRED: int
FUNMAP: int
FREVERSE: int

Utilities

Error Handling

Pysam uses SamtoolsError for command-line tool errors and standard Python exceptions for file I/O and data access issues. Most file operations support context managers for proper resource cleanup.

Performance Considerations

  • Use indexed files (fetch() with coordinates) for random access
  • Stream processing with iterators for large datasets
  • Context managers ensure proper file handle cleanup
  • Proxy classes provide memory-efficient access to parsed data