CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-pysam

Package for reading, manipulating, and writing genomic data

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

Pysam

A comprehensive Python wrapper for the HTSlib library that provides facilities for reading, manipulating, and writing genomic data sets in standard bioinformatics formats. Pysam supports SAM/BAM/CRAM for sequence alignments, VCF/BCF for variant calls, and FASTA/FASTQ for sequences, along with tabix-indexed files and compressed formats.

Package Information

  • Package Name: pysam
  • Language: Python
  • Installation: pip install pysam

Core Imports

import pysam

Common specific imports:

from pysam import AlignmentFile, VariantFile, FastaFile, TabixFile, BGZFile
from pysam import samtools, bcftools  # For command-line functions

Basic Usage

import pysam

# Reading SAM/BAM/CRAM alignment files
with pysam.AlignmentFile("example.bam", "rb") as samfile:
    for read in samfile.fetch("chr1", 1000, 2000):
        print(f"Read: {read.query_name}, Position: {read.reference_start}")

# Reading VCF/BCF variant files
with pysam.VariantFile("example.vcf") as vcffile:
    for record in vcffile.fetch("chr1", 1000, 2000):
        print(f"Variant at {record.pos}: {record.ref} -> {record.alts}")

# Reading FASTA files
with pysam.FastaFile("reference.fa") as fastafile:
    sequence = fastafile.fetch("chr1", 1000, 2000)
    print(f"Sequence: {sequence}")

# Command-line tool integration
pysam.sort("-o", "sorted.bam", "input.bam")
pysam.index("sorted.bam")

# BCFtools for variant processing  
pysam.call("-mv", "-o", "calls.vcf", "pileup.bcf")
pysam.filter("-i", "QUAL>=20", "-o", "filtered.vcf", "calls.vcf")

Architecture

Pysam follows a modular architecture built around HTSlib's C API:

  • File Classes: High-level interfaces (AlignmentFile, VariantFile, FastaFile, TabixFile) that provide Pythonic access to genomic file formats
  • Record Classes: Data structures (AlignedSegment, VariantRecord, FastxRecord) representing individual entries with attribute access
  • Proxy Classes: Efficient access to parsed data without copying (GTFProxy, VCFProxy, BedProxy)
  • Iterator Classes: Different iteration patterns (row-wise, column-wise, pileup) for accessing data
  • Command Integration: Direct access to samtools and bcftools command-line functionality

This design enables efficient processing of large genomic datasets while maintaining Python's ease of use.

Capabilities

SAM/BAM/CRAM Alignment Files

Read and write sequence alignment files with support for indexing, random access, and comprehensive metadata handling.

class AlignmentFile:
    def __init__(self, filepath, mode, **kwargs): ...
    def fetch(self, contig=None, start=None, stop=None): ...
    def pileup(self, contig=None, start=None, stop=None): ...

class AlignedSegment:
    query_name: str
    reference_start: int
    reference_end: int
    query_sequence: str
    query_qualities: list

SAM/BAM/CRAM Files

VCF/BCF Variant Files

Handle variant call format files with full header support, sample data access, and filtering capabilities.

class VariantFile:
    def __init__(self, filepath, mode="r", **kwargs): ...
    def fetch(self, contig=None, start=None, stop=None): ...

class VariantRecord:
    contig: str
    pos: int
    ref: str
    alts: tuple
    qual: float

VCF/BCF Files

FASTA/FASTQ Sequence Files

Access sequence files with both random access (FASTA with index) and streaming capabilities (FASTA/FASTQ).

class FastaFile:
    def __init__(self, filename): ...
    def fetch(self, reference, start=None, end=None): ...

class FastxFile:
    def __init__(self, filename, mode="r"): ...
    def __iter__(self): ...

class FastxRecord:
    name: str
    sequence: str
    comment: str
    quality: str

FASTA/FASTQ Files

Tabix-Indexed Files

Access compressed, indexed genomic files with support for multiple formats (BED, GFF, GTF, VCF).

class TabixFile:
    def __init__(self, filename, parser=None): ...
    def fetch(self, reference, start=None, end=None, parser=None): ...

def tabix_index(filename, preset=None, **kwargs): ...
def tabix_compress(filename_in, filename_out, **kwargs): ...

Tabix Files

Compressed Files (BGZF)

Handle block gzip compressed files commonly used in genomics.

class BGZFile:
    def __init__(self, filepath, mode): ...
    def read(self, size=-1): ...
    def write(self, data): ...
    def seek(self, offset, whence=0): ...

BGZF Files

Command-Line Tools Integration

Access samtools and bcftools functionality directly from Python with all subcommands available as functions.

def view(*args, **kwargs): ...
def sort(*args, **kwargs): ...
def index(*args, **kwargs): ...
def stats(*args, **kwargs): ...
def call(*args, **kwargs): ...
def merge(*args, **kwargs): ...

Command-Line Tools

Utility Functions and Constants

Helper functions for quality score conversion, error handling, and genomic constants.

def qualitystring_to_array(s): ...
def array_to_qualitystring(a): ...

class SamtoolsError(Exception): ...

# CIGAR operations
CMATCH: int
CINS: int
CDEL: int
# SAM flags  
FPAIRED: int
FUNMAP: int
FREVERSE: int

Utilities

Error Handling

Pysam uses SamtoolsError for command-line tool errors and standard Python exceptions for file I/O and data access issues. Most file operations support context managers for proper resource cleanup.

Performance Considerations

  • Use indexed files (fetch() with coordinates) for random access
  • Stream processing with iterators for large datasets
  • Context managers ensure proper file handle cleanup
  • Proxy classes provide memory-efficient access to parsed data
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pysam@0.23.x
Publish Source
CLI
Badge
tessl/pypi-pysam badge