A Python package for exploring and analysing genetic variation data.
npx @tessl/cli install tessl/pypi-scikit-allel@1.3.0Scikit-allel is a comprehensive Python library for exploratory analysis of large-scale genetic variation data. It provides efficient data structures and algorithms for working with genomic datasets commonly found in population genetics and evolutionary biology research, including tools for genotype arrays, allele frequency calculations, genetic diversity metrics, and statistical analysis of population structure.
pip install scikit-allelimport allel
import numpy as np
# Import specific modules for focused functionality
from allel import GenotypeArray, HaplotypeArray, AlleleCountsArray
from allel.stats import diversity, fst, ld, selection
from allel.io import vcf_read, gff{ .api }
import allel
import numpy as np
# Read VCF file into arrays
variants, samples, genotypes = allel.read_vcf('data.vcf', fields=['variants/*', 'samples', 'calldata/GT'])
# Create genotype array for analysis
g = allel.GenotypeArray(genotypes)
# Calculate basic statistics
ac = g.count_alleles()
diversity = allel.sequence_diversity(variants['POS'], ac)
# Population structure analysis
fst = allel.weir_cockerham_fst(g, [range(10), range(10, 20)]){ .api }
Scikit-allel is organized around several key concepts:
Core array classes for representing and manipulating genetic variation data, including genotype arrays, haplotype arrays, and allele count arrays with efficient operations and memory management.
Key functionality: GenotypeArray, HaplotypeArray, AlleleCountsArray, array creation functions, indexing classes (SortedIndex, UniqueIndex, ChromPosIndex), chunked storage backends (Dask, HDF5, Zarr), vector classes for single variants.
Comprehensive statistical methods for analyzing genetic diversity, population structure, and natural selection, including windowed analyses and hypothesis tests commonly used in population genetics.
Key functionality: diversity metrics (π, θ, Tajima's D), FST calculations (Weir & Cockerham, Hudson, Patterson), linkage disequilibrium analysis, selection tests (iHS, XP-EHH, NSL), site frequency spectrum analysis, principal component analysis, admixture statistics (F2, F3, D), Hardy-Weinberg equilibrium tests, runs of homozygosity detection, and Mendelian inheritance analysis.
Reading and writing genetic data in standard formats with support for large files and selective data loading, enabling integration with existing bioinformatics pipelines.
Key functionality: VCF reading/writing, GFF parsing, FASTA processing, format conversion utilities, chunked I/O for large datasets.
Supporting utilities for data management, caching, memory optimization, and integration with scientific Python ecosystem, plus chunked storage backends for large-scale data.
Key functionality: HDF5 caching decorator, array validation functions (check_ndim, check_dtype, asarray_ndim), error handling utilities, integration with scientific Python ecosystem (NumPy, pandas, matplotlib).