A Python package for exploring and analysing genetic variation data.
npx @tessl/cli install tessl/pypi-scikit-allel@1.3.00
# Scikit-allel Knowledge Tile
1
2
## Overview
3
4
Scikit-allel is a comprehensive Python library for exploratory analysis of large-scale genetic variation data. It provides efficient data structures and algorithms for working with genomic datasets commonly found in population genetics and evolutionary biology research, including tools for genotype arrays, allele frequency calculations, genetic diversity metrics, and statistical analysis of population structure.
5
6
## Package Information
7
8
- **Name**: scikit-allel
9
- **Version**: 1.3.13
10
- **Type**: Python library
11
- **Language**: Python (with Cython extensions)
12
- **Installation**: `pip install scikit-allel`
13
14
## Core Imports
15
16
```python
17
import allel
18
import numpy as np
19
20
# Import specific modules for focused functionality
21
from allel import GenotypeArray, HaplotypeArray, AlleleCountsArray
22
from allel.stats import diversity, fst, ld, selection
23
from allel.io import vcf_read, gff
24
```
25
{ .api }
26
27
## Basic Usage
28
29
```python
30
import allel
31
import numpy as np
32
33
# Read VCF file into arrays
34
variants, samples, genotypes = allel.read_vcf('data.vcf', fields=['variants/*', 'samples', 'calldata/GT'])
35
36
# Create genotype array for analysis
37
g = allel.GenotypeArray(genotypes)
38
39
# Calculate basic statistics
40
ac = g.count_alleles()
41
diversity = allel.sequence_diversity(variants['POS'], ac)
42
43
# Population structure analysis
44
fst = allel.weir_cockerham_fst(g, [range(10), range(10, 20)])
45
```
46
{ .api }
47
48
## Architecture
49
50
Scikit-allel is organized around several key concepts:
51
52
- **Data Structures**: Specialized array classes for genetic data (genotypes, haplotypes, allele counts)
53
- **Statistical Methods**: Population genetics statistics organized by analysis type
54
- **I/O Operations**: Reading and writing genetic data formats (VCF, GFF, FASTA)
55
- **Storage Backends**: Support for chunked storage (HDF5, Zarr) and distributed computing (Dask)
56
57
## Capabilities
58
59
### Genetic Data Structures
60
61
Core array classes for representing and manipulating genetic variation data, including genotype arrays, haplotype arrays, and allele count arrays with efficient operations and memory management.
62
63
Key functionality: `GenotypeArray`, `HaplotypeArray`, `AlleleCountsArray`, array creation functions, indexing classes (`SortedIndex`, `UniqueIndex`, `ChromPosIndex`), chunked storage backends (Dask, HDF5, Zarr), vector classes for single variants.
64
65
[Data Structures Documentation](./data-structures.md)
66
67
### Population Genetics Statistics
68
69
Comprehensive statistical methods for analyzing genetic diversity, population structure, and natural selection, including windowed analyses and hypothesis tests commonly used in population genetics.
70
71
Key functionality: diversity metrics (π, θ, Tajima's D), FST calculations (Weir & Cockerham, Hudson, Patterson), linkage disequilibrium analysis, selection tests (iHS, XP-EHH, NSL), site frequency spectrum analysis, principal component analysis, admixture statistics (F2, F3, D), Hardy-Weinberg equilibrium tests, runs of homozygosity detection, and Mendelian inheritance analysis.
72
73
[Statistics Documentation](./statistics.md)
74
75
### File I/O Operations
76
77
Reading and writing genetic data in standard formats with support for large files and selective data loading, enabling integration with existing bioinformatics pipelines.
78
79
Key functionality: VCF reading/writing, GFF parsing, FASTA processing, format conversion utilities, chunked I/O for large datasets.
80
81
[I/O Documentation](./io.md)
82
83
### Utilities and Storage
84
85
Supporting utilities for data management, caching, memory optimization, and integration with scientific Python ecosystem, plus chunked storage backends for large-scale data.
86
87
Key functionality: HDF5 caching decorator, array validation functions (`check_ndim`, `check_dtype`, `asarray_ndim`), error handling utilities, integration with scientific Python ecosystem (NumPy, pandas, matplotlib).
88
89
[Utilities Documentation](./utilities.md)