or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-scikit-allel

A Python package for exploring and analysing genetic variation data.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/scikit-allel@1.3.x

To install, run

npx @tessl/cli install tessl/pypi-scikit-allel@1.3.0

0

# Scikit-allel Knowledge Tile

1

2

## Overview

3

4

Scikit-allel is a comprehensive Python library for exploratory analysis of large-scale genetic variation data. It provides efficient data structures and algorithms for working with genomic datasets commonly found in population genetics and evolutionary biology research, including tools for genotype arrays, allele frequency calculations, genetic diversity metrics, and statistical analysis of population structure.

5

6

## Package Information

7

8

- **Name**: scikit-allel

9

- **Version**: 1.3.13

10

- **Type**: Python library

11

- **Language**: Python (with Cython extensions)

12

- **Installation**: `pip install scikit-allel`

13

14

## Core Imports

15

16

```python

17

import allel

18

import numpy as np

19

20

# Import specific modules for focused functionality

21

from allel import GenotypeArray, HaplotypeArray, AlleleCountsArray

22

from allel.stats import diversity, fst, ld, selection

23

from allel.io import vcf_read, gff

24

```

25

{ .api }

26

27

## Basic Usage

28

29

```python

30

import allel

31

import numpy as np

32

33

# Read VCF file into arrays

34

variants, samples, genotypes = allel.read_vcf('data.vcf', fields=['variants/*', 'samples', 'calldata/GT'])

35

36

# Create genotype array for analysis

37

g = allel.GenotypeArray(genotypes)

38

39

# Calculate basic statistics

40

ac = g.count_alleles()

41

diversity = allel.sequence_diversity(variants['POS'], ac)

42

43

# Population structure analysis

44

fst = allel.weir_cockerham_fst(g, [range(10), range(10, 20)])

45

```

46

{ .api }

47

48

## Architecture

49

50

Scikit-allel is organized around several key concepts:

51

52

- **Data Structures**: Specialized array classes for genetic data (genotypes, haplotypes, allele counts)

53

- **Statistical Methods**: Population genetics statistics organized by analysis type

54

- **I/O Operations**: Reading and writing genetic data formats (VCF, GFF, FASTA)

55

- **Storage Backends**: Support for chunked storage (HDF5, Zarr) and distributed computing (Dask)

56

57

## Capabilities

58

59

### Genetic Data Structures

60

61

Core array classes for representing and manipulating genetic variation data, including genotype arrays, haplotype arrays, and allele count arrays with efficient operations and memory management.

62

63

Key functionality: `GenotypeArray`, `HaplotypeArray`, `AlleleCountsArray`, array creation functions, indexing classes (`SortedIndex`, `UniqueIndex`, `ChromPosIndex`), chunked storage backends (Dask, HDF5, Zarr), vector classes for single variants.

64

65

[Data Structures Documentation](./data-structures.md)

66

67

### Population Genetics Statistics

68

69

Comprehensive statistical methods for analyzing genetic diversity, population structure, and natural selection, including windowed analyses and hypothesis tests commonly used in population genetics.

70

71

Key functionality: diversity metrics (π, θ, Tajima's D), FST calculations (Weir & Cockerham, Hudson, Patterson), linkage disequilibrium analysis, selection tests (iHS, XP-EHH, NSL), site frequency spectrum analysis, principal component analysis, admixture statistics (F2, F3, D), Hardy-Weinberg equilibrium tests, runs of homozygosity detection, and Mendelian inheritance analysis.

72

73

[Statistics Documentation](./statistics.md)

74

75

### File I/O Operations

76

77

Reading and writing genetic data in standard formats with support for large files and selective data loading, enabling integration with existing bioinformatics pipelines.

78

79

Key functionality: VCF reading/writing, GFF parsing, FASTA processing, format conversion utilities, chunked I/O for large datasets.

80

81

[I/O Documentation](./io.md)

82

83

### Utilities and Storage

84

85

Supporting utilities for data management, caching, memory optimization, and integration with scientific Python ecosystem, plus chunked storage backends for large-scale data.

86

87

Key functionality: HDF5 caching decorator, array validation functions (`check_ndim`, `check_dtype`, `asarray_ndim`), error handling utilities, integration with scientific Python ecosystem (NumPy, pandas, matplotlib).

88

89

[Utilities Documentation](./utilities.md)