0
# Statistics
1
2
Scikit-allel provides comprehensive statistical methods for population genetics analysis, including diversity metrics, population structure analysis, linkage disequilibrium, and tests for natural selection.
3
4
## Diversity Analysis
5
6
### Sequence Diversity
7
8
Calculate nucleotide diversity (π) and related metrics.
9
10
```python
11
import allel
12
13
# Basic sequence diversity
14
diversity = allel.sequence_diversity(positions, allele_counts, start=None, stop=None)
15
16
# Mean pairwise differences
17
mpd = allel.mean_pairwise_difference(allele_counts)
18
19
# Between-population differences
20
div = allel.mean_pairwise_difference_between(ac1, ac2)
21
```
22
{ .api }
23
24
**Key Functions:**
25
26
```python
27
def sequence_diversity(pos, ac, start=None, stop=None, is_accessible=None):
28
"""
29
Calculate sequence diversity (π) over a genomic region.
30
31
Args:
32
pos (array_like): Variant positions
33
ac (AlleleCountsArray): Allele counts per variant
34
start (int, optional): Start position for calculation
35
stop (int, optional): Stop position for calculation
36
is_accessible (array_like, optional): Boolean array of accessible sites
37
38
Returns:
39
float: Sequence diversity value
40
"""
41
42
def mean_pairwise_difference(ac, an=None):
43
"""
44
Calculate mean pairwise nucleotide differences.
45
46
Args:
47
ac (AlleleCountsArray): Allele counts per variant
48
an (array_like, optional): Total allele numbers per variant
49
50
Returns:
51
ndarray: Mean pairwise differences per variant
52
"""
53
54
def windowed_diversity(pos, ac, windows, is_accessible=None):
55
"""
56
Calculate diversity in genomic windows.
57
58
Args:
59
pos (array_like): Variant positions
60
ac (AlleleCountsArray): Allele counts
61
windows (array_like): Window coordinates as (start, stop) pairs
62
is_accessible (array_like, optional): Accessible sites mask
63
64
Returns:
65
ndarray: Diversity values per window
66
"""
67
```
68
{ .api }
69
70
### Sequence Divergence
71
72
Calculate divergence between populations or sequences.
73
74
```python
75
# Between-population divergence
76
divergence = allel.sequence_divergence(positions, ac_pop1, ac_pop2, start=1000, stop=2000)
77
78
# Windowed divergence analysis
79
div_windows = allel.windowed_divergence(positions, ac_pop1, ac_pop2, windows)
80
81
# Windowed differentiation
82
df_windows = allel.windowed_df(positions, ac_pop1, ac_pop2, windows)
83
```
84
{ .api }
85
86
```python
87
def sequence_divergence(pos, ac1, ac2, start=None, stop=None, is_accessible=None):
88
"""
89
Calculate sequence divergence between two populations.
90
91
Args:
92
pos (array_like): Variant positions
93
ac1 (AlleleCountsArray): Allele counts for population 1
94
ac2 (AlleleCountsArray): Allele counts for population 2
95
start (int, optional): Start position
96
stop (int, optional): Stop position
97
is_accessible (array_like, optional): Accessible sites mask
98
99
Returns:
100
float: Sequence divergence value
101
"""
102
103
def windowed_divergence(pos, ac1, ac2, windows, is_accessible=None):
104
"""
105
Calculate divergence between populations in genomic windows.
106
107
Args:
108
pos (array_like): Variant positions
109
ac1 (AlleleCountsArray): Allele counts for population 1
110
ac2 (AlleleCountsArray): Allele counts for population 2
111
windows (array_like): Window coordinates
112
is_accessible (array_like, optional): Accessible sites mask
113
114
Returns:
115
ndarray: Divergence values per window
116
"""
117
118
def windowed_df(pos, ac1, ac2, windows, is_accessible=None):
119
"""
120
Calculate windowed differentiation (Df) between populations.
121
122
Args:
123
pos (array_like): Variant positions
124
ac1 (AlleleCountsArray): Allele counts for population 1
125
ac2 (AlleleCountsArray): Allele counts for population 2
126
windows (array_like): Window coordinates
127
is_accessible (array_like, optional): Accessible sites mask
128
129
Returns:
130
ndarray: Df values per window
131
"""
132
```
133
{ .api }
134
135
### Watterson's Theta
136
137
Estimate population mutation rate using Watterson's method.
138
139
```python
140
# Single value
141
theta_w = allel.watterson_theta(allele_counts)
142
143
# Windowed analysis
144
theta_windows = allel.windowed_watterson_theta(positions, allele_counts, windows)
145
```
146
{ .api }
147
148
```python
149
def watterson_theta(ac, an=None):
150
"""
151
Calculate Watterson's estimator of theta (population mutation rate).
152
153
Args:
154
ac (AlleleCountsArray): Allele counts per variant
155
an (array_like, optional): Total allele numbers per variant
156
157
Returns:
158
float: Watterson's theta estimate
159
"""
160
161
def windowed_watterson_theta(pos, ac, windows, is_accessible=None):
162
"""
163
Calculate Watterson's theta in genomic windows.
164
165
Args:
166
pos (array_like): Variant positions
167
ac (AlleleCountsArray): Allele counts
168
windows (array_like): Window coordinates
169
is_accessible (array_like, optional): Accessible sites mask
170
171
Returns:
172
ndarray: Theta values per window
173
"""
174
```
175
{ .api }
176
177
### Tajima's D
178
179
Test for departure from neutral evolution.
180
181
```python
182
# Single genomic region
183
tajima_d = allel.tajima_d(allele_counts, positions, start=1000, stop=2000)
184
185
# Windowed analysis
186
d_windows = allel.windowed_tajima_d(positions, allele_counts, windows)
187
188
# Moving window
189
d_moving = allel.moving_tajima_d(positions, allele_counts, size=1000, step=500)
190
```
191
{ .api }
192
193
```python
194
def tajima_d(ac, pos=None, start=None, stop=None, is_accessible=None):
195
"""
196
Calculate Tajima's D test statistic.
197
198
Args:
199
ac (AlleleCountsArray): Allele counts per variant
200
pos (array_like, optional): Variant positions
201
start (int, optional): Start position
202
stop (int, optional): Stop position
203
is_accessible (array_like, optional): Accessible sites mask
204
205
Returns:
206
float: Tajima's D value
207
"""
208
209
def windowed_tajima_d(pos, ac, windows, is_accessible=None):
210
"""
211
Calculate Tajima's D in genomic windows.
212
213
Args:
214
pos (array_like): Variant positions
215
ac (AlleleCountsArray): Allele counts
216
windows (array_like): Window coordinates
217
is_accessible (array_like, optional): Accessible sites mask
218
219
Returns:
220
ndarray: Tajima's D values per window
221
"""
222
223
def moving_tajima_d(pos, ac, size, start=None, stop=None, step=None, is_accessible=None):
224
"""
225
Calculate Tajima's D in sliding windows.
226
227
Args:
228
pos (array_like): Variant positions
229
ac (AlleleCountsArray): Allele counts
230
size (int): Window size in base pairs
231
start (int, optional): Start position
232
stop (int, optional): Stop position
233
step (int, optional): Step size between windows
234
is_accessible (array_like, optional): Accessible sites mask
235
236
Returns:
237
tuple: (window_centers, tajima_d_values)
238
"""
239
```
240
{ .api }
241
242
## Population Structure (FST)
243
244
### FST Estimators
245
246
Calculate fixation indices between populations.
247
248
```python
249
# Weir & Cockerham FST
250
fst_wc = allel.weir_cockerham_fst(genotypes, subpops)
251
252
# Hudson's FST
253
fst_hudson = allel.hudson_fst(allele_counts, subpops)
254
255
# Patterson's FST (for two populations)
256
fst_patterson = allel.patterson_fst(ac_pop1, ac_pop2)
257
```
258
{ .api }
259
260
**Key Functions:**
261
262
```python
263
def weir_cockerham_fst(g, subpops, max_allele=None):
264
"""
265
Calculate Weir & Cockerham FST between subpopulations.
266
267
Args:
268
g (GenotypeArray): Genotype data
269
subpops (list): List of arrays containing sample indices for each subpopulation
270
max_allele (int, optional): Maximum allele number to consider
271
272
Returns:
273
tuple: (a, b, c) variance components for FST calculation
274
"""
275
276
def hudson_fst(ac, subpops):
277
"""
278
Calculate Hudson's FST estimator.
279
280
Args:
281
ac (AlleleCountsArray): Allele counts per variant
282
subpops (list): Subpopulation definitions
283
284
Returns:
285
tuple: (num, den) numerator and denominator for FST calculation
286
"""
287
288
def average_weir_cockerham_fst(g, subpops, max_allele=None):
289
"""
290
Calculate average FST across all variants.
291
292
Args:
293
g (GenotypeArray): Genotype data
294
subpops (list): Subpopulation sample indices
295
max_allele (int, optional): Maximum allele number
296
297
Returns:
298
float: Average FST value
299
"""
300
301
def windowed_weir_cockerham_fst(pos, g, subpops, windows, max_allele=None):
302
"""
303
Calculate FST in genomic windows.
304
305
Args:
306
pos (array_like): Variant positions
307
g (GenotypeArray): Genotype data
308
subpops (list): Subpopulation definitions
309
windows (array_like): Window coordinates
310
max_allele (int, optional): Maximum allele number
311
312
Returns:
313
ndarray: FST values per window
314
"""
315
```
316
{ .api }
317
318
### Additional FST Methods
319
320
Advanced FST calculations for specific analysis needs.
321
322
```python
323
# Blockwise FST calculations (for bootstrapping)
324
fst_blocks_wc = allel.blockwise_weir_cockerham_fst(genotypes, subpops, blocks)
325
fst_blocks_hudson = allel.blockwise_hudson_fst(ac, subpops, blocks)
326
fst_blocks_patterson = allel.blockwise_patterson_fst(ac1, ac2, blocks)
327
328
# Moving window FST
329
fst_moving_hudson = allel.moving_hudson_fst(positions, ac, subpops, size=10000)
330
fst_moving_patterson = allel.moving_patterson_fst(positions, ac1, ac2, size=10000)
331
fst_moving_wc = allel.moving_weir_cockerham_fst(positions, genotypes, subpops, size=10000)
332
333
# Average FST calculations
334
avg_hudson = allel.average_hudson_fst(ac, subpops)
335
avg_patterson = allel.average_patterson_fst(ac1, ac2)
336
```
337
{ .api }
338
339
```python
340
def blockwise_weir_cockerham_fst(g, subpops, blocks, max_allele=None):
341
"""
342
Calculate Weir & Cockerham FST for genomic blocks.
343
344
Args:
345
g (GenotypeArray): Genotype data
346
subpops (list): Subpopulation sample indices
347
blocks (array_like): Block definitions
348
max_allele (int, optional): Maximum allele number
349
350
Returns:
351
ndarray: FST values per block
352
"""
353
354
def blockwise_hudson_fst(ac, subpops, blocks):
355
"""
356
Calculate Hudson's FST for genomic blocks.
357
358
Args:
359
ac (AlleleCountsArray): Allele counts
360
subpops (list): Subpopulation definitions
361
blocks (array_like): Block definitions
362
363
Returns:
364
ndarray: FST values per block
365
"""
366
367
def blockwise_patterson_fst(ac1, ac2, blocks):
368
"""
369
Calculate Patterson's FST for genomic blocks.
370
371
Args:
372
ac1 (AlleleCountsArray): Allele counts for population 1
373
ac2 (AlleleCountsArray): Allele counts for population 2
374
blocks (array_like): Block definitions
375
376
Returns:
377
ndarray: FST values per block
378
"""
379
380
def moving_hudson_fst(pos, ac, subpops, size, start=None, stop=None, step=None):
381
"""
382
Calculate Hudson's FST in sliding windows.
383
384
Args:
385
pos (array_like): Variant positions
386
ac (AlleleCountsArray): Allele counts
387
subpops (list): Subpopulation definitions
388
size (int): Window size in base pairs
389
start (int, optional): Start position
390
stop (int, optional): Stop position
391
step (int, optional): Step size between windows
392
393
Returns:
394
tuple: (window_centers, fst_values)
395
"""
396
397
def moving_patterson_fst(pos, ac1, ac2, size, start=None, stop=None, step=None):
398
"""
399
Calculate Patterson's FST in sliding windows.
400
401
Args:
402
pos (array_like): Variant positions
403
ac1 (AlleleCountsArray): Population 1 allele counts
404
ac2 (AlleleCountsArray): Population 2 allele counts
405
size (int): Window size in base pairs
406
start (int, optional): Start position
407
stop (int, optional): Stop position
408
step (int, optional): Step size between windows
409
410
Returns:
411
tuple: (window_centers, fst_values)
412
"""
413
414
def moving_weir_cockerham_fst(pos, g, subpops, size, start=None, stop=None, step=None, max_allele=None):
415
"""
416
Calculate Weir & Cockerham FST in sliding windows.
417
418
Args:
419
pos (array_like): Variant positions
420
g (GenotypeArray): Genotype data
421
subpops (list): Subpopulation sample indices
422
size (int): Window size in base pairs
423
start (int, optional): Start position
424
stop (int, optional): Stop position
425
step (int, optional): Step size between windows
426
max_allele (int, optional): Maximum allele number
427
428
Returns:
429
tuple: (window_centers, fst_values)
430
"""
431
432
def average_hudson_fst(ac, subpops):
433
"""
434
Calculate average Hudson's FST across all variants.
435
436
Args:
437
ac (AlleleCountsArray): Allele counts
438
subpops (list): Subpopulation definitions
439
440
Returns:
441
float: Average FST value
442
"""
443
444
def average_patterson_fst(ac1, ac2):
445
"""
446
Calculate average Patterson's FST across all variants.
447
448
Args:
449
ac1 (AlleleCountsArray): Population 1 allele counts
450
ac2 (AlleleCountsArray): Population 2 allele counts
451
452
Returns:
453
float: Average FST value
454
"""
455
```
456
{ .api }
457
458
## Linkage Disequilibrium
459
460
### LD Metrics
461
462
Measure linkage disequilibrium between variants.
463
464
```python
465
# Rogers & Huff r correlation
466
r = allel.rogers_huff_r(genotype_matrix)
467
468
# Between two sets of variants
469
r_between = allel.rogers_huff_r_between(gn_a, gn_b)
470
471
# Windowed r-squared
472
r2_windows = allel.windowed_r_squared(positions, genotype_matrix, size=10000)
473
474
# Visualize LD patterns
475
allel.plot_pairwise_ld(gn_subset, positions_subset)
476
```
477
{ .api }
478
479
```python
480
def rogers_huff_r(gn):
481
"""
482
Calculate Rogers & Huff r correlation coefficient matrix.
483
484
Args:
485
gn (array_like): Genotype matrix (n_variants, n_samples)
486
487
Returns:
488
ndarray: Correlation matrix between all pairs of variants
489
"""
490
491
def rogers_huff_r_between(gn1, gn2):
492
"""
493
Calculate Rogers & Huff r correlation between two sets of variants.
494
495
Args:
496
gn1 (array_like): First set of genotype data (n_variants1, n_samples)
497
gn2 (array_like): Second set of genotype data (n_variants2, n_samples)
498
499
Returns:
500
ndarray: Correlation matrix between variants in the two sets
501
"""
502
503
def windowed_r_squared(pos, gn, size, start=None, stop=None, step=None):
504
"""
505
Calculate r² in sliding windows.
506
507
Args:
508
pos (array_like): Variant positions
509
gn (array_like): Genotype matrix
510
size (int): Window size in base pairs
511
start (int, optional): Start position
512
stop (int, optional): Stop position
513
step (int, optional): Step size between windows
514
515
Returns:
516
tuple: (window_centers, mean_r2_values)
517
"""
518
519
def locate_unlinked(gn, size, step, threshold):
520
"""
521
Identify unlinked variants based on LD threshold.
522
523
Args:
524
gn (array_like): Genotype matrix
525
size (int): Window size for LD calculation
526
step (int): Step size between windows
527
threshold (float): LD threshold for considering variants unlinked
528
529
Returns:
530
ndarray: Indices of unlinked variants
531
"""
532
533
def plot_pairwise_ld(gn, pos, start=None, stop=None, ax=None):
534
"""
535
Plot pairwise linkage disequilibrium patterns.
536
537
Args:
538
gn (array_like): Genotype matrix
539
pos (array_like): Variant positions
540
start (int, optional): Start position for plotting
541
stop (int, optional): Stop position for plotting
542
ax (matplotlib axis, optional): Axis to plot on
543
544
Returns:
545
matplotlib figure: LD heatmap plot
546
"""
547
```
548
{ .api }
549
550
## Principal Component Analysis
551
552
### PCA Methods
553
554
Dimensionality reduction for population structure analysis.
555
556
```python
557
# Standard PCA
558
coords, model = allel.pca(genotype_matrix, n_components=10)
559
560
# Randomized PCA for large datasets
561
coords, model = allel.randomized_pca(genotype_matrix, n_components=10)
562
563
# Use different scalers
564
coords, model = allel.pca(genotype_matrix, scaler='patterson')
565
```
566
{ .api }
567
568
```python
569
def pca(gn, n_components=10, scaler=None, copy=True, random_state=None):
570
"""
571
Perform principal component analysis on genotype data.
572
573
Args:
574
gn (array_like): Genotype matrix (n_variants, n_samples)
575
n_components (int): Number of principal components to compute
576
scaler (str or object, optional): Scaling method ('standard', 'patterson', or scaler object)
577
copy (bool): Whether to copy input data
578
random_state (int, optional): Random seed for reproducibility
579
580
Returns:
581
tuple: (coordinates, fitted_model) where coordinates are sample projections
582
"""
583
584
def randomized_pca(gn, n_components=10, scaler=None, copy=True, random_state=None):
585
"""
586
Randomized PCA for large datasets.
587
588
Args:
589
gn (array_like): Genotype matrix
590
n_components (int): Number of components
591
scaler (str or object, optional): Scaling method
592
copy (bool): Whether to copy data
593
random_state (int, optional): Random seed
594
595
Returns:
596
tuple: (coordinates, fitted_model)
597
"""
598
```
599
{ .api }
600
601
## Data Preprocessing
602
603
### Scaling Methods
604
605
Standardize genotype data for analysis.
606
607
```python
608
# Standard scaling (mean=0, std=1)
609
scaler = allel.StandardScaler()
610
gn_scaled = scaler.fit_transform(genotype_matrix)
611
612
# Center scaling (mean=0)
613
scaler = allel.CenterScaler()
614
gn_centered = scaler.fit_transform(genotype_matrix)
615
616
# Patterson scaling (for PCA)
617
scaler = allel.PattersonScaler()
618
gn_patterson = scaler.fit_transform(genotype_matrix)
619
620
# Get scaler by name
621
scaler = allel.get_scaler('patterson')
622
```
623
{ .api }
624
625
```python
626
class StandardScaler:
627
"""
628
Standardize genotype data to have zero mean and unit variance.
629
"""
630
631
def fit(self, X):
632
"""Compute mean and standard deviation for later scaling."""
633
634
def transform(self, X):
635
"""Scale data using computed statistics."""
636
637
def fit_transform(self, X):
638
"""Fit scaler and transform data in one step."""
639
640
class CenterScaler:
641
"""
642
Center genotype data to have zero mean.
643
"""
644
645
def fit(self, X):
646
"""Compute mean for later centering."""
647
648
def transform(self, X):
649
"""Center data using computed mean."""
650
651
def fit_transform(self, X):
652
"""Fit scaler and center data in one step."""
653
654
class PattersonScaler:
655
"""
656
Scale genotype data using Patterson method for PCA.
657
658
Scales variants by their expected variance under Hardy-Weinberg
659
equilibrium, which is appropriate for principal component analysis.
660
"""
661
662
def fit(self, X):
663
"""Compute scaling factors."""
664
665
def transform(self, X):
666
"""Apply Patterson scaling."""
667
668
def fit_transform(self, X):
669
"""Fit and transform in one step."""
670
671
def get_scaler(name):
672
"""
673
Get a scaler by name.
674
675
Args:
676
name (str): Scaler name ('standard', 'center', or 'patterson')
677
678
Returns:
679
scaler: Scaler object
680
"""
681
```
682
{ .api }
683
684
## Selection Tests
685
686
### Extended Haplotype Homozygosity
687
688
Test for recent positive selection using haplotype structure.
689
690
```python
691
# Integrated Haplotype Score (iHS)
692
ihs_scores = allel.ihs(haplotypes, map_positions, min_ehh=0.05)
693
694
# Cross-population Extended Haplotype Homozygosity (XP-EHH)
695
xpehh_scores = allel.xpehh(hap_pop1, hap_pop2, map_positions)
696
697
# Standardize scores
698
ihs_std = allel.standardize(ihs_scores)
699
```
700
{ .api }
701
702
```python
703
def ihs(h, map_pos, min_ehh=0.05, include_edges=False, gap_scale=20000, max_gap=200000, discard_integration_warnings=False):
704
"""
705
Calculate integrated haplotype score (iHS) for each variant.
706
707
Args:
708
h (HaplotypeArray): Haplotype data
709
map_pos (array_like): Genetic map positions in centiMorgans
710
min_ehh (float): Minimum EHH value for integration cutoff
711
include_edges (bool): Whether to include variants at edges
712
gap_scale (float): Scale parameter for gap penalty
713
max_gap (float): Maximum gap size to allow
714
discard_integration_warnings (bool): Whether to suppress integration warnings
715
716
Returns:
717
ndarray: iHS scores per variant
718
"""
719
720
def xpehh(h1, h2, map_pos, min_ehh=0.05, include_edges=False, gap_scale=20000, max_gap=200000, discard_integration_warnings=False):
721
"""
722
Calculate cross-population EHH (XP-EHH) between two populations.
723
724
Args:
725
h1 (HaplotypeArray): Haplotypes from population 1
726
h2 (HaplotypeArray): Haplotypes from population 2
727
map_pos (array_like): Genetic map positions
728
min_ehh (float): Minimum EHH cutoff
729
include_edges (bool): Include edge variants
730
gap_scale (float): Gap penalty scale
731
max_gap (float): Maximum allowed gap
732
discard_integration_warnings (bool): Suppress warnings
733
734
Returns:
735
ndarray: XP-EHH scores per variant
736
"""
737
738
def standardize_by_allele_count(score, aac, flip=None):
739
"""
740
Standardize test statistic by allele count.
741
742
Args:
743
score (array_like): Raw test statistic values
744
aac (array_like): Alternate allele counts
745
flip (array_like, optional): Boolean array indicating which scores to flip
746
747
Returns:
748
ndarray: Standardized scores
749
"""
750
```
751
{ .api }
752
753
### Number of Segregating Sites by Length
754
755
Alternative selection test less sensitive to recombination rate variation.
756
757
```python
758
# NSL scores
759
nsl_scores = allel.nsl(haplotypes, map_positions)
760
761
# Cross-population NSL
762
xpnsl_scores = allel.xpnsl(hap_pop1, hap_pop2, map_positions)
763
```
764
{ .api }
765
766
```python
767
def nsl(h, map_pos=None, max_extent=None):
768
"""
769
Calculate number of segregating sites by length (NSL).
770
771
Args:
772
h (HaplotypeArray): Haplotype data
773
map_pos (array_like, optional): Map positions for physical distance
774
max_extent (float, optional): Maximum extent for calculation
775
776
Returns:
777
ndarray: NSL scores per variant
778
"""
779
780
def xpnsl(h1, h2, map_pos=None, max_extent=None):
781
"""
782
Calculate cross-population NSL.
783
784
Args:
785
h1 (HaplotypeArray): Population 1 haplotypes
786
h2 (HaplotypeArray): Population 2 haplotypes
787
map_pos (array_like, optional): Map positions
788
max_extent (float, optional): Maximum extent
789
790
Returns:
791
ndarray: XP-NSL scores per variant
792
"""
793
```
794
{ .api }
795
796
### Additional Selection Tests
797
798
More methods for detecting natural selection.
799
800
```python
801
# EHH decay analysis
802
ehh_values = allel.ehh_decay(haplotypes, map_positions, focal_variant=100)
803
804
# Voight painting (visualization of haplotype decay)
805
painting = allel.voight_painting(haplotypes, map_positions, focal_variant=100)
806
allel.plot_voight_painting(painting)
807
808
# Standardization methods
809
scores_std = allel.standardize(scores)
810
scores_std_ac = allel.standardize_by_allele_count(scores, allele_counts)
811
812
# Moving Tajima's D difference
813
delta_td = allel.moving_delta_tajima_d(positions, ac_pop1, ac_pop2, size=10000)
814
815
# Population Branch Statistic (PBS)
816
pbs_scores = allel.pbs(fst_AB, fst_AC, fst_BC)
817
```
818
{ .api }
819
820
```python
821
def ehh_decay(h, map_pos, focal=None, max_gap=200000, gap_scale=20000, truncate_left=None, truncate_right=None):
822
"""
823
Calculate Extended Haplotype Homozygosity (EHH) decay.
824
825
Args:
826
h (HaplotypeArray): Haplotype data
827
map_pos (array_like): Genetic map positions
828
focal (int, optional): Index of focal variant
829
max_gap (float): Maximum gap between variants
830
gap_scale (float): Gap scaling factor
831
truncate_left (float, optional): Left truncation point
832
truncate_right (float, optional): Right truncation point
833
834
Returns:
835
tuple: (distances, ehh_values)
836
"""
837
838
def voight_painting(h, map_pos, focal=None, max_gap=200000, gap_scale=20000):
839
"""
840
Generate Voight painting data for haplotype visualization.
841
842
Args:
843
h (HaplotypeArray): Haplotype data
844
map_pos (array_like): Genetic map positions
845
focal (int, optional): Index of focal variant
846
max_gap (float): Maximum gap between variants
847
gap_scale (float): Gap scaling factor
848
849
Returns:
850
dict: Painting data for visualization
851
"""
852
853
def plot_voight_painting(painting, ax=None):
854
"""
855
Plot Voight painting visualization.
856
857
Args:
858
painting (dict): Painting data from voight_painting()
859
ax (matplotlib axis, optional): Axis to plot on
860
861
Returns:
862
matplotlib figure: Voight painting plot
863
"""
864
865
def fig_voight_painting(painting, height=None):
866
"""
867
Create figure for Voight painting.
868
869
Args:
870
painting (dict): Painting data
871
height (float, optional): Figure height
872
873
Returns:
874
matplotlib figure: Figure object
875
"""
876
877
def plot_haplotype_frequencies(h, start=None, stop=None, ax=None):
878
"""
879
Plot haplotype frequencies.
880
881
Args:
882
h (HaplotypeArray): Haplotype data
883
start (int, optional): Start position index
884
stop (int, optional): Stop position index
885
ax (matplotlib axis, optional): Axis to plot on
886
887
Returns:
888
matplotlib figure: Haplotype frequency plot
889
"""
890
891
def plot_moving_haplotype_frequencies(h, size, start=None, stop=None, step=None, ax=None):
892
"""
893
Plot moving window haplotype frequencies.
894
895
Args:
896
h (HaplotypeArray): Haplotype data
897
size (int): Window size
898
start (int, optional): Start position
899
stop (int, optional): Stop position
900
step (int, optional): Step size
901
ax (matplotlib axis, optional): Axis to plot on
902
903
Returns:
904
matplotlib figure: Moving haplotype frequency plot
905
"""
906
907
def standardize(score):
908
"""
909
Standardize test statistic to z-scores.
910
911
Args:
912
score (array_like): Raw test statistic values
913
914
Returns:
915
ndarray: Standardized scores
916
"""
917
918
def moving_delta_tajima_d(pos, ac1, ac2, size, start=None, stop=None, step=None, is_accessible=None):
919
"""
920
Calculate difference in Tajima's D between populations in sliding windows.
921
922
Args:
923
pos (array_like): Variant positions
924
ac1 (AlleleCountsArray): Allele counts for population 1
925
ac2 (AlleleCountsArray): Allele counts for population 2
926
size (int): Window size in base pairs
927
start (int, optional): Start position
928
stop (int, optional): Stop position
929
step (int, optional): Step size
930
is_accessible (array_like, optional): Accessible sites mask
931
932
Returns:
933
tuple: (window_centers, delta_tajima_d_values)
934
"""
935
936
def pbs(fst_AB, fst_AC, fst_BC):
937
"""
938
Calculate Population Branch Statistic (PBS) for three populations.
939
940
Args:
941
fst_AB (array_like): FST between populations A and B
942
fst_AC (array_like): FST between populations A and C
943
fst_BC (array_like): FST between populations B and C
944
945
Returns:
946
ndarray: PBS values for population A
947
"""
948
949
def moving_haplotype_diversity(h, size, start=None, stop=None, step=None):
950
"""
951
Calculate haplotype diversity in sliding windows.
952
953
Args:
954
h (HaplotypeArray): Haplotype data
955
size (int): Window size (number of variants)
956
start (int, optional): Start position index
957
stop (int, optional): Stop position index
958
step (int, optional): Step size
959
960
Returns:
961
tuple: (window_centers, diversity_values)
962
"""
963
964
def moving_garud_h(h, size, start=None, stop=None, step=None):
965
"""
966
Calculate Garud's H statistics in sliding windows.
967
968
Args:
969
h (HaplotypeArray): Haplotype data
970
size (int): Window size (number of variants)
971
start (int, optional): Start position index
972
stop (int, optional): Stop position index
973
step (int, optional): Step size
974
975
Returns:
976
tuple: (window_centers, h1_values, h12_values, h123_values, h2_h1_values)
977
"""
978
```
979
{ .api }
980
981
### Haplotype Diversity
982
983
Measure diversity of haplotype patterns.
984
985
```python
986
# Basic haplotype diversity
987
h_div = allel.haplotype_diversity(haplotypes, start=1000, stop=2000)
988
989
# Moving window analysis
990
h_div_moving = allel.moving_haplotype_diversity(haplotypes, size=1000, step=500)
991
992
# Garud's H statistics
993
h_stats = allel.garud_h(haplotypes, start=1000, stop=2000)
994
```
995
{ .api }
996
997
```python
998
def haplotype_diversity(h, start=None, stop=None):
999
"""
1000
Calculate haplotype diversity in a genomic region.
1001
1002
Args:
1003
h (HaplotypeArray): Haplotype data
1004
start (int, optional): Start position index
1005
stop (int, optional): Stop position index
1006
1007
Returns:
1008
float: Haplotype diversity value
1009
"""
1010
1011
def garud_h(h, start=None, stop=None):
1012
"""
1013
Calculate Garud's H1, H12, H123, and H2/H1 statistics.
1014
1015
Args:
1016
h (HaplotypeArray): Haplotype data
1017
start (int, optional): Start position index
1018
stop (int, optional): Stop position index
1019
1020
Returns:
1021
tuple: (H1, H12, H123, H2_H1) statistics
1022
"""
1023
```
1024
{ .api }
1025
1026
## Site Frequency Spectrum
1027
1028
### SFS Analysis
1029
1030
Analyze allele frequency distributions.
1031
1032
```python
1033
# Basic SFS
1034
sfs = allel.sfs(derived_allele_counts, n=sample_size)
1035
1036
# Folded SFS (when ancestral state unknown)
1037
sfs_folded = allel.sfs_folded(derived_allele_counts, n=sample_size)
1038
1039
# Joint SFS between populations
1040
joint_sfs = allel.joint_sfs(dac_pop1, dac_pop2, n1=n_pop1, n2=n_pop2)
1041
```
1042
{ .api }
1043
1044
```python
1045
def sfs(dac, n=None):
1046
"""
1047
Calculate site frequency spectrum.
1048
1049
Args:
1050
dac (array_like): Derived allele counts per variant
1051
n (int, optional): Sample size (number of chromosomes)
1052
1053
Returns:
1054
ndarray: Site frequency spectrum
1055
"""
1056
1057
def sfs_folded(dac, n=None):
1058
"""
1059
Calculate folded site frequency spectrum.
1060
1061
Args:
1062
dac (array_like): Derived allele counts
1063
n (int, optional): Sample size
1064
1065
Returns:
1066
ndarray: Folded SFS
1067
"""
1068
1069
def joint_sfs(dac1, dac2, n1=None, n2=None):
1070
"""
1071
Calculate joint site frequency spectrum between two populations.
1072
1073
Args:
1074
dac1 (array_like): Derived allele counts for population 1
1075
dac2 (array_like): Derived allele counts for population 2
1076
n1 (int, optional): Sample size for population 1
1077
n2 (int, optional): Sample size for population 2
1078
1079
Returns:
1080
ndarray: Joint SFS as 2D array
1081
"""
1082
```
1083
{ .api }
1084
1085
### SFS Transformations and Scaling
1086
1087
Transform and scale site frequency spectra.
1088
1089
```python
1090
# Scale SFS by mutation rate
1091
sfs_scaled = allel.sfs_scaled(derived_counts, n=sample_size, theta=0.01)
1092
sfs_folded_scaled = allel.sfs_folded_scaled(derived_counts, n=sample_size, theta=0.01)
1093
1094
# Joint SFS scaling
1095
joint_sfs_scaled = allel.joint_sfs_scaled(dac1, dac2, n1=n1, n2=n2, theta=0.01)
1096
joint_sfs_folded_scaled = allel.joint_sfs_folded_scaled(dac1, dac2, n1=n1, n2=n2, theta=0.01)
1097
1098
# Fold existing SFS
1099
folded_sfs = allel.fold_sfs(unfolded_sfs)
1100
folded_joint_sfs = allel.fold_joint_sfs(unfolded_joint_sfs)
1101
1102
# Scale existing SFS
1103
scaled_sfs = allel.scale_sfs(sfs, theta=0.01)
1104
scaled_folded_sfs = allel.scale_sfs_folded(sfs, theta=0.01)
1105
scaled_joint_sfs = allel.scale_joint_sfs(joint_sfs, theta=0.01)
1106
scaled_joint_folded_sfs = allel.scale_joint_sfs_folded(joint_sfs, theta=0.01)
1107
```
1108
{ .api }
1109
1110
```python
1111
def sfs_scaled(dac, n=None, theta=1.0):
1112
"""
1113
Calculate scaled site frequency spectrum.
1114
1115
Args:
1116
dac (array_like): Derived allele counts
1117
n (int, optional): Sample size
1118
theta (float): Scaling factor (mutation rate)
1119
1120
Returns:
1121
ndarray: Scaled SFS
1122
"""
1123
1124
def sfs_folded_scaled(dac, n=None, theta=1.0):
1125
"""
1126
Calculate scaled folded site frequency spectrum.
1127
1128
Args:
1129
dac (array_like): Derived allele counts
1130
n (int, optional): Sample size
1131
theta (float): Scaling factor
1132
1133
Returns:
1134
ndarray: Scaled folded SFS
1135
"""
1136
1137
def joint_sfs_scaled(dac1, dac2, n1=None, n2=None, theta=1.0):
1138
"""
1139
Calculate scaled joint site frequency spectrum.
1140
1141
Args:
1142
dac1 (array_like): Population 1 derived allele counts
1143
dac2 (array_like): Population 2 derived allele counts
1144
n1 (int, optional): Population 1 sample size
1145
n2 (int, optional): Population 2 sample size
1146
theta (float): Scaling factor
1147
1148
Returns:
1149
ndarray: Scaled joint SFS
1150
"""
1151
1152
def joint_sfs_folded_scaled(dac1, dac2, n1=None, n2=None, theta=1.0):
1153
"""
1154
Calculate scaled folded joint site frequency spectrum.
1155
1156
Args:
1157
dac1 (array_like): Population 1 derived allele counts
1158
dac2 (array_like): Population 2 derived allele counts
1159
n1 (int, optional): Population 1 sample size
1160
n2 (int, optional): Population 2 sample size
1161
theta (float): Scaling factor
1162
1163
Returns:
1164
ndarray: Scaled folded joint SFS
1165
"""
1166
1167
def fold_sfs(sfs):
1168
"""
1169
Fold an existing site frequency spectrum.
1170
1171
Args:
1172
sfs (array_like): Unfolded SFS
1173
1174
Returns:
1175
ndarray: Folded SFS
1176
"""
1177
1178
def fold_joint_sfs(joint_sfs):
1179
"""
1180
Fold an existing joint site frequency spectrum.
1181
1182
Args:
1183
joint_sfs (array_like): Unfolded joint SFS
1184
1185
Returns:
1186
ndarray: Folded joint SFS
1187
"""
1188
1189
def scale_sfs(sfs, theta=1.0):
1190
"""
1191
Scale an existing site frequency spectrum.
1192
1193
Args:
1194
sfs (array_like): SFS to scale
1195
theta (float): Scaling factor
1196
1197
Returns:
1198
ndarray: Scaled SFS
1199
"""
1200
1201
def scale_sfs_folded(sfs, theta=1.0):
1202
"""
1203
Scale an existing folded site frequency spectrum.
1204
1205
Args:
1206
sfs (array_like): Folded SFS to scale
1207
theta (float): Scaling factor
1208
1209
Returns:
1210
ndarray: Scaled folded SFS
1211
"""
1212
1213
def scale_joint_sfs(joint_sfs, theta=1.0):
1214
"""
1215
Scale an existing joint site frequency spectrum.
1216
1217
Args:
1218
joint_sfs (array_like): Joint SFS to scale
1219
theta (float): Scaling factor
1220
1221
Returns:
1222
ndarray: Scaled joint SFS
1223
"""
1224
1225
def scale_joint_sfs_folded(joint_sfs, theta=1.0):
1226
"""
1227
Scale an existing folded joint site frequency spectrum.
1228
1229
Args:
1230
joint_sfs (array_like): Folded joint SFS to scale
1231
theta (float): Scaling factor
1232
1233
Returns:
1234
ndarray: Scaled folded joint SFS
1235
"""
1236
```
1237
{ .api }
1238
1239
### SFS Visualization
1240
1241
Plot site frequency spectra.
1242
1243
```python
1244
# Basic SFS plots
1245
allel.plot_sfs(sfs)
1246
allel.plot_sfs_folded(sfs_folded)
1247
allel.plot_sfs_scaled(sfs_scaled)
1248
allel.plot_sfs_folded_scaled(sfs_folded_scaled)
1249
1250
# Joint SFS plots
1251
allel.plot_joint_sfs(joint_sfs)
1252
allel.plot_joint_sfs_folded(joint_sfs_folded)
1253
allel.plot_joint_sfs_scaled(joint_sfs_scaled)
1254
allel.plot_joint_sfs_folded_scaled(joint_sfs_folded_scaled)
1255
```
1256
{ .api }
1257
1258
```python
1259
def plot_sfs(sfs, ax=None, n=None):
1260
"""
1261
Plot site frequency spectrum.
1262
1263
Args:
1264
sfs (array_like): SFS values
1265
ax (matplotlib axis, optional): Axis to plot on
1266
n (int, optional): Sample size for x-axis labels
1267
1268
Returns:
1269
matplotlib figure: SFS plot
1270
"""
1271
1272
def plot_sfs_folded(sfs, ax=None, n=None):
1273
"""
1274
Plot folded site frequency spectrum.
1275
1276
Args:
1277
sfs (array_like): Folded SFS values
1278
ax (matplotlib axis, optional): Axis to plot on
1279
n (int, optional): Sample size for x-axis labels
1280
1281
Returns:
1282
matplotlib figure: Folded SFS plot
1283
"""
1284
1285
def plot_sfs_scaled(sfs, ax=None, n=None):
1286
"""
1287
Plot scaled site frequency spectrum.
1288
1289
Args:
1290
sfs (array_like): Scaled SFS values
1291
ax (matplotlib axis, optional): Axis to plot on
1292
n (int, optional): Sample size
1293
1294
Returns:
1295
matplotlib figure: Scaled SFS plot
1296
"""
1297
1298
def plot_sfs_folded_scaled(sfs, ax=None, n=None):
1299
"""
1300
Plot scaled folded site frequency spectrum.
1301
1302
Args:
1303
sfs (array_like): Scaled folded SFS values
1304
ax (matplotlib axis, optional): Axis to plot on
1305
n (int, optional): Sample size
1306
1307
Returns:
1308
matplotlib figure: Scaled folded SFS plot
1309
"""
1310
1311
def plot_joint_sfs(joint_sfs, ax=None):
1312
"""
1313
Plot joint site frequency spectrum as heatmap.
1314
1315
Args:
1316
joint_sfs (array_like): Joint SFS matrix
1317
ax (matplotlib axis, optional): Axis to plot on
1318
1319
Returns:
1320
matplotlib figure: Joint SFS heatmap
1321
"""
1322
1323
def plot_joint_sfs_folded(joint_sfs, ax=None):
1324
"""
1325
Plot folded joint site frequency spectrum.
1326
1327
Args:
1328
joint_sfs (array_like): Folded joint SFS matrix
1329
ax (matplotlib axis, optional): Axis to plot on
1330
1331
Returns:
1332
matplotlib figure: Folded joint SFS heatmap
1333
"""
1334
1335
def plot_joint_sfs_scaled(joint_sfs, ax=None):
1336
"""
1337
Plot scaled joint site frequency spectrum.
1338
1339
Args:
1340
joint_sfs (array_like): Scaled joint SFS matrix
1341
ax (matplotlib axis, optional): Axis to plot on
1342
1343
Returns:
1344
matplotlib figure: Scaled joint SFS heatmap
1345
"""
1346
1347
def plot_joint_sfs_folded_scaled(joint_sfs, ax=None):
1348
"""
1349
Plot scaled folded joint site frequency spectrum.
1350
1351
Args:
1352
joint_sfs (array_like): Scaled folded joint SFS matrix
1353
ax (matplotlib axis, optional): Axis to plot on
1354
1355
Returns:
1356
matplotlib figure: Scaled folded joint SFS heatmap
1357
"""
1358
```
1359
{ .api }
1360
1361
## Hardy-Weinberg Equilibrium
1362
1363
### HWE Statistics
1364
1365
Test for deviations from Hardy-Weinberg equilibrium.
1366
1367
```python
1368
# Observed heterozygosity
1369
het_obs = allel.heterozygosity_observed(genotypes)
1370
1371
# Expected heterozygosity
1372
het_exp = allel.heterozygosity_expected(allele_frequencies)
1373
1374
# Inbreeding coefficient (FIS)
1375
fis = allel.inbreeding_coefficient(genotypes)
1376
```
1377
{ .api }
1378
1379
```python
1380
def heterozygosity_observed(g, alleles=None):
1381
"""
1382
Calculate observed heterozygosity per variant.
1383
1384
Args:
1385
g (GenotypeArray): Genotype data
1386
alleles (array_like, optional): Specific alleles to consider
1387
1388
Returns:
1389
ndarray: Observed heterozygosity per variant
1390
"""
1391
1392
def heterozygosity_expected(af, ploidy=2):
1393
"""
1394
Calculate expected heterozygosity under Hardy-Weinberg equilibrium.
1395
1396
Args:
1397
af (array_like): Allele frequencies
1398
ploidy (int): Ploidy level
1399
1400
Returns:
1401
ndarray: Expected heterozygosity per variant
1402
"""
1403
1404
def inbreeding_coefficient(g, alleles=None):
1405
"""
1406
Calculate inbreeding coefficient (FIS) per variant.
1407
1408
Args:
1409
g (GenotypeArray): Genotype data
1410
alleles (array_like, optional): Specific alleles to consider
1411
1412
Returns:
1413
ndarray: Inbreeding coefficient per variant
1414
"""
1415
```
1416
{ .api }
1417
1418
## Admixture Statistics
1419
1420
### Patterson's F-statistics
1421
1422
Test for population admixture and phylogenetic relationships.
1423
1424
```python
1425
# F2 statistic (genetic distance)
1426
f2_values = allel.patterson_f2(ac_pop1, ac_pop2)
1427
1428
# F3 statistic (admixture test)
1429
f3_values = allel.patterson_f3(ac_pop1, ac_pop2, ac_pop3)
1430
1431
# D statistic (4-population test)
1432
d_values = allel.patterson_d(ac_pop1, ac_pop2, ac_pop3, ac_pop4)
1433
1434
# Blockwise calculations for bootstrapping
1435
f3_blocks = allel.blockwise_patterson_f3(ac_pop1, ac_pop2, ac_pop3, blocks)
1436
d_blocks = allel.blockwise_patterson_d(ac_pop1, ac_pop2, ac_pop3, ac_pop4, blocks)
1437
1438
# Average statistics
1439
avg_f3 = allel.average_patterson_f3(ac_pop1, ac_pop2, ac_pop3)
1440
avg_d = allel.average_patterson_d(ac_pop1, ac_pop2, ac_pop3, ac_pop4)
1441
1442
# Moving window analysis
1443
f3_moving = allel.moving_patterson_f3(positions, ac_pop1, ac_pop2, ac_pop3, size=100000)
1444
d_moving = allel.moving_patterson_d(positions, ac_pop1, ac_pop2, ac_pop3, ac_pop4, size=100000)
1445
```
1446
{ .api }
1447
1448
**Key Functions:**
1449
1450
```python
1451
def patterson_f2(ac1, ac2):
1452
"""
1453
Calculate Patterson's F2 statistic (genetic distance).
1454
1455
Args:
1456
ac1 (AlleleCountsArray): Allele counts for population 1
1457
ac2 (AlleleCountsArray): Allele counts for population 2
1458
1459
Returns:
1460
ndarray: F2 values per variant
1461
"""
1462
1463
def patterson_f3(ac1, ac2, ac3):
1464
"""
1465
Calculate Patterson's F3 statistic (test for admixture).
1466
1467
Args:
1468
ac1 (AlleleCountsArray): Allele counts for population 1 (test population)
1469
ac2 (AlleleCountsArray): Allele counts for population 2 (source 1)
1470
ac3 (AlleleCountsArray): Allele counts for population 3 (source 2)
1471
1472
Returns:
1473
ndarray: F3 values per variant
1474
"""
1475
1476
def patterson_d(ac1, ac2, ac3, ac4):
1477
"""
1478
Calculate Patterson's D statistic (4-population test).
1479
1480
Args:
1481
ac1 (AlleleCountsArray): Population 1 allele counts
1482
ac2 (AlleleCountsArray): Population 2 allele counts
1483
ac3 (AlleleCountsArray): Population 3 allele counts
1484
ac4 (AlleleCountsArray): Population 4 allele counts (outgroup)
1485
1486
Returns:
1487
ndarray: D values per variant
1488
"""
1489
1490
def blockwise_patterson_f3(ac1, ac2, ac3, blocks):
1491
"""
1492
Calculate F3 statistic for genomic blocks.
1493
1494
Args:
1495
ac1 (AlleleCountsArray): Population 1 allele counts
1496
ac2 (AlleleCountsArray): Population 2 allele counts
1497
ac3 (AlleleCountsArray): Population 3 allele counts
1498
blocks (array_like): Block definitions
1499
1500
Returns:
1501
ndarray: F3 values per block
1502
"""
1503
1504
def blockwise_patterson_d(ac1, ac2, ac3, ac4, blocks):
1505
"""
1506
Calculate D statistic for genomic blocks.
1507
1508
Args:
1509
ac1 (AlleleCountsArray): Population 1 allele counts
1510
ac2 (AlleleCountsArray): Population 2 allele counts
1511
ac3 (AlleleCountsArray): Population 3 allele counts
1512
ac4 (AlleleCountsArray): Population 4 allele counts
1513
blocks (array_like): Block definitions
1514
1515
Returns:
1516
ndarray: D values per block
1517
"""
1518
1519
def average_patterson_f3(ac1, ac2, ac3):
1520
"""
1521
Calculate average F3 statistic across all variants.
1522
1523
Args:
1524
ac1 (AlleleCountsArray): Population 1 allele counts
1525
ac2 (AlleleCountsArray): Population 2 allele counts
1526
ac3 (AlleleCountsArray): Population 3 allele counts
1527
1528
Returns:
1529
float: Average F3 value
1530
"""
1531
1532
def average_patterson_d(ac1, ac2, ac3, ac4):
1533
"""
1534
Calculate average D statistic across all variants.
1535
1536
Args:
1537
ac1 (AlleleCountsArray): Population 1 allele counts
1538
ac2 (AlleleCountsArray): Population 2 allele counts
1539
ac3 (AlleleCountsArray): Population 3 allele counts
1540
ac4 (AlleleCountsArray): Population 4 allele counts
1541
1542
Returns:
1543
float: Average D value
1544
"""
1545
1546
def moving_patterson_f3(pos, ac1, ac2, ac3, size, start=None, stop=None, step=None):
1547
"""
1548
Calculate F3 statistic in sliding windows.
1549
1550
Args:
1551
pos (array_like): Variant positions
1552
ac1 (AlleleCountsArray): Population 1 allele counts
1553
ac2 (AlleleCountsArray): Population 2 allele counts
1554
ac3 (AlleleCountsArray): Population 3 allele counts
1555
size (int): Window size in base pairs
1556
start (int, optional): Start position
1557
stop (int, optional): Stop position
1558
step (int, optional): Step size
1559
1560
Returns:
1561
tuple: (window_centers, f3_values)
1562
"""
1563
1564
def moving_patterson_d(pos, ac1, ac2, ac3, ac4, size, start=None, stop=None, step=None):
1565
"""
1566
Calculate D statistic in sliding windows.
1567
1568
Args:
1569
pos (array_like): Variant positions
1570
ac1 (AlleleCountsArray): Population 1 allele counts
1571
ac2 (AlleleCountsArray): Population 2 allele counts
1572
ac3 (AlleleCountsArray): Population 3 allele counts
1573
ac4 (AlleleCountsArray): Population 4 allele counts
1574
size (int): Window size in base pairs
1575
start (int, optional): Start position
1576
stop (int, optional): Stop position
1577
step (int, optional): Step size
1578
1579
Returns:
1580
tuple: (window_centers, d_values)
1581
"""
1582
```
1583
{ .api }
1584
1585
## Windowed Analysis
1586
1587
### Window Functions
1588
1589
Apply statistical methods over genomic windows.
1590
1591
```python
1592
# Create equally-sized windows
1593
windows = allel.equally_accessible_windows(positions, size=10000)
1594
1595
# Apply any statistic to windows
1596
window_stats = allel.windowed_statistic(positions, values, np.mean, windows)
1597
1598
# Moving window statistics
1599
moving_stats = allel.moving_statistic(values, np.mean, size=100, step=50)
1600
1601
# Count variants in windows
1602
counts = allel.windowed_count(positions, windows)
1603
1604
# Per-base analysis
1605
per_base_values = allel.per_base(positions, values, start=1000, stop=2000)
1606
1607
# Moving statistics with different functions
1608
moving_means = allel.moving_mean(values, size=100, step=50)
1609
moving_stds = allel.moving_std(values, size=100, step=50)
1610
moving_mids = allel.moving_midpoint(positions, size=100, step=50)
1611
1612
# Window coordinate helpers
1613
index_wins = allel.index_windows(positions, window_size=100)
1614
pos_wins = allel.position_windows(positions, window_size=10000, step=5000)
1615
win_locs = allel.window_locations(windows)
1616
```
1617
{ .api }
1618
1619
### Additional Window Functions
1620
1621
More specialized windowing utilities.
1622
1623
```python
1624
def windowed_count(pos, windows, is_accessible=None):
1625
"""
1626
Count variants in genomic windows.
1627
1628
Args:
1629
pos (array_like): Variant positions
1630
windows (array_like): Window coordinates as (start, stop) pairs
1631
is_accessible (array_like, optional): Accessible sites mask
1632
1633
Returns:
1634
ndarray: Variant counts per window
1635
"""
1636
1637
def per_base(pos, values, start=None, stop=None, is_accessible=None):
1638
"""
1639
Calculate per-base values over a genomic region.
1640
1641
Args:
1642
pos (array_like): Variant positions
1643
values (array_like): Values per variant
1644
start (int, optional): Start position
1645
stop (int, optional): Stop position
1646
is_accessible (array_like, optional): Accessible sites mask
1647
1648
Returns:
1649
float: Per-base value
1650
"""
1651
1652
def moving_mean(values, size, step=1):
1653
"""
1654
Calculate moving mean over a sequence of values.
1655
1656
Args:
1657
values (array_like): Input values
1658
size (int): Window size
1659
step (int): Step size between windows
1660
1661
Returns:
1662
ndarray: Moving mean values
1663
"""
1664
1665
def moving_std(values, size, step=1):
1666
"""
1667
Calculate moving standard deviation.
1668
1669
Args:
1670
values (array_like): Input values
1671
size (int): Window size
1672
step (int): Step size between windows
1673
1674
Returns:
1675
ndarray: Moving standard deviation values
1676
"""
1677
1678
def moving_midpoint(positions, size, step=1):
1679
"""
1680
Calculate midpoints for moving windows.
1681
1682
Args:
1683
positions (array_like): Position coordinates
1684
size (int): Window size
1685
step (int): Step size between windows
1686
1687
Returns:
1688
ndarray: Window midpoint positions
1689
"""
1690
1691
def index_windows(positions, window_size):
1692
"""
1693
Create windows based on variant indices.
1694
1695
Args:
1696
positions (array_like): Variant positions
1697
window_size (int): Number of variants per window
1698
1699
Returns:
1700
ndarray: Window definitions as index ranges
1701
"""
1702
1703
def position_windows(positions, window_size, step=None):
1704
"""
1705
Create windows based on genomic positions.
1706
1707
Args:
1708
positions (array_like): Variant positions
1709
window_size (int): Window size in base pairs
1710
step (int, optional): Step size between windows
1711
1712
Returns:
1713
ndarray: Window coordinates as (start, stop) pairs
1714
"""
1715
1716
def window_locations(windows):
1717
"""
1718
Calculate locations (centers) of genomic windows.
1719
1720
Args:
1721
windows (array_like): Window coordinates as (start, stop) pairs
1722
1723
Returns:
1724
ndarray: Window center positions
1725
"""
1726
```
1727
{ .api }
1728
1729
## Mendelian Analysis
1730
1731
### Inheritance Pattern Analysis
1732
1733
Analyze transmission patterns in family data to identify Mendelian errors and phase genotypes.
1734
1735
```python
1736
# Identify Mendelian errors in trio data
1737
errors = allel.mendel_errors(offspring_genotypes, parent_genotypes)
1738
1739
# Phase genotypes using transmission patterns
1740
phased_offspring = allel.phase_progeny_by_transmission(offspring_genotypes, transmission)
1741
phased_parents = allel.phase_parents_by_transmission(parent_genotypes, transmission)
1742
1743
# Complete phasing workflow
1744
phased_offspring, phased_parents = allel.phase_by_transmission(offspring_genotypes, parent_genotypes)
1745
```
1746
{ .api }
1747
1748
**Key Functions:**
1749
1750
```python
1751
def mendel_errors(offspring_genotypes, parent_genotypes):
1752
"""
1753
Identify Mendelian inheritance errors in family data.
1754
1755
Args:
1756
offspring_genotypes (GenotypeArray): Genotypes of offspring
1757
parent_genotypes (GenotypeArray): Genotypes of parents
1758
1759
Returns:
1760
ndarray: Array indicating Mendelian error status for each variant/offspring pair
1761
"""
1762
1763
def phase_progeny_by_transmission(offspring_genotypes, transmission):
1764
"""
1765
Phase offspring genotypes using transmission patterns.
1766
1767
Args:
1768
offspring_genotypes (GenotypeArray): Offspring genotype data
1769
transmission (array_like): Transmission pattern information
1770
1771
Returns:
1772
GenotypeArray: Phased offspring genotypes
1773
"""
1774
1775
def phase_parents_by_transmission(parent_genotypes, transmission):
1776
"""
1777
Phase parent genotypes using transmission patterns.
1778
1779
Args:
1780
parent_genotypes (GenotypeArray): Parent genotype data
1781
transmission (array_like): Transmission pattern information
1782
1783
Returns:
1784
GenotypeArray: Phased parent genotypes
1785
"""
1786
1787
def phase_by_transmission(offspring_genotypes, parent_genotypes):
1788
"""
1789
Phase both offspring and parent genotypes using transmission patterns.
1790
1791
Args:
1792
offspring_genotypes (GenotypeArray): Offspring genotypes
1793
parent_genotypes (GenotypeArray): Parent genotypes
1794
1795
Returns:
1796
tuple: (phased_offspring, phased_parents) GenotypeArray objects
1797
"""
1798
```
1799
{ .api }
1800
1801
**Inheritance Constants:**
1802
1803
```python
1804
# Mendelian inheritance pattern constants
1805
INHERIT_MISSING = -1 # Missing inheritance pattern
1806
INHERIT_NONPARENTAL = 0 # Non-parental allele
1807
INHERIT_NONSEG_ALT = 1 # Non-segregating alternative
1808
INHERIT_NONSEG_REF = 2 # Non-segregating reference
1809
INHERIT_PARENT1 = 3 # Inherited from parent 1
1810
INHERIT_PARENT2 = 4 # Inherited from parent 2
1811
INHERIT_PARENT_MISSING = 5 # Parent has missing data
1812
INHERIT_UNDETERMINED = 6 # Cannot determine inheritance
1813
```
1814
{ .api }
1815
1816
## Runs of Homozygosity
1817
1818
### ROH Detection
1819
1820
Detect runs of homozygosity using Hidden Markov Models.
1821
1822
```python
1823
# Multi-state HMM for ROH detection
1824
roh_results = allel.roh_mhmm(genotypes, positions, phet_roh=0.001, phet_nonroh=0.01, transition=1e-6)
1825
1826
# Poisson HMM approach
1827
roh_poisson = allel.roh_poissonhmm(genotypes, positions, phet_roh=0.001, phet_nonroh=0.01,
1828
transition_hethet=1e-5, transition_homhom=1e-8)
1829
```
1830
{ .api }
1831
1832
```python
1833
def roh_mhmm(g, pos, phet_roh, phet_nonroh, transition, map_pos=None):
1834
"""
1835
Detect runs of homozygosity using multi-state Hidden Markov Model.
1836
1837
Args:
1838
g (GenotypeArray): Genotype data
1839
pos (array_like): Variant positions
1840
phet_roh (float): Probability of heterozygosity in ROH segments
1841
phet_nonroh (float): Probability of heterozygosity in non-ROH segments
1842
transition (float): Transition probability between states
1843
map_pos (array_like, optional): Genetic map positions
1844
1845
Returns:
1846
ndarray: ROH state predictions
1847
"""
1848
1849
def roh_poissonhmm(g, pos, phet_roh, phet_nonroh, transition_hethet, transition_homhom, map_pos=None):
1850
"""
1851
Detect runs of homozygosity using Poisson Hidden Markov Model.
1852
1853
Args:
1854
g (GenotypeArray): Genotype data
1855
pos (array_like): Variant positions
1856
phet_roh (float): Het probability in ROH
1857
phet_nonroh (float): Het probability in non-ROH
1858
transition_hethet (float): Transition probability between het states
1859
transition_homhom (float): Transition probability between hom states
1860
map_pos (array_like, optional): Genetic map positions
1861
1862
Returns:
1863
ndarray: ROH state predictions
1864
"""
1865
```
1866
{ .api }
1867
1868
## Windowed Analysis
1869
1870
### Window Functions
1871
1872
Apply statistical methods over genomic windows.
1873
1874
```python
1875
# Create equally-sized windows
1876
windows = allel.equally_accessible_windows(positions, size=10000)
1877
1878
# Apply any statistic to windows
1879
window_stats = allel.windowed_statistic(positions, values, np.mean, windows)
1880
1881
# Moving window statistics
1882
moving_stats = allel.moving_statistic(values, np.mean, size=100, step=50)
1883
```
1884
{ .api }
1885
1886
```python
1887
def windowed_statistic(pos, values, statistic, windows, fill=None):
1888
"""
1889
Apply a statistic function to values within genomic windows.
1890
1891
Args:
1892
pos (array_like): Positions corresponding to values
1893
values (array_like): Values to calculate statistics on
1894
statistic (callable): Function to apply (e.g., np.mean, np.sum)
1895
windows (array_like): Window coordinates as (start, stop) pairs
1896
fill (float, optional): Fill value for empty windows
1897
1898
Returns:
1899
ndarray: Statistic values per window
1900
"""
1901
1902
def equally_accessible_windows(pos, size, start=None, stop=None):
1903
"""
1904
Create equally-sized genomic windows.
1905
1906
Args:
1907
pos (array_like): Variant positions to determine window bounds
1908
size (int): Window size in base pairs
1909
start (int, optional): Start coordinate
1910
stop (int, optional): Stop coordinate
1911
1912
Returns:
1913
ndarray: Window coordinates as (start, stop) pairs
1914
"""
1915
1916
def moving_statistic(values, statistic, size, step=1):
1917
"""
1918
Apply statistic over a moving window.
1919
1920
Args:
1921
values (array_like): Input values
1922
statistic (callable): Function to apply
1923
size (int): Window size
1924
step (int): Step size between windows
1925
1926
Returns:
1927
ndarray: Statistic values per window
1928
"""
1929
```
1930
{ .api }
1931
1932
## Miscellaneous Functions
1933
1934
### Utility Functions
1935
1936
Additional statistical utilities and visualization tools.
1937
1938
```python
1939
# Variant locator plot for exploring data
1940
allel.plot_variant_locator(genotypes, positions, start=1000, stop=2000)
1941
1942
# Tabulate state transitions (for HMM analysis)
1943
transitions = allel.tabulate_state_transitions(states)
1944
1945
# Tabulate state blocks
1946
blocks = allel.tabulate_state_blocks(states)
1947
```
1948
{ .api }
1949
1950
```python
1951
def plot_variant_locator(gn, pos, start=None, stop=None, ax=None):
1952
"""
1953
Plot variant locator for exploring genotype data.
1954
1955
Args:
1956
gn (GenotypeArray): Genotype data
1957
pos (array_like): Variant positions
1958
start (int, optional): Start position for plotting
1959
stop (int, optional): Stop position for plotting
1960
ax (matplotlib axis, optional): Axis to plot on
1961
1962
Returns:
1963
matplotlib figure: Variant locator plot
1964
"""
1965
1966
def tabulate_state_transitions(states, pos=None):
1967
"""
1968
Tabulate transitions between states in a sequence.
1969
1970
Args:
1971
states (array_like): State sequence
1972
pos (array_like, optional): Positions corresponding to states
1973
1974
Returns:
1975
DataFrame: Transition table with counts and positions
1976
"""
1977
1978
def tabulate_state_blocks(states, pos=None):
1979
"""
1980
Tabulate contiguous blocks of the same state.
1981
1982
Args:
1983
states (array_like): State sequence
1984
pos (array_like, optional): Positions corresponding to states
1985
1986
Returns:
1987
DataFrame: Block table with start/stop positions and lengths
1988
"""
1989
```
1990
{ .api }