0
# Data Structures
1
2
Scikit-allel provides specialized array classes for representing genetic variation data with efficient operations and memory management. These data structures are the foundation for all genetic analyses in the library.
3
4
## Core Array Classes
5
6
### GenotypeArray
7
8
A specialized numpy array for storing genotype data with shape `(n_variants, n_samples, ploidy)`.
9
10
```python
11
import allel
12
import numpy as np
13
14
# Create from list/array data
15
genotypes = [[[0, 0], [0, 1]], [[1, 1], [1, 0]]] # 2 variants, 2 samples, diploid
16
g = allel.GenotypeArray(genotypes)
17
18
# Properties
19
print(g.n_variants) # Number of variants
20
print(g.n_samples) # Number of samples
21
print(g.ploidy) # Ploidy (e.g., 2 for diploid)
22
print(g.shape) # Array shape
23
print(g.dtype) # Data type
24
```
25
{ .api }
26
27
**Key Methods:**
28
29
```python
30
def count_alleles(self, max_allele=None):
31
"""
32
Count alleles per variant across all samples.
33
34
Args:
35
max_allele (int, optional): Maximum allele number to count
36
37
Returns:
38
AlleleCountsArray: Allele counts with shape (n_variants, n_alleles)
39
"""
40
41
def to_haplotypes(self):
42
"""
43
Convert genotypes to haplotype representation.
44
45
Returns:
46
HaplotypeArray: Haplotype array with shape (n_variants, n_haplotypes)
47
"""
48
49
def is_phased(self):
50
"""
51
Check if genotypes are phased.
52
53
Returns:
54
bool: True if all genotypes are phased
55
"""
56
57
def map_alleles(self, mapping):
58
"""
59
Remap allele values using provided mapping.
60
61
Args:
62
mapping (array_like): New allele values for each position
63
64
Returns:
65
GenotypeArray: Array with remapped alleles
66
"""
67
68
def subset(self, sel0=None, sel1=None):
69
"""
70
Select subset of variants and/or samples.
71
72
Args:
73
sel0 (array_like, optional): Variant selection
74
sel1 (array_like, optional): Sample selection
75
76
Returns:
77
GenotypeArray: Subset of original array
78
"""
79
80
def count_het(self, allele=None):
81
"""
82
Count heterozygous genotypes per variant.
83
84
Args:
85
allele (int, optional): Specific allele to check heterozygosity
86
87
Returns:
88
ndarray: Heterozygous counts per variant
89
"""
90
91
def count_hom_alt(self):
92
"""
93
Count homozygous alternate genotypes per variant.
94
95
Returns:
96
ndarray: Homozygous alternate counts per variant
97
"""
98
99
def is_het(self, allele=None):
100
"""
101
Determine which genotypes are heterozygous.
102
103
Args:
104
allele (int, optional): Specific allele to check
105
106
Returns:
107
ndarray: Boolean array indicating heterozygous genotypes
108
"""
109
110
def is_hom_alt(self):
111
"""
112
Determine which genotypes are homozygous alternate.
113
114
Returns:
115
ndarray: Boolean array indicating homozygous alternate genotypes
116
"""
117
118
def is_call(self):
119
"""
120
Determine which genotypes are called (not missing).
121
122
Returns:
123
ndarray: Boolean array indicating called genotypes
124
"""
125
```
126
{ .api }
127
128
### HaplotypeArray
129
130
A specialized array for haplotype data with shape `(n_variants, n_haplotypes)`.
131
132
```python
133
# Create haplotype array
134
haplotypes = [[0, 1, 0, 1], [1, 0, 1, 0]] # 2 variants, 4 haplotypes
135
h = allel.HaplotypeArray(haplotypes)
136
137
# Convert from genotypes
138
g = allel.GenotypeArray([[[0, 1], [1, 0]]])
139
h = g.to_haplotypes()
140
```
141
{ .api }
142
143
**Key Methods:**
144
145
```python
146
def count_alleles(self, max_allele=None):
147
"""
148
Count alleles per variant across all haplotypes.
149
150
Args:
151
max_allele (int, optional): Maximum allele number to count
152
153
Returns:
154
AlleleCountsArray: Allele counts per variant
155
"""
156
157
def distinct_frequencies(self):
158
"""
159
Calculate frequencies of distinct haplotypes.
160
161
Returns:
162
tuple: (distinct_haplotypes, frequencies)
163
"""
164
165
def map_alleles(self, mapping):
166
"""
167
Remap allele values using provided mapping.
168
169
Args:
170
mapping (array_like): New allele values
171
172
Returns:
173
HaplotypeArray: Array with remapped alleles
174
"""
175
```
176
{ .api }
177
178
### AlleleCountsArray
179
180
Array for storing allele count data with shape `(n_variants, n_alleles)`.
181
182
```python
183
# Create from genotype array
184
g = allel.GenotypeArray([[[0, 0], [0, 1]], [[1, 1], [1, 0]]])
185
ac = g.count_alleles()
186
187
# Direct creation
188
counts = [[3, 1], [2, 2]] # 2 variants, counts for alleles 0,1
189
ac = allel.AlleleCountsArray(counts)
190
```
191
{ .api }
192
193
**Key Methods:**
194
195
```python
196
def count_segregating(self):
197
"""
198
Count number of segregating (polymorphic) variants.
199
200
Returns:
201
int: Number of segregating variants
202
"""
203
204
def is_segregating(self):
205
"""
206
Determine which variants are segregating.
207
208
Returns:
209
ndarray: Boolean array indicating segregating variants
210
"""
211
212
def count_singletons(self):
213
"""
214
Count singleton variants (minor allele count = 1).
215
216
Returns:
217
int: Number of singleton variants
218
"""
219
220
def allelism(self):
221
"""
222
Determine number of alleles per variant.
223
224
Returns:
225
ndarray: Number of alleles observed per variant
226
"""
227
228
def max_allele(self):
229
"""
230
Find maximum allele number per variant.
231
232
Returns:
233
ndarray: Maximum allele number per variant
234
"""
235
236
def to_frequencies(self, fill=0):
237
"""
238
Convert counts to frequencies.
239
240
Args:
241
fill (float): Value for missing data
242
243
Returns:
244
ndarray: Allele frequencies
245
"""
246
```
247
{ .api }
248
249
## Array Creation Functions
250
251
### Genotype Array Creation
252
253
```python
254
def GenotypeArray(data, dtype=None):
255
"""
256
Create a genotype array from input data.
257
258
Args:
259
data (array_like): Genotype data with shape (n_variants, n_samples, ploidy)
260
dtype (numpy.dtype, optional): Data type for array
261
262
Returns:
263
GenotypeArray: Genotype array instance
264
"""
265
```
266
{ .api }
267
268
### Haplotype Array Creation
269
270
```python
271
def HaplotypeArray(data, dtype=None):
272
"""
273
Create a haplotype array from input data.
274
275
Args:
276
data (array_like): Haplotype data with shape (n_variants, n_haplotypes)
277
dtype (numpy.dtype, optional): Data type for array
278
279
Returns:
280
HaplotypeArray: Haplotype array instance
281
"""
282
```
283
{ .api }
284
285
### Allele Counts Array Creation
286
287
```python
288
def AlleleCountsArray(data, dtype=None):
289
"""
290
Create an allele counts array from input data.
291
292
Args:
293
data (array_like): Allele count data with shape (n_variants, n_alleles)
294
dtype (numpy.dtype, optional): Data type for array
295
296
Returns:
297
AlleleCountsArray: Allele counts array instance
298
"""
299
```
300
{ .api }
301
302
## Structured Arrays
303
304
### VariantTable
305
306
Structured array for storing variant metadata.
307
308
```python
309
# Create variant table with metadata
310
variants = allel.VariantTable({
311
'CHROM': ['1', '1', '2'],
312
'POS': [100, 200, 150],
313
'REF': ['A', 'G', 'T'],
314
'ALT': ['T', 'A', 'C']
315
})
316
317
# Access fields
318
positions = variants['POS']
319
chromosomes = variants['CHROM']
320
```
321
{ .api }
322
323
### FeatureTable
324
325
Structured array for genomic features like genes and exons.
326
327
```python
328
# Create feature table
329
features = allel.FeatureTable({
330
'seqid': ['1', '1', '2'],
331
'start': [1000, 2000, 1500],
332
'end': [2000, 3000, 2500],
333
'type': ['gene', 'exon', 'gene']
334
})
335
```
336
{ .api }
337
338
## Chunked Arrays
339
340
For large datasets, scikit-allel provides chunked versions using Dask.
341
342
### GenotypeDaskArray
343
344
```python
345
import dask.array as da
346
347
# Create chunked genotype array
348
chunks = (1000, 100, 2) # chunk sizes for (variants, samples, ploidy)
349
g_chunked = allel.GenotypeDaskArray(data, chunks=chunks)
350
351
# Compute results
352
ac = g_chunked.count_alleles().compute()
353
```
354
{ .api }
355
356
### HaplotypeDaskArray
357
358
```python
359
# Create chunked haplotype array
360
h_chunked = allel.HaplotypeDaskArray(data, chunks=(1000, 200))
361
```
362
{ .api }
363
364
### AlleleCountsDaskArray
365
366
```python
367
# Create chunked allele counts array
368
ac_chunked = allel.AlleleCountsDaskArray(data, chunks=(1000, 4))
369
```
370
{ .api }
371
372
## Chunked Arrays (Legacy)
373
374
For backwards compatibility, scikit-allel provides chunked array implementations using Zarr or HDF5 storage. Note: These are maintained for backwards compatibility; prefer Dask arrays for new code.
375
376
### GenotypeChunkedArray
377
378
```python
379
# Create chunked genotype array with Zarr storage
380
g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='zarr')
381
382
# With HDF5 storage
383
g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='hdf5')
384
```
385
{ .api }
386
387
### HaplotypeChunkedArray
388
389
```python
390
# Create chunked haplotype array
391
h_chunked = allel.HaplotypeChunkedArray(data, chunks=(1000, 200), storage='zarr')
392
```
393
{ .api }
394
395
### AlleleCountsChunkedArray
396
397
```python
398
# Create chunked allele counts array
399
ac_chunked = allel.AlleleCountsChunkedArray(data, chunks=(1000, 4), storage='zarr')
400
```
401
{ .api }
402
403
### GenotypeAlleleCountsChunkedArray
404
405
```python
406
# Create chunked genotype allele counts array
407
gac_chunked = allel.GenotypeAlleleCountsChunkedArray(data, chunks=(1000, 100, 4), storage='zarr')
408
```
409
{ .api }
410
411
## Index Classes
412
413
### UniqueIndex
414
415
Index for unique identifiers with efficient lookup.
416
417
```python
418
# Create unique index
419
sample_ids = ['sample_1', 'sample_2', 'sample_3']
420
idx = allel.UniqueIndex(sample_ids)
421
422
# Lookup by value
423
pos = idx.locate_key('sample_2') # Returns position
424
```
425
{ .api }
426
427
### SortedIndex
428
429
Index for sorted values with range queries.
430
431
```python
432
# Create sorted index for positions
433
positions = [100, 200, 300, 400, 500]
434
idx = allel.SortedIndex(positions)
435
436
# Range queries
437
selection = idx.locate_range(150, 350) # Find positions in range
438
```
439
{ .api }
440
441
### SortedMultiIndex
442
443
Multi-level index for hierarchical data.
444
445
```python
446
# Create multi-level index (chromosome, position)
447
chroms = ['1', '1', '1', '2', '2']
448
positions = [100, 200, 300, 150, 250]
449
idx = allel.SortedMultiIndex([chroms, positions])
450
451
# Query specific chromosome
452
chrom1_variants = idx.locate_key('1')
453
```
454
{ .api }
455
456
### ChromPosIndex
457
458
Specialized index for chromosome-position data.
459
460
```python
461
# Create chromosome-position index
462
chroms = ['1', '1', '2', '2']
463
positions = [100, 200, 150, 250]
464
idx = allel.ChromPosIndex(chroms, positions)
465
466
# Query by chromosome and position range
467
variants = idx.locate_range('1', 50, 150)
468
```
469
{ .api }
470
471
## Vector Classes
472
473
### GenotypeVector
474
475
1D genotype vector for single variant.
476
477
```python
478
# Create genotype vector for single variant
479
genotypes_1d = [0, 1, 1, 0] # Single variant across samples
480
gv = allel.GenotypeVector(genotypes_1d)
481
482
# Properties
483
print(gv.n_samples) # Number of samples
484
print(gv.shape) # Vector shape
485
```
486
{ .api }
487
488
### GenotypeAlleleCountsVector
489
490
1D allele counts vector for single variant.
491
492
```python
493
# Create allele counts vector
494
counts_1d = [3, 1, 0] # Counts for alleles 0, 1, 2
495
acv = allel.GenotypeAlleleCountsVector(counts_1d)
496
497
# Analysis methods
498
print(acv.is_segregating()) # Check if variant is polymorphic
499
print(acv.allelism()) # Number of alleles
500
```
501
{ .api }
502
503
## Memory-Efficient Storage
504
505