Tessl Tile for pypi/scikit-allel@1.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-structures.md index.md io.md statistics.md utilities.md

data-structures.mddocs/

0
# Data Structures
1

2
Scikit-allel provides specialized array classes for representing genetic variation data with efficient operations and memory management. These data structures are the foundation for all genetic analyses in the library.
3

4
## Core Array Classes
5

6
### GenotypeArray
7

8
A specialized numpy array for storing genotype data with shape `(n_variants, n_samples, ploidy)`.
9

10
```python
11
import allel
12
import numpy as np
13

14
# Create from list/array data
15
genotypes = [[[0, 0], [0, 1]], [[1, 1], [1, 0]]]  # 2 variants, 2 samples, diploid
16
g = allel.GenotypeArray(genotypes)
17

18
# Properties
19
print(g.n_variants)  # Number of variants
20
print(g.n_samples)   # Number of samples  
21
print(g.ploidy)      # Ploidy (e.g., 2 for diploid)
22
print(g.shape)       # Array shape
23
print(g.dtype)       # Data type
24
```
25
{ .api }
26

27
**Key Methods:**
28

29
```python
30
def count_alleles(self, max_allele=None):
31
    """
32
    Count alleles per variant across all samples.
33
    
34
    Args:
35
        max_allele (int, optional): Maximum allele number to count
36
        
37
    Returns:
38
        AlleleCountsArray: Allele counts with shape (n_variants, n_alleles)
39
    """
40

41
def to_haplotypes(self):
42
    """
43
    Convert genotypes to haplotype representation.
44
    
45
    Returns:
46
        HaplotypeArray: Haplotype array with shape (n_variants, n_haplotypes)
47
    """
48

49
def is_phased(self):
50
    """
51
    Check if genotypes are phased.
52
    
53
    Returns:
54
        bool: True if all genotypes are phased
55
    """
56

57
def map_alleles(self, mapping):
58
    """
59
    Remap allele values using provided mapping.
60
    
61
    Args:
62
        mapping (array_like): New allele values for each position
63
        
64
    Returns:
65
        GenotypeArray: Array with remapped alleles
66
    """
67

68
def subset(self, sel0=None, sel1=None):
69
    """
70
    Select subset of variants and/or samples.
71
    
72
    Args:
73
        sel0 (array_like, optional): Variant selection
74
        sel1 (array_like, optional): Sample selection
75
        
76
    Returns:
77
        GenotypeArray: Subset of original array
78
    """
79

80
def count_het(self, allele=None):
81
    """
82
    Count heterozygous genotypes per variant.
83
    
84
    Args:
85
        allele (int, optional): Specific allele to check heterozygosity
86
        
87
    Returns:
88
        ndarray: Heterozygous counts per variant
89
    """
90

91
def count_hom_alt(self):
92
    """
93
    Count homozygous alternate genotypes per variant.
94
    
95
    Returns:
96
        ndarray: Homozygous alternate counts per variant
97
    """
98

99
def is_het(self, allele=None):
100
    """
101
    Determine which genotypes are heterozygous.
102
    
103
    Args:
104
        allele (int, optional): Specific allele to check
105
        
106
    Returns:
107
        ndarray: Boolean array indicating heterozygous genotypes
108
    """
109

110
def is_hom_alt(self):
111
    """
112
    Determine which genotypes are homozygous alternate.
113
    
114
    Returns:
115
        ndarray: Boolean array indicating homozygous alternate genotypes
116
    """
117

118
def is_call(self):
119
    """
120
    Determine which genotypes are called (not missing).
121
    
122
    Returns:
123
        ndarray: Boolean array indicating called genotypes
124
    """
125
```
126
{ .api }
127

128
### HaplotypeArray
129

130
A specialized array for haplotype data with shape `(n_variants, n_haplotypes)`.
131

132
```python
133
# Create haplotype array
134
haplotypes = [[0, 1, 0, 1], [1, 0, 1, 0]]  # 2 variants, 4 haplotypes
135
h = allel.HaplotypeArray(haplotypes)
136

137
# Convert from genotypes
138
g = allel.GenotypeArray([[[0, 1], [1, 0]]])
139
h = g.to_haplotypes()
140
```
141
{ .api }
142

143
**Key Methods:**
144

145
```python
146
def count_alleles(self, max_allele=None):
147
    """
148
    Count alleles per variant across all haplotypes.
149
    
150
    Args:
151
        max_allele (int, optional): Maximum allele number to count
152
        
153
    Returns:
154
        AlleleCountsArray: Allele counts per variant
155
    """
156

157
def distinct_frequencies(self):
158
    """
159
    Calculate frequencies of distinct haplotypes.
160
    
161
    Returns:
162
        tuple: (distinct_haplotypes, frequencies)
163
    """
164

165
def map_alleles(self, mapping):
166
    """
167
    Remap allele values using provided mapping.
168
    
169
    Args:
170
        mapping (array_like): New allele values
171
        
172
    Returns:
173
        HaplotypeArray: Array with remapped alleles
174
    """
175
```
176
{ .api }
177

178
### AlleleCountsArray
179

180
Array for storing allele count data with shape `(n_variants, n_alleles)`.
181

182
```python
183
# Create from genotype array
184
g = allel.GenotypeArray([[[0, 0], [0, 1]], [[1, 1], [1, 0]]])
185
ac = g.count_alleles()
186

187
# Direct creation
188
counts = [[3, 1], [2, 2]]  # 2 variants, counts for alleles 0,1
189
ac = allel.AlleleCountsArray(counts)
190
```
191
{ .api }
192

193
**Key Methods:**
194

195
```python
196
def count_segregating(self):
197
    """
198
    Count number of segregating (polymorphic) variants.
199
    
200
    Returns:
201
        int: Number of segregating variants
202
    """
203

204
def is_segregating(self):
205
    """
206
    Determine which variants are segregating.
207
    
208
    Returns:
209
        ndarray: Boolean array indicating segregating variants
210
    """
211

212
def count_singletons(self):
213
    """
214
    Count singleton variants (minor allele count = 1).
215
    
216
    Returns:
217
        int: Number of singleton variants
218
    """
219

220
def allelism(self):
221
    """
222
    Determine number of alleles per variant.
223
    
224
    Returns:
225
        ndarray: Number of alleles observed per variant
226
    """
227

228
def max_allele(self):
229
    """
230
    Find maximum allele number per variant.
231
    
232
    Returns:
233
        ndarray: Maximum allele number per variant
234
    """
235

236
def to_frequencies(self, fill=0):
237
    """
238
    Convert counts to frequencies.
239
    
240
    Args:
241
        fill (float): Value for missing data
242
        
243
    Returns:
244
        ndarray: Allele frequencies
245
    """
246
```
247
{ .api }
248

249
## Array Creation Functions
250

251
### Genotype Array Creation
252

253
```python
254
def GenotypeArray(data, dtype=None):
255
    """
256
    Create a genotype array from input data.
257
    
258
    Args:
259
        data (array_like): Genotype data with shape (n_variants, n_samples, ploidy)
260
        dtype (numpy.dtype, optional): Data type for array
261
        
262
    Returns:
263
        GenotypeArray: Genotype array instance
264
    """
265
```
266
{ .api }
267

268
### Haplotype Array Creation
269

270
```python
271
def HaplotypeArray(data, dtype=None):
272
    """
273
    Create a haplotype array from input data.
274
    
275
    Args:
276
        data (array_like): Haplotype data with shape (n_variants, n_haplotypes)
277
        dtype (numpy.dtype, optional): Data type for array
278
        
279
    Returns:
280
        HaplotypeArray: Haplotype array instance
281
    """
282
```
283
{ .api }
284

285
### Allele Counts Array Creation
286

287
```python
288
def AlleleCountsArray(data, dtype=None):
289
    """
290
    Create an allele counts array from input data.
291
    
292
    Args:
293
        data (array_like): Allele count data with shape (n_variants, n_alleles)
294
        dtype (numpy.dtype, optional): Data type for array
295
        
296
    Returns:
297
        AlleleCountsArray: Allele counts array instance
298
    """
299
```
300
{ .api }
301

302
## Structured Arrays
303

304
### VariantTable
305

306
Structured array for storing variant metadata.
307

308
```python
309
# Create variant table with metadata
310
variants = allel.VariantTable({
311
    'CHROM': ['1', '1', '2'],
312
    'POS': [100, 200, 150],
313
    'REF': ['A', 'G', 'T'],
314
    'ALT': ['T', 'A', 'C']
315
})
316

317
# Access fields
318
positions = variants['POS']
319
chromosomes = variants['CHROM']
320
```
321
{ .api }
322

323
### FeatureTable
324

325
Structured array for genomic features like genes and exons.
326

327
```python
328
# Create feature table
329
features = allel.FeatureTable({
330
    'seqid': ['1', '1', '2'],
331
    'start': [1000, 2000, 1500],
332
    'end': [2000, 3000, 2500],
333
    'type': ['gene', 'exon', 'gene']
334
})
335
```
336
{ .api }
337

338
## Chunked Arrays
339

340
For large datasets, scikit-allel provides chunked versions using Dask.
341

342
### GenotypeDaskArray
343

344
```python
345
import dask.array as da
346

347
# Create chunked genotype array
348
chunks = (1000, 100, 2)  # chunk sizes for (variants, samples, ploidy)
349
g_chunked = allel.GenotypeDaskArray(data, chunks=chunks)
350

351
# Compute results
352
ac = g_chunked.count_alleles().compute()
353
```
354
{ .api }
355

356
### HaplotypeDaskArray
357

358
```python
359
# Create chunked haplotype array
360
h_chunked = allel.HaplotypeDaskArray(data, chunks=(1000, 200))
361
```
362
{ .api }
363

364
### AlleleCountsDaskArray
365

366
```python
367
# Create chunked allele counts array
368
ac_chunked = allel.AlleleCountsDaskArray(data, chunks=(1000, 4))
369
```
370
{ .api }
371

372
## Chunked Arrays (Legacy)
373

374
For backwards compatibility, scikit-allel provides chunked array implementations using Zarr or HDF5 storage. Note: These are maintained for backwards compatibility; prefer Dask arrays for new code.
375

376
### GenotypeChunkedArray
377

378
```python
379
# Create chunked genotype array with Zarr storage
380
g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='zarr')
381

382
# With HDF5 storage
383
g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='hdf5')
384
```
385
{ .api }
386

387
### HaplotypeChunkedArray
388

389
```python
390
# Create chunked haplotype array
391
h_chunked = allel.HaplotypeChunkedArray(data, chunks=(1000, 200), storage='zarr')
392
```
393
{ .api }
394

395
### AlleleCountsChunkedArray
396

397
```python
398
# Create chunked allele counts array
399
ac_chunked = allel.AlleleCountsChunkedArray(data, chunks=(1000, 4), storage='zarr')
400
```
401
{ .api }
402

403
### GenotypeAlleleCountsChunkedArray
404

405
```python
406
# Create chunked genotype allele counts array
407
gac_chunked = allel.GenotypeAlleleCountsChunkedArray(data, chunks=(1000, 100, 4), storage='zarr')
408
```
409
{ .api }
410

411
## Index Classes
412

413
### UniqueIndex
414

415
Index for unique identifiers with efficient lookup.
416

417
```python
418
# Create unique index
419
sample_ids = ['sample_1', 'sample_2', 'sample_3']
420
idx = allel.UniqueIndex(sample_ids)
421

422
# Lookup by value
423
pos = idx.locate_key('sample_2')  # Returns position
424
```
425
{ .api }
426

427
### SortedIndex
428

429
Index for sorted values with range queries.
430

431
```python
432
# Create sorted index for positions
433
positions = [100, 200, 300, 400, 500]
434
idx = allel.SortedIndex(positions)
435

436
# Range queries
437
selection = idx.locate_range(150, 350)  # Find positions in range
438
```
439
{ .api }
440

441
### SortedMultiIndex
442

443
Multi-level index for hierarchical data.
444

445
```python
446
# Create multi-level index (chromosome, position)
447
chroms = ['1', '1', '1', '2', '2']
448
positions = [100, 200, 300, 150, 250]
449
idx = allel.SortedMultiIndex([chroms, positions])
450

451
# Query specific chromosome
452
chrom1_variants = idx.locate_key('1')
453
```
454
{ .api }
455

456
### ChromPosIndex
457

458
Specialized index for chromosome-position data.
459

460
```python
461
# Create chromosome-position index
462
chroms = ['1', '1', '2', '2']
463
positions = [100, 200, 150, 250]
464
idx = allel.ChromPosIndex(chroms, positions)
465

466
# Query by chromosome and position range
467
variants = idx.locate_range('1', 50, 150)
468
```
469
{ .api }
470

471
## Vector Classes
472

473
### GenotypeVector
474

475
1D genotype vector for single variant.
476

477
```python
478
# Create genotype vector for single variant
479
genotypes_1d = [0, 1, 1, 0]  # Single variant across samples
480
gv = allel.GenotypeVector(genotypes_1d)
481

482
# Properties
483
print(gv.n_samples)  # Number of samples
484
print(gv.shape)      # Vector shape
485
```
486
{ .api }
487

488
### GenotypeAlleleCountsVector
489

490
1D allele counts vector for single variant.
491

492
```python
493
# Create allele counts vector
494
counts_1d = [3, 1, 0]  # Counts for alleles 0, 1, 2
495
acv = allel.GenotypeAlleleCountsVector(counts_1d)
496

497
# Analysis methods
498
print(acv.is_segregating())  # Check if variant is polymorphic
499
print(acv.allelism())        # Number of alleles
500
```
501
{ .api }
502

503
## Memory-Efficient Storage
504

505

Version

Tile

Files

data-structures.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-structures.mddocs/