or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-structures.mdindex.mdio.mdstatistics.mdutilities.md

data-structures.mddocs/

0

# Data Structures

1

2

Scikit-allel provides specialized array classes for representing genetic variation data with efficient operations and memory management. These data structures are the foundation for all genetic analyses in the library.

3

4

## Core Array Classes

5

6

### GenotypeArray

7

8

A specialized numpy array for storing genotype data with shape `(n_variants, n_samples, ploidy)`.

9

10

```python

11

import allel

12

import numpy as np

13

14

# Create from list/array data

15

genotypes = [[[0, 0], [0, 1]], [[1, 1], [1, 0]]] # 2 variants, 2 samples, diploid

16

g = allel.GenotypeArray(genotypes)

17

18

# Properties

19

print(g.n_variants) # Number of variants

20

print(g.n_samples) # Number of samples

21

print(g.ploidy) # Ploidy (e.g., 2 for diploid)

22

print(g.shape) # Array shape

23

print(g.dtype) # Data type

24

```

25

{ .api }

26

27

**Key Methods:**

28

29

```python

30

def count_alleles(self, max_allele=None):

31

"""

32

Count alleles per variant across all samples.

33

34

Args:

35

max_allele (int, optional): Maximum allele number to count

36

37

Returns:

38

AlleleCountsArray: Allele counts with shape (n_variants, n_alleles)

39

"""

40

41

def to_haplotypes(self):

42

"""

43

Convert genotypes to haplotype representation.

44

45

Returns:

46

HaplotypeArray: Haplotype array with shape (n_variants, n_haplotypes)

47

"""

48

49

def is_phased(self):

50

"""

51

Check if genotypes are phased.

52

53

Returns:

54

bool: True if all genotypes are phased

55

"""

56

57

def map_alleles(self, mapping):

58

"""

59

Remap allele values using provided mapping.

60

61

Args:

62

mapping (array_like): New allele values for each position

63

64

Returns:

65

GenotypeArray: Array with remapped alleles

66

"""

67

68

def subset(self, sel0=None, sel1=None):

69

"""

70

Select subset of variants and/or samples.

71

72

Args:

73

sel0 (array_like, optional): Variant selection

74

sel1 (array_like, optional): Sample selection

75

76

Returns:

77

GenotypeArray: Subset of original array

78

"""

79

80

def count_het(self, allele=None):

81

"""

82

Count heterozygous genotypes per variant.

83

84

Args:

85

allele (int, optional): Specific allele to check heterozygosity

86

87

Returns:

88

ndarray: Heterozygous counts per variant

89

"""

90

91

def count_hom_alt(self):

92

"""

93

Count homozygous alternate genotypes per variant.

94

95

Returns:

96

ndarray: Homozygous alternate counts per variant

97

"""

98

99

def is_het(self, allele=None):

100

"""

101

Determine which genotypes are heterozygous.

102

103

Args:

104

allele (int, optional): Specific allele to check

105

106

Returns:

107

ndarray: Boolean array indicating heterozygous genotypes

108

"""

109

110

def is_hom_alt(self):

111

"""

112

Determine which genotypes are homozygous alternate.

113

114

Returns:

115

ndarray: Boolean array indicating homozygous alternate genotypes

116

"""

117

118

def is_call(self):

119

"""

120

Determine which genotypes are called (not missing).

121

122

Returns:

123

ndarray: Boolean array indicating called genotypes

124

"""

125

```

126

{ .api }

127

128

### HaplotypeArray

129

130

A specialized array for haplotype data with shape `(n_variants, n_haplotypes)`.

131

132

```python

133

# Create haplotype array

134

haplotypes = [[0, 1, 0, 1], [1, 0, 1, 0]] # 2 variants, 4 haplotypes

135

h = allel.HaplotypeArray(haplotypes)

136

137

# Convert from genotypes

138

g = allel.GenotypeArray([[[0, 1], [1, 0]]])

139

h = g.to_haplotypes()

140

```

141

{ .api }

142

143

**Key Methods:**

144

145

```python

146

def count_alleles(self, max_allele=None):

147

"""

148

Count alleles per variant across all haplotypes.

149

150

Args:

151

max_allele (int, optional): Maximum allele number to count

152

153

Returns:

154

AlleleCountsArray: Allele counts per variant

155

"""

156

157

def distinct_frequencies(self):

158

"""

159

Calculate frequencies of distinct haplotypes.

160

161

Returns:

162

tuple: (distinct_haplotypes, frequencies)

163

"""

164

165

def map_alleles(self, mapping):

166

"""

167

Remap allele values using provided mapping.

168

169

Args:

170

mapping (array_like): New allele values

171

172

Returns:

173

HaplotypeArray: Array with remapped alleles

174

"""

175

```

176

{ .api }

177

178

### AlleleCountsArray

179

180

Array for storing allele count data with shape `(n_variants, n_alleles)`.

181

182

```python

183

# Create from genotype array

184

g = allel.GenotypeArray([[[0, 0], [0, 1]], [[1, 1], [1, 0]]])

185

ac = g.count_alleles()

186

187

# Direct creation

188

counts = [[3, 1], [2, 2]] # 2 variants, counts for alleles 0,1

189

ac = allel.AlleleCountsArray(counts)

190

```

191

{ .api }

192

193

**Key Methods:**

194

195

```python

196

def count_segregating(self):

197

"""

198

Count number of segregating (polymorphic) variants.

199

200

Returns:

201

int: Number of segregating variants

202

"""

203

204

def is_segregating(self):

205

"""

206

Determine which variants are segregating.

207

208

Returns:

209

ndarray: Boolean array indicating segregating variants

210

"""

211

212

def count_singletons(self):

213

"""

214

Count singleton variants (minor allele count = 1).

215

216

Returns:

217

int: Number of singleton variants

218

"""

219

220

def allelism(self):

221

"""

222

Determine number of alleles per variant.

223

224

Returns:

225

ndarray: Number of alleles observed per variant

226

"""

227

228

def max_allele(self):

229

"""

230

Find maximum allele number per variant.

231

232

Returns:

233

ndarray: Maximum allele number per variant

234

"""

235

236

def to_frequencies(self, fill=0):

237

"""

238

Convert counts to frequencies.

239

240

Args:

241

fill (float): Value for missing data

242

243

Returns:

244

ndarray: Allele frequencies

245

"""

246

```

247

{ .api }

248

249

## Array Creation Functions

250

251

### Genotype Array Creation

252

253

```python

254

def GenotypeArray(data, dtype=None):

255

"""

256

Create a genotype array from input data.

257

258

Args:

259

data (array_like): Genotype data with shape (n_variants, n_samples, ploidy)

260

dtype (numpy.dtype, optional): Data type for array

261

262

Returns:

263

GenotypeArray: Genotype array instance

264

"""

265

```

266

{ .api }

267

268

### Haplotype Array Creation

269

270

```python

271

def HaplotypeArray(data, dtype=None):

272

"""

273

Create a haplotype array from input data.

274

275

Args:

276

data (array_like): Haplotype data with shape (n_variants, n_haplotypes)

277

dtype (numpy.dtype, optional): Data type for array

278

279

Returns:

280

HaplotypeArray: Haplotype array instance

281

"""

282

```

283

{ .api }

284

285

### Allele Counts Array Creation

286

287

```python

288

def AlleleCountsArray(data, dtype=None):

289

"""

290

Create an allele counts array from input data.

291

292

Args:

293

data (array_like): Allele count data with shape (n_variants, n_alleles)

294

dtype (numpy.dtype, optional): Data type for array

295

296

Returns:

297

AlleleCountsArray: Allele counts array instance

298

"""

299

```

300

{ .api }

301

302

## Structured Arrays

303

304

### VariantTable

305

306

Structured array for storing variant metadata.

307

308

```python

309

# Create variant table with metadata

310

variants = allel.VariantTable({

311

'CHROM': ['1', '1', '2'],

312

'POS': [100, 200, 150],

313

'REF': ['A', 'G', 'T'],

314

'ALT': ['T', 'A', 'C']

315

})

316

317

# Access fields

318

positions = variants['POS']

319

chromosomes = variants['CHROM']

320

```

321

{ .api }

322

323

### FeatureTable

324

325

Structured array for genomic features like genes and exons.

326

327

```python

328

# Create feature table

329

features = allel.FeatureTable({

330

'seqid': ['1', '1', '2'],

331

'start': [1000, 2000, 1500],

332

'end': [2000, 3000, 2500],

333

'type': ['gene', 'exon', 'gene']

334

})

335

```

336

{ .api }

337

338

## Chunked Arrays

339

340

For large datasets, scikit-allel provides chunked versions using Dask.

341

342

### GenotypeDaskArray

343

344

```python

345

import dask.array as da

346

347

# Create chunked genotype array

348

chunks = (1000, 100, 2) # chunk sizes for (variants, samples, ploidy)

349

g_chunked = allel.GenotypeDaskArray(data, chunks=chunks)

350

351

# Compute results

352

ac = g_chunked.count_alleles().compute()

353

```

354

{ .api }

355

356

### HaplotypeDaskArray

357

358

```python

359

# Create chunked haplotype array

360

h_chunked = allel.HaplotypeDaskArray(data, chunks=(1000, 200))

361

```

362

{ .api }

363

364

### AlleleCountsDaskArray

365

366

```python

367

# Create chunked allele counts array

368

ac_chunked = allel.AlleleCountsDaskArray(data, chunks=(1000, 4))

369

```

370

{ .api }

371

372

## Chunked Arrays (Legacy)

373

374

For backwards compatibility, scikit-allel provides chunked array implementations using Zarr or HDF5 storage. Note: These are maintained for backwards compatibility; prefer Dask arrays for new code.

375

376

### GenotypeChunkedArray

377

378

```python

379

# Create chunked genotype array with Zarr storage

380

g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='zarr')

381

382

# With HDF5 storage

383

g_chunked = allel.GenotypeChunkedArray(data, chunks=(1000, 100, 2), storage='hdf5')

384

```

385

{ .api }

386

387

### HaplotypeChunkedArray

388

389

```python

390

# Create chunked haplotype array

391

h_chunked = allel.HaplotypeChunkedArray(data, chunks=(1000, 200), storage='zarr')

392

```

393

{ .api }

394

395

### AlleleCountsChunkedArray

396

397

```python

398

# Create chunked allele counts array

399

ac_chunked = allel.AlleleCountsChunkedArray(data, chunks=(1000, 4), storage='zarr')

400

```

401

{ .api }

402

403

### GenotypeAlleleCountsChunkedArray

404

405

```python

406

# Create chunked genotype allele counts array

407

gac_chunked = allel.GenotypeAlleleCountsChunkedArray(data, chunks=(1000, 100, 4), storage='zarr')

408

```

409

{ .api }

410

411

## Index Classes

412

413

### UniqueIndex

414

415

Index for unique identifiers with efficient lookup.

416

417

```python

418

# Create unique index

419

sample_ids = ['sample_1', 'sample_2', 'sample_3']

420

idx = allel.UniqueIndex(sample_ids)

421

422

# Lookup by value

423

pos = idx.locate_key('sample_2') # Returns position

424

```

425

{ .api }

426

427

### SortedIndex

428

429

Index for sorted values with range queries.

430

431

```python

432

# Create sorted index for positions

433

positions = [100, 200, 300, 400, 500]

434

idx = allel.SortedIndex(positions)

435

436

# Range queries

437

selection = idx.locate_range(150, 350) # Find positions in range

438

```

439

{ .api }

440

441

### SortedMultiIndex

442

443

Multi-level index for hierarchical data.

444

445

```python

446

# Create multi-level index (chromosome, position)

447

chroms = ['1', '1', '1', '2', '2']

448

positions = [100, 200, 300, 150, 250]

449

idx = allel.SortedMultiIndex([chroms, positions])

450

451

# Query specific chromosome

452

chrom1_variants = idx.locate_key('1')

453

```

454

{ .api }

455

456

### ChromPosIndex

457

458

Specialized index for chromosome-position data.

459

460

```python

461

# Create chromosome-position index

462

chroms = ['1', '1', '2', '2']

463

positions = [100, 200, 150, 250]

464

idx = allel.ChromPosIndex(chroms, positions)

465

466

# Query by chromosome and position range

467

variants = idx.locate_range('1', 50, 150)

468

```

469

{ .api }

470

471

## Vector Classes

472

473

### GenotypeVector

474

475

1D genotype vector for single variant.

476

477

```python

478

# Create genotype vector for single variant

479

genotypes_1d = [0, 1, 1, 0] # Single variant across samples

480

gv = allel.GenotypeVector(genotypes_1d)

481

482

# Properties

483

print(gv.n_samples) # Number of samples

484

print(gv.shape) # Vector shape

485

```

486

{ .api }

487

488

### GenotypeAlleleCountsVector

489

490

1D allele counts vector for single variant.

491

492

```python

493

# Create allele counts vector

494

counts_1d = [3, 1, 0] # Counts for alleles 0, 1, 2

495

acv = allel.GenotypeAlleleCountsVector(counts_1d)

496

497

# Analysis methods

498

print(acv.is_segregating()) # Check if variant is polymorphic

499

print(acv.allelism()) # Number of alleles

500

```

501

{ .api }

502

503

## Memory-Efficient Storage

504

505