or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-handling.mddistributions.mdgaussian-processes.mdglm.mdindex.mdmath-functions.mdmodeling.mdsampling.mdstats-plots.mdstep-methods.mdvariational.md

data-handling.mddocs/

0

# Data Handling

1

2

PyMC3 provides specialized data structures and utilities for handling observed data, minibatch processing, and generator-based data sources. These tools enable efficient memory usage and streaming data processing in Bayesian models.

3

4

## Capabilities

5

6

### Data Container

7

8

The primary data container for observed variables in PyMC3 models.

9

10

```python { .api }

11

class Data:

12

"""

13

Data container for observed variables with mutable and shared data support.

14

15

Creates a shared variable that can be updated during sampling or between

16

model fits, enabling out-of-sample prediction and data augmentation.

17

18

Parameters:

19

- name: str, name for the data variable

20

- value: array-like, initial data values

21

- dims: tuple, named dimensions for the data

22

- export_index_as_coords: bool, export index as coordinates

23

"""

24

```

25

26

### Minibatch Processing

27

28

Efficient minibatch processing for large datasets and stochastic variational inference.

29

30

```python { .api }

31

class Minibatch:

32

"""

33

Multidimensional minibatch container for stochastic inference.

34

35

Enables efficient processing of large datasets by sampling random

36

minibatches during each iteration, essential for scalable variational

37

inference and large dataset handling.

38

39

Parameters:

40

- data: array-like, full dataset to sample from

41

- batch_size: int or list, size of minibatches or list of sizes with random seeds

42

- dtype: str, data type for minibatch arrays

43

- broadcastable: tuple, broadcasting pattern (defaults to (False,) * ndim)

44

- name: str, name for minibatch variable (default "Minibatch")

45

- random_seed: int, random seed for minibatch sampling

46

- update_shared_f: callable, function to update underlying shared variable

47

"""

48

49

def align_minibatches(*minibatches):

50

"""

51

Align multiple minibatch variables to sample consistent indices.

52

53

Ensures that multiple minibatch variables sample the same data points

54

in each iteration, maintaining consistency across related datasets.

55

56

Parameters:

57

- minibatches: Minibatch variables to align

58

59

Returns:

60

- tuple: aligned minibatch variables

61

"""

62

```

63

64

### Generator Adaptation

65

66

Support for generator-based data sources and streaming data processing.

67

68

```python { .api }

69

class GeneratorAdapter:

70

"""

71

Adapter for generator-based data sources in PyMC3 models.

72

73

Converts Python generators into PyMC3-compatible tensor variables,

74

enabling streaming data processing and infinite data sources with

75

automatic type inference from the first generated item.

76

77

Parameters:

78

- generator: Python generator yielding data arrays

79

80

Methods:

81

- make_variable(gop, name): create tensor variable from generator

82

- set_gen(gen): update underlying generator

83

- set_default(value): set default value for variable

84

"""

85

```

86

87

### Data Utilities

88

89

Helper functions for data loading and management.

90

91

```python { .api }

92

def get_data(filename):

93

"""

94

Load package data files for examples and testing.

95

96

Retrieves data files from the PyMC3 package or downloads from

97

remote repository if not available locally.

98

99

Parameters:

100

- filename: str, name of data file to load

101

102

Returns:

103

- BytesIO: file-like object containing data

104

"""

105

```

106

107

## Usage Examples

108

109

### Basic Data Container

110

111

```python

112

import pymc3 as pm

113

import numpy as np

114

115

# Create mutable data container

116

data = np.random.randn(100)

117

shared_data = pm.Data('shared_data', data)

118

119

with pm.Model() as model:

120

mu = pm.Normal('mu', 0, 1)

121

sigma = pm.HalfNormal('sigma', 1)

122

123

# Use shared data in likelihood

124

obs = pm.Normal('obs', mu, sigma, observed=shared_data)

125

126

trace = pm.sample(1000)

127

128

# Update data for out-of-sample prediction

129

new_data = np.random.randn(50)

130

pm.set_data({'shared_data': new_data})

131

132

# Generate predictions with new data

133

with model:

134

post_pred = pm.sample_posterior_predictive(trace)

135

```

136

137

### Minibatch Processing for Large Datasets

138

139

```python

140

import pymc3 as pm

141

import numpy as np

142

143

# Large dataset

144

N = 10000

145

X = np.random.randn(N, 5)

146

y = np.random.randn(N)

147

148

# Create minibatches

149

batch_size = 100

150

X_batch = pm.Minibatch(X, batch_size=batch_size)

151

y_batch = pm.Minibatch(y, batch_size=batch_size)

152

153

# Align minibatches to sample consistent indices

154

X_batch, y_batch = pm.align_minibatches(X_batch, y_batch)

155

156

with pm.Model() as model:

157

# Model parameters

158

beta = pm.Normal('beta', 0, 1, shape=5)

159

sigma = pm.HalfNormal('sigma', 1)

160

161

# Minibatch likelihood

162

mu = pm.math.dot(X_batch, beta)

163

obs = pm.Normal('obs', mu, sigma, observed=y_batch, total_size=N)

164

165

# Use ADVI for large dataset

166

approx = pm.fit(n=10000, method='advi')

167

trace = approx.sample(1000)

168

```

169

170

### Generator-Based Data Processing

171

172

```python

173

import pymc3 as pm

174

import numpy as np

175

176

def data_generator():

177

"""Generator yielding streaming data batches."""

178

while True:

179

yield np.random.randn(10, 3)

180

181

# Create generator adapter

182

gen = data_generator()

183

adapter = pm.GeneratorAdapter(gen)

184

185

with pm.Model() as model:

186

# Create variable from generator

187

data_var = adapter.make_variable(name='streaming_data')

188

189

# Model using streaming data

190

mu = pm.Normal('mu', 0, 1, shape=3)

191

sigma = pm.HalfNormal('sigma', 1)

192

obs = pm.Normal('obs', mu, sigma, observed=data_var)

193

194

# Note: Special handling needed for generator-based sampling

195

```

196

197

### Data Loading Utilities

198

199

```python

200

import pymc3 as pm

201

import pandas as pd

202

203

# Load package data

204

data_file = pm.get_data('coal.csv')

205

coal_data = pd.read_csv(data_file)

206

207

# Use in model

208

with pm.Model() as model:

209

# Process loaded data

210

years = coal_data['year'].values

211

disasters = pm.Data('disasters', coal_data['disasters'].values)

212

213

# Build time series model

214

lambda1 = pm.Exponential('lambda1', 1)

215

lambda2 = pm.Exponential('lambda2', 1)

216

tau = pm.DiscreteUniform('tau', 0, len(years))

217

218

rate = pm.math.switch(years < tau, lambda1, lambda2)

219

obs = pm.Poisson('obs', rate, observed=disasters)

220

221

trace = pm.sample(2000)

222

```

223

224

### Dynamic Data Updates

225

226

```python

227

import pymc3 as pm

228

import numpy as np

229

230

# Initial training data

231

X_train = np.random.randn(100, 3)

232

y_train = np.random.randn(100)

233

234

# Create shared variables

235

X_shared = pm.Data('X_shared', X_train)

236

y_shared = pm.Data('y_shared', y_train)

237

238

with pm.Model() as model:

239

# Model parameters

240

beta = pm.Normal('beta', 0, 1, shape=3)

241

sigma = pm.HalfNormal('sigma', 1)

242

243

# Model definition

244

mu = pm.math.dot(X_shared, beta)

245

obs = pm.Normal('obs', mu, sigma, observed=y_shared)

246

247

# Fit to initial data

248

trace = pm.sample(1000)

249

250

# Update with new data batches

251

for batch_idx in range(5):

252

X_new = np.random.randn(20, 3)

253

y_new = np.random.randn(20)

254

255

# Update shared variables

256

pm.set_data({'X_shared': X_new, 'y_shared': y_new})

257

258

# Continue sampling or refit

259

with model:

260

trace_batch = pm.sample(200, start=trace[-1])

261

```