0
# Data Handling
1
2
PyMC3 provides specialized data structures and utilities for handling observed data, minibatch processing, and generator-based data sources. These tools enable efficient memory usage and streaming data processing in Bayesian models.
3
4
## Capabilities
5
6
### Data Container
7
8
The primary data container for observed variables in PyMC3 models.
9
10
```python { .api }
11
class Data:
12
"""
13
Data container for observed variables with mutable and shared data support.
14
15
Creates a shared variable that can be updated during sampling or between
16
model fits, enabling out-of-sample prediction and data augmentation.
17
18
Parameters:
19
- name: str, name for the data variable
20
- value: array-like, initial data values
21
- dims: tuple, named dimensions for the data
22
- export_index_as_coords: bool, export index as coordinates
23
"""
24
```
25
26
### Minibatch Processing
27
28
Efficient minibatch processing for large datasets and stochastic variational inference.
29
30
```python { .api }
31
class Minibatch:
32
"""
33
Multidimensional minibatch container for stochastic inference.
34
35
Enables efficient processing of large datasets by sampling random
36
minibatches during each iteration, essential for scalable variational
37
inference and large dataset handling.
38
39
Parameters:
40
- data: array-like, full dataset to sample from
41
- batch_size: int or list, size of minibatches or list of sizes with random seeds
42
- dtype: str, data type for minibatch arrays
43
- broadcastable: tuple, broadcasting pattern (defaults to (False,) * ndim)
44
- name: str, name for minibatch variable (default "Minibatch")
45
- random_seed: int, random seed for minibatch sampling
46
- update_shared_f: callable, function to update underlying shared variable
47
"""
48
49
def align_minibatches(*minibatches):
50
"""
51
Align multiple minibatch variables to sample consistent indices.
52
53
Ensures that multiple minibatch variables sample the same data points
54
in each iteration, maintaining consistency across related datasets.
55
56
Parameters:
57
- minibatches: Minibatch variables to align
58
59
Returns:
60
- tuple: aligned minibatch variables
61
"""
62
```
63
64
### Generator Adaptation
65
66
Support for generator-based data sources and streaming data processing.
67
68
```python { .api }
69
class GeneratorAdapter:
70
"""
71
Adapter for generator-based data sources in PyMC3 models.
72
73
Converts Python generators into PyMC3-compatible tensor variables,
74
enabling streaming data processing and infinite data sources with
75
automatic type inference from the first generated item.
76
77
Parameters:
78
- generator: Python generator yielding data arrays
79
80
Methods:
81
- make_variable(gop, name): create tensor variable from generator
82
- set_gen(gen): update underlying generator
83
- set_default(value): set default value for variable
84
"""
85
```
86
87
### Data Utilities
88
89
Helper functions for data loading and management.
90
91
```python { .api }
92
def get_data(filename):
93
"""
94
Load package data files for examples and testing.
95
96
Retrieves data files from the PyMC3 package or downloads from
97
remote repository if not available locally.
98
99
Parameters:
100
- filename: str, name of data file to load
101
102
Returns:
103
- BytesIO: file-like object containing data
104
"""
105
```
106
107
## Usage Examples
108
109
### Basic Data Container
110
111
```python
112
import pymc3 as pm
113
import numpy as np
114
115
# Create mutable data container
116
data = np.random.randn(100)
117
shared_data = pm.Data('shared_data', data)
118
119
with pm.Model() as model:
120
mu = pm.Normal('mu', 0, 1)
121
sigma = pm.HalfNormal('sigma', 1)
122
123
# Use shared data in likelihood
124
obs = pm.Normal('obs', mu, sigma, observed=shared_data)
125
126
trace = pm.sample(1000)
127
128
# Update data for out-of-sample prediction
129
new_data = np.random.randn(50)
130
pm.set_data({'shared_data': new_data})
131
132
# Generate predictions with new data
133
with model:
134
post_pred = pm.sample_posterior_predictive(trace)
135
```
136
137
### Minibatch Processing for Large Datasets
138
139
```python
140
import pymc3 as pm
141
import numpy as np
142
143
# Large dataset
144
N = 10000
145
X = np.random.randn(N, 5)
146
y = np.random.randn(N)
147
148
# Create minibatches
149
batch_size = 100
150
X_batch = pm.Minibatch(X, batch_size=batch_size)
151
y_batch = pm.Minibatch(y, batch_size=batch_size)
152
153
# Align minibatches to sample consistent indices
154
X_batch, y_batch = pm.align_minibatches(X_batch, y_batch)
155
156
with pm.Model() as model:
157
# Model parameters
158
beta = pm.Normal('beta', 0, 1, shape=5)
159
sigma = pm.HalfNormal('sigma', 1)
160
161
# Minibatch likelihood
162
mu = pm.math.dot(X_batch, beta)
163
obs = pm.Normal('obs', mu, sigma, observed=y_batch, total_size=N)
164
165
# Use ADVI for large dataset
166
approx = pm.fit(n=10000, method='advi')
167
trace = approx.sample(1000)
168
```
169
170
### Generator-Based Data Processing
171
172
```python
173
import pymc3 as pm
174
import numpy as np
175
176
def data_generator():
177
"""Generator yielding streaming data batches."""
178
while True:
179
yield np.random.randn(10, 3)
180
181
# Create generator adapter
182
gen = data_generator()
183
adapter = pm.GeneratorAdapter(gen)
184
185
with pm.Model() as model:
186
# Create variable from generator
187
data_var = adapter.make_variable(name='streaming_data')
188
189
# Model using streaming data
190
mu = pm.Normal('mu', 0, 1, shape=3)
191
sigma = pm.HalfNormal('sigma', 1)
192
obs = pm.Normal('obs', mu, sigma, observed=data_var)
193
194
# Note: Special handling needed for generator-based sampling
195
```
196
197
### Data Loading Utilities
198
199
```python
200
import pymc3 as pm
201
import pandas as pd
202
203
# Load package data
204
data_file = pm.get_data('coal.csv')
205
coal_data = pd.read_csv(data_file)
206
207
# Use in model
208
with pm.Model() as model:
209
# Process loaded data
210
years = coal_data['year'].values
211
disasters = pm.Data('disasters', coal_data['disasters'].values)
212
213
# Build time series model
214
lambda1 = pm.Exponential('lambda1', 1)
215
lambda2 = pm.Exponential('lambda2', 1)
216
tau = pm.DiscreteUniform('tau', 0, len(years))
217
218
rate = pm.math.switch(years < tau, lambda1, lambda2)
219
obs = pm.Poisson('obs', rate, observed=disasters)
220
221
trace = pm.sample(2000)
222
```
223
224
### Dynamic Data Updates
225
226
```python
227
import pymc3 as pm
228
import numpy as np
229
230
# Initial training data
231
X_train = np.random.randn(100, 3)
232
y_train = np.random.randn(100)
233
234
# Create shared variables
235
X_shared = pm.Data('X_shared', X_train)
236
y_shared = pm.Data('y_shared', y_train)
237
238
with pm.Model() as model:
239
# Model parameters
240
beta = pm.Normal('beta', 0, 1, shape=3)
241
sigma = pm.HalfNormal('sigma', 1)
242
243
# Model definition
244
mu = pm.math.dot(X_shared, beta)
245
obs = pm.Normal('obs', mu, sigma, observed=y_shared)
246
247
# Fit to initial data
248
trace = pm.sample(1000)
249
250
# Update with new data batches
251
for batch_idx in range(5):
252
X_new = np.random.randn(20, 3)
253
y_new = np.random.randn(20)
254
255
# Update shared variables
256
pm.set_data({'X_shared': X_new, 'y_shared': y_new})
257
258
# Continue sampling or refit
259
with model:
260
trace_batch = pm.sample(200, start=trace[-1])
261
```