0
# Data Handling
1
2
Data transformation, loading, and preprocessing functionality including support for pandas DataFrames, CSV/JSON files, and various data formats. Altair provides flexible data transformation pipeline that works with multiple data sources.
3
4
## Capabilities
5
6
### Data Transformers
7
8
Registry system for managing different data transformation backends that convert various data formats into Vega-Lite compatible specifications.
9
10
```python { .api }
11
class DataTransformerRegistry:
12
def enable(self, name, **kwargs):
13
"""Enable a data transformer by name."""
14
15
def disable(self):
16
"""Disable current data transformer."""
17
18
def register(self, name, func):
19
"""Register a new data transformer function."""
20
21
def get(self):
22
"""Get currently active data transformer."""
23
24
@property
25
def active(self):
26
"""Get name of active data transformer."""
27
28
def names(self):
29
"""Get list of available transformer names."""
30
31
# Global data transformers registry
32
data_transformers = DataTransformerRegistry()
33
```
34
35
### Data Format Converters
36
37
Functions for converting data between different formats compatible with Vega-Lite.
38
39
```python { .api }
40
def to_json(data, prefix='altair-data', extension='json', **kwargs):
41
"""
42
Convert data to JSON format.
43
44
Parameters:
45
- data: Input data (DataFrame, dict, list)
46
- prefix: Filename prefix for generated files
47
- extension: File extension to use
48
49
Returns:
50
dict: Vega-Lite data specification
51
"""
52
53
def to_csv(data, prefix='altair-data', extension='csv', **kwargs):
54
"""
55
Convert data to CSV format.
56
57
Parameters:
58
- data: Input data (DataFrame, dict, list)
59
- prefix: Filename prefix for generated files
60
- extension: File extension to use
61
62
Returns:
63
dict: Vega-Lite data specification with CSV URL
64
"""
65
66
def to_values(data):
67
"""
68
Convert data to inline values format.
69
70
Parameters:
71
- data: Input data (DataFrame, dict, list)
72
73
Returns:
74
dict: Vega-Lite data specification with inline values
75
"""
76
```
77
78
### Data Limiting and Sampling
79
80
Functions for managing large datasets by limiting rows or sampling data.
81
82
```python { .api }
83
def limit_rows(max_rows=5000):
84
"""
85
Data transformer that limits the number of rows.
86
87
Parameters:
88
- max_rows: Maximum number of rows to include
89
90
Returns:
91
Configured data transformer function
92
"""
93
94
def sample(n=None, frac=None):
95
"""
96
Sample random subset of data rows.
97
98
Parameters:
99
- n: Number of rows to sample
100
- frac: Fraction of rows to sample (0-1)
101
102
Returns:
103
Sampled data
104
"""
105
106
class MaxRowsError(Exception):
107
"""Exception raised when data exceeds maximum allowed rows."""
108
```
109
110
### Default Data Transformer
111
112
```python { .api }
113
def default_data_transformer(data):
114
"""
115
Get the default data transformer function.
116
117
Parameters:
118
- data: Input data
119
120
Returns:
121
Transformed data specification
122
"""
123
```
124
125
### Data Generation Functions
126
127
Standalone functions for generating synthetic data sources.
128
129
```python { .api }
130
def sequence(start, stop=None, step=None, as_='data'):
131
"""
132
Generate sequence of numbers as data source.
133
134
Parameters:
135
- start: Starting value
136
- stop: Ending value (exclusive)
137
- step: Step size (default 1)
138
- as_: Output field name
139
140
Returns:
141
SequenceGenerator: Sequence data specification
142
"""
143
144
def graticule(extent=None, extentMajor=None, extentMinor=None, step=None, stepMajor=None, stepMinor=None, precision=None):
145
"""
146
Generate graticule (geographic grid lines) as data source.
147
148
Parameters:
149
- extent: Overall extent [[x0, y0], [x1, y1]]
150
- extentMajor: Major line extent
151
- extentMinor: Minor line extent
152
- step: Overall step size [dx, dy]
153
- stepMajor: Major line step size
154
- stepMinor: Minor line step size
155
- precision: Line precision
156
157
Returns:
158
GraticuleGenerator: Graticule data specification
159
"""
160
161
def sphere():
162
"""
163
Generate sphere geometry as data source.
164
165
Returns:
166
SphereGenerator: Sphere data specification
167
"""
168
169
def topo_feature(topology, feature):
170
"""
171
Extract feature from TopoJSON topology.
172
173
Parameters:
174
- topology: TopoJSON topology object or URL
175
- feature: Feature name to extract
176
177
Returns:
178
dict: Data specification for extracted feature
179
"""
180
```
181
182
### Data Source Types
183
184
Support for various data input formats and sources.
185
186
```python { .api }
187
# Inline data
188
class InlineData:
189
def __init__(self, values=None, format=None): ...
190
191
# URL-based data
192
class UrlData:
193
def __init__(self, url=None, format=None): ...
194
195
# Named datasets
196
class NamedData:
197
def __init__(self, name=None): ...
198
199
# Generated data
200
class SequenceGenerator:
201
def __init__(self, start=None, stop=None, step=None, as_=None): ...
202
203
class GraticuleGenerator:
204
def __init__(self, extent=None, extentMajor=None, extentMinor=None, step=None, stepMajor=None, stepMinor=None, precision=None): ...
205
206
class SphereGenerator:
207
def __init__(self): ...
208
```
209
210
### Data Format Specifications
211
212
Classes for specifying data parsing and formatting options.
213
214
```python { .api }
215
class DataFormat:
216
def __init__(self, type=None, **kwargs): ...
217
218
class CsvDataFormat(DataFormat):
219
def __init__(self, parse=None, delimiter=None, **kwargs): ...
220
221
class JsonDataFormat(DataFormat):
222
def __init__(self, parse=None, property=None, **kwargs): ...
223
224
class TopoDataFormat(DataFormat):
225
def __init__(self, feature=None, mesh=None, **kwargs): ...
226
227
class DsvDataFormat(DataFormat):
228
def __init__(self, delimiter=None, parse=None, **kwargs): ...
229
```
230
231
## Usage Examples
232
233
### Basic Data Loading
234
235
```python
236
import altair as alt
237
import pandas as pd
238
239
# From pandas DataFrame
240
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
241
chart = alt.Chart(df).mark_point().encode(x='x', y='y')
242
243
# From URL
244
chart = alt.Chart('https://example.com/data.csv').mark_point().encode(x='x', y='y')
245
246
# From JSON file
247
chart = alt.Chart('data.json').mark_point().encode(x='x', y='y')
248
```
249
250
### Data Transformer Configuration
251
252
```python
253
# Enable JSON data transformer
254
alt.data_transformers.enable('json')
255
256
# Enable data server (for large datasets)
257
alt.data_transformers.enable('data_server')
258
259
# Custom row limit
260
alt.data_transformers.enable('json', max_rows=10000)
261
262
# Check active transformer
263
print(alt.data_transformers.active)
264
```
265
266
### Large Dataset Handling
267
268
```python
269
import pandas as pd
270
271
# Large dataset
272
large_df = pd.DataFrame({
273
'x': range(100000),
274
'y': range(100000)
275
})
276
277
# Sample data for visualization
278
alt.data_transformers.enable('json', max_rows=5000)
279
chart = alt.Chart(large_df).mark_point().encode(x='x', y='y')
280
281
# Or manually sample
282
sampled_data = large_df.sample(n=1000)
283
chart = alt.Chart(sampled_data).mark_point().encode(x='x', y='y')
284
```
285
286
### Generated Data Sources
287
288
```python
289
# Sequence data for mathematical functions
290
sequence_chart = alt.Chart(alt.sequence(1, 100)).mark_line().encode(
291
x='data:Q',
292
y=alt.expr('sin(datum.data * PI / 20)').title('sin(x)')
293
).properties(
294
title='Sine Wave from Generated Sequence'
295
)
296
297
# Multiple sequences for comparison
298
comparison_chart = alt.Chart(alt.sequence(0, 10, 0.1)).mark_line().encode(
299
x='data:Q'
300
).transform_calculate(
301
sin_val=alt.expr('sin(datum.data)'),
302
cos_val=alt.expr('cos(datum.data)')
303
).transform_fold(
304
['sin_val', 'cos_val'], as_=['function', 'value']
305
).encode(
306
y='value:Q',
307
color='function:N'
308
)
309
310
# Graticule for geographic maps
311
world_with_graticule = alt.layer(
312
alt.Chart(alt.sphere()).mark_geoshape(fill='lightblue'),
313
alt.Chart(alt.graticule()).mark_geoshape(
314
stroke='white',
315
strokeWidth=0.5,
316
fill=None
317
)
318
).resolve_scale(color='independent')
319
320
# Custom graticule spacing
321
custom_graticule = alt.Chart(
322
alt.graticule(step=[30, 30]) # 30-degree grid
323
).mark_geoshape(stroke='gray', strokeWidth=1)
324
325
# TopoJSON feature extraction
326
us_states = alt.Chart(
327
alt.topo_feature('https://vega.github.io/vega-datasets/data/us-10m.json', 'states')
328
).mark_geoshape().encode(
329
color=alt.value('steelblue'),
330
stroke=alt.value('white')
331
)
332
```
333
334
### Data Format Parsing
335
336
```python
337
# CSV with custom parsing
338
chart = alt.Chart(
339
alt.UrlData(
340
url='data.csv',
341
format=alt.CsvDataFormat(parse={'date': 'date:%Y-%m-%d'})
342
)
343
).mark_line().encode(
344
x='date:T',
345
y='value:Q'
346
)
347
348
# JSON with property extraction
349
chart = alt.Chart(
350
alt.UrlData(
351
url='data.json',
352
format=alt.JsonDataFormat(property='results')
353
)
354
).mark_bar().encode(
355
x='category:N',
356
y='value:Q'
357
)
358
```
359
360
## Types
361
362
```python { .api }
363
from typing import Union, Dict, Any, Optional, List, Callable
364
365
# Data source types
366
DataSource = Union[
367
pd.DataFrame,
368
str, # URL
369
Dict[str, Any], # Specification
370
List[Dict[str, Any]], # Inline values
371
InlineData,
372
UrlData,
373
NamedData,
374
SequenceGenerator,
375
GraticuleGenerator,
376
SphereGenerator
377
]
378
379
# Data transformer function type
380
DataTransformer = Callable[[Any], Dict[str, Any]]
381
382
# Parse specification
383
ParseDict = Dict[str, Union[str, None]]
384
385
# Format types
386
FormatType = Union['json', 'csv', 'tsv', 'dsv', 'topojson']
387
```