0
# Dataset Loading
1
2
Comprehensive dataset loading capabilities providing access to 70 curated datasets from the Vega visualization ecosystem. Supports both local and remote data sources with automatic format detection and pandas integration.
3
4
## Capabilities
5
6
### DataLoader Class
7
8
Main interface for accessing all available datasets with flexible loading options and format support.
9
10
```python { .api }
11
class DataLoader:
12
def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame:
13
"""
14
Load a dataset by name.
15
16
Parameters:
17
- name: str, dataset name (use list_datasets() to see available names)
18
- return_raw: bool, if True return raw bytes instead of DataFrame
19
- use_local: bool, if True prefer local data over remote when available
20
- **kwargs: additional arguments passed to pandas parser (read_csv, read_json)
21
22
Returns:
23
pandas.DataFrame or bytes (if return_raw=True)
24
"""
25
26
def list_datasets(self) -> List[str]:
27
"""Return list of all available dataset names."""
28
29
def __getattr__(self, dataset_name: str):
30
"""Access datasets as attributes (e.g., data.iris())."""
31
32
def __dir__(self) -> List[str]:
33
"""Support for tab completion and introspection."""
34
```
35
36
### LocalDataLoader Class
37
38
Restricted loader for only locally bundled datasets, ensuring offline operation.
39
40
```python { .api }
41
class LocalDataLoader:
42
def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame:
43
"""
44
Load a locally bundled dataset by name.
45
46
Parameters:
47
- name: str, local dataset name (use list_datasets() to see available names)
48
- return_raw: bool, if True return raw bytes instead of DataFrame
49
- use_local: bool, ignored (always True for local loader)
50
- **kwargs: additional arguments passed to pandas parser
51
52
Returns:
53
pandas.DataFrame or bytes (if return_raw=True)
54
55
Raises:
56
ValueError: if dataset is not available locally
57
"""
58
59
def list_datasets(self) -> List[str]:
60
"""Return list of locally available dataset names."""
61
62
def __getattr__(self, dataset_name: str):
63
"""Access local datasets as attributes."""
64
```
65
66
### Dataset Base Class
67
68
Individual dataset handler providing metadata and flexible loading options.
69
70
```python { .api }
71
class Dataset:
72
# Class methods
73
@classmethod
74
def init(cls, name: str) -> 'Dataset':
75
"""Return an instance of appropriate Dataset subclass for the given name."""
76
77
@classmethod
78
def list_datasets(cls) -> List[str]:
79
"""Return list of all available dataset names."""
80
81
@classmethod
82
def list_local_datasets(cls) -> List[str]:
83
"""Return list of locally available dataset names."""
84
85
# Instance methods
86
def raw(self, use_local: bool = True) -> bytes:
87
"""
88
Load raw dataset bytes.
89
90
Parameters:
91
- use_local: bool, if True and dataset is local, load from package
92
93
Returns:
94
bytes: raw dataset content
95
"""
96
97
def __call__(self, use_local: bool = True, **kwargs) -> pd.DataFrame:
98
"""
99
Load and parse dataset.
100
101
Parameters:
102
- use_local: bool, prefer local data when available
103
- **kwargs: passed to pandas parser (read_csv, read_json, read_csv with sep='\t')
104
105
Returns:
106
pandas.DataFrame: parsed dataset
107
"""
108
109
# Properties
110
@property
111
def filepath(self) -> str:
112
"""Local file path (only valid for local datasets)."""
113
114
# Instance attributes
115
name: str # Dataset name
116
methodname: str # Method-safe name (hyphens -> underscores)
117
filename: str # Original filename
118
url: str # Full remote URL
119
format: str # File format ('csv', 'json', 'tsv', 'png')
120
pkg_filename: str # Path within package
121
is_local: bool # True if bundled locally
122
description: str # Dataset description
123
references: List[str] # Academic references
124
```
125
126
## Usage Examples
127
128
### Basic Dataset Loading
129
130
```python
131
from vega_datasets import data
132
133
# Load classic iris dataset
134
iris = data.iris()
135
print(iris.shape) # (150, 5)
136
print(iris.columns.tolist()) # ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species']
137
138
# Load by string name
139
cars = data('cars')
140
print(cars.head())
141
142
# Pass pandas arguments
143
airports = data.airports(usecols=['iata', 'name', 'city', 'state'])
144
```
145
146
### Local vs Remote Loading
147
148
```python
149
from vega_datasets import data, local_data
150
151
# Force remote loading (even for local datasets)
152
iris_remote = data.iris(use_local=False)
153
154
# Local-only loading (fails for remote datasets)
155
try:
156
stocks_local = local_data.stocks() # Works - stocks is local
157
github_local = local_data.github() # Fails - github is remote-only
158
except ValueError as e:
159
print(f"Error: {e}")
160
161
# Check if dataset is local
162
print(f"Iris is local: {data.iris.is_local}") # True
163
print(f"GitHub is local: {data.github.is_local}") # False
164
```
165
166
### Raw Data Access
167
168
```python
169
from vega_datasets import data
170
171
# Get raw bytes instead of DataFrame
172
raw_data = data.iris.raw()
173
print(type(raw_data)) # <class 'bytes'>
174
175
# Use with custom parsing
176
import json
177
raw_json = data.cars.raw()
178
custom_data = json.loads(raw_json.decode())
179
180
# Raw data through call method
181
raw_csv = data('airports', return_raw=True)
182
```
183
184
### Dataset Discovery
185
186
```python
187
from vega_datasets import data, local_data
188
189
# List all datasets
190
all_datasets = data.list_datasets()
191
print(f"Total datasets: {len(all_datasets)}") # 70
192
193
# List only local datasets
194
local_datasets = local_data.list_datasets()
195
print(f"Local datasets: {len(local_datasets)}") # 17
196
197
# Check specific dataset availability
198
print("Local datasets:", local_datasets[:5])
199
# ['airports', 'anscombe', 'barley', 'burtin', 'cars']
200
201
# Use tab completion in interactive environments
202
# data.<TAB> shows all available datasets
203
```
204
205
### Advanced Pandas Integration
206
207
```python
208
from vega_datasets import data
209
import pandas as pd
210
211
# Load with pandas options
212
flights = data.flights(
213
parse_dates=['date'],
214
dtype={'origin': 'category', 'destination': 'category'}
215
)
216
217
# TSV format handling (automatic)
218
seattle_temps = data.seattle_temps() # Handles TSV automatically
219
220
# JSON with custom options
221
github_data = data.github(lines=True) # If supported by dataset format
222
```
223
224
### Metadata Access
225
226
```python
227
from vega_datasets import data
228
229
# Access dataset metadata
230
iris_dataset = data.iris # Get Dataset object (don't call yet)
231
print(f"Name: {iris_dataset.name}")
232
print(f"Format: {iris_dataset.format}")
233
print(f"URL: {iris_dataset.url}")
234
print(f"Local: {iris_dataset.is_local}")
235
print(f"Description: {iris_dataset.description}")
236
237
# Get file path for local datasets
238
if iris_dataset.is_local:
239
print(f"Local path: {iris_dataset.filepath}")
240
```
241
242
### Error Handling
243
244
```python
245
from vega_datasets import data
246
from urllib.error import URLError
247
248
# Handle invalid dataset names
249
try:
250
df = data('invalid-name')
251
except ValueError as e:
252
print(f"Dataset error: {e}")
253
254
# Handle network issues for remote datasets
255
try:
256
df = data.github(use_local=False)
257
except URLError as e:
258
print(f"Network error: {e}")
259
# Fallback to local if available
260
if data.github.is_local:
261
df = data.github(use_local=True)
262
```
263
264
### Connection Testing
265
266
```python
267
from vega_datasets.utils import connection_ok
268
269
# Check internet connectivity before loading remote datasets
270
if connection_ok():
271
github_data = data.github()
272
print("Loaded remote dataset successfully")
273
else:
274
print("No internet connection - using local datasets only")
275
local_datasets = local_data.list_datasets()
276
stocks_data = local_data.stocks()
277
```
278
279
## Supported File Formats
280
281
The package automatically handles multiple data formats:
282
283
- **CSV**: Comma-separated values (most common)
284
- **JSON**: JavaScript Object Notation (nested data structures)
285
- **TSV**: Tab-separated values (automatic delimiter detection)
286
- **PNG**: Portable Network Graphics (for 7zip dataset, returns raw bytes)
287
288
Format detection is automatic based on dataset metadata, with appropriate pandas parsers used for each format.
289
290
**Note**: PNG format datasets (like 7zip) can only be accessed via the `raw()` method or with `return_raw=True`, as the DataFrame parsing will raise a ValueError for unsupported formats.