0
# Vega Datasets
1
2
A Python package for offline access to Vega visualization datasets, providing a comprehensive collection of well-known datasets commonly used in data visualization and statistical analysis. Returns results as Pandas DataFrames for seamless integration with Python data science workflows.
3
4
## Package Information
5
6
- **Package Name**: vega_datasets
7
- **Language**: Python
8
- **Installation**: `pip install vega_datasets`
9
10
## Core Imports
11
12
```python
13
import vega_datasets
14
from vega_datasets import data, local_data
15
```
16
17
Access individual components:
18
19
```python
20
from vega_datasets import DataLoader, LocalDataLoader
21
from vega_datasets.utils import connection_ok
22
```
23
24
## Basic Usage
25
26
```python
27
from vega_datasets import data
28
29
# Load a dataset by calling the data loader with dataset name
30
iris_df = data('iris')
31
print(type(iris_df)) # pandas.DataFrame
32
33
# Or use attribute access
34
iris_df = data.iris()
35
36
# Get list of available datasets
37
all_datasets = data.list_datasets()
38
print(len(all_datasets)) # 70 datasets
39
40
# Load dataset with pandas options
41
cars_df = data.cars(usecols=['Name', 'Miles_per_Gallon', 'Horsepower'])
42
43
# Access only locally bundled datasets (no internet required)
44
from vega_datasets import local_data
45
stocks_df = local_data.stocks()
46
47
# Get raw data instead of parsed DataFrame
48
raw_data = data.iris.raw()
49
print(type(raw_data)) # bytes
50
```
51
52
## Architecture
53
54
The package follows a clean loader pattern with automatic fallback between local and remote data sources:
55
56
- **DataLoader**: Main interface for accessing all 70 datasets (17 local + 53 remote)
57
- **LocalDataLoader**: Restricted interface for only locally bundled datasets
58
- **Dataset**: Base class handling individual dataset loading, parsing, and metadata
59
- **Specialized Dataset Subclasses**: Custom loaders for datasets requiring specific handling
60
61
The design enables both bundled offline access and remote data fetching, making it suitable for various development and production environments.
62
63
## Capabilities
64
65
### Core Data Loading
66
67
Primary interface for loading datasets using either method calls or attribute access, with automatic format detection and pandas DataFrame conversion.
68
69
```python { .api }
70
class DataLoader:
71
def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame: ...
72
def list_datasets(self) -> List[str]: ...
73
74
class LocalDataLoader:
75
def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame: ...
76
def list_datasets(self) -> List[str]: ...
77
```
78
79
[Dataset Loading](./dataset-loading.md)
80
81
### Specialized Dataset Handling
82
83
Enhanced loaders for datasets requiring custom parsing, date handling, or alternative return types beyond standard DataFrames.
84
85
```python { .api }
86
# Stocks with pivot support
87
def stocks(pivoted: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame: ...
88
89
# Miserables returns tuple of DataFrames
90
def miserables(use_local: bool = True, **kwargs) -> Tuple[pd.DataFrame, pd.DataFrame]: ...
91
92
# Geographic data returns dict objects
93
def us_10m(use_local: bool = True, **kwargs) -> dict: ...
94
def world_110m(use_local: bool = True, **kwargs) -> dict: ...
95
```
96
97
[Specialized Datasets](./specialized-datasets.md)
98
99
## Dataset Categories
100
101
**Locally Bundled (17 datasets)** - Available without internet connection:
102
- Statistical classics: `iris`, `anscombe`, `cars`
103
- Time series: `stocks`, `seattle-weather`, `seattle-temps`, `sf-temps`
104
- Economic data: `iowa-electricity`, `us-employment`
105
- Geographic: `airports`, `la-riots`
106
- Scientific: `barley`, `wheat`, `burtin`, `crimea`, `driving`
107
- Financial: `ohlc`
108
109
**Remote Datasets (53 datasets)** - Require internet connection:
110
- Visualization examples: `7zip`, `flare`, `flare-dependencies`
111
- Global data: `countries`, `world-110m`, `population`
112
- Economic/social: `budget`, `budgets`, `disasters`, `gapminder`
113
- Scientific: `climate`, `co2-concentration`, `earthquakes`, `annual-precip`
114
- Technology: `github`, `ffox`, `movies`
115
- And many more specialized datasets
116
117
## Error Handling
118
119
```python
120
from vega_datasets import data
121
122
# Dataset not found
123
try:
124
df = data('nonexistent-dataset')
125
except ValueError as e:
126
print(e) # "No such dataset nonexistent-dataset exists..."
127
128
# Local dataset not available in LocalDataLoader
129
from vega_datasets import local_data
130
try:
131
df = local_data.github() # github is remote-only
132
except ValueError as e:
133
print(e) # "'github' dataset is not available locally..."
134
135
# Network issues for remote datasets
136
try:
137
df = data.github(use_local=False) # Force remote access
138
except Exception as e:
139
print(f"Network error: {e}")
140
```
141
142
## Utility Functions
143
144
```python { .api }
145
def connection_ok() -> bool:
146
"""
147
Check if web connection is available for remote datasets.
148
149
Returns:
150
bool: True if web connection is OK, False otherwise.
151
"""
152
```
153
154
## Types
155
156
```python { .api }
157
from typing import List, Tuple, Dict, Any
158
import pandas as pd
159
160
# Core classes
161
class DataLoader: ...
162
class LocalDataLoader: ...
163
class Dataset: ...
164
165
# Package-level exports
166
data: DataLoader
167
local_data: LocalDataLoader
168
__version__: str
169
170
# Utility functions
171
def connection_ok() -> bool: ...
172
```