or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-loading.mdindex.mdspecialized-datasets.md

dataset-loading.mddocs/

0

# Dataset Loading

1

2

Comprehensive dataset loading capabilities providing access to 70 curated datasets from the Vega visualization ecosystem. Supports both local and remote data sources with automatic format detection and pandas integration.

3

4

## Capabilities

5

6

### DataLoader Class

7

8

Main interface for accessing all available datasets with flexible loading options and format support.

9

10

```python { .api }

11

class DataLoader:

12

def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame:

13

"""

14

Load a dataset by name.

15

16

Parameters:

17

- name: str, dataset name (use list_datasets() to see available names)

18

- return_raw: bool, if True return raw bytes instead of DataFrame

19

- use_local: bool, if True prefer local data over remote when available

20

- **kwargs: additional arguments passed to pandas parser (read_csv, read_json)

21

22

Returns:

23

pandas.DataFrame or bytes (if return_raw=True)

24

"""

25

26

def list_datasets(self) -> List[str]:

27

"""Return list of all available dataset names."""

28

29

def __getattr__(self, dataset_name: str):

30

"""Access datasets as attributes (e.g., data.iris())."""

31

32

def __dir__(self) -> List[str]:

33

"""Support for tab completion and introspection."""

34

```

35

36

### LocalDataLoader Class

37

38

Restricted loader for only locally bundled datasets, ensuring offline operation.

39

40

```python { .api }

41

class LocalDataLoader:

42

def __call__(self, name: str, return_raw: bool = False, use_local: bool = True, **kwargs) -> pd.DataFrame:

43

"""

44

Load a locally bundled dataset by name.

45

46

Parameters:

47

- name: str, local dataset name (use list_datasets() to see available names)

48

- return_raw: bool, if True return raw bytes instead of DataFrame

49

- use_local: bool, ignored (always True for local loader)

50

- **kwargs: additional arguments passed to pandas parser

51

52

Returns:

53

pandas.DataFrame or bytes (if return_raw=True)

54

55

Raises:

56

ValueError: if dataset is not available locally

57

"""

58

59

def list_datasets(self) -> List[str]:

60

"""Return list of locally available dataset names."""

61

62

def __getattr__(self, dataset_name: str):

63

"""Access local datasets as attributes."""

64

```

65

66

### Dataset Base Class

67

68

Individual dataset handler providing metadata and flexible loading options.

69

70

```python { .api }

71

class Dataset:

72

# Class methods

73

@classmethod

74

def init(cls, name: str) -> 'Dataset':

75

"""Return an instance of appropriate Dataset subclass for the given name."""

76

77

@classmethod

78

def list_datasets(cls) -> List[str]:

79

"""Return list of all available dataset names."""

80

81

@classmethod

82

def list_local_datasets(cls) -> List[str]:

83

"""Return list of locally available dataset names."""

84

85

# Instance methods

86

def raw(self, use_local: bool = True) -> bytes:

87

"""

88

Load raw dataset bytes.

89

90

Parameters:

91

- use_local: bool, if True and dataset is local, load from package

92

93

Returns:

94

bytes: raw dataset content

95

"""

96

97

def __call__(self, use_local: bool = True, **kwargs) -> pd.DataFrame:

98

"""

99

Load and parse dataset.

100

101

Parameters:

102

- use_local: bool, prefer local data when available

103

- **kwargs: passed to pandas parser (read_csv, read_json, read_csv with sep='\t')

104

105

Returns:

106

pandas.DataFrame: parsed dataset

107

"""

108

109

# Properties

110

@property

111

def filepath(self) -> str:

112

"""Local file path (only valid for local datasets)."""

113

114

# Instance attributes

115

name: str # Dataset name

116

methodname: str # Method-safe name (hyphens -> underscores)

117

filename: str # Original filename

118

url: str # Full remote URL

119

format: str # File format ('csv', 'json', 'tsv', 'png')

120

pkg_filename: str # Path within package

121

is_local: bool # True if bundled locally

122

description: str # Dataset description

123

references: List[str] # Academic references

124

```

125

126

## Usage Examples

127

128

### Basic Dataset Loading

129

130

```python

131

from vega_datasets import data

132

133

# Load classic iris dataset

134

iris = data.iris()

135

print(iris.shape) # (150, 5)

136

print(iris.columns.tolist()) # ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species']

137

138

# Load by string name

139

cars = data('cars')

140

print(cars.head())

141

142

# Pass pandas arguments

143

airports = data.airports(usecols=['iata', 'name', 'city', 'state'])

144

```

145

146

### Local vs Remote Loading

147

148

```python

149

from vega_datasets import data, local_data

150

151

# Force remote loading (even for local datasets)

152

iris_remote = data.iris(use_local=False)

153

154

# Local-only loading (fails for remote datasets)

155

try:

156

stocks_local = local_data.stocks() # Works - stocks is local

157

github_local = local_data.github() # Fails - github is remote-only

158

except ValueError as e:

159

print(f"Error: {e}")

160

161

# Check if dataset is local

162

print(f"Iris is local: {data.iris.is_local}") # True

163

print(f"GitHub is local: {data.github.is_local}") # False

164

```

165

166

### Raw Data Access

167

168

```python

169

from vega_datasets import data

170

171

# Get raw bytes instead of DataFrame

172

raw_data = data.iris.raw()

173

print(type(raw_data)) # <class 'bytes'>

174

175

# Use with custom parsing

176

import json

177

raw_json = data.cars.raw()

178

custom_data = json.loads(raw_json.decode())

179

180

# Raw data through call method

181

raw_csv = data('airports', return_raw=True)

182

```

183

184

### Dataset Discovery

185

186

```python

187

from vega_datasets import data, local_data

188

189

# List all datasets

190

all_datasets = data.list_datasets()

191

print(f"Total datasets: {len(all_datasets)}") # 70

192

193

# List only local datasets

194

local_datasets = local_data.list_datasets()

195

print(f"Local datasets: {len(local_datasets)}") # 17

196

197

# Check specific dataset availability

198

print("Local datasets:", local_datasets[:5])

199

# ['airports', 'anscombe', 'barley', 'burtin', 'cars']

200

201

# Use tab completion in interactive environments

202

# data.<TAB> shows all available datasets

203

```

204

205

### Advanced Pandas Integration

206

207

```python

208

from vega_datasets import data

209

import pandas as pd

210

211

# Load with pandas options

212

flights = data.flights(

213

parse_dates=['date'],

214

dtype={'origin': 'category', 'destination': 'category'}

215

)

216

217

# TSV format handling (automatic)

218

seattle_temps = data.seattle_temps() # Handles TSV automatically

219

220

# JSON with custom options

221

github_data = data.github(lines=True) # If supported by dataset format

222

```

223

224

### Metadata Access

225

226

```python

227

from vega_datasets import data

228

229

# Access dataset metadata

230

iris_dataset = data.iris # Get Dataset object (don't call yet)

231

print(f"Name: {iris_dataset.name}")

232

print(f"Format: {iris_dataset.format}")

233

print(f"URL: {iris_dataset.url}")

234

print(f"Local: {iris_dataset.is_local}")

235

print(f"Description: {iris_dataset.description}")

236

237

# Get file path for local datasets

238

if iris_dataset.is_local:

239

print(f"Local path: {iris_dataset.filepath}")

240

```

241

242

### Error Handling

243

244

```python

245

from vega_datasets import data

246

from urllib.error import URLError

247

248

# Handle invalid dataset names

249

try:

250

df = data('invalid-name')

251

except ValueError as e:

252

print(f"Dataset error: {e}")

253

254

# Handle network issues for remote datasets

255

try:

256

df = data.github(use_local=False)

257

except URLError as e:

258

print(f"Network error: {e}")

259

# Fallback to local if available

260

if data.github.is_local:

261

df = data.github(use_local=True)

262

```

263

264

### Connection Testing

265

266

```python

267

from vega_datasets.utils import connection_ok

268

269

# Check internet connectivity before loading remote datasets

270

if connection_ok():

271

github_data = data.github()

272

print("Loaded remote dataset successfully")

273

else:

274

print("No internet connection - using local datasets only")

275

local_datasets = local_data.list_datasets()

276

stocks_data = local_data.stocks()

277

```

278

279

## Supported File Formats

280

281

The package automatically handles multiple data formats:

282

283

- **CSV**: Comma-separated values (most common)

284

- **JSON**: JavaScript Object Notation (nested data structures)

285

- **TSV**: Tab-separated values (automatic delimiter detection)

286

- **PNG**: Portable Network Graphics (for 7zip dataset, returns raw bytes)

287

288

Format detection is automatic based on dataset metadata, with appropriate pandas parsers used for each format.

289

290

**Note**: PNG format datasets (like 7zip) can only be accessed via the `raw()` method or with `return_raw=True`, as the DataFrame parsing will raise a ValueError for unsupported formats.