or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

archives.mdcloud-storage.mdconfiguration.mddata-formats.mddirectory-management.mdfile-operations.mdindex.mdmodule-class.mdnltk-integration.mdweb-scraping.md

index.mddocs/

0

# PyStow

1

2

PyStow is a Python library that provides a standardized and configurable way to manage data directories for Python applications. It offers a simple API for creating and accessing application-specific data directories in a user's file system, with support for nested directory structures, automatic directory creation, and environment variable-based configuration.

3

4

The library enables developers to easily download, cache, and manage files from the internet with built-in support for various data formats including CSV, RDF, Excel, and compressed archives (ZIP, TAR, LZMA, GZ). It includes functionality for ensuring files are downloaded only once and cached locally, with features for handling tabular data through pandas integration, RDF data through rdflib integration, and provides configurable storage locations that respect both traditional home directory patterns and XDG Base Directory specifications.

5

6

## Package Information

7

8

- **Package Name**: pystow

9

- **Language**: Python

10

- **Installation**: `pip install pystow`

11

12

## Core Imports

13

14

```python

15

import pystow

16

17

# Most common usage patterns

18

module = pystow.module("myapp")

19

path = pystow.join("myapp", "data")

20

data = pystow.ensure_csv("myapp", url="https://example.com/data.csv")

21

```

22

23

## Basic Usage

24

25

### Directory Management

26

```python

27

import pystow

28

29

# Get a module for your application

30

module = pystow.module("myapp")

31

32

# Create nested directories and get paths

33

data_dir = module.join("datasets", "version1")

34

config_path = module.join("config", name="settings.json")

35

36

# Using functional API

37

path = pystow.join("myapp", "data", name="file.txt")

38

```

39

40

### File Download and Caching

41

```python

42

import pystow

43

44

# Download and cache a file

45

path = pystow.ensure(

46

"myapp", "data",

47

url="https://example.com/dataset.csv",

48

name="dataset.csv"

49

)

50

51

# File is automatically cached - subsequent calls return the cached version

52

# Use force=True to re-download

53

path = pystow.ensure(

54

"myapp", "data",

55

url="https://example.com/dataset.csv",

56

name="dataset.csv",

57

force=True

58

)

59

```

60

61

### Data Format Integration

62

```python

63

import pystow

64

import pandas as pd

65

66

# Download and load CSV as DataFrame

67

df = pystow.ensure_csv(

68

"myapp", "datasets",

69

url="https://example.com/data.csv"

70

)

71

72

# Download and parse JSON

73

data = pystow.ensure_json(

74

"myapp", "config",

75

url="https://api.example.com/config.json"

76

)

77

78

# Work with compressed files

79

graph = pystow.ensure_rdf(

80

"myapp", "ontologies",

81

url="https://example.com/ontology.rdf.gz",

82

parse_kwargs={"format": "xml"}

83

)

84

```

85

86

## Architecture

87

88

PyStow is built around a modular architecture with two main usage patterns:

89

90

1. **Functional API**: Direct function calls for quick operations (`pystow.ensure()`, `pystow.join()`)

91

2. **Module-based API**: Create Module instances for organized data management (`pystow.module()`)

92

93

The core `Module` class manages directory structures and provides methods for file operations, while the functional API provides convenient shortcuts for common tasks. All operations support:

94

95

- **Configurable base directories** via environment variables

96

- **Version-aware storage** for handling different data versions

97

- **Automatic directory creation** with the `ensure_exists` parameter

98

- **Force re-download capabilities** for cache invalidation

99

- **Flexible data format support** through specialized ensure/load/dump methods

100

101

## Capabilities

102

103

### [Directory Management](./directory-management.md)

104

Core functionality for creating and managing application data directories with configurable storage locations and automatic directory creation.

105

106

```python { .api }

107

def module(key: str, *subkeys: str, ensure_exists: bool = True) -> Module:

108

"""Return a module for the application.

109

110

Args:

111

key: The name of the module. No funny characters. The envvar <key>_HOME where

112

key is uppercased is checked first before using the default home directory.

113

subkeys: A sequence of additional strings to join. If none are given, returns

114

the directory for this module.

115

ensure_exists: Should all directories be created automatically? Defaults to true.

116

117

Returns:

118

The module object that manages getting and ensuring

119

"""

120

121

def join(key: str, *subkeys: str, name: str | None = None, ensure_exists: bool = True, version: VersionHint = None) -> Path:

122

"""Return the home data directory for the given module.

123

124

Args:

125

key: The name of the module. No funny characters. The envvar <key>_HOME where

126

key is uppercased is checked first before using the default home directory.

127

subkeys: A sequence of additional strings to join

128

name: The name of the file (optional) inside the folder

129

ensure_exists: Should all directories be created automatically? Defaults to true.

130

version: The optional version, or no-argument callable that returns an

131

optional version. This is prepended before the subkeys.

132

133

Returns:

134

The path of the directory or subdirectory for the given module.

135

"""

136

```

137

138

### [File Download and Caching](./file-operations.md)

139

Comprehensive file download system with caching, compression support, and cloud storage integration.

140

141

```python { .api }

142

def ensure(key: str, *subkeys: str, url: str, name: str | None = None, version: VersionHint = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None) -> Path:

143

"""Ensure a file is downloaded.

144

145

Args:

146

key: The name of the module. No funny characters. The envvar <key>_HOME where

147

key is uppercased is checked first before using the default home directory.

148

subkeys: A sequence of additional strings to join. If none are given, returns

149

the directory for this module.

150

url: The URL to download.

151

name: Overrides the name of the file at the end of the URL, if given. Also

152

useful for URLs that don't have proper filenames with extensions.

153

version: The optional version, or no-argument callable that returns an

154

optional version. This is prepended before the subkeys.

155

force: Should the download be done again, even if the path already exists?

156

Defaults to false.

157

download_kwargs: Keyword arguments to pass through to pystow.utils.download.

158

159

Returns:

160

The path of the file that has been downloaded (or already exists)

161

"""

162

```

163

164

### [Data Format Support](./data-formats.md)

165

Built-in support for common data formats including CSV, JSON, XML, RDF, Excel, and Python objects with pandas and specialized library integration.

166

167

```python { .api }

168

def ensure_csv(key: str, *subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, read_csv_kwargs: Mapping[str, Any] | None = None) -> pd.DataFrame:

169

"""Download a CSV and open as a dataframe with pandas.

170

171

Args:

172

key: The module name

173

subkeys: A sequence of additional strings to join. If none are given, returns

174

the directory for this module.

175

url: The URL to download.

176

name: Overrides the name of the file at the end of the URL, if given. Also

177

useful for URLs that don't have proper filenames with extensions.

178

force: Should the download be done again, even if the path already exists?

179

Defaults to false.

180

download_kwargs: Keyword arguments to pass through to pystow.utils.download.

181

read_csv_kwargs: Keyword arguments to pass through to pandas.read_csv.

182

183

Returns:

184

A pandas DataFrame

185

"""

186

187

def ensure_json(key: str, *subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, open_kwargs: Mapping[str, Any] | None = None, json_load_kwargs: Mapping[str, Any] | None = None) -> JSON:

188

"""Download JSON and open with json.

189

190

Args:

191

key: The module name

192

subkeys: A sequence of additional strings to join. If none are given, returns

193

the directory for this module.

194

url: The URL to download.

195

name: Overrides the name of the file at the end of the URL, if given. Also

196

useful for URLs that don't have proper filenames with extensions.

197

force: Should the download be done again, even if the path already exists?

198

Defaults to false.

199

download_kwargs: Keyword arguments to pass through to pystow.utils.download.

200

open_kwargs: Additional keyword arguments passed to open

201

json_load_kwargs: Keyword arguments to pass through to json.load.

202

203

Returns:

204

A JSON object (list, dict, etc.)

205

"""

206

```

207

208

### [Web Scraping](./web-scraping.md)

209

HTML parsing and web content extraction with BeautifulSoup integration for downloading and parsing web pages.

210

211

```python { .api }

212

def ensure_soup(key: str, *subkeys: str, url: str, name: str | None = None, version: VersionHint = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, beautiful_soup_kwargs: Mapping[str, Any] | None = None) -> bs4.BeautifulSoup:

213

"""Ensure a webpage is downloaded and parsed with BeautifulSoup.

214

215

Args:

216

key: The name of the module. No funny characters. The envvar <key>_HOME where

217

key is uppercased is checked first before using the default home directory.

218

subkeys: A sequence of additional strings to join. If none are given,

219

returns the directory for this module.

220

url: The URL to download.

221

name: Overrides the name of the file at the end of the URL, if given.

222

Also useful for URLs that don't have proper filenames with extensions.

223

version: The optional version, or no-argument callable that returns an

224

optional version. This is prepended before the subkeys.

225

force: Should the download be done again, even if the path already

226

exists? Defaults to false.

227

download_kwargs: Keyword arguments to pass through to pystow.utils.download.

228

beautiful_soup_kwargs: Additional keyword arguments passed to BeautifulSoup

229

230

Returns:

231

An BeautifulSoup object

232

"""

233

```

234

235

### [Archive and Compression](./archives.md)

236

Support for compressed archives including ZIP, TAR, GZIP, LZMA, and BZ2 with automatic extraction and content access.

237

238

```python { .api }

239

def ensure_untar(key: str, *subkeys: str, url: str, name: str | None = None, directory: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, extract_kwargs: Mapping[str, Any] | None = None) -> Path:

240

"""Ensure a file is downloaded and untarred.

241

242

Args:

243

key: The name of the module. No funny characters. The envvar <key>_HOME where

244

key is uppercased is checked first before using the default home directory.

245

subkeys: A sequence of additional strings to join. If none are given, returns

246

the directory for this module.

247

url: The URL to download.

248

name: Overrides the name of the file at the end of the URL, if given. Also

249

useful for URLs that don't have proper filenames with extensions.

250

directory: Overrides the name of the directory into which the tar archive is

251

extracted. If none given, will use the stem of the file name that gets

252

downloaded.

253

force: Should the download be done again, even if the path already exists?

254

Defaults to false.

255

download_kwargs: Keyword arguments to pass through to pystow.utils.download.

256

extract_kwargs: Keyword arguments to pass to tarfile.TarFile.extract_all.

257

258

Returns:

259

The path of the directory where the file that has been downloaded gets

260

extracted to

261

"""

262

```

263

264

### [Cloud Storage Integration](./cloud-storage.md)

265

Download files from cloud storage services including AWS S3 and Google Drive with authentication support.

266

267

```python { .api }

268

def ensure_from_s3(key: str, *subkeys: str, s3_bucket: str, s3_key: str | Sequence[str], name: str | None = None, force: bool = False, **kwargs: Any) -> Path:

269

"""Ensure a file is downloaded from AWS S3.

270

271

Args:

272

key: The name of the module. No funny characters. The envvar <key>_HOME where

273

key is uppercased is checked first before using the default home directory.

274

subkeys: A sequence of additional strings to join. If none are given, returns

275

the directory for this module.

276

s3_bucket: The S3 bucket name

277

s3_key: The S3 key name

278

name: Overrides the name of the file at the end of the S3 key, if given.

279

force: Should the download be done again, even if the path already exists?

280

Defaults to false.

281

kwargs: Remaining kwargs to forward to Module.ensure_from_s3.

282

283

Returns:

284

The path of the file that has been downloaded (or already exists)

285

"""

286

```

287

288

### [Configuration Management](./configuration.md)

289

Environment variable and INI file-based configuration system for storing API keys, URLs, and other settings.

290

291

```python { .api }

292

def get_config(module: str, key: str, *, passthrough: X | None = None, default: X | None = None, dtype: type[X] | None = None, raise_on_missing: bool = False) -> Any:

293

"""Get a configuration value.

294

295

Args:

296

module: Name of the module (e.g., pybel) to get configuration for

297

key: Name of the key (e.g., connection)

298

passthrough: If this is not none, will get returned

299

default: If the environment and configuration files don't contain anything,

300

this is returned.

301

dtype: The datatype to parse out. Can either be int, float,

302

bool, or str. If none, defaults to str.

303

raise_on_missing: If true, will raise a value error if no data is found and

304

no default is given

305

306

Returns:

307

The config value or the default.

308

309

Raises:

310

ConfigError: If raise_on_missing conditions are met

311

"""

312

313

def write_config(module: str, key: str, value: str) -> None:

314

"""Write a configuration value.

315

316

Args:

317

module: The name of the app (e.g., indra)

318

key: The key of the configuration in the app

319

value: The value of the configuration in the app

320

"""

321

```

322

323

### [NLTK Integration](./nltk-integration.md)

324

Integration with NLTK (Natural Language Toolkit) for managing linguistic data resources.

325

326

```python { .api }

327

def ensure_nltk(resource: str = "stopwords") -> tuple[Path, bool]:

328

"""Ensure NLTK data is downloaded in a standard way.

329

330

Args:

331

resource: Name of the resource to download, e.g., stopwords

332

333

Returns:

334

A pair of the NLTK cache directory and a boolean that says if download was successful

335

"""

336

```

337

338

### [Module Class API](./module-class.md)

339

The core Module class that provides object-oriented interface for data directory management with all file operations as methods.

340

341

```python { .api }

342

class Module:

343

"""The class wrapping the directory lookup implementation."""

344

345

def __init__(self, base: str | Path, ensure_exists: bool = True) -> None:

346

"""Initialize the module.

347

348

Args:

349

base: The base directory for the module

350

ensure_exists: Should the base directory be created automatically?

351

Defaults to true.

352

"""

353

354

@classmethod

355

def from_key(cls, key: str, *subkeys: str, ensure_exists: bool = True) -> Module:

356

"""Get a module for the given directory or one of its subdirectories.

357

358

Args:

359

key: The name of the module. No funny characters. The envvar <key>_HOME

360

where key is uppercased is checked first before using the default home

361

directory.

362

subkeys: A sequence of additional strings to join. If none are given,

363

returns the directory for this module.

364

ensure_exists: Should all directories be created automatically? Defaults

365

to true.

366

367

Returns:

368

A module

369

"""

370

```

371

372

## Type Definitions

373

374

```python { .api }

375

from typing import Union, Optional, Callable, Any

376

from pathlib import Path

377

378

# Version specification type

379

VersionHint = Union[None, str, Callable[[], Optional[str]]]

380

381

# JSON data type

382

JSON = Any

383

384

# File provider function type

385

Provider = Callable[..., None]

386

387

# HTTP timeout specification

388

TimeoutHint = Union[int, float, None, tuple[Union[float, int], Union[float, int]]]

389

```

390

391

## Exception Classes

392

393

```python { .api }

394

class ConfigError(ValueError):

395

"""Raised when configuration can not be looked up."""

396

397

def __init__(self, module: str, key: str):

398

"""Initialize the configuration error.

399

400

Args:

401

module: Name of the module, e.g., bioportal

402

key: Name of the key inside the module, e.g., api_key

403

"""

404

```