0
# PyStow
1
2
PyStow is a Python library that provides a standardized and configurable way to manage data directories for Python applications. It offers a simple API for creating and accessing application-specific data directories in a user's file system, with support for nested directory structures, automatic directory creation, and environment variable-based configuration.
3
4
The library enables developers to easily download, cache, and manage files from the internet with built-in support for various data formats including CSV, RDF, Excel, and compressed archives (ZIP, TAR, LZMA, GZ). It includes functionality for ensuring files are downloaded only once and cached locally, with features for handling tabular data through pandas integration, RDF data through rdflib integration, and provides configurable storage locations that respect both traditional home directory patterns and XDG Base Directory specifications.
5
6
## Package Information
7
8
- **Package Name**: pystow
9
- **Language**: Python
10
- **Installation**: `pip install pystow`
11
12
## Core Imports
13
14
```python
15
import pystow
16
17
# Most common usage patterns
18
module = pystow.module("myapp")
19
path = pystow.join("myapp", "data")
20
data = pystow.ensure_csv("myapp", url="https://example.com/data.csv")
21
```
22
23
## Basic Usage
24
25
### Directory Management
26
```python
27
import pystow
28
29
# Get a module for your application
30
module = pystow.module("myapp")
31
32
# Create nested directories and get paths
33
data_dir = module.join("datasets", "version1")
34
config_path = module.join("config", name="settings.json")
35
36
# Using functional API
37
path = pystow.join("myapp", "data", name="file.txt")
38
```
39
40
### File Download and Caching
41
```python
42
import pystow
43
44
# Download and cache a file
45
path = pystow.ensure(
46
"myapp", "data",
47
url="https://example.com/dataset.csv",
48
name="dataset.csv"
49
)
50
51
# File is automatically cached - subsequent calls return the cached version
52
# Use force=True to re-download
53
path = pystow.ensure(
54
"myapp", "data",
55
url="https://example.com/dataset.csv",
56
name="dataset.csv",
57
force=True
58
)
59
```
60
61
### Data Format Integration
62
```python
63
import pystow
64
import pandas as pd
65
66
# Download and load CSV as DataFrame
67
df = pystow.ensure_csv(
68
"myapp", "datasets",
69
url="https://example.com/data.csv"
70
)
71
72
# Download and parse JSON
73
data = pystow.ensure_json(
74
"myapp", "config",
75
url="https://api.example.com/config.json"
76
)
77
78
# Work with compressed files
79
graph = pystow.ensure_rdf(
80
"myapp", "ontologies",
81
url="https://example.com/ontology.rdf.gz",
82
parse_kwargs={"format": "xml"}
83
)
84
```
85
86
## Architecture
87
88
PyStow is built around a modular architecture with two main usage patterns:
89
90
1. **Functional API**: Direct function calls for quick operations (`pystow.ensure()`, `pystow.join()`)
91
2. **Module-based API**: Create Module instances for organized data management (`pystow.module()`)
92
93
The core `Module` class manages directory structures and provides methods for file operations, while the functional API provides convenient shortcuts for common tasks. All operations support:
94
95
- **Configurable base directories** via environment variables
96
- **Version-aware storage** for handling different data versions
97
- **Automatic directory creation** with the `ensure_exists` parameter
98
- **Force re-download capabilities** for cache invalidation
99
- **Flexible data format support** through specialized ensure/load/dump methods
100
101
## Capabilities
102
103
### [Directory Management](./directory-management.md)
104
Core functionality for creating and managing application data directories with configurable storage locations and automatic directory creation.
105
106
```python { .api }
107
def module(key: str, *subkeys: str, ensure_exists: bool = True) -> Module:
108
"""Return a module for the application.
109
110
Args:
111
key: The name of the module. No funny characters. The envvar <key>_HOME where
112
key is uppercased is checked first before using the default home directory.
113
subkeys: A sequence of additional strings to join. If none are given, returns
114
the directory for this module.
115
ensure_exists: Should all directories be created automatically? Defaults to true.
116
117
Returns:
118
The module object that manages getting and ensuring
119
"""
120
121
def join(key: str, *subkeys: str, name: str | None = None, ensure_exists: bool = True, version: VersionHint = None) -> Path:
122
"""Return the home data directory for the given module.
123
124
Args:
125
key: The name of the module. No funny characters. The envvar <key>_HOME where
126
key is uppercased is checked first before using the default home directory.
127
subkeys: A sequence of additional strings to join
128
name: The name of the file (optional) inside the folder
129
ensure_exists: Should all directories be created automatically? Defaults to true.
130
version: The optional version, or no-argument callable that returns an
131
optional version. This is prepended before the subkeys.
132
133
Returns:
134
The path of the directory or subdirectory for the given module.
135
"""
136
```
137
138
### [File Download and Caching](./file-operations.md)
139
Comprehensive file download system with caching, compression support, and cloud storage integration.
140
141
```python { .api }
142
def ensure(key: str, *subkeys: str, url: str, name: str | None = None, version: VersionHint = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None) -> Path:
143
"""Ensure a file is downloaded.
144
145
Args:
146
key: The name of the module. No funny characters. The envvar <key>_HOME where
147
key is uppercased is checked first before using the default home directory.
148
subkeys: A sequence of additional strings to join. If none are given, returns
149
the directory for this module.
150
url: The URL to download.
151
name: Overrides the name of the file at the end of the URL, if given. Also
152
useful for URLs that don't have proper filenames with extensions.
153
version: The optional version, or no-argument callable that returns an
154
optional version. This is prepended before the subkeys.
155
force: Should the download be done again, even if the path already exists?
156
Defaults to false.
157
download_kwargs: Keyword arguments to pass through to pystow.utils.download.
158
159
Returns:
160
The path of the file that has been downloaded (or already exists)
161
"""
162
```
163
164
### [Data Format Support](./data-formats.md)
165
Built-in support for common data formats including CSV, JSON, XML, RDF, Excel, and Python objects with pandas and specialized library integration.
166
167
```python { .api }
168
def ensure_csv(key: str, *subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, read_csv_kwargs: Mapping[str, Any] | None = None) -> pd.DataFrame:
169
"""Download a CSV and open as a dataframe with pandas.
170
171
Args:
172
key: The module name
173
subkeys: A sequence of additional strings to join. If none are given, returns
174
the directory for this module.
175
url: The URL to download.
176
name: Overrides the name of the file at the end of the URL, if given. Also
177
useful for URLs that don't have proper filenames with extensions.
178
force: Should the download be done again, even if the path already exists?
179
Defaults to false.
180
download_kwargs: Keyword arguments to pass through to pystow.utils.download.
181
read_csv_kwargs: Keyword arguments to pass through to pandas.read_csv.
182
183
Returns:
184
A pandas DataFrame
185
"""
186
187
def ensure_json(key: str, *subkeys: str, url: str, name: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, open_kwargs: Mapping[str, Any] | None = None, json_load_kwargs: Mapping[str, Any] | None = None) -> JSON:
188
"""Download JSON and open with json.
189
190
Args:
191
key: The module name
192
subkeys: A sequence of additional strings to join. If none are given, returns
193
the directory for this module.
194
url: The URL to download.
195
name: Overrides the name of the file at the end of the URL, if given. Also
196
useful for URLs that don't have proper filenames with extensions.
197
force: Should the download be done again, even if the path already exists?
198
Defaults to false.
199
download_kwargs: Keyword arguments to pass through to pystow.utils.download.
200
open_kwargs: Additional keyword arguments passed to open
201
json_load_kwargs: Keyword arguments to pass through to json.load.
202
203
Returns:
204
A JSON object (list, dict, etc.)
205
"""
206
```
207
208
### [Web Scraping](./web-scraping.md)
209
HTML parsing and web content extraction with BeautifulSoup integration for downloading and parsing web pages.
210
211
```python { .api }
212
def ensure_soup(key: str, *subkeys: str, url: str, name: str | None = None, version: VersionHint = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, beautiful_soup_kwargs: Mapping[str, Any] | None = None) -> bs4.BeautifulSoup:
213
"""Ensure a webpage is downloaded and parsed with BeautifulSoup.
214
215
Args:
216
key: The name of the module. No funny characters. The envvar <key>_HOME where
217
key is uppercased is checked first before using the default home directory.
218
subkeys: A sequence of additional strings to join. If none are given,
219
returns the directory for this module.
220
url: The URL to download.
221
name: Overrides the name of the file at the end of the URL, if given.
222
Also useful for URLs that don't have proper filenames with extensions.
223
version: The optional version, or no-argument callable that returns an
224
optional version. This is prepended before the subkeys.
225
force: Should the download be done again, even if the path already
226
exists? Defaults to false.
227
download_kwargs: Keyword arguments to pass through to pystow.utils.download.
228
beautiful_soup_kwargs: Additional keyword arguments passed to BeautifulSoup
229
230
Returns:
231
An BeautifulSoup object
232
"""
233
```
234
235
### [Archive and Compression](./archives.md)
236
Support for compressed archives including ZIP, TAR, GZIP, LZMA, and BZ2 with automatic extraction and content access.
237
238
```python { .api }
239
def ensure_untar(key: str, *subkeys: str, url: str, name: str | None = None, directory: str | None = None, force: bool = False, download_kwargs: Mapping[str, Any] | None = None, extract_kwargs: Mapping[str, Any] | None = None) -> Path:
240
"""Ensure a file is downloaded and untarred.
241
242
Args:
243
key: The name of the module. No funny characters. The envvar <key>_HOME where
244
key is uppercased is checked first before using the default home directory.
245
subkeys: A sequence of additional strings to join. If none are given, returns
246
the directory for this module.
247
url: The URL to download.
248
name: Overrides the name of the file at the end of the URL, if given. Also
249
useful for URLs that don't have proper filenames with extensions.
250
directory: Overrides the name of the directory into which the tar archive is
251
extracted. If none given, will use the stem of the file name that gets
252
downloaded.
253
force: Should the download be done again, even if the path already exists?
254
Defaults to false.
255
download_kwargs: Keyword arguments to pass through to pystow.utils.download.
256
extract_kwargs: Keyword arguments to pass to tarfile.TarFile.extract_all.
257
258
Returns:
259
The path of the directory where the file that has been downloaded gets
260
extracted to
261
"""
262
```
263
264
### [Cloud Storage Integration](./cloud-storage.md)
265
Download files from cloud storage services including AWS S3 and Google Drive with authentication support.
266
267
```python { .api }
268
def ensure_from_s3(key: str, *subkeys: str, s3_bucket: str, s3_key: str | Sequence[str], name: str | None = None, force: bool = False, **kwargs: Any) -> Path:
269
"""Ensure a file is downloaded from AWS S3.
270
271
Args:
272
key: The name of the module. No funny characters. The envvar <key>_HOME where
273
key is uppercased is checked first before using the default home directory.
274
subkeys: A sequence of additional strings to join. If none are given, returns
275
the directory for this module.
276
s3_bucket: The S3 bucket name
277
s3_key: The S3 key name
278
name: Overrides the name of the file at the end of the S3 key, if given.
279
force: Should the download be done again, even if the path already exists?
280
Defaults to false.
281
kwargs: Remaining kwargs to forward to Module.ensure_from_s3.
282
283
Returns:
284
The path of the file that has been downloaded (or already exists)
285
"""
286
```
287
288
### [Configuration Management](./configuration.md)
289
Environment variable and INI file-based configuration system for storing API keys, URLs, and other settings.
290
291
```python { .api }
292
def get_config(module: str, key: str, *, passthrough: X | None = None, default: X | None = None, dtype: type[X] | None = None, raise_on_missing: bool = False) -> Any:
293
"""Get a configuration value.
294
295
Args:
296
module: Name of the module (e.g., pybel) to get configuration for
297
key: Name of the key (e.g., connection)
298
passthrough: If this is not none, will get returned
299
default: If the environment and configuration files don't contain anything,
300
this is returned.
301
dtype: The datatype to parse out. Can either be int, float,
302
bool, or str. If none, defaults to str.
303
raise_on_missing: If true, will raise a value error if no data is found and
304
no default is given
305
306
Returns:
307
The config value or the default.
308
309
Raises:
310
ConfigError: If raise_on_missing conditions are met
311
"""
312
313
def write_config(module: str, key: str, value: str) -> None:
314
"""Write a configuration value.
315
316
Args:
317
module: The name of the app (e.g., indra)
318
key: The key of the configuration in the app
319
value: The value of the configuration in the app
320
"""
321
```
322
323
### [NLTK Integration](./nltk-integration.md)
324
Integration with NLTK (Natural Language Toolkit) for managing linguistic data resources.
325
326
```python { .api }
327
def ensure_nltk(resource: str = "stopwords") -> tuple[Path, bool]:
328
"""Ensure NLTK data is downloaded in a standard way.
329
330
Args:
331
resource: Name of the resource to download, e.g., stopwords
332
333
Returns:
334
A pair of the NLTK cache directory and a boolean that says if download was successful
335
"""
336
```
337
338
### [Module Class API](./module-class.md)
339
The core Module class that provides object-oriented interface for data directory management with all file operations as methods.
340
341
```python { .api }
342
class Module:
343
"""The class wrapping the directory lookup implementation."""
344
345
def __init__(self, base: str | Path, ensure_exists: bool = True) -> None:
346
"""Initialize the module.
347
348
Args:
349
base: The base directory for the module
350
ensure_exists: Should the base directory be created automatically?
351
Defaults to true.
352
"""
353
354
@classmethod
355
def from_key(cls, key: str, *subkeys: str, ensure_exists: bool = True) -> Module:
356
"""Get a module for the given directory or one of its subdirectories.
357
358
Args:
359
key: The name of the module. No funny characters. The envvar <key>_HOME
360
where key is uppercased is checked first before using the default home
361
directory.
362
subkeys: A sequence of additional strings to join. If none are given,
363
returns the directory for this module.
364
ensure_exists: Should all directories be created automatically? Defaults
365
to true.
366
367
Returns:
368
A module
369
"""
370
```
371
372
## Type Definitions
373
374
```python { .api }
375
from typing import Union, Optional, Callable, Any
376
from pathlib import Path
377
378
# Version specification type
379
VersionHint = Union[None, str, Callable[[], Optional[str]]]
380
381
# JSON data type
382
JSON = Any
383
384
# File provider function type
385
Provider = Callable[..., None]
386
387
# HTTP timeout specification
388
TimeoutHint = Union[int, float, None, tuple[Union[float, int], Union[float, int]]]
389
```
390
391
## Exception Classes
392
393
```python { .api }
394
class ConfigError(ValueError):
395
"""Raised when configuration can not be looked up."""
396
397
def __init__(self, module: str, key: str):
398
"""Initialize the configuration error.
399
400
Args:
401
module: Name of the module, e.g., bioportal
402
key: Name of the key inside the module, e.g., api_key
403
"""
404
```