0
# Utilities and Configuration
1
2
Helper functions for URL parsing, path manipulation, tokenization, and configuration management that support the core filesystem operations. These utilities provide essential infrastructure for protocol handling, caching, and system configuration.
3
4
## Capabilities
5
6
### URL and Path Processing
7
8
Functions for parsing URLs, extracting protocols, and manipulating filesystem paths across different storage backends.
9
10
```python { .api }
11
def infer_storage_options(urlpath, inherit_storage_options=None):
12
"""
13
Infer storage options from URL parameters.
14
15
Parameters:
16
- urlpath: str, URL with potential query parameters
17
- inherit_storage_options: dict, existing options to inherit/override
18
19
Returns:
20
dict, storage options extracted from URL
21
"""
22
23
def get_protocol(url):
24
"""
25
Extract protocol from URL.
26
27
Parameters:
28
- url: str, URL to parse
29
30
Returns:
31
str, protocol name (e.g., 's3', 'gcs', 'file')
32
"""
33
34
def stringify_path(filepath):
35
"""
36
Convert path object to string.
37
38
Parameters:
39
- filepath: str or Path-like, file path
40
41
Returns:
42
str, string representation of path
43
"""
44
```
45
46
### Compression Detection
47
48
Utilities for automatically detecting compression formats from filenames and extensions.
49
50
```python { .api }
51
def infer_compression(filename):
52
"""
53
Infer compression format from filename.
54
55
Parameters:
56
- filename: str, file name or path
57
58
Returns:
59
str or None, compression format name or None if uncompressed
60
"""
61
```
62
63
### Tokenization and Hashing
64
65
Functions for generating consistent hash tokens from filesystem paths and parameters, used internally for caching and deduplication.
66
67
```python { .api }
68
def tokenize(*args, **kwargs):
69
"""
70
Generate hash token from arguments.
71
72
Parameters:
73
- *args: positional arguments to hash
74
- **kwargs: keyword arguments to hash
75
76
Returns:
77
str, hash token string
78
"""
79
```
80
81
### Block Reading Utilities
82
83
Low-level utilities for reading data blocks with delimiter support, useful for implementing custom file readers and parsers.
84
85
```python { .api }
86
def read_block(file, offset, length, delimiter=None):
87
"""
88
Read a block of data from file.
89
90
Parameters:
91
- file: file-like object, source file
92
- offset: int, byte offset to start reading
93
- length: int, maximum bytes to read
94
- delimiter: bytes, delimiter to read until (optional)
95
96
Returns:
97
bytes, block data
98
"""
99
```
100
101
### Filename Generation
102
103
Utilities for generating systematic filenames for batch operations and parallel processing.
104
105
```python { .api }
106
def build_name_function(max_int):
107
"""
108
Build function for generating sequential filenames.
109
110
Parameters:
111
- max_int: int, maximum number to generate names for
112
113
Returns:
114
callable, function that takes int and returns filename string
115
"""
116
```
117
118
### Atomic File Operations
119
120
Utilities for ensuring atomic file writes and preventing data corruption during file operations.
121
122
```python { .api }
123
def atomic_write(path, mode='wb'):
124
"""
125
Context manager for atomic file writing.
126
127
Parameters:
128
- path: str, target file path
129
- mode: str, file opening mode
130
131
Returns:
132
context manager, yields temporary file object
133
"""
134
```
135
136
### Pattern Matching
137
138
Utilities for translating glob patterns to regular expressions and other pattern matching operations.
139
140
```python { .api }
141
def glob_translate(pat):
142
"""
143
Translate glob pattern to regular expression.
144
145
Parameters:
146
- pat: str, glob pattern
147
148
Returns:
149
str, regular expression pattern
150
"""
151
```
152
153
### Configuration Management
154
155
Global configuration system for fsspec behavior and default settings.
156
157
```python { .api }
158
conf: dict
159
"""Global configuration dictionary with fsspec settings"""
160
161
conf_dir: str
162
"""Configuration directory path"""
163
164
def set_conf_env(conf_dict, envdict=os.environ):
165
"""
166
Set configuration from environment variables.
167
168
Parameters:
169
- conf_dict: dict, configuration dictionary to update
170
- envdict: dict, environment variables dictionary
171
"""
172
173
def apply_config(cls, kwargs):
174
"""
175
Apply configuration to class constructor arguments.
176
177
Parameters:
178
- cls: type, class to configure
179
- kwargs: dict, keyword arguments to modify
180
181
Returns:
182
dict, modified keyword arguments with config applied
183
"""
184
```
185
186
## Usage Patterns
187
188
### URL Parameter Extraction
189
190
```python
191
# Extract storage options from URL query parameters
192
url = 's3://bucket/path?key=ACCESS_KEY&secret=SECRET_KEY®ion=us-west-2'
193
storage_options = fsspec.utils.infer_storage_options(url)
194
print(storage_options)
195
# {'key': 'ACCESS_KEY', 'secret': 'SECRET_KEY', 'region': 'us-west-2'}
196
197
# Use extracted options
198
fs = fsspec.filesystem('s3', **storage_options)
199
200
# Inherit and override options
201
base_options = {'key': 'BASE_KEY', 'timeout': 30}
202
url = 's3://bucket/path?secret=SECRET_KEY'
203
final_options = fsspec.utils.infer_storage_options(url, base_options)
204
# Result: {'key': 'BASE_KEY', 'timeout': 30, 'secret': 'SECRET_KEY'}
205
```
206
207
### Protocol Detection
208
209
```python
210
# Extract protocol from various URL formats
211
urls = [
212
's3://bucket/file.txt',
213
'gcs://bucket/file.txt',
214
'https://example.com/api',
215
'/local/path/file.txt',
216
'file:///absolute/path'
217
]
218
219
for url in urls:
220
protocol = fsspec.utils.get_protocol(url)
221
print(f"{url} -> {protocol}")
222
223
# s3://bucket/file.txt -> s3
224
# gcs://bucket/file.txt -> gcs
225
# https://example.com/api -> https
226
# /local/path/file.txt -> file
227
# file:///absolute/path -> file
228
```
229
230
### Compression Auto-Detection
231
232
```python
233
# Automatically detect compression from filenames
234
filenames = [
235
'data.csv.gz',
236
'archive.tar.bz2',
237
'logs.txt.xz',
238
'config.json',
239
'model.pkl.lz4'
240
]
241
242
for filename in filenames:
243
compression = fsspec.utils.infer_compression(filename)
244
print(f"{filename} -> {compression}")
245
246
# data.csv.gz -> gzip
247
# archive.tar.bz2 -> bz2
248
# logs.txt.xz -> lzma
249
# config.json -> None
250
# model.pkl.lz4 -> lz4
251
```
252
253
### Path Standardization
254
255
```python
256
import pathlib
257
258
# Convert various path types to strings
259
paths = [
260
'/local/file.txt',
261
pathlib.Path('/local/file.txt'),
262
pathlib.PurePosixPath('/local/file.txt')
263
]
264
265
for path in paths:
266
str_path = fsspec.utils.stringify_path(path)
267
print(f"{type(path)} -> {str_path}")
268
```
269
270
### Tokenization for Caching
271
272
```python
273
# Generate consistent tokens for caching
274
token1 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
275
token2 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
276
token3 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-west-2')
277
278
print(token1 == token2) # True - same parameters
279
print(token1 == token3) # False - different region
280
281
# Use for cache keys
282
cache_key = fsspec.utils.tokenize(protocol, path, **storage_options)
283
```
284
285
### Block Reading with Delimiters
286
287
```python
288
# Read file in blocks with line boundaries
289
with open('large_file.txt', 'rb') as f:
290
offset = 0
291
block_size = 1024 * 1024 # 1MB blocks
292
293
while True:
294
# Read block ending at line boundary
295
block = fsspec.utils.read_block(f, offset, block_size, delimiter=b'\n')
296
if not block:
297
break
298
299
# Process complete lines
300
lines = block.split(b'\n')
301
for line in lines:
302
if line: # Skip empty lines
303
process_line(line)
304
305
offset += len(block)
306
```
307
308
### Sequential Filename Generation
309
310
```python
311
# Generate systematic filenames for batch output
312
name_func = fsspec.utils.build_name_function(1000)
313
314
filenames = [name_func(i) for i in range(5)]
315
print(filenames)
316
# ['000', '001', '002', '003', '004']
317
318
# Use with fsspec.open_files for multiple outputs
319
output_files = fsspec.open_files(
320
'output-*.json',
321
'w',
322
num=10,
323
name_function=name_func
324
)
325
```
326
327
### Atomic File Writing
328
329
```python
330
# Ensure atomic writes to prevent corruption
331
with fsspec.utils.atomic_write('/important/file.txt', 'w') as f:
332
f.write('Critical data that must be written atomically\n')
333
f.write('If this fails, the original file remains unchanged\n')
334
# File is only moved to final location if all writes succeed
335
336
# Works with binary mode too
337
with fsspec.utils.atomic_write('/data/model.pkl', 'wb') as f:
338
pickle.dump(model, f)
339
```
340
341
### Glob Pattern Processing
342
343
```python
344
# Convert glob patterns to regex for custom matching
345
patterns = ['*.txt', 'data_*.csv', 'logs/*/error.log']
346
347
for pattern in patterns:
348
regex = fsspec.utils.glob_translate(pattern)
349
print(f"{pattern} -> {regex}")
350
351
# Use compiled regex for matching
352
import re
353
regex_pattern = fsspec.utils.glob_translate('data_*.csv')
354
compiled = re.compile(regex_pattern)
355
356
files = ['data_1.csv', 'data_2.csv', 'config.json', 'data_old.csv']
357
matches = [f for f in files if compiled.match(f)]
358
print(matches) # ['data_1.csv', 'data_2.csv', 'data_old.csv']
359
```
360
361
### Global Configuration
362
363
```python
364
# Check current configuration
365
print("Current fsspec config:", fsspec.config.conf)
366
367
# Set configuration options
368
fsspec.config.conf['default_cache_type'] = 'blockcache'
369
fsspec.config.conf['default_block_size'] = 1024 * 1024
370
371
# Configuration from environment variables
372
import os
373
os.environ['FSSPEC_CACHE_TYPE'] = 'readahead'
374
os.environ['FSSPEC_BLOCK_SIZE'] = '2097152'
375
376
fsspec.utils.set_conf_env(fsspec.config.conf)
377
print("Updated config:", fsspec.config.conf)
378
```
379
380
### Custom Utility Functions
381
382
```python
383
def get_file_info(url):
384
"""Get comprehensive file information from URL."""
385
protocol = fsspec.utils.get_protocol(url)
386
compression = fsspec.utils.infer_compression(url)
387
storage_options = fsspec.utils.infer_storage_options(url)
388
389
return {
390
'protocol': protocol,
391
'compression': compression,
392
'storage_options': storage_options,
393
'token': fsspec.utils.tokenize(url, **storage_options)
394
}
395
396
# Use custom utility
397
info = get_file_info('s3://bucket/data.csv.gz?region=us-west-2')
398
print(info)
399
```
400
401
### Error Handling with Utilities
402
403
```python
404
def safe_infer_compression(filename):
405
"""Safely infer compression with fallback."""
406
try:
407
return fsspec.utils.infer_compression(filename)
408
except Exception:
409
# Return None if compression inference fails
410
return None
411
412
def safe_get_protocol(url):
413
"""Safely extract protocol with fallback."""
414
try:
415
return fsspec.utils.get_protocol(url)
416
except Exception:
417
# Default to file protocol
418
return 'file'
419
```
420
421
### Performance Optimization with Utilities
422
423
```python
424
# Cache tokenization results for repeated operations
425
from functools import lru_cache
426
427
@lru_cache(maxsize=1000)
428
def cached_tokenize(*args, **kwargs):
429
"""Cached version of tokenize for performance."""
430
# Sort kwargs for consistent hashing
431
sorted_kwargs = tuple(sorted(kwargs.items()))
432
return fsspec.utils.tokenize(*args, *sorted_kwargs)
433
434
# Use cached tokenization
435
token = cached_tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
436
```
437
438
## Configuration Options
439
440
### Global Settings
441
442
```python
443
# Common configuration options in fsspec.config.conf
444
{
445
'default_cache_type': 'readahead', # Default cache strategy
446
'default_block_size': 1024 * 1024, # Default block size (1MB)
447
'connect_timeout': 10, # Connection timeout seconds
448
'read_timeout': 30, # Read timeout seconds
449
'max_connections': 100, # Max concurrent connections
450
'cache_dir': '/tmp/fsspec', # Cache directory
451
'logging_level': 'INFO' # Logging verbosity
452
}
453
```
454
455
### Environment Variable Mapping
456
457
```python
458
# Environment variables that affect fsspec behavior
459
FSSPEC_CACHE_TYPE -> conf['default_cache_type']
460
FSSPEC_BLOCK_SIZE -> conf['default_block_size']
461
FSSPEC_TIMEOUT -> conf['connect_timeout']
462
FSSPEC_CACHE_DIR -> conf['cache_dir']
463
```
464
465
### Per-Filesystem Configuration
466
467
```python
468
# Apply configuration to specific filesystem instances
469
config_overrides = {
470
's3': {'default_cache_type': 'mmap'},
471
'gcs': {'default_block_size': 2*1024*1024},
472
'http': {'connect_timeout': 5}
473
}
474
475
# Configuration is applied when creating filesystem instances
476
for protocol, overrides in config_overrides.items():
477
fsspec.utils.apply_config(fsspec.get_filesystem_class(protocol), overrides)
478
```