Tessl Tile for pypi/fsspec@2025.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

caching.md callbacks.md compression.md core-operations.md filesystem-interface.md index.md mapping.md registry.md utilities.md

utilities.mddocs/

0
# Utilities and Configuration
1

2
Helper functions for URL parsing, path manipulation, tokenization, and configuration management that support the core filesystem operations. These utilities provide essential infrastructure for protocol handling, caching, and system configuration.
3

4
## Capabilities
5

6
### URL and Path Processing
7

8
Functions for parsing URLs, extracting protocols, and manipulating filesystem paths across different storage backends.
9

10
```python { .api }
11
def infer_storage_options(urlpath, inherit_storage_options=None):
12
    """
13
    Infer storage options from URL parameters.
14
    
15
    Parameters:
16
    - urlpath: str, URL with potential query parameters
17
    - inherit_storage_options: dict, existing options to inherit/override
18
    
19
    Returns:
20
    dict, storage options extracted from URL
21
    """
22

23
def get_protocol(url):
24
    """
25
    Extract protocol from URL.
26
    
27
    Parameters:
28
    - url: str, URL to parse
29
    
30
    Returns:
31
    str, protocol name (e.g., 's3', 'gcs', 'file')
32
    """
33

34
def stringify_path(filepath):
35
    """
36
    Convert path object to string.
37
    
38
    Parameters:
39
    - filepath: str or Path-like, file path
40
    
41
    Returns:
42
    str, string representation of path
43
    """
44
```
45

46
### Compression Detection
47

48
Utilities for automatically detecting compression formats from filenames and extensions.
49

50
```python { .api }
51
def infer_compression(filename):
52
    """
53
    Infer compression format from filename.
54
    
55
    Parameters:
56
    - filename: str, file name or path
57
    
58
    Returns:
59
    str or None, compression format name or None if uncompressed
60
    """
61
```
62

63
### Tokenization and Hashing
64

65
Functions for generating consistent hash tokens from filesystem paths and parameters, used internally for caching and deduplication.
66

67
```python { .api }
68
def tokenize(*args, **kwargs):
69
    """
70
    Generate hash token from arguments.
71
    
72
    Parameters:
73
    - *args: positional arguments to hash
74
    - **kwargs: keyword arguments to hash
75
    
76
    Returns:
77
    str, hash token string
78
    """
79
```
80

81
### Block Reading Utilities
82

83
Low-level utilities for reading data blocks with delimiter support, useful for implementing custom file readers and parsers.
84

85
```python { .api }
86
def read_block(file, offset, length, delimiter=None):
87
    """
88
    Read a block of data from file.
89
    
90
    Parameters:
91
    - file: file-like object, source file
92
    - offset: int, byte offset to start reading
93
    - length: int, maximum bytes to read
94
    - delimiter: bytes, delimiter to read until (optional)
95
    
96
    Returns:
97
    bytes, block data
98
    """
99
```
100

101
### Filename Generation
102

103
Utilities for generating systematic filenames for batch operations and parallel processing.
104

105
```python { .api }
106
def build_name_function(max_int):
107
    """
108
    Build function for generating sequential filenames.
109
    
110
    Parameters:
111
    - max_int: int, maximum number to generate names for
112
    
113
    Returns:
114
    callable, function that takes int and returns filename string
115
    """
116
```
117

118
### Atomic File Operations
119

120
Utilities for ensuring atomic file writes and preventing data corruption during file operations.
121

122
```python { .api }
123
def atomic_write(path, mode='wb'):
124
    """
125
    Context manager for atomic file writing.
126
    
127
    Parameters:
128
    - path: str, target file path
129
    - mode: str, file opening mode
130
    
131
    Returns:
132
    context manager, yields temporary file object
133
    """
134
```
135

136
### Pattern Matching
137

138
Utilities for translating glob patterns to regular expressions and other pattern matching operations.
139

140
```python { .api }
141
def glob_translate(pat):
142
    """
143
    Translate glob pattern to regular expression.
144
    
145
    Parameters:
146
    - pat: str, glob pattern
147
    
148
    Returns:
149
    str, regular expression pattern
150
    """
151
```
152

153
### Configuration Management
154

155
Global configuration system for fsspec behavior and default settings.
156

157
```python { .api }
158
conf: dict
159
    """Global configuration dictionary with fsspec settings"""
160

161
conf_dir: str
162
    """Configuration directory path"""
163

164
def set_conf_env(conf_dict, envdict=os.environ):
165
    """
166
    Set configuration from environment variables.
167
    
168
    Parameters:
169
    - conf_dict: dict, configuration dictionary to update
170
    - envdict: dict, environment variables dictionary
171
    """
172

173
def apply_config(cls, kwargs):
174
    """
175
    Apply configuration to class constructor arguments.
176
    
177
    Parameters:
178
    - cls: type, class to configure
179
    - kwargs: dict, keyword arguments to modify
180
    
181
    Returns:
182
    dict, modified keyword arguments with config applied
183
    """
184
```
185

186
## Usage Patterns
187

188
### URL Parameter Extraction
189

190
```python
191
# Extract storage options from URL query parameters
192
url = 's3://bucket/path?key=ACCESS_KEY&secret=SECRET_KEY&region=us-west-2'
193
storage_options = fsspec.utils.infer_storage_options(url)
194
print(storage_options)
195
# {'key': 'ACCESS_KEY', 'secret': 'SECRET_KEY', 'region': 'us-west-2'}
196

197
# Use extracted options
198
fs = fsspec.filesystem('s3', **storage_options)
199

200
# Inherit and override options
201
base_options = {'key': 'BASE_KEY', 'timeout': 30}
202
url = 's3://bucket/path?secret=SECRET_KEY'
203
final_options = fsspec.utils.infer_storage_options(url, base_options)
204
# Result: {'key': 'BASE_KEY', 'timeout': 30, 'secret': 'SECRET_KEY'}
205
```
206

207
### Protocol Detection
208

209
```python
210
# Extract protocol from various URL formats
211
urls = [
212
    's3://bucket/file.txt',
213
    'gcs://bucket/file.txt', 
214
    'https://example.com/api',
215
    '/local/path/file.txt',
216
    'file:///absolute/path'
217
]
218

219
for url in urls:
220
    protocol = fsspec.utils.get_protocol(url)
221
    print(f"{url} -> {protocol}")
222

223
# s3://bucket/file.txt -> s3
224
# gcs://bucket/file.txt -> gcs  
225
# https://example.com/api -> https
226
# /local/path/file.txt -> file
227
# file:///absolute/path -> file
228
```
229

230
### Compression Auto-Detection
231

232
```python
233
# Automatically detect compression from filenames
234
filenames = [
235
    'data.csv.gz',
236
    'archive.tar.bz2',
237
    'logs.txt.xz',
238
    'config.json',
239
    'model.pkl.lz4'
240
]
241

242
for filename in filenames:
243
    compression = fsspec.utils.infer_compression(filename)
244
    print(f"{filename} -> {compression}")
245

246
# data.csv.gz -> gzip
247
# archive.tar.bz2 -> bz2
248
# logs.txt.xz -> lzma
249
# config.json -> None
250
# model.pkl.lz4 -> lz4
251
```
252

253
### Path Standardization
254

255
```python
256
import pathlib
257

258
# Convert various path types to strings
259
paths = [
260
    '/local/file.txt',
261
    pathlib.Path('/local/file.txt'),
262
    pathlib.PurePosixPath('/local/file.txt')
263
]
264

265
for path in paths:
266
    str_path = fsspec.utils.stringify_path(path)
267
    print(f"{type(path)} -> {str_path}")
268
```
269

270
### Tokenization for Caching
271

272
```python
273
# Generate consistent tokens for caching
274
token1 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
275
token2 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
276
token3 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-west-2')
277

278
print(token1 == token2)  # True - same parameters
279
print(token1 == token3)  # False - different region
280

281
# Use for cache keys
282
cache_key = fsspec.utils.tokenize(protocol, path, **storage_options)
283
```
284

285
### Block Reading with Delimiters
286

287
```python
288
# Read file in blocks with line boundaries
289
with open('large_file.txt', 'rb') as f:
290
    offset = 0
291
    block_size = 1024 * 1024  # 1MB blocks
292
    
293
    while True:
294
        # Read block ending at line boundary
295
        block = fsspec.utils.read_block(f, offset, block_size, delimiter=b'\n')
296
        if not block:
297
            break
298
            
299
        # Process complete lines
300
        lines = block.split(b'\n')
301
        for line in lines:
302
            if line:  # Skip empty lines
303
                process_line(line)
304
                
305
        offset += len(block)
306
```
307

308
### Sequential Filename Generation
309

310
```python
311
# Generate systematic filenames for batch output
312
name_func = fsspec.utils.build_name_function(1000)
313

314
filenames = [name_func(i) for i in range(5)]
315
print(filenames)
316
# ['000', '001', '002', '003', '004']
317

318
# Use with fsspec.open_files for multiple outputs
319
output_files = fsspec.open_files(
320
    'output-*.json',
321
    'w',
322
    num=10,
323
    name_function=name_func
324
)
325
```
326

327
### Atomic File Writing
328

329
```python
330
# Ensure atomic writes to prevent corruption
331
with fsspec.utils.atomic_write('/important/file.txt', 'w') as f:
332
    f.write('Critical data that must be written atomically\n')
333
    f.write('If this fails, the original file remains unchanged\n')
334
    # File is only moved to final location if all writes succeed
335

336
# Works with binary mode too
337
with fsspec.utils.atomic_write('/data/model.pkl', 'wb') as f:
338
    pickle.dump(model, f)
339
```
340

341
### Glob Pattern Processing
342

343
```python
344
# Convert glob patterns to regex for custom matching
345
patterns = ['*.txt', 'data_*.csv', 'logs/*/error.log']
346

347
for pattern in patterns:
348
    regex = fsspec.utils.glob_translate(pattern)
349
    print(f"{pattern} -> {regex}")
350

351
# Use compiled regex for matching
352
import re
353
regex_pattern = fsspec.utils.glob_translate('data_*.csv')
354
compiled = re.compile(regex_pattern)
355

356
files = ['data_1.csv', 'data_2.csv', 'config.json', 'data_old.csv']
357
matches = [f for f in files if compiled.match(f)]
358
print(matches)  # ['data_1.csv', 'data_2.csv', 'data_old.csv']
359
```
360

361
### Global Configuration
362

363
```python
364
# Check current configuration
365
print("Current fsspec config:", fsspec.config.conf)
366

367
# Set configuration options
368
fsspec.config.conf['default_cache_type'] = 'blockcache'
369
fsspec.config.conf['default_block_size'] = 1024 * 1024
370

371
# Configuration from environment variables
372
import os
373
os.environ['FSSPEC_CACHE_TYPE'] = 'readahead'
374
os.environ['FSSPEC_BLOCK_SIZE'] = '2097152'
375

376
fsspec.utils.set_conf_env(fsspec.config.conf)
377
print("Updated config:", fsspec.config.conf)
378
```
379

380
### Custom Utility Functions
381

382
```python
383
def get_file_info(url):
384
    """Get comprehensive file information from URL."""
385
    protocol = fsspec.utils.get_protocol(url)
386
    compression = fsspec.utils.infer_compression(url)
387
    storage_options = fsspec.utils.infer_storage_options(url)
388
    
389
    return {
390
        'protocol': protocol,
391
        'compression': compression,
392
        'storage_options': storage_options,
393
        'token': fsspec.utils.tokenize(url, **storage_options)
394
    }
395

396
# Use custom utility
397
info = get_file_info('s3://bucket/data.csv.gz?region=us-west-2')
398
print(info)
399
```
400

401
### Error Handling with Utilities
402

403
```python
404
def safe_infer_compression(filename):
405
    """Safely infer compression with fallback."""
406
    try:
407
        return fsspec.utils.infer_compression(filename)
408
    except Exception:
409
        # Return None if compression inference fails
410
        return None
411

412
def safe_get_protocol(url):
413
    """Safely extract protocol with fallback."""
414
    try:
415
        return fsspec.utils.get_protocol(url)
416
    except Exception:
417
        # Default to file protocol
418
        return 'file'
419
```
420

421
### Performance Optimization with Utilities
422

423
```python
424
# Cache tokenization results for repeated operations
425
from functools import lru_cache
426

427
@lru_cache(maxsize=1000)
428
def cached_tokenize(*args, **kwargs):
429
    """Cached version of tokenize for performance."""
430
    # Sort kwargs for consistent hashing
431
    sorted_kwargs = tuple(sorted(kwargs.items()))
432
    return fsspec.utils.tokenize(*args, *sorted_kwargs)
433

434
# Use cached tokenization
435
token = cached_tokenize('s3', 'bucket', 'file.txt', region='us-east-1')
436
```
437

438
## Configuration Options
439

440
### Global Settings
441

442
```python
443
# Common configuration options in fsspec.config.conf
444
{
445
    'default_cache_type': 'readahead',      # Default cache strategy
446
    'default_block_size': 1024 * 1024,     # Default block size (1MB)  
447
    'connect_timeout': 10,                  # Connection timeout seconds
448
    'read_timeout': 30,                     # Read timeout seconds
449
    'max_connections': 100,                 # Max concurrent connections
450
    'cache_dir': '/tmp/fsspec',             # Cache directory
451
    'logging_level': 'INFO'                 # Logging verbosity
452
}
453
```
454

455
### Environment Variable Mapping
456

457
```python
458
# Environment variables that affect fsspec behavior
459
FSSPEC_CACHE_TYPE -> conf['default_cache_type']
460
FSSPEC_BLOCK_SIZE -> conf['default_block_size']  
461
FSSPEC_TIMEOUT -> conf['connect_timeout']
462
FSSPEC_CACHE_DIR -> conf['cache_dir']
463
```
464

465
### Per-Filesystem Configuration
466

467
```python
468
# Apply configuration to specific filesystem instances
469
config_overrides = {
470
    's3': {'default_cache_type': 'mmap'},
471
    'gcs': {'default_block_size': 2*1024*1024},
472
    'http': {'connect_timeout': 5}
473
}
474

475
# Configuration is applied when creating filesystem instances
476
for protocol, overrides in config_overrides.items():
477
    fsspec.utils.apply_config(fsspec.get_filesystem_class(protocol), overrides)
478
```

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/