or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

caching.mdcallbacks.mdcompression.mdcore-operations.mdfilesystem-interface.mdindex.mdmapping.mdregistry.mdutilities.md

utilities.mddocs/

0

# Utilities and Configuration

1

2

Helper functions for URL parsing, path manipulation, tokenization, and configuration management that support the core filesystem operations. These utilities provide essential infrastructure for protocol handling, caching, and system configuration.

3

4

## Capabilities

5

6

### URL and Path Processing

7

8

Functions for parsing URLs, extracting protocols, and manipulating filesystem paths across different storage backends.

9

10

```python { .api }

11

def infer_storage_options(urlpath, inherit_storage_options=None):

12

"""

13

Infer storage options from URL parameters.

14

15

Parameters:

16

- urlpath: str, URL with potential query parameters

17

- inherit_storage_options: dict, existing options to inherit/override

18

19

Returns:

20

dict, storage options extracted from URL

21

"""

22

23

def get_protocol(url):

24

"""

25

Extract protocol from URL.

26

27

Parameters:

28

- url: str, URL to parse

29

30

Returns:

31

str, protocol name (e.g., 's3', 'gcs', 'file')

32

"""

33

34

def stringify_path(filepath):

35

"""

36

Convert path object to string.

37

38

Parameters:

39

- filepath: str or Path-like, file path

40

41

Returns:

42

str, string representation of path

43

"""

44

```

45

46

### Compression Detection

47

48

Utilities for automatically detecting compression formats from filenames and extensions.

49

50

```python { .api }

51

def infer_compression(filename):

52

"""

53

Infer compression format from filename.

54

55

Parameters:

56

- filename: str, file name or path

57

58

Returns:

59

str or None, compression format name or None if uncompressed

60

"""

61

```

62

63

### Tokenization and Hashing

64

65

Functions for generating consistent hash tokens from filesystem paths and parameters, used internally for caching and deduplication.

66

67

```python { .api }

68

def tokenize(*args, **kwargs):

69

"""

70

Generate hash token from arguments.

71

72

Parameters:

73

- *args: positional arguments to hash

74

- **kwargs: keyword arguments to hash

75

76

Returns:

77

str, hash token string

78

"""

79

```

80

81

### Block Reading Utilities

82

83

Low-level utilities for reading data blocks with delimiter support, useful for implementing custom file readers and parsers.

84

85

```python { .api }

86

def read_block(file, offset, length, delimiter=None):

87

"""

88

Read a block of data from file.

89

90

Parameters:

91

- file: file-like object, source file

92

- offset: int, byte offset to start reading

93

- length: int, maximum bytes to read

94

- delimiter: bytes, delimiter to read until (optional)

95

96

Returns:

97

bytes, block data

98

"""

99

```

100

101

### Filename Generation

102

103

Utilities for generating systematic filenames for batch operations and parallel processing.

104

105

```python { .api }

106

def build_name_function(max_int):

107

"""

108

Build function for generating sequential filenames.

109

110

Parameters:

111

- max_int: int, maximum number to generate names for

112

113

Returns:

114

callable, function that takes int and returns filename string

115

"""

116

```

117

118

### Atomic File Operations

119

120

Utilities for ensuring atomic file writes and preventing data corruption during file operations.

121

122

```python { .api }

123

def atomic_write(path, mode='wb'):

124

"""

125

Context manager for atomic file writing.

126

127

Parameters:

128

- path: str, target file path

129

- mode: str, file opening mode

130

131

Returns:

132

context manager, yields temporary file object

133

"""

134

```

135

136

### Pattern Matching

137

138

Utilities for translating glob patterns to regular expressions and other pattern matching operations.

139

140

```python { .api }

141

def glob_translate(pat):

142

"""

143

Translate glob pattern to regular expression.

144

145

Parameters:

146

- pat: str, glob pattern

147

148

Returns:

149

str, regular expression pattern

150

"""

151

```

152

153

### Configuration Management

154

155

Global configuration system for fsspec behavior and default settings.

156

157

```python { .api }

158

conf: dict

159

"""Global configuration dictionary with fsspec settings"""

160

161

conf_dir: str

162

"""Configuration directory path"""

163

164

def set_conf_env(conf_dict, envdict=os.environ):

165

"""

166

Set configuration from environment variables.

167

168

Parameters:

169

- conf_dict: dict, configuration dictionary to update

170

- envdict: dict, environment variables dictionary

171

"""

172

173

def apply_config(cls, kwargs):

174

"""

175

Apply configuration to class constructor arguments.

176

177

Parameters:

178

- cls: type, class to configure

179

- kwargs: dict, keyword arguments to modify

180

181

Returns:

182

dict, modified keyword arguments with config applied

183

"""

184

```

185

186

## Usage Patterns

187

188

### URL Parameter Extraction

189

190

```python

191

# Extract storage options from URL query parameters

192

url = 's3://bucket/path?key=ACCESS_KEY&secret=SECRET_KEY&region=us-west-2'

193

storage_options = fsspec.utils.infer_storage_options(url)

194

print(storage_options)

195

# {'key': 'ACCESS_KEY', 'secret': 'SECRET_KEY', 'region': 'us-west-2'}

196

197

# Use extracted options

198

fs = fsspec.filesystem('s3', **storage_options)

199

200

# Inherit and override options

201

base_options = {'key': 'BASE_KEY', 'timeout': 30}

202

url = 's3://bucket/path?secret=SECRET_KEY'

203

final_options = fsspec.utils.infer_storage_options(url, base_options)

204

# Result: {'key': 'BASE_KEY', 'timeout': 30, 'secret': 'SECRET_KEY'}

205

```

206

207

### Protocol Detection

208

209

```python

210

# Extract protocol from various URL formats

211

urls = [

212

's3://bucket/file.txt',

213

'gcs://bucket/file.txt',

214

'https://example.com/api',

215

'/local/path/file.txt',

216

'file:///absolute/path'

217

]

218

219

for url in urls:

220

protocol = fsspec.utils.get_protocol(url)

221

print(f"{url} -> {protocol}")

222

223

# s3://bucket/file.txt -> s3

224

# gcs://bucket/file.txt -> gcs

225

# https://example.com/api -> https

226

# /local/path/file.txt -> file

227

# file:///absolute/path -> file

228

```

229

230

### Compression Auto-Detection

231

232

```python

233

# Automatically detect compression from filenames

234

filenames = [

235

'data.csv.gz',

236

'archive.tar.bz2',

237

'logs.txt.xz',

238

'config.json',

239

'model.pkl.lz4'

240

]

241

242

for filename in filenames:

243

compression = fsspec.utils.infer_compression(filename)

244

print(f"{filename} -> {compression}")

245

246

# data.csv.gz -> gzip

247

# archive.tar.bz2 -> bz2

248

# logs.txt.xz -> lzma

249

# config.json -> None

250

# model.pkl.lz4 -> lz4

251

```

252

253

### Path Standardization

254

255

```python

256

import pathlib

257

258

# Convert various path types to strings

259

paths = [

260

'/local/file.txt',

261

pathlib.Path('/local/file.txt'),

262

pathlib.PurePosixPath('/local/file.txt')

263

]

264

265

for path in paths:

266

str_path = fsspec.utils.stringify_path(path)

267

print(f"{type(path)} -> {str_path}")

268

```

269

270

### Tokenization for Caching

271

272

```python

273

# Generate consistent tokens for caching

274

token1 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')

275

token2 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-east-1')

276

token3 = fsspec.utils.tokenize('s3', 'bucket', 'file.txt', region='us-west-2')

277

278

print(token1 == token2) # True - same parameters

279

print(token1 == token3) # False - different region

280

281

# Use for cache keys

282

cache_key = fsspec.utils.tokenize(protocol, path, **storage_options)

283

```

284

285

### Block Reading with Delimiters

286

287

```python

288

# Read file in blocks with line boundaries

289

with open('large_file.txt', 'rb') as f:

290

offset = 0

291

block_size = 1024 * 1024 # 1MB blocks

292

293

while True:

294

# Read block ending at line boundary

295

block = fsspec.utils.read_block(f, offset, block_size, delimiter=b'\n')

296

if not block:

297

break

298

299

# Process complete lines

300

lines = block.split(b'\n')

301

for line in lines:

302

if line: # Skip empty lines

303

process_line(line)

304

305

offset += len(block)

306

```

307

308

### Sequential Filename Generation

309

310

```python

311

# Generate systematic filenames for batch output

312

name_func = fsspec.utils.build_name_function(1000)

313

314

filenames = [name_func(i) for i in range(5)]

315

print(filenames)

316

# ['000', '001', '002', '003', '004']

317

318

# Use with fsspec.open_files for multiple outputs

319

output_files = fsspec.open_files(

320

'output-*.json',

321

'w',

322

num=10,

323

name_function=name_func

324

)

325

```

326

327

### Atomic File Writing

328

329

```python

330

# Ensure atomic writes to prevent corruption

331

with fsspec.utils.atomic_write('/important/file.txt', 'w') as f:

332

f.write('Critical data that must be written atomically\n')

333

f.write('If this fails, the original file remains unchanged\n')

334

# File is only moved to final location if all writes succeed

335

336

# Works with binary mode too

337

with fsspec.utils.atomic_write('/data/model.pkl', 'wb') as f:

338

pickle.dump(model, f)

339

```

340

341

### Glob Pattern Processing

342

343

```python

344

# Convert glob patterns to regex for custom matching

345

patterns = ['*.txt', 'data_*.csv', 'logs/*/error.log']

346

347

for pattern in patterns:

348

regex = fsspec.utils.glob_translate(pattern)

349

print(f"{pattern} -> {regex}")

350

351

# Use compiled regex for matching

352

import re

353

regex_pattern = fsspec.utils.glob_translate('data_*.csv')

354

compiled = re.compile(regex_pattern)

355

356

files = ['data_1.csv', 'data_2.csv', 'config.json', 'data_old.csv']

357

matches = [f for f in files if compiled.match(f)]

358

print(matches) # ['data_1.csv', 'data_2.csv', 'data_old.csv']

359

```

360

361

### Global Configuration

362

363

```python

364

# Check current configuration

365

print("Current fsspec config:", fsspec.config.conf)

366

367

# Set configuration options

368

fsspec.config.conf['default_cache_type'] = 'blockcache'

369

fsspec.config.conf['default_block_size'] = 1024 * 1024

370

371

# Configuration from environment variables

372

import os

373

os.environ['FSSPEC_CACHE_TYPE'] = 'readahead'

374

os.environ['FSSPEC_BLOCK_SIZE'] = '2097152'

375

376

fsspec.utils.set_conf_env(fsspec.config.conf)

377

print("Updated config:", fsspec.config.conf)

378

```

379

380

### Custom Utility Functions

381

382

```python

383

def get_file_info(url):

384

"""Get comprehensive file information from URL."""

385

protocol = fsspec.utils.get_protocol(url)

386

compression = fsspec.utils.infer_compression(url)

387

storage_options = fsspec.utils.infer_storage_options(url)

388

389

return {

390

'protocol': protocol,

391

'compression': compression,

392

'storage_options': storage_options,

393

'token': fsspec.utils.tokenize(url, **storage_options)

394

}

395

396

# Use custom utility

397

info = get_file_info('s3://bucket/data.csv.gz?region=us-west-2')

398

print(info)

399

```

400

401

### Error Handling with Utilities

402

403

```python

404

def safe_infer_compression(filename):

405

"""Safely infer compression with fallback."""

406

try:

407

return fsspec.utils.infer_compression(filename)

408

except Exception:

409

# Return None if compression inference fails

410

return None

411

412

def safe_get_protocol(url):

413

"""Safely extract protocol with fallback."""

414

try:

415

return fsspec.utils.get_protocol(url)

416

except Exception:

417

# Default to file protocol

418

return 'file'

419

```

420

421

### Performance Optimization with Utilities

422

423

```python

424

# Cache tokenization results for repeated operations

425

from functools import lru_cache

426

427

@lru_cache(maxsize=1000)

428

def cached_tokenize(*args, **kwargs):

429

"""Cached version of tokenize for performance."""

430

# Sort kwargs for consistent hashing

431

sorted_kwargs = tuple(sorted(kwargs.items()))

432

return fsspec.utils.tokenize(*args, *sorted_kwargs)

433

434

# Use cached tokenization

435

token = cached_tokenize('s3', 'bucket', 'file.txt', region='us-east-1')

436

```

437

438

## Configuration Options

439

440

### Global Settings

441

442

```python

443

# Common configuration options in fsspec.config.conf

444

{

445

'default_cache_type': 'readahead', # Default cache strategy

446

'default_block_size': 1024 * 1024, # Default block size (1MB)

447

'connect_timeout': 10, # Connection timeout seconds

448

'read_timeout': 30, # Read timeout seconds

449

'max_connections': 100, # Max concurrent connections

450

'cache_dir': '/tmp/fsspec', # Cache directory

451

'logging_level': 'INFO' # Logging verbosity

452

}

453

```

454

455

### Environment Variable Mapping

456

457

```python

458

# Environment variables that affect fsspec behavior

459

FSSPEC_CACHE_TYPE -> conf['default_cache_type']

460

FSSPEC_BLOCK_SIZE -> conf['default_block_size']

461

FSSPEC_TIMEOUT -> conf['connect_timeout']

462

FSSPEC_CACHE_DIR -> conf['cache_dir']

463

```

464

465

### Per-Filesystem Configuration

466

467

```python

468

# Apply configuration to specific filesystem instances

469

config_overrides = {

470

's3': {'default_cache_type': 'mmap'},

471

'gcs': {'default_block_size': 2*1024*1024},

472

'http': {'connect_timeout': 5}

473

}

474

475

# Configuration is applied when creating filesystem instances

476

for protocol, overrides in config_overrides.items():

477

fsspec.utils.apply_config(fsspec.get_filesystem_class(protocol), overrides)

478

```