or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

caching.mdcallbacks.mdcompression.mdcore-operations.mdfilesystem-interface.mdindex.mdmapping.mdregistry.mdutilities.md

compression.mddocs/

0

# Compression Support

1

2

Automatic compression and decompression support for multiple formats, enabling transparent handling of compressed files across all filesystem backends. fsspec automatically detects compression from file extensions and provides seamless integration with Python's compression libraries.

3

4

## Capabilities

5

6

### Compression Registration

7

8

Register new compression formats and associate them with file extensions and compression handlers.

9

10

```python { .api }

11

def register_compression(name, callback, extensions, force=False):

12

"""

13

Register a compression format.

14

15

Parameters:

16

- name: str, compression format name

17

- callback: callable, function that returns file-like object for decompression

18

- extensions: list of str, file extensions associated with this format

19

- force: bool, whether to overwrite existing registration

20

"""

21

```

22

23

### Available Compressions

24

25

Query which compression formats are currently supported by the system.

26

27

```python { .api }

28

def available_compressions():

29

"""

30

List all available compression formats.

31

32

Returns:

33

list of str, compression format names

34

"""

35

```

36

37

## Built-in Compression Formats

38

39

### Standard Library Formats

40

41

Compression formats supported through Python's standard library:

42

43

```python { .api }

44

# GZIP compression (.gz files)

45

'gzip': Uses gzip module for compression/decompression

46

47

# BZIP2 compression (.bz2 files)

48

'bz2': Uses bz2 module for compression/decompression

49

50

# LZMA compression (.lzma, .xz files)

51

'lzma': Uses lzma module for compression/decompression

52

53

# ZIP archive format (.zip files)

54

'zip': Uses zipfile module for archive access

55

```

56

57

### Optional Third-Party Formats

58

59

Additional compression formats available when optional dependencies are installed:

60

61

```python { .api }

62

# Snappy compression (requires python-snappy)

63

'snappy': Fast compression optimized for speed over ratio

64

65

# LZ4 compression (requires lz4)

66

'lz4': Ultra-fast compression with .lz4 extension

67

68

# Zstandard compression (requires zstandard)

69

'zstd': Modern compression with .zst extension, good speed/ratio balance

70

```

71

72

## Usage Patterns

73

74

### Automatic Compression Detection

75

76

```python

77

# fsspec automatically detects compression from file extensions

78

79

# Reading compressed files

80

with fsspec.open('data.csv.gz', 'rt') as f:

81

# Automatically decompressed

82

content = f.read()

83

84

with fsspec.open('logs.txt.bz2', 'rt') as f:

85

for line in f:

86

process_line(line)

87

88

with fsspec.open('archive.tar.xz', 'rb') as f:

89

# LZMA decompression

90

data = f.read()

91

```

92

93

### Explicit Compression Specification

94

95

```python

96

# Force specific compression format

97

with fsspec.open('data.csv', 'rt', compression='gzip') as f:

98

content = f.read()

99

100

# Override automatic detection

101

with fsspec.open('file.gz', 'rt', compression='bz2') as f:

102

# Treats .gz file as bz2 compressed

103

content = f.read()

104

105

# Disable compression for .gz file

106

with fsspec.open('not-compressed.gz', 'rt', compression=None) as f:

107

# Reads raw file without decompression

108

content = f.read()

109

```

110

111

### Writing Compressed Files

112

113

```python

114

# Write compressed data

115

with fsspec.open('output.csv.gz', 'wt') as f:

116

# Automatically compressed using gzip

117

f.write('column1,column2\n')

118

f.write('value1,value2\n')

119

120

# Write with explicit compression

121

with fsspec.open('output.txt', 'wt', compression='bz2') as f:

122

f.write('This will be bz2 compressed\n')

123

```

124

125

### Remote Files with Compression

126

127

```python

128

# S3 files with compression

129

with fsspec.open('s3://bucket/data.csv.gz', 'rt') as f:

130

df = pd.read_csv(f)

131

132

# HTTP files with compression

133

with fsspec.open('https://example.com/data.json.gz', 'rt') as f:

134

data = json.load(f)

135

136

# GCS files with compression

137

with fsspec.open('gcs://bucket/logs.txt.bz2', 'rt') as f:

138

for line in f:

139

process_log(line)

140

```

141

142

### Batch Processing with Mixed Compression

143

144

```python

145

# Process files with different compression formats

146

files = [

147

's3://bucket/data1.csv.gz', # gzip

148

's3://bucket/data2.csv.bz2', # bzip2

149

's3://bucket/data3.csv.xz', # lzma

150

's3://bucket/data4.csv' # uncompressed

151

]

152

153

dataframes = []

154

for file_path in files:

155

with fsspec.open(file_path, 'rt') as f:

156

# Compression automatically handled

157

df = pd.read_csv(f)

158

dataframes.append(df)

159

160

combined_df = pd.concat(dataframes)

161

```

162

163

### Archive File Access

164

165

```python

166

# Access files within ZIP archives

167

with fsspec.open('zip://data.csv::archive.zip', 'rt') as f:

168

# Reads data.csv from within archive.zip

169

content = f.read()

170

171

# Remote ZIP archives

172

with fsspec.open('zip://data.csv::s3://bucket/archive.zip', 'rt') as f:

173

content = f.read()

174

```

175

176

### Custom Compression Registration

177

178

```python

179

import fsspec

180

import my_compression_lib

181

182

def my_compression_opener(file, mode='rb'):

183

"""Custom compression opener function."""

184

if 'r' in mode:

185

return my_compression_lib.decompress_file(file)

186

elif 'w' in mode:

187

return my_compression_lib.compress_file(file)

188

else:

189

raise ValueError(f"Unsupported mode: {mode}")

190

191

# Register custom compression format

192

fsspec.compression.register_compression(

193

name='myformat',

194

callback=my_compression_opener,

195

extensions=['.mycomp', '.mc']

196

)

197

198

# Now use custom compression

199

with fsspec.open('data.txt.mycomp', 'rt') as f:

200

content = f.read()

201

```

202

203

### Performance Considerations

204

205

```python

206

# Choose compression based on use case

207

208

# For speed-critical applications

209

with fsspec.open('data.csv.lz4', 'rt') as f: # Fast decompression

210

df = pd.read_csv(f)

211

212

# For space-critical applications

213

with fsspec.open('data.csv.xz', 'rt') as f: # High compression ratio

214

df = pd.read_csv(f)

215

216

# For general use

217

with fsspec.open('data.csv.gz', 'rt') as f: # Good balance

218

df = pd.read_csv(f)

219

```

220

221

### Compression with Caching

222

223

```python

224

# Compression works with caching layers

225

with fsspec.open('s3://bucket/large-data.csv.gz',

226

'rt',

227

cache_type='blockcache',

228

block_size=1024*1024) as f:

229

# Compressed data is cached, decompression happens after cache

230

df = pd.read_csv(f)

231

```

232

233

### Multi-threaded Compression

234

235

```python

236

import concurrent.futures

237

238

def process_compressed_file(file_path):

239

with fsspec.open(file_path, 'rt') as f:

240

return len(f.read())

241

242

# Process multiple compressed files in parallel

243

compressed_files = [

244

's3://bucket/file1.csv.gz',

245

's3://bucket/file2.csv.bz2',

246

's3://bucket/file3.csv.xz'

247

]

248

249

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:

250

results = list(executor.map(process_compressed_file, compressed_files))

251

```

252

253

### Checking Compression Support

254

255

```python

256

# Check what compression formats are available

257

available = fsspec.compression.available_compressions()

258

print("Available compression formats:", available)

259

260

# Check for specific format

261

if 'lz4' in available:

262

print("LZ4 compression is available")

263

with fsspec.open('data.csv.lz4', 'rt') as f:

264

content = f.read()

265

else:

266

print("LZ4 not available, using gzip")

267

with fsspec.open('data.csv.gz', 'rt') as f:

268

content = f.read()

269

```

270

271

### Error Handling with Compression

272

273

```python

274

try:

275

with fsspec.open('data.csv.gz', 'rt') as f:

276

content = f.read()

277

except ImportError as e:

278

print(f"Compression library not available: {e}")

279

except OSError as e:

280

print(f"Compression error (possibly corrupted file): {e}")

281

except Exception as e:

282

print(f"Unexpected error: {e}")

283

```

284

285

## Compression Format Details

286

287

### GZIP (.gz)

288

- **Use case**: General purpose, widely supported

289

- **Performance**: Medium compression ratio, medium speed

290

- **Availability**: Python standard library (always available)

291

292

### BZIP2 (.bz2)

293

- **Use case**: Better compression than gzip

294

- **Performance**: High compression ratio, slower speed

295

- **Availability**: Python standard library (always available)

296

297

### LZMA (.xz, .lzma)

298

- **Use case**: Best compression ratio

299

- **Performance**: Highest compression ratio, slowest speed

300

- **Availability**: Python standard library (Python 3.3+)

301

302

### LZ4 (.lz4)

303

- **Use case**: Speed-critical applications

304

- **Performance**: Lower compression ratio, very fast

305

- **Availability**: Requires `lz4` package (`pip install lz4`)

306

307

### Snappy

308

- **Use case**: High-speed compression/decompression

309

- **Performance**: Fast with reasonable compression

310

- **Availability**: Requires `python-snappy` package

311

312

### Zstandard (.zst)

313

- **Use case**: Modern replacement for gzip

314

- **Performance**: Good compression ratio and speed

315

- **Availability**: Requires `zstandard` package

316

317

### ZIP (.zip)

318

- **Use case**: Archive files with multiple entries

319

- **Performance**: Variable depending on internal compression

320

- **Availability**: Python standard library (zipfile module)

321

322

## Integration Examples

323

324

### With Pandas

325

326

```python

327

# Read compressed CSV directly into pandas

328

df = pd.read_csv(fsspec.open('s3://bucket/data.csv.gz', 'rt'))

329

330

# Write compressed CSV from pandas

331

with fsspec.open('output.csv.gz', 'wt') as f:

332

df.to_csv(f, index=False)

333

```

334

335

### With JSON

336

337

```python

338

# Read compressed JSON

339

with fsspec.open('config.json.gz', 'rt') as f:

340

config = json.load(f)

341

342

# Write compressed JSON

343

with fsspec.open('output.json.bz2', 'wt') as f:

344

json.dump(data, f, indent=2)

345

```

346

347

### With Numpy

348

349

```python

350

# Read compressed numpy array

351

with fsspec.open('array.npy.gz', 'rb') as f:

352

array = np.load(f)

353

354

# Write compressed numpy array

355

with fsspec.open('output.npy.gz', 'wb') as f:

356

np.save(f, array)

357

```