0
# Compression Support
1
2
Automatic compression and decompression support for multiple formats, enabling transparent handling of compressed files across all filesystem backends. fsspec automatically detects compression from file extensions and provides seamless integration with Python's compression libraries.
3
4
## Capabilities
5
6
### Compression Registration
7
8
Register new compression formats and associate them with file extensions and compression handlers.
9
10
```python { .api }
11
def register_compression(name, callback, extensions, force=False):
12
"""
13
Register a compression format.
14
15
Parameters:
16
- name: str, compression format name
17
- callback: callable, function that returns file-like object for decompression
18
- extensions: list of str, file extensions associated with this format
19
- force: bool, whether to overwrite existing registration
20
"""
21
```
22
23
### Available Compressions
24
25
Query which compression formats are currently supported by the system.
26
27
```python { .api }
28
def available_compressions():
29
"""
30
List all available compression formats.
31
32
Returns:
33
list of str, compression format names
34
"""
35
```
36
37
## Built-in Compression Formats
38
39
### Standard Library Formats
40
41
Compression formats supported through Python's standard library:
42
43
```python { .api }
44
# GZIP compression (.gz files)
45
'gzip': Uses gzip module for compression/decompression
46
47
# BZIP2 compression (.bz2 files)
48
'bz2': Uses bz2 module for compression/decompression
49
50
# LZMA compression (.lzma, .xz files)
51
'lzma': Uses lzma module for compression/decompression
52
53
# ZIP archive format (.zip files)
54
'zip': Uses zipfile module for archive access
55
```
56
57
### Optional Third-Party Formats
58
59
Additional compression formats available when optional dependencies are installed:
60
61
```python { .api }
62
# Snappy compression (requires python-snappy)
63
'snappy': Fast compression optimized for speed over ratio
64
65
# LZ4 compression (requires lz4)
66
'lz4': Ultra-fast compression with .lz4 extension
67
68
# Zstandard compression (requires zstandard)
69
'zstd': Modern compression with .zst extension, good speed/ratio balance
70
```
71
72
## Usage Patterns
73
74
### Automatic Compression Detection
75
76
```python
77
# fsspec automatically detects compression from file extensions
78
79
# Reading compressed files
80
with fsspec.open('data.csv.gz', 'rt') as f:
81
# Automatically decompressed
82
content = f.read()
83
84
with fsspec.open('logs.txt.bz2', 'rt') as f:
85
for line in f:
86
process_line(line)
87
88
with fsspec.open('archive.tar.xz', 'rb') as f:
89
# LZMA decompression
90
data = f.read()
91
```
92
93
### Explicit Compression Specification
94
95
```python
96
# Force specific compression format
97
with fsspec.open('data.csv', 'rt', compression='gzip') as f:
98
content = f.read()
99
100
# Override automatic detection
101
with fsspec.open('file.gz', 'rt', compression='bz2') as f:
102
# Treats .gz file as bz2 compressed
103
content = f.read()
104
105
# Disable compression for .gz file
106
with fsspec.open('not-compressed.gz', 'rt', compression=None) as f:
107
# Reads raw file without decompression
108
content = f.read()
109
```
110
111
### Writing Compressed Files
112
113
```python
114
# Write compressed data
115
with fsspec.open('output.csv.gz', 'wt') as f:
116
# Automatically compressed using gzip
117
f.write('column1,column2\n')
118
f.write('value1,value2\n')
119
120
# Write with explicit compression
121
with fsspec.open('output.txt', 'wt', compression='bz2') as f:
122
f.write('This will be bz2 compressed\n')
123
```
124
125
### Remote Files with Compression
126
127
```python
128
# S3 files with compression
129
with fsspec.open('s3://bucket/data.csv.gz', 'rt') as f:
130
df = pd.read_csv(f)
131
132
# HTTP files with compression
133
with fsspec.open('https://example.com/data.json.gz', 'rt') as f:
134
data = json.load(f)
135
136
# GCS files with compression
137
with fsspec.open('gcs://bucket/logs.txt.bz2', 'rt') as f:
138
for line in f:
139
process_log(line)
140
```
141
142
### Batch Processing with Mixed Compression
143
144
```python
145
# Process files with different compression formats
146
files = [
147
's3://bucket/data1.csv.gz', # gzip
148
's3://bucket/data2.csv.bz2', # bzip2
149
's3://bucket/data3.csv.xz', # lzma
150
's3://bucket/data4.csv' # uncompressed
151
]
152
153
dataframes = []
154
for file_path in files:
155
with fsspec.open(file_path, 'rt') as f:
156
# Compression automatically handled
157
df = pd.read_csv(f)
158
dataframes.append(df)
159
160
combined_df = pd.concat(dataframes)
161
```
162
163
### Archive File Access
164
165
```python
166
# Access files within ZIP archives
167
with fsspec.open('zip://data.csv::archive.zip', 'rt') as f:
168
# Reads data.csv from within archive.zip
169
content = f.read()
170
171
# Remote ZIP archives
172
with fsspec.open('zip://data.csv::s3://bucket/archive.zip', 'rt') as f:
173
content = f.read()
174
```
175
176
### Custom Compression Registration
177
178
```python
179
import fsspec
180
import my_compression_lib
181
182
def my_compression_opener(file, mode='rb'):
183
"""Custom compression opener function."""
184
if 'r' in mode:
185
return my_compression_lib.decompress_file(file)
186
elif 'w' in mode:
187
return my_compression_lib.compress_file(file)
188
else:
189
raise ValueError(f"Unsupported mode: {mode}")
190
191
# Register custom compression format
192
fsspec.compression.register_compression(
193
name='myformat',
194
callback=my_compression_opener,
195
extensions=['.mycomp', '.mc']
196
)
197
198
# Now use custom compression
199
with fsspec.open('data.txt.mycomp', 'rt') as f:
200
content = f.read()
201
```
202
203
### Performance Considerations
204
205
```python
206
# Choose compression based on use case
207
208
# For speed-critical applications
209
with fsspec.open('data.csv.lz4', 'rt') as f: # Fast decompression
210
df = pd.read_csv(f)
211
212
# For space-critical applications
213
with fsspec.open('data.csv.xz', 'rt') as f: # High compression ratio
214
df = pd.read_csv(f)
215
216
# For general use
217
with fsspec.open('data.csv.gz', 'rt') as f: # Good balance
218
df = pd.read_csv(f)
219
```
220
221
### Compression with Caching
222
223
```python
224
# Compression works with caching layers
225
with fsspec.open('s3://bucket/large-data.csv.gz',
226
'rt',
227
cache_type='blockcache',
228
block_size=1024*1024) as f:
229
# Compressed data is cached, decompression happens after cache
230
df = pd.read_csv(f)
231
```
232
233
### Multi-threaded Compression
234
235
```python
236
import concurrent.futures
237
238
def process_compressed_file(file_path):
239
with fsspec.open(file_path, 'rt') as f:
240
return len(f.read())
241
242
# Process multiple compressed files in parallel
243
compressed_files = [
244
's3://bucket/file1.csv.gz',
245
's3://bucket/file2.csv.bz2',
246
's3://bucket/file3.csv.xz'
247
]
248
249
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
250
results = list(executor.map(process_compressed_file, compressed_files))
251
```
252
253
### Checking Compression Support
254
255
```python
256
# Check what compression formats are available
257
available = fsspec.compression.available_compressions()
258
print("Available compression formats:", available)
259
260
# Check for specific format
261
if 'lz4' in available:
262
print("LZ4 compression is available")
263
with fsspec.open('data.csv.lz4', 'rt') as f:
264
content = f.read()
265
else:
266
print("LZ4 not available, using gzip")
267
with fsspec.open('data.csv.gz', 'rt') as f:
268
content = f.read()
269
```
270
271
### Error Handling with Compression
272
273
```python
274
try:
275
with fsspec.open('data.csv.gz', 'rt') as f:
276
content = f.read()
277
except ImportError as e:
278
print(f"Compression library not available: {e}")
279
except OSError as e:
280
print(f"Compression error (possibly corrupted file): {e}")
281
except Exception as e:
282
print(f"Unexpected error: {e}")
283
```
284
285
## Compression Format Details
286
287
### GZIP (.gz)
288
- **Use case**: General purpose, widely supported
289
- **Performance**: Medium compression ratio, medium speed
290
- **Availability**: Python standard library (always available)
291
292
### BZIP2 (.bz2)
293
- **Use case**: Better compression than gzip
294
- **Performance**: High compression ratio, slower speed
295
- **Availability**: Python standard library (always available)
296
297
### LZMA (.xz, .lzma)
298
- **Use case**: Best compression ratio
299
- **Performance**: Highest compression ratio, slowest speed
300
- **Availability**: Python standard library (Python 3.3+)
301
302
### LZ4 (.lz4)
303
- **Use case**: Speed-critical applications
304
- **Performance**: Lower compression ratio, very fast
305
- **Availability**: Requires `lz4` package (`pip install lz4`)
306
307
### Snappy
308
- **Use case**: High-speed compression/decompression
309
- **Performance**: Fast with reasonable compression
310
- **Availability**: Requires `python-snappy` package
311
312
### Zstandard (.zst)
313
- **Use case**: Modern replacement for gzip
314
- **Performance**: Good compression ratio and speed
315
- **Availability**: Requires `zstandard` package
316
317
### ZIP (.zip)
318
- **Use case**: Archive files with multiple entries
319
- **Performance**: Variable depending on internal compression
320
- **Availability**: Python standard library (zipfile module)
321
322
## Integration Examples
323
324
### With Pandas
325
326
```python
327
# Read compressed CSV directly into pandas
328
df = pd.read_csv(fsspec.open('s3://bucket/data.csv.gz', 'rt'))
329
330
# Write compressed CSV from pandas
331
with fsspec.open('output.csv.gz', 'wt') as f:
332
df.to_csv(f, index=False)
333
```
334
335
### With JSON
336
337
```python
338
# Read compressed JSON
339
with fsspec.open('config.json.gz', 'rt') as f:
340
config = json.load(f)
341
342
# Write compressed JSON
343
with fsspec.open('output.json.bz2', 'wt') as f:
344
json.dump(data, f, indent=2)
345
```
346
347
### With Numpy
348
349
```python
350
# Read compressed numpy array
351
with fsspec.open('array.npy.gz', 'rb') as f:
352
array = np.load(f)
353
354
# Write compressed numpy array
355
with fsspec.open('output.npy.gz', 'wb') as f:
356
np.save(f, array)
357
```