0
# Caching System
1
2
Multiple caching strategies for optimizing filesystem access patterns, including memory mapping, block caching, read-ahead caching, and background prefetching. The caching system improves performance with remote storage by reducing network requests and providing intelligent data prefetching.
3
4
## Capabilities
5
6
### Base Cache Class
7
8
Abstract base class that defines the interface for all caching implementations.
9
10
```python { .api }
11
class BaseCache:
12
"""Base class for caching implementations."""
13
14
def __init__(self, blocksize, fetcher, size, **kwargs):
15
"""
16
Initialize cache.
17
18
Parameters:
19
- blocksize: int, size of cache blocks
20
- fetcher: callable, function to fetch data
21
- size: int, total size of cached object
22
- **kwargs: additional cache-specific options
23
"""
24
25
def _fetch(self, start, end):
26
"""
27
Fetch data range.
28
29
Parameters:
30
- start: int, start byte offset
31
- end: int, end byte offset
32
33
Returns:
34
bytes, fetched data
35
"""
36
37
def _read_cache(self, start, end):
38
"""
39
Read from cache if available.
40
41
Parameters:
42
- start: int, start byte offset
43
- end: int, end byte offset
44
45
Returns:
46
bytes or None, cached data or None if not cached
47
"""
48
```
49
50
### Memory-Mapped Cache
51
52
Uses memory mapping for efficient access to cached files, particularly useful for large files with random access patterns.
53
54
```python { .api }
55
class MMapCache(BaseCache):
56
"""Memory-mapped file cache for efficient random access."""
57
58
def __init__(self, blocksize, fetcher, size, location=None, blocks=None):
59
"""
60
Initialize memory-mapped cache.
61
62
Parameters:
63
- blocksize: int, size of cache blocks
64
- fetcher: callable, function to fetch data
65
- size: int, total size of cached object
66
- location: str, local file path for memory mapping
67
- blocks: set, specific blocks to cache
68
"""
69
```
70
71
### Read-Ahead Cache
72
73
Implements read-ahead caching strategy that prefetches data based on sequential access patterns.
74
75
```python { .api }
76
class ReadAheadCache(BaseCache):
77
"""Read-ahead cache optimized for sequential access patterns."""
78
79
def __init__(self, blocksize, fetcher, size, maxblocks=32):
80
"""
81
Initialize read-ahead cache.
82
83
Parameters:
84
- blocksize: int, size of cache blocks
85
- fetcher: callable, function to fetch data
86
- size: int, total size of cached object
87
- maxblocks: int, maximum number of blocks to cache
88
"""
89
```
90
91
### Block Cache
92
93
LRU-based block caching with configurable cache size and eviction policies.
94
95
```python { .api }
96
class BlockCache(BaseCache):
97
"""Block-based cache with LRU eviction policy."""
98
99
def __init__(self, blocksize, fetcher, size, maxblocks=32):
100
"""
101
Initialize block cache.
102
103
Parameters:
104
- blocksize: int, size of cache blocks
105
- fetcher: callable, function to fetch data
106
- size: int, total size of cached object
107
- maxblocks: int, maximum number of blocks to keep in cache
108
"""
109
```
110
111
### Bytes Cache
112
113
Simple in-memory cache that stores entire file contents as bytes.
114
115
```python { .api }
116
class BytesCache(BaseCache):
117
"""In-memory bytes cache for small files."""
118
119
def __init__(self, blocksize, fetcher, size, **kwargs):
120
"""
121
Initialize bytes cache.
122
123
Parameters:
124
- blocksize: int, size of cache blocks
125
- fetcher: callable, function to fetch data
126
- size: int, total size of cached object
127
"""
128
```
129
130
### Background Block Cache
131
132
Advanced block cache with background prefetching for improved performance with predictable access patterns.
133
134
```python { .api }
135
class BackgroundBlockCache(BaseCache):
136
"""Block cache with background prefetching capabilities."""
137
138
def __init__(self, blocksize, fetcher, size, maxblocks=32):
139
"""
140
Initialize background block cache.
141
142
Parameters:
143
- blocksize: int, size of cache blocks
144
- fetcher: callable, function to fetch data
145
- size: int, total size of cached object
146
- maxblocks: int, maximum number of blocks to cache
147
"""
148
```
149
150
### Cache Registry
151
152
Dictionary of available cache implementations that can be selected by name.
153
154
```python { .api }
155
caches: dict
156
"""
157
Mapping of cache names to cache classes.
158
159
Available caches:
160
- 'mmap': MMapCache
161
- 'readahead': ReadAheadCache
162
- 'blockcache': BlockCache
163
- 'bytes': BytesCache
164
- 'background': BackgroundBlockCache
165
"""
166
```
167
168
## Usage Patterns
169
170
### Specifying Cache Type in File Opening
171
172
```python
173
# Use specific cache type when opening files
174
with fsspec.open('s3://bucket/large-file.dat', cache_type='mmap') as f:
175
# File uses memory-mapped caching
176
data = f.read(1024)
177
178
# Use block cache with custom parameters
179
with fsspec.open('s3://bucket/file.dat',
180
cache_type='blockcache',
181
block_size=1024*1024,
182
maxblocks=64) as f:
183
data = f.read()
184
```
185
186
### Cache Configuration for Different Access Patterns
187
188
```python
189
# Sequential reading - use read-ahead cache
190
with fsspec.open('s3://bucket/log-file.txt',
191
cache_type='readahead',
192
block_size=64*1024) as f:
193
for line in f:
194
process_line(line)
195
196
# Random access - use memory-mapped cache
197
with fsspec.open('s3://bucket/database.dat',
198
cache_type='mmap',
199
block_size=4096) as f:
200
# Jump to different positions efficiently
201
f.seek(1000000)
202
data1 = f.read(100)
203
f.seek(5000000)
204
data2 = f.read(100)
205
206
# Small files - use bytes cache
207
with fsspec.open('s3://bucket/config.json',
208
cache_type='bytes') as f:
209
config = json.load(f)
210
```
211
212
### Background Prefetching
213
214
```python
215
# Use background cache for predictable access patterns
216
with fsspec.open('s3://bucket/time-series.dat',
217
cache_type='background',
218
block_size=1024*1024,
219
maxblocks=16) as f:
220
# Cache will prefetch subsequent blocks in background
221
for i in range(0, file_size, chunk_size):
222
f.seek(i)
223
chunk = f.read(chunk_size)
224
process_chunk(chunk)
225
```
226
227
### Filesystem-Level Cache Configuration
228
229
```python
230
# Configure caching at filesystem level
231
s3 = fsspec.filesystem('s3',
232
key='ACCESS_KEY',
233
secret='SECRET_KEY',
234
default_cache_type='blockcache',
235
default_block_size=1024*1024)
236
237
# All files opened through this filesystem use the cache settings
238
with s3.open('bucket/file1.dat') as f:
239
data1 = f.read()
240
241
with s3.open('bucket/file2.dat') as f:
242
data2 = f.read()
243
```
244
245
### Cache Performance Tuning
246
247
```python
248
# Tune cache parameters for specific workloads
249
250
# Large files with sequential access
251
large_file_cache = {
252
'cache_type': 'readahead',
253
'block_size': 8 * 1024 * 1024, # 8MB blocks
254
'maxblocks': 4 # Keep 32MB in memory
255
}
256
257
# Database-like files with random access
258
random_access_cache = {
259
'cache_type': 'mmap',
260
'block_size': 64 * 1024, # 64KB blocks
261
'maxblocks': 256 # Keep 16MB in memory
262
}
263
264
# Many small files
265
small_files_cache = {
266
'cache_type': 'bytes' # Cache entire file
267
}
268
269
# Open files with appropriate cache settings
270
with fsspec.open('s3://bucket/large.dat', **large_file_cache) as f:
271
process_large_file(f)
272
273
with fsspec.open('s3://bucket/index.db', **random_access_cache) as f:
274
lookup_data(f)
275
276
with fsspec.open('s3://bucket/config.json', **small_files_cache) as f:
277
config = json.load(f)
278
```
279
280
### Monitoring Cache Performance
281
282
```python
283
# Access cache statistics (implementation-dependent)
284
with fsspec.open('s3://bucket/file.dat', cache_type='blockcache') as f:
285
# Perform operations
286
data = f.read(1024*1024)
287
288
# Some cache implementations provide statistics
289
if hasattr(f.cache, 'hit_count'):
290
print(f"Cache hits: {f.cache.hit_count}")
291
print(f"Cache misses: {f.cache.miss_count}")
292
print(f"Hit ratio: {f.cache.hit_count / (f.cache.hit_count + f.cache.miss_count)}")
293
```
294
295
### Combining with Compression
296
297
```python
298
# Caching works with compression
299
with fsspec.open('s3://bucket/data.csv.gz',
300
compression='gzip',
301
cache_type='readahead',
302
block_size=1024*1024) as f:
303
# Compressed data is cached, decompression happens after cache
304
df = pd.read_csv(f)
305
```
306
307
### Cache Location Control
308
309
```python
310
# Control where cache files are stored (for persistent caches)
311
import tempfile
312
313
cache_dir = tempfile.mkdtemp()
314
315
with fsspec.open('s3://bucket/large-file.dat',
316
cache_type='mmap',
317
cache_storage=cache_dir) as f:
318
# Memory-mapped cache file stored in cache_dir
319
data = f.read()
320
321
# Cache files persist after closing
322
# Subsequent opens can reuse cached data
323
```
324
325
### Cache Invalidation
326
327
```python
328
# Clear caches when needed
329
fs = fsspec.filesystem('s3')
330
331
# Clear cache for specific file
332
fs.invalidate_cache('bucket/file.dat')
333
334
# Clear all cached data for this filesystem
335
fs.invalidate_cache()
336
337
# Clear all filesystem instances (nuclear option)
338
fsspec.AbstractFileSystem.clear_instance_cache()
339
```
340
341
## Cache Selection Guidelines
342
343
### By Access Pattern
344
345
- **Sequential Reading**: `ReadAheadCache` - Prefetches next blocks automatically
346
- **Random Access**: `MMapCache` - Efficient memory mapping for jumping around
347
- **Mixed Access**: `BlockCache` - Good general-purpose LRU cache
348
- **One-time Read**: `BytesCache` - Simple for small files read once
349
- **Predictable Patterns**: `BackgroundBlockCache` - Intelligent prefetching
350
351
### By File Size
352
353
- **Small files (<1MB)**: `BytesCache` - Cache entire file in memory
354
- **Medium files (1MB-100MB)**: `BlockCache` or `ReadAheadCache`
355
- **Large files (>100MB)**: `MMapCache` for random access, `ReadAheadCache` for sequential
356
357
### By Network Conditions
358
359
- **High latency**: Larger block sizes, more aggressive prefetching
360
- **Low bandwidth**: Smaller block sizes, conservative caching
361
- **Reliable connection**: `BackgroundBlockCache` for intelligent prefetching
362
- **Unreliable connection**: `BlockCache` with smaller blocks for retry resilience