0
# Optimization and Maintenance
1
2
Performance optimization and table maintenance operations for Delta Lake including file compaction, Z-ordering, vacuum operations, and data layout optimization. Provides fine-grained control over storage efficiency and query performance.
3
4
## Capabilities
5
6
### Optimize Operations
7
8
Improve query performance through data layout optimization.
9
10
```python { .api }
11
class DeltaTable:
12
def optimize(self) -> DeltaOptimizeBuilder:
13
"""
14
Create optimize builder for data layout optimization.
15
16
Returns:
17
DeltaOptimizeBuilder for configuring optimization operations
18
"""
19
```
20
21
```scala { .api }
22
class DeltaTable {
23
def optimize(): DeltaOptimizeBuilder
24
}
25
```
26
27
### Optimize Configuration
28
29
Configure optimization scope and execution.
30
31
```python { .api }
32
class DeltaOptimizeBuilder:
33
def where(self, partition_filter: str) -> DeltaOptimizeBuilder:
34
"""
35
Apply partition filter to limit optimization scope.
36
37
Parameters:
38
- partition_filter: SQL condition to filter partitions for optimization
39
40
Returns:
41
DeltaOptimizeBuilder for method chaining
42
"""
43
44
def executeCompaction(self) -> DataFrame:
45
"""
46
Execute file compaction to reduce small file overhead.
47
48
Returns:
49
DataFrame with optimization metrics including files compacted, size changes
50
"""
51
52
def executeZOrderBy(self, *cols: str) -> DataFrame:
53
"""
54
Execute Z-order optimization for improved query performance.
55
56
Parameters:
57
- cols: Column names to use for Z-order clustering
58
59
Returns:
60
DataFrame with Z-order optimization metrics
61
"""
62
```
63
64
```scala { .api }
65
class DeltaOptimizeBuilder {
66
def where(partitionFilter: String): DeltaOptimizeBuilder
67
def executeCompaction(): DataFrame
68
def executeZOrderBy(cols: String*): DataFrame
69
}
70
```
71
72
### Vacuum Operations
73
74
Clean up unused files and optimize storage usage.
75
76
```python { .api }
77
class DeltaTable:
78
def vacuum(self, retention_hours: Optional[float] = None) -> DataFrame:
79
"""
80
Remove files no longer referenced by the table.
81
82
Parameters:
83
- retention_hours: Hours to retain files (default: 168 hours / 7 days)
84
85
Returns:
86
DataFrame with vacuum operation results
87
"""
88
```
89
90
```scala { .api }
91
class DeltaTable {
92
def vacuum(): DataFrame
93
def vacuum(retentionHours: Double): DataFrame
94
}
95
```
96
97
### Manifest Generation
98
99
Generate metadata manifests for external system integration.
100
101
```python { .api }
102
class DeltaTable:
103
def generate(self, mode: str) -> None:
104
"""
105
Generate manifests for external system compatibility.
106
107
Parameters:
108
- mode: Manifest type ("symlink_format_manifest" for Presto/Athena)
109
"""
110
```
111
112
```scala { .api }
113
class DeltaTable {
114
def generate(mode: String): Unit
115
}
116
```
117
118
## Usage Examples
119
120
### File Compaction
121
122
```python
123
# Compact all partitions
124
optimize_result = delta_table.optimize().executeCompaction()
125
optimize_result.show()
126
127
# Compact specific partitions
128
optimize_result = (delta_table.optimize()
129
.where("date >= '2023-01-01' AND region = 'us-west'")
130
.executeCompaction())
131
132
# View optimization metrics
133
optimize_result.select(
134
"path",
135
"metrics.numFilesAdded",
136
"metrics.numFilesRemoved",
137
"metrics.filesAdded.size",
138
"metrics.filesRemoved.size"
139
).show()
140
```
141
142
### Z-Order Optimization
143
144
```python
145
# Z-order by commonly filtered columns
146
zorder_result = (delta_table.optimize()
147
.executeZOrderBy("customer_id", "transaction_date"))
148
149
# Z-order specific partitions
150
zorder_result = (delta_table.optimize()
151
.where("year = 2023")
152
.executeZOrderBy("customer_id", "product_category"))
153
154
zorder_result.show()
155
```
156
157
### Vacuum Operations
158
159
```python
160
# Vacuum with default retention (7 days)
161
vacuum_result = delta_table.vacuum()
162
vacuum_result.show()
163
164
# Vacuum with custom retention period
165
vacuum_result = delta_table.vacuum(retention_hours=24) # 1 day retention
166
167
# Dry run to see what would be deleted (use SQL)
168
spark.sql("VACUUM delta.`/path/to/table` RETAIN 168 HOURS DRY RUN").show()
169
```
170
171
### Manifest Generation
172
173
```python
174
# Generate symlink manifest for Presto/Athena
175
delta_table.generate("symlink_format_manifest")
176
177
# Check generated manifest files
178
manifest_path = "/path/to/table/_symlink_format_manifest"
179
dbutils.fs.ls(manifest_path) # In Databricks
180
```
181
182
### Comprehensive Maintenance Workflow
183
184
```python
185
from pyspark.sql.functions import col, current_timestamp
186
187
# 1. Analyze table statistics before optimization
188
pre_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
189
file_count_before = pre_stats.select("numFiles").collect()[0][0]
190
191
print(f"Files before optimization: {file_count_before}")
192
193
# 2. Optimize frequently queried partitions
194
recent_partitions = (delta_table.optimize()
195
.where("date >= current_date() - interval 30 days")
196
.executeZOrderBy("customer_id", "product_id"))
197
198
# 3. Compact older partitions
199
older_partitions = (delta_table.optimize()
200
.where("date < current_date() - interval 30 days")
201
.executeCompaction())
202
203
# 4. Update table statistics
204
spark.sql(f"ANALYZE TABLE delta.`{table_path}` COMPUTE STATISTICS")
205
206
# 5. Clean up old files
207
vacuum_result = delta_table.vacuum(retention_hours=168) # 7 days
208
209
# 6. Generate manifests for external access
210
delta_table.generate("symlink_format_manifest")
211
212
# 7. Check final statistics
213
post_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
214
file_count_after = post_stats.select("numFiles").collect()[0][0]
215
216
print(f"Files after optimization: {file_count_after}")
217
print(f"File reduction: {file_count_before - file_count_after}")
218
```
219
220
### Performance Monitoring
221
222
```python
223
# Monitor optimize operations over time
224
optimize_history = (delta_table.history()
225
.filter(col("operation") == "OPTIMIZE")
226
.select(
227
"timestamp",
228
"operationParameters.predicate",
229
col("operationMetrics.numFilesAdded").alias("files_added"),
230
col("operationMetrics.numFilesRemoved").alias("files_removed"),
231
col("operationMetrics.totalFilesSize").alias("total_size"),
232
col("operationMetrics.numBatches").alias("batches")
233
))
234
235
optimize_history.show(truncate=False)
236
237
# Check vacuum history
238
vacuum_history = (delta_table.history()
239
.filter(col("operation") == "VACUUM")
240
.select(
241
"timestamp",
242
col("operationMetrics.numDeletedFiles").alias("deleted_files"),
243
col("operationMetrics.sizeOfDeletedFiles").alias("deleted_size")
244
))
245
246
vacuum_history.show()
247
```
248
249
## Optimization Best Practices
250
251
### File Compaction
252
- Run regularly to prevent small file accumulation
253
- Target 128MB-1GB file sizes for optimal performance
254
- Focus on frequently accessed partitions
255
- Monitor file count vs query performance
256
257
### Z-Order Optimization
258
- Choose columns commonly used in WHERE clauses
259
- Limit to 3-4 columns for best effectiveness
260
- Reorder periodically as data patterns change
261
- Consider column cardinality and query patterns
262
263
### Vacuum Operations
264
- Balance retention with storage costs
265
- Coordinate with time travel requirements
266
- Run during low-activity periods
267
- Monitor storage reclaimed vs files deleted
268
269
### Partition Strategy
270
- Partition by commonly filtered columns
271
- Avoid over-partitioning (aim for 1GB+ per partition)
272
- Consider date-based partitioning for time-series data
273
- Use OPTIMIZE with partition filters for efficiency
274
275
## Optimization Metrics
276
277
Common metrics returned by optimization operations:
278
279
- `numFilesAdded` / `numFilesRemoved`: File count changes
280
- `filesAdded.size` / `filesRemoved.size`: Size changes in bytes
281
- `numBatches`: Number of optimization batches executed
282
- `totalFilesSize`: Total size after optimization
283
- `zOrderStats`: Z-order clustering effectiveness metrics