Tessl Tile for pypi/delta-spark@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md index.md merge-operations.md optimization.md table-management.md table-operations.md time-travel.md

optimization.mddocs/

0
# Optimization and Maintenance
1

2
Performance optimization and table maintenance operations for Delta Lake including file compaction, Z-ordering, vacuum operations, and data layout optimization. Provides fine-grained control over storage efficiency and query performance.
3

4
## Capabilities
5

6
### Optimize Operations
7

8
Improve query performance through data layout optimization.
9

10
```python { .api }
11
class DeltaTable:
12
    def optimize(self) -> DeltaOptimizeBuilder:
13
        """
14
        Create optimize builder for data layout optimization.
15
        
16
        Returns:
17
        DeltaOptimizeBuilder for configuring optimization operations
18
        """
19
```
20

21
```scala { .api }
22
class DeltaTable {
23
  def optimize(): DeltaOptimizeBuilder
24
}
25
```
26

27
### Optimize Configuration
28

29
Configure optimization scope and execution.
30

31
```python { .api }
32
class DeltaOptimizeBuilder:
33
    def where(self, partition_filter: str) -> DeltaOptimizeBuilder:
34
        """
35
        Apply partition filter to limit optimization scope.
36
        
37
        Parameters:
38
        - partition_filter: SQL condition to filter partitions for optimization
39
        
40
        Returns:
41
        DeltaOptimizeBuilder for method chaining
42
        """
43
    
44
    def executeCompaction(self) -> DataFrame:
45
        """
46
        Execute file compaction to reduce small file overhead.
47
        
48
        Returns:
49
        DataFrame with optimization metrics including files compacted, size changes
50
        """
51
    
52
    def executeZOrderBy(self, *cols: str) -> DataFrame:
53
        """
54
        Execute Z-order optimization for improved query performance.
55
        
56
        Parameters:
57
        - cols: Column names to use for Z-order clustering
58
        
59
        Returns:
60
        DataFrame with Z-order optimization metrics
61
        """
62
```
63

64
```scala { .api }
65
class DeltaOptimizeBuilder {
66
  def where(partitionFilter: String): DeltaOptimizeBuilder
67
  def executeCompaction(): DataFrame
68
  def executeZOrderBy(cols: String*): DataFrame
69
}
70
```
71

72
### Vacuum Operations
73

74
Clean up unused files and optimize storage usage.
75

76
```python { .api }
77
class DeltaTable:
78
    def vacuum(self, retention_hours: Optional[float] = None) -> DataFrame:
79
        """
80
        Remove files no longer referenced by the table.
81
        
82
        Parameters:
83
        - retention_hours: Hours to retain files (default: 168 hours / 7 days)
84
        
85
        Returns:
86
        DataFrame with vacuum operation results
87
        """
88
```
89

90
```scala { .api }
91
class DeltaTable {
92
  def vacuum(): DataFrame
93
  def vacuum(retentionHours: Double): DataFrame
94
}
95
```
96

97
### Manifest Generation
98

99
Generate metadata manifests for external system integration.
100

101
```python { .api }
102
class DeltaTable:
103
    def generate(self, mode: str) -> None:
104
        """
105
        Generate manifests for external system compatibility.
106
        
107
        Parameters:
108
        - mode: Manifest type ("symlink_format_manifest" for Presto/Athena)
109
        """
110
```
111

112
```scala { .api }
113
class DeltaTable {
114
  def generate(mode: String): Unit
115
}
116
```
117

118
## Usage Examples
119

120
### File Compaction
121

122
```python
123
# Compact all partitions
124
optimize_result = delta_table.optimize().executeCompaction()
125
optimize_result.show()
126

127
# Compact specific partitions
128
optimize_result = (delta_table.optimize()
129
    .where("date >= '2023-01-01' AND region = 'us-west'")
130
    .executeCompaction())
131

132
# View optimization metrics
133
optimize_result.select(
134
    "path",
135
    "metrics.numFilesAdded",
136
    "metrics.numFilesRemoved", 
137
    "metrics.filesAdded.size",
138
    "metrics.filesRemoved.size"
139
).show()
140
```
141

142
### Z-Order Optimization
143

144
```python
145
# Z-order by commonly filtered columns
146
zorder_result = (delta_table.optimize()
147
    .executeZOrderBy("customer_id", "transaction_date"))
148

149
# Z-order specific partitions
150
zorder_result = (delta_table.optimize()
151
    .where("year = 2023")
152
    .executeZOrderBy("customer_id", "product_category"))
153

154
zorder_result.show()
155
```
156

157
### Vacuum Operations
158

159
```python
160
# Vacuum with default retention (7 days)
161
vacuum_result = delta_table.vacuum()
162
vacuum_result.show()
163

164
# Vacuum with custom retention period
165
vacuum_result = delta_table.vacuum(retention_hours=24)  # 1 day retention
166

167
# Dry run to see what would be deleted (use SQL)
168
spark.sql("VACUUM delta.`/path/to/table` RETAIN 168 HOURS DRY RUN").show()
169
```
170

171
### Manifest Generation
172

173
```python
174
# Generate symlink manifest for Presto/Athena
175
delta_table.generate("symlink_format_manifest")
176

177
# Check generated manifest files
178
manifest_path = "/path/to/table/_symlink_format_manifest"
179
dbutils.fs.ls(manifest_path)  # In Databricks
180
```
181

182
### Comprehensive Maintenance Workflow
183

184
```python
185
from pyspark.sql.functions import col, current_timestamp
186

187
# 1. Analyze table statistics before optimization
188
pre_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
189
file_count_before = pre_stats.select("numFiles").collect()[0][0]
190

191
print(f"Files before optimization: {file_count_before}")
192

193
# 2. Optimize frequently queried partitions
194
recent_partitions = (delta_table.optimize()
195
    .where("date >= current_date() - interval 30 days")
196
    .executeZOrderBy("customer_id", "product_id"))
197

198
# 3. Compact older partitions  
199
older_partitions = (delta_table.optimize()
200
    .where("date < current_date() - interval 30 days")
201
    .executeCompaction())
202

203
# 4. Update table statistics
204
spark.sql(f"ANALYZE TABLE delta.`{table_path}` COMPUTE STATISTICS")
205

206
# 5. Clean up old files
207
vacuum_result = delta_table.vacuum(retention_hours=168)  # 7 days
208

209
# 6. Generate manifests for external access
210
delta_table.generate("symlink_format_manifest")
211

212
# 7. Check final statistics
213
post_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")
214
file_count_after = post_stats.select("numFiles").collect()[0][0]
215

216
print(f"Files after optimization: {file_count_after}")
217
print(f"File reduction: {file_count_before - file_count_after}")
218
```
219

220
### Performance Monitoring
221

222
```python
223
# Monitor optimize operations over time
224
optimize_history = (delta_table.history()
225
    .filter(col("operation") == "OPTIMIZE")
226
    .select(
227
        "timestamp",
228
        "operationParameters.predicate",
229
        col("operationMetrics.numFilesAdded").alias("files_added"),
230
        col("operationMetrics.numFilesRemoved").alias("files_removed"),
231
        col("operationMetrics.totalFilesSize").alias("total_size"),
232
        col("operationMetrics.numBatches").alias("batches")
233
    ))
234

235
optimize_history.show(truncate=False)
236

237
# Check vacuum history
238
vacuum_history = (delta_table.history()
239
    .filter(col("operation") == "VACUUM")
240
    .select(
241
        "timestamp",
242
        col("operationMetrics.numDeletedFiles").alias("deleted_files"),
243
        col("operationMetrics.sizeOfDeletedFiles").alias("deleted_size")
244
    ))
245

246
vacuum_history.show()
247
```
248

249
## Optimization Best Practices
250

251
### File Compaction
252
- Run regularly to prevent small file accumulation
253
- Target 128MB-1GB file sizes for optimal performance
254
- Focus on frequently accessed partitions
255
- Monitor file count vs query performance
256

257
### Z-Order Optimization
258
- Choose columns commonly used in WHERE clauses
259
- Limit to 3-4 columns for best effectiveness
260
- Reorder periodically as data patterns change
261
- Consider column cardinality and query patterns
262

263
### Vacuum Operations
264
- Balance retention with storage costs
265
- Coordinate with time travel requirements
266
- Run during low-activity periods
267
- Monitor storage reclaimed vs files deleted
268

269
### Partition Strategy
270
- Partition by commonly filtered columns
271
- Avoid over-partitioning (aim for 1GB+ per partition)
272
- Consider date-based partitioning for time-series data
273
- Use OPTIMIZE with partition filters for efficiency
274

275
## Optimization Metrics
276

277
Common metrics returned by optimization operations:
278

279
- `numFilesAdded` / `numFilesRemoved`: File count changes
280
- `filesAdded.size` / `filesRemoved.size`: Size changes in bytes
281
- `numBatches`: Number of optimization batches executed
282
- `totalFilesSize`: Total size after optimization
283
- `zOrderStats`: Z-order clustering effectiveness metrics

Version

Tile

Files

optimization.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

optimization.mddocs/