or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdindex.mdmerge-operations.mdoptimization.mdtable-management.mdtable-operations.mdtime-travel.md

optimization.mddocs/

0

# Optimization and Maintenance

1

2

Performance optimization and table maintenance operations for Delta Lake including file compaction, Z-ordering, vacuum operations, and data layout optimization. Provides fine-grained control over storage efficiency and query performance.

3

4

## Capabilities

5

6

### Optimize Operations

7

8

Improve query performance through data layout optimization.

9

10

```python { .api }

11

class DeltaTable:

12

def optimize(self) -> DeltaOptimizeBuilder:

13

"""

14

Create optimize builder for data layout optimization.

15

16

Returns:

17

DeltaOptimizeBuilder for configuring optimization operations

18

"""

19

```

20

21

```scala { .api }

22

class DeltaTable {

23

def optimize(): DeltaOptimizeBuilder

24

}

25

```

26

27

### Optimize Configuration

28

29

Configure optimization scope and execution.

30

31

```python { .api }

32

class DeltaOptimizeBuilder:

33

def where(self, partition_filter: str) -> DeltaOptimizeBuilder:

34

"""

35

Apply partition filter to limit optimization scope.

36

37

Parameters:

38

- partition_filter: SQL condition to filter partitions for optimization

39

40

Returns:

41

DeltaOptimizeBuilder for method chaining

42

"""

43

44

def executeCompaction(self) -> DataFrame:

45

"""

46

Execute file compaction to reduce small file overhead.

47

48

Returns:

49

DataFrame with optimization metrics including files compacted, size changes

50

"""

51

52

def executeZOrderBy(self, *cols: str) -> DataFrame:

53

"""

54

Execute Z-order optimization for improved query performance.

55

56

Parameters:

57

- cols: Column names to use for Z-order clustering

58

59

Returns:

60

DataFrame with Z-order optimization metrics

61

"""

62

```

63

64

```scala { .api }

65

class DeltaOptimizeBuilder {

66

def where(partitionFilter: String): DeltaOptimizeBuilder

67

def executeCompaction(): DataFrame

68

def executeZOrderBy(cols: String*): DataFrame

69

}

70

```

71

72

### Vacuum Operations

73

74

Clean up unused files and optimize storage usage.

75

76

```python { .api }

77

class DeltaTable:

78

def vacuum(self, retention_hours: Optional[float] = None) -> DataFrame:

79

"""

80

Remove files no longer referenced by the table.

81

82

Parameters:

83

- retention_hours: Hours to retain files (default: 168 hours / 7 days)

84

85

Returns:

86

DataFrame with vacuum operation results

87

"""

88

```

89

90

```scala { .api }

91

class DeltaTable {

92

def vacuum(): DataFrame

93

def vacuum(retentionHours: Double): DataFrame

94

}

95

```

96

97

### Manifest Generation

98

99

Generate metadata manifests for external system integration.

100

101

```python { .api }

102

class DeltaTable:

103

def generate(self, mode: str) -> None:

104

"""

105

Generate manifests for external system compatibility.

106

107

Parameters:

108

- mode: Manifest type ("symlink_format_manifest" for Presto/Athena)

109

"""

110

```

111

112

```scala { .api }

113

class DeltaTable {

114

def generate(mode: String): Unit

115

}

116

```

117

118

## Usage Examples

119

120

### File Compaction

121

122

```python

123

# Compact all partitions

124

optimize_result = delta_table.optimize().executeCompaction()

125

optimize_result.show()

126

127

# Compact specific partitions

128

optimize_result = (delta_table.optimize()

129

.where("date >= '2023-01-01' AND region = 'us-west'")

130

.executeCompaction())

131

132

# View optimization metrics

133

optimize_result.select(

134

"path",

135

"metrics.numFilesAdded",

136

"metrics.numFilesRemoved",

137

"metrics.filesAdded.size",

138

"metrics.filesRemoved.size"

139

).show()

140

```

141

142

### Z-Order Optimization

143

144

```python

145

# Z-order by commonly filtered columns

146

zorder_result = (delta_table.optimize()

147

.executeZOrderBy("customer_id", "transaction_date"))

148

149

# Z-order specific partitions

150

zorder_result = (delta_table.optimize()

151

.where("year = 2023")

152

.executeZOrderBy("customer_id", "product_category"))

153

154

zorder_result.show()

155

```

156

157

### Vacuum Operations

158

159

```python

160

# Vacuum with default retention (7 days)

161

vacuum_result = delta_table.vacuum()

162

vacuum_result.show()

163

164

# Vacuum with custom retention period

165

vacuum_result = delta_table.vacuum(retention_hours=24) # 1 day retention

166

167

# Dry run to see what would be deleted (use SQL)

168

spark.sql("VACUUM delta.`/path/to/table` RETAIN 168 HOURS DRY RUN").show()

169

```

170

171

### Manifest Generation

172

173

```python

174

# Generate symlink manifest for Presto/Athena

175

delta_table.generate("symlink_format_manifest")

176

177

# Check generated manifest files

178

manifest_path = "/path/to/table/_symlink_format_manifest"

179

dbutils.fs.ls(manifest_path) # In Databricks

180

```

181

182

### Comprehensive Maintenance Workflow

183

184

```python

185

from pyspark.sql.functions import col, current_timestamp

186

187

# 1. Analyze table statistics before optimization

188

pre_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")

189

file_count_before = pre_stats.select("numFiles").collect()[0][0]

190

191

print(f"Files before optimization: {file_count_before}")

192

193

# 2. Optimize frequently queried partitions

194

recent_partitions = (delta_table.optimize()

195

.where("date >= current_date() - interval 30 days")

196

.executeZOrderBy("customer_id", "product_id"))

197

198

# 3. Compact older partitions

199

older_partitions = (delta_table.optimize()

200

.where("date < current_date() - interval 30 days")

201

.executeCompaction())

202

203

# 4. Update table statistics

204

spark.sql(f"ANALYZE TABLE delta.`{table_path}` COMPUTE STATISTICS")

205

206

# 5. Clean up old files

207

vacuum_result = delta_table.vacuum(retention_hours=168) # 7 days

208

209

# 6. Generate manifests for external access

210

delta_table.generate("symlink_format_manifest")

211

212

# 7. Check final statistics

213

post_stats = spark.sql(f"DESCRIBE DETAIL delta.`{table_path}`")

214

file_count_after = post_stats.select("numFiles").collect()[0][0]

215

216

print(f"Files after optimization: {file_count_after}")

217

print(f"File reduction: {file_count_before - file_count_after}")

218

```

219

220

### Performance Monitoring

221

222

```python

223

# Monitor optimize operations over time

224

optimize_history = (delta_table.history()

225

.filter(col("operation") == "OPTIMIZE")

226

.select(

227

"timestamp",

228

"operationParameters.predicate",

229

col("operationMetrics.numFilesAdded").alias("files_added"),

230

col("operationMetrics.numFilesRemoved").alias("files_removed"),

231

col("operationMetrics.totalFilesSize").alias("total_size"),

232

col("operationMetrics.numBatches").alias("batches")

233

))

234

235

optimize_history.show(truncate=False)

236

237

# Check vacuum history

238

vacuum_history = (delta_table.history()

239

.filter(col("operation") == "VACUUM")

240

.select(

241

"timestamp",

242

col("operationMetrics.numDeletedFiles").alias("deleted_files"),

243

col("operationMetrics.sizeOfDeletedFiles").alias("deleted_size")

244

))

245

246

vacuum_history.show()

247

```

248

249

## Optimization Best Practices

250

251

### File Compaction

252

- Run regularly to prevent small file accumulation

253

- Target 128MB-1GB file sizes for optimal performance

254

- Focus on frequently accessed partitions

255

- Monitor file count vs query performance

256

257

### Z-Order Optimization

258

- Choose columns commonly used in WHERE clauses

259

- Limit to 3-4 columns for best effectiveness

260

- Reorder periodically as data patterns change

261

- Consider column cardinality and query patterns

262

263

### Vacuum Operations

264

- Balance retention with storage costs

265

- Coordinate with time travel requirements

266

- Run during low-activity periods

267

- Monitor storage reclaimed vs files deleted

268

269

### Partition Strategy

270

- Partition by commonly filtered columns

271

- Avoid over-partitioning (aim for 1GB+ per partition)

272

- Consider date-based partitioning for time-series data

273

- Use OPTIMIZE with partition filters for efficiency

274

275

## Optimization Metrics

276

277

Common metrics returned by optimization operations:

278

279

- `numFilesAdded` / `numFilesRemoved`: File count changes

280

- `filesAdded.size` / `filesRemoved.size`: Size changes in bytes

281

- `numBatches`: Number of optimization batches executed

282

- `totalFilesSize`: Total size after optimization

283

- `zOrderStats`: Z-order clustering effectiveness metrics