or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

build-info.mdexception-handling.mdindex.mdjava-functions.mdlogging.mdstorage-management.mdutilities.md

storage-management.mddocs/

0

# Storage Management

1

2

RDD storage level management system for controlling persistence strategies across memory, disk, and off-heap storage with configurable replication and serialization options.

3

4

## Capabilities

5

6

### StorageLevel Class

7

8

Main class for configuring RDD persistence behavior with fine-grained control over storage tiers and replication.

9

10

```scala { .api }

11

/**

12

* Flags for controlling the storage of an RDD

13

* Records whether to use memory, disk, or off-heap storage, serialization preferences, and replication factor

14

* @param useDisk Whether to store RDD partitions on disk

15

* @param useMemory Whether to store RDD partitions in memory

16

* @param useOffHeap Whether to use off-heap memory storage

17

* @param deserialized Whether to keep data in deserialized form

18

* @param replication Number of replicas to maintain (default 1)

19

*/

20

@DeveloperApi

21

class StorageLevel private(

22

private var _useDisk: Boolean,

23

private var _useMemory: Boolean,

24

private var _useOffHeap: Boolean,

25

private var _deserialized: Boolean,

26

private var _replication: Int = 1

27

) extends Externalizable {

28

29

/** Returns whether this storage level uses disk storage */

30

def useDisk: Boolean

31

32

/** Returns whether this storage level uses memory storage */

33

def useMemory: Boolean

34

35

/** Returns whether this storage level uses off-heap memory */

36

def useOffHeap: Boolean

37

38

/** Returns whether data is stored in deserialized form */

39

def deserialized: Boolean

40

41

/** Returns the replication factor */

42

def replication: Int

43

44

/** Returns the memory mode (ON_HEAP or OFF_HEAP) */

45

private[spark] def memoryMode: MemoryMode

46

47

/** Validates that the storage level configuration is valid */

48

def isValid: Boolean

49

50

/** Returns human-readable description of the storage configuration */

51

def description: String

52

53

/** Creates a copy of this storage level */

54

override def clone(): StorageLevel

55

56

/** Converts storage level to integer representation for serialization */

57

def toInt: Int

58

}

59

```

60

61

**Usage Examples:**

62

63

```scala

64

import org.apache.spark.storage.StorageLevel

65

66

// Using predefined storage levels

67

val memoryOnly = StorageLevel.MEMORY_ONLY

68

val memoryAndDisk = StorageLevel.MEMORY_AND_DISK_SER

69

70

// Custom storage level creation

71

val customLevel = StorageLevel(

72

useDisk = true,

73

useMemory = true,

74

deserialized = false,

75

replication = 2

76

)

77

78

// Checking storage level properties

79

if (customLevel.useMemory) {

80

println(s"Uses memory: ${customLevel.memoryMode}")

81

}

82

println(s"Configuration: ${customLevel.description}")

83

84

// Applying to RDDs (conceptual - would be used in broader Spark context)

85

// rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)

86

```

87

88

### StorageLevel Companion Object

89

90

Factory methods and predefined constants for common storage configurations.

91

92

```scala { .api }

93

object StorageLevel {

94

/** Predefined storage levels for common use cases */

95

val NONE: StorageLevel

96

val DISK_ONLY: StorageLevel

97

val DISK_ONLY_2: StorageLevel

98

val DISK_ONLY_3: StorageLevel

99

val MEMORY_ONLY: StorageLevel

100

val MEMORY_ONLY_2: StorageLevel

101

val MEMORY_ONLY_SER: StorageLevel

102

val MEMORY_ONLY_SER_2: StorageLevel

103

val MEMORY_AND_DISK: StorageLevel

104

val MEMORY_AND_DISK_2: StorageLevel

105

val MEMORY_AND_DISK_SER: StorageLevel

106

val MEMORY_AND_DISK_SER_2: StorageLevel

107

val OFF_HEAP: StorageLevel

108

109

/** Create StorageLevel from string representation */

110

@DeveloperApi

111

def fromString(s: String): StorageLevel

112

113

/** Create StorageLevel with full parameter control */

114

@DeveloperApi

115

def apply(

116

useDisk: Boolean,

117

useMemory: Boolean,

118

useOffHeap: Boolean,

119

deserialized: Boolean,

120

replication: Int

121

): StorageLevel

122

123

/** Create StorageLevel without off-heap parameter */

124

@DeveloperApi

125

def apply(

126

useDisk: Boolean,

127

useMemory: Boolean,

128

deserialized: Boolean,

129

replication: Int = 1

130

): StorageLevel

131

132

/** Create StorageLevel from integer flags and replication */

133

@DeveloperApi

134

def apply(flags: Int, replication: Int): StorageLevel

135

136

/** Create StorageLevel from ObjectInput stream */

137

@DeveloperApi

138

def apply(in: ObjectInput): StorageLevel

139

}

140

```

141

142

**Usage Examples:**

143

144

```scala

145

import org.apache.spark.storage.StorageLevel

146

147

// Using predefined levels

148

val diskOnly = StorageLevel.DISK_ONLY

149

val memoryAndDiskSer = StorageLevel.MEMORY_AND_DISK_SER_2

150

151

// Creating custom levels

152

val highReplication = StorageLevel(

153

useDisk = true,

154

useMemory = true,

155

useOffHeap = false,

156

deserialized = true,

157

replication = 3

158

)

159

160

val serializedMemory = StorageLevel(

161

useDisk = false,

162

useMemory = true,

163

deserialized = false,

164

replication = 2

165

)

166

167

// String-based creation

168

val fromString = StorageLevel.fromString("MEMORY_AND_DISK")

169

170

// Validation

171

if (highReplication.isValid) {

172

println(s"Valid configuration: ${highReplication.description}")

173

}

174

```

175

176

### Storage Level Patterns

177

178

Common storage configuration patterns for different use cases.

179

180

```scala { .api }

181

// Performance-oriented: Fast access, memory preferred

182

val performanceLevel = StorageLevel.MEMORY_ONLY

183

184

// Reliability-oriented: Fault tolerance with replication

185

val reliableLevel = StorageLevel.MEMORY_AND_DISK_2

186

187

// Memory-efficient: Serialized storage to reduce memory footprint

188

val efficientLevel = StorageLevel.MEMORY_ONLY_SER

189

190

// Large dataset: Disk storage with memory caching

191

val largeDataLevel = StorageLevel.MEMORY_AND_DISK

192

193

// Fault-tolerant large dataset: Disk + memory with replication

194

val faultTolerantLevel = StorageLevel.MEMORY_AND_DISK_SER_2

195

196

// Off-heap storage: Avoid GC pressure

197

val offHeapLevel = StorageLevel.OFF_HEAP

198

199

// No persistence: Recompute on demand

200

val noPersistence = StorageLevel.NONE

201

```

202

203

### Memory Mode Integration

204

205

Storage levels integrate with Spark's memory management system.

206

207

```java { .api }

208

/**

209

* Memory storage mode enumeration

210

* Controls whether data is stored on-heap or off-heap

211

*/

212

@Private

213

public enum MemoryMode {

214

/** Standard JVM heap memory */

215

ON_HEAP,

216

217

/** Off-heap memory (e.g., using sun.misc.Unsafe) */

218

OFF_HEAP

219

}

220

```

221

222

**Usage Examples:**

223

224

```scala

225

import org.apache.spark.storage.StorageLevel

226

import org.apache.spark.memory.MemoryMode

227

228

val onHeapLevel = StorageLevel.MEMORY_ONLY

229

val offHeapLevel = StorageLevel.OFF_HEAP

230

231

// Check memory mode

232

onHeapLevel.memoryMode match {

233

case MemoryMode.ON_HEAP => println("Using JVM heap memory")

234

case MemoryMode.OFF_HEAP => println("Using off-heap memory")

235

}

236

237

// Off-heap storage reduces GC pressure but requires configuration

238

val offHeapConfig = StorageLevel(

239

useDisk = true,

240

useMemory = true,

241

useOffHeap = true,

242

deserialized = false, // Off-heap is always serialized

243

replication = 1

244

)

245

```

246

247

## Storage Strategy Selection

248

249

### Performance Considerations

250

251

- **MEMORY_ONLY**: Fastest access, limited by available memory

252

- **MEMORY_ONLY_SER**: Reduced memory footprint, CPU overhead for serialization

253

- **MEMORY_AND_DISK**: Automatic spill to disk when memory full

254

- **DISK_ONLY**: Slowest but handles unlimited data sizes

255

- **OFF_HEAP**: Avoids GC pressure, requires additional configuration

256

257

### Fault Tolerance Considerations

258

259

- **Replication Factor**: Higher replication increases fault tolerance but uses more resources

260

- **Disk Storage**: Provides durability across node failures

261

- **Memory + Disk**: Balances performance and reliability

262

263

### Memory Management Considerations

264

265

- **Deserialized**: Faster access but higher memory usage

266

- **Serialized**: Compact storage but CPU overhead for access

267

- **Off-heap**: Avoids GC impact but requires serialized format

268

269

### Selection Guidelines

270

271

```scala

272

// Interactive/iterative workloads - prioritize speed

273

val interactiveLevel = StorageLevel.MEMORY_ONLY

274

275

// Production ETL - balance performance and reliability

276

val etlLevel = StorageLevel.MEMORY_AND_DISK_SER

277

278

// Large reference datasets - efficient memory usage

279

val referenceLevel = StorageLevel.MEMORY_ONLY_SER_2

280

281

// Temporary intermediate results - fast but can recompute

282

val temporaryLevel = StorageLevel.MEMORY_ONLY

283

284

// Critical datasets - maximum fault tolerance

285

val criticalLevel = StorageLevel.MEMORY_AND_DISK_2

286

```