or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cluster-management.mdcluster-routing.mdconfiguration.mdevent-system.mdindex.mdmember-management.mdsplit-brain-resolution.md

split-brain-resolution.mddocs/

0

# Split Brain Resolution

1

2

Split Brain Resolution (SBR) in Akka Cluster provides strategies for handling network partitions by automatically downing unreachable members to maintain cluster consistency and availability.

3

4

## DowningProvider API

5

6

### Base DowningProvider

7

8

```scala { .api }

9

abstract class DowningProvider {

10

def downRemovalMargin: FiniteDuration

11

def downingActorProps: Option[Props]

12

}

13

14

object DowningProvider {

15

def load(fqcn: String, system: ActorSystem): DowningProvider

16

}

17

```

18

19

### NoDowning (Default)

20

21

```scala { .api }

22

class NoDowning extends DowningProvider {

23

override def downRemovalMargin: FiniteDuration = Duration.Zero

24

override def downingActorProps: Option[Props] = None

25

}

26

```

27

28

## Split Brain Resolver Provider

29

30

### SplitBrainResolverProvider

31

32

```scala { .api }

33

class SplitBrainResolverProvider(system: ActorSystem) extends DowningProvider {

34

override def downRemovalMargin: FiniteDuration

35

override def downingActorProps: Option[Props]

36

}

37

```

38

39

### Configuration

40

41

```hocon

42

akka.cluster {

43

downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"

44

45

split-brain-resolver {

46

# Select strategy: keep-majority, lease-majority, static-quorum, keep-oldest, down-all

47

active-strategy = "keep-majority"

48

49

# Time margin after which unreachable nodes will be downed

50

stable-after = 20s

51

52

# If on, down all members if cluster size is less than this

53

down-all-when-unstable = "on"

54

}

55

}

56

```

57

58

## SBR Strategies

59

60

### Keep Majority Strategy

61

62

Keeps the partition with the majority of nodes, downs the minority.

63

64

```hocon

65

akka.cluster.split-brain-resolver {

66

active-strategy = "keep-majority"

67

68

keep-majority {

69

# Additional minimum size of majority partition

70

role = ""

71

}

72

}

73

```

74

75

**Behavior:**

76

- Partition with >50% of nodes survives

77

- Minority partitions are downed

78

- Equal-sized partitions: no nodes are downed (configurable)

79

80

**Usage Example:**

81

```scala

82

// 5-node cluster splits into 3+2

83

// 3-node partition survives, 2-node partition is downed

84

85

// 4-node cluster splits into 2+2

86

// No partition is downed (requires configuration tuning)

87

```

88

89

### Lease Majority Strategy

90

91

Uses a distributed lease to determine which partition can continue.

92

93

```hocon

94

akka.cluster.split-brain-resolver {

95

active-strategy = "lease-majority"

96

97

lease-majority {

98

lease-implementation = "akka.coordination.lease.kubernetes"

99

# Acquire lease timeout

100

acquire-lease-delay-for-minority = 2s

101

# Release lease after

102

release-after = 40s

103

}

104

}

105

```

106

107

**Behavior:**

108

- Majority partition acquires lease and survives

109

- Minority waits then attempts lease acquisition

110

- Only one partition can hold lease at a time

111

112

### Static Quorum Strategy

113

114

Downs minority partitions based on configured quorum size.

115

116

```hocon

117

akka.cluster.split-brain-resolver {

118

active-strategy = "static-quorum"

119

120

static-quorum {

121

# Minimum cluster size to maintain

122

quorum-size = 3

123

124

# Specific role that must meet quorum

125

role = ""

126

}

127

}

128

```

129

130

**Behavior:**

131

- Partitions with fewer than `quorum-size` nodes are downed

132

- Multiple partitions can survive if both meet quorum

133

- Useful for clusters with known minimum size requirements

134

135

### Keep Oldest Strategy

136

137

Keeps the partition containing the oldest member (by cluster join time).

138

139

```hocon

140

akka.cluster.split-brain-resolver {

141

active-strategy = "keep-oldest"

142

143

keep-oldest {

144

# Prioritize members with this role

145

role = ""

146

147

# Down oldest member if singleton partition

148

down-if-alone = on

149

}

150

}

151

```

152

153

**Behavior:**

154

- Partition with oldest member survives

155

- Other partitions are downed

156

- Deterministic: always same result for same partition scenario

157

158

### Down All Strategy

159

160

Downs all unreachable members (primarily for testing).

161

162

```hocon

163

akka.cluster.split-brain-resolver {

164

active-strategy = "down-all"

165

}

166

```

167

168

**Behavior:**

169

- All unreachable members are downed

170

- Cluster continues with reachable members only

171

- Use with caution in production

172

173

## SBR Settings

174

175

### SplitBrainResolverSettings

176

177

```scala { .api }

178

class SplitBrainResolverSettings(config: Config) {

179

def activeStrategy: String

180

def stableAfter: FiniteDuration

181

def downAllWhenUnstable: DownAllWhenUnstable

182

}

183

184

sealed trait DownAllWhenUnstable

185

case object DownAllWhenUnstableOn extends DownAllWhenUnstable

186

case object DownAllWhenUnstableOff extends DownAllWhenUnstable

187

```

188

189

### Global SBR Configuration

190

191

```hocon

192

akka.cluster.split-brain-resolver {

193

# Strategy to use

194

active-strategy = "keep-majority"

195

196

# Time to wait before taking downing decision

197

stable-after = 20s

198

199

# Down all when cluster becomes unstable

200

down-all-when-unstable = "on"

201

202

# Additional settings per strategy

203

keep-majority {

204

# Minimum size for majority

205

role = "core"

206

}

207

208

static-quorum {

209

quorum-size = 3

210

role = "important"

211

}

212

213

keep-oldest {

214

role = "seed"

215

down-if-alone = off

216

}

217

218

lease-majority {

219

lease-implementation = "akka.coordination.lease.kubernetes"

220

acquire-lease-delay-for-minority = 2s

221

release-after = 40s

222

}

223

}

224

```

225

226

## Custom Downing Provider

227

228

### Creating Custom Provider

229

230

```scala

231

import akka.cluster.DowningProvider

232

import akka.actor.{ActorSystem, Props}

233

import scala.concurrent.duration._

234

235

class CustomDowningProvider(system: ActorSystem) extends DowningProvider {

236

override def downRemovalMargin: FiniteDuration = 10.seconds

237

238

override def downingActorProps: Option[Props] =

239

Some(Props(classOf[CustomDowningActor]))

240

}

241

242

class CustomDowningActor extends Actor with ActorLogging {

243

val cluster = Cluster(context.system)

244

245

// Subscribe to unreachability events

246

override def preStart(): Unit = {

247

cluster.subscribe(self, classOf[UnreachableMember])

248

}

249

250

override def postStop(): Unit = {

251

cluster.unsubscribe(self)

252

}

253

254

def receive = {

255

case UnreachableMember(member) =>

256

log.info("Member {} is unreachable", member)

257

258

// Custom downing logic

259

if (shouldDownMember(member)) {

260

log.warning("Downing unreachable member {}", member)

261

cluster.down(member.address)

262

}

263

}

264

265

private def shouldDownMember(member: Member): Boolean = {

266

// Custom logic - example: down after 30 seconds unreachable

267

// In practice, you'd track unreachable time

268

true

269

}

270

}

271

```

272

273

### Registering Custom Provider

274

275

```hocon

276

akka.cluster.downing-provider-class = "com.example.CustomDowningProvider"

277

```

278

279

## SBR Monitoring and Observability

280

281

### SBR Decision Logging

282

283

```scala

284

// SBR logs decisions at INFO level

285

// Example log messages:

286

// "SBR is downing [Member(akka://sys@host1:2551, Up)] in partition [...]"

287

// "SBR is keeping partition [Member(akka://sys@host2:2551, Up), ...]"

288

```

289

290

### Monitoring SBR Events

291

292

```scala

293

import akka.cluster.ClusterEvent._

294

295

class SBRMonitor extends Actor with ActorLogging {

296

val cluster = Cluster(context.system)

297

298

override def preStart(): Unit = {

299

cluster.subscribe(self, classOf[MemberDowned], classOf[MemberRemoved])

300

}

301

302

def receive = {

303

case MemberDowned(member) =>

304

log.warning("Member downed by SBR: {}", member)

305

// Send alert/metric

306

307

case MemberRemoved(member, previousStatus) =>

308

if (previousStatus == MemberStatus.Down) {

309

log.info("Previously downed member removed: {}", member)

310

// Update monitoring dashboard

311

}

312

}

313

}

314

```

315

316

### Health Check Integration

317

318

```scala

319

import akka.http.scaladsl.server.Route

320

import akka.http.scaladsl.server.Directives._

321

322

def healthRoute: Route = {

323

path("health") {

324

get {

325

val cluster = Cluster(system)

326

val unreachableCount = cluster.state.unreachable.size

327

328

if (unreachableCount == 0) {

329

complete("healthy")

330

} else {

331

complete(s"unhealthy: $unreachableCount unreachable members")

332

}

333

}

334

}

335

}

336

```

337

338

## Production Best Practices

339

340

### Strategy Selection Guidelines

341

342

**Keep Majority:**

343

- Best for most scenarios

344

- Good balance of availability and consistency

345

- Works well with odd number of nodes

346

347

**Lease Majority:**

348

- Use with external coordination systems (Kubernetes, etcd)

349

- Provides strongest consistency guarantees

350

- Requires reliable lease implementation

351

352

**Static Quorum:**

353

- Use when minimum cluster size is known

354

- Good for clusters with well-defined capacity requirements

355

- May result in multiple surviving partitions

356

357

**Keep Oldest:**

358

- Use when one node has special significance

359

- Deterministic but potentially less available

360

- Good for master/worker patterns

361

362

### Configuration Recommendations

363

364

```hocon

365

# Production configuration

366

akka.cluster.split-brain-resolver {

367

active-strategy = "keep-majority"

368

stable-after = 30s # Allow time for transient network issues

369

down-all-when-unstable = "on" # Prevent brain-dead cluster states

370

371

keep-majority {

372

# Use role-based majority for heterogeneous clusters

373

role = "core"

374

}

375

}

376

```

377

378

### Operational Considerations

379

380

```scala

381

// Monitor cluster health

382

val cluster = Cluster(system)

383

384

// Check for unreachable members

385

val unreachableMembers = cluster.state.unreachable

386

if (unreachableMembers.nonEmpty) {

387

log.warning("Unreachable members detected: {}",

388

unreachableMembers.map(_.address).mkString(", "))

389

}

390

391

// Monitor cluster size

392

val memberCount = cluster.state.members.count(_.status == MemberStatus.Up)

393

val minimumRequired = 3 // Your application's minimum

394

395

if (memberCount < minimumRequired) {

396

log.error("Cluster size {} below minimum required {}", memberCount, minimumRequired)

397

// Consider alerting or graceful degradation

398

}

399

```

400

401

### Testing SBR Strategies

402

403

```scala

404

// Use MultiNodeSpec for testing split brain scenarios

405

class SplitBrainResolverSpec extends MultiNodeSpec(SplitBrainConfig) {

406

407

"Split Brain Resolver" should {

408

"down minority partition in keep-majority strategy" in {

409

// Create 5-node cluster

410

awaitClusterUp(first, second, third, fourth, fifth)

411

412

// Partition cluster into 3+2

413

testConductor.blackhole(first, fourth, Direction.Both)

414

testConductor.blackhole(first, fifth, Direction.Both)

415

testConductor.blackhole(second, fourth, Direction.Both)

416

testConductor.blackhole(second, fifth, Direction.Both)

417

testConductor.blackhole(third, fourth, Direction.Both)

418

testConductor.blackhole(third, fifth, Direction.Both)

419

420

// Verify majority partition (first, second, third) survives

421

runOn(first, second, third) {

422

within(30.seconds) {

423

awaitAssert {

424

cluster.state.members.size should be(3)

425

cluster.state.unreachable should be(empty)

426

}

427

}

428

}

429

430

// Verify minority partition (fourth, fifth) is downed

431

runOn(fourth, fifth) {

432

within(30.seconds) {

433

awaitAssert {

434

cluster.isTerminated should be(true)

435

}

436

}

437

}

438

}

439

}

440

}

441

```