0
# Split Brain Resolution
1
2
Split Brain Resolution (SBR) in Akka Cluster provides strategies for handling network partitions by automatically downing unreachable members to maintain cluster consistency and availability.
3
4
## DowningProvider API
5
6
### Base DowningProvider
7
8
```scala { .api }
9
abstract class DowningProvider {
10
def downRemovalMargin: FiniteDuration
11
def downingActorProps: Option[Props]
12
}
13
14
object DowningProvider {
15
def load(fqcn: String, system: ActorSystem): DowningProvider
16
}
17
```
18
19
### NoDowning (Default)
20
21
```scala { .api }
22
class NoDowning extends DowningProvider {
23
override def downRemovalMargin: FiniteDuration = Duration.Zero
24
override def downingActorProps: Option[Props] = None
25
}
26
```
27
28
## Split Brain Resolver Provider
29
30
### SplitBrainResolverProvider
31
32
```scala { .api }
33
class SplitBrainResolverProvider(system: ActorSystem) extends DowningProvider {
34
override def downRemovalMargin: FiniteDuration
35
override def downingActorProps: Option[Props]
36
}
37
```
38
39
### Configuration
40
41
```hocon
42
akka.cluster {
43
downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
44
45
split-brain-resolver {
46
# Select strategy: keep-majority, lease-majority, static-quorum, keep-oldest, down-all
47
active-strategy = "keep-majority"
48
49
# Time margin after which unreachable nodes will be downed
50
stable-after = 20s
51
52
# If on, down all members if cluster size is less than this
53
down-all-when-unstable = "on"
54
}
55
}
56
```
57
58
## SBR Strategies
59
60
### Keep Majority Strategy
61
62
Keeps the partition with the majority of nodes, downs the minority.
63
64
```hocon
65
akka.cluster.split-brain-resolver {
66
active-strategy = "keep-majority"
67
68
keep-majority {
69
# Additional minimum size of majority partition
70
role = ""
71
}
72
}
73
```
74
75
**Behavior:**
76
- Partition with >50% of nodes survives
77
- Minority partitions are downed
78
- Equal-sized partitions: no nodes are downed (configurable)
79
80
**Usage Example:**
81
```scala
82
// 5-node cluster splits into 3+2
83
// 3-node partition survives, 2-node partition is downed
84
85
// 4-node cluster splits into 2+2
86
// No partition is downed (requires configuration tuning)
87
```
88
89
### Lease Majority Strategy
90
91
Uses a distributed lease to determine which partition can continue.
92
93
```hocon
94
akka.cluster.split-brain-resolver {
95
active-strategy = "lease-majority"
96
97
lease-majority {
98
lease-implementation = "akka.coordination.lease.kubernetes"
99
# Acquire lease timeout
100
acquire-lease-delay-for-minority = 2s
101
# Release lease after
102
release-after = 40s
103
}
104
}
105
```
106
107
**Behavior:**
108
- Majority partition acquires lease and survives
109
- Minority waits then attempts lease acquisition
110
- Only one partition can hold lease at a time
111
112
### Static Quorum Strategy
113
114
Downs minority partitions based on configured quorum size.
115
116
```hocon
117
akka.cluster.split-brain-resolver {
118
active-strategy = "static-quorum"
119
120
static-quorum {
121
# Minimum cluster size to maintain
122
quorum-size = 3
123
124
# Specific role that must meet quorum
125
role = ""
126
}
127
}
128
```
129
130
**Behavior:**
131
- Partitions with fewer than `quorum-size` nodes are downed
132
- Multiple partitions can survive if both meet quorum
133
- Useful for clusters with known minimum size requirements
134
135
### Keep Oldest Strategy
136
137
Keeps the partition containing the oldest member (by cluster join time).
138
139
```hocon
140
akka.cluster.split-brain-resolver {
141
active-strategy = "keep-oldest"
142
143
keep-oldest {
144
# Prioritize members with this role
145
role = ""
146
147
# Down oldest member if singleton partition
148
down-if-alone = on
149
}
150
}
151
```
152
153
**Behavior:**
154
- Partition with oldest member survives
155
- Other partitions are downed
156
- Deterministic: always same result for same partition scenario
157
158
### Down All Strategy
159
160
Downs all unreachable members (primarily for testing).
161
162
```hocon
163
akka.cluster.split-brain-resolver {
164
active-strategy = "down-all"
165
}
166
```
167
168
**Behavior:**
169
- All unreachable members are downed
170
- Cluster continues with reachable members only
171
- Use with caution in production
172
173
## SBR Settings
174
175
### SplitBrainResolverSettings
176
177
```scala { .api }
178
class SplitBrainResolverSettings(config: Config) {
179
def activeStrategy: String
180
def stableAfter: FiniteDuration
181
def downAllWhenUnstable: DownAllWhenUnstable
182
}
183
184
sealed trait DownAllWhenUnstable
185
case object DownAllWhenUnstableOn extends DownAllWhenUnstable
186
case object DownAllWhenUnstableOff extends DownAllWhenUnstable
187
```
188
189
### Global SBR Configuration
190
191
```hocon
192
akka.cluster.split-brain-resolver {
193
# Strategy to use
194
active-strategy = "keep-majority"
195
196
# Time to wait before taking downing decision
197
stable-after = 20s
198
199
# Down all when cluster becomes unstable
200
down-all-when-unstable = "on"
201
202
# Additional settings per strategy
203
keep-majority {
204
# Minimum size for majority
205
role = "core"
206
}
207
208
static-quorum {
209
quorum-size = 3
210
role = "important"
211
}
212
213
keep-oldest {
214
role = "seed"
215
down-if-alone = off
216
}
217
218
lease-majority {
219
lease-implementation = "akka.coordination.lease.kubernetes"
220
acquire-lease-delay-for-minority = 2s
221
release-after = 40s
222
}
223
}
224
```
225
226
## Custom Downing Provider
227
228
### Creating Custom Provider
229
230
```scala
231
import akka.cluster.DowningProvider
232
import akka.actor.{ActorSystem, Props}
233
import scala.concurrent.duration._
234
235
class CustomDowningProvider(system: ActorSystem) extends DowningProvider {
236
override def downRemovalMargin: FiniteDuration = 10.seconds
237
238
override def downingActorProps: Option[Props] =
239
Some(Props(classOf[CustomDowningActor]))
240
}
241
242
class CustomDowningActor extends Actor with ActorLogging {
243
val cluster = Cluster(context.system)
244
245
// Subscribe to unreachability events
246
override def preStart(): Unit = {
247
cluster.subscribe(self, classOf[UnreachableMember])
248
}
249
250
override def postStop(): Unit = {
251
cluster.unsubscribe(self)
252
}
253
254
def receive = {
255
case UnreachableMember(member) =>
256
log.info("Member {} is unreachable", member)
257
258
// Custom downing logic
259
if (shouldDownMember(member)) {
260
log.warning("Downing unreachable member {}", member)
261
cluster.down(member.address)
262
}
263
}
264
265
private def shouldDownMember(member: Member): Boolean = {
266
// Custom logic - example: down after 30 seconds unreachable
267
// In practice, you'd track unreachable time
268
true
269
}
270
}
271
```
272
273
### Registering Custom Provider
274
275
```hocon
276
akka.cluster.downing-provider-class = "com.example.CustomDowningProvider"
277
```
278
279
## SBR Monitoring and Observability
280
281
### SBR Decision Logging
282
283
```scala
284
// SBR logs decisions at INFO level
285
// Example log messages:
286
// "SBR is downing [Member(akka://sys@host1:2551, Up)] in partition [...]"
287
// "SBR is keeping partition [Member(akka://sys@host2:2551, Up), ...]"
288
```
289
290
### Monitoring SBR Events
291
292
```scala
293
import akka.cluster.ClusterEvent._
294
295
class SBRMonitor extends Actor with ActorLogging {
296
val cluster = Cluster(context.system)
297
298
override def preStart(): Unit = {
299
cluster.subscribe(self, classOf[MemberDowned], classOf[MemberRemoved])
300
}
301
302
def receive = {
303
case MemberDowned(member) =>
304
log.warning("Member downed by SBR: {}", member)
305
// Send alert/metric
306
307
case MemberRemoved(member, previousStatus) =>
308
if (previousStatus == MemberStatus.Down) {
309
log.info("Previously downed member removed: {}", member)
310
// Update monitoring dashboard
311
}
312
}
313
}
314
```
315
316
### Health Check Integration
317
318
```scala
319
import akka.http.scaladsl.server.Route
320
import akka.http.scaladsl.server.Directives._
321
322
def healthRoute: Route = {
323
path("health") {
324
get {
325
val cluster = Cluster(system)
326
val unreachableCount = cluster.state.unreachable.size
327
328
if (unreachableCount == 0) {
329
complete("healthy")
330
} else {
331
complete(s"unhealthy: $unreachableCount unreachable members")
332
}
333
}
334
}
335
}
336
```
337
338
## Production Best Practices
339
340
### Strategy Selection Guidelines
341
342
**Keep Majority:**
343
- Best for most scenarios
344
- Good balance of availability and consistency
345
- Works well with odd number of nodes
346
347
**Lease Majority:**
348
- Use with external coordination systems (Kubernetes, etcd)
349
- Provides strongest consistency guarantees
350
- Requires reliable lease implementation
351
352
**Static Quorum:**
353
- Use when minimum cluster size is known
354
- Good for clusters with well-defined capacity requirements
355
- May result in multiple surviving partitions
356
357
**Keep Oldest:**
358
- Use when one node has special significance
359
- Deterministic but potentially less available
360
- Good for master/worker patterns
361
362
### Configuration Recommendations
363
364
```hocon
365
# Production configuration
366
akka.cluster.split-brain-resolver {
367
active-strategy = "keep-majority"
368
stable-after = 30s # Allow time for transient network issues
369
down-all-when-unstable = "on" # Prevent brain-dead cluster states
370
371
keep-majority {
372
# Use role-based majority for heterogeneous clusters
373
role = "core"
374
}
375
}
376
```
377
378
### Operational Considerations
379
380
```scala
381
// Monitor cluster health
382
val cluster = Cluster(system)
383
384
// Check for unreachable members
385
val unreachableMembers = cluster.state.unreachable
386
if (unreachableMembers.nonEmpty) {
387
log.warning("Unreachable members detected: {}",
388
unreachableMembers.map(_.address).mkString(", "))
389
}
390
391
// Monitor cluster size
392
val memberCount = cluster.state.members.count(_.status == MemberStatus.Up)
393
val minimumRequired = 3 // Your application's minimum
394
395
if (memberCount < minimumRequired) {
396
log.error("Cluster size {} below minimum required {}", memberCount, minimumRequired)
397
// Consider alerting or graceful degradation
398
}
399
```
400
401
### Testing SBR Strategies
402
403
```scala
404
// Use MultiNodeSpec for testing split brain scenarios
405
class SplitBrainResolverSpec extends MultiNodeSpec(SplitBrainConfig) {
406
407
"Split Brain Resolver" should {
408
"down minority partition in keep-majority strategy" in {
409
// Create 5-node cluster
410
awaitClusterUp(first, second, third, fourth, fifth)
411
412
// Partition cluster into 3+2
413
testConductor.blackhole(first, fourth, Direction.Both)
414
testConductor.blackhole(first, fifth, Direction.Both)
415
testConductor.blackhole(second, fourth, Direction.Both)
416
testConductor.blackhole(second, fifth, Direction.Both)
417
testConductor.blackhole(third, fourth, Direction.Both)
418
testConductor.blackhole(third, fifth, Direction.Both)
419
420
// Verify majority partition (first, second, third) survives
421
runOn(first, second, third) {
422
within(30.seconds) {
423
awaitAssert {
424
cluster.state.members.size should be(3)
425
cluster.state.unreachable should be(empty)
426
}
427
}
428
}
429
430
// Verify minority partition (fourth, fifth) is downed
431
runOn(fourth, fifth) {
432
within(30.seconds) {
433
awaitAssert {
434
cluster.isTerminated should be(true)
435
}
436
}
437
}
438
}
439
}
440
}
441
```