Tessl Tile for maven/com.typesafe.akka/akka-cluster_2.12@2.8.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cluster-management.md cluster-routing.md configuration.md event-system.md index.md member-management.md split-brain-resolution.md

split-brain-resolution.mddocs/

0
# Split Brain Resolution
1

2
Split Brain Resolution (SBR) in Akka Cluster provides strategies for handling network partitions by automatically downing unreachable members to maintain cluster consistency and availability.
3

4
## DowningProvider API
5

6
### Base DowningProvider
7

8
```scala { .api }
9
abstract class DowningProvider {
10
  def downRemovalMargin: FiniteDuration
11
  def downingActorProps: Option[Props]
12
}
13

14
object DowningProvider {
15
  def load(fqcn: String, system: ActorSystem): DowningProvider
16
}
17
```
18

19
### NoDowning (Default)
20

21
```scala { .api }
22
class NoDowning extends DowningProvider {
23
  override def downRemovalMargin: FiniteDuration = Duration.Zero
24
  override def downingActorProps: Option[Props] = None
25
}
26
```
27

28
## Split Brain Resolver Provider
29

30
### SplitBrainResolverProvider
31

32
```scala { .api }
33
class SplitBrainResolverProvider(system: ActorSystem) extends DowningProvider {
34
  override def downRemovalMargin: FiniteDuration
35
  override def downingActorProps: Option[Props]
36
}
37
```
38

39
### Configuration
40

41
```hocon
42
akka.cluster {
43
  downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
44
  
45
  split-brain-resolver {
46
    # Select strategy: keep-majority, lease-majority, static-quorum, keep-oldest, down-all
47
    active-strategy = "keep-majority"
48
    
49
    # Time margin after which unreachable nodes will be downed
50
    stable-after = 20s
51
    
52
    # If on, down all members if cluster size is less than this
53
    down-all-when-unstable = "on"
54
  }
55
}
56
```
57

58
## SBR Strategies
59

60
### Keep Majority Strategy
61

62
Keeps the partition with the majority of nodes, downs the minority.
63

64
```hocon
65
akka.cluster.split-brain-resolver {
66
  active-strategy = "keep-majority"
67
  
68
  keep-majority {
69
    # Additional minimum size of majority partition
70
    role = ""
71
  }
72
}
73
```
74

75
**Behavior:**
76
- Partition with >50% of nodes survives
77
- Minority partitions are downed
78
- Equal-sized partitions: no nodes are downed (configurable)
79

80
**Usage Example:**
81
```scala
82
// 5-node cluster splits into 3+2
83
// 3-node partition survives, 2-node partition is downed
84

85
// 4-node cluster splits into 2+2  
86
// No partition is downed (requires configuration tuning)
87
```
88

89
### Lease Majority Strategy
90

91
Uses a distributed lease to determine which partition can continue.
92

93
```hocon
94
akka.cluster.split-brain-resolver {
95
  active-strategy = "lease-majority"
96
  
97
  lease-majority {
98
    lease-implementation = "akka.coordination.lease.kubernetes"
99
    # Acquire lease timeout
100
    acquire-lease-delay-for-minority = 2s
101
    # Release lease after
102
    release-after = 40s
103
  }
104
}
105
```
106

107
**Behavior:**
108
- Majority partition acquires lease and survives
109
- Minority waits then attempts lease acquisition
110
- Only one partition can hold lease at a time
111

112
### Static Quorum Strategy
113

114
Downs minority partitions based on configured quorum size.
115

116
```hocon
117
akka.cluster.split-brain-resolver {
118
  active-strategy = "static-quorum"
119
  
120
  static-quorum {
121
    # Minimum cluster size to maintain
122
    quorum-size = 3
123
    
124
    # Specific role that must meet quorum
125
    role = ""
126
  }
127
}
128
```
129

130
**Behavior:**
131
- Partitions with fewer than `quorum-size` nodes are downed
132
- Multiple partitions can survive if both meet quorum
133
- Useful for clusters with known minimum size requirements
134

135
### Keep Oldest Strategy
136

137
Keeps the partition containing the oldest member (by cluster join time).
138

139
```hocon
140
akka.cluster.split-brain-resolver {
141
  active-strategy = "keep-oldest"
142
  
143
  keep-oldest {
144
    # Prioritize members with this role
145
    role = ""
146
    
147
    # Down oldest member if singleton partition
148
    down-if-alone = on
149
  }
150
}
151
```
152

153
**Behavior:**
154
- Partition with oldest member survives
155
- Other partitions are downed
156
- Deterministic: always same result for same partition scenario
157

158
### Down All Strategy
159

160
Downs all unreachable members (primarily for testing).
161

162
```hocon
163
akka.cluster.split-brain-resolver {
164
  active-strategy = "down-all"
165
}
166
```
167

168
**Behavior:**
169
- All unreachable members are downed
170
- Cluster continues with reachable members only
171
- Use with caution in production
172

173
## SBR Settings
174

175
### SplitBrainResolverSettings
176

177
```scala { .api }
178
class SplitBrainResolverSettings(config: Config) {
179
  def activeStrategy: String
180
  def stableAfter: FiniteDuration
181
  def downAllWhenUnstable: DownAllWhenUnstable
182
}
183

184
sealed trait DownAllWhenUnstable
185
case object DownAllWhenUnstableOn extends DownAllWhenUnstable
186
case object DownAllWhenUnstableOff extends DownAllWhenUnstable
187
```
188

189
### Global SBR Configuration
190

191
```hocon
192
akka.cluster.split-brain-resolver {
193
  # Strategy to use
194
  active-strategy = "keep-majority"
195
  
196
  # Time to wait before taking downing decision  
197
  stable-after = 20s
198
  
199
  # Down all when cluster becomes unstable
200
  down-all-when-unstable = "on"
201
  
202
  # Additional settings per strategy
203
  keep-majority {
204
    # Minimum size for majority
205
    role = "core"
206
  }
207
  
208
  static-quorum {
209
    quorum-size = 3
210
    role = "important"
211
  }
212
  
213
  keep-oldest {
214
    role = "seed"
215
    down-if-alone = off
216
  }
217
  
218
  lease-majority {
219
    lease-implementation = "akka.coordination.lease.kubernetes"
220
    acquire-lease-delay-for-minority = 2s
221
    release-after = 40s
222
  }
223
}
224
```
225

226
## Custom Downing Provider
227

228
### Creating Custom Provider
229

230
```scala
231
import akka.cluster.DowningProvider
232
import akka.actor.{ActorSystem, Props}
233
import scala.concurrent.duration._
234

235
class CustomDowningProvider(system: ActorSystem) extends DowningProvider {
236
  override def downRemovalMargin: FiniteDuration = 10.seconds
237
  
238
  override def downingActorProps: Option[Props] = 
239
    Some(Props(classOf[CustomDowningActor]))
240
}
241

242
class CustomDowningActor extends Actor with ActorLogging {
243
  val cluster = Cluster(context.system)
244
  
245
  // Subscribe to unreachability events
246
  override def preStart(): Unit = {
247
    cluster.subscribe(self, classOf[UnreachableMember])
248
  }
249
  
250
  override def postStop(): Unit = {
251
    cluster.unsubscribe(self)
252
  }
253
  
254
  def receive = {
255
    case UnreachableMember(member) =>
256
      log.info("Member {} is unreachable", member)
257
      
258
      // Custom downing logic
259
      if (shouldDownMember(member)) {
260
        log.warning("Downing unreachable member {}", member)
261
        cluster.down(member.address)
262
      }
263
  }
264
  
265
  private def shouldDownMember(member: Member): Boolean = {
266
    // Custom logic - example: down after 30 seconds unreachable
267
    // In practice, you'd track unreachable time
268
    true
269
  }
270
}
271
```
272

273
### Registering Custom Provider
274

275
```hocon
276
akka.cluster.downing-provider-class = "com.example.CustomDowningProvider"
277
```
278

279
## SBR Monitoring and Observability
280

281
### SBR Decision Logging
282

283
```scala
284
// SBR logs decisions at INFO level
285
// Example log messages:
286
// "SBR is downing [Member(akka://sys@host1:2551, Up)] in partition [...]"
287
// "SBR is keeping partition [Member(akka://sys@host2:2551, Up), ...]"
288
```
289

290
### Monitoring SBR Events
291

292
```scala
293
import akka.cluster.ClusterEvent._
294

295
class SBRMonitor extends Actor with ActorLogging {
296
  val cluster = Cluster(context.system)
297
  
298
  override def preStart(): Unit = {
299
    cluster.subscribe(self, classOf[MemberDowned], classOf[MemberRemoved])
300
  }
301
  
302
  def receive = {
303
    case MemberDowned(member) =>
304
      log.warning("Member downed by SBR: {}", member)
305
      // Send alert/metric
306
      
307
    case MemberRemoved(member, previousStatus) =>
308
      if (previousStatus == MemberStatus.Down) {
309
        log.info("Previously downed member removed: {}", member)
310
        // Update monitoring dashboard
311
      }
312
  }
313
}
314
```
315

316
### Health Check Integration
317

318
```scala
319
import akka.http.scaladsl.server.Route
320
import akka.http.scaladsl.server.Directives._
321

322
def healthRoute: Route = {
323
  path("health") {
324
    get {
325
      val cluster = Cluster(system)
326
      val unreachableCount = cluster.state.unreachable.size
327
      
328
      if (unreachableCount == 0) {
329
        complete("healthy")
330
      } else {
331
        complete(s"unhealthy: $unreachableCount unreachable members")
332
      }
333
    }
334
  }
335
}
336
```
337

338
## Production Best Practices
339

340
### Strategy Selection Guidelines
341

342
**Keep Majority:**
343
- Best for most scenarios
344
- Good balance of availability and consistency
345
- Works well with odd number of nodes
346

347
**Lease Majority:**
348
- Use with external coordination systems (Kubernetes, etcd)
349
- Provides strongest consistency guarantees
350
- Requires reliable lease implementation
351

352
**Static Quorum:**
353
- Use when minimum cluster size is known
354
- Good for clusters with well-defined capacity requirements
355
- May result in multiple surviving partitions
356

357
**Keep Oldest:**
358
- Use when one node has special significance
359
- Deterministic but potentially less available
360
- Good for master/worker patterns
361

362
### Configuration Recommendations
363

364
```hocon
365
# Production configuration
366
akka.cluster.split-brain-resolver {
367
  active-strategy = "keep-majority"
368
  stable-after = 30s  # Allow time for transient network issues
369
  down-all-when-unstable = "on"  # Prevent brain-dead cluster states
370
  
371
  keep-majority {
372
    # Use role-based majority for heterogeneous clusters
373
    role = "core"
374
  }
375
}
376
```
377

378
### Operational Considerations
379

380
```scala
381
// Monitor cluster health
382
val cluster = Cluster(system)
383

384
// Check for unreachable members
385
val unreachableMembers = cluster.state.unreachable
386
if (unreachableMembers.nonEmpty) {
387
  log.warning("Unreachable members detected: {}", 
388
              unreachableMembers.map(_.address).mkString(", "))
389
}
390

391
// Monitor cluster size
392
val memberCount = cluster.state.members.count(_.status == MemberStatus.Up)
393
val minimumRequired = 3 // Your application's minimum
394

395
if (memberCount < minimumRequired) {
396
  log.error("Cluster size {} below minimum required {}", memberCount, minimumRequired)
397
  // Consider alerting or graceful degradation
398
}
399
```
400

401
### Testing SBR Strategies
402

403
```scala
404
// Use MultiNodeSpec for testing split brain scenarios
405
class SplitBrainResolverSpec extends MultiNodeSpec(SplitBrainConfig) {
406
  
407
  "Split Brain Resolver" should {
408
    "down minority partition in keep-majority strategy" in {
409
      // Create 5-node cluster
410
      awaitClusterUp(first, second, third, fourth, fifth)
411
      
412
      // Partition cluster into 3+2
413
      testConductor.blackhole(first, fourth, Direction.Both)
414
      testConductor.blackhole(first, fifth, Direction.Both)
415
      testConductor.blackhole(second, fourth, Direction.Both)
416
      testConductor.blackhole(second, fifth, Direction.Both)
417
      testConductor.blackhole(third, fourth, Direction.Both)
418
      testConductor.blackhole(third, fifth, Direction.Both)
419
      
420
      // Verify majority partition (first, second, third) survives
421
      runOn(first, second, third) {
422
        within(30.seconds) {
423
          awaitAssert {
424
            cluster.state.members.size should be(3)
425
            cluster.state.unreachable should be(empty)
426
          }
427
        }
428
      }
429
      
430
      // Verify minority partition (fourth, fifth) is downed
431
      runOn(fourth, fifth) {
432
        within(30.seconds) {
433
          awaitAssert {
434
            cluster.isTerminated should be(true)
435
          }
436
        }
437
      }
438
    }
439
  }
440
}
441
```

Version

Tile

Files

split-brain-resolution.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

split-brain-resolution.mddocs/