or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-graph-api.mdgraph-algorithms.mdindex.mdpregel-api.mdutilities.md

index.mddocs/

0

# Apache Spark GraphX

1

2

GraphX is Apache Spark's API for graphs and graph-parallel computation. It provides a distributed graph processing framework built on top of Spark RDDs, offering both graph-parallel and data-parallel views of the same physical data. GraphX enables users to seamlessly move between graph structures and tabular data, making it ideal for ETL, exploratory analysis, and iterative graph computation.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark/spark-graphx_2.12

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Installation**: Add to `build.sbt`: `"org.apache.spark" %% "spark-graphx" % "3.5.6"`

10

11

## Core Imports

12

13

```scala

14

import org.apache.spark.graphx._

15

import org.apache.spark.graphx.lib._

16

import org.apache.spark.rdd.RDD

17

import org.apache.spark.{SparkContext, SparkConf}

18

import org.apache.spark.storage.StorageLevel

19

```

20

21

For specific imports:

22

23

```scala

24

import org.apache.spark.graphx.{Graph, VertexId, Edge, EdgeTriplet}

25

import org.apache.spark.graphx.{VertexRDD, EdgeRDD, GraphOps}

26

import org.apache.spark.graphx.{PartitionStrategy, EdgeDirection}

27

```

28

29

## Basic Usage

30

31

```scala

32

import org.apache.spark.graphx._

33

import org.apache.spark.rdd.RDD

34

35

// Create vertices RDD: (VertexId, VertexAttribute)

36

val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(

37

(1L, "Alice"),

38

(2L, "Bob"),

39

(3L, "Charlie")

40

))

41

42

// Create edges RDD

43

val edges: RDD[Edge[String]] = sc.parallelize(Array(

44

Edge(1L, 2L, "friend"),

45

Edge(2L, 3L, "friend"),

46

Edge(3L, 1L, "friend")

47

))

48

49

// Build the graph

50

val graph: Graph[String, String] = Graph(vertices, edges)

51

52

// Basic operations

53

println(s"Vertices: ${graph.numVertices}")

54

println(s"Edges: ${graph.numEdges}")

55

56

// Transform vertex attributes

57

val transformedGraph = graph.mapVertices((id, attr) => attr.toUpperCase)

58

59

// Run PageRank algorithm

60

val ranks = graph.pageRank(0.0001).vertices

61

```

62

63

## Architecture

64

65

GraphX is built around several key components:

66

67

- **Graph Abstraction**: Immutable, distributed graphs with typed vertex and edge attributes

68

- **Specialized RDDs**: `VertexRDD` and `EdgeRDD` provide efficient graph-specific operations

69

- **Triplet View**: `EdgeTriplet` joins edges with adjacent vertex attributes for message passing

70

- **Partitioning Strategies**: Optimize data locality and minimize communication overhead

71

- **Pregel API**: Vertex-centric programming model for iterative graph algorithms

72

- **Algorithm Library**: Pre-implemented graph algorithms like PageRank and Connected Components

73

74

## Capabilities

75

76

### Core Graph Operations

77

78

Fundamental graph construction, transformation, and analysis operations for building and manipulating graph structures.

79

80

```scala { .api }

81

// Graph construction

82

def Graph.apply[VD: ClassTag, ED: ClassTag](

83

vertices: RDD[(VertexId, VD)],

84

edges: RDD[Edge[ED]]

85

): Graph[VD, ED]

86

87

def Graph.fromEdges[VD: ClassTag, ED: ClassTag](

88

edges: RDD[Edge[ED]],

89

defaultValue: VD

90

): Graph[VD, ED]

91

92

// Graph transformations

93

def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]

94

def mapEdges[ED2: ClassTag](map: Edge[ED] => ED2): Graph[VD, ED2]

95

```

96

97

[Core Graph API](./core-graph-api.md)

98

99

### Graph Algorithms

100

101

Comprehensive collection of graph algorithms including PageRank, Connected Components, Triangle Counting, and community detection.

102

103

```scala { .api }

104

// PageRank algorithms

105

def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]

106

def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]

107

108

// Component algorithms

109

def connectedComponents(): Graph[VertexId, ED]

110

def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]

111

112

// Community detection

113

def triangleCount(): Graph[Int, ED]

114

```

115

116

[Graph Algorithms](./graph-algorithms.md)

117

118

### Pregel Message-Passing API

119

120

Vertex-centric programming framework for implementing custom iterative graph algorithms using the Pregel computational model.

121

122

```scala { .api }

123

def pregel[A: ClassTag](

124

initialMsg: A,

125

maxIterations: Int = Int.MaxValue,

126

activeDirection: EdgeDirection = EdgeDirection.Either

127

)(

128

vprog: (VertexId, VD, A) => VD,

129

sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],

130

mergeMsg: (A, A) => A

131

): Graph[VD, ED]

132

```

133

134

[Pregel API](./pregel-api.md)

135

136

### Utilities and Graph Generation

137

138

Graph loading, generation, and utility functions for creating test graphs, importing data, and performance optimization.

139

140

```scala { .api }

141

// Graph loading

142

def GraphLoader.edgeListFile(

143

sc: SparkContext,

144

path: String,

145

canonicalOrientation: Boolean = false,

146

numEdgePartitions: Int = -1

147

): Graph[Int, Int]

148

149

// Graph generation

150

def GraphGenerators.logNormalGraph(

151

sc: SparkContext,

152

numVertices: Int,

153

numEParts: Int = -1,

154

mu: Double = 4.0,

155

sigma: Double = 1.3

156

): Graph[Long, Int]

157

```

158

159

[Utilities](./utilities.md)

160

161

## Core Types

162

163

```scala { .api }

164

// Type aliases

165

type VertexId = Long

166

type PartitionID = Int

167

168

// Core data structures

169

case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)

170

171

class EdgeTriplet[VD, ED] extends Edge[ED] {

172

val srcAttr: VD

173

val dstAttr: VD

174

}

175

176

abstract class Graph[VD: ClassTag, ED: ClassTag] {

177

val vertices: VertexRDD[VD]

178

val edges: EdgeRDD[ED]

179

val triplets: RDD[EdgeTriplet[VD, ED]]

180

}

181

182

abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]

183

abstract class EdgeRDD[ED] extends RDD[Edge[ED]]

184

```

185

186

## Common Patterns

187

188

**Graph Construction from Data:**

189

```scala

190

// From edge list file

191

val graph = GraphLoader.edgeListFile(sc, "path/to/edges.txt")

192

193

// From existing RDDs

194

val graph = Graph(verticesRDD, edgesRDD)

195

196

// From edge tuples with default vertex values

197

val graph = Graph.fromEdgeTuples(edgeTuples, defaultValue = "Unknown")

198

```

199

200

**Performance Optimization:**

201

```scala

202

// Cache for iterative algorithms

203

val cachedGraph = graph.cache()

204

205

// Partition for better locality

206

val partitionedGraph = graph.partitionBy(PartitionStrategy.EdgePartition2D)

207

208

// Checkpoint for fault tolerance

209

graph.checkpoint()

210

```