GraphX is Apache Spark's API for graphs and graph-parallel computation
npx @tessl/cli install tessl/maven-org-apache-spark--spark-graphx_2-12@3.5.00
# Apache Spark GraphX
1
2
GraphX is Apache Spark's API for graphs and graph-parallel computation. It provides a distributed graph processing framework built on top of Spark RDDs, offering both graph-parallel and data-parallel views of the same physical data. GraphX enables users to seamlessly move between graph structures and tabular data, making it ideal for ETL, exploratory analysis, and iterative graph computation.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark/spark-graphx_2.12
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Add to `build.sbt`: `"org.apache.spark" %% "spark-graphx" % "3.5.6"`
10
11
## Core Imports
12
13
```scala
14
import org.apache.spark.graphx._
15
import org.apache.spark.graphx.lib._
16
import org.apache.spark.rdd.RDD
17
import org.apache.spark.{SparkContext, SparkConf}
18
import org.apache.spark.storage.StorageLevel
19
```
20
21
For specific imports:
22
23
```scala
24
import org.apache.spark.graphx.{Graph, VertexId, Edge, EdgeTriplet}
25
import org.apache.spark.graphx.{VertexRDD, EdgeRDD, GraphOps}
26
import org.apache.spark.graphx.{PartitionStrategy, EdgeDirection}
27
```
28
29
## Basic Usage
30
31
```scala
32
import org.apache.spark.graphx._
33
import org.apache.spark.rdd.RDD
34
35
// Create vertices RDD: (VertexId, VertexAttribute)
36
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
37
(1L, "Alice"),
38
(2L, "Bob"),
39
(3L, "Charlie")
40
))
41
42
// Create edges RDD
43
val edges: RDD[Edge[String]] = sc.parallelize(Array(
44
Edge(1L, 2L, "friend"),
45
Edge(2L, 3L, "friend"),
46
Edge(3L, 1L, "friend")
47
))
48
49
// Build the graph
50
val graph: Graph[String, String] = Graph(vertices, edges)
51
52
// Basic operations
53
println(s"Vertices: ${graph.numVertices}")
54
println(s"Edges: ${graph.numEdges}")
55
56
// Transform vertex attributes
57
val transformedGraph = graph.mapVertices((id, attr) => attr.toUpperCase)
58
59
// Run PageRank algorithm
60
val ranks = graph.pageRank(0.0001).vertices
61
```
62
63
## Architecture
64
65
GraphX is built around several key components:
66
67
- **Graph Abstraction**: Immutable, distributed graphs with typed vertex and edge attributes
68
- **Specialized RDDs**: `VertexRDD` and `EdgeRDD` provide efficient graph-specific operations
69
- **Triplet View**: `EdgeTriplet` joins edges with adjacent vertex attributes for message passing
70
- **Partitioning Strategies**: Optimize data locality and minimize communication overhead
71
- **Pregel API**: Vertex-centric programming model for iterative graph algorithms
72
- **Algorithm Library**: Pre-implemented graph algorithms like PageRank and Connected Components
73
74
## Capabilities
75
76
### Core Graph Operations
77
78
Fundamental graph construction, transformation, and analysis operations for building and manipulating graph structures.
79
80
```scala { .api }
81
// Graph construction
82
def Graph.apply[VD: ClassTag, ED: ClassTag](
83
vertices: RDD[(VertexId, VD)],
84
edges: RDD[Edge[ED]]
85
): Graph[VD, ED]
86
87
def Graph.fromEdges[VD: ClassTag, ED: ClassTag](
88
edges: RDD[Edge[ED]],
89
defaultValue: VD
90
): Graph[VD, ED]
91
92
// Graph transformations
93
def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]
94
def mapEdges[ED2: ClassTag](map: Edge[ED] => ED2): Graph[VD, ED2]
95
```
96
97
[Core Graph API](./core-graph-api.md)
98
99
### Graph Algorithms
100
101
Comprehensive collection of graph algorithms including PageRank, Connected Components, Triangle Counting, and community detection.
102
103
```scala { .api }
104
// PageRank algorithms
105
def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
106
def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]
107
108
// Component algorithms
109
def connectedComponents(): Graph[VertexId, ED]
110
def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]
111
112
// Community detection
113
def triangleCount(): Graph[Int, ED]
114
```
115
116
[Graph Algorithms](./graph-algorithms.md)
117
118
### Pregel Message-Passing API
119
120
Vertex-centric programming framework for implementing custom iterative graph algorithms using the Pregel computational model.
121
122
```scala { .api }
123
def pregel[A: ClassTag](
124
initialMsg: A,
125
maxIterations: Int = Int.MaxValue,
126
activeDirection: EdgeDirection = EdgeDirection.Either
127
)(
128
vprog: (VertexId, VD, A) => VD,
129
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
130
mergeMsg: (A, A) => A
131
): Graph[VD, ED]
132
```
133
134
[Pregel API](./pregel-api.md)
135
136
### Utilities and Graph Generation
137
138
Graph loading, generation, and utility functions for creating test graphs, importing data, and performance optimization.
139
140
```scala { .api }
141
// Graph loading
142
def GraphLoader.edgeListFile(
143
sc: SparkContext,
144
path: String,
145
canonicalOrientation: Boolean = false,
146
numEdgePartitions: Int = -1
147
): Graph[Int, Int]
148
149
// Graph generation
150
def GraphGenerators.logNormalGraph(
151
sc: SparkContext,
152
numVertices: Int,
153
numEParts: Int = -1,
154
mu: Double = 4.0,
155
sigma: Double = 1.3
156
): Graph[Long, Int]
157
```
158
159
[Utilities](./utilities.md)
160
161
## Core Types
162
163
```scala { .api }
164
// Type aliases
165
type VertexId = Long
166
type PartitionID = Int
167
168
// Core data structures
169
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
170
171
class EdgeTriplet[VD, ED] extends Edge[ED] {
172
val srcAttr: VD
173
val dstAttr: VD
174
}
175
176
abstract class Graph[VD: ClassTag, ED: ClassTag] {
177
val vertices: VertexRDD[VD]
178
val edges: EdgeRDD[ED]
179
val triplets: RDD[EdgeTriplet[VD, ED]]
180
}
181
182
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
183
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
184
```
185
186
## Common Patterns
187
188
**Graph Construction from Data:**
189
```scala
190
// From edge list file
191
val graph = GraphLoader.edgeListFile(sc, "path/to/edges.txt")
192
193
// From existing RDDs
194
val graph = Graph(verticesRDD, edgesRDD)
195
196
// From edge tuples with default vertex values
197
val graph = Graph.fromEdgeTuples(edgeTuples, defaultValue = "Unknown")
198
```
199
200
**Performance Optimization:**
201
```scala
202
// Cache for iterative algorithms
203
val cachedGraph = graph.cache()
204
205
// Partition for better locality
206
val partitionedGraph = graph.partitionBy(PartitionStrategy.EdgePartition2D)
207
208
// Checkpoint for fault tolerance
209
graph.checkpoint()
210
```