or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-context-rdds.mdindex.mdlegacy-mllib.mdmachine-learning.mdpandas-api.mdresource-management.mdsql-dataframes.mdstreaming.md

index.mddocs/

0

# PySpark

1

2

PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It provides high-level APIs for distributed computing, data analysis, and machine learning workloads across clusters, enabling Python developers to leverage Spark's distributed computing capabilities through familiar Python syntax while handling large-scale datasets.

3

4

## Package Information

5

6

- **Package Name**: pyspark

7

- **Language**: Python

8

- **Installation**: `pip install pyspark`

9

10

## Core Imports

11

12

```python

13

import pyspark

14

from pyspark import SparkContext, SparkConf

15

```

16

17

For SQL and DataFrame operations:

18

19

```python

20

from pyspark.sql import SparkSession, DataFrame, Column, Row

21

from pyspark.sql.functions import col, lit, when

22

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

23

```

24

25

For window operations:

26

27

```python

28

from pyspark.sql.window import Window, WindowSpec

29

```

30

31

For streaming operations:

32

33

```python

34

from pyspark.sql.streaming import StreamingQuery, StreamingQueryManager

35

```

36

37

For machine learning:

38

39

```python

40

from pyspark.ml import Pipeline

41

from pyspark.ml.classification import LogisticRegression

42

from pyspark.ml.feature import VectorAssembler

43

```

44

45

## Basic Usage

46

47

```python

48

from pyspark.sql import SparkSession

49

50

# Create Spark session

51

spark = SparkSession.builder \

52

.appName("MyApp") \

53

.getOrCreate()

54

55

# Create DataFrame from data

56

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]

57

columns = ["name", "age"]

58

df = spark.createDataFrame(data, columns)

59

60

# Basic DataFrame operations

61

df.show()

62

df.filter(df.age > 25).show()

63

64

# SQL operations

65

df.createOrReplaceTempView("people")

66

result = spark.sql("SELECT name FROM people WHERE age > 25")

67

result.show()

68

69

# Stop the session

70

spark.stop()

71

```

72

73

## Architecture

74

75

PySpark's architecture enables distributed computing through several key components:

76

77

- **SparkSession**: Main entry point providing unified API for Spark functionality

78

- **SparkContext**: Low-level Spark functionality and RDD operations

79

- **DataFrames**: Distributed data collections with schema and SQL support

80

- **RDDs**: Resilient Distributed Datasets, the fundamental data abstraction

81

- **Executors**: Distributed workers that execute tasks across the cluster

82

- **Driver**: Coordinates the Spark application and distributes work

83

84

This architecture allows PySpark to process large datasets by distributing computations across cluster nodes while providing fault tolerance, automatic optimization, and familiar Python APIs for data manipulation, SQL queries, machine learning, and streaming analytics.

85

86

## Capabilities

87

88

### Core Spark Context and RDDs

89

90

Low-level distributed computing with Resilient Distributed Datasets (RDDs), broadcast variables, accumulators, and Spark configuration. Provides foundational distributed computing primitives.

91

92

```python { .api }

93

class SparkContext:

94

def __init__(self, master=None, appName=None, conf=None): ...

95

def parallelize(self, c, numSlices=None): ...

96

def textFile(self, name, minPartitions=None, use_unicode=True): ...

97

98

class RDD:

99

def map(self, f): ...

100

def filter(self, f): ...

101

def collect(self): ...

102

103

class SparkConf:

104

def setAppName(self, value): ...

105

def setMaster(self, value): ...

106

```

107

108

[Core Spark Context and RDDs](./core-context-rdds.md)

109

110

### SQL and DataFrames

111

112

Structured data processing with DataFrames, SQL queries, data I/O, and 500+ built-in functions. Provides the primary interface for structured data analysis and processing.

113

114

```python { .api }

115

class SparkSession:

116

def createDataFrame(self, data, schema=None): ...

117

def sql(self, sqlQuery): ...

118

def read: DataFrameReader

119

def table(self, tableName): ...

120

121

class DataFrame:

122

def select(self, *cols): ...

123

def filter(self, condition): ...

124

def groupBy(self, *cols): ...

125

def show(self, n=20, truncate=True): ...

126

```

127

128

[SQL and DataFrames](./sql-dataframes.md)

129

130

### Machine Learning (ML)

131

132

Modern machine learning pipeline API with estimators, transformers, and comprehensive algorithms for classification, regression, clustering, and feature processing.

133

134

```python { .api }

135

class Pipeline:

136

def __init__(self, stages=None): ...

137

def fit(self, dataset): ...

138

139

class LogisticRegression:

140

def __init__(self, featuresCol="features", labelCol="label"): ...

141

def fit(self, dataset): ...

142

143

class VectorAssembler:

144

def __init__(self, inputCols=None, outputCol=None): ...

145

```

146

147

[Machine Learning (ML)](./machine-learning.md)

148

149

### Legacy MLlib

150

151

Legacy machine learning library with RDD-based algorithms for classification, regression, clustering, and collaborative filtering. Maintained for backward compatibility.

152

153

```python { .api }

154

class LogisticRegressionWithSGD:

155

@classmethod

156

def train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0): ...

157

158

class KMeans:

159

@classmethod

160

def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"): ...

161

```

162

163

[Legacy MLlib](./legacy-mllib.md)

164

165

### Pandas API on Spark

166

167

Pandas-compatible API for familiar pandas operations on distributed datasets. Enables seamless scaling of pandas workflows to large datasets.

168

169

```python { .api }

170

class DataFrame:

171

def head(self, n=5): ...

172

def describe(self): ...

173

def groupby(self, by=None): ...

174

def merge(self, right, how="inner", on=None): ...

175

176

def read_csv(path, **kwargs): ...

177

def concat(objs, axis=0): ...

178

```

179

180

[Pandas API on Spark](./pandas-api.md)

181

182

### Streaming

183

184

Real-time data processing with structured streaming for continuous data ingestion, processing, and output to various sinks.

185

186

```python { .api }

187

class StreamingContext:

188

def __init__(self, sparkContext, batchDuration): ...

189

def socketTextStream(self, hostname, port): ...

190

def start(self): ...

191

192

class DStream:

193

def map(self, f): ...

194

def filter(self, f): ...

195

def foreachRDD(self, func): ...

196

```

197

198

[Streaming](./streaming.md)

199

200

### Resource Management

201

202

Resource allocation and management for Spark applications including task resources, executor resources, and resource profiles for optimized cluster utilization.

203

204

```python { .api }

205

class ResourceProfile:

206

def __init__(self, executorResources=None, taskResources=None): ...

207

208

class ExecutorResourceRequests:

209

def cores(self, amount): ...

210

def memory(self, amount): ...

211

```

212

213

[Resource Management](./resource-management.md)

214

215

## Types

216

217

```python { .api }

218

class StorageLevel:

219

DISK_ONLY: StorageLevel

220

MEMORY_ONLY: StorageLevel

221

MEMORY_AND_DISK: StorageLevel

222

223

class TaskContext:

224

def partitionId(self): ...

225

def stageId(self): ...

226

def taskAttemptId(self): ...

227

228

class Row:

229

def __init__(self, **kwargs): ...

230

def asDict(self): ...

231

232

# Data Types

233

class DataType:

234

"""Base class for data types."""

235

236

class StringType(DataType):

237

"""String data type."""

238

239

class IntegerType(DataType):

240

"""Integer data type."""

241

242

class LongType(DataType):

243

"""Long integer data type."""

244

245

class FloatType(DataType):

246

"""Float data type."""

247

248

class DoubleType(DataType):

249

"""Double precision float data type."""

250

251

class BooleanType(DataType):

252

"""Boolean data type."""

253

254

class TimestampType(DataType):

255

"""Timestamp data type."""

256

257

class DateType(DataType):

258

"""Date data type."""

259

260

class ArrayType(DataType):

261

"""Array data type."""

262

def __init__(self, elementType, containsNull=True): ...

263

264

class MapType(DataType):

265

"""Map data type."""

266

def __init__(self, keyType, valueType, valueContainsNull=True): ...

267

268

class StructType(DataType):

269

"""Struct data type representing a row."""

270

def __init__(self, fields=None): ...

271

def add(self, field, data_type=None, nullable=True, metadata=None): ...

272

273

class StructField:

274

"""Field in a StructType."""

275

def __init__(self, name, dataType, nullable=True, metadata=None): ...

276

277

# Common Exception Types

278

class PySparkException(Exception):

279

"""Base exception for PySpark errors."""

280

281

class AnalysisException(PySparkException):

282

"""Exception thrown when analysis of SQL query fails."""

283

284

class ParseException(PySparkException):

285

"""Exception thrown when parsing of SQL query fails."""

286

287

class StreamingQueryException(PySparkException):

288

"""Exception thrown by streaming queries."""

289

```