Python API for Apache Spark, providing distributed computing, data analysis, and machine learning capabilities
npx @tessl/cli install tessl/pypi-pyspark@4.0.00
# PySpark
1
2
PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It provides high-level APIs for distributed computing, data analysis, and machine learning workloads across clusters, enabling Python developers to leverage Spark's distributed computing capabilities through familiar Python syntax while handling large-scale datasets.
3
4
## Package Information
5
6
- **Package Name**: pyspark
7
- **Language**: Python
8
- **Installation**: `pip install pyspark`
9
10
## Core Imports
11
12
```python
13
import pyspark
14
from pyspark import SparkContext, SparkConf
15
```
16
17
For SQL and DataFrame operations:
18
19
```python
20
from pyspark.sql import SparkSession, DataFrame, Column, Row
21
from pyspark.sql.functions import col, lit, when
22
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
23
```
24
25
For window operations:
26
27
```python
28
from pyspark.sql.window import Window, WindowSpec
29
```
30
31
For streaming operations:
32
33
```python
34
from pyspark.sql.streaming import StreamingQuery, StreamingQueryManager
35
```
36
37
For machine learning:
38
39
```python
40
from pyspark.ml import Pipeline
41
from pyspark.ml.classification import LogisticRegression
42
from pyspark.ml.feature import VectorAssembler
43
```
44
45
## Basic Usage
46
47
```python
48
from pyspark.sql import SparkSession
49
50
# Create Spark session
51
spark = SparkSession.builder \
52
.appName("MyApp") \
53
.getOrCreate()
54
55
# Create DataFrame from data
56
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
57
columns = ["name", "age"]
58
df = spark.createDataFrame(data, columns)
59
60
# Basic DataFrame operations
61
df.show()
62
df.filter(df.age > 25).show()
63
64
# SQL operations
65
df.createOrReplaceTempView("people")
66
result = spark.sql("SELECT name FROM people WHERE age > 25")
67
result.show()
68
69
# Stop the session
70
spark.stop()
71
```
72
73
## Architecture
74
75
PySpark's architecture enables distributed computing through several key components:
76
77
- **SparkSession**: Main entry point providing unified API for Spark functionality
78
- **SparkContext**: Low-level Spark functionality and RDD operations
79
- **DataFrames**: Distributed data collections with schema and SQL support
80
- **RDDs**: Resilient Distributed Datasets, the fundamental data abstraction
81
- **Executors**: Distributed workers that execute tasks across the cluster
82
- **Driver**: Coordinates the Spark application and distributes work
83
84
This architecture allows PySpark to process large datasets by distributing computations across cluster nodes while providing fault tolerance, automatic optimization, and familiar Python APIs for data manipulation, SQL queries, machine learning, and streaming analytics.
85
86
## Capabilities
87
88
### Core Spark Context and RDDs
89
90
Low-level distributed computing with Resilient Distributed Datasets (RDDs), broadcast variables, accumulators, and Spark configuration. Provides foundational distributed computing primitives.
91
92
```python { .api }
93
class SparkContext:
94
def __init__(self, master=None, appName=None, conf=None): ...
95
def parallelize(self, c, numSlices=None): ...
96
def textFile(self, name, minPartitions=None, use_unicode=True): ...
97
98
class RDD:
99
def map(self, f): ...
100
def filter(self, f): ...
101
def collect(self): ...
102
103
class SparkConf:
104
def setAppName(self, value): ...
105
def setMaster(self, value): ...
106
```
107
108
[Core Spark Context and RDDs](./core-context-rdds.md)
109
110
### SQL and DataFrames
111
112
Structured data processing with DataFrames, SQL queries, data I/O, and 500+ built-in functions. Provides the primary interface for structured data analysis and processing.
113
114
```python { .api }
115
class SparkSession:
116
def createDataFrame(self, data, schema=None): ...
117
def sql(self, sqlQuery): ...
118
def read: DataFrameReader
119
def table(self, tableName): ...
120
121
class DataFrame:
122
def select(self, *cols): ...
123
def filter(self, condition): ...
124
def groupBy(self, *cols): ...
125
def show(self, n=20, truncate=True): ...
126
```
127
128
[SQL and DataFrames](./sql-dataframes.md)
129
130
### Machine Learning (ML)
131
132
Modern machine learning pipeline API with estimators, transformers, and comprehensive algorithms for classification, regression, clustering, and feature processing.
133
134
```python { .api }
135
class Pipeline:
136
def __init__(self, stages=None): ...
137
def fit(self, dataset): ...
138
139
class LogisticRegression:
140
def __init__(self, featuresCol="features", labelCol="label"): ...
141
def fit(self, dataset): ...
142
143
class VectorAssembler:
144
def __init__(self, inputCols=None, outputCol=None): ...
145
```
146
147
[Machine Learning (ML)](./machine-learning.md)
148
149
### Legacy MLlib
150
151
Legacy machine learning library with RDD-based algorithms for classification, regression, clustering, and collaborative filtering. Maintained for backward compatibility.
152
153
```python { .api }
154
class LogisticRegressionWithSGD:
155
@classmethod
156
def train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0): ...
157
158
class KMeans:
159
@classmethod
160
def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"): ...
161
```
162
163
[Legacy MLlib](./legacy-mllib.md)
164
165
### Pandas API on Spark
166
167
Pandas-compatible API for familiar pandas operations on distributed datasets. Enables seamless scaling of pandas workflows to large datasets.
168
169
```python { .api }
170
class DataFrame:
171
def head(self, n=5): ...
172
def describe(self): ...
173
def groupby(self, by=None): ...
174
def merge(self, right, how="inner", on=None): ...
175
176
def read_csv(path, **kwargs): ...
177
def concat(objs, axis=0): ...
178
```
179
180
[Pandas API on Spark](./pandas-api.md)
181
182
### Streaming
183
184
Real-time data processing with structured streaming for continuous data ingestion, processing, and output to various sinks.
185
186
```python { .api }
187
class StreamingContext:
188
def __init__(self, sparkContext, batchDuration): ...
189
def socketTextStream(self, hostname, port): ...
190
def start(self): ...
191
192
class DStream:
193
def map(self, f): ...
194
def filter(self, f): ...
195
def foreachRDD(self, func): ...
196
```
197
198
[Streaming](./streaming.md)
199
200
### Resource Management
201
202
Resource allocation and management for Spark applications including task resources, executor resources, and resource profiles for optimized cluster utilization.
203
204
```python { .api }
205
class ResourceProfile:
206
def __init__(self, executorResources=None, taskResources=None): ...
207
208
class ExecutorResourceRequests:
209
def cores(self, amount): ...
210
def memory(self, amount): ...
211
```
212
213
[Resource Management](./resource-management.md)
214
215
## Types
216
217
```python { .api }
218
class StorageLevel:
219
DISK_ONLY: StorageLevel
220
MEMORY_ONLY: StorageLevel
221
MEMORY_AND_DISK: StorageLevel
222
223
class TaskContext:
224
def partitionId(self): ...
225
def stageId(self): ...
226
def taskAttemptId(self): ...
227
228
class Row:
229
def __init__(self, **kwargs): ...
230
def asDict(self): ...
231
232
# Data Types
233
class DataType:
234
"""Base class for data types."""
235
236
class StringType(DataType):
237
"""String data type."""
238
239
class IntegerType(DataType):
240
"""Integer data type."""
241
242
class LongType(DataType):
243
"""Long integer data type."""
244
245
class FloatType(DataType):
246
"""Float data type."""
247
248
class DoubleType(DataType):
249
"""Double precision float data type."""
250
251
class BooleanType(DataType):
252
"""Boolean data type."""
253
254
class TimestampType(DataType):
255
"""Timestamp data type."""
256
257
class DateType(DataType):
258
"""Date data type."""
259
260
class ArrayType(DataType):
261
"""Array data type."""
262
def __init__(self, elementType, containsNull=True): ...
263
264
class MapType(DataType):
265
"""Map data type."""
266
def __init__(self, keyType, valueType, valueContainsNull=True): ...
267
268
class StructType(DataType):
269
"""Struct data type representing a row."""
270
def __init__(self, fields=None): ...
271
def add(self, field, data_type=None, nullable=True, metadata=None): ...
272
273
class StructField:
274
"""Field in a StructType."""
275
def __init__(self, name, dataType, nullable=True, metadata=None): ...
276
277
# Common Exception Types
278
class PySparkException(Exception):
279
"""Base exception for PySpark errors."""
280
281
class AnalysisException(PySparkException):
282
"""Exception thrown when analysis of SQL query fails."""
283
284
class ParseException(PySparkException):
285
"""Exception thrown when parsing of SQL query fails."""
286
287
class StreamingQueryException(PySparkException):
288
"""Exception thrown by streaming queries."""
289
```