Tessl Tile for pypi/pyspark@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-pyspark

Python API for Apache Spark, providing distributed computing, data analysis, and machine learning capabilities

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/pyspark@4.0.x

To install, run

npx @tessl/cli install tessl/pypi-pyspark@4.0.0

0
# PySpark
1

2
PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It provides high-level APIs for distributed computing, data analysis, and machine learning workloads across clusters, enabling Python developers to leverage Spark's distributed computing capabilities through familiar Python syntax while handling large-scale datasets.
3

4
## Package Information
5

6
- **Package Name**: pyspark
7
- **Language**: Python
8
- **Installation**: `pip install pyspark`
9

10
## Core Imports
11

12
```python
13
import pyspark
14
from pyspark import SparkContext, SparkConf
15
```
16

17
For SQL and DataFrame operations:
18

19
```python
20
from pyspark.sql import SparkSession, DataFrame, Column, Row
21
from pyspark.sql.functions import col, lit, when
22
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
23
```
24

25
For window operations:
26

27
```python
28
from pyspark.sql.window import Window, WindowSpec
29
```
30

31
For streaming operations:
32

33
```python
34
from pyspark.sql.streaming import StreamingQuery, StreamingQueryManager
35
```
36

37
For machine learning:
38

39
```python
40
from pyspark.ml import Pipeline
41
from pyspark.ml.classification import LogisticRegression
42
from pyspark.ml.feature import VectorAssembler
43
```
44

45
## Basic Usage
46

47
```python
48
from pyspark.sql import SparkSession
49

50
# Create Spark session
51
spark = SparkSession.builder \
52
    .appName("MyApp") \
53
    .getOrCreate()
54

55
# Create DataFrame from data
56
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
57
columns = ["name", "age"]
58
df = spark.createDataFrame(data, columns)
59

60
# Basic DataFrame operations
61
df.show()
62
df.filter(df.age > 25).show()
63

64
# SQL operations
65
df.createOrReplaceTempView("people")
66
result = spark.sql("SELECT name FROM people WHERE age > 25")
67
result.show()
68

69
# Stop the session
70
spark.stop()
71
```
72

73
## Architecture
74

75
PySpark's architecture enables distributed computing through several key components:
76

77
- **SparkSession**: Main entry point providing unified API for Spark functionality
78
- **SparkContext**: Low-level Spark functionality and RDD operations
79
- **DataFrames**: Distributed data collections with schema and SQL support
80
- **RDDs**: Resilient Distributed Datasets, the fundamental data abstraction
81
- **Executors**: Distributed workers that execute tasks across the cluster
82
- **Driver**: Coordinates the Spark application and distributes work
83

84
This architecture allows PySpark to process large datasets by distributing computations across cluster nodes while providing fault tolerance, automatic optimization, and familiar Python APIs for data manipulation, SQL queries, machine learning, and streaming analytics.
85

86
## Capabilities
87

88
### Core Spark Context and RDDs
89

90
Low-level distributed computing with Resilient Distributed Datasets (RDDs), broadcast variables, accumulators, and Spark configuration. Provides foundational distributed computing primitives.
91

92
```python { .api }
93
class SparkContext:
94
    def __init__(self, master=None, appName=None, conf=None): ...
95
    def parallelize(self, c, numSlices=None): ...
96
    def textFile(self, name, minPartitions=None, use_unicode=True): ...
97

98
class RDD:
99
    def map(self, f): ...
100
    def filter(self, f): ...
101
    def collect(self): ...
102

103
class SparkConf:
104
    def setAppName(self, value): ...
105
    def setMaster(self, value): ...
106
```
107

108
[Core Spark Context and RDDs](./core-context-rdds.md)
109

110
### SQL and DataFrames
111

112
Structured data processing with DataFrames, SQL queries, data I/O, and 500+ built-in functions. Provides the primary interface for structured data analysis and processing.
113

114
```python { .api }
115
class SparkSession:
116
    def createDataFrame(self, data, schema=None): ...
117
    def sql(self, sqlQuery): ...
118
    def read: DataFrameReader
119
    def table(self, tableName): ...
120

121
class DataFrame:
122
    def select(self, *cols): ...
123
    def filter(self, condition): ...
124
    def groupBy(self, *cols): ...
125
    def show(self, n=20, truncate=True): ...
126
```
127

128
[SQL and DataFrames](./sql-dataframes.md)
129

130
### Machine Learning (ML)
131

132
Modern machine learning pipeline API with estimators, transformers, and comprehensive algorithms for classification, regression, clustering, and feature processing.
133

134
```python { .api }
135
class Pipeline:
136
    def __init__(self, stages=None): ...
137
    def fit(self, dataset): ...
138

139
class LogisticRegression:
140
    def __init__(self, featuresCol="features", labelCol="label"): ...
141
    def fit(self, dataset): ...
142

143
class VectorAssembler:
144
    def __init__(self, inputCols=None, outputCol=None): ...
145
```
146

147
[Machine Learning (ML)](./machine-learning.md)
148

149
### Legacy MLlib
150

151
Legacy machine learning library with RDD-based algorithms for classification, regression, clustering, and collaborative filtering. Maintained for backward compatibility.
152

153
```python { .api }
154
class LogisticRegressionWithSGD:
155
    @classmethod
156
    def train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0): ...
157

158
class KMeans:
159
    @classmethod
160
    def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"): ...
161
```
162

163
[Legacy MLlib](./legacy-mllib.md)
164

165
### Pandas API on Spark
166

167
Pandas-compatible API for familiar pandas operations on distributed datasets. Enables seamless scaling of pandas workflows to large datasets.
168

169
```python { .api }
170
class DataFrame:
171
    def head(self, n=5): ...
172
    def describe(self): ...
173
    def groupby(self, by=None): ...
174
    def merge(self, right, how="inner", on=None): ...
175

176
def read_csv(path, **kwargs): ...
177
def concat(objs, axis=0): ...
178
```
179

180
[Pandas API on Spark](./pandas-api.md)
181

182
### Streaming
183

184
Real-time data processing with structured streaming for continuous data ingestion, processing, and output to various sinks.
185

186
```python { .api }
187
class StreamingContext:
188
    def __init__(self, sparkContext, batchDuration): ...
189
    def socketTextStream(self, hostname, port): ...
190
    def start(self): ...
191

192
class DStream:
193
    def map(self, f): ...
194
    def filter(self, f): ...
195
    def foreachRDD(self, func): ...
196
```
197

198
[Streaming](./streaming.md)
199

200
### Resource Management
201

202
Resource allocation and management for Spark applications including task resources, executor resources, and resource profiles for optimized cluster utilization.
203

204
```python { .api }
205
class ResourceProfile:
206
    def __init__(self, executorResources=None, taskResources=None): ...
207

208
class ExecutorResourceRequests:
209
    def cores(self, amount): ...
210
    def memory(self, amount): ...
211
```
212

213
[Resource Management](./resource-management.md)
214

215
## Types
216

217
```python { .api }
218
class StorageLevel:
219
    DISK_ONLY: StorageLevel
220
    MEMORY_ONLY: StorageLevel
221
    MEMORY_AND_DISK: StorageLevel
222

223
class TaskContext:
224
    def partitionId(self): ...
225
    def stageId(self): ...
226
    def taskAttemptId(self): ...
227

228
class Row:
229
    def __init__(self, **kwargs): ...
230
    def asDict(self): ...
231

232
# Data Types
233
class DataType:
234
    """Base class for data types."""
235

236
class StringType(DataType):
237
    """String data type."""
238

239
class IntegerType(DataType):
240
    """Integer data type."""
241

242
class LongType(DataType):
243
    """Long integer data type."""
244

245
class FloatType(DataType):
246
    """Float data type."""
247

248
class DoubleType(DataType):
249
    """Double precision float data type."""
250

251
class BooleanType(DataType):
252
    """Boolean data type."""
253

254
class TimestampType(DataType):
255
    """Timestamp data type."""
256

257
class DateType(DataType):
258
    """Date data type."""
259

260
class ArrayType(DataType):
261
    """Array data type."""
262
    def __init__(self, elementType, containsNull=True): ...
263

264
class MapType(DataType):
265
    """Map data type."""
266
    def __init__(self, keyType, valueType, valueContainsNull=True): ...
267

268
class StructType(DataType):
269
    """Struct data type representing a row."""
270
    def __init__(self, fields=None): ...
271
    def add(self, field, data_type=None, nullable=True, metadata=None): ...
272

273
class StructField:
274
    """Field in a StructType."""
275
    def __init__(self, name, dataType, nullable=True, metadata=None): ...
276

277
# Common Exception Types
278
class PySparkException(Exception):
279
    """Base exception for PySpark errors."""
280

281
class AnalysisException(PySparkException):
282
    """Exception thrown when analysis of SQL query fails."""
283

284
class ParseException(PySparkException):
285
    """Exception thrown when parsing of SQL query fails."""
286

287
class StreamingQueryException(PySparkException):
288
    """Exception thrown by streaming queries."""
289
```