0
# PyArrow
1
2
PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.
3
4
## Package Information
5
6
- **Package Name**: pyarrow
7
- **Language**: Python
8
- **Installation**: `pip install pyarrow`
9
- **Documentation**: https://arrow.apache.org/docs/python
10
11
## Core Imports
12
13
```python
14
import pyarrow as pa
15
```
16
17
Common specialized imports:
18
19
```python
20
import pyarrow.compute as pc
21
import pyarrow.parquet as pq
22
import pyarrow.csv as csv
23
import pyarrow.dataset as ds
24
import pyarrow.flight as flight
25
```
26
27
## Basic Usage
28
29
```python
30
import pyarrow as pa
31
import numpy as np
32
33
# Create arrays from Python data
34
arr = pa.array([1, 2, 3, 4, 5])
35
str_arr = pa.array(['hello', 'world', None, 'arrow'])
36
37
# Create tables
38
table = pa.table({
39
'integers': [1, 2, 3, 4],
40
'strings': ['foo', 'bar', 'baz', None],
41
'floats': [1.0, 2.5, 3.7, 4.1]
42
})
43
44
# Read/write Parquet files
45
import pyarrow.parquet as pq
46
pq.write_table(table, 'example.parquet')
47
loaded_table = pq.read_table('example.parquet')
48
49
# Compute operations
50
import pyarrow.compute as pc
51
result = pc.sum(arr)
52
filtered = pc.filter(table, pc.greater(table['integers'], 2))
53
```
54
55
## Architecture
56
57
PyArrow's design centers around the Arrow columnar memory format:
58
59
- **Columnar Storage**: Data organized by columns for efficient analytical operations
60
- **Zero-Copy Operations**: Memory-efficient data sharing between processes and languages
61
- **Type System**: Rich data types including nested structures, decimals, and temporal types
62
- **Compute Engine**: Vectorized operations for high-performance analytics
63
- **Format Support**: Native support for Parquet, CSV, JSON, ORC, and custom formats
64
- **Interoperability**: Seamless integration with pandas, NumPy, and other Python libraries
65
66
This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.
67
68
## Capabilities
69
70
### Core Data Structures
71
72
Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.
73
74
```python { .api }
75
def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...
76
def table(data, schema=None, metadata=None, columns=None): ...
77
def schema(fields, metadata=None): ...
78
def field(name, type, nullable=True, metadata=None): ...
79
80
class Array: ...
81
class Table: ...
82
class Schema: ...
83
class Field: ...
84
```
85
86
[Core Data Structures](./core-data-structures.md)
87
88
### Data Types System
89
90
Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.
91
92
```python { .api }
93
def int64(): ...
94
def string(): ...
95
def timestamp(unit, tz=None): ...
96
def list_(value_type): ...
97
def struct(fields): ...
98
99
class DataType: ...
100
def is_integer(type): ...
101
def cast(arr, target_type, safe=True): ...
102
```
103
104
[Data Types](./data-types.md)
105
106
### Compute Functions
107
108
High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.
109
110
```python { .api }
111
def add(x, y): ...
112
def subtract(x, y): ...
113
def multiply(x, y): ...
114
def sum(array): ...
115
def filter(data, mask): ...
116
def take(data, indices): ...
117
```
118
119
[Compute Functions](./compute-functions.md)
120
121
### File Format Support
122
123
Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.
124
125
```python { .api }
126
# Parquet
127
def read_table(source, **kwargs): ...
128
def write_table(table, where, **kwargs): ...
129
130
# CSV
131
def read_csv(input_file, **kwargs): ...
132
def write_csv(data, output_file, **kwargs): ...
133
```
134
135
[File Formats](./file-formats.md)
136
137
### Memory and I/O Management
138
139
Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.
140
141
```python { .api }
142
def default_memory_pool(): ...
143
def compress(data, codec=None): ...
144
def input_stream(source): ...
145
146
class Buffer: ...
147
class MemoryPool: ...
148
```
149
150
[Memory and I/O](./memory-io.md)
151
152
### Dataset Operations
153
154
Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.
155
156
```python { .api }
157
def dataset(source, **kwargs): ...
158
def write_dataset(data, base_dir, **kwargs): ...
159
160
class Dataset: ...
161
class Scanner: ...
162
```
163
164
[Dataset Operations](./dataset-operations.md)
165
166
### Arrow Flight RPC
167
168
High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.
169
170
```python { .api }
171
def connect(location, **kwargs): ...
172
173
class FlightClient: ...
174
class FlightServerBase: ...
175
class FlightDescriptor: ...
176
```
177
178
[Arrow Flight](./arrow-flight.md)
179
180
### Advanced Features
181
182
Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.
183
184
```python { .api }
185
# CUDA support
186
class Context: ...
187
class CudaBuffer: ...
188
189
# Substrait integration
190
def run_query(plan): ...
191
def serialize_expressions(expressions): ...
192
```
193
194
[Advanced Features](./advanced-features.md)
195
196
## Version and Build Information
197
198
```python { .api }
199
def show_versions(): ...
200
def show_info(): ...
201
def cpp_build_info(): ...
202
def runtime_info(): ...
203
```
204
205
Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.
206
207
## Exception Handling
208
209
```python { .api }
210
class ArrowException(Exception): ...
211
class ArrowInvalid(ArrowException): ...
212
class ArrowTypeError(ArrowException): ...
213
class ArrowIOError(ArrowException): ...
214
```
215
216
Comprehensive exception hierarchy for error handling in data processing workflows.