Tessl Tile for pypi/pyarrow@21.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

advanced-features.md arrow-flight.md compute-functions.md core-data-structures.md data-types.md dataset-operations.md file-formats.md index.md memory-io.md

index.mddocs/

0
# PyArrow
1

2
PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.
3

4
## Package Information
5

6
- **Package Name**: pyarrow
7
- **Language**: Python
8
- **Installation**: `pip install pyarrow`
9
- **Documentation**: https://arrow.apache.org/docs/python
10

11
## Core Imports
12

13
```python
14
import pyarrow as pa
15
```
16

17
Common specialized imports:
18

19
```python
20
import pyarrow.compute as pc
21
import pyarrow.parquet as pq
22
import pyarrow.csv as csv
23
import pyarrow.dataset as ds
24
import pyarrow.flight as flight
25
```
26

27
## Basic Usage
28

29
```python
30
import pyarrow as pa
31
import numpy as np
32

33
# Create arrays from Python data
34
arr = pa.array([1, 2, 3, 4, 5])
35
str_arr = pa.array(['hello', 'world', None, 'arrow'])
36

37
# Create tables
38
table = pa.table({
39
    'integers': [1, 2, 3, 4],
40
    'strings': ['foo', 'bar', 'baz', None],
41
    'floats': [1.0, 2.5, 3.7, 4.1]
42
})
43

44
# Read/write Parquet files
45
import pyarrow.parquet as pq
46
pq.write_table(table, 'example.parquet')
47
loaded_table = pq.read_table('example.parquet')
48

49
# Compute operations
50
import pyarrow.compute as pc
51
result = pc.sum(arr)
52
filtered = pc.filter(table, pc.greater(table['integers'], 2))
53
```
54

55
## Architecture
56

57
PyArrow's design centers around the Arrow columnar memory format:
58

59
- **Columnar Storage**: Data organized by columns for efficient analytical operations
60
- **Zero-Copy Operations**: Memory-efficient data sharing between processes and languages
61
- **Type System**: Rich data types including nested structures, decimals, and temporal types
62
- **Compute Engine**: Vectorized operations for high-performance analytics
63
- **Format Support**: Native support for Parquet, CSV, JSON, ORC, and custom formats
64
- **Interoperability**: Seamless integration with pandas, NumPy, and other Python libraries
65

66
This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.
67

68
## Capabilities
69

70
### Core Data Structures
71

72
Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.
73

74
```python { .api }
75
def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...
76
def table(data, schema=None, metadata=None, columns=None): ...
77
def schema(fields, metadata=None): ...
78
def field(name, type, nullable=True, metadata=None): ...
79

80
class Array: ...
81
class Table: ...
82
class Schema: ...
83
class Field: ...
84
```
85

86
[Core Data Structures](./core-data-structures.md)
87

88
### Data Types System
89

90
Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.
91

92
```python { .api }
93
def int64(): ...
94
def string(): ...
95
def timestamp(unit, tz=None): ...
96
def list_(value_type): ...
97
def struct(fields): ...
98

99
class DataType: ...
100
def is_integer(type): ...
101
def cast(arr, target_type, safe=True): ...
102
```
103

104
[Data Types](./data-types.md)
105

106
### Compute Functions
107

108
High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.
109

110
```python { .api }
111
def add(x, y): ...
112
def subtract(x, y): ...
113
def multiply(x, y): ...
114
def sum(array): ...
115
def filter(data, mask): ...
116
def take(data, indices): ...
117
```
118

119
[Compute Functions](./compute-functions.md)
120

121
### File Format Support
122

123
Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.
124

125
```python { .api }
126
# Parquet
127
def read_table(source, **kwargs): ...
128
def write_table(table, where, **kwargs): ...
129

130
# CSV  
131
def read_csv(input_file, **kwargs): ...
132
def write_csv(data, output_file, **kwargs): ...
133
```
134

135
[File Formats](./file-formats.md)
136

137
### Memory and I/O Management
138

139
Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.
140

141
```python { .api }
142
def default_memory_pool(): ...
143
def compress(data, codec=None): ...
144
def input_stream(source): ...
145

146
class Buffer: ...
147
class MemoryPool: ...
148
```
149

150
[Memory and I/O](./memory-io.md)
151

152
### Dataset Operations
153

154
Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.
155

156
```python { .api }
157
def dataset(source, **kwargs): ...
158
def write_dataset(data, base_dir, **kwargs): ...
159

160
class Dataset: ...
161
class Scanner: ...
162
```
163

164
[Dataset Operations](./dataset-operations.md)
165

166
### Arrow Flight RPC
167

168
High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.
169

170
```python { .api }
171
def connect(location, **kwargs): ...
172

173
class FlightClient: ...
174
class FlightServerBase: ...
175
class FlightDescriptor: ...
176
```
177

178
[Arrow Flight](./arrow-flight.md)
179

180
### Advanced Features
181

182
Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.
183

184
```python { .api }
185
# CUDA support
186
class Context: ...
187
class CudaBuffer: ...
188

189
# Substrait integration  
190
def run_query(plan): ...
191
def serialize_expressions(expressions): ...
192
```
193

194
[Advanced Features](./advanced-features.md)
195

196
## Version and Build Information
197

198
```python { .api }
199
def show_versions(): ...
200
def show_info(): ...
201
def cpp_build_info(): ...
202
def runtime_info(): ...
203
```
204

205
Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.
206

207
## Exception Handling
208

209
```python { .api }
210
class ArrowException(Exception): ...
211
class ArrowInvalid(ArrowException): ...
212
class ArrowTypeError(ArrowException): ...
213
class ArrowIOError(ArrowException): ...
214
```
215

216
Comprehensive exception hierarchy for error handling in data processing workflows.

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/