or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-pyarrow

Python library for Apache Arrow columnar memory format and computing libraries

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pyarrow@21.0.x

To install, run

npx @tessl/cli install tessl/pypi-pyarrow@21.0.0

0

# PyArrow

1

2

PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.

3

4

## Package Information

5

6

- **Package Name**: pyarrow

7

- **Language**: Python

8

- **Installation**: `pip install pyarrow`

9

- **Documentation**: https://arrow.apache.org/docs/python

10

11

## Core Imports

12

13

```python

14

import pyarrow as pa

15

```

16

17

Common specialized imports:

18

19

```python

20

import pyarrow.compute as pc

21

import pyarrow.parquet as pq

22

import pyarrow.csv as csv

23

import pyarrow.dataset as ds

24

import pyarrow.flight as flight

25

```

26

27

## Basic Usage

28

29

```python

30

import pyarrow as pa

31

import numpy as np

32

33

# Create arrays from Python data

34

arr = pa.array([1, 2, 3, 4, 5])

35

str_arr = pa.array(['hello', 'world', None, 'arrow'])

36

37

# Create tables

38

table = pa.table({

39

'integers': [1, 2, 3, 4],

40

'strings': ['foo', 'bar', 'baz', None],

41

'floats': [1.0, 2.5, 3.7, 4.1]

42

})

43

44

# Read/write Parquet files

45

import pyarrow.parquet as pq

46

pq.write_table(table, 'example.parquet')

47

loaded_table = pq.read_table('example.parquet')

48

49

# Compute operations

50

import pyarrow.compute as pc

51

result = pc.sum(arr)

52

filtered = pc.filter(table, pc.greater(table['integers'], 2))

53

```

54

55

## Architecture

56

57

PyArrow's design centers around the Arrow columnar memory format:

58

59

- **Columnar Storage**: Data organized by columns for efficient analytical operations

60

- **Zero-Copy Operations**: Memory-efficient data sharing between processes and languages

61

- **Type System**: Rich data types including nested structures, decimals, and temporal types

62

- **Compute Engine**: Vectorized operations for high-performance analytics

63

- **Format Support**: Native support for Parquet, CSV, JSON, ORC, and custom formats

64

- **Interoperability**: Seamless integration with pandas, NumPy, and other Python libraries

65

66

This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.

67

68

## Capabilities

69

70

### Core Data Structures

71

72

Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.

73

74

```python { .api }

75

def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...

76

def table(data, schema=None, metadata=None, columns=None): ...

77

def schema(fields, metadata=None): ...

78

def field(name, type, nullable=True, metadata=None): ...

79

80

class Array: ...

81

class Table: ...

82

class Schema: ...

83

class Field: ...

84

```

85

86

[Core Data Structures](./core-data-structures.md)

87

88

### Data Types System

89

90

Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.

91

92

```python { .api }

93

def int64(): ...

94

def string(): ...

95

def timestamp(unit, tz=None): ...

96

def list_(value_type): ...

97

def struct(fields): ...

98

99

class DataType: ...

100

def is_integer(type): ...

101

def cast(arr, target_type, safe=True): ...

102

```

103

104

[Data Types](./data-types.md)

105

106

### Compute Functions

107

108

High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.

109

110

```python { .api }

111

def add(x, y): ...

112

def subtract(x, y): ...

113

def multiply(x, y): ...

114

def sum(array): ...

115

def filter(data, mask): ...

116

def take(data, indices): ...

117

```

118

119

[Compute Functions](./compute-functions.md)

120

121

### File Format Support

122

123

Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.

124

125

```python { .api }

126

# Parquet

127

def read_table(source, **kwargs): ...

128

def write_table(table, where, **kwargs): ...

129

130

# CSV

131

def read_csv(input_file, **kwargs): ...

132

def write_csv(data, output_file, **kwargs): ...

133

```

134

135

[File Formats](./file-formats.md)

136

137

### Memory and I/O Management

138

139

Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.

140

141

```python { .api }

142

def default_memory_pool(): ...

143

def compress(data, codec=None): ...

144

def input_stream(source): ...

145

146

class Buffer: ...

147

class MemoryPool: ...

148

```

149

150

[Memory and I/O](./memory-io.md)

151

152

### Dataset Operations

153

154

Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.

155

156

```python { .api }

157

def dataset(source, **kwargs): ...

158

def write_dataset(data, base_dir, **kwargs): ...

159

160

class Dataset: ...

161

class Scanner: ...

162

```

163

164

[Dataset Operations](./dataset-operations.md)

165

166

### Arrow Flight RPC

167

168

High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.

169

170

```python { .api }

171

def connect(location, **kwargs): ...

172

173

class FlightClient: ...

174

class FlightServerBase: ...

175

class FlightDescriptor: ...

176

```

177

178

[Arrow Flight](./arrow-flight.md)

179

180

### Advanced Features

181

182

Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.

183

184

```python { .api }

185

# CUDA support

186

class Context: ...

187

class CudaBuffer: ...

188

189

# Substrait integration

190

def run_query(plan): ...

191

def serialize_expressions(expressions): ...

192

```

193

194

[Advanced Features](./advanced-features.md)

195

196

## Version and Build Information

197

198

```python { .api }

199

def show_versions(): ...

200

def show_info(): ...

201

def cpp_build_info(): ...

202

def runtime_info(): ...

203

```

204

205

Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.

206

207

## Exception Handling

208

209

```python { .api }

210

class ArrowException(Exception): ...

211

class ArrowInvalid(ArrowException): ...

212

class ArrowTypeError(ArrowException): ...

213

class ArrowIOError(ArrowException): ...

214

```

215

216

Comprehensive exception hierarchy for error handling in data processing workflows.