Python library for Apache Arrow columnar memory format and computing libraries
npx @tessl/cli install tessl/pypi-pyarrow@21.0.0PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.
pip install pyarrowimport pyarrow as paCommon specialized imports:
import pyarrow.compute as pc
import pyarrow.parquet as pq
import pyarrow.csv as csv
import pyarrow.dataset as ds
import pyarrow.flight as flightimport pyarrow as pa
import numpy as np
# Create arrays from Python data
arr = pa.array([1, 2, 3, 4, 5])
str_arr = pa.array(['hello', 'world', None, 'arrow'])
# Create tables
table = pa.table({
'integers': [1, 2, 3, 4],
'strings': ['foo', 'bar', 'baz', None],
'floats': [1.0, 2.5, 3.7, 4.1]
})
# Read/write Parquet files
import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
loaded_table = pq.read_table('example.parquet')
# Compute operations
import pyarrow.compute as pc
result = pc.sum(arr)
filtered = pc.filter(table, pc.greater(table['integers'], 2))PyArrow's design centers around the Arrow columnar memory format:
This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.
Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.
def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...
def table(data, schema=None, metadata=None, columns=None): ...
def schema(fields, metadata=None): ...
def field(name, type, nullable=True, metadata=None): ...
class Array: ...
class Table: ...
class Schema: ...
class Field: ...Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.
def int64(): ...
def string(): ...
def timestamp(unit, tz=None): ...
def list_(value_type): ...
def struct(fields): ...
class DataType: ...
def is_integer(type): ...
def cast(arr, target_type, safe=True): ...High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.
def add(x, y): ...
def subtract(x, y): ...
def multiply(x, y): ...
def sum(array): ...
def filter(data, mask): ...
def take(data, indices): ...Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.
# Parquet
def read_table(source, **kwargs): ...
def write_table(table, where, **kwargs): ...
# CSV
def read_csv(input_file, **kwargs): ...
def write_csv(data, output_file, **kwargs): ...Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.
def default_memory_pool(): ...
def compress(data, codec=None): ...
def input_stream(source): ...
class Buffer: ...
class MemoryPool: ...Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.
def dataset(source, **kwargs): ...
def write_dataset(data, base_dir, **kwargs): ...
class Dataset: ...
class Scanner: ...High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.
def connect(location, **kwargs): ...
class FlightClient: ...
class FlightServerBase: ...
class FlightDescriptor: ...Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.
# CUDA support
class Context: ...
class CudaBuffer: ...
# Substrait integration
def run_query(plan): ...
def serialize_expressions(expressions): ...def show_versions(): ...
def show_info(): ...
def cpp_build_info(): ...
def runtime_info(): ...Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.
class ArrowException(Exception): ...
class ArrowInvalid(ArrowException): ...
class ArrowTypeError(ArrowException): ...
class ArrowIOError(ArrowException): ...Comprehensive exception hierarchy for error handling in data processing workflows.