or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

advanced-features.mdarrow-flight.mdcompute-functions.mdcore-data-structures.mddata-types.mddataset-operations.mdfile-formats.mdindex.mdmemory-io.md
tile.json

tessl/pypi-pyarrow

Python library for Apache Arrow columnar memory format and computing libraries

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pyarrow@21.0.x

To install, run

npx @tessl/cli install tessl/pypi-pyarrow@21.0.0

index.mddocs/

PyArrow

PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.

Package Information

  • Package Name: pyarrow
  • Language: Python
  • Installation: pip install pyarrow
  • Documentation: https://arrow.apache.org/docs/python

Core Imports

import pyarrow as pa

Common specialized imports:

import pyarrow.compute as pc
import pyarrow.parquet as pq
import pyarrow.csv as csv
import pyarrow.dataset as ds
import pyarrow.flight as flight

Basic Usage

import pyarrow as pa
import numpy as np

# Create arrays from Python data
arr = pa.array([1, 2, 3, 4, 5])
str_arr = pa.array(['hello', 'world', None, 'arrow'])

# Create tables
table = pa.table({
    'integers': [1, 2, 3, 4],
    'strings': ['foo', 'bar', 'baz', None],
    'floats': [1.0, 2.5, 3.7, 4.1]
})

# Read/write Parquet files
import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
loaded_table = pq.read_table('example.parquet')

# Compute operations
import pyarrow.compute as pc
result = pc.sum(arr)
filtered = pc.filter(table, pc.greater(table['integers'], 2))

Architecture

PyArrow's design centers around the Arrow columnar memory format:

  • Columnar Storage: Data organized by columns for efficient analytical operations
  • Zero-Copy Operations: Memory-efficient data sharing between processes and languages
  • Type System: Rich data types including nested structures, decimals, and temporal types
  • Compute Engine: Vectorized operations for high-performance analytics
  • Format Support: Native support for Parquet, CSV, JSON, ORC, and custom formats
  • Interoperability: Seamless integration with pandas, NumPy, and other Python libraries

This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.

Capabilities

Core Data Structures

Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.

def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...
def table(data, schema=None, metadata=None, columns=None): ...
def schema(fields, metadata=None): ...
def field(name, type, nullable=True, metadata=None): ...

class Array: ...
class Table: ...
class Schema: ...
class Field: ...

Core Data Structures

Data Types System

Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.

def int64(): ...
def string(): ...
def timestamp(unit, tz=None): ...
def list_(value_type): ...
def struct(fields): ...

class DataType: ...
def is_integer(type): ...
def cast(arr, target_type, safe=True): ...

Data Types

Compute Functions

High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.

def add(x, y): ...
def subtract(x, y): ...
def multiply(x, y): ...
def sum(array): ...
def filter(data, mask): ...
def take(data, indices): ...

Compute Functions

File Format Support

Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.

# Parquet
def read_table(source, **kwargs): ...
def write_table(table, where, **kwargs): ...

# CSV  
def read_csv(input_file, **kwargs): ...
def write_csv(data, output_file, **kwargs): ...

File Formats

Memory and I/O Management

Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.

def default_memory_pool(): ...
def compress(data, codec=None): ...
def input_stream(source): ...

class Buffer: ...
class MemoryPool: ...

Memory and I/O

Dataset Operations

Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.

def dataset(source, **kwargs): ...
def write_dataset(data, base_dir, **kwargs): ...

class Dataset: ...
class Scanner: ...

Dataset Operations

Arrow Flight RPC

High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.

def connect(location, **kwargs): ...

class FlightClient: ...
class FlightServerBase: ...
class FlightDescriptor: ...

Arrow Flight

Advanced Features

Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.

# CUDA support
class Context: ...
class CudaBuffer: ...

# Substrait integration  
def run_query(plan): ...
def serialize_expressions(expressions): ...

Advanced Features

Version and Build Information

def show_versions(): ...
def show_info(): ...
def cpp_build_info(): ...
def runtime_info(): ...

Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.

Exception Handling

class ArrowException(Exception): ...
class ArrowInvalid(ArrowException): ...
class ArrowTypeError(ArrowException): ...
class ArrowIOError(ArrowException): ...

Comprehensive exception hierarchy for error handling in data processing workflows.