CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-xorbits

Scalable Python data science, in an API compatible & lightning fast way.

Overview
Eval results
Files

pandas-integration.mddocs/

Pandas Integration

Drop-in replacement for pandas with distributed computing capabilities. Xorbits pandas provides the same API as pandas while enabling computation on datasets that exceed single-machine memory through distributed processing.

Capabilities

Core Data Structures

The fundamental data structures that mirror pandas DataFrame, Series, and Index with distributed capabilities.

class DataFrame:
    """
    Distributed DataFrame with pandas-compatible API.
    
    Provides all pandas DataFrame functionality with automatic distribution
    across multiple workers for scalable data processing.
    """

class Series:
    """
    Distributed Series with pandas-compatible API.
    
    One-dimensional labeled array capable of holding any data type,
    distributed across multiple workers.
    """

class Index:
    """
    Distributed Index with pandas-compatible API.
    
    Immutable sequence used for indexing and alignment,
    supporting distributed operations.
    """

Data Types and Time Components

Pandas-compatible data types and time-related classes for working with temporal data.

class Timedelta:
    """Time delta class for representing durations."""

class DateOffset:
    """Date offset class for date arithmetic."""

class Interval:
    """Interval class for representing intervals between values."""

class Timestamp:
    """Timestamp class for representing points in time."""

NaT: object
    """Not-a-Time constant for missing time values."""

NA: object
    """Missing value indicator (pandas >= 1.0)."""

class NamedAgg:
    """Named aggregation class for groupby operations (pandas >= 1.0)."""

class ArrowDtype:
    """Arrow data type for PyArrow integration (pandas >= 1.5)."""

Configuration Functions

Configuration management specific to pandas operations, mirroring the pandas options system.

def describe_option(option_name: str) -> None:
    """
    Describe a configuration option.
    
    Parameters:
    - option_name: Name of the option to describe
    """

def get_option(option_name: str):
    """
    Get the value of a configuration option.
    
    Parameters:
    - option_name: Name of the option to retrieve
    
    Returns:
    - Current value of the option
    """

def set_option(option_name: str, value) -> None:
    """
    Set the value of a configuration option.
    
    Parameters:
    - option_name: Name of the option to set
    - value: New value for the option
    """

def reset_option(option_name: str) -> None:
    """
    Reset a configuration option to its default value.
    
    Parameters:
    - option_name: Name of the option to reset
    """

def option_context(*args, **kwargs):
    """
    Context manager for temporarily changing pandas options.
    
    Parameters:
    - *args: Option names and values as alternating arguments
    - **kwargs: Option names and values as keyword arguments
    
    Returns:
    - Context manager for temporary option changes
    """

def set_eng_float_format(format_string: str) -> None:
    """
    Set engineering float format for display.
    
    Parameters:
    - format_string: Format string for engineering notation
    """

Specialized Modules

Access to pandas specialized functionality through submodules.

# Submodules providing specialized functionality
accessors  # DataFrame and Series accessor functionality
core       # Core pandas data structures
groupby    # GroupBy functionality
plotting   # Plotting functionality
window     # Window operations
offsets    # Date offset functionality

Dynamic Function Access

All pandas module-level functions are available through dynamic import, including but not limited to:

# Data I/O functions
def read_csv(filepath_or_buffer, **kwargs): ...
def read_parquet(path, **kwargs): ...
def read_json(path_or_buf, **kwargs): ...
def read_excel(io, **kwargs): ...
def read_sql(sql, con, **kwargs): ...
def read_pickle(filepath_or_buffer, **kwargs): ...

# Data manipulation functions
def concat(objs, **kwargs): ...
def merge(left, right, **kwargs): ...
def merge_asof(left, right, **kwargs): ...
def crosstab(index, columns, **kwargs): ...
def pivot_table(data, **kwargs): ...
def melt(frame, **kwargs): ...

# Utility functions
def cut(x, bins, **kwargs): ...
def qcut(x, q, **kwargs): ...
def get_dummies(data, **kwargs): ...
def factorize(values, **kwargs): ...
def unique(values): ...
def value_counts(values, **kwargs): ...

# Date/time utilities
def date_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
def period_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
def timedelta_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
def to_datetime(arg, **kwargs): ...
def to_timedelta(arg, **kwargs): ...
def to_numeric(arg, **kwargs): ...

Usage Examples:

import xorbits
import xorbits.pandas as pd
import xorbits.numpy as np

xorbits.init()

# Creating DataFrames (same as pandas)
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': [1.1, 2.2, 3.3, 4.4, 5.5]
})

# Reading data (same as pandas)
df_from_csv = pd.read_csv('data.csv')

# Data manipulation (same as pandas)
grouped = df.groupby('B').agg({'A': 'sum', 'C': 'mean'})
merged = pd.merge(df, other_df, on='key')
concatenated = pd.concat([df1, df2])

# All pandas operations work the same way
result = df.query('A > 2').sort_values('C').head(10)

# Execute computation
computed = xorbits.run(result)

xorbits.shutdown()

Configuration Usage

import xorbits.pandas as pd

# Get current display options
max_rows = pd.get_option('display.max_rows')

# Set display options
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

# Use option context for temporary changes
with pd.option_context('display.max_rows', 20):
    print(large_dataframe)  # Shows only 20 rows

# Reset options
pd.reset_option('display.max_rows')

Install with Tessl CLI

npx tessl i tessl/pypi-xorbits

docs

configuration.md

datasets.md

index.md

machine-learning.md

numpy-integration.md

pandas-integration.md

remote-computing.md

runtime-management.md

tile.json