or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration.mddatasets.mdindex.mdmachine-learning.mdnumpy-integration.mdpandas-integration.mdremote-computing.mdruntime-management.md
tile.json

tessl/pypi-xorbits

Scalable Python data science, in an API compatible & lightning fast way.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/xorbits@0.8.x

To install, run

npx @tessl/cli install tessl/pypi-xorbits@0.8.0

index.mddocs/

Xorbits

Xorbits is an open-source computing framework that enables seamless scaling of data science and machine learning workloads from single machines to distributed clusters. It provides a familiar Python API that supports popular libraries like pandas, NumPy, PyTorch, and XGBoost, allowing users to scale their existing workflows with minimal code changes.

Package Information

  • Package Name: xorbits
  • Package Type: pypi
  • Language: Python
  • Installation: pip install xorbits
  • Python Requires: >=3.9

Core Imports

import xorbits

Common imports for specific functionality:

import xorbits.pandas as pd
import xorbits.numpy as np
import xorbits.sklearn as sk

Basic Usage

import xorbits
import xorbits.pandas as pd
import xorbits.numpy as np

# Initialize Xorbits runtime
xorbits.init()

# Create distributed DataFrame (same API as pandas)
df = pd.DataFrame({
    'A': np.random.randn(10000),
    'B': np.random.randn(10000),
    'C': np.random.randn(10000)
})

# Perform operations (lazy evaluation)
result = df.groupby('A').agg({'B': 'mean', 'C': 'sum'})

# Execute computation
computed_result = xorbits.run(result)
print(computed_result)

# Shutdown when done
xorbits.shutdown()

Architecture

Xorbits leverages a distributed computing architecture built on top of Mars:

  • DataRef System: All distributed objects are represented as DataRef instances that contain references to underlying Mars entities
  • Lazy Evaluation: Operations are recorded as computation graphs and executed when explicitly triggered
  • Drop-in Compatibility: APIs mirror popular libraries (pandas, numpy, sklearn) with minimal code changes required
  • Distributed Execution: Automatically handles data partitioning, task scheduling, and parallel execution across workers
  • Memory Management: Intelligent memory management with spilling to disk when needed

Capabilities

Runtime Management

Core functions for initializing, managing, and shutting down Xorbits runtime environments, including local and distributed cluster configurations.

from typing import Dict, List, Optional, Union
from .._mars.utils import no_default

def init(
    address: Optional[str] = None,
    init_local: bool = no_default,
    session_id: Optional[str] = None,
    timeout: Optional[float] = None,
    n_worker: int = 1,
    n_cpu: Union[int, str] = "auto",
    mem_bytes: Union[int, str] = "auto",
    cuda_devices: Union[List[int], List[List[int]], str] = "auto",
    web: Union[bool, str] = "auto",
    new: bool = True,
    storage_config: Optional[Dict] = None,
    **kwargs
) -> None: ...

def shutdown(**kw) -> None: ...

def run(obj, **kwargs): ...

Runtime Management

Configuration

Configuration management through options system, providing control over execution behavior and runtime settings.

# Configuration objects and functions
options: object
def option_context(*args, **kwargs): ...

Configuration

Pandas Integration

Drop-in replacement for pandas with distributed computing capabilities, supporting DataFrames, Series, and the full pandas API.

class DataFrame: ...
class Series: ...
class Index: ...

# Data types and constants
class Timedelta: ...
class DateOffset: ...
class Interval: ...
class Timestamp: ...
NaT: object
NA: object

Pandas Integration

NumPy Integration

Distributed array computing with NumPy-compatible API, supporting all NumPy operations on large distributed arrays.

class ndarray: ...

# NumPy constants and types
bool_: type
int8: type
int16: type
int32: type
int64: type
float16: type
float32: type
float64: type
complex64: type
complex128: type
dtype: type
pi: float
e: float
inf: float
nan: float

NumPy Integration

Machine Learning

Distributed machine learning capabilities through sklearn, XGBoost, and LightGBM integrations, enabling scalable model training and prediction.

# Sklearn submodules
from xorbits.sklearn import cluster, datasets, decomposition, ensemble
from xorbits.sklearn import linear_model, metrics, model_selection, neighbors
from xorbits.sklearn import preprocessing, semi_supervised

# XGBoost and LightGBM classes dynamically exposed

Machine Learning

Datasets

Large-scale dataset handling with support for Hugging Face datasets and efficient data loading patterns.

class Dataset: ...
def from_huggingface(dataset_name: str, **kwargs): ...

Datasets

Remote Computing

Remote function execution capabilities for distributed computing workloads.

def spawn(func, **kwargs): ...

Remote Computing

Types

Core Data Types

class Data:
    """Base data container class."""

class DataRef:
    """Reference to distributed data object."""

class DataRefMeta:
    """Metaclass for DataRef."""

from enum import Enum

class DataType(Enum):
    """Enumeration of data types."""
    object_ = 1
    scalar = 2
    tensor = 3
    dataframe = 4
    series = 5
    index = 6
    categorical = 7
    dataframe_groupby = 8
    series_groupby = 9
    dataset = 10