Scalable Python data science, in an API compatible & lightning fast way.
npx @tessl/cli install tessl/pypi-xorbits@0.8.00
# Xorbits
1
2
Xorbits is an open-source computing framework that enables seamless scaling of data science and machine learning workloads from single machines to distributed clusters. It provides a familiar Python API that supports popular libraries like pandas, NumPy, PyTorch, and XGBoost, allowing users to scale their existing workflows with minimal code changes.
3
4
## Package Information
5
6
- **Package Name**: xorbits
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install xorbits`
10
- **Python Requires**: >=3.9
11
12
## Core Imports
13
14
```python
15
import xorbits
16
```
17
18
Common imports for specific functionality:
19
20
```python
21
import xorbits.pandas as pd
22
import xorbits.numpy as np
23
import xorbits.sklearn as sk
24
```
25
26
## Basic Usage
27
28
```python
29
import xorbits
30
import xorbits.pandas as pd
31
import xorbits.numpy as np
32
33
# Initialize Xorbits runtime
34
xorbits.init()
35
36
# Create distributed DataFrame (same API as pandas)
37
df = pd.DataFrame({
38
'A': np.random.randn(10000),
39
'B': np.random.randn(10000),
40
'C': np.random.randn(10000)
41
})
42
43
# Perform operations (lazy evaluation)
44
result = df.groupby('A').agg({'B': 'mean', 'C': 'sum'})
45
46
# Execute computation
47
computed_result = xorbits.run(result)
48
print(computed_result)
49
50
# Shutdown when done
51
xorbits.shutdown()
52
```
53
54
## Architecture
55
56
Xorbits leverages a distributed computing architecture built on top of Mars:
57
58
- **DataRef System**: All distributed objects are represented as `DataRef` instances that contain references to underlying Mars entities
59
- **Lazy Evaluation**: Operations are recorded as computation graphs and executed when explicitly triggered
60
- **Drop-in Compatibility**: APIs mirror popular libraries (pandas, numpy, sklearn) with minimal code changes required
61
- **Distributed Execution**: Automatically handles data partitioning, task scheduling, and parallel execution across workers
62
- **Memory Management**: Intelligent memory management with spilling to disk when needed
63
64
## Capabilities
65
66
### Runtime Management
67
68
Core functions for initializing, managing, and shutting down Xorbits runtime environments, including local and distributed cluster configurations.
69
70
```python { .api }
71
from typing import Dict, List, Optional, Union
72
from .._mars.utils import no_default
73
74
def init(
75
address: Optional[str] = None,
76
init_local: bool = no_default,
77
session_id: Optional[str] = None,
78
timeout: Optional[float] = None,
79
n_worker: int = 1,
80
n_cpu: Union[int, str] = "auto",
81
mem_bytes: Union[int, str] = "auto",
82
cuda_devices: Union[List[int], List[List[int]], str] = "auto",
83
web: Union[bool, str] = "auto",
84
new: bool = True,
85
storage_config: Optional[Dict] = None,
86
**kwargs
87
) -> None: ...
88
89
def shutdown(**kw) -> None: ...
90
91
def run(obj, **kwargs): ...
92
```
93
94
[Runtime Management](./runtime-management.md)
95
96
### Configuration
97
98
Configuration management through options system, providing control over execution behavior and runtime settings.
99
100
```python { .api }
101
# Configuration objects and functions
102
options: object
103
def option_context(*args, **kwargs): ...
104
```
105
106
[Configuration](./configuration.md)
107
108
### Pandas Integration
109
110
Drop-in replacement for pandas with distributed computing capabilities, supporting DataFrames, Series, and the full pandas API.
111
112
```python { .api }
113
class DataFrame: ...
114
class Series: ...
115
class Index: ...
116
117
# Data types and constants
118
class Timedelta: ...
119
class DateOffset: ...
120
class Interval: ...
121
class Timestamp: ...
122
NaT: object
123
NA: object
124
```
125
126
[Pandas Integration](./pandas-integration.md)
127
128
### NumPy Integration
129
130
Distributed array computing with NumPy-compatible API, supporting all NumPy operations on large distributed arrays.
131
132
```python { .api }
133
class ndarray: ...
134
135
# NumPy constants and types
136
bool_: type
137
int8: type
138
int16: type
139
int32: type
140
int64: type
141
float16: type
142
float32: type
143
float64: type
144
complex64: type
145
complex128: type
146
dtype: type
147
pi: float
148
e: float
149
inf: float
150
nan: float
151
```
152
153
[NumPy Integration](./numpy-integration.md)
154
155
### Machine Learning
156
157
Distributed machine learning capabilities through sklearn, XGBoost, and LightGBM integrations, enabling scalable model training and prediction.
158
159
```python { .api }
160
# Sklearn submodules
161
from xorbits.sklearn import cluster, datasets, decomposition, ensemble
162
from xorbits.sklearn import linear_model, metrics, model_selection, neighbors
163
from xorbits.sklearn import preprocessing, semi_supervised
164
165
# XGBoost and LightGBM classes dynamically exposed
166
```
167
168
[Machine Learning](./machine-learning.md)
169
170
### Datasets
171
172
Large-scale dataset handling with support for Hugging Face datasets and efficient data loading patterns.
173
174
```python { .api }
175
class Dataset: ...
176
def from_huggingface(dataset_name: str, **kwargs): ...
177
```
178
179
[Datasets](./datasets.md)
180
181
### Remote Computing
182
183
Remote function execution capabilities for distributed computing workloads.
184
185
```python { .api }
186
def spawn(func, **kwargs): ...
187
```
188
189
[Remote Computing](./remote-computing.md)
190
191
## Types
192
193
### Core Data Types
194
195
```python { .api }
196
class Data:
197
"""Base data container class."""
198
199
class DataRef:
200
"""Reference to distributed data object."""
201
202
class DataRefMeta:
203
"""Metaclass for DataRef."""
204
205
from enum import Enum
206
207
class DataType(Enum):
208
"""Enumeration of data types."""
209
object_ = 1
210
scalar = 2
211
tensor = 3
212
dataframe = 4
213
series = 5
214
index = 6
215
categorical = 7
216
dataframe_groupby = 8
217
series_groupby = 9
218
dataset = 10
219
```