or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mddatasets.mdindex.mdmachine-learning.mdnumpy-integration.mdpandas-integration.mdremote-computing.mdruntime-management.md

index.mddocs/

0

# Xorbits

1

2

Xorbits is an open-source computing framework that enables seamless scaling of data science and machine learning workloads from single machines to distributed clusters. It provides a familiar Python API that supports popular libraries like pandas, NumPy, PyTorch, and XGBoost, allowing users to scale their existing workflows with minimal code changes.

3

4

## Package Information

5

6

- **Package Name**: xorbits

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install xorbits`

10

- **Python Requires**: >=3.9

11

12

## Core Imports

13

14

```python

15

import xorbits

16

```

17

18

Common imports for specific functionality:

19

20

```python

21

import xorbits.pandas as pd

22

import xorbits.numpy as np

23

import xorbits.sklearn as sk

24

```

25

26

## Basic Usage

27

28

```python

29

import xorbits

30

import xorbits.pandas as pd

31

import xorbits.numpy as np

32

33

# Initialize Xorbits runtime

34

xorbits.init()

35

36

# Create distributed DataFrame (same API as pandas)

37

df = pd.DataFrame({

38

'A': np.random.randn(10000),

39

'B': np.random.randn(10000),

40

'C': np.random.randn(10000)

41

})

42

43

# Perform operations (lazy evaluation)

44

result = df.groupby('A').agg({'B': 'mean', 'C': 'sum'})

45

46

# Execute computation

47

computed_result = xorbits.run(result)

48

print(computed_result)

49

50

# Shutdown when done

51

xorbits.shutdown()

52

```

53

54

## Architecture

55

56

Xorbits leverages a distributed computing architecture built on top of Mars:

57

58

- **DataRef System**: All distributed objects are represented as `DataRef` instances that contain references to underlying Mars entities

59

- **Lazy Evaluation**: Operations are recorded as computation graphs and executed when explicitly triggered

60

- **Drop-in Compatibility**: APIs mirror popular libraries (pandas, numpy, sklearn) with minimal code changes required

61

- **Distributed Execution**: Automatically handles data partitioning, task scheduling, and parallel execution across workers

62

- **Memory Management**: Intelligent memory management with spilling to disk when needed

63

64

## Capabilities

65

66

### Runtime Management

67

68

Core functions for initializing, managing, and shutting down Xorbits runtime environments, including local and distributed cluster configurations.

69

70

```python { .api }

71

from typing import Dict, List, Optional, Union

72

from .._mars.utils import no_default

73

74

def init(

75

address: Optional[str] = None,

76

init_local: bool = no_default,

77

session_id: Optional[str] = None,

78

timeout: Optional[float] = None,

79

n_worker: int = 1,

80

n_cpu: Union[int, str] = "auto",

81

mem_bytes: Union[int, str] = "auto",

82

cuda_devices: Union[List[int], List[List[int]], str] = "auto",

83

web: Union[bool, str] = "auto",

84

new: bool = True,

85

storage_config: Optional[Dict] = None,

86

**kwargs

87

) -> None: ...

88

89

def shutdown(**kw) -> None: ...

90

91

def run(obj, **kwargs): ...

92

```

93

94

[Runtime Management](./runtime-management.md)

95

96

### Configuration

97

98

Configuration management through options system, providing control over execution behavior and runtime settings.

99

100

```python { .api }

101

# Configuration objects and functions

102

options: object

103

def option_context(*args, **kwargs): ...

104

```

105

106

[Configuration](./configuration.md)

107

108

### Pandas Integration

109

110

Drop-in replacement for pandas with distributed computing capabilities, supporting DataFrames, Series, and the full pandas API.

111

112

```python { .api }

113

class DataFrame: ...

114

class Series: ...

115

class Index: ...

116

117

# Data types and constants

118

class Timedelta: ...

119

class DateOffset: ...

120

class Interval: ...

121

class Timestamp: ...

122

NaT: object

123

NA: object

124

```

125

126

[Pandas Integration](./pandas-integration.md)

127

128

### NumPy Integration

129

130

Distributed array computing with NumPy-compatible API, supporting all NumPy operations on large distributed arrays.

131

132

```python { .api }

133

class ndarray: ...

134

135

# NumPy constants and types

136

bool_: type

137

int8: type

138

int16: type

139

int32: type

140

int64: type

141

float16: type

142

float32: type

143

float64: type

144

complex64: type

145

complex128: type

146

dtype: type

147

pi: float

148

e: float

149

inf: float

150

nan: float

151

```

152

153

[NumPy Integration](./numpy-integration.md)

154

155

### Machine Learning

156

157

Distributed machine learning capabilities through sklearn, XGBoost, and LightGBM integrations, enabling scalable model training and prediction.

158

159

```python { .api }

160

# Sklearn submodules

161

from xorbits.sklearn import cluster, datasets, decomposition, ensemble

162

from xorbits.sklearn import linear_model, metrics, model_selection, neighbors

163

from xorbits.sklearn import preprocessing, semi_supervised

164

165

# XGBoost and LightGBM classes dynamically exposed

166

```

167

168

[Machine Learning](./machine-learning.md)

169

170

### Datasets

171

172

Large-scale dataset handling with support for Hugging Face datasets and efficient data loading patterns.

173

174

```python { .api }

175

class Dataset: ...

176

def from_huggingface(dataset_name: str, **kwargs): ...

177

```

178

179

[Datasets](./datasets.md)

180

181

### Remote Computing

182

183

Remote function execution capabilities for distributed computing workloads.

184

185

```python { .api }

186

def spawn(func, **kwargs): ...

187

```

188

189

[Remote Computing](./remote-computing.md)

190

191

## Types

192

193

### Core Data Types

194

195

```python { .api }

196

class Data:

197

"""Base data container class."""

198

199

class DataRef:

200

"""Reference to distributed data object."""

201

202

class DataRefMeta:

203

"""Metaclass for DataRef."""

204

205

from enum import Enum

206

207

class DataType(Enum):

208

"""Enumeration of data types."""

209

object_ = 1

210

scalar = 2

211

tensor = 3

212

dataframe = 4

213

series = 5

214

index = 6

215

categorical = 7

216

dataframe_groupby = 8

217

series_groupby = 9

218

dataset = 10

219

```