0
# Pandas Integration
1
2
Drop-in replacement for pandas with distributed computing capabilities. Xorbits pandas provides the same API as pandas while enabling computation on datasets that exceed single-machine memory through distributed processing.
3
4
## Capabilities
5
6
### Core Data Structures
7
8
The fundamental data structures that mirror pandas DataFrame, Series, and Index with distributed capabilities.
9
10
```python { .api }
11
class DataFrame:
12
"""
13
Distributed DataFrame with pandas-compatible API.
14
15
Provides all pandas DataFrame functionality with automatic distribution
16
across multiple workers for scalable data processing.
17
"""
18
19
class Series:
20
"""
21
Distributed Series with pandas-compatible API.
22
23
One-dimensional labeled array capable of holding any data type,
24
distributed across multiple workers.
25
"""
26
27
class Index:
28
"""
29
Distributed Index with pandas-compatible API.
30
31
Immutable sequence used for indexing and alignment,
32
supporting distributed operations.
33
"""
34
```
35
36
### Data Types and Time Components
37
38
Pandas-compatible data types and time-related classes for working with temporal data.
39
40
```python { .api }
41
class Timedelta:
42
"""Time delta class for representing durations."""
43
44
class DateOffset:
45
"""Date offset class for date arithmetic."""
46
47
class Interval:
48
"""Interval class for representing intervals between values."""
49
50
class Timestamp:
51
"""Timestamp class for representing points in time."""
52
53
NaT: object
54
"""Not-a-Time constant for missing time values."""
55
56
NA: object
57
"""Missing value indicator (pandas >= 1.0)."""
58
59
class NamedAgg:
60
"""Named aggregation class for groupby operations (pandas >= 1.0)."""
61
62
class ArrowDtype:
63
"""Arrow data type for PyArrow integration (pandas >= 1.5)."""
64
```
65
66
### Configuration Functions
67
68
Configuration management specific to pandas operations, mirroring the pandas options system.
69
70
```python { .api }
71
def describe_option(option_name: str) -> None:
72
"""
73
Describe a configuration option.
74
75
Parameters:
76
- option_name: Name of the option to describe
77
"""
78
79
def get_option(option_name: str):
80
"""
81
Get the value of a configuration option.
82
83
Parameters:
84
- option_name: Name of the option to retrieve
85
86
Returns:
87
- Current value of the option
88
"""
89
90
def set_option(option_name: str, value) -> None:
91
"""
92
Set the value of a configuration option.
93
94
Parameters:
95
- option_name: Name of the option to set
96
- value: New value for the option
97
"""
98
99
def reset_option(option_name: str) -> None:
100
"""
101
Reset a configuration option to its default value.
102
103
Parameters:
104
- option_name: Name of the option to reset
105
"""
106
107
def option_context(*args, **kwargs):
108
"""
109
Context manager for temporarily changing pandas options.
110
111
Parameters:
112
- *args: Option names and values as alternating arguments
113
- **kwargs: Option names and values as keyword arguments
114
115
Returns:
116
- Context manager for temporary option changes
117
"""
118
119
def set_eng_float_format(format_string: str) -> None:
120
"""
121
Set engineering float format for display.
122
123
Parameters:
124
- format_string: Format string for engineering notation
125
"""
126
```
127
128
### Specialized Modules
129
130
Access to pandas specialized functionality through submodules.
131
132
```python { .api }
133
# Submodules providing specialized functionality
134
accessors # DataFrame and Series accessor functionality
135
core # Core pandas data structures
136
groupby # GroupBy functionality
137
plotting # Plotting functionality
138
window # Window operations
139
offsets # Date offset functionality
140
```
141
142
### Dynamic Function Access
143
144
All pandas module-level functions are available through dynamic import, including but not limited to:
145
146
```python { .api }
147
# Data I/O functions
148
def read_csv(filepath_or_buffer, **kwargs): ...
149
def read_parquet(path, **kwargs): ...
150
def read_json(path_or_buf, **kwargs): ...
151
def read_excel(io, **kwargs): ...
152
def read_sql(sql, con, **kwargs): ...
153
def read_pickle(filepath_or_buffer, **kwargs): ...
154
155
# Data manipulation functions
156
def concat(objs, **kwargs): ...
157
def merge(left, right, **kwargs): ...
158
def merge_asof(left, right, **kwargs): ...
159
def crosstab(index, columns, **kwargs): ...
160
def pivot_table(data, **kwargs): ...
161
def melt(frame, **kwargs): ...
162
163
# Utility functions
164
def cut(x, bins, **kwargs): ...
165
def qcut(x, q, **kwargs): ...
166
def get_dummies(data, **kwargs): ...
167
def factorize(values, **kwargs): ...
168
def unique(values): ...
169
def value_counts(values, **kwargs): ...
170
171
# Date/time utilities
172
def date_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
173
def period_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
174
def timedelta_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
175
def to_datetime(arg, **kwargs): ...
176
def to_timedelta(arg, **kwargs): ...
177
def to_numeric(arg, **kwargs): ...
178
```
179
180
**Usage Examples:**
181
182
```python
183
import xorbits
184
import xorbits.pandas as pd
185
import xorbits.numpy as np
186
187
xorbits.init()
188
189
# Creating DataFrames (same as pandas)
190
df = pd.DataFrame({
191
'A': [1, 2, 3, 4, 5],
192
'B': ['a', 'b', 'c', 'd', 'e'],
193
'C': [1.1, 2.2, 3.3, 4.4, 5.5]
194
})
195
196
# Reading data (same as pandas)
197
df_from_csv = pd.read_csv('data.csv')
198
199
# Data manipulation (same as pandas)
200
grouped = df.groupby('B').agg({'A': 'sum', 'C': 'mean'})
201
merged = pd.merge(df, other_df, on='key')
202
concatenated = pd.concat([df1, df2])
203
204
# All pandas operations work the same way
205
result = df.query('A > 2').sort_values('C').head(10)
206
207
# Execute computation
208
computed = xorbits.run(result)
209
210
xorbits.shutdown()
211
```
212
213
### Configuration Usage
214
215
```python
216
import xorbits.pandas as pd
217
218
# Get current display options
219
max_rows = pd.get_option('display.max_rows')
220
221
# Set display options
222
pd.set_option('display.max_rows', 100)
223
pd.set_option('display.max_columns', 50)
224
225
# Use option context for temporary changes
226
with pd.option_context('display.max_rows', 20):
227
print(large_dataframe) # Shows only 20 rows
228
229
# Reset options
230
pd.reset_option('display.max_rows')
231
```