or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-structures.mddata-manipulation.mdindex.mdio-operations.mdpandas-compatibility.mdtesting-utilities.mdtype-checking.md

data-manipulation.mddocs/

0

# Data Manipulation

1

2

cuDF provides GPU-accelerated operations for reshaping, joining, aggregating, and transforming data. All operations leverage GPU parallelism for optimal performance on large datasets.

3

4

## Import Statements

5

6

```python

7

# Core manipulation functions

8

from cudf import concat, merge, pivot, pivot_table, melt, crosstab

9

from cudf import unstack, get_dummies

10

11

# Algorithm functions

12

from cudf import factorize, unique, cut

13

14

# Time/date operations

15

from cudf import date_range, to_datetime, interval_range, DateOffset

16

from cudf import to_numeric

17

18

# Groupby operations

19

from cudf import Grouper, NamedAgg

20

```

21

22

## Concatenation

23

24

Combine cuDF objects along axes with flexible alignment and indexing options.

25

26

```{ .api }

27

def concat(

28

objs,

29

axis=0,

30

join='outer',

31

ignore_index=False,

32

keys=None,

33

levels=None,

34

names=None,

35

verify_integrity=False,

36

sort=False,

37

copy=True

38

) -> Union[DataFrame, Series]:

39

"""

40

Concatenate cuDF objects along a particular axis with GPU acceleration

41

42

Efficiently combines multiple DataFrames or Series along rows or columns

43

with flexible joining and indexing options. GPU-optimized for large datasets.

44

45

Parameters:

46

objs: sequence of DataFrame, Series, or dict

47

Objects to concatenate (list, tuple, or dict of objects)

48

axis: int or str, default 0

49

Axis to concatenate along (0/'index' for rows, 1/'columns' for columns)

50

join: str, default 'outer'

51

How to handle indexes on other axis ('inner' or 'outer')

52

ignore_index: bool, default False

53

If True, reset index to default integer index

54

keys: sequence, optional

55

Construct hierarchical index using keys as outermost level

56

levels: list of sequences, optional

57

Specific levels to use for MultiIndex construction

58

names: list, optional

59

Names for levels in resulting hierarchical index

60

verify_integrity: bool, default False

61

Check whether new concatenated axis contains duplicates

62

sort: bool, default False

63

Sort non-concatenation axis if not already aligned

64

copy: bool, default True

65

Copy data if False and possible to avoid copy

66

67

Returns:

68

Union[DataFrame, Series]: Concatenated result of same type as input objects

69

70

Examples:

71

# Concatenate DataFrames vertically (rows)

72

df1 = cudf.DataFrame({'A': [1, 2], 'B': [3, 4]})

73

df2 = cudf.DataFrame({'A': [5, 6], 'B': [7, 8]})

74

result = cudf.concat([df1, df2]) # 4 rows, 2 columns

75

76

# Concatenate horizontally (columns)

77

df3 = cudf.DataFrame({'C': [9, 10], 'D': [11, 12]})

78

result = cudf.concat([df1, df3], axis=1) # 2 rows, 4 columns

79

80

# With hierarchical indexing

81

result = cudf.concat([df1, df2], keys=['first', 'second'])

82

83

# Ignore original indexes

84

result = cudf.concat([df1, df2], ignore_index=True)

85

"""

86

```

87

88

## Merging and Joining

89

90

Database-style join operations with various merge strategies and optimizations.

91

92

```{ .api }

93

def merge(

94

left,

95

right,

96

how='inner',

97

on=None,

98

left_on=None,

99

right_on=None,

100

left_index=False,

101

right_index=False,

102

sort=False,

103

suffixes=('_x', '_y'),

104

copy=True,

105

indicator=False,

106

validate=None,

107

method='hash'

108

) -> DataFrame:

109

"""

110

Merge DataFrame objects with database-style join operations

111

112

High-performance GPU joins with automatic optimization and support

113

for various join algorithms. Handles large datasets efficiently.

114

115

Parameters:

116

left: DataFrame

117

Left DataFrame to merge

118

right: DataFrame

119

Right DataFrame to merge

120

how: str, default 'inner'

121

Type of merge ('left', 'right', 'outer', 'inner', 'cross')

122

on: label or list, optional

123

Column or index level names to join on (must exist in both objects)

124

left_on: label or list, optional

125

Column or index level names to join on in left DataFrame

126

right_on: label or list, optional

127

Column or index level names to join on in right DataFrame

128

left_index: bool, default False

129

Use left DataFrame's index as join key

130

right_index: bool, default False

131

Use right DataFrame's index as join key

132

sort: bool, default False

133

Sort join keys lexicographically in result

134

suffixes: tuple of str, default ('_x', '_y')

135

Suffixes to apply to overlapping column names

136

copy: bool, default True

137

Always copy data, set False to avoid copies when possible

138

indicator: bool or str, default False

139

Add column indicating source of each row

140

validate: str, optional

141

Check uniqueness of merge keys ('one_to_one', 'one_to_many', etc.)

142

method: str, default 'hash'

143

Join algorithm ('hash', 'sort')

144

145

Returns:

146

DataFrame: Merged DataFrame combining left and right

147

148

Examples:

149

# Inner join on common column

150

left = cudf.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})

151

right = cudf.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})

152

result = cudf.merge(left, right, on='key') # Returns A, B rows

153

154

# Left join with different column names

155

result = cudf.merge(

156

left, right,

157

left_on='key', right_on='key',

158

how='left'

159

)

160

161

# Multiple key join

162

result = cudf.merge(df1, df2, on=['key1', 'key2'], how='outer')

163

164

# Index-based join

165

result = cudf.merge(

166

left, right,

167

left_index=True, right_index=True,

168

how='inner'

169

)

170

"""

171

```

172

173

## Reshaping Operations

174

175

Transform data layout between wide and long formats with pivoting and melting.

176

177

```{ .api }

178

def pivot(

179

data,

180

index=None,

181

columns=None,

182

values=None

183

) -> DataFrame:

184

"""

185

Pivot data to reshape from long to wide format

186

187

Reorganizes data by pivoting column values into new columns.

188

GPU-accelerated for large pivot operations.

189

190

Parameters:

191

data: DataFrame

192

Input DataFrame to pivot

193

index: str, list, or array, optional

194

Column(s) to use to make new DataFrame's index

195

columns: str, list, or array

196

Column(s) to use to make new DataFrame's columns

197

values: str, list, or array, optional

198

Column(s) to use for populating new DataFrame's values

199

200

Returns:

201

DataFrame: Pivoted DataFrame with reshaped data

202

203

Examples:

204

# Basic pivot

205

df = cudf.DataFrame({

206

'date': ['2023-01', '2023-01', '2023-02', '2023-02'],

207

'variable': ['A', 'B', 'A', 'B'],

208

'value': [1, 2, 3, 4]

209

})

210

result = cudf.pivot(df, index='date', columns='variable', values='value')

211

212

# Multiple values columns

213

result = cudf.pivot(df, columns='variable', values=['value1', 'value2'])

214

"""

215

216

def pivot_table(

217

data,

218

values=None,

219

index=None,

220

columns=None,

221

aggfunc='mean',

222

fill_value=None,

223

margins=False,

224

dropna=True,

225

margins_name='All',

226

sort=True

227

) -> DataFrame:

228

"""

229

Create pivot table with aggregation functions

230

231

Generalized pivot operation that applies aggregation functions to

232

grouped data. Supports multiple aggregation functions and fill values.

233

234

Parameters:

235

data: DataFrame

236

Input DataFrame to create pivot table from

237

values: str, list, or array, optional

238

Column(s) to aggregate

239

index: str, list, or array, optional

240

Keys to group by on pivot table index

241

columns: str, list, or array, optional

242

Keys to group by on pivot table columns

243

aggfunc: function, list, dict, default 'mean'

244

Aggregation function(s) to apply ('mean', 'sum', 'count', etc.)

245

fill_value: scalar, optional

246

Value to replace missing values with

247

margins: bool, default False

248

Add row/column margins (subtotals)

249

dropna: bool, default True

250

Drop columns with all NaN values

251

margins_name: str, default 'All'

252

Name of margins row/column

253

sort: bool, default True

254

Sort resulting pivot table by index/columns

255

256

Returns:

257

DataFrame: Pivot table with aggregated values

258

259

Examples:

260

# Basic pivot table with aggregation

261

df = cudf.DataFrame({

262

'A': ['foo', 'foo', 'bar', 'bar'],

263

'B': ['one', 'two', 'one', 'two'],

264

'C': [1, 2, 3, 4],

265

'D': [10, 20, 30, 40]

266

})

267

table = cudf.pivot_table(df, values='C', index='A', columns='B', aggfunc='sum')

268

269

# Multiple aggregation functions

270

table = cudf.pivot_table(

271

df, values='C', index='A', columns='B',

272

aggfunc=['sum', 'mean', 'count']

273

)

274

275

# With margins

276

table = cudf.pivot_table(df, values='C', index='A', columns='B', margins=True)

277

"""

278

279

def melt(

280

frame,

281

id_vars=None,

282

value_vars=None,

283

var_name=None,

284

value_name='value',

285

col_level=None,

286

ignore_index=True

287

) -> DataFrame:

288

"""

289

Unpivot DataFrame from wide to long format (reverse of pivot)

290

291

Transforms columns into rows by "melting" the DataFrame. Useful for

292

converting wide-format data to long format for analysis.

293

294

Parameters:

295

frame: DataFrame

296

DataFrame to melt

297

id_vars: list of str, optional

298

Column(s) to use as identifier variables

299

value_vars: list of str, optional

300

Column(s) to unpivot (default: all columns not in id_vars)

301

var_name: str, optional

302

Name for variable column (default: 'variable')

303

value_name: str, default 'value'

304

Name for value column

305

col_level: int or str, optional

306

Level to melt for MultiIndex columns

307

ignore_index: bool, default True

308

Reset index in result

309

310

Returns:

311

DataFrame: Melted DataFrame in long format

312

313

Examples:

314

# Basic melt

315

df = cudf.DataFrame({

316

'id': ['A', 'B'],

317

'var1': [1, 3],

318

'var2': [2, 4]

319

})

320

result = cudf.melt(df, id_vars=['id']) # Long format

321

322

# Specify columns to melt

323

result = cudf.melt(

324

df,

325

id_vars=['id'],

326

value_vars=['var1', 'var2'],

327

var_name='variable',

328

value_name='measurement'

329

)

330

"""

331

```

332

333

## Cross-tabulation and Dummy Variables

334

335

Statistical cross-tabulation and categorical variable encoding.

336

337

```{ .api }

338

def crosstab(

339

index,

340

columns,

341

values=None,

342

rownames=None,

343

colnames=None,

344

aggfunc=None,

345

margins=False,

346

margins_name='All',

347

dropna=True,

348

normalize=False

349

) -> DataFrame:

350

"""

351

Compute cross-tabulation of two or more factors

352

353

Creates frequency table showing relationship between categorical variables.

354

GPU-accelerated for large categorical datasets.

355

356

Parameters:

357

index: array-like, Series, or list of arrays/Series

358

Values to group by in rows

359

columns: array-like, Series, or list of arrays/Series

360

Values to group by in columns

361

values: array-like, optional

362

Values to aggregate (default: frequency count)

363

rownames: sequence, optional

364

Names for row index levels

365

colnames: sequence, optional

366

Names for column index levels

367

aggfunc: function, optional

368

Aggregation function if values is specified

369

margins: bool, default False

370

Add row/column margins

371

margins_name: str, default 'All'

372

Name for margin row/column

373

dropna: bool, default True

374

Drop missing value combinations

375

normalize: bool or str, default False

376

Normalize by dividing by sum ('all', 'index', 'columns')

377

378

Returns:

379

DataFrame: Cross-tabulation table

380

381

Examples:

382

# Basic cross-tabulation

383

a = cudf.Series(['foo', 'foo', 'bar', 'bar'])

384

b = cudf.Series(['one', 'two', 'one', 'two'])

385

result = cudf.crosstab(a, b)

386

387

# With values and aggregation

388

values = cudf.Series([1, 2, 3, 4])

389

result = cudf.crosstab(a, b, values=values, aggfunc='sum')

390

391

# Normalized

392

result = cudf.crosstab(a, b, normalize=True)

393

"""

394

395

def get_dummies(

396

data,

397

prefix=None,

398

prefix_sep='_',

399

dummy_na=False,

400

columns=None,

401

sparse=False,

402

drop_first=False,

403

dtype=None

404

) -> DataFrame:

405

"""

406

Convert categorical variables to dummy/indicator variables

407

408

Creates binary columns for each category in categorical variables.

409

Commonly used for machine learning feature encoding.

410

411

Parameters:

412

data: array-like, Series, or DataFrame

413

Data to create dummy variables from

414

prefix: str, list of str, or dict, optional

415

Prefix for dummy column names

416

prefix_sep: str, default '_'

417

Separator between prefix and category name

418

dummy_na: bool, default False

419

Add column for missing values

420

columns: list-like, optional

421

Column names to encode (default: all categorical columns)

422

sparse: bool, default False

423

Return sparse matrix (not supported, included for compatibility)

424

drop_first: bool, default False

425

Drop first category to avoid multicollinearity

426

dtype: numpy.dtype, optional

427

Data type for dummy variables

428

429

Returns:

430

DataFrame: DataFrame with dummy variables

431

432

Examples:

433

# From Series

434

s = cudf.Series(['a', 'b', 'c', 'a'])

435

result = cudf.get_dummies(s) # Creates 3 binary columns

436

437

# From DataFrame with prefix

438

df = cudf.DataFrame({'col': ['red', 'blue', 'red', 'green']})

439

result = cudf.get_dummies(df, prefix='color')

440

441

# Drop first category

442

result = cudf.get_dummies(df, drop_first=True)

443

"""

444

445

def unstack(

446

level=-1,

447

fill_value=None

448

) -> DataFrame:

449

"""

450

Pivot index level to columns (MultiIndex method)

451

452

Transforms index level into columns, effectively pivoting the data.

453

Used with MultiIndex DataFrames to reshape hierarchical data.

454

455

Parameters:

456

level: int, str, or list, default -1

457

Level(s) of index to unstack

458

fill_value: scalar, optional

459

Value to use for missing combinations

460

461

Returns:

462

DataFrame: DataFrame with unstacked index level as columns

463

464

Examples:

465

# Create MultiIndex DataFrame

466

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]

467

index = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])

468

df = cudf.DataFrame({'value': [10, 20, 30, 40]}, index=index)

469

470

# Unstack inner level

471

result = df.unstack() # number level becomes columns

472

473

# Unstack specific level

474

result = df.unstack(level='letter')

475

"""

476

```

477

478

## Algorithm Functions

479

480

Fundamental algorithms for data analysis and preprocessing.

481

482

```{ .api }

483

def factorize(

484

values,

485

sort=False,

486

na_sentinel=-1,

487

use_na_sentinel=True

488

) -> tuple[cupy.ndarray, Index]:

489

"""

490

Encode input values as enumerated type or categorical variable

491

492

Converts object array to integer codes and unique values. Useful for

493

creating categorical encodings and memory-efficient representations.

494

495

Parameters:

496

values: array-like

497

Sequence to factorize (Series, Index, or array-like)

498

sort: bool, default False

499

Sort unique values and codes

500

na_sentinel: int, default -1

501

Value to mark missing values with

502

use_na_sentinel: bool, default True

503

Whether to use sentinel value for missing data

504

505

Returns:

506

tuple: (codes, uniques)

507

codes: cupy.ndarray of integer codes

508

uniques: Index of unique values

509

510

Examples:

511

# Basic factorization

512

values = cudf.Series(['red', 'blue', 'red', 'green'])

513

codes, uniques = cudf.factorize(values)

514

# codes: [0, 1, 0, 2], uniques: ['red', 'blue', 'green']

515

516

# With sorting

517

codes, uniques = cudf.factorize(values, sort=True)

518

519

# Handle missing values

520

values_na = cudf.Series(['a', None, 'b', 'a'])

521

codes, uniques = cudf.factorize(values_na)

522

"""

523

524

def unique(values) -> Union[cupy.ndarray, Index]:

525

"""

526

Return unique values from array-like object

527

528

GPU-accelerated unique value extraction with automatic deduplication.

529

Preserves data type and handles missing values appropriately.

530

531

Parameters:

532

values: array-like

533

Input array, Series, or Index

534

535

Returns:

536

Union[cupy.ndarray, Index]: Unique values in same type as input

537

538

Examples:

539

# From Series

540

s = cudf.Series([1, 2, 2, 3, 1, 4])

541

unique_vals = cudf.unique(s) # [1, 2, 3, 4]

542

543

# From array with strings

544

arr = ['a', 'b', 'a', 'c', 'b']

545

unique_vals = cudf.unique(arr) # ['a', 'b', 'c']

546

547

# Preserves data type

548

dates = cudf.Series(['2023-01-01', '2023-01-02', '2023-01-01'])

549

dates = cudf.to_datetime(dates)

550

unique_dates = cudf.unique(dates)

551

"""

552

553

def cut(

554

x,

555

bins,

556

right=True,

557

labels=None,

558

retbins=False,

559

precision=3,

560

include_lowest=False,

561

duplicates='raise'

562

) -> Union[Series, tuple]:

563

"""

564

Bin continuous values into discrete intervals

565

566

Segments and sorts data values into bins. Useful for creating categorical

567

variables from continuous data and histogram-like operations.

568

569

Parameters:

570

x: array-like

571

Input array to be binned (1-dimensional)

572

bins: int, sequence, or IntervalIndex

573

Criteria for binning (number of bins or bin edges)

574

right: bool, default True

575

Whether intervals include right edge

576

labels: array-like or False, optional

577

Labels for returned bins (length must match number of bins)

578

retbins: bool, default False

579

Whether to return bins array

580

precision: int, default 3

581

Precision for bin edge display

582

include_lowest: bool, default False

583

Whether first interval should be left-inclusive

584

duplicates: str, default 'raise'

585

Treatment of duplicate bin edges ('raise' or 'drop')

586

587

Returns:

588

Union[Series, tuple]: Categorical Series with bin assignments

589

If retbins=True, returns (binned_series, bin_edges)

590

591

Examples:

592

# Equal-width bins

593

values = cudf.Series([1, 7, 5, 4, 6, 3])

594

result = cudf.cut(values, bins=3) # 3 equal-width bins

595

596

# Custom bin edges

597

result = cudf.cut(values, bins=[0, 3, 6, 9])

598

599

# With custom labels

600

result = cudf.cut(

601

values,

602

bins=3,

603

labels=['low', 'medium', 'high']

604

)

605

606

# Return bin edges

607

result, bin_edges = cudf.cut(values, bins=4, retbins=True)

608

"""

609

```

610

611

## Date and Time Operations

612

613

Comprehensive date/time functionality for temporal data analysis.

614

615

```{ .api }

616

def date_range(

617

start=None,

618

end=None,

619

periods=None,

620

freq=None,

621

tz=None,

622

normalize=False,

623

name=None,

624

closed=None

625

) -> DatetimeIndex:

626

"""

627

Generate sequence of dates with GPU acceleration

628

629

Creates DatetimeIndex with regular frequency between start and end dates.

630

Supports various frequency specifications and timezone handling.

631

632

Parameters:

633

start: str or datetime-like, optional

634

Left bound for generating dates

635

end: str or datetime-like, optional

636

Right bound for generating dates

637

periods: int, optional

638

Number of periods to generate

639

freq: str or DateOffset, default 'D'

640

Frequency string ('D', 'H', 'min', 'S', 'MS', etc.)

641

tz: str or tzinfo, optional

642

Timezone name for localized DatetimeIndex

643

normalize: bool, default False

644

Normalize start/end dates to midnight

645

name: str, optional

646

Name of resulting DatetimeIndex

647

closed: str, optional

648

Make interval closed ('left', 'right', or None)

649

650

Returns:

651

DatetimeIndex: Fixed frequency DatetimeIndex

652

653

Examples:

654

# Basic date range

655

dates = cudf.date_range('2023-01-01', '2023-01-10', freq='D')

656

657

# By number of periods

658

dates = cudf.date_range('2023-01-01', periods=10, freq='D')

659

660

# Hourly frequency

661

dates = cudf.date_range('2023-01-01', periods=24, freq='H')

662

663

# With timezone

664

dates = cudf.date_range('2023-01-01', periods=5, freq='D', tz='UTC')

665

666

# Business days only

667

dates = cudf.date_range('2023-01-01', periods=10, freq='B')

668

"""

669

670

def to_datetime(

671

arg,

672

errors='raise',

673

dayfirst=False,

674

yearfirst=False,

675

utc=None,

676

format=None,

677

exact=True,

678

unit=None,

679

infer_datetime_format=False,

680

origin='unix',

681

cache=True

682

) -> Union[datetime, Series, DatetimeIndex]:

683

"""

684

Convert argument to datetime with GPU acceleration

685

686

Flexible datetime parsing with automatic format detection and

687

error handling. Optimized for large-scale datetime conversions.

688

689

Parameters:

690

arg: int, float, str, datetime, list, tuple, array, Series, DataFrame

691

Object to convert to datetime

692

errors: str, default 'raise'

693

Error handling ('raise', 'coerce', 'ignore')

694

dayfirst: bool, default False

695

Interpret first value as day in ambiguous cases

696

yearfirst: bool, default False

697

Interpret first value as year in ambiguous cases

698

utc: bool, optional

699

Return UTC DatetimeIndex if True

700

format: str, optional

701

Strftime format to use for parsing

702

exact: bool, default True

703

Whether format must match exactly

704

unit: str, optional

705

Unit for numeric conversions ('D', 's', 'ms', 'us', 'ns')

706

infer_datetime_format: bool, default False

707

Attempt to infer format automatically

708

origin: scalar, default 'unix'

709

Define origin for numeric conversions

710

cache: bool, default True

711

Use cache for repeated conversion patterns

712

713

Returns:

714

Union[datetime, Series, DatetimeIndex]: Converted datetime object

715

716

Examples:

717

# String conversion

718

dates = cudf.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])

719

720

# With format specification

721

dates = cudf.to_datetime(

722

['01/01/2023', '01/02/2023'],

723

format='%m/%d/%Y'

724

)

725

726

# Numeric timestamps

727

timestamps = [1609459200, 1609545600, 1609632000] # Unix timestamps

728

dates = cudf.to_datetime(timestamps, unit='s')

729

730

# Error handling

731

mixed = ['2023-01-01', 'invalid', '2023-01-03']

732

dates = cudf.to_datetime(mixed, errors='coerce') # Invalid -> NaT

733

"""

734

735

def interval_range(

736

start=None,

737

end=None,

738

periods=None,

739

freq=None,

740

name=None,

741

closed='right'

742

) -> IntervalIndex:

743

"""

744

Generate sequence of intervals with fixed frequency

745

746

Creates IntervalIndex with regular intervals between start and end.

747

Useful for time-based and numeric interval operations.

748

749

Parameters:

750

start: numeric or datetime-like, optional

751

Left bound for generating intervals

752

end: numeric or datetime-like, optional

753

Right bound for generating intervals

754

periods: int, optional

755

Number of intervals to generate

756

freq: numeric, str, or DateOffset, optional

757

Length of each interval

758

name: str, optional

759

Name of resulting IntervalIndex

760

closed: str, default 'right'

761

Which side of intervals is closed ('left', 'right', 'both', 'neither')

762

763

Returns:

764

IntervalIndex: Fixed frequency IntervalIndex

765

766

Examples:

767

# Numeric intervals

768

intervals = cudf.interval_range(start=0, end=10, periods=5)

769

770

# Date intervals

771

intervals = cudf.interval_range(

772

start='2023-01-01',

773

end='2023-01-10',

774

freq='2D'

775

)

776

777

# Custom frequency

778

intervals = cudf.interval_range(start=0, periods=4, freq=2.5)

779

"""

780

781

class DateOffset:

782

"""

783

Standard offset class for date arithmetic and frequency operations

784

785

Base class for date offsets that can be added to datetime objects.

786

Provides consistent interface for date manipulation operations.

787

788

Parameters:

789

n: int, default 1

790

Number of offset periods

791

792

Examples:

793

# Create date offset

794

offset = cudf.DateOffset(days=1)

795

796

# Add to datetime

797

date = cudf.to_datetime('2023-01-01')

798

new_date = date + offset

799

800

# Use in date_range

801

dates = cudf.date_range('2023-01-01', periods=5, freq=offset)

802

"""

803

804

def to_numeric(

805

arg,

806

errors='raise',

807

downcast=None

808

) -> Union[Series, scalar]:

809

"""

810

Convert argument to numeric type with GPU acceleration

811

812

Attempts to convert object to numeric type with flexible error handling

813

and optional downcasting for memory efficiency.

814

815

Parameters:

816

arg: scalar, list, tuple, array, Series

817

Object to convert to numeric type

818

errors: str, default 'raise'

819

Error handling ('raise', 'coerce', 'ignore')

820

downcast: str, optional

821

Downcast to smallest possible numeric type ('integer', 'signed', 'unsigned', 'float')

822

823

Returns:

824

Union[Series, scalar]: Converted numeric object

825

826

Examples:

827

# String to numeric conversion

828

strings = cudf.Series(['1', '2', '3.5', '4'])

829

numeric = cudf.to_numeric(strings)

830

831

# Error handling

832

mixed = cudf.Series(['1', '2', 'invalid', '4'])

833

numeric = cudf.to_numeric(mixed, errors='coerce') # Invalid -> NaN

834

835

# Downcast for memory efficiency

836

large_ints = cudf.Series([1, 2, 3, 4]) # Default int64

837

small_ints = cudf.to_numeric(large_ints, downcast='integer') # Smallest int type

838

"""

839

```

840

841

## Groupby Operations

842

843

Flexible grouping utilities for split-apply-combine operations.

844

845

```{ .api }

846

class Grouper:

847

"""

848

Groupby specification object for complex grouping operations

849

850

Provides detailed control over groupby operations including time-based

851

grouping, level selection, and custom key functions.

852

853

Parameters:

854

key: str, optional

855

Grouping key (column name for DataFrame, None for Series)

856

level: int, str, or list, optional

857

Level name or number for MultiIndex grouping

858

freq: str or DateOffset, optional

859

Frequency for time-based grouping

860

axis: int, default 0

861

Axis to group along

862

sort: bool, default True

863

Sort group keys

864

865

Examples:

866

# Time-based grouping

867

df = cudf.DataFrame({

868

'date': cudf.date_range('2023-01-01', periods=10, freq='D'),

869

'value': range(10)

870

})

871

monthly = df.groupby(cudf.Grouper(key='date', freq='M')).sum()

872

873

# MultiIndex grouping

874

grouper = cudf.Grouper(level='category')

875

result = df.groupby(grouper).mean()

876

"""

877

878

class NamedAgg:

879

"""

880

Named aggregation specification for groupby operations

881

882

Provides clear naming for aggregation results when using multiple

883

aggregation functions on the same column.

884

885

Parameters:

886

column: str

887

Column name to aggregate

888

aggfunc: str or callable

889

Aggregation function name or function

890

891

Examples:

892

# Named aggregations

893

df = cudf.DataFrame({

894

'group': ['A', 'B', 'A', 'B'],

895

'value': [1, 2, 3, 4]

896

})

897

898

result = df.groupby('group').agg(

899

mean_value=cudf.NamedAgg('value', 'mean'),

900

sum_value=cudf.NamedAgg('value', 'sum'),

901

count_value=cudf.NamedAgg('value', 'count')

902

)

903

"""

904

```

905

906

## Performance Optimizations

907

908

### GPU Memory Management

909

- **Columnar Operations**: Optimized for columnar data layout

910

- **Memory Pooling**: Efficient memory allocation for operations

911

- **Zero-Copy**: Minimal data movement between manipulations

912

- **Automatic Broadcasting**: Efficient element-wise operations

913

914

### Parallel Algorithms

915

- **Hash-Based Joins**: GPU-optimized hash joins for merge operations

916

- **Parallel Sort**: Multi-key parallel sorting algorithms

917

- **Grouped Operations**: SIMD optimized groupby aggregations

918

- **Vectorized Functions**: GPU kernels for element-wise operations

919

920

### Query Optimization

921

- **Kernel Fusion**: Combine multiple operations into single GPU kernels

922

- **Lazy Evaluation**: Defer computation until results needed

923

- **Memory-Aware**: Automatically choose algorithms based on available memory

924

- **Cache Locality**: Optimize memory access patterns for GPU caches