0
# Data Manipulation
1
2
cuDF provides GPU-accelerated operations for reshaping, joining, aggregating, and transforming data. All operations leverage GPU parallelism for optimal performance on large datasets.
3
4
## Import Statements
5
6
```python
7
# Core manipulation functions
8
from cudf import concat, merge, pivot, pivot_table, melt, crosstab
9
from cudf import unstack, get_dummies
10
11
# Algorithm functions
12
from cudf import factorize, unique, cut
13
14
# Time/date operations
15
from cudf import date_range, to_datetime, interval_range, DateOffset
16
from cudf import to_numeric
17
18
# Groupby operations
19
from cudf import Grouper, NamedAgg
20
```
21
22
## Concatenation
23
24
Combine cuDF objects along axes with flexible alignment and indexing options.
25
26
```{ .api }
27
def concat(
28
objs,
29
axis=0,
30
join='outer',
31
ignore_index=False,
32
keys=None,
33
levels=None,
34
names=None,
35
verify_integrity=False,
36
sort=False,
37
copy=True
38
) -> Union[DataFrame, Series]:
39
"""
40
Concatenate cuDF objects along a particular axis with GPU acceleration
41
42
Efficiently combines multiple DataFrames or Series along rows or columns
43
with flexible joining and indexing options. GPU-optimized for large datasets.
44
45
Parameters:
46
objs: sequence of DataFrame, Series, or dict
47
Objects to concatenate (list, tuple, or dict of objects)
48
axis: int or str, default 0
49
Axis to concatenate along (0/'index' for rows, 1/'columns' for columns)
50
join: str, default 'outer'
51
How to handle indexes on other axis ('inner' or 'outer')
52
ignore_index: bool, default False
53
If True, reset index to default integer index
54
keys: sequence, optional
55
Construct hierarchical index using keys as outermost level
56
levels: list of sequences, optional
57
Specific levels to use for MultiIndex construction
58
names: list, optional
59
Names for levels in resulting hierarchical index
60
verify_integrity: bool, default False
61
Check whether new concatenated axis contains duplicates
62
sort: bool, default False
63
Sort non-concatenation axis if not already aligned
64
copy: bool, default True
65
Copy data if False and possible to avoid copy
66
67
Returns:
68
Union[DataFrame, Series]: Concatenated result of same type as input objects
69
70
Examples:
71
# Concatenate DataFrames vertically (rows)
72
df1 = cudf.DataFrame({'A': [1, 2], 'B': [3, 4]})
73
df2 = cudf.DataFrame({'A': [5, 6], 'B': [7, 8]})
74
result = cudf.concat([df1, df2]) # 4 rows, 2 columns
75
76
# Concatenate horizontally (columns)
77
df3 = cudf.DataFrame({'C': [9, 10], 'D': [11, 12]})
78
result = cudf.concat([df1, df3], axis=1) # 2 rows, 4 columns
79
80
# With hierarchical indexing
81
result = cudf.concat([df1, df2], keys=['first', 'second'])
82
83
# Ignore original indexes
84
result = cudf.concat([df1, df2], ignore_index=True)
85
"""
86
```
87
88
## Merging and Joining
89
90
Database-style join operations with various merge strategies and optimizations.
91
92
```{ .api }
93
def merge(
94
left,
95
right,
96
how='inner',
97
on=None,
98
left_on=None,
99
right_on=None,
100
left_index=False,
101
right_index=False,
102
sort=False,
103
suffixes=('_x', '_y'),
104
copy=True,
105
indicator=False,
106
validate=None,
107
method='hash'
108
) -> DataFrame:
109
"""
110
Merge DataFrame objects with database-style join operations
111
112
High-performance GPU joins with automatic optimization and support
113
for various join algorithms. Handles large datasets efficiently.
114
115
Parameters:
116
left: DataFrame
117
Left DataFrame to merge
118
right: DataFrame
119
Right DataFrame to merge
120
how: str, default 'inner'
121
Type of merge ('left', 'right', 'outer', 'inner', 'cross')
122
on: label or list, optional
123
Column or index level names to join on (must exist in both objects)
124
left_on: label or list, optional
125
Column or index level names to join on in left DataFrame
126
right_on: label or list, optional
127
Column or index level names to join on in right DataFrame
128
left_index: bool, default False
129
Use left DataFrame's index as join key
130
right_index: bool, default False
131
Use right DataFrame's index as join key
132
sort: bool, default False
133
Sort join keys lexicographically in result
134
suffixes: tuple of str, default ('_x', '_y')
135
Suffixes to apply to overlapping column names
136
copy: bool, default True
137
Always copy data, set False to avoid copies when possible
138
indicator: bool or str, default False
139
Add column indicating source of each row
140
validate: str, optional
141
Check uniqueness of merge keys ('one_to_one', 'one_to_many', etc.)
142
method: str, default 'hash'
143
Join algorithm ('hash', 'sort')
144
145
Returns:
146
DataFrame: Merged DataFrame combining left and right
147
148
Examples:
149
# Inner join on common column
150
left = cudf.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
151
right = cudf.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
152
result = cudf.merge(left, right, on='key') # Returns A, B rows
153
154
# Left join with different column names
155
result = cudf.merge(
156
left, right,
157
left_on='key', right_on='key',
158
how='left'
159
)
160
161
# Multiple key join
162
result = cudf.merge(df1, df2, on=['key1', 'key2'], how='outer')
163
164
# Index-based join
165
result = cudf.merge(
166
left, right,
167
left_index=True, right_index=True,
168
how='inner'
169
)
170
"""
171
```
172
173
## Reshaping Operations
174
175
Transform data layout between wide and long formats with pivoting and melting.
176
177
```{ .api }
178
def pivot(
179
data,
180
index=None,
181
columns=None,
182
values=None
183
) -> DataFrame:
184
"""
185
Pivot data to reshape from long to wide format
186
187
Reorganizes data by pivoting column values into new columns.
188
GPU-accelerated for large pivot operations.
189
190
Parameters:
191
data: DataFrame
192
Input DataFrame to pivot
193
index: str, list, or array, optional
194
Column(s) to use to make new DataFrame's index
195
columns: str, list, or array
196
Column(s) to use to make new DataFrame's columns
197
values: str, list, or array, optional
198
Column(s) to use for populating new DataFrame's values
199
200
Returns:
201
DataFrame: Pivoted DataFrame with reshaped data
202
203
Examples:
204
# Basic pivot
205
df = cudf.DataFrame({
206
'date': ['2023-01', '2023-01', '2023-02', '2023-02'],
207
'variable': ['A', 'B', 'A', 'B'],
208
'value': [1, 2, 3, 4]
209
})
210
result = cudf.pivot(df, index='date', columns='variable', values='value')
211
212
# Multiple values columns
213
result = cudf.pivot(df, columns='variable', values=['value1', 'value2'])
214
"""
215
216
def pivot_table(
217
data,
218
values=None,
219
index=None,
220
columns=None,
221
aggfunc='mean',
222
fill_value=None,
223
margins=False,
224
dropna=True,
225
margins_name='All',
226
sort=True
227
) -> DataFrame:
228
"""
229
Create pivot table with aggregation functions
230
231
Generalized pivot operation that applies aggregation functions to
232
grouped data. Supports multiple aggregation functions and fill values.
233
234
Parameters:
235
data: DataFrame
236
Input DataFrame to create pivot table from
237
values: str, list, or array, optional
238
Column(s) to aggregate
239
index: str, list, or array, optional
240
Keys to group by on pivot table index
241
columns: str, list, or array, optional
242
Keys to group by on pivot table columns
243
aggfunc: function, list, dict, default 'mean'
244
Aggregation function(s) to apply ('mean', 'sum', 'count', etc.)
245
fill_value: scalar, optional
246
Value to replace missing values with
247
margins: bool, default False
248
Add row/column margins (subtotals)
249
dropna: bool, default True
250
Drop columns with all NaN values
251
margins_name: str, default 'All'
252
Name of margins row/column
253
sort: bool, default True
254
Sort resulting pivot table by index/columns
255
256
Returns:
257
DataFrame: Pivot table with aggregated values
258
259
Examples:
260
# Basic pivot table with aggregation
261
df = cudf.DataFrame({
262
'A': ['foo', 'foo', 'bar', 'bar'],
263
'B': ['one', 'two', 'one', 'two'],
264
'C': [1, 2, 3, 4],
265
'D': [10, 20, 30, 40]
266
})
267
table = cudf.pivot_table(df, values='C', index='A', columns='B', aggfunc='sum')
268
269
# Multiple aggregation functions
270
table = cudf.pivot_table(
271
df, values='C', index='A', columns='B',
272
aggfunc=['sum', 'mean', 'count']
273
)
274
275
# With margins
276
table = cudf.pivot_table(df, values='C', index='A', columns='B', margins=True)
277
"""
278
279
def melt(
280
frame,
281
id_vars=None,
282
value_vars=None,
283
var_name=None,
284
value_name='value',
285
col_level=None,
286
ignore_index=True
287
) -> DataFrame:
288
"""
289
Unpivot DataFrame from wide to long format (reverse of pivot)
290
291
Transforms columns into rows by "melting" the DataFrame. Useful for
292
converting wide-format data to long format for analysis.
293
294
Parameters:
295
frame: DataFrame
296
DataFrame to melt
297
id_vars: list of str, optional
298
Column(s) to use as identifier variables
299
value_vars: list of str, optional
300
Column(s) to unpivot (default: all columns not in id_vars)
301
var_name: str, optional
302
Name for variable column (default: 'variable')
303
value_name: str, default 'value'
304
Name for value column
305
col_level: int or str, optional
306
Level to melt for MultiIndex columns
307
ignore_index: bool, default True
308
Reset index in result
309
310
Returns:
311
DataFrame: Melted DataFrame in long format
312
313
Examples:
314
# Basic melt
315
df = cudf.DataFrame({
316
'id': ['A', 'B'],
317
'var1': [1, 3],
318
'var2': [2, 4]
319
})
320
result = cudf.melt(df, id_vars=['id']) # Long format
321
322
# Specify columns to melt
323
result = cudf.melt(
324
df,
325
id_vars=['id'],
326
value_vars=['var1', 'var2'],
327
var_name='variable',
328
value_name='measurement'
329
)
330
"""
331
```
332
333
## Cross-tabulation and Dummy Variables
334
335
Statistical cross-tabulation and categorical variable encoding.
336
337
```{ .api }
338
def crosstab(
339
index,
340
columns,
341
values=None,
342
rownames=None,
343
colnames=None,
344
aggfunc=None,
345
margins=False,
346
margins_name='All',
347
dropna=True,
348
normalize=False
349
) -> DataFrame:
350
"""
351
Compute cross-tabulation of two or more factors
352
353
Creates frequency table showing relationship between categorical variables.
354
GPU-accelerated for large categorical datasets.
355
356
Parameters:
357
index: array-like, Series, or list of arrays/Series
358
Values to group by in rows
359
columns: array-like, Series, or list of arrays/Series
360
Values to group by in columns
361
values: array-like, optional
362
Values to aggregate (default: frequency count)
363
rownames: sequence, optional
364
Names for row index levels
365
colnames: sequence, optional
366
Names for column index levels
367
aggfunc: function, optional
368
Aggregation function if values is specified
369
margins: bool, default False
370
Add row/column margins
371
margins_name: str, default 'All'
372
Name for margin row/column
373
dropna: bool, default True
374
Drop missing value combinations
375
normalize: bool or str, default False
376
Normalize by dividing by sum ('all', 'index', 'columns')
377
378
Returns:
379
DataFrame: Cross-tabulation table
380
381
Examples:
382
# Basic cross-tabulation
383
a = cudf.Series(['foo', 'foo', 'bar', 'bar'])
384
b = cudf.Series(['one', 'two', 'one', 'two'])
385
result = cudf.crosstab(a, b)
386
387
# With values and aggregation
388
values = cudf.Series([1, 2, 3, 4])
389
result = cudf.crosstab(a, b, values=values, aggfunc='sum')
390
391
# Normalized
392
result = cudf.crosstab(a, b, normalize=True)
393
"""
394
395
def get_dummies(
396
data,
397
prefix=None,
398
prefix_sep='_',
399
dummy_na=False,
400
columns=None,
401
sparse=False,
402
drop_first=False,
403
dtype=None
404
) -> DataFrame:
405
"""
406
Convert categorical variables to dummy/indicator variables
407
408
Creates binary columns for each category in categorical variables.
409
Commonly used for machine learning feature encoding.
410
411
Parameters:
412
data: array-like, Series, or DataFrame
413
Data to create dummy variables from
414
prefix: str, list of str, or dict, optional
415
Prefix for dummy column names
416
prefix_sep: str, default '_'
417
Separator between prefix and category name
418
dummy_na: bool, default False
419
Add column for missing values
420
columns: list-like, optional
421
Column names to encode (default: all categorical columns)
422
sparse: bool, default False
423
Return sparse matrix (not supported, included for compatibility)
424
drop_first: bool, default False
425
Drop first category to avoid multicollinearity
426
dtype: numpy.dtype, optional
427
Data type for dummy variables
428
429
Returns:
430
DataFrame: DataFrame with dummy variables
431
432
Examples:
433
# From Series
434
s = cudf.Series(['a', 'b', 'c', 'a'])
435
result = cudf.get_dummies(s) # Creates 3 binary columns
436
437
# From DataFrame with prefix
438
df = cudf.DataFrame({'col': ['red', 'blue', 'red', 'green']})
439
result = cudf.get_dummies(df, prefix='color')
440
441
# Drop first category
442
result = cudf.get_dummies(df, drop_first=True)
443
"""
444
445
def unstack(
446
level=-1,
447
fill_value=None
448
) -> DataFrame:
449
"""
450
Pivot index level to columns (MultiIndex method)
451
452
Transforms index level into columns, effectively pivoting the data.
453
Used with MultiIndex DataFrames to reshape hierarchical data.
454
455
Parameters:
456
level: int, str, or list, default -1
457
Level(s) of index to unstack
458
fill_value: scalar, optional
459
Value to use for missing combinations
460
461
Returns:
462
DataFrame: DataFrame with unstacked index level as columns
463
464
Examples:
465
# Create MultiIndex DataFrame
466
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
467
index = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])
468
df = cudf.DataFrame({'value': [10, 20, 30, 40]}, index=index)
469
470
# Unstack inner level
471
result = df.unstack() # number level becomes columns
472
473
# Unstack specific level
474
result = df.unstack(level='letter')
475
"""
476
```
477
478
## Algorithm Functions
479
480
Fundamental algorithms for data analysis and preprocessing.
481
482
```{ .api }
483
def factorize(
484
values,
485
sort=False,
486
na_sentinel=-1,
487
use_na_sentinel=True
488
) -> tuple[cupy.ndarray, Index]:
489
"""
490
Encode input values as enumerated type or categorical variable
491
492
Converts object array to integer codes and unique values. Useful for
493
creating categorical encodings and memory-efficient representations.
494
495
Parameters:
496
values: array-like
497
Sequence to factorize (Series, Index, or array-like)
498
sort: bool, default False
499
Sort unique values and codes
500
na_sentinel: int, default -1
501
Value to mark missing values with
502
use_na_sentinel: bool, default True
503
Whether to use sentinel value for missing data
504
505
Returns:
506
tuple: (codes, uniques)
507
codes: cupy.ndarray of integer codes
508
uniques: Index of unique values
509
510
Examples:
511
# Basic factorization
512
values = cudf.Series(['red', 'blue', 'red', 'green'])
513
codes, uniques = cudf.factorize(values)
514
# codes: [0, 1, 0, 2], uniques: ['red', 'blue', 'green']
515
516
# With sorting
517
codes, uniques = cudf.factorize(values, sort=True)
518
519
# Handle missing values
520
values_na = cudf.Series(['a', None, 'b', 'a'])
521
codes, uniques = cudf.factorize(values_na)
522
"""
523
524
def unique(values) -> Union[cupy.ndarray, Index]:
525
"""
526
Return unique values from array-like object
527
528
GPU-accelerated unique value extraction with automatic deduplication.
529
Preserves data type and handles missing values appropriately.
530
531
Parameters:
532
values: array-like
533
Input array, Series, or Index
534
535
Returns:
536
Union[cupy.ndarray, Index]: Unique values in same type as input
537
538
Examples:
539
# From Series
540
s = cudf.Series([1, 2, 2, 3, 1, 4])
541
unique_vals = cudf.unique(s) # [1, 2, 3, 4]
542
543
# From array with strings
544
arr = ['a', 'b', 'a', 'c', 'b']
545
unique_vals = cudf.unique(arr) # ['a', 'b', 'c']
546
547
# Preserves data type
548
dates = cudf.Series(['2023-01-01', '2023-01-02', '2023-01-01'])
549
dates = cudf.to_datetime(dates)
550
unique_dates = cudf.unique(dates)
551
"""
552
553
def cut(
554
x,
555
bins,
556
right=True,
557
labels=None,
558
retbins=False,
559
precision=3,
560
include_lowest=False,
561
duplicates='raise'
562
) -> Union[Series, tuple]:
563
"""
564
Bin continuous values into discrete intervals
565
566
Segments and sorts data values into bins. Useful for creating categorical
567
variables from continuous data and histogram-like operations.
568
569
Parameters:
570
x: array-like
571
Input array to be binned (1-dimensional)
572
bins: int, sequence, or IntervalIndex
573
Criteria for binning (number of bins or bin edges)
574
right: bool, default True
575
Whether intervals include right edge
576
labels: array-like or False, optional
577
Labels for returned bins (length must match number of bins)
578
retbins: bool, default False
579
Whether to return bins array
580
precision: int, default 3
581
Precision for bin edge display
582
include_lowest: bool, default False
583
Whether first interval should be left-inclusive
584
duplicates: str, default 'raise'
585
Treatment of duplicate bin edges ('raise' or 'drop')
586
587
Returns:
588
Union[Series, tuple]: Categorical Series with bin assignments
589
If retbins=True, returns (binned_series, bin_edges)
590
591
Examples:
592
# Equal-width bins
593
values = cudf.Series([1, 7, 5, 4, 6, 3])
594
result = cudf.cut(values, bins=3) # 3 equal-width bins
595
596
# Custom bin edges
597
result = cudf.cut(values, bins=[0, 3, 6, 9])
598
599
# With custom labels
600
result = cudf.cut(
601
values,
602
bins=3,
603
labels=['low', 'medium', 'high']
604
)
605
606
# Return bin edges
607
result, bin_edges = cudf.cut(values, bins=4, retbins=True)
608
"""
609
```
610
611
## Date and Time Operations
612
613
Comprehensive date/time functionality for temporal data analysis.
614
615
```{ .api }
616
def date_range(
617
start=None,
618
end=None,
619
periods=None,
620
freq=None,
621
tz=None,
622
normalize=False,
623
name=None,
624
closed=None
625
) -> DatetimeIndex:
626
"""
627
Generate sequence of dates with GPU acceleration
628
629
Creates DatetimeIndex with regular frequency between start and end dates.
630
Supports various frequency specifications and timezone handling.
631
632
Parameters:
633
start: str or datetime-like, optional
634
Left bound for generating dates
635
end: str or datetime-like, optional
636
Right bound for generating dates
637
periods: int, optional
638
Number of periods to generate
639
freq: str or DateOffset, default 'D'
640
Frequency string ('D', 'H', 'min', 'S', 'MS', etc.)
641
tz: str or tzinfo, optional
642
Timezone name for localized DatetimeIndex
643
normalize: bool, default False
644
Normalize start/end dates to midnight
645
name: str, optional
646
Name of resulting DatetimeIndex
647
closed: str, optional
648
Make interval closed ('left', 'right', or None)
649
650
Returns:
651
DatetimeIndex: Fixed frequency DatetimeIndex
652
653
Examples:
654
# Basic date range
655
dates = cudf.date_range('2023-01-01', '2023-01-10', freq='D')
656
657
# By number of periods
658
dates = cudf.date_range('2023-01-01', periods=10, freq='D')
659
660
# Hourly frequency
661
dates = cudf.date_range('2023-01-01', periods=24, freq='H')
662
663
# With timezone
664
dates = cudf.date_range('2023-01-01', periods=5, freq='D', tz='UTC')
665
666
# Business days only
667
dates = cudf.date_range('2023-01-01', periods=10, freq='B')
668
"""
669
670
def to_datetime(
671
arg,
672
errors='raise',
673
dayfirst=False,
674
yearfirst=False,
675
utc=None,
676
format=None,
677
exact=True,
678
unit=None,
679
infer_datetime_format=False,
680
origin='unix',
681
cache=True
682
) -> Union[datetime, Series, DatetimeIndex]:
683
"""
684
Convert argument to datetime with GPU acceleration
685
686
Flexible datetime parsing with automatic format detection and
687
error handling. Optimized for large-scale datetime conversions.
688
689
Parameters:
690
arg: int, float, str, datetime, list, tuple, array, Series, DataFrame
691
Object to convert to datetime
692
errors: str, default 'raise'
693
Error handling ('raise', 'coerce', 'ignore')
694
dayfirst: bool, default False
695
Interpret first value as day in ambiguous cases
696
yearfirst: bool, default False
697
Interpret first value as year in ambiguous cases
698
utc: bool, optional
699
Return UTC DatetimeIndex if True
700
format: str, optional
701
Strftime format to use for parsing
702
exact: bool, default True
703
Whether format must match exactly
704
unit: str, optional
705
Unit for numeric conversions ('D', 's', 'ms', 'us', 'ns')
706
infer_datetime_format: bool, default False
707
Attempt to infer format automatically
708
origin: scalar, default 'unix'
709
Define origin for numeric conversions
710
cache: bool, default True
711
Use cache for repeated conversion patterns
712
713
Returns:
714
Union[datetime, Series, DatetimeIndex]: Converted datetime object
715
716
Examples:
717
# String conversion
718
dates = cudf.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
719
720
# With format specification
721
dates = cudf.to_datetime(
722
['01/01/2023', '01/02/2023'],
723
format='%m/%d/%Y'
724
)
725
726
# Numeric timestamps
727
timestamps = [1609459200, 1609545600, 1609632000] # Unix timestamps
728
dates = cudf.to_datetime(timestamps, unit='s')
729
730
# Error handling
731
mixed = ['2023-01-01', 'invalid', '2023-01-03']
732
dates = cudf.to_datetime(mixed, errors='coerce') # Invalid -> NaT
733
"""
734
735
def interval_range(
736
start=None,
737
end=None,
738
periods=None,
739
freq=None,
740
name=None,
741
closed='right'
742
) -> IntervalIndex:
743
"""
744
Generate sequence of intervals with fixed frequency
745
746
Creates IntervalIndex with regular intervals between start and end.
747
Useful for time-based and numeric interval operations.
748
749
Parameters:
750
start: numeric or datetime-like, optional
751
Left bound for generating intervals
752
end: numeric or datetime-like, optional
753
Right bound for generating intervals
754
periods: int, optional
755
Number of intervals to generate
756
freq: numeric, str, or DateOffset, optional
757
Length of each interval
758
name: str, optional
759
Name of resulting IntervalIndex
760
closed: str, default 'right'
761
Which side of intervals is closed ('left', 'right', 'both', 'neither')
762
763
Returns:
764
IntervalIndex: Fixed frequency IntervalIndex
765
766
Examples:
767
# Numeric intervals
768
intervals = cudf.interval_range(start=0, end=10, periods=5)
769
770
# Date intervals
771
intervals = cudf.interval_range(
772
start='2023-01-01',
773
end='2023-01-10',
774
freq='2D'
775
)
776
777
# Custom frequency
778
intervals = cudf.interval_range(start=0, periods=4, freq=2.5)
779
"""
780
781
class DateOffset:
782
"""
783
Standard offset class for date arithmetic and frequency operations
784
785
Base class for date offsets that can be added to datetime objects.
786
Provides consistent interface for date manipulation operations.
787
788
Parameters:
789
n: int, default 1
790
Number of offset periods
791
792
Examples:
793
# Create date offset
794
offset = cudf.DateOffset(days=1)
795
796
# Add to datetime
797
date = cudf.to_datetime('2023-01-01')
798
new_date = date + offset
799
800
# Use in date_range
801
dates = cudf.date_range('2023-01-01', periods=5, freq=offset)
802
"""
803
804
def to_numeric(
805
arg,
806
errors='raise',
807
downcast=None
808
) -> Union[Series, scalar]:
809
"""
810
Convert argument to numeric type with GPU acceleration
811
812
Attempts to convert object to numeric type with flexible error handling
813
and optional downcasting for memory efficiency.
814
815
Parameters:
816
arg: scalar, list, tuple, array, Series
817
Object to convert to numeric type
818
errors: str, default 'raise'
819
Error handling ('raise', 'coerce', 'ignore')
820
downcast: str, optional
821
Downcast to smallest possible numeric type ('integer', 'signed', 'unsigned', 'float')
822
823
Returns:
824
Union[Series, scalar]: Converted numeric object
825
826
Examples:
827
# String to numeric conversion
828
strings = cudf.Series(['1', '2', '3.5', '4'])
829
numeric = cudf.to_numeric(strings)
830
831
# Error handling
832
mixed = cudf.Series(['1', '2', 'invalid', '4'])
833
numeric = cudf.to_numeric(mixed, errors='coerce') # Invalid -> NaN
834
835
# Downcast for memory efficiency
836
large_ints = cudf.Series([1, 2, 3, 4]) # Default int64
837
small_ints = cudf.to_numeric(large_ints, downcast='integer') # Smallest int type
838
"""
839
```
840
841
## Groupby Operations
842
843
Flexible grouping utilities for split-apply-combine operations.
844
845
```{ .api }
846
class Grouper:
847
"""
848
Groupby specification object for complex grouping operations
849
850
Provides detailed control over groupby operations including time-based
851
grouping, level selection, and custom key functions.
852
853
Parameters:
854
key: str, optional
855
Grouping key (column name for DataFrame, None for Series)
856
level: int, str, or list, optional
857
Level name or number for MultiIndex grouping
858
freq: str or DateOffset, optional
859
Frequency for time-based grouping
860
axis: int, default 0
861
Axis to group along
862
sort: bool, default True
863
Sort group keys
864
865
Examples:
866
# Time-based grouping
867
df = cudf.DataFrame({
868
'date': cudf.date_range('2023-01-01', periods=10, freq='D'),
869
'value': range(10)
870
})
871
monthly = df.groupby(cudf.Grouper(key='date', freq='M')).sum()
872
873
# MultiIndex grouping
874
grouper = cudf.Grouper(level='category')
875
result = df.groupby(grouper).mean()
876
"""
877
878
class NamedAgg:
879
"""
880
Named aggregation specification for groupby operations
881
882
Provides clear naming for aggregation results when using multiple
883
aggregation functions on the same column.
884
885
Parameters:
886
column: str
887
Column name to aggregate
888
aggfunc: str or callable
889
Aggregation function name or function
890
891
Examples:
892
# Named aggregations
893
df = cudf.DataFrame({
894
'group': ['A', 'B', 'A', 'B'],
895
'value': [1, 2, 3, 4]
896
})
897
898
result = df.groupby('group').agg(
899
mean_value=cudf.NamedAgg('value', 'mean'),
900
sum_value=cudf.NamedAgg('value', 'sum'),
901
count_value=cudf.NamedAgg('value', 'count')
902
)
903
"""
904
```
905
906
## Performance Optimizations
907
908
### GPU Memory Management
909
- **Columnar Operations**: Optimized for columnar data layout
910
- **Memory Pooling**: Efficient memory allocation for operations
911
- **Zero-Copy**: Minimal data movement between manipulations
912
- **Automatic Broadcasting**: Efficient element-wise operations
913
914
### Parallel Algorithms
915
- **Hash-Based Joins**: GPU-optimized hash joins for merge operations
916
- **Parallel Sort**: Multi-key parallel sorting algorithms
917
- **Grouped Operations**: SIMD optimized groupby aggregations
918
- **Vectorized Functions**: GPU kernels for element-wise operations
919
920
### Query Optimization
921
- **Kernel Fusion**: Combine multiple operations into single GPU kernels
922
- **Lazy Evaluation**: Defer computation until results needed
923
- **Memory-Aware**: Automatically choose algorithms based on available memory
924
- **Cache Locality**: Optimize memory access patterns for GPU caches