0
# String Operations
1
2
Comprehensive string processing capabilities modeled after Apache Arrow's compute functions, providing efficient operations on arrays of strings including pattern matching, transformations, analysis, and categorical operations. All functions work seamlessly with nested string arrays.
3
4
## Capabilities
5
6
### String Case Transformations
7
8
Functions for changing the case of string arrays while preserving array structure and handling missing values appropriately.
9
10
```python { .api }
11
def str.capitalize(array):
12
"""
13
Capitalize the first character of each string.
14
15
Parameters:
16
- array: Array of strings to capitalize
17
18
Returns:
19
Array with strings having first character capitalized
20
"""
21
22
def str.lower(array):
23
"""
24
Convert strings to lowercase.
25
26
Parameters:
27
- array: Array of strings to convert
28
29
Returns:
30
Array with strings converted to lowercase
31
"""
32
33
def str.upper(array):
34
"""
35
Convert strings to uppercase.
36
37
Parameters:
38
- array: Array of strings to convert
39
40
Returns:
41
Array with strings converted to uppercase
42
"""
43
44
def str.swapcase(array):
45
"""
46
Swap case of each character in strings.
47
48
Parameters:
49
- array: Array of strings to swap case
50
51
Returns:
52
Array with case of each character swapped
53
"""
54
55
def str.title(array):
56
"""
57
Convert strings to title case (capitalize first letter of each word).
58
59
Parameters:
60
- array: Array of strings to convert
61
62
Returns:
63
Array with strings converted to title case
64
"""
65
```
66
67
### String Reversal and Ordering
68
69
Functions for reversing string content and analyzing string structure.
70
71
```python { .api }
72
def str.reverse(array):
73
"""
74
Reverse each string character by character.
75
76
Parameters:
77
- array: Array of strings to reverse
78
79
Returns:
80
Array with strings reversed
81
"""
82
```
83
84
### String Padding and Alignment
85
86
Functions for padding strings to specified widths with customizable fill characters and alignment options.
87
88
```python { .api }
89
def str.center(array, width, padding=" "):
90
"""
91
Center strings in fields of specified width.
92
93
Parameters:
94
- array: Array of strings to center
95
- width: int, minimum width of resulting strings
96
- padding: str, character to use for padding (default space)
97
98
Returns:
99
Array with strings centered and padded to specified width
100
"""
101
102
def str.lpad(array, width, padding=" "):
103
"""
104
Left-pad strings to specified width.
105
106
Parameters:
107
- array: Array of strings to pad
108
- width: int, minimum width of resulting strings
109
- padding: str, character to use for padding (default space)
110
111
Returns:
112
Array with strings left-padded to specified width
113
"""
114
115
def str.rpad(array, width, padding=" "):
116
"""
117
Right-pad strings to specified width.
118
119
Parameters:
120
- array: Array of strings to pad
121
- width: int, minimum width of resulting strings
122
- padding: str, character to use for padding (default space)
123
124
Returns:
125
Array with strings right-padded to specified width
126
"""
127
```
128
129
### String Trimming and Cleanup
130
131
Functions for removing unwanted characters from the beginning, end, or both ends of strings.
132
133
```python { .api }
134
def str.trim(array, characters=None):
135
"""
136
Remove leading and trailing characters from strings.
137
138
Parameters:
139
- array: Array of strings to trim
140
- characters: str, characters to remove (None for whitespace)
141
142
Returns:
143
Array with specified characters trimmed from both ends
144
"""
145
146
def str.ltrim(array, characters=None):
147
"""
148
Remove leading characters from strings.
149
150
Parameters:
151
- array: Array of strings to trim
152
- characters: str, characters to remove (None for whitespace)
153
154
Returns:
155
Array with specified characters trimmed from start
156
"""
157
158
def str.rtrim(array, characters=None):
159
"""
160
Remove trailing characters from strings.
161
162
Parameters:
163
- array: Array of strings to trim
164
- characters: str, characters to remove (None for whitespace)
165
166
Returns:
167
Array with specified characters trimmed from end
168
"""
169
170
def str.trim_whitespace(array):
171
"""
172
Remove leading and trailing whitespace from strings.
173
174
Parameters:
175
- array: Array of strings to trim
176
177
Returns:
178
Array with whitespace trimmed from both ends
179
"""
180
181
def str.ltrim_whitespace(array):
182
"""
183
Remove leading whitespace from strings.
184
185
Parameters:
186
- array: Array of strings to trim
187
188
Returns:
189
Array with whitespace trimmed from start
190
"""
191
192
def str.rtrim_whitespace(array):
193
"""
194
Remove trailing whitespace from strings.
195
196
Parameters:
197
- array: Array of strings to trim
198
199
Returns:
200
Array with whitespace trimmed from end
201
"""
202
```
203
204
### String Length and Analysis
205
206
Functions for analyzing string properties including length, character counts, and pattern occurrences.
207
208
```python { .api }
209
def str.length(array):
210
"""
211
Get length of each string in characters.
212
213
Parameters:
214
- array: Array of strings to measure
215
216
Returns:
217
Array of integers representing string lengths
218
"""
219
220
def str.count_substring(array, pattern, ignore_case=False):
221
"""
222
Count non-overlapping occurrences of substring in each string.
223
224
Parameters:
225
- array: Array of strings to search
226
- pattern: str, substring pattern to count
227
- ignore_case: bool, if True perform case-insensitive search
228
229
Returns:
230
Array of integers representing count of pattern occurrences
231
"""
232
233
def str.count_substring_regex(array, pattern, flags=0):
234
"""
235
Count non-overlapping regex matches in each string.
236
237
Parameters:
238
- array: Array of strings to search
239
- pattern: str, regular expression pattern to count
240
- flags: int, regex flags (e.g., re.IGNORECASE)
241
242
Returns:
243
Array of integers representing count of pattern matches
244
"""
245
```
246
247
### String Search and Pattern Finding
248
249
Functions for locating patterns within strings using both literal and regular expression matching.
250
251
```python { .api }
252
def str.find_substring(array, pattern, start=0, end=None, ignore_case=False):
253
"""
254
Find first occurrence of substring in each string.
255
256
Parameters:
257
- array: Array of strings to search
258
- pattern: str, substring pattern to find
259
- start: int, starting position for search
260
- end: int, ending position for search (None for end of string)
261
- ignore_case: bool, if True perform case-insensitive search
262
263
Returns:
264
Array of integers representing position of first match (-1 if not found)
265
"""
266
267
def str.find_substring_regex(array, pattern, flags=0):
268
"""
269
Find first regex match position in each string.
270
271
Parameters:
272
- array: Array of strings to search
273
- pattern: str, regular expression pattern to find
274
- flags: int, regex flags (e.g., re.IGNORECASE)
275
276
Returns:
277
Array of integers representing position of first match (-1 if not found)
278
"""
279
```
280
281
### Character Type Predicates
282
283
Functions for testing character properties and string composition, useful for data validation and filtering.
284
285
```python { .api }
286
def str.is_alnum(array):
287
"""
288
Test if all characters in strings are alphanumeric.
289
290
Parameters:
291
- array: Array of strings to test
292
293
Returns:
294
Array of booleans indicating if strings are alphanumeric
295
"""
296
297
def str.is_alpha(array):
298
"""
299
Test if all characters in strings are alphabetic.
300
301
Parameters:
302
- array: Array of strings to test
303
304
Returns:
305
Array of booleans indicating if strings are alphabetic
306
"""
307
308
def str.is_ascii(array):
309
"""
310
Test if all characters in strings are ASCII.
311
312
Parameters:
313
- array: Array of strings to test
314
315
Returns:
316
Array of booleans indicating if strings contain only ASCII characters
317
"""
318
319
def str.is_decimal(array):
320
"""
321
Test if all characters in strings are decimal digits.
322
323
Parameters:
324
- array: Array of strings to test
325
326
Returns:
327
Array of booleans indicating if strings are decimal
328
"""
329
330
def str.is_digit(array):
331
"""
332
Test if all characters in strings are digits.
333
334
Parameters:
335
- array: Array of strings to test
336
337
Returns:
338
Array of booleans indicating if strings contain only digits
339
"""
340
341
def str.is_lower(array):
342
"""
343
Test if all cased characters in strings are lowercase.
344
345
Parameters:
346
- array: Array of strings to test
347
348
Returns:
349
Array of booleans indicating if strings are lowercase
350
"""
351
352
def str.is_numeric(array):
353
"""
354
Test if all characters in strings are numeric.
355
356
Parameters:
357
- array: Array of strings to test
358
359
Returns:
360
Array of booleans indicating if strings are numeric
361
"""
362
363
def str.is_printable(array):
364
"""
365
Test if all characters in strings are printable.
366
367
Parameters:
368
- array: Array of strings to test
369
370
Returns:
371
Array of booleans indicating if strings are printable
372
"""
373
374
def str.is_space(array):
375
"""
376
Test if all characters in strings are whitespace.
377
378
Parameters:
379
- array: Array of strings to test
380
381
Returns:
382
Array of booleans indicating if strings contain only whitespace
383
"""
384
385
def str.is_title(array):
386
"""
387
Test if strings are in title case.
388
389
Parameters:
390
- array: Array of strings to test
391
392
Returns:
393
Array of booleans indicating if strings are in title case
394
"""
395
396
def str.is_upper(array):
397
"""
398
Test if all cased characters in strings are uppercase.
399
400
Parameters:
401
- array: Array of strings to test
402
403
Returns:
404
Array of booleans indicating if strings are uppercase
405
"""
406
```
407
408
### Pattern Matching and Boolean Tests
409
410
Functions for testing string patterns using various matching strategies including prefix/suffix, regex, and SQL-like patterns.
411
412
```python { .api }
413
def str.starts_with(array, pattern, ignore_case=False):
414
"""
415
Test if strings start with specified pattern.
416
417
Parameters:
418
- array: Array of strings to test
419
- pattern: str, pattern to match at start of strings
420
- ignore_case: bool, if True perform case-insensitive matching
421
422
Returns:
423
Array of booleans indicating if strings start with pattern
424
"""
425
426
def str.ends_with(array, pattern, ignore_case=False):
427
"""
428
Test if strings end with specified pattern.
429
430
Parameters:
431
- array: Array of strings to test
432
- pattern: str, pattern to match at end of strings
433
- ignore_case: bool, if True perform case-insensitive matching
434
435
Returns:
436
Array of booleans indicating if strings end with pattern
437
"""
438
439
def str.match_substring(array, pattern, ignore_case=False):
440
"""
441
Test if strings contain specified substring.
442
443
Parameters:
444
- array: Array of strings to test
445
- pattern: str, substring pattern to match
446
- ignore_case: bool, if True perform case-insensitive matching
447
448
Returns:
449
Array of booleans indicating if strings contain pattern
450
"""
451
452
def str.match_substring_regex(array, pattern, flags=0):
453
"""
454
Test if strings match regular expression pattern.
455
456
Parameters:
457
- array: Array of strings to test
458
- pattern: str, regular expression pattern to match
459
- flags: int, regex flags (e.g., re.IGNORECASE)
460
461
Returns:
462
Array of booleans indicating if strings match pattern
463
"""
464
465
def str.match_like(array, pattern, ignore_case=False, escape=None):
466
"""
467
Test strings using SQL LIKE pattern matching.
468
469
Parameters:
470
- array: Array of strings to test
471
- pattern: str, SQL LIKE pattern (% for any chars, _ for single char)
472
- ignore_case: bool, if True perform case-insensitive matching
473
- escape: str, escape character for literal % and _ (default None)
474
475
Returns:
476
Array of booleans indicating if strings match LIKE pattern
477
"""
478
```
479
480
### Set Membership Operations
481
482
Functions for testing string membership in collections and finding positions within value sets.
483
484
```python { .api }
485
def str.is_in(array, values):
486
"""
487
Test if strings are in specified collection of values.
488
489
Parameters:
490
- array: Array of strings to test
491
- values: Array or sequence of strings to test membership against
492
493
Returns:
494
Array of booleans indicating if strings are in value set
495
"""
496
497
def str.index_in(array, values):
498
"""
499
Find index of strings in specified collection of values.
500
501
Parameters:
502
- array: Array of strings to find indices for
503
- values: Array or sequence of strings to find indices in
504
505
Returns:
506
Array of integers representing index in values (-1 if not found)
507
"""
508
```
509
510
### String Replacement and Modification
511
512
Functions for replacing and modifying string content using literal patterns, regular expressions, or slice operations.
513
514
```python { .api }
515
def str.replace_substring(array, pattern, replacement, max_replacements=None):
516
"""
517
Replace occurrences of substring with replacement string.
518
519
Parameters:
520
- array: Array of strings to modify
521
- pattern: str, substring pattern to replace
522
- replacement: str, replacement string
523
- max_replacements: int, maximum number of replacements per string (None for all)
524
525
Returns:
526
Array with substring occurrences replaced
527
"""
528
529
def str.replace_substring_regex(array, pattern, replacement, max_replacements=None):
530
"""
531
Replace regex matches with replacement string.
532
533
Parameters:
534
- array: Array of strings to modify
535
- pattern: str, regular expression pattern to replace
536
- replacement: str, replacement string (can include capture groups)
537
- max_replacements: int, maximum number of replacements per string (None for all)
538
539
Returns:
540
Array with regex matches replaced
541
"""
542
543
def str.replace_slice(array, start, stop, replacement):
544
"""
545
Replace string slice with replacement string.
546
547
Parameters:
548
- array: Array of strings to modify
549
- start: int, start index of slice to replace
550
- stop: int, stop index of slice to replace
551
- replacement: str, replacement string
552
553
Returns:
554
Array with string slices replaced
555
"""
556
557
def str.repeat(array, repeats):
558
"""
559
Repeat each string specified number of times.
560
561
Parameters:
562
- array: Array of strings to repeat
563
- repeats: int or Array of ints, number of repetitions for each string
564
565
Returns:
566
Array with strings repeated
567
"""
568
```
569
570
### String Extraction and Slicing
571
572
Functions for extracting parts of strings using position-based slicing or pattern-based extraction.
573
574
```python { .api }
575
def str.slice(array, start=0, stop=None, step=1):
576
"""
577
Extract substring using slice notation.
578
579
Parameters:
580
- array: Array of strings to slice
581
- start: int, start index (default 0)
582
- stop: int, stop index (None for end of string)
583
- step: int, step size (default 1)
584
585
Returns:
586
Array containing extracted substrings
587
"""
588
589
def str.extract_regex(array, pattern, flags=0):
590
"""
591
Extract regex capture groups from strings.
592
593
Parameters:
594
- array: Array of strings to extract from
595
- pattern: str, regular expression with capture groups
596
- flags: int, regex flags (e.g., re.IGNORECASE)
597
598
Returns:
599
Array of tuples/records containing captured groups (None if no match)
600
"""
601
```
602
603
### String Splitting and Joining
604
605
Functions for splitting strings into components and joining string arrays into single strings.
606
607
```python { .api }
608
def str.split_whitespace(array, max_splits=None):
609
"""
610
Split strings on whitespace characters.
611
612
Parameters:
613
- array: Array of strings to split
614
- max_splits: int, maximum number of splits per string (None for unlimited)
615
616
Returns:
617
Array of lists containing string components
618
"""
619
620
def str.split_pattern(array, pattern, max_splits=None):
621
"""
622
Split strings on literal pattern.
623
624
Parameters:
625
- array: Array of strings to split
626
- pattern: str, literal pattern to split on
627
- max_splits: int, maximum number of splits per string (None for unlimited)
628
629
Returns:
630
Array of lists containing string components
631
"""
632
633
def str.split_pattern_regex(array, pattern, max_splits=None, flags=0):
634
"""
635
Split strings using regular expression pattern.
636
637
Parameters:
638
- array: Array of strings to split
639
- pattern: str, regular expression pattern to split on
640
- max_splits: int, maximum number of splits per string (None for unlimited)
641
- flags: int, regex flags (e.g., re.IGNORECASE)
642
643
Returns:
644
Array of lists containing string components
645
"""
646
647
def str.join(array, separator):
648
"""
649
Join arrays of strings using separator.
650
651
Parameters:
652
- array: Array of string lists to join
653
- separator: str, separator to use between elements
654
655
Returns:
656
Array of strings created by joining list elements
657
"""
658
659
def str.join_element_wise(array, separator):
660
"""
661
Join corresponding elements from multiple string arrays.
662
663
Parameters:
664
- array: Array of string lists where each inner list contains strings to join
665
- separator: str, separator to use between elements
666
667
Returns:
668
Array of strings created by joining corresponding elements
669
"""
670
```
671
672
### Categorical String Operations
673
674
Functions for working with categorical string data, enabling memory-efficient storage and processing of repeated string values.
675
676
```python { .api }
677
def str.to_categorical(array):
678
"""
679
Convert string array to categorical representation.
680
681
Parameters:
682
- array: Array of strings to convert
683
684
Returns:
685
Array with categorical representation (indices + categories)
686
"""
687
```
688
689
## Usage Examples
690
691
### Basic String Operations
692
693
```python
694
import awkward as ak
695
696
# Create array of strings
697
names = ak.Array(["alice", "bob", "CHARLIE", "diana"])
698
699
# Case transformations
700
upper_names = ak.str.upper(names) # ["ALICE", "BOB", "CHARLIE", "DIANA"]
701
lower_names = ak.str.lower(names) # ["alice", "bob", "charlie", "diana"]
702
title_names = ak.str.title(names) # ["Alice", "Bob", "Charlie", "Diana"]
703
704
# String properties
705
lengths = ak.str.length(names) # [5, 3, 7, 5]
706
is_upper = ak.str.is_upper(names) # [False, False, True, False]
707
```
708
709
### String Filtering and Matching
710
711
```python
712
import awkward as ak
713
714
emails = ak.Array(["user@example.com", "admin@site.org", "test@example.com"])
715
716
# Pattern matching
717
has_example = ak.str.match_substring(emails, "example") # [True, False, True]
718
starts_admin = ak.str.starts_with(emails, "admin") # [False, True, False]
719
ends_com = ak.str.ends_with(emails, ".com") # [True, False, True]
720
721
# Filter based on pattern
722
example_emails = emails[has_example] # ["user@example.com", "test@example.com"]
723
```
724
725
### String Transformations
726
727
```python
728
import awkward as ak
729
730
# Nested string arrays
731
data = ak.Array([["hello world", "test"], ["python", "awkward array"]])
732
733
# Split strings
734
split_data = ak.str.split_whitespace(data)
735
# [[["hello", "world"], ["test"]], [["python"], ["awkward", "array"]]]
736
737
# Replace patterns
738
cleaned = ak.str.replace_substring(data, "test", "demo")
739
# [["hello world", "demo"], ["python", "awkward array"]]
740
741
# Extract parts
742
first_words = ak.str.split_whitespace(data)[:, :, 0]
743
# [["hello", "test"], ["python", "awkward"]]
744
```
745
746
### String Padding and Formatting
747
748
```python
749
import awkward as ak
750
751
numbers = ak.Array(["1", "22", "333"])
752
753
# Pad strings
754
left_padded = ak.str.lpad(numbers, 5, "0") # ["00001", "00022", "00333"]
755
centered = ak.str.center(numbers, 5, "*") # ["**1**", "*22**", "*333*"]
756
757
# Trim whitespace
758
messy = ak.Array([" hello ", " world ", "test"])
759
clean = ak.str.trim_whitespace(messy) # ["hello", "world", "test"]
760
```
761
762
### Regular Expression Operations
763
764
```python
765
import awkward as ak
766
import re
767
768
text = ak.Array(["Phone: 123-456-7890", "Call me at 555-123-4567", "No phone"])
769
770
# Extract phone numbers
771
phone_pattern = r'(\d{3})-(\d{3})-(\d{4})'
772
matches = ak.str.extract_regex(text, phone_pattern)
773
774
# Count pattern occurrences
775
digit_count = ak.str.count_substring_regex(text, r'\d') # [10, 10, 0]
776
777
# Boolean matching
778
has_phone = ak.str.match_substring_regex(text, phone_pattern) # [True, True, False]
779
```
780
781
### Advanced String Processing
782
783
```python
784
import awkward as ak
785
786
# String arrays with missing values
787
data = ak.Array([["alice", "bob"], None, ["charlie"]])
788
789
# Operations handle None gracefully
790
upper_data = ak.str.upper(data) # [["ALICE", "BOB"], None, ["CHARLIE"]]
791
792
# Join string lists
793
sentences = ak.Array([["hello", "world"], ["python", "is", "great"]])
794
joined = ak.str.join(sentences, " ") # ["hello world", "python is great"]
795
796
# Categorical conversion for memory efficiency
797
categories = ak.Array(["red", "blue", "red", "green", "blue", "red"])
798
categorical = ak.str.to_categorical(categories) # More memory efficient
799
```