0
# UTF-8 String Processing
1
2
Comprehensive UTF-8 string manipulation capabilities with extensive string operations, collation support, and optimized storage for internal Spark use. UTF8String provides a complete alternative to Java's String class optimized for big data processing with support for direct memory access and collation-aware operations.
3
4
## Capabilities
5
6
### UTF8String Core Operations
7
8
UTF-8 encoded string class with comprehensive string manipulation capabilities, implementing Comparable, Externalizable, KryoSerializable, and Cloneable interfaces.
9
10
```java { .api }
11
public final class UTF8String implements Comparable<UTF8String>, Externalizable, KryoSerializable, Cloneable {
12
// Construction and factory methods
13
public UTF8String();
14
public static UTF8String fromBytes(byte[] bytes);
15
public static UTF8String fromBytes(byte[] bytes, int offset, int numBytes);
16
public static UTF8String fromAddress(Object base, long offset, int numBytes);
17
public static UTF8String fromString(String str);
18
public static UTF8String blankString(int length);
19
public static boolean isWhitespaceOrISOControl(int codePoint);
20
public static int numBytesForFirstByte(byte b);
21
22
// Constants
23
public static final UTF8String EMPTY_UTF8;
24
public static final UTF8String ZERO_UTF8;
25
public static final UTF8String SPACE_UTF8;
26
27
// Core access methods
28
public Object getBaseObject();
29
public long getBaseOffset();
30
public int numBytes();
31
public int numChars();
32
public long getPrefix();
33
public byte[] getBytes();
34
public ByteBuffer getByteBuffer();
35
}
36
```
37
38
### String Access and Validation
39
40
Methods for accessing individual characters, bytes, and validating UTF-8 encoding.
41
42
```java { .api }
43
// Character and byte access
44
public byte getByte(int byteIndex);
45
public int getChar(int charIndex);
46
public int codePointFrom(int byteIndex);
47
48
// Validation methods
49
public UTF8String makeValid();
50
public boolean isValid();
51
public boolean isFullAscii();
52
53
// Position conversion
54
public int charPosToByte(int charPos);
55
public int bytePosToChar(int bytePos);
56
```
57
58
### Substring Operations
59
60
Various substring extraction methods with different indexing strategies.
61
62
```java { .api }
63
public UTF8String substring(int start, int until);
64
public UTF8String substringSQL(int pos, int length);
65
public UTF8String copyUTF8String(int start, int end);
66
```
67
68
### Search Operations
69
70
Comprehensive string search and pattern matching capabilities.
71
72
```java { .api }
73
public boolean contains(UTF8String substring);
74
public boolean matchAt(UTF8String s, int pos);
75
public boolean startsWith(UTF8String prefix);
76
public boolean endsWith(UTF8String suffix);
77
public int indexOf(UTF8String v, int start);
78
public int indexOfEmpty(int start);
79
public int find(UTF8String str, int start);
80
public int rfind(UTF8String str, int start);
81
public int findInSet(UTF8String match);
82
```
83
84
### Case Conversion
85
86
Case conversion methods with both ASCII-only and full Unicode support.
87
88
```java { .api }
89
public UTF8String toUpperCase();
90
public UTF8String toUpperCaseAscii();
91
public UTF8String toLowerCase();
92
public UTF8String toLowerCaseAscii();
93
public UTF8String toTitleCase();
94
public UTF8String toTitleCaseICU();
95
```
96
97
### Trimming Operations
98
99
Various trimming methods for whitespace and custom character removal.
100
101
```java { .api }
102
public UTF8String trim();
103
public UTF8String trimAll();
104
public UTF8String trim(UTF8String trimString);
105
public UTF8String trimLeft();
106
public UTF8String trimLeft(UTF8String trimString);
107
public UTF8String trimRight();
108
public UTF8String trimTrailingSpaces(int numSpaces);
109
public UTF8String trimRight(UTF8String trimString);
110
```
111
112
### String Manipulation
113
114
Methods for string transformation, padding, and manipulation.
115
116
```java { .api }
117
public UTF8String reverse();
118
public UTF8String repeat(int times);
119
public UTF8String rpad(int len, UTF8String pad);
120
public UTF8String lpad(int len, UTF8String pad);
121
public UTF8String subStringIndex(UTF8String delim, int count);
122
public UTF8String replace(UTF8String search, UTF8String replace);
123
public UTF8String translate(Map<String, String> dict);
124
```
125
126
### Splitting Operations
127
128
String splitting with regex and SQL-style delimiters.
129
130
```java { .api }
131
public UTF8String[] split(UTF8String pattern, int limit);
132
public UTF8String[] splitSQL(UTF8String delimiter, int limit);
133
```
134
135
### Concatenation
136
137
Static methods for efficient string concatenation.
138
139
```java { .api }
140
public static UTF8String concat(UTF8String... inputs);
141
public static UTF8String concatWs(UTF8String separator, UTF8String... inputs);
142
public static UTF8String toBinaryString(long val);
143
```
144
145
### Numeric Conversion
146
147
Methods for parsing strings as numeric values with error handling.
148
149
```java { .api }
150
public boolean toLong(LongWrapper toLongResult);
151
public boolean toInt(IntWrapper intWrapper);
152
public boolean toShort(IntWrapper intWrapper);
153
public boolean toByte(IntWrapper intWrapper);
154
public long toLongExact();
155
public int toIntExact();
156
public short toShortExact();
157
public byte toByteExact();
158
```
159
160
### Comparison Operations
161
162
Various comparison methods including binary and collation-aware comparisons.
163
164
```java { .api }
165
public int compareTo(UTF8String other);
166
public int binaryCompare(UTF8String other);
167
public int semanticCompare(UTF8String other, int collationId);
168
public boolean equals(Object other);
169
public boolean binaryEquals(UTF8String other);
170
public boolean semanticEquals(UTF8String other, int collationId);
171
```
172
173
### I/O Operations
174
175
Methods for serialization and I/O operations.
176
177
```java { .api }
178
public void writeToMemory(Object target, long targetOffset);
179
public void writeTo(ByteBuffer buffer);
180
public void writeTo(OutputStream out) throws IOException;
181
public void writeExternal(ObjectOutput out) throws IOException;
182
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;
183
public void write(Kryo kryo, Output out);
184
public void read(Kryo kryo, Input in);
185
```
186
187
### Iterator Support
188
189
Iterator methods for code point traversal.
190
191
```java { .api }
192
public Iterator<Integer> codePointIterator();
193
public Iterator<Integer> codePointIterator(CodePointIteratorType iteratorMode);
194
public Iterator<Integer> reverseCodePointIterator();
195
public Iterator<Integer> reverseCodePointIterator(CodePointIteratorType iteratorMode);
196
```
197
198
### Utility Methods
199
200
Utility methods for cloning, hashing, and distance calculations.
201
202
```java { .api }
203
public String toString();
204
public String toValidString();
205
public UTF8String clone();
206
public UTF8String copy();
207
public int hashCode();
208
public int levenshteinDistance(UTF8String other);
209
public int levenshteinDistance(UTF8String other, int threshold);
210
public UTF8String soundex();
211
```
212
213
### UTF8StringBuilder
214
215
Helper class for building UTF8String objects by concatenating multiple UTF-8 encoded strings.
216
217
```java { .api }
218
public class UTF8StringBuilder {
219
public UTF8StringBuilder();
220
public UTF8StringBuilder(int initialSize);
221
public void append(UTF8String value);
222
public void append(String value);
223
public void appendBytes(Object base, long offset, int length);
224
public UTF8String build();
225
public void appendCodePoint(int codePoint);
226
}
227
```
228
229
## Usage Examples
230
231
### Basic String Operations
232
233
```java
234
import org.apache.spark.unsafe.types.UTF8String;
235
236
// Create UTF8String instances
237
UTF8String str1 = UTF8String.fromString("Hello");
238
UTF8String str2 = UTF8String.fromString("World");
239
UTF8String empty = UTF8String.EMPTY_UTF8;
240
241
// Basic properties
242
int bytes = str1.numBytes(); // Number of UTF-8 bytes
243
int chars = str1.numChars(); // Number of Unicode characters
244
boolean isAscii = str1.isFullAscii();
245
246
// UTF-8 validation and utility
247
boolean isWhitespace = UTF8String.isWhitespaceOrISOControl(0x0020); // Space character
248
byte firstByte = "Hello".getBytes()[0];
249
int bytesForChar = UTF8String.numBytesForFirstByte(firstByte); // Number of bytes for UTF-8 character
250
251
// Concatenation
252
UTF8String result = UTF8String.concat(str1, UTF8String.fromString(" "), str2);
253
UTF8String joined = UTF8String.concatWs(UTF8String.fromString(","), str1, str2);
254
255
// Conversion back to Java String
256
String javaString = result.toString();
257
```
258
259
### Substring and Search Operations
260
261
```java
262
import org.apache.spark.unsafe.types.UTF8String;
263
264
UTF8String text = UTF8String.fromString("Hello World Example");
265
266
// Substring operations
267
UTF8String sub1 = text.substring(0, 5); // "Hello" (character-based)
268
UTF8String sub2 = text.substringSQL(1, 5); // "Hello" (SQL 1-based indexing)
269
270
// Search operations
271
boolean contains = text.contains(UTF8String.fromString("World"));
272
int index = text.indexOf(UTF8String.fromString("World"), 0);
273
boolean startsWith = text.startsWith(UTF8String.fromString("Hello"));
274
boolean endsWith = text.endsWith(UTF8String.fromString("Example"));
275
276
// Pattern matching at specific position
277
boolean matches = text.matchAt(UTF8String.fromString("World"), 6);
278
```
279
280
### Case Conversion and Trimming
281
282
```java
283
import org.apache.spark.unsafe.types.UTF8String;
284
285
UTF8String text = UTF8String.fromString(" Hello World ");
286
287
// Case conversion
288
UTF8String upper = text.toUpperCase();
289
UTF8String lower = text.toLowerCase();
290
UTF8String title = text.toTitleCase();
291
292
// ASCII-only conversion (faster for ASCII strings)
293
UTF8String upperAscii = text.toUpperCaseAscii();
294
UTF8String lowerAscii = text.toLowerCaseAscii();
295
296
// Trimming operations
297
UTF8String trimmed = text.trim(); // Remove whitespace
298
UTF8String leftTrim = text.trimLeft(); // Remove leading whitespace
299
UTF8String rightTrim = text.trimRight(); // Remove trailing whitespace
300
301
// Custom character trimming
302
UTF8String customTrim = text.trim(UTF8String.fromString(" H"));
303
```
304
305
### String Manipulation
306
307
```java
308
import org.apache.spark.unsafe.types.UTF8String;
309
310
UTF8String text = UTF8String.fromString("Hello");
311
312
// String manipulation
313
UTF8String reversed = text.reverse(); // "olleH"
314
UTF8String repeated = text.repeat(3); // "HelloHelloHello"
315
UTF8String padded = text.rpad(10, UTF8String.fromString("*")); // "Hello*****"
316
UTF8String leftPadded = text.lpad(10, UTF8String.fromString("*")); // "*****Hello"
317
318
// Replace operations
319
UTF8String replaced = text.replace(
320
UTF8String.fromString("ll"),
321
UTF8String.fromString("XX")
322
); // "HeXXo"
323
324
// Translation using character mapping
325
Map<String, String> dict = new HashMap<>();
326
dict.put("l", "1");
327
dict.put("o", "0");
328
UTF8String translated = text.translate(dict); // "He110"
329
```
330
331
### Splitting and Parsing
332
333
```java
334
import org.apache.spark.unsafe.types.UTF8String;
335
336
UTF8String csv = UTF8String.fromString("apple,banana,cherry");
337
338
// Split operations
339
UTF8String[] parts = csv.splitSQL(UTF8String.fromString(","), -1);
340
// Results in: ["apple", "banana", "cherry"]
341
342
// Numeric parsing
343
UTF8String number = UTF8String.fromString("12345");
344
try {
345
long value = number.toLongExact(); // 12345L
346
int intValue = number.toIntExact(); // 12345
347
} catch (NumberFormatException e) {
348
// Handle parsing error
349
}
350
351
// Safe parsing with wrapper objects
352
LongWrapper longWrapper = new LongWrapper();
353
if (number.toLong(longWrapper)) {
354
long value = longWrapper.value; // Parsing succeeded
355
}
356
```
357
358
### String Building
359
360
```java
361
import org.apache.spark.unsafe.UTF8StringBuilder;
362
import org.apache.spark.unsafe.types.UTF8String;
363
364
// Efficient string building
365
UTF8StringBuilder builder = new UTF8StringBuilder();
366
builder.append(UTF8String.fromString("Hello"));
367
builder.append(" "); // Java string automatically converted
368
builder.append(UTF8String.fromString("World"));
369
builder.appendCodePoint(0x1F600); // Unicode emoji
370
371
UTF8String result = builder.build();
372
```
373
374
### Memory-Based String Operations
375
376
```java
377
import org.apache.spark.unsafe.types.UTF8String;
378
import org.apache.spark.unsafe.Platform;
379
380
// Create string from memory address
381
byte[] data = "Hello World".getBytes("UTF-8");
382
long address = Platform.allocateMemory(data.length);
383
Platform.copyMemory(data, Platform.BYTE_ARRAY_OFFSET, null, address, data.length);
384
385
UTF8String str = UTF8String.fromAddress(null, address, data.length);
386
387
// Write string to memory
388
long targetAddress = Platform.allocateMemory(str.numBytes());
389
str.writeToMemory(null, targetAddress);
390
391
// Clean up
392
Platform.freeMemory(address);
393
Platform.freeMemory(targetAddress);
394
```
395
396
### Collation-Aware Operations
397
398
```java
399
import org.apache.spark.unsafe.types.UTF8String;
400
401
UTF8String str1 = UTF8String.fromString("Hello");
402
UTF8String str2 = UTF8String.fromString("HELLO");
403
404
// Binary comparison (case-sensitive)
405
int binaryCompare = str1.binaryCompare(str2); // != 0
406
407
// Semantic comparison with collation ID
408
int collationId = 1; // Case-insensitive collation
409
int semanticCompare = str1.semanticCompare(str2, collationId); // == 0
410
411
// Semantic equality
412
boolean equal = str1.semanticEquals(str2, collationId); // true
413
```