Tessl Tile for maven/org.apache.spark/spark-unsafe_2.13@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

array-operations.md data-types-utilities.md hash-bitset-operations.md index.md memory-management.md platform-operations.md utf8-string-processing.md

utf8-string-processing.mddocs/

0
# UTF-8 String Processing
1

2
Comprehensive UTF-8 string manipulation capabilities with extensive string operations, collation support, and optimized storage for internal Spark use. UTF8String provides a complete alternative to Java's String class optimized for big data processing with support for direct memory access and collation-aware operations.
3

4
## Capabilities
5

6
### UTF8String Core Operations
7

8
UTF-8 encoded string class with comprehensive string manipulation capabilities, implementing Comparable, Externalizable, KryoSerializable, and Cloneable interfaces.
9

10
```java { .api }
11
public final class UTF8String implements Comparable<UTF8String>, Externalizable, KryoSerializable, Cloneable {
12
    // Construction and factory methods
13
    public UTF8String();
14
    public static UTF8String fromBytes(byte[] bytes);
15
    public static UTF8String fromBytes(byte[] bytes, int offset, int numBytes);
16
    public static UTF8String fromAddress(Object base, long offset, int numBytes);
17
    public static UTF8String fromString(String str);
18
    public static UTF8String blankString(int length);
19
    public static boolean isWhitespaceOrISOControl(int codePoint);
20
    public static int numBytesForFirstByte(byte b);
21
    
22
    // Constants
23
    public static final UTF8String EMPTY_UTF8;
24
    public static final UTF8String ZERO_UTF8;
25
    public static final UTF8String SPACE_UTF8;
26
    
27
    // Core access methods
28
    public Object getBaseObject();
29
    public long getBaseOffset();
30
    public int numBytes();
31
    public int numChars();
32
    public long getPrefix();
33
    public byte[] getBytes();
34
    public ByteBuffer getByteBuffer();
35
}
36
```
37

38
### String Access and Validation
39

40
Methods for accessing individual characters, bytes, and validating UTF-8 encoding.
41

42
```java { .api }
43
// Character and byte access
44
public byte getByte(int byteIndex);
45
public int getChar(int charIndex);
46
public int codePointFrom(int byteIndex);
47

48
// Validation methods
49
public UTF8String makeValid();
50
public boolean isValid();
51
public boolean isFullAscii();
52

53
// Position conversion
54
public int charPosToByte(int charPos);
55
public int bytePosToChar(int bytePos);
56
```
57

58
### Substring Operations
59

60
Various substring extraction methods with different indexing strategies.
61

62
```java { .api }
63
public UTF8String substring(int start, int until);
64
public UTF8String substringSQL(int pos, int length);
65
public UTF8String copyUTF8String(int start, int end);
66
```
67

68
### Search Operations
69

70
Comprehensive string search and pattern matching capabilities.
71

72
```java { .api }
73
public boolean contains(UTF8String substring);
74
public boolean matchAt(UTF8String s, int pos);
75
public boolean startsWith(UTF8String prefix);
76
public boolean endsWith(UTF8String suffix);
77
public int indexOf(UTF8String v, int start);
78
public int indexOfEmpty(int start);
79
public int find(UTF8String str, int start);
80
public int rfind(UTF8String str, int start);
81
public int findInSet(UTF8String match);
82
```
83

84
### Case Conversion
85

86
Case conversion methods with both ASCII-only and full Unicode support.
87

88
```java { .api }
89
public UTF8String toUpperCase();
90
public UTF8String toUpperCaseAscii();
91
public UTF8String toLowerCase();
92
public UTF8String toLowerCaseAscii();
93
public UTF8String toTitleCase();
94
public UTF8String toTitleCaseICU();
95
```
96

97
### Trimming Operations
98

99
Various trimming methods for whitespace and custom character removal.
100

101
```java { .api }
102
public UTF8String trim();
103
public UTF8String trimAll();
104
public UTF8String trim(UTF8String trimString);
105
public UTF8String trimLeft();
106
public UTF8String trimLeft(UTF8String trimString);
107
public UTF8String trimRight();
108
public UTF8String trimTrailingSpaces(int numSpaces);
109
public UTF8String trimRight(UTF8String trimString);
110
```
111

112
### String Manipulation
113

114
Methods for string transformation, padding, and manipulation.
115

116
```java { .api }
117
public UTF8String reverse();
118
public UTF8String repeat(int times);
119
public UTF8String rpad(int len, UTF8String pad);
120
public UTF8String lpad(int len, UTF8String pad);
121
public UTF8String subStringIndex(UTF8String delim, int count);
122
public UTF8String replace(UTF8String search, UTF8String replace);
123
public UTF8String translate(Map<String, String> dict);
124
```
125

126
### Splitting Operations
127

128
String splitting with regex and SQL-style delimiters.
129

130
```java { .api }
131
public UTF8String[] split(UTF8String pattern, int limit);
132
public UTF8String[] splitSQL(UTF8String delimiter, int limit);
133
```
134

135
### Concatenation
136

137
Static methods for efficient string concatenation.
138

139
```java { .api }
140
public static UTF8String concat(UTF8String... inputs);
141
public static UTF8String concatWs(UTF8String separator, UTF8String... inputs);
142
public static UTF8String toBinaryString(long val);
143
```
144

145
### Numeric Conversion
146

147
Methods for parsing strings as numeric values with error handling.
148

149
```java { .api }
150
public boolean toLong(LongWrapper toLongResult);
151
public boolean toInt(IntWrapper intWrapper);
152
public boolean toShort(IntWrapper intWrapper);
153
public boolean toByte(IntWrapper intWrapper);
154
public long toLongExact();
155
public int toIntExact();
156
public short toShortExact();
157
public byte toByteExact();
158
```
159

160
### Comparison Operations
161

162
Various comparison methods including binary and collation-aware comparisons.
163

164
```java { .api }
165
public int compareTo(UTF8String other);
166
public int binaryCompare(UTF8String other);
167
public int semanticCompare(UTF8String other, int collationId);
168
public boolean equals(Object other);
169
public boolean binaryEquals(UTF8String other);
170
public boolean semanticEquals(UTF8String other, int collationId);
171
```
172

173
### I/O Operations
174

175
Methods for serialization and I/O operations.
176

177
```java { .api }
178
public void writeToMemory(Object target, long targetOffset);
179
public void writeTo(ByteBuffer buffer);
180
public void writeTo(OutputStream out) throws IOException;
181
public void writeExternal(ObjectOutput out) throws IOException;
182
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;
183
public void write(Kryo kryo, Output out);
184
public void read(Kryo kryo, Input in);
185
```
186

187
### Iterator Support
188

189
Iterator methods for code point traversal.
190

191
```java { .api }
192
public Iterator<Integer> codePointIterator();
193
public Iterator<Integer> codePointIterator(CodePointIteratorType iteratorMode);
194
public Iterator<Integer> reverseCodePointIterator();
195
public Iterator<Integer> reverseCodePointIterator(CodePointIteratorType iteratorMode);
196
```
197

198
### Utility Methods
199

200
Utility methods for cloning, hashing, and distance calculations.
201

202
```java { .api }
203
public String toString();
204
public String toValidString();
205
public UTF8String clone();
206
public UTF8String copy();
207
public int hashCode();
208
public int levenshteinDistance(UTF8String other);
209
public int levenshteinDistance(UTF8String other, int threshold);
210
public UTF8String soundex();
211
```
212

213
### UTF8StringBuilder
214

215
Helper class for building UTF8String objects by concatenating multiple UTF-8 encoded strings.
216

217
```java { .api }
218
public class UTF8StringBuilder {
219
    public UTF8StringBuilder();
220
    public UTF8StringBuilder(int initialSize);
221
    public void append(UTF8String value);
222
    public void append(String value);
223
    public void appendBytes(Object base, long offset, int length);
224
    public UTF8String build();
225
    public void appendCodePoint(int codePoint);
226
}
227
```
228

229
## Usage Examples
230

231
### Basic String Operations
232

233
```java
234
import org.apache.spark.unsafe.types.UTF8String;
235

236
// Create UTF8String instances
237
UTF8String str1 = UTF8String.fromString("Hello");
238
UTF8String str2 = UTF8String.fromString("World");
239
UTF8String empty = UTF8String.EMPTY_UTF8;
240

241
// Basic properties
242
int bytes = str1.numBytes();     // Number of UTF-8 bytes
243
int chars = str1.numChars();     // Number of Unicode characters
244
boolean isAscii = str1.isFullAscii();
245

246
// UTF-8 validation and utility
247
boolean isWhitespace = UTF8String.isWhitespaceOrISOControl(0x0020); // Space character
248
byte firstByte = "Hello".getBytes()[0];
249
int bytesForChar = UTF8String.numBytesForFirstByte(firstByte); // Number of bytes for UTF-8 character
250

251
// Concatenation
252
UTF8String result = UTF8String.concat(str1, UTF8String.fromString(" "), str2);
253
UTF8String joined = UTF8String.concatWs(UTF8String.fromString(","), str1, str2);
254

255
// Conversion back to Java String
256
String javaString = result.toString();
257
```
258

259
### Substring and Search Operations
260

261
```java
262
import org.apache.spark.unsafe.types.UTF8String;
263

264
UTF8String text = UTF8String.fromString("Hello World Example");
265

266
// Substring operations
267
UTF8String sub1 = text.substring(0, 5);          // "Hello" (character-based)
268
UTF8String sub2 = text.substringSQL(1, 5);       // "Hello" (SQL 1-based indexing)
269

270
// Search operations
271
boolean contains = text.contains(UTF8String.fromString("World"));
272
int index = text.indexOf(UTF8String.fromString("World"), 0);
273
boolean startsWith = text.startsWith(UTF8String.fromString("Hello"));
274
boolean endsWith = text.endsWith(UTF8String.fromString("Example"));
275

276
// Pattern matching at specific position
277
boolean matches = text.matchAt(UTF8String.fromString("World"), 6);
278
```
279

280
### Case Conversion and Trimming
281

282
```java
283
import org.apache.spark.unsafe.types.UTF8String;
284

285
UTF8String text = UTF8String.fromString("  Hello World  ");
286

287
// Case conversion
288
UTF8String upper = text.toUpperCase();
289
UTF8String lower = text.toLowerCase();
290
UTF8String title = text.toTitleCase();
291

292
// ASCII-only conversion (faster for ASCII strings)
293
UTF8String upperAscii = text.toUpperCaseAscii();
294
UTF8String lowerAscii = text.toLowerCaseAscii();
295

296
// Trimming operations
297
UTF8String trimmed = text.trim();                    // Remove whitespace
298
UTF8String leftTrim = text.trimLeft();              // Remove leading whitespace
299
UTF8String rightTrim = text.trimRight();            // Remove trailing whitespace
300

301
// Custom character trimming
302
UTF8String customTrim = text.trim(UTF8String.fromString(" H"));
303
```
304

305
### String Manipulation
306

307
```java
308
import org.apache.spark.unsafe.types.UTF8String;
309

310
UTF8String text = UTF8String.fromString("Hello");
311

312
// String manipulation
313
UTF8String reversed = text.reverse();               // "olleH"
314
UTF8String repeated = text.repeat(3);               // "HelloHelloHello"
315
UTF8String padded = text.rpad(10, UTF8String.fromString("*")); // "Hello*****"
316
UTF8String leftPadded = text.lpad(10, UTF8String.fromString("*")); // "*****Hello"
317

318
// Replace operations
319
UTF8String replaced = text.replace(
320
    UTF8String.fromString("ll"), 
321
    UTF8String.fromString("XX")
322
); // "HeXXo"
323

324
// Translation using character mapping
325
Map<String, String> dict = new HashMap<>();
326
dict.put("l", "1");
327
dict.put("o", "0");
328
UTF8String translated = text.translate(dict); // "He110"
329
```
330

331
### Splitting and Parsing
332

333
```java
334
import org.apache.spark.unsafe.types.UTF8String;
335

336
UTF8String csv = UTF8String.fromString("apple,banana,cherry");
337

338
// Split operations
339
UTF8String[] parts = csv.splitSQL(UTF8String.fromString(","), -1);
340
// Results in: ["apple", "banana", "cherry"]
341

342
// Numeric parsing
343
UTF8String number = UTF8String.fromString("12345");
344
try {
345
    long value = number.toLongExact();    // 12345L
346
    int intValue = number.toIntExact();   // 12345
347
} catch (NumberFormatException e) {
348
    // Handle parsing error
349
}
350

351
// Safe parsing with wrapper objects
352
LongWrapper longWrapper = new LongWrapper();
353
if (number.toLong(longWrapper)) {
354
    long value = longWrapper.value;  // Parsing succeeded
355
}
356
```
357

358
### String Building
359

360
```java
361
import org.apache.spark.unsafe.UTF8StringBuilder;
362
import org.apache.spark.unsafe.types.UTF8String;
363

364
// Efficient string building
365
UTF8StringBuilder builder = new UTF8StringBuilder();
366
builder.append(UTF8String.fromString("Hello"));
367
builder.append(" ");  // Java string automatically converted
368
builder.append(UTF8String.fromString("World"));
369
builder.appendCodePoint(0x1F600); // Unicode emoji
370

371
UTF8String result = builder.build();
372
```
373

374
### Memory-Based String Operations
375

376
```java
377
import org.apache.spark.unsafe.types.UTF8String;
378
import org.apache.spark.unsafe.Platform;
379

380
// Create string from memory address
381
byte[] data = "Hello World".getBytes("UTF-8");
382
long address = Platform.allocateMemory(data.length);
383
Platform.copyMemory(data, Platform.BYTE_ARRAY_OFFSET, null, address, data.length);
384

385
UTF8String str = UTF8String.fromAddress(null, address, data.length);
386

387
// Write string to memory
388
long targetAddress = Platform.allocateMemory(str.numBytes());
389
str.writeToMemory(null, targetAddress);
390

391
// Clean up
392
Platform.freeMemory(address);
393
Platform.freeMemory(targetAddress);
394
```
395

396
### Collation-Aware Operations
397

398
```java
399
import org.apache.spark.unsafe.types.UTF8String;
400

401
UTF8String str1 = UTF8String.fromString("Hello");
402
UTF8String str2 = UTF8String.fromString("HELLO");
403

404
// Binary comparison (case-sensitive)
405
int binaryCompare = str1.binaryCompare(str2);    // != 0
406

407
// Semantic comparison with collation ID
408
int collationId = 1; // Case-insensitive collation
409
int semanticCompare = str1.semanticCompare(str2, collationId);  // == 0
410

411
// Semantic equality
412
boolean equal = str1.semanticEquals(str2, collationId);  // true
413
```

Version

Tile

Files

utf8-string-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utf8-string-processing.mddocs/