or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

array-operations.mddata-types-utilities.mdhash-bitset-operations.mdindex.mdmemory-management.mdplatform-operations.mdutf8-string-processing.md

utf8-string-processing.mddocs/

0

# UTF-8 String Processing

1

2

Comprehensive UTF-8 string manipulation capabilities with extensive string operations, collation support, and optimized storage for internal Spark use. UTF8String provides a complete alternative to Java's String class optimized for big data processing with support for direct memory access and collation-aware operations.

3

4

## Capabilities

5

6

### UTF8String Core Operations

7

8

UTF-8 encoded string class with comprehensive string manipulation capabilities, implementing Comparable, Externalizable, KryoSerializable, and Cloneable interfaces.

9

10

```java { .api }

11

public final class UTF8String implements Comparable<UTF8String>, Externalizable, KryoSerializable, Cloneable {

12

// Construction and factory methods

13

public UTF8String();

14

public static UTF8String fromBytes(byte[] bytes);

15

public static UTF8String fromBytes(byte[] bytes, int offset, int numBytes);

16

public static UTF8String fromAddress(Object base, long offset, int numBytes);

17

public static UTF8String fromString(String str);

18

public static UTF8String blankString(int length);

19

public static boolean isWhitespaceOrISOControl(int codePoint);

20

public static int numBytesForFirstByte(byte b);

21

22

// Constants

23

public static final UTF8String EMPTY_UTF8;

24

public static final UTF8String ZERO_UTF8;

25

public static final UTF8String SPACE_UTF8;

26

27

// Core access methods

28

public Object getBaseObject();

29

public long getBaseOffset();

30

public int numBytes();

31

public int numChars();

32

public long getPrefix();

33

public byte[] getBytes();

34

public ByteBuffer getByteBuffer();

35

}

36

```

37

38

### String Access and Validation

39

40

Methods for accessing individual characters, bytes, and validating UTF-8 encoding.

41

42

```java { .api }

43

// Character and byte access

44

public byte getByte(int byteIndex);

45

public int getChar(int charIndex);

46

public int codePointFrom(int byteIndex);

47

48

// Validation methods

49

public UTF8String makeValid();

50

public boolean isValid();

51

public boolean isFullAscii();

52

53

// Position conversion

54

public int charPosToByte(int charPos);

55

public int bytePosToChar(int bytePos);

56

```

57

58

### Substring Operations

59

60

Various substring extraction methods with different indexing strategies.

61

62

```java { .api }

63

public UTF8String substring(int start, int until);

64

public UTF8String substringSQL(int pos, int length);

65

public UTF8String copyUTF8String(int start, int end);

66

```

67

68

### Search Operations

69

70

Comprehensive string search and pattern matching capabilities.

71

72

```java { .api }

73

public boolean contains(UTF8String substring);

74

public boolean matchAt(UTF8String s, int pos);

75

public boolean startsWith(UTF8String prefix);

76

public boolean endsWith(UTF8String suffix);

77

public int indexOf(UTF8String v, int start);

78

public int indexOfEmpty(int start);

79

public int find(UTF8String str, int start);

80

public int rfind(UTF8String str, int start);

81

public int findInSet(UTF8String match);

82

```

83

84

### Case Conversion

85

86

Case conversion methods with both ASCII-only and full Unicode support.

87

88

```java { .api }

89

public UTF8String toUpperCase();

90

public UTF8String toUpperCaseAscii();

91

public UTF8String toLowerCase();

92

public UTF8String toLowerCaseAscii();

93

public UTF8String toTitleCase();

94

public UTF8String toTitleCaseICU();

95

```

96

97

### Trimming Operations

98

99

Various trimming methods for whitespace and custom character removal.

100

101

```java { .api }

102

public UTF8String trim();

103

public UTF8String trimAll();

104

public UTF8String trim(UTF8String trimString);

105

public UTF8String trimLeft();

106

public UTF8String trimLeft(UTF8String trimString);

107

public UTF8String trimRight();

108

public UTF8String trimTrailingSpaces(int numSpaces);

109

public UTF8String trimRight(UTF8String trimString);

110

```

111

112

### String Manipulation

113

114

Methods for string transformation, padding, and manipulation.

115

116

```java { .api }

117

public UTF8String reverse();

118

public UTF8String repeat(int times);

119

public UTF8String rpad(int len, UTF8String pad);

120

public UTF8String lpad(int len, UTF8String pad);

121

public UTF8String subStringIndex(UTF8String delim, int count);

122

public UTF8String replace(UTF8String search, UTF8String replace);

123

public UTF8String translate(Map<String, String> dict);

124

```

125

126

### Splitting Operations

127

128

String splitting with regex and SQL-style delimiters.

129

130

```java { .api }

131

public UTF8String[] split(UTF8String pattern, int limit);

132

public UTF8String[] splitSQL(UTF8String delimiter, int limit);

133

```

134

135

### Concatenation

136

137

Static methods for efficient string concatenation.

138

139

```java { .api }

140

public static UTF8String concat(UTF8String... inputs);

141

public static UTF8String concatWs(UTF8String separator, UTF8String... inputs);

142

public static UTF8String toBinaryString(long val);

143

```

144

145

### Numeric Conversion

146

147

Methods for parsing strings as numeric values with error handling.

148

149

```java { .api }

150

public boolean toLong(LongWrapper toLongResult);

151

public boolean toInt(IntWrapper intWrapper);

152

public boolean toShort(IntWrapper intWrapper);

153

public boolean toByte(IntWrapper intWrapper);

154

public long toLongExact();

155

public int toIntExact();

156

public short toShortExact();

157

public byte toByteExact();

158

```

159

160

### Comparison Operations

161

162

Various comparison methods including binary and collation-aware comparisons.

163

164

```java { .api }

165

public int compareTo(UTF8String other);

166

public int binaryCompare(UTF8String other);

167

public int semanticCompare(UTF8String other, int collationId);

168

public boolean equals(Object other);

169

public boolean binaryEquals(UTF8String other);

170

public boolean semanticEquals(UTF8String other, int collationId);

171

```

172

173

### I/O Operations

174

175

Methods for serialization and I/O operations.

176

177

```java { .api }

178

public void writeToMemory(Object target, long targetOffset);

179

public void writeTo(ByteBuffer buffer);

180

public void writeTo(OutputStream out) throws IOException;

181

public void writeExternal(ObjectOutput out) throws IOException;

182

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException;

183

public void write(Kryo kryo, Output out);

184

public void read(Kryo kryo, Input in);

185

```

186

187

### Iterator Support

188

189

Iterator methods for code point traversal.

190

191

```java { .api }

192

public Iterator<Integer> codePointIterator();

193

public Iterator<Integer> codePointIterator(CodePointIteratorType iteratorMode);

194

public Iterator<Integer> reverseCodePointIterator();

195

public Iterator<Integer> reverseCodePointIterator(CodePointIteratorType iteratorMode);

196

```

197

198

### Utility Methods

199

200

Utility methods for cloning, hashing, and distance calculations.

201

202

```java { .api }

203

public String toString();

204

public String toValidString();

205

public UTF8String clone();

206

public UTF8String copy();

207

public int hashCode();

208

public int levenshteinDistance(UTF8String other);

209

public int levenshteinDistance(UTF8String other, int threshold);

210

public UTF8String soundex();

211

```

212

213

### UTF8StringBuilder

214

215

Helper class for building UTF8String objects by concatenating multiple UTF-8 encoded strings.

216

217

```java { .api }

218

public class UTF8StringBuilder {

219

public UTF8StringBuilder();

220

public UTF8StringBuilder(int initialSize);

221

public void append(UTF8String value);

222

public void append(String value);

223

public void appendBytes(Object base, long offset, int length);

224

public UTF8String build();

225

public void appendCodePoint(int codePoint);

226

}

227

```

228

229

## Usage Examples

230

231

### Basic String Operations

232

233

```java

234

import org.apache.spark.unsafe.types.UTF8String;

235

236

// Create UTF8String instances

237

UTF8String str1 = UTF8String.fromString("Hello");

238

UTF8String str2 = UTF8String.fromString("World");

239

UTF8String empty = UTF8String.EMPTY_UTF8;

240

241

// Basic properties

242

int bytes = str1.numBytes(); // Number of UTF-8 bytes

243

int chars = str1.numChars(); // Number of Unicode characters

244

boolean isAscii = str1.isFullAscii();

245

246

// UTF-8 validation and utility

247

boolean isWhitespace = UTF8String.isWhitespaceOrISOControl(0x0020); // Space character

248

byte firstByte = "Hello".getBytes()[0];

249

int bytesForChar = UTF8String.numBytesForFirstByte(firstByte); // Number of bytes for UTF-8 character

250

251

// Concatenation

252

UTF8String result = UTF8String.concat(str1, UTF8String.fromString(" "), str2);

253

UTF8String joined = UTF8String.concatWs(UTF8String.fromString(","), str1, str2);

254

255

// Conversion back to Java String

256

String javaString = result.toString();

257

```

258

259

### Substring and Search Operations

260

261

```java

262

import org.apache.spark.unsafe.types.UTF8String;

263

264

UTF8String text = UTF8String.fromString("Hello World Example");

265

266

// Substring operations

267

UTF8String sub1 = text.substring(0, 5); // "Hello" (character-based)

268

UTF8String sub2 = text.substringSQL(1, 5); // "Hello" (SQL 1-based indexing)

269

270

// Search operations

271

boolean contains = text.contains(UTF8String.fromString("World"));

272

int index = text.indexOf(UTF8String.fromString("World"), 0);

273

boolean startsWith = text.startsWith(UTF8String.fromString("Hello"));

274

boolean endsWith = text.endsWith(UTF8String.fromString("Example"));

275

276

// Pattern matching at specific position

277

boolean matches = text.matchAt(UTF8String.fromString("World"), 6);

278

```

279

280

### Case Conversion and Trimming

281

282

```java

283

import org.apache.spark.unsafe.types.UTF8String;

284

285

UTF8String text = UTF8String.fromString(" Hello World ");

286

287

// Case conversion

288

UTF8String upper = text.toUpperCase();

289

UTF8String lower = text.toLowerCase();

290

UTF8String title = text.toTitleCase();

291

292

// ASCII-only conversion (faster for ASCII strings)

293

UTF8String upperAscii = text.toUpperCaseAscii();

294

UTF8String lowerAscii = text.toLowerCaseAscii();

295

296

// Trimming operations

297

UTF8String trimmed = text.trim(); // Remove whitespace

298

UTF8String leftTrim = text.trimLeft(); // Remove leading whitespace

299

UTF8String rightTrim = text.trimRight(); // Remove trailing whitespace

300

301

// Custom character trimming

302

UTF8String customTrim = text.trim(UTF8String.fromString(" H"));

303

```

304

305

### String Manipulation

306

307

```java

308

import org.apache.spark.unsafe.types.UTF8String;

309

310

UTF8String text = UTF8String.fromString("Hello");

311

312

// String manipulation

313

UTF8String reversed = text.reverse(); // "olleH"

314

UTF8String repeated = text.repeat(3); // "HelloHelloHello"

315

UTF8String padded = text.rpad(10, UTF8String.fromString("*")); // "Hello*****"

316

UTF8String leftPadded = text.lpad(10, UTF8String.fromString("*")); // "*****Hello"

317

318

// Replace operations

319

UTF8String replaced = text.replace(

320

UTF8String.fromString("ll"),

321

UTF8String.fromString("XX")

322

); // "HeXXo"

323

324

// Translation using character mapping

325

Map<String, String> dict = new HashMap<>();

326

dict.put("l", "1");

327

dict.put("o", "0");

328

UTF8String translated = text.translate(dict); // "He110"

329

```

330

331

### Splitting and Parsing

332

333

```java

334

import org.apache.spark.unsafe.types.UTF8String;

335

336

UTF8String csv = UTF8String.fromString("apple,banana,cherry");

337

338

// Split operations

339

UTF8String[] parts = csv.splitSQL(UTF8String.fromString(","), -1);

340

// Results in: ["apple", "banana", "cherry"]

341

342

// Numeric parsing

343

UTF8String number = UTF8String.fromString("12345");

344

try {

345

long value = number.toLongExact(); // 12345L

346

int intValue = number.toIntExact(); // 12345

347

} catch (NumberFormatException e) {

348

// Handle parsing error

349

}

350

351

// Safe parsing with wrapper objects

352

LongWrapper longWrapper = new LongWrapper();

353

if (number.toLong(longWrapper)) {

354

long value = longWrapper.value; // Parsing succeeded

355

}

356

```

357

358

### String Building

359

360

```java

361

import org.apache.spark.unsafe.UTF8StringBuilder;

362

import org.apache.spark.unsafe.types.UTF8String;

363

364

// Efficient string building

365

UTF8StringBuilder builder = new UTF8StringBuilder();

366

builder.append(UTF8String.fromString("Hello"));

367

builder.append(" "); // Java string automatically converted

368

builder.append(UTF8String.fromString("World"));

369

builder.appendCodePoint(0x1F600); // Unicode emoji

370

371

UTF8String result = builder.build();

372

```

373

374

### Memory-Based String Operations

375

376

```java

377

import org.apache.spark.unsafe.types.UTF8String;

378

import org.apache.spark.unsafe.Platform;

379

380

// Create string from memory address

381

byte[] data = "Hello World".getBytes("UTF-8");

382

long address = Platform.allocateMemory(data.length);

383

Platform.copyMemory(data, Platform.BYTE_ARRAY_OFFSET, null, address, data.length);

384

385

UTF8String str = UTF8String.fromAddress(null, address, data.length);

386

387

// Write string to memory

388

long targetAddress = Platform.allocateMemory(str.numBytes());

389

str.writeToMemory(null, targetAddress);

390

391

// Clean up

392

Platform.freeMemory(address);

393

Platform.freeMemory(targetAddress);

394

```

395

396

### Collation-Aware Operations

397

398

```java

399

import org.apache.spark.unsafe.types.UTF8String;

400

401

UTF8String str1 = UTF8String.fromString("Hello");

402

UTF8String str2 = UTF8String.fromString("HELLO");

403

404

// Binary comparison (case-sensitive)

405

int binaryCompare = str1.binaryCompare(str2); // != 0

406

407

// Semantic comparison with collation ID

408

int collationId = 1; // Case-insensitive collation

409

int semanticCompare = str1.semanticCompare(str2, collationId); // == 0

410

411

// Semantic equality

412

boolean equal = str1.semanticEquals(str2, collationId); // true

413

```