or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

advanced-matching.mdconstruction.mdexecution.mdextensibility.mdindex.mdpattern-building.mdstring-processing.mdunicode-support.md

unicode-support.mddocs/

0

# Unicode Support

1

2

Comprehensive Unicode property, category, and script matching with astral plane support for international text processing.

3

4

## Capabilities

5

6

### Unicode Token Syntax

7

8

XRegExp supports Unicode property matching via `\\p{}` and `\\P{}` tokens.

9

10

```javascript { .api }

11

// Unicode token patterns:

12

// \\p{PropertyName} - Match Unicode property

13

// \\P{PropertyName} - Match NOT Unicode property (negated)

14

// \\p{^PropertyName} - Match NOT Unicode property (caret negation)

15

// \\pL - Single letter shorthand for \\p{Letter}

16

// \\p{Type=Value} - Match specific property type and value

17

```

18

19

**Usage Examples:**

20

21

```javascript

22

// Basic Unicode property matching

23

const letters = XRegExp('\\\\p{Letter}+', 'A');

24

letters.test('Hello世界'); // true - matches Unicode letters

25

26

// Negated properties

27

const nonDigits = XRegExp('\\\\P{Number}+', 'A');

28

nonDigits.test('abc'); // true - matches non-numeric characters

29

30

// Single letter shortcuts

31

const identifiers = XRegExp('\\\\pL[\\\\pL\\\\pN]*', 'A');

32

identifiers.test('變數名123'); // true - letter followed by letters/numbers

33

34

// Category matching

35

const punctuation = XRegExp('\\\\p{Punctuation}', 'A');

36

punctuation.test('!'); // true

37

punctuation.test('。'); // true - Unicode punctuation

38

```

39

40

### Unicode Data Management

41

42

Add custom Unicode character data for specialized matching.

43

44

```javascript { .api }

45

/**

46

* Adds to the list of Unicode tokens that XRegExp regexes can match

47

* @param data - Array of objects with named character ranges

48

* @param typePrefix - Optional type prefix for all provided Unicode tokens

49

*/

50

function addUnicodeData(data: UnicodeCharacterRange[], typePrefix?: string): void;

51

52

interface UnicodeCharacterRange {

53

/** The name of the character range */

54

name: string;

55

/** An alternate name for the character range */

56

alias?: string;

57

/** Needed when token matches orphan high surrogates and uses surrogate pairs */

58

isBmpLast?: boolean;

59

/** Can be used to avoid duplicating data by referencing inverse of another token */

60

inverseOf?: string;

61

/** Character data for Basic Multilingual Plane (U+0000-U+FFFF) */

62

bmp?: string;

63

/** Character data for astral code points (U+10000-U+10FFFF) */

64

astral?: string;

65

}

66

```

67

68

**Usage Examples:**

69

70

```javascript

71

// Add custom Unicode token

72

XRegExp.addUnicodeData([{

73

name: 'XDigit',

74

alias: 'Hexadecimal',

75

bmp: '0-9A-Fa-f'

76

}]);

77

78

// Use the custom token

79

XRegExp('\\\\p{XDigit}:\\\\p{Hexadecimal}+').test('0:3D'); // true

80

81

// Add token with type prefix

82

XRegExp.addUnicodeData([{

83

name: 'Emoji',

84

bmp: '\\u{1F600}-\\u{1F64F}',

85

astral: '\\u{1F300}-\\u{1F5FF}|\\u{1F680}-\\u{1F6FF}'

86

}], 'Custom');

87

88

// Use with type prefix

89

XRegExp('\\\\p{Custom=Emoji}').test('😀'); // true (with flag A)

90

```

91

92

## Built-in Unicode Categories

93

94

XRegExp includes comprehensive Unicode general categories:

95

96

### Letter Categories

97

98

```javascript

99

// All letters

100

XRegExp('\\\\p{Letter}', 'A').test('A'); // true

101

XRegExp('\\\\p{Letter}', 'A').test('文'); // true

102

XRegExp('\\\\p{L}', 'A').test('π'); // true (shorthand)

103

104

// Specific letter subcategories

105

XRegExp('\\\\p{Uppercase_Letter}', 'A').test('A'); // true

106

XRegExp('\\\\p{Lu}', 'A').test('A'); // true (shorthand)

107

XRegExp('\\\\p{Lowercase_Letter}', 'A').test('a'); // true

108

XRegExp('\\\\p{Ll}', 'A').test('a'); // true (shorthand)

109

XRegExp('\\\\p{Titlecase_Letter}', 'A').test('Dž'); // true

110

XRegExp('\\\\p{Lt}', 'A').test('Dž'); // true (shorthand)

111

```

112

113

### Number Categories

114

115

```javascript

116

// All numbers

117

XRegExp('\\\\p{Number}', 'A').test('5'); // true

118

XRegExp('\\\\p{Number}', 'A').test('Ⅴ'); // true (Roman numeral)

119

XRegExp('\\\\p{N}', 'A').test('½'); // true (shorthand)

120

121

// Specific number subcategories

122

XRegExp('\\\\p{Decimal_Number}', 'A').test('9'); // true

123

XRegExp('\\\\p{Nd}', 'A').test('9'); // true (shorthand)

124

XRegExp('\\\\p{Letter_Number}', 'A').test('Ⅴ'); // true

125

XRegExp('\\\\p{Nl}', 'A').test('Ⅴ'); // true (shorthand)

126

XRegExp('\\\\p{Other_Number}', 'A').test('½'); // true

127

XRegExp('\\\\p{No}', 'A').test('½'); // true (shorthand)

128

```

129

130

### Mark Categories

131

132

```javascript

133

// All marks (combining characters)

134

XRegExp('\\\\p{Mark}', 'A').test('́'); // true (combining acute)

135

XRegExp('\\\\p{M}', 'A').test('̃'); // true (shorthand)

136

137

// Specific mark subcategories

138

XRegExp('\\\\p{Nonspacing_Mark}', 'A').test('́'); // true

139

XRegExp('\\\\p{Mn}', 'A').test('́'); // true (shorthand)

140

```

141

142

### Punctuation Categories

143

144

```javascript

145

// All punctuation

146

XRegExp('\\\\p{Punctuation}', 'A').test('!'); // true

147

XRegExp('\\\\p{Punctuation}', 'A').test('。'); // true (CJK period)

148

XRegExp('\\\\p{P}', 'A').test('?'); // true (shorthand)

149

150

// Specific punctuation subcategories

151

XRegExp('\\\\p{Open_Punctuation}', 'A').test('('); // true

152

XRegExp('\\\\p{Ps}', 'A').test('['); // true (shorthand)

153

XRegExp('\\\\p{Close_Punctuation}', 'A').test(')'); // true

154

XRegExp('\\\\p{Pe}', 'A').test(']'); // true (shorthand)

155

```

156

157

## Built-in Unicode Scripts

158

159

XRegExp supports Unicode script matching:

160

161

```javascript

162

// Latin script

163

XRegExp('\\\\p{Latin}', 'A').test('Hello'); // true

164

XRegExp('\\\\p{Script=Latin}', 'A').test('A'); // true (explicit syntax)

165

166

// Chinese/Japanese/Korean scripts

167

XRegExp('\\\\p{Han}', 'A').test('漢字'); // true (Chinese characters)

168

XRegExp('\\\\p{Hiragana}', 'A').test('ひらがな'); // true

169

XRegExp('\\\\p{Katakana}', 'A').test('カタカナ'); // true

170

XRegExp('\\\\p{Hangul}', 'A').test('한글'); // true (Korean)

171

172

// Arabic and Hebrew

173

XRegExp('\\\\p{Arabic}', 'A').test('العربية'); // true

174

XRegExp('\\\\p{Hebrew}', 'A').test('עברית'); // true

175

176

// Cyrillic

177

XRegExp('\\\\p{Cyrillic}', 'A').test('Кирилица'); // true

178

179

// Greek

180

XRegExp('\\\\p{Greek}', 'A').test('Ελληνικά'); // true

181

```

182

183

## Built-in Unicode Properties

184

185

XRegExp includes Unicode properties for specialized matching:

186

187

```javascript

188

// Alphabetic property (broader than Letter category)

189

XRegExp('\\\\p{Alphabetic}', 'A').test('A'); // true

190

XRegExp('\\\\p{Alpha}', 'A').test('文'); // true

191

192

// Whitespace property

193

XRegExp('\\\\p{White_Space}', 'A').test(' '); // true

194

XRegExp('\\\\p{Space}', 'A').test('\\t'); // true

195

196

// Uppercase and Lowercase properties

197

XRegExp('\\\\p{Uppercase}', 'A').test('A'); // true

198

XRegExp('\\\\p{Lowercase}', 'A').test('a'); // true

199

200

// Math property

201

XRegExp('\\\\p{Math}', 'A').test('+'); // true

202

XRegExp('\\\\p{Math}', 'A').test('∑'); // true (summation)

203

204

// Currency symbol property

205

XRegExp('\\\\p{Currency_Symbol}', 'A').test('$'); // true

206

XRegExp('\\\\p{Sc}', 'A').test('€'); // true (shorthand)

207

```

208

209

## Astral Unicode Support

210

211

Flag `A` enables 21-bit Unicode support for characters beyond the Basic Multilingual Plane:

212

213

### Astral Flag Usage

214

215

```javascript { .api }

216

// Flag A enables astral mode for Unicode tokens

217

// Required for code points above U+FFFF (outside BMP)

218

// Automatically added when XRegExp.install('astral') is called

219

```

220

221

**Usage Examples:**

222

223

```javascript

224

// Without flag A - only BMP characters (U+0000-U+FFFF)

225

const bmpOnly = XRegExp('\\\\p{Letter}');

226

bmpOnly.test('A'); // true

227

bmpOnly.test('文'); // true

228

bmpOnly.test('𝒜'); // false (mathematical script capital A, U+1D49C)

229

230

// With flag A - full Unicode range (U+0000-U+10FFFF)

231

const fullUnicode = XRegExp('\\\\p{Letter}', 'A');

232

fullUnicode.test('A'); // true

233

fullUnicode.test('文'); // true

234

fullUnicode.test('𝒜'); // true (now works with astral support)

235

236

// Astral emoji support

237

const emoji = XRegExp('\\\\p{Emoji}', 'A');

238

emoji.test('😀'); // true (U+1F600)

239

emoji.test('🚀'); // true (U+1F680)

240

```

241

242

### Global Astral Mode

243

244

Enable astral mode for all new regexes:

245

246

```javascript

247

// Enable astral mode globally

248

XRegExp.install('astral');

249

250

// Now flag A is automatically added to all XRegExp regexes

251

const auto = XRegExp('\\\\p{Letter}'); // Automatically gets flag A

252

auto.test('𝒜'); // true

253

254

// Disable astral mode

255

XRegExp.uninstall('astral');

256

```

257

258

## Unicode Code Point Escapes

259

260

XRegExp supports extended Unicode escape syntax:

261

262

```javascript { .api }

263

// \\u{N...} - Unicode code point escape with curly braces

264

// N... is any one or more digit hexadecimal number from 0-10FFFF

265

// Can include leading zeros

266

// Requires flag u for code points > U+FFFF

267

```

268

269

**Usage Examples:**

270

271

```javascript

272

// Basic Multilingual Plane characters

273

XRegExp('\\\\u{41}').test('A'); // true (U+0041)

274

XRegExp('\\\\u{3042}').test('あ'); // true (U+3042, Hiragana A)

275

276

// Astral characters (requires flag u)

277

XRegExp('\\\\u{1F600}', 'u').test('😀'); // true (U+1F600, grinning face)

278

XRegExp('\\\\u{1D49C}', 'u').test('𝒜'); // true (U+1D49C, math script A)

279

280

// Leading zeros allowed

281

XRegExp('\\\\u{0041}').test('A'); // true (same as \\u{41})

282

XRegExp('\\\\u{00003042}').test('あ'); // true (same as \\u{3042})

283

```

284

285

## Pattern Examples

286

287

### International Identifiers

288

289

```javascript

290

// Match programming identifiers with Unicode support

291

const identifier = XRegExp('^\\\\p{ID_Start}\\\\p{ID_Continue}*$', 'A');

292

identifier.test('변수명'); // true (Korean)

293

identifier.test('переменная'); // true (Russian)

294

identifier.test('変数名'); // true (Japanese)

295

identifier.test('متغير'); // true (Arabic)

296

```

297

298

### Multi-script Text Processing

299

300

```javascript

301

// Extract words from mixed-script text

302

const words = XRegExp('\\\\p{Letter}+', 'gA');

303

const text = 'Hello 世界 مرحبا мир';

304

const matches = XRegExp.match(text, words, 'all');

305

// Result: ['Hello', '世界', 'مرحبا', 'мир']

306

```

307

308

### Unicode-aware Whitespace

309

310

```javascript

311

// Match all Unicode whitespace characters

312

const whitespace = XRegExp('\\\\p{White_Space}+', 'gA');

313

const text = 'word1\\u2003word2\\u2009word3'; // em space and thin space

314

XRegExp.split(text, whitespace);

315

// Result: ['word1', 'word2', 'word3']

316

```

317

318

### Diacritic Handling

319

320

```javascript

321

// Match base letters with any combining marks

322

const withDiacritics = XRegExp('\\\\p{Letter}\\\\p{Mark}*', 'gA');

323

const text = 'café naïve résumé';

324

XRegExp.match(text, withDiacritics, 'all');

325

// Result: ['café', 'naïve', 'résumé'] (preserves combining characters)

326

```