Tessl Tile for npm/xregexp@5.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

advanced-matching.md construction.md execution.md extensibility.md index.md pattern-building.md string-processing.md unicode-support.md

unicode-support.mddocs/

0
# Unicode Support
1

2
Comprehensive Unicode property, category, and script matching with astral plane support for international text processing.
3

4
## Capabilities
5

6
### Unicode Token Syntax
7

8
XRegExp supports Unicode property matching via `\\p{}` and `\\P{}` tokens.
9

10
```javascript { .api }
11
// Unicode token patterns:
12
// \\p{PropertyName}     - Match Unicode property
13
// \\P{PropertyName}     - Match NOT Unicode property (negated)
14
// \\p{^PropertyName}    - Match NOT Unicode property (caret negation)
15
// \\pL                  - Single letter shorthand for \\p{Letter}
16
// \\p{Type=Value}       - Match specific property type and value
17
```
18

19
**Usage Examples:**
20

21
```javascript
22
// Basic Unicode property matching
23
const letters = XRegExp('\\\\p{Letter}+', 'A');
24
letters.test('Hello世界'); // true - matches Unicode letters
25

26
// Negated properties
27
const nonDigits = XRegExp('\\\\P{Number}+', 'A');  
28
nonDigits.test('abc'); // true - matches non-numeric characters
29

30
// Single letter shortcuts
31
const identifiers = XRegExp('\\\\pL[\\\\pL\\\\pN]*', 'A');
32
identifiers.test('變數名123'); // true - letter followed by letters/numbers
33

34
// Category matching
35
const punctuation = XRegExp('\\\\p{Punctuation}', 'A');
36
punctuation.test('!'); // true
37
punctuation.test('。'); // true - Unicode punctuation
38
```
39

40
### Unicode Data Management
41

42
Add custom Unicode character data for specialized matching.
43

44
```javascript { .api }
45
/**
46
 * Adds to the list of Unicode tokens that XRegExp regexes can match
47
 * @param data - Array of objects with named character ranges
48
 * @param typePrefix - Optional type prefix for all provided Unicode tokens
49
 */
50
function addUnicodeData(data: UnicodeCharacterRange[], typePrefix?: string): void;
51

52
interface UnicodeCharacterRange {
53
  /** The name of the character range */
54
  name: string;
55
  /** An alternate name for the character range */
56
  alias?: string;
57
  /** Needed when token matches orphan high surrogates and uses surrogate pairs */
58
  isBmpLast?: boolean;
59
  /** Can be used to avoid duplicating data by referencing inverse of another token */
60
  inverseOf?: string;
61
  /** Character data for Basic Multilingual Plane (U+0000-U+FFFF) */
62
  bmp?: string;
63
  /** Character data for astral code points (U+10000-U+10FFFF) */
64
  astral?: string;
65
}
66
```
67

68
**Usage Examples:**
69

70
```javascript
71
// Add custom Unicode token
72
XRegExp.addUnicodeData([{
73
  name: 'XDigit',
74
  alias: 'Hexadecimal', 
75
  bmp: '0-9A-Fa-f'
76
}]);
77

78
// Use the custom token
79
XRegExp('\\\\p{XDigit}:\\\\p{Hexadecimal}+').test('0:3D'); // true
80

81
// Add token with type prefix
82
XRegExp.addUnicodeData([{
83
  name: 'Emoji',
84
  bmp: '\\u{1F600}-\\u{1F64F}',
85
  astral: '\\u{1F300}-\\u{1F5FF}|\\u{1F680}-\\u{1F6FF}'
86
}], 'Custom');
87

88
// Use with type prefix
89
XRegExp('\\\\p{Custom=Emoji}').test('😀'); // true (with flag A)
90
```
91

92
## Built-in Unicode Categories
93

94
XRegExp includes comprehensive Unicode general categories:
95

96
### Letter Categories
97

98
```javascript
99
// All letters
100
XRegExp('\\\\p{Letter}', 'A').test('A');    // true
101
XRegExp('\\\\p{Letter}', 'A').test('文');   // true  
102
XRegExp('\\\\p{L}', 'A').test('π');        // true (shorthand)
103

104
// Specific letter subcategories
105
XRegExp('\\\\p{Uppercase_Letter}', 'A').test('A');  // true
106
XRegExp('\\\\p{Lu}', 'A').test('A');               // true (shorthand)
107
XRegExp('\\\\p{Lowercase_Letter}', 'A').test('a');  // true  
108
XRegExp('\\\\p{Ll}', 'A').test('a');               // true (shorthand)
109
XRegExp('\\\\p{Titlecase_Letter}', 'A').test('Dž'); // true
110
XRegExp('\\\\p{Lt}', 'A').test('Dž');              // true (shorthand)
111
```
112

113
### Number Categories  
114

115
```javascript
116
// All numbers
117
XRegExp('\\\\p{Number}', 'A').test('5');    // true
118
XRegExp('\\\\p{Number}', 'A').test('Ⅴ');   // true (Roman numeral)
119
XRegExp('\\\\p{N}', 'A').test('½');        // true (shorthand)
120

121
// Specific number subcategories
122
XRegExp('\\\\p{Decimal_Number}', 'A').test('9');   // true  
123
XRegExp('\\\\p{Nd}', 'A').test('9');              // true (shorthand)
124
XRegExp('\\\\p{Letter_Number}', 'A').test('Ⅴ');   // true
125
XRegExp('\\\\p{Nl}', 'A').test('Ⅴ');              // true (shorthand)
126
XRegExp('\\\\p{Other_Number}', 'A').test('½');     // true
127
XRegExp('\\\\p{No}', 'A').test('½');              // true (shorthand)
128
```
129

130
### Mark Categories
131

132
```javascript
133
// All marks (combining characters)
134
XRegExp('\\\\p{Mark}', 'A').test('́');      // true (combining acute)
135
XRegExp('\\\\p{M}', 'A').test('̃');         // true (shorthand)
136

137
// Specific mark subcategories
138
XRegExp('\\\\p{Nonspacing_Mark}', 'A').test('́');  // true
139
XRegExp('\\\\p{Mn}', 'A').test('́');              // true (shorthand)
140
```
141

142
### Punctuation Categories
143

144
```javascript
145
// All punctuation
146
XRegExp('\\\\p{Punctuation}', 'A').test('!');  // true
147
XRegExp('\\\\p{Punctuation}', 'A').test('。'); // true (CJK period)
148
XRegExp('\\\\p{P}', 'A').test('?');           // true (shorthand)
149

150
// Specific punctuation subcategories  
151
XRegExp('\\\\p{Open_Punctuation}', 'A').test('(');  // true
152
XRegExp('\\\\p{Ps}', 'A').test('[');               // true (shorthand)
153
XRegExp('\\\\p{Close_Punctuation}', 'A').test(')'); // true
154
XRegExp('\\\\p{Pe}', 'A').test(']');               // true (shorthand)
155
```
156

157
## Built-in Unicode Scripts
158

159
XRegExp supports Unicode script matching:
160

161
```javascript
162
// Latin script
163
XRegExp('\\\\p{Latin}', 'A').test('Hello');     // true
164
XRegExp('\\\\p{Script=Latin}', 'A').test('A');  // true (explicit syntax)
165

166
// Chinese/Japanese/Korean scripts
167
XRegExp('\\\\p{Han}', 'A').test('漢字');        // true (Chinese characters)
168
XRegExp('\\\\p{Hiragana}', 'A').test('ひらがな'); // true
169
XRegExp('\\\\p{Katakana}', 'A').test('カタカナ');  // true
170
XRegExp('\\\\p{Hangul}', 'A').test('한글');      // true (Korean)
171

172
// Arabic and Hebrew
173
XRegExp('\\\\p{Arabic}', 'A').test('العربية');  // true
174
XRegExp('\\\\p{Hebrew}', 'A').test('עברית');    // true
175

176
// Cyrillic  
177
XRegExp('\\\\p{Cyrillic}', 'A').test('Кирилица'); // true
178

179
// Greek
180
XRegExp('\\\\p{Greek}', 'A').test('Ελληνικά');   // true
181
```
182

183
## Built-in Unicode Properties
184

185
XRegExp includes Unicode properties for specialized matching:
186

187
```javascript
188
// Alphabetic property (broader than Letter category)
189
XRegExp('\\\\p{Alphabetic}', 'A').test('A');    // true
190
XRegExp('\\\\p{Alpha}', 'A').test('文');        // true
191

192
// Whitespace property
193
XRegExp('\\\\p{White_Space}', 'A').test(' ');   // true  
194
XRegExp('\\\\p{Space}', 'A').test('\\t');       // true
195

196
// Uppercase and Lowercase properties
197
XRegExp('\\\\p{Uppercase}', 'A').test('A');     // true
198
XRegExp('\\\\p{Lowercase}', 'A').test('a');     // true
199

200
// Math property
201
XRegExp('\\\\p{Math}', 'A').test('+');         // true
202
XRegExp('\\\\p{Math}', 'A').test('∑');         // true (summation)
203

204
// Currency symbol property
205
XRegExp('\\\\p{Currency_Symbol}', 'A').test('$'); // true
206
XRegExp('\\\\p{Sc}', 'A').test('€');             // true (shorthand)
207
```
208

209
## Astral Unicode Support
210

211
Flag `A` enables 21-bit Unicode support for characters beyond the Basic Multilingual Plane:
212

213
### Astral Flag Usage
214

215
```javascript { .api }
216
// Flag A enables astral mode for Unicode tokens
217
// Required for code points above U+FFFF (outside BMP)
218
// Automatically added when XRegExp.install('astral') is called
219
```
220

221
**Usage Examples:**
222

223
```javascript
224
// Without flag A - only BMP characters (U+0000-U+FFFF)
225
const bmpOnly = XRegExp('\\\\p{Letter}');
226
bmpOnly.test('A');    // true
227
bmpOnly.test('文');  // true  
228
bmpOnly.test('𝒜');   // false (mathematical script capital A, U+1D49C)
229

230
// With flag A - full Unicode range (U+0000-U+10FFFF)  
231
const fullUnicode = XRegExp('\\\\p{Letter}', 'A');
232
fullUnicode.test('A');    // true
233
fullUnicode.test('文');  // true
234
fullUnicode.test('𝒜');   // true (now works with astral support)
235

236
// Astral emoji support
237
const emoji = XRegExp('\\\\p{Emoji}', 'A');
238
emoji.test('😀');  // true (U+1F600)
239
emoji.test('🚀'); // true (U+1F680)
240
```
241

242
### Global Astral Mode
243

244
Enable astral mode for all new regexes:
245

246
```javascript
247
// Enable astral mode globally
248
XRegExp.install('astral');
249

250
// Now flag A is automatically added to all XRegExp regexes
251
const auto = XRegExp('\\\\p{Letter}'); // Automatically gets flag A
252
auto.test('𝒜'); // true
253

254
// Disable astral mode
255
XRegExp.uninstall('astral');
256
```
257

258
## Unicode Code Point Escapes
259

260
XRegExp supports extended Unicode escape syntax:
261

262
```javascript { .api }
263
// \\u{N...} - Unicode code point escape with curly braces
264
// N... is any one or more digit hexadecimal number from 0-10FFFF
265
// Can include leading zeros
266
// Requires flag u for code points > U+FFFF
267
```
268

269
**Usage Examples:**
270

271
```javascript
272
// Basic Multilingual Plane characters
273
XRegExp('\\\\u{41}').test('A');        // true (U+0041)
274
XRegExp('\\\\u{3042}').test('あ');     // true (U+3042, Hiragana A)
275

276
// Astral characters (requires flag u)
277
XRegExp('\\\\u{1F600}', 'u').test('😀'); // true (U+1F600, grinning face)
278
XRegExp('\\\\u{1D49C}', 'u').test('𝒜');  // true (U+1D49C, math script A)
279

280
// Leading zeros allowed
281
XRegExp('\\\\u{0041}').test('A');       // true (same as \\u{41})
282
XRegExp('\\\\u{00003042}').test('あ');  // true (same as \\u{3042})
283
```
284

285
## Pattern Examples
286

287
### International Identifiers
288

289
```javascript
290
// Match programming identifiers with Unicode support
291
const identifier = XRegExp('^\\\\p{ID_Start}\\\\p{ID_Continue}*$', 'A');
292
identifier.test('변수명');     // true (Korean)
293
identifier.test('переменная'); // true (Russian)  
294
identifier.test('変数名');     // true (Japanese)
295
identifier.test('متغير');     // true (Arabic)
296
```
297

298
### Multi-script Text Processing
299

300
```javascript
301
// Extract words from mixed-script text
302
const words = XRegExp('\\\\p{Letter}+', 'gA');
303
const text = 'Hello 世界 مرحبا мир';
304
const matches = XRegExp.match(text, words, 'all');
305
// Result: ['Hello', '世界', 'مرحبا', 'мир']
306
```
307

308
### Unicode-aware Whitespace
309

310
```javascript
311
// Match all Unicode whitespace characters
312
const whitespace = XRegExp('\\\\p{White_Space}+', 'gA');
313
const text = 'word1\\u2003word2\\u2009word3'; // em space and thin space
314
XRegExp.split(text, whitespace);
315
// Result: ['word1', 'word2', 'word3']
316
```
317

318
### Diacritic Handling
319

320
```javascript
321
// Match base letters with any combining marks
322
const withDiacritics = XRegExp('\\\\p{Letter}\\\\p{Mark}*', 'gA');
323
const text = 'café naïve résumé';
324
XRegExp.match(text, withDiacritics, 'all');
325
// Result: ['café', 'naïve', 'résumé'] (preserves combining characters)
326
```

Version

Tile

Files

unicode-support.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

unicode-support.mddocs/