0
# Unicode Support
1
2
Comprehensive Unicode property, category, and script matching with astral plane support for international text processing.
3
4
## Capabilities
5
6
### Unicode Token Syntax
7
8
XRegExp supports Unicode property matching via `\\p{}` and `\\P{}` tokens.
9
10
```javascript { .api }
11
// Unicode token patterns:
12
// \\p{PropertyName} - Match Unicode property
13
// \\P{PropertyName} - Match NOT Unicode property (negated)
14
// \\p{^PropertyName} - Match NOT Unicode property (caret negation)
15
// \\pL - Single letter shorthand for \\p{Letter}
16
// \\p{Type=Value} - Match specific property type and value
17
```
18
19
**Usage Examples:**
20
21
```javascript
22
// Basic Unicode property matching
23
const letters = XRegExp('\\\\p{Letter}+', 'A');
24
letters.test('Hello世界'); // true - matches Unicode letters
25
26
// Negated properties
27
const nonDigits = XRegExp('\\\\P{Number}+', 'A');
28
nonDigits.test('abc'); // true - matches non-numeric characters
29
30
// Single letter shortcuts
31
const identifiers = XRegExp('\\\\pL[\\\\pL\\\\pN]*', 'A');
32
identifiers.test('變數名123'); // true - letter followed by letters/numbers
33
34
// Category matching
35
const punctuation = XRegExp('\\\\p{Punctuation}', 'A');
36
punctuation.test('!'); // true
37
punctuation.test('。'); // true - Unicode punctuation
38
```
39
40
### Unicode Data Management
41
42
Add custom Unicode character data for specialized matching.
43
44
```javascript { .api }
45
/**
46
* Adds to the list of Unicode tokens that XRegExp regexes can match
47
* @param data - Array of objects with named character ranges
48
* @param typePrefix - Optional type prefix for all provided Unicode tokens
49
*/
50
function addUnicodeData(data: UnicodeCharacterRange[], typePrefix?: string): void;
51
52
interface UnicodeCharacterRange {
53
/** The name of the character range */
54
name: string;
55
/** An alternate name for the character range */
56
alias?: string;
57
/** Needed when token matches orphan high surrogates and uses surrogate pairs */
58
isBmpLast?: boolean;
59
/** Can be used to avoid duplicating data by referencing inverse of another token */
60
inverseOf?: string;
61
/** Character data for Basic Multilingual Plane (U+0000-U+FFFF) */
62
bmp?: string;
63
/** Character data for astral code points (U+10000-U+10FFFF) */
64
astral?: string;
65
}
66
```
67
68
**Usage Examples:**
69
70
```javascript
71
// Add custom Unicode token
72
XRegExp.addUnicodeData([{
73
name: 'XDigit',
74
alias: 'Hexadecimal',
75
bmp: '0-9A-Fa-f'
76
}]);
77
78
// Use the custom token
79
XRegExp('\\\\p{XDigit}:\\\\p{Hexadecimal}+').test('0:3D'); // true
80
81
// Add token with type prefix
82
XRegExp.addUnicodeData([{
83
name: 'Emoji',
84
bmp: '\\u{1F600}-\\u{1F64F}',
85
astral: '\\u{1F300}-\\u{1F5FF}|\\u{1F680}-\\u{1F6FF}'
86
}], 'Custom');
87
88
// Use with type prefix
89
XRegExp('\\\\p{Custom=Emoji}').test('😀'); // true (with flag A)
90
```
91
92
## Built-in Unicode Categories
93
94
XRegExp includes comprehensive Unicode general categories:
95
96
### Letter Categories
97
98
```javascript
99
// All letters
100
XRegExp('\\\\p{Letter}', 'A').test('A'); // true
101
XRegExp('\\\\p{Letter}', 'A').test('文'); // true
102
XRegExp('\\\\p{L}', 'A').test('π'); // true (shorthand)
103
104
// Specific letter subcategories
105
XRegExp('\\\\p{Uppercase_Letter}', 'A').test('A'); // true
106
XRegExp('\\\\p{Lu}', 'A').test('A'); // true (shorthand)
107
XRegExp('\\\\p{Lowercase_Letter}', 'A').test('a'); // true
108
XRegExp('\\\\p{Ll}', 'A').test('a'); // true (shorthand)
109
XRegExp('\\\\p{Titlecase_Letter}', 'A').test('Dž'); // true
110
XRegExp('\\\\p{Lt}', 'A').test('Dž'); // true (shorthand)
111
```
112
113
### Number Categories
114
115
```javascript
116
// All numbers
117
XRegExp('\\\\p{Number}', 'A').test('5'); // true
118
XRegExp('\\\\p{Number}', 'A').test('Ⅴ'); // true (Roman numeral)
119
XRegExp('\\\\p{N}', 'A').test('½'); // true (shorthand)
120
121
// Specific number subcategories
122
XRegExp('\\\\p{Decimal_Number}', 'A').test('9'); // true
123
XRegExp('\\\\p{Nd}', 'A').test('9'); // true (shorthand)
124
XRegExp('\\\\p{Letter_Number}', 'A').test('Ⅴ'); // true
125
XRegExp('\\\\p{Nl}', 'A').test('Ⅴ'); // true (shorthand)
126
XRegExp('\\\\p{Other_Number}', 'A').test('½'); // true
127
XRegExp('\\\\p{No}', 'A').test('½'); // true (shorthand)
128
```
129
130
### Mark Categories
131
132
```javascript
133
// All marks (combining characters)
134
XRegExp('\\\\p{Mark}', 'A').test('́'); // true (combining acute)
135
XRegExp('\\\\p{M}', 'A').test('̃'); // true (shorthand)
136
137
// Specific mark subcategories
138
XRegExp('\\\\p{Nonspacing_Mark}', 'A').test('́'); // true
139
XRegExp('\\\\p{Mn}', 'A').test('́'); // true (shorthand)
140
```
141
142
### Punctuation Categories
143
144
```javascript
145
// All punctuation
146
XRegExp('\\\\p{Punctuation}', 'A').test('!'); // true
147
XRegExp('\\\\p{Punctuation}', 'A').test('。'); // true (CJK period)
148
XRegExp('\\\\p{P}', 'A').test('?'); // true (shorthand)
149
150
// Specific punctuation subcategories
151
XRegExp('\\\\p{Open_Punctuation}', 'A').test('('); // true
152
XRegExp('\\\\p{Ps}', 'A').test('['); // true (shorthand)
153
XRegExp('\\\\p{Close_Punctuation}', 'A').test(')'); // true
154
XRegExp('\\\\p{Pe}', 'A').test(']'); // true (shorthand)
155
```
156
157
## Built-in Unicode Scripts
158
159
XRegExp supports Unicode script matching:
160
161
```javascript
162
// Latin script
163
XRegExp('\\\\p{Latin}', 'A').test('Hello'); // true
164
XRegExp('\\\\p{Script=Latin}', 'A').test('A'); // true (explicit syntax)
165
166
// Chinese/Japanese/Korean scripts
167
XRegExp('\\\\p{Han}', 'A').test('漢字'); // true (Chinese characters)
168
XRegExp('\\\\p{Hiragana}', 'A').test('ひらがな'); // true
169
XRegExp('\\\\p{Katakana}', 'A').test('カタカナ'); // true
170
XRegExp('\\\\p{Hangul}', 'A').test('한글'); // true (Korean)
171
172
// Arabic and Hebrew
173
XRegExp('\\\\p{Arabic}', 'A').test('العربية'); // true
174
XRegExp('\\\\p{Hebrew}', 'A').test('עברית'); // true
175
176
// Cyrillic
177
XRegExp('\\\\p{Cyrillic}', 'A').test('Кирилица'); // true
178
179
// Greek
180
XRegExp('\\\\p{Greek}', 'A').test('Ελληνικά'); // true
181
```
182
183
## Built-in Unicode Properties
184
185
XRegExp includes Unicode properties for specialized matching:
186
187
```javascript
188
// Alphabetic property (broader than Letter category)
189
XRegExp('\\\\p{Alphabetic}', 'A').test('A'); // true
190
XRegExp('\\\\p{Alpha}', 'A').test('文'); // true
191
192
// Whitespace property
193
XRegExp('\\\\p{White_Space}', 'A').test(' '); // true
194
XRegExp('\\\\p{Space}', 'A').test('\\t'); // true
195
196
// Uppercase and Lowercase properties
197
XRegExp('\\\\p{Uppercase}', 'A').test('A'); // true
198
XRegExp('\\\\p{Lowercase}', 'A').test('a'); // true
199
200
// Math property
201
XRegExp('\\\\p{Math}', 'A').test('+'); // true
202
XRegExp('\\\\p{Math}', 'A').test('∑'); // true (summation)
203
204
// Currency symbol property
205
XRegExp('\\\\p{Currency_Symbol}', 'A').test('$'); // true
206
XRegExp('\\\\p{Sc}', 'A').test('€'); // true (shorthand)
207
```
208
209
## Astral Unicode Support
210
211
Flag `A` enables 21-bit Unicode support for characters beyond the Basic Multilingual Plane:
212
213
### Astral Flag Usage
214
215
```javascript { .api }
216
// Flag A enables astral mode for Unicode tokens
217
// Required for code points above U+FFFF (outside BMP)
218
// Automatically added when XRegExp.install('astral') is called
219
```
220
221
**Usage Examples:**
222
223
```javascript
224
// Without flag A - only BMP characters (U+0000-U+FFFF)
225
const bmpOnly = XRegExp('\\\\p{Letter}');
226
bmpOnly.test('A'); // true
227
bmpOnly.test('文'); // true
228
bmpOnly.test('𝒜'); // false (mathematical script capital A, U+1D49C)
229
230
// With flag A - full Unicode range (U+0000-U+10FFFF)
231
const fullUnicode = XRegExp('\\\\p{Letter}', 'A');
232
fullUnicode.test('A'); // true
233
fullUnicode.test('文'); // true
234
fullUnicode.test('𝒜'); // true (now works with astral support)
235
236
// Astral emoji support
237
const emoji = XRegExp('\\\\p{Emoji}', 'A');
238
emoji.test('😀'); // true (U+1F600)
239
emoji.test('🚀'); // true (U+1F680)
240
```
241
242
### Global Astral Mode
243
244
Enable astral mode for all new regexes:
245
246
```javascript
247
// Enable astral mode globally
248
XRegExp.install('astral');
249
250
// Now flag A is automatically added to all XRegExp regexes
251
const auto = XRegExp('\\\\p{Letter}'); // Automatically gets flag A
252
auto.test('𝒜'); // true
253
254
// Disable astral mode
255
XRegExp.uninstall('astral');
256
```
257
258
## Unicode Code Point Escapes
259
260
XRegExp supports extended Unicode escape syntax:
261
262
```javascript { .api }
263
// \\u{N...} - Unicode code point escape with curly braces
264
// N... is any one or more digit hexadecimal number from 0-10FFFF
265
// Can include leading zeros
266
// Requires flag u for code points > U+FFFF
267
```
268
269
**Usage Examples:**
270
271
```javascript
272
// Basic Multilingual Plane characters
273
XRegExp('\\\\u{41}').test('A'); // true (U+0041)
274
XRegExp('\\\\u{3042}').test('あ'); // true (U+3042, Hiragana A)
275
276
// Astral characters (requires flag u)
277
XRegExp('\\\\u{1F600}', 'u').test('😀'); // true (U+1F600, grinning face)
278
XRegExp('\\\\u{1D49C}', 'u').test('𝒜'); // true (U+1D49C, math script A)
279
280
// Leading zeros allowed
281
XRegExp('\\\\u{0041}').test('A'); // true (same as \\u{41})
282
XRegExp('\\\\u{00003042}').test('あ'); // true (same as \\u{3042})
283
```
284
285
## Pattern Examples
286
287
### International Identifiers
288
289
```javascript
290
// Match programming identifiers with Unicode support
291
const identifier = XRegExp('^\\\\p{ID_Start}\\\\p{ID_Continue}*$', 'A');
292
identifier.test('변수명'); // true (Korean)
293
identifier.test('переменная'); // true (Russian)
294
identifier.test('変数名'); // true (Japanese)
295
identifier.test('متغير'); // true (Arabic)
296
```
297
298
### Multi-script Text Processing
299
300
```javascript
301
// Extract words from mixed-script text
302
const words = XRegExp('\\\\p{Letter}+', 'gA');
303
const text = 'Hello 世界 مرحبا мир';
304
const matches = XRegExp.match(text, words, 'all');
305
// Result: ['Hello', '世界', 'مرحبا', 'мир']
306
```
307
308
### Unicode-aware Whitespace
309
310
```javascript
311
// Match all Unicode whitespace characters
312
const whitespace = XRegExp('\\\\p{White_Space}+', 'gA');
313
const text = 'word1\\u2003word2\\u2009word3'; // em space and thin space
314
XRegExp.split(text, whitespace);
315
// Result: ['word1', 'word2', 'word3']
316
```
317
318
### Diacritic Handling
319
320
```javascript
321
// Match base letters with any combining marks
322
const withDiacritics = XRegExp('\\\\p{Letter}\\\\p{Mark}*', 'gA');
323
const text = 'café naïve résumé';
324
XRegExp.match(text, withDiacritics, 'all');
325
// Result: ['café', 'naïve', 'résumé'] (preserves combining characters)
326
```