0
# Regex Tokenization
1
2
Core tokenization functionality that parses regular expression strings into structured token representations. Handles all regex features including groups, character classes, quantifiers, lookarounds, and special characters.
3
4
## Capabilities
5
6
### Tokenizer Function
7
8
Parses a regular expression string and returns a structured token tree representing the regex's components.
9
10
```typescript { .api }
11
/**
12
* Tokenizes a regular expression string into structured tokens
13
* @param regexpStr - String representation of a regular expression (without delimiters)
14
* @returns Root token containing the parsed structure
15
* @throws SyntaxError for invalid regular expressions
16
*/
17
function tokenizer(regexpStr: string): Root;
18
```
19
20
**Usage Examples:**
21
22
```typescript
23
import { tokenizer, types } from "ret";
24
25
// Simple character sequence
26
const simple = tokenizer("abc");
27
// Result: { type: types.ROOT, stack: [
28
// { type: types.CHAR, value: 97 }, // 'a'
29
// { type: types.CHAR, value: 98 }, // 'b'
30
// { type: types.CHAR, value: 99 } // 'c'
31
// ]}
32
33
// Alternation
34
const alternation = tokenizer("foo|bar");
35
// Result: { type: types.ROOT, options: [
36
// [{ type: types.CHAR, value: 102 }, ...], // 'foo'
37
// [{ type: types.CHAR, value: 98 }, ...] // 'bar'
38
// ]}
39
40
// Groups with quantifiers
41
const groups = tokenizer("(ab)+");
42
// Result: { type: types.ROOT, stack: [
43
// { type: types.REPETITION, min: 1, max: Infinity, value: {
44
// type: types.GROUP, remember: true, stack: [
45
// { type: types.CHAR, value: 97 }, // 'a'
46
// { type: types.CHAR, value: 98 } // 'b'
47
// ]
48
// }}
49
// ]}
50
```
51
52
### Error Handling
53
54
The tokenizer throws `SyntaxError` for invalid regular expressions. All possible errors include:
55
56
```typescript
57
import { tokenizer } from "ret";
58
59
// Invalid group - '?' followed by invalid character
60
try {
61
tokenizer("(?_abc)");
62
} catch (error) {
63
// SyntaxError: Invalid regular expression: /(?_abc)/: Invalid group, character '_' after '?' at column X
64
}
65
66
// Nothing to repeat - repetition token used inappropriately
67
try {
68
tokenizer("foo|?bar");
69
} catch (error) {
70
// SyntaxError: Invalid regular expression: /foo|?bar/: Nothing to repeat at column X
71
}
72
73
try {
74
tokenizer("{1,3}foo");
75
} catch (error) {
76
// SyntaxError: Invalid regular expression: /{1,3}foo/: Nothing to repeat at column X
77
}
78
79
try {
80
tokenizer("foo(+bar)");
81
} catch (error) {
82
// SyntaxError: Invalid regular expression: /foo(+bar)/: Nothing to repeat at column X
83
}
84
85
// Unmatched closing parenthesis
86
try {
87
tokenizer("hello)world");
88
} catch (error) {
89
// SyntaxError: Invalid regular expression: /hello)world/: Unmatched ) at column X
90
}
91
92
// Unterminated group
93
try {
94
tokenizer("(1(23)4");
95
} catch (error) {
96
// SyntaxError: Invalid regular expression: /(1(23)4/: Unterminated group
97
}
98
99
// Unterminated character class
100
try {
101
tokenizer("[abc");
102
} catch (error) {
103
// SyntaxError: Invalid regular expression: /[abc/: Unterminated character class
104
}
105
106
// Backslash at end of pattern
107
try {
108
tokenizer("test\\");
109
} catch (error) {
110
// SyntaxError: Invalid regular expression: /test\\/: \ at end of pattern
111
}
112
113
// Invalid capture group name
114
try {
115
tokenizer("(?<123>abc)");
116
} catch (error) {
117
// SyntaxError: Invalid regular expression: /(?<123>abc)/: Invalid capture group name, character '1' after '<' at column X
118
}
119
120
// Unclosed capture group name
121
try {
122
tokenizer("(?<name abc)");
123
} catch (error) {
124
// SyntaxError: Invalid regular expression: /(?<name abc)/: Unclosed capture group name, expected '>', found ' ' at column X
125
}
126
```
127
128
## Supported Regex Features
129
130
### Basic Characters
131
132
- **Literal characters**: Any character not having special meaning
133
- **Escaped characters**: `\n`, `\t`, `\r`, `\f`, `\v`, `\0`
134
- **Unicode escapes**: `\uXXXX`, `\xXX`
135
- **Control characters**: `\cX`
136
137
### Character Classes
138
139
- **Predefined classes**: `\d`, `\D`, `\w`, `\W`, `\s`, `\S`
140
- **Custom classes**: `[abc]`, `[^abc]`, `[a-z]`
141
- **Dot metacharacter**: `.` (any character except newline)
142
143
### Anchors and Positions
144
145
- **Line anchors**: `^` (start), `$` (end)
146
- **Word boundaries**: `\b`, `\B`
147
148
### Groups
149
150
- **Capturing groups**: `(pattern)`
151
- **Non-capturing groups**: `(?:pattern)`
152
- **Named groups**: `(?<name>pattern)`
153
- **Lookahead**: `(?=pattern)` (positive), `(?!pattern)` (negative)
154
155
### Quantifiers
156
157
- **Basic quantifiers**: `*` (0+), `+` (1+), `?` (0-1)
158
- **Precise quantifiers**: `{n}`, `{n,}`, `{n,m}`
159
160
### Alternation
161
162
- **Pipe operator**: `|` for alternative patterns
163
164
### Backreferences
165
166
- **Numeric references**: `\1`, `\2`, etc.
167
- **Octal character codes**: When reference numbers exceed capture group count
168
169
## Token Structure Details
170
171
### Root Token
172
173
The top-level container for the entire regex:
174
175
```typescript { .api }
176
interface Root {
177
type: types.ROOT;
178
stack?: Token[]; // Sequential tokens (no alternation)
179
options?: Token[][]; // Alternative branches (with alternation)
180
flags?: string[]; // Optional regex flags
181
}
182
```
183
184
### Group Token
185
186
Represents parenthesized groups with various modifiers:
187
188
```typescript { .api }
189
interface Group {
190
type: types.GROUP;
191
stack?: Token[]; // Sequential tokens in group
192
options?: Token[][]; // Alternative branches in group
193
remember: boolean; // Whether group captures (true for capturing groups)
194
followedBy?: boolean; // Positive lookahead (?=)
195
notFollowedBy?: boolean; // Negative lookahead (?!)
196
lookBehind?: boolean; // Lookbehind assertions
197
name?: string; // Named capture group name
198
}
199
```
200
201
### Character and Set Tokens
202
203
Represent individual characters and character classes:
204
205
```typescript { .api }
206
interface Char {
207
type: types.CHAR;
208
value: number; // Character code
209
}
210
211
interface Set {
212
type: types.SET;
213
set: SetTokens; // Array of characters/ranges in the set
214
not: boolean; // Whether set is negated ([^...])
215
}
216
217
interface Range {
218
type: types.RANGE;
219
from: number; // Start character code
220
to: number; // End character code
221
}
222
```
223
224
### Quantifier Tokens
225
226
Represent repetition patterns:
227
228
```typescript { .api }
229
interface Repetition {
230
type: types.REPETITION;
231
min: number; // Minimum repetitions
232
max: number; // Maximum repetitions (Infinity for unbounded)
233
value: Token; // Token being repeated
234
}
235
```
236
237
### Position and Reference Tokens
238
239
Represent anchors and backreferences:
240
241
```typescript { .api }
242
interface Position {
243
type: types.POSITION;
244
value: '$' | '^' | 'b' | 'B'; // Anchor/boundary type
245
}
246
247
interface Reference {
248
type: types.REFERENCE;
249
value: number; // Reference number
250
}
251
```