or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-sets.mdindex.mdreconstruction.mdtokenization.md

tokenization.mddocs/

0

# Regex Tokenization

1

2

Core tokenization functionality that parses regular expression strings into structured token representations. Handles all regex features including groups, character classes, quantifiers, lookarounds, and special characters.

3

4

## Capabilities

5

6

### Tokenizer Function

7

8

Parses a regular expression string and returns a structured token tree representing the regex's components.

9

10

```typescript { .api }

11

/**

12

* Tokenizes a regular expression string into structured tokens

13

* @param regexpStr - String representation of a regular expression (without delimiters)

14

* @returns Root token containing the parsed structure

15

* @throws SyntaxError for invalid regular expressions

16

*/

17

function tokenizer(regexpStr: string): Root;

18

```

19

20

**Usage Examples:**

21

22

```typescript

23

import { tokenizer, types } from "ret";

24

25

// Simple character sequence

26

const simple = tokenizer("abc");

27

// Result: { type: types.ROOT, stack: [

28

// { type: types.CHAR, value: 97 }, // 'a'

29

// { type: types.CHAR, value: 98 }, // 'b'

30

// { type: types.CHAR, value: 99 } // 'c'

31

// ]}

32

33

// Alternation

34

const alternation = tokenizer("foo|bar");

35

// Result: { type: types.ROOT, options: [

36

// [{ type: types.CHAR, value: 102 }, ...], // 'foo'

37

// [{ type: types.CHAR, value: 98 }, ...] // 'bar'

38

// ]}

39

40

// Groups with quantifiers

41

const groups = tokenizer("(ab)+");

42

// Result: { type: types.ROOT, stack: [

43

// { type: types.REPETITION, min: 1, max: Infinity, value: {

44

// type: types.GROUP, remember: true, stack: [

45

// { type: types.CHAR, value: 97 }, // 'a'

46

// { type: types.CHAR, value: 98 } // 'b'

47

// ]

48

// }}

49

// ]}

50

```

51

52

### Error Handling

53

54

The tokenizer throws `SyntaxError` for invalid regular expressions. All possible errors include:

55

56

```typescript

57

import { tokenizer } from "ret";

58

59

// Invalid group - '?' followed by invalid character

60

try {

61

tokenizer("(?_abc)");

62

} catch (error) {

63

// SyntaxError: Invalid regular expression: /(?_abc)/: Invalid group, character '_' after '?' at column X

64

}

65

66

// Nothing to repeat - repetition token used inappropriately

67

try {

68

tokenizer("foo|?bar");

69

} catch (error) {

70

// SyntaxError: Invalid regular expression: /foo|?bar/: Nothing to repeat at column X

71

}

72

73

try {

74

tokenizer("{1,3}foo");

75

} catch (error) {

76

// SyntaxError: Invalid regular expression: /{1,3}foo/: Nothing to repeat at column X

77

}

78

79

try {

80

tokenizer("foo(+bar)");

81

} catch (error) {

82

// SyntaxError: Invalid regular expression: /foo(+bar)/: Nothing to repeat at column X

83

}

84

85

// Unmatched closing parenthesis

86

try {

87

tokenizer("hello)world");

88

} catch (error) {

89

// SyntaxError: Invalid regular expression: /hello)world/: Unmatched ) at column X

90

}

91

92

// Unterminated group

93

try {

94

tokenizer("(1(23)4");

95

} catch (error) {

96

// SyntaxError: Invalid regular expression: /(1(23)4/: Unterminated group

97

}

98

99

// Unterminated character class

100

try {

101

tokenizer("[abc");

102

} catch (error) {

103

// SyntaxError: Invalid regular expression: /[abc/: Unterminated character class

104

}

105

106

// Backslash at end of pattern

107

try {

108

tokenizer("test\\");

109

} catch (error) {

110

// SyntaxError: Invalid regular expression: /test\\/: \ at end of pattern

111

}

112

113

// Invalid capture group name

114

try {

115

tokenizer("(?<123>abc)");

116

} catch (error) {

117

// SyntaxError: Invalid regular expression: /(?<123>abc)/: Invalid capture group name, character '1' after '<' at column X

118

}

119

120

// Unclosed capture group name

121

try {

122

tokenizer("(?<name abc)");

123

} catch (error) {

124

// SyntaxError: Invalid regular expression: /(?<name abc)/: Unclosed capture group name, expected '>', found ' ' at column X

125

}

126

```

127

128

## Supported Regex Features

129

130

### Basic Characters

131

132

- **Literal characters**: Any character not having special meaning

133

- **Escaped characters**: `\n`, `\t`, `\r`, `\f`, `\v`, `\0`

134

- **Unicode escapes**: `\uXXXX`, `\xXX`

135

- **Control characters**: `\cX`

136

137

### Character Classes

138

139

- **Predefined classes**: `\d`, `\D`, `\w`, `\W`, `\s`, `\S`

140

- **Custom classes**: `[abc]`, `[^abc]`, `[a-z]`

141

- **Dot metacharacter**: `.` (any character except newline)

142

143

### Anchors and Positions

144

145

- **Line anchors**: `^` (start), `$` (end)

146

- **Word boundaries**: `\b`, `\B`

147

148

### Groups

149

150

- **Capturing groups**: `(pattern)`

151

- **Non-capturing groups**: `(?:pattern)`

152

- **Named groups**: `(?<name>pattern)`

153

- **Lookahead**: `(?=pattern)` (positive), `(?!pattern)` (negative)

154

155

### Quantifiers

156

157

- **Basic quantifiers**: `*` (0+), `+` (1+), `?` (0-1)

158

- **Precise quantifiers**: `{n}`, `{n,}`, `{n,m}`

159

160

### Alternation

161

162

- **Pipe operator**: `|` for alternative patterns

163

164

### Backreferences

165

166

- **Numeric references**: `\1`, `\2`, etc.

167

- **Octal character codes**: When reference numbers exceed capture group count

168

169

## Token Structure Details

170

171

### Root Token

172

173

The top-level container for the entire regex:

174

175

```typescript { .api }

176

interface Root {

177

type: types.ROOT;

178

stack?: Token[]; // Sequential tokens (no alternation)

179

options?: Token[][]; // Alternative branches (with alternation)

180

flags?: string[]; // Optional regex flags

181

}

182

```

183

184

### Group Token

185

186

Represents parenthesized groups with various modifiers:

187

188

```typescript { .api }

189

interface Group {

190

type: types.GROUP;

191

stack?: Token[]; // Sequential tokens in group

192

options?: Token[][]; // Alternative branches in group

193

remember: boolean; // Whether group captures (true for capturing groups)

194

followedBy?: boolean; // Positive lookahead (?=)

195

notFollowedBy?: boolean; // Negative lookahead (?!)

196

lookBehind?: boolean; // Lookbehind assertions

197

name?: string; // Named capture group name

198

}

199

```

200

201

### Character and Set Tokens

202

203

Represent individual characters and character classes:

204

205

```typescript { .api }

206

interface Char {

207

type: types.CHAR;

208

value: number; // Character code

209

}

210

211

interface Set {

212

type: types.SET;

213

set: SetTokens; // Array of characters/ranges in the set

214

not: boolean; // Whether set is negated ([^...])

215

}

216

217

interface Range {

218

type: types.RANGE;

219

from: number; // Start character code

220

to: number; // End character code

221

}

222

```

223

224

### Quantifier Tokens

225

226

Represent repetition patterns:

227

228

```typescript { .api }

229

interface Repetition {

230

type: types.REPETITION;

231

min: number; // Minimum repetitions

232

max: number; // Maximum repetitions (Infinity for unbounded)

233

value: Token; // Token being repeated

234

}

235

```

236

237

### Position and Reference Tokens

238

239

Represent anchors and backreferences:

240

241

```typescript { .api }

242

interface Position {

243

type: types.POSITION;

244

value: '$' | '^' | 'b' | 'B'; // Anchor/boundary type

245

}

246

247

interface Reference {

248

type: types.REFERENCE;

249

value: number; // Reference number

250

}

251

```