tessl/npm-ret

Tokenizes a string that represents a regular expression.

—

Pending

Overview

Eval results

Files

Regex Tokenization

Name: tessl/npm-ret
Author: tessl

Core tokenization functionality that parses regular expression strings into structured token representations. Handles all regex features including groups, character classes, quantifiers, lookarounds, and special characters.

Capabilities

Tokenizer Function

Parses a regular expression string and returns a structured token tree representing the regex's components.

/**
 * Tokenizes a regular expression string into structured tokens
 * @param regexpStr - String representation of a regular expression (without delimiters)
 * @returns Root token containing the parsed structure
 * @throws SyntaxError for invalid regular expressions
 */
function tokenizer(regexpStr: string): Root;

Usage Examples:

import { tokenizer, types } from "ret";

// Simple character sequence
const simple = tokenizer("abc");
// Result: { type: types.ROOT, stack: [
//   { type: types.CHAR, value: 97 },  // 'a'
//   { type: types.CHAR, value: 98 },  // 'b'
//   { type: types.CHAR, value: 99 }   // 'c'
// ]}

// Alternation
const alternation = tokenizer("foo|bar");
// Result: { type: types.ROOT, options: [
//   [{ type: types.CHAR, value: 102 }, ...], // 'foo'
//   [{ type: types.CHAR, value: 98 }, ...]   // 'bar'
// ]}

// Groups with quantifiers
const groups = tokenizer("(ab)+");
// Result: { type: types.ROOT, stack: [
//   { type: types.REPETITION, min: 1, max: Infinity, value: {
//     type: types.GROUP, remember: true, stack: [
//       { type: types.CHAR, value: 97 },  // 'a'
//       { type: types.CHAR, value: 98 }   // 'b'
//     ]
//   }}
// ]}

Error Handling

The tokenizer throws SyntaxError for invalid regular expressions. All possible errors include:

import { tokenizer } from "ret";

// Invalid group - '?' followed by invalid character
try {
  tokenizer("(?_abc)");
} catch (error) {
  // SyntaxError: Invalid regular expression: /(?_abc)/: Invalid group, character '_' after '?' at column X
}

// Nothing to repeat - repetition token used inappropriately
try {
  tokenizer("foo|?bar");
} catch (error) {
  // SyntaxError: Invalid regular expression: /foo|?bar/: Nothing to repeat at column X
}

try {
  tokenizer("{1,3}foo");
} catch (error) {
  // SyntaxError: Invalid regular expression: /{1,3}foo/: Nothing to repeat at column X
}

try {
  tokenizer("foo(+bar)");
} catch (error) {
  // SyntaxError: Invalid regular expression: /foo(+bar)/: Nothing to repeat at column X
}

// Unmatched closing parenthesis
try {
  tokenizer("hello)world");
} catch (error) {
  // SyntaxError: Invalid regular expression: /hello)world/: Unmatched ) at column X
}

// Unterminated group
try {
  tokenizer("(1(23)4");
} catch (error) {
  // SyntaxError: Invalid regular expression: /(1(23)4/: Unterminated group
}

// Unterminated character class
try {
  tokenizer("[abc");
} catch (error) {
  // SyntaxError: Invalid regular expression: /[abc/: Unterminated character class
}

// Backslash at end of pattern
try {
  tokenizer("test\\");
} catch (error) {
  // SyntaxError: Invalid regular expression: /test\\/: \ at end of pattern
}

// Invalid capture group name
try {
  tokenizer("(?<123>abc)");
} catch (error) {
  // SyntaxError: Invalid regular expression: /(?<123>abc)/: Invalid capture group name, character '1' after '<' at column X
}

// Unclosed capture group name
try {
  tokenizer("(?<name abc)");
} catch (error) {
  // SyntaxError: Invalid regular expression: /(?<name abc)/: Unclosed capture group name, expected '>', found ' ' at column X
}

Supported Regex Features

Basic Characters

Literal characters: Any character not having special meaning
Escaped characters: \n, \t, \r, \f, \v, \0
Unicode escapes: \uXXXX, \xXX
Control characters: \cX

Character Classes

Predefined classes: \d, \D, \w, \W, \s, \S
Custom classes: [abc], [^abc], [a-z]
Dot metacharacter: . (any character except newline)

Anchors and Positions

Line anchors: ^ (start), $ (end)
Word boundaries: \b, \B

Groups

Capturing groups: (pattern)
Non-capturing groups: (?:pattern)
Named groups: (?<name>pattern)
Lookahead: (?=pattern) (positive), (?!pattern) (negative)

Quantifiers

Basic quantifiers: * (0+), + (1+), ? (0-1)
Precise quantifiers: {n}, {n,}, {n,m}

Alternation

Pipe operator: | for alternative patterns

Backreferences

Numeric references: \1, \2, etc.
Octal character codes: When reference numbers exceed capture group count

Token Structure Details

Root Token

The top-level container for the entire regex:

interface Root {
  type: types.ROOT;
  stack?: Token[];      // Sequential tokens (no alternation)
  options?: Token[][];  // Alternative branches (with alternation)
  flags?: string[];     // Optional regex flags
}

Group Token

Represents parenthesized groups with various modifiers:

interface Group {
  type: types.GROUP;
  stack?: Token[];      // Sequential tokens in group
  options?: Token[][];  // Alternative branches in group
  remember: boolean;    // Whether group captures (true for capturing groups)
  followedBy?: boolean; // Positive lookahead (?=)
  notFollowedBy?: boolean; // Negative lookahead (?!)
  lookBehind?: boolean; // Lookbehind assertions
  name?: string;        // Named capture group name
}

Character and Set Tokens

Represent individual characters and character classes:

interface Char {
  type: types.CHAR;
  value: number;        // Character code
}

interface Set {
  type: types.SET;
  set: SetTokens;       // Array of characters/ranges in the set
  not: boolean;         // Whether set is negated ([^...])
}

interface Range {
  type: types.RANGE;
  from: number;         // Start character code
  to: number;           // End character code
}

Quantifier Tokens

Represent repetition patterns:

interface Repetition {
  type: types.REPETITION;
  min: number;          // Minimum repetitions
  max: number;          // Maximum repetitions (Infinity for unbounded)
  value: Token;         // Token being repeated
}

Position and Reference Tokens

Represent anchors and backreferences:

interface Position {
  type: types.POSITION;
  value: '$' | '^' | 'b' | 'B'; // Anchor/boundary type
}

interface Reference {
  type: types.REFERENCE;
  value: number;        // Reference number
}

Install with Tessl CLI

npx tessl i tessl/npm-ret

docs