tessl/npm-re2

Bindings for RE2: fast, safe alternative to backtracking regular expression engines.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Buffer Support

Name: tessl/npm-re2
Author: tessl

Direct Buffer processing for efficient text operations without string conversion overhead.

Capabilities

Buffer Processing Overview

RE2 provides native support for Node.js Buffers, allowing direct processing of UTF-8 encoded binary data without conversion to JavaScript strings. This is particularly useful for:

Processing large text files efficiently
Working with binary protocols containing text patterns
Avoiding UTF-8 ↔ UTF-16 conversion overhead
Handling text data that may contain null bytes

Key Characteristics:

All Buffer inputs must be UTF-8 encoded
Positions and lengths are in bytes, not characters
Results are returned as Buffers when input is Buffer
Full Unicode support maintained

Buffer Method Signatures

All core RE2 methods accept Buffer inputs and return appropriate Buffer results:

/**
 * Buffer-compatible method signatures
 */
regex.exec(buffer: Buffer): RE2BufferExecArray | null;
regex.test(buffer: Buffer): boolean;
regex.match(buffer: Buffer): RE2BufferMatchArray | null;
regex.search(buffer: Buffer): number;
regex.replace(buffer: Buffer, replacement: string | Buffer): Buffer;
regex.split(buffer: Buffer, limit?: number): Buffer[];

Buffer Result Types

/**
 * Buffer-specific result interfaces
 */
interface RE2BufferExecArray extends Array<Buffer> {
  index: number;          // Match start position in bytes
  input: Buffer;         // Original Buffer input
  groups?: {             // Named groups as Buffers
    [key: string]: Buffer;
  };
}

interface RE2BufferMatchArray extends Array<Buffer> {
  index?: number;        // Match position in bytes (undefined for global)
  input?: Buffer;       // Original input (undefined for global)
  groups?: {            // Named groups as Buffers
    [key: string]: Buffer;
  };
}

Buffer Usage Examples

Basic Buffer Operations:

const RE2 = require("re2");

// Create Buffer with UTF-8 text
const buffer = Buffer.from("Hello 世界! Testing 123", "utf8");
const regex = new RE2("\\d+");

// Test with Buffer
console.log(regex.test(buffer)); // true

// Find match in Buffer
const match = regex.exec(buffer);
console.log(match[0].toString()); // "123"
console.log(match.index);         // 20 (byte position, not character position)

// Search in Buffer
const position = regex.search(buffer);
console.log(position); // 20 (byte position)

Buffer Replacement:

const RE2 = require("re2");

// Replace text in Buffer
const sourceBuffer = Buffer.from("test 123 and 456", "utf8");
const numberRegex = new RE2("\\d+", "g");

// Replace with string (returns Buffer)
const replaced1 = numberRegex.replace(sourceBuffer, "XXX");
console.log(replaced1.toString()); // "test XXX and XXX"

// Replace with Buffer
const replacement = Buffer.from("NUM", "utf8");
const replaced2 = numberRegex.replace(sourceBuffer, replacement);
console.log(replaced2.toString()); // "test NUM and NUM"

// Replace with function
const replacer = (match, offset, input) => {
  const num = parseInt(match.toString());
  return Buffer.from(String(num * 2), "utf8");
};
const doubled = numberRegex.replace(sourceBuffer, replacer);
console.log(doubled.toString()); // "test 246 and 912"

Buffer Splitting:

const RE2 = require("re2");

// Split Buffer by pattern
const data = Buffer.from("apple,banana,cherry", "utf8");
const commaRegex = new RE2(",");

const parts = commaRegex.split(data);
console.log(parts.length); // 3
console.log(parts[0].toString()); // "apple"
console.log(parts[1].toString()); // "banana" 
console.log(parts[2].toString()); // "cherry"

// Each part is a Buffer
console.log(Buffer.isBuffer(parts[0])); // true

Named Groups with Buffers

Named capture groups work seamlessly with Buffers:

const RE2 = require("re2");

// Named groups in Buffer matching
const emailRegex = new RE2("(?<user>\\w+)@(?<domain>\\w+\\.\\w+)");
const emailBuffer = Buffer.from("Contact: user@example.com", "utf8");

const match = emailRegex.exec(emailBuffer);
console.log(match.groups.user.toString());   // "user"
console.log(match.groups.domain.toString()); // "example.com"

// Groups are also Buffers
console.log(Buffer.isBuffer(match.groups.user)); // true

UTF-8 Length Utilities

RE2 provides utility methods for calculating UTF-8 and UTF-16 lengths:

/**
 * Calculate UTF-8 byte length needed for UTF-16 string
 * @param str - UTF-16 string
 * @returns Number of bytes needed for UTF-8 encoding
 */
RE2.getUtf8Length(str: string): number;

/**
 * Calculate UTF-16 character length for UTF-8 Buffer
 * @param buffer - UTF-8 encoded Buffer
 * @returns Number of characters in UTF-16, or -1 on error
 */
RE2.getUtf16Length(buffer: Buffer): number;

Usage Examples:

const RE2 = require("re2");

// Calculate UTF-8 length for string
const text = "Hello 世界!";
const utf8Length = RE2.getUtf8Length(text);
console.log(utf8Length); // 13 (bytes needed for UTF-8)
console.log(text.length); // 9 (UTF-16 characters)

// Verify with actual Buffer
const buffer = Buffer.from(text, "utf8");
console.log(buffer.length); // 13 (matches calculated length)

// Calculate UTF-16 length for Buffer
const utf16Length = RE2.getUtf16Length(buffer);
console.log(utf16Length); // 9 (UTF-16 characters)

// Error handling
const invalidBuffer = Buffer.from([0xff, 0xfe, 0xfd]); // Invalid UTF-8
const errorResult = RE2.getUtf16Length(invalidBuffer);
console.log(errorResult); // -1 (indicates error)

Buffer Performance Considerations

Advantages:

No UTF-8 ↔ UTF-16 conversion overhead
Direct binary data processing
Memory efficient for large text files
Preserves exact byte boundaries

Considerations:

Positions and lengths are in bytes, not characters
Requires UTF-8 encoded input
Results need .toString() for string operations
More complex when mixing with string operations

Best Practices:

const RE2 = require("re2");
const fs = require("fs");

// Efficient large file processing
async function processLogFile(filename) {
  const buffer = await fs.promises.readFile(filename);
  const errorRegex = new RE2("ERROR:\\s*(.*)", "g");
  
  const errors = [];
  let match;
  while ((match = errorRegex.exec(buffer)) !== null) {
    errors.push({
      message: match[1].toString(),
      position: match.index,
      context: buffer.slice(
        Math.max(0, match.index - 50),
        match.index + match[0].length + 50
      ).toString()
    });
  }
  
  return errors;
}

// Mixed string/Buffer operations
function processWithContext(text) {
  // Use string for simple operations
  const regex = new RE2("\\w+@\\w+\\.\\w+", "g");
  const emails = text.match(regex);
  
  // Use Buffer for binary operations if needed
  if (emails && emails.length > 0) {
    const buffer = Buffer.from(text, "utf8");
    const firstEmailPos = regex.search(buffer);
    
    return {
      emails,
      firstEmailBytePosition: firstEmailPos
    };
  }
  
  return { emails: [], firstEmailBytePosition: -1 };
}

Binary Data Patterns

RE2 can process Buffers containing binary data with text patterns:

const RE2 = require("re2");

// Create Buffer with mixed binary and text data
const binaryData = Buffer.concat([
  Buffer.from([0x00, 0x01, 0x02]), // Binary header
  Buffer.from("START", "utf8"),     // Text marker
  Buffer.from([0x03, 0x04]),       // More binary data
  Buffer.from("Hello World", "utf8"), // Text content
  Buffer.from([0x05, 0x06, 0x07])  // Binary footer
]);

// Find text patterns in binary data
const textRegex = new RE2("[A-Z]+");
const textMatch = textRegex.exec(binaryData);
console.log(textMatch[0].toString()); // "START"
console.log(textMatch.index);         // 3 (after binary header)

// Extract all text from binary data
const wordRegex = new RE2("[a-zA-Z]+", "g");
const words = [];
let match;
while ((match = wordRegex.exec(binaryData)) !== null) {
  words.push(match[0].toString());
}
console.log(words); // ["START", "Hello", "World"]

Install with Tessl CLI