tessl/pypi-regex

Alternative regular expression module providing enhanced pattern matching, fuzzy matching, and advanced Unicode support as a replacement for Python's re module.

—

Pending

Overview

Eval results

Files

Pattern Matching Functions

Name: tessl/pypi-regex
Author: tessl

Core functions for finding patterns in text with enhanced capabilities beyond the standard re module. These functions support advanced features like partial matching, concurrent execution, timeout handling, and position control for precise pattern matching operations.

Capabilities

Match at Start

Attempts to match a pattern at the beginning of a string, providing precise control over matching behavior through various parameters.

def match(pattern, string, flags=0, pos=None, endpos=None, partial=False,
          concurrent=None, timeout=None, ignore_unused=False, **kwargs):
    """
    Try to apply the pattern at the start of the string, returning a Match object or None.
    
    Args:
        pattern (str): Regular expression pattern to match
        string (str): String to search in
        flags (int, optional): Regex flags to modify matching behavior
        pos (int, optional): Start position for matching (default: 0)
        endpos (int, optional): End position for matching (default: len(string))
        partial (bool, optional): Allow partial matches at end of string
        concurrent (bool, optional): Release GIL during matching for multithreading
        timeout (float, optional): Timeout in seconds for matching operation
        ignore_unused (bool, optional): Ignore unused keyword arguments
        **kwargs: Additional pattern compilation arguments
    
    Returns:
        Match object if pattern matches at start, None otherwise
    """

Usage Examples:

import regex

# Basic matching at start
result = regex.match(r'\d+', '123abc')
print(result.group())  # '123'

# Position control
result = regex.match(r'abc', 'xxabcyy', pos=2, endpos=5)
print(result.group())  # 'abc'

# Partial matching at string end
result = regex.match(r'hello world', 'hello wor', partial=True)
print(result.group())  # 'hello wor' (partial match)

# Timeout for complex patterns
import time
result = regex.match(r'(a+)+b', 'a' * 20, timeout=0.1)  # May timeout

Full String Match

Matches a pattern against the entire string, ensuring the pattern covers the complete input text.

def fullmatch(pattern, string, flags=0, pos=None, endpos=None, partial=False,
              concurrent=None, timeout=None, ignore_unused=False, **kwargs):
    """
    Try to apply the pattern against all of the string, returning a Match object or None.
    
    Args:
        pattern (str): Regular expression pattern to match
        string (str): String to match completely
        flags (int, optional): Regex flags to modify matching behavior
        pos (int, optional): Start position for matching (default: 0)
        endpos (int, optional): End position for matching (default: len(string))
        partial (bool, optional): Allow partial matches at end of string
        concurrent (bool, optional): Release GIL during matching for multithreading
        timeout (float, optional): Timeout in seconds for matching operation
        ignore_unused (bool, optional): Ignore unused keyword arguments
        **kwargs: Additional pattern compilation arguments
    
    Returns:
        Match object if pattern matches entire string, None otherwise
    """

Usage Examples:

import regex

# Complete string matching
result = regex.fullmatch(r'\d{3}-\d{2}-\d{4}', '123-45-6789')
print(result.group())  # '123-45-6789'

# Fails on partial match
result = regex.fullmatch(r'\d+', '123abc')
print(result)  # None (doesn't match entire string)

# With position bounds
result = regex.fullmatch(r'abc', 'xxabcyy', pos=2, endpos=5)
print(result.group())  # 'abc'

Search Through String

Searches through a string looking for the first location where a pattern matches, providing the most commonly used pattern matching function.

def search(pattern, string, flags=0, pos=None, endpos=None, partial=False,
           concurrent=None, timeout=None, ignore_unused=False, **kwargs):
    """
    Search through string looking for a match to the pattern, returning a Match object or None.
    
    Args:
        pattern (str): Regular expression pattern to search for
        string (str): String to search in
        flags (int, optional): Regex flags to modify matching behavior
        pos (int, optional): Start position for searching (default: 0)
        endpos (int, optional): End position for searching (default: len(string))
        partial (bool, optional): Allow partial matches at end of string
        concurrent (bool, optional): Release GIL during matching for multithreading
        timeout (float, optional): Timeout in seconds for matching operation
        ignore_unused (bool, optional): Ignore unused keyword arguments
        **kwargs: Additional pattern compilation arguments
    
    Returns:
        Match object for first match found, None if no match
    """

Usage Examples:

import regex

# Basic search
result = regex.search(r'\d+', 'abc123def')
print(result.group())  # '123'
print(result.span())   # (3, 6)

# Search with position bounds
result = regex.search(r'\w+', 'hello world test', pos=6, endpos=11)
print(result.group())  # 'world'

# Fuzzy search with error tolerance
result = regex.search(r'(?e)(hello){i<=1,d<=1,s<=1}', 'helo world')
print(result.group())  # 'helo' (found with 1 deletion error)

# Case-insensitive search
result = regex.search(r'python', 'I love PYTHON!', regex.IGNORECASE)
print(result.group())  # 'PYTHON'

Find All Matches

Returns all non-overlapping matches of a pattern in a string as a list, with options for overlapping matches and position control.

def findall(pattern, string, flags=0, pos=None, endpos=None, overlapped=False,
            concurrent=None, timeout=None, ignore_unused=False, **kwargs):
    """
    Return a list of all matches in the string.
    
    Args:
        pattern (str): Regular expression pattern to find
        string (str): String to search in
        flags (int, optional): Regex flags to modify matching behavior
        pos (int, optional): Start position for searching (default: 0)
        endpos (int, optional): End position for searching (default: len(string))
        overlapped (bool, optional): Find overlapping matches
        concurrent (bool, optional): Release GIL during matching for multithreading
        timeout (float, optional): Timeout in seconds for matching operation
        ignore_unused (bool, optional): Ignore unused keyword arguments
        **kwargs: Additional pattern compilation arguments
    
    Returns:
        List of matched strings or tuples (for patterns with groups)
    """

Usage Examples:

import regex

# Find all numbers
numbers = regex.findall(r'\d+', 'Price: $123, Quantity: 45, Total: $5535')
print(numbers)  # ['123', '45', '5535']

# Find all email addresses
emails = regex.findall(r'\b\w+@\w+\.\w+\b', 'Contact: user@example.com or admin@site.org')
print(emails)  # ['user@example.com', 'admin@site.org']

# Find with groups
matches = regex.findall(r'(\w+):(\d+)', 'port:80, secure:443, admin:8080')
print(matches)  # [('port', '80'), ('secure', '443'), ('admin', '8080')]

# Overlapping matches
matches = regex.findall(r'\w\w', 'abcdef', overlapped=True)
print(matches)  # ['ab', 'bc', 'cd', 'de', 'ef']

# Non-overlapping (default)
matches = regex.findall(r'\w\w', 'abcdef')
print(matches)  # ['ab', 'cd', 'ef']

Find All Matches Iterator

Returns an iterator over all matches, providing memory-efficient processing for large texts or when you need Match objects with full details.

def finditer(pattern, string, flags=0, pos=None, endpos=None, overlapped=False,
             partial=False, concurrent=None, timeout=None, ignore_unused=False, **kwargs):
    """
    Return an iterator over all matches in the string.
    
    Args:
        pattern (str): Regular expression pattern to find
        string (str): String to search in
        flags (int, optional): Regex flags to modify matching behavior
        pos (int, optional): Start position for searching (default: 0)
        endpos (int, optional): End position for searching (default: len(string))
        overlapped (bool, optional): Find overlapping matches
        partial (bool, optional): Allow partial matches at end of string
        concurrent (bool, optional): Release GIL during matching for multithreading
        timeout (float, optional): Timeout in seconds for matching operation
        ignore_unused (bool, optional): Ignore unused keyword arguments
        **kwargs: Additional pattern compilation arguments
    
    Returns:
        Iterator yielding Match objects
    """

Usage Examples:

import regex

# Iterator over matches with full match info
text = 'Word1: 123, Word2: 456, Word3: 789'
for match in regex.finditer(r'(\w+): (\d+)', text):
    word, number = match.groups()
    start, end = match.span()
    print(f"Found '{word}: {number}' at positions {start}-{end}")

# Memory-efficient processing of large text
def process_large_text(text):
    word_count = 0
    for match in regex.finditer(r'\b\w+\b', text):
        word_count += 1
        # Process one match at a time without storing all matches
    return word_count

# Overlapping matches with iterator
text = 'aaaa'
for match in regex.finditer(r'aa', text, overlapped=True):
    print(f"Match: '{match.group()}' at {match.span()}")
# Output: Match: 'aa' at (0, 2)
#         Match: 'aa' at (1, 3)  
#         Match: 'aa' at (2, 4)

Advanced Pattern Features

Fuzzy Matching

The regex module supports fuzzy (approximate) matching with configurable error limits:

# Basic fuzzy matching - allow up to 2 errors of any type
pattern = r'(?e)(python){e<=2}'
result = regex.search(pattern, 'pyhton is great')  # Matches with 1 substitution

# Specific error types - insertions, deletions, substitutions
pattern = r'(?e)(hello){i<=1,d<=1,s<=1}'  # Allow 1 of each error type
result = regex.search(pattern, 'helo')  # Matches with 1 deletion

# Best match mode - find the best match instead of first
pattern = r'(?be)(test){e<=2}'
result = regex.search(pattern, 'testing text best')  # Finds 'test' (best match)

Version Control

# Version 0 (legacy re-compatible behavior)
result = regex.search(r'(?V0)pattern', text)

# Version 1 (enhanced behavior with full case-folding)
result = regex.search(r'(?V1)pattern', text, regex.IGNORECASE)

Concurrent Execution

# Enable concurrent execution for long-running matches
result = regex.search(complex_pattern, large_text, concurrent=True)

# Set timeout to prevent runaway regex
result = regex.search(potentially_slow_pattern, text, timeout=5.0)

Install with Tessl CLI