CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-ply

Python implementation of lex and yacc parsing tools with LALR(1) algorithm and zero dependencies

Pending
Overview
Eval results
Files

lexical-analysis.mddocs/

Lexical Analysis

The ply.lex module provides lexical analysis capabilities, converting raw text input into a stream of tokens using regular expressions and finite state machines. It supports multiple lexer states, comprehensive error handling, and flexible token rule definitions.

Capabilities

Lexer Creation

Creates a lexer instance by analyzing token rules defined in the calling module or specified object. Automatically discovers token rules through naming conventions and validates the lexical specification.

def lex(*, module=None, object=None, debug=False, reflags=int(re.VERBOSE), debuglog=None, errorlog=None):
    """
    Build a lexer from token rules.

    Parameters:
    - module: Module containing token rules (default: calling module)
    - object: Object containing token rules (alternative to module)
    - debug: Enable debug mode for lexer construction
    - reflags: Regular expression flags (default: re.VERBOSE)
    - debuglog: Logger for debug output
    - errorlog: Logger for error messages

    Returns:
    Lexer instance
    """

Token Rule Decorator

Decorator for adding regular expression patterns to token rule functions, enabling more complex token processing while maintaining the pattern association.

def TOKEN(r):
    """
    Decorator to add a regular expression pattern to a token rule function.

    Parameters:
    - r: Regular expression pattern string

    Returns:
    Decorated function with attached regex pattern
    """

Usage example:

@TOKEN(r'\d+')
def t_NUMBER(t):
    t.value = int(t.value)
    return t

Standalone Lexer Execution

Runs a lexer in standalone mode for testing and debugging purposes, reading input from command line arguments or standard input.

def runmain(lexer=None, data=None):
    """
    Run lexer in standalone mode.

    Parameters:
    - lexer: Lexer instance to run (default: global lexer)
    - data: Input data to tokenize (default: from command line)
    """

Lexer Class

The main lexer class that performs tokenization of input strings. Supports stateful tokenization, error recovery, and position tracking.

class Lexer:
    def input(self, s):
        """
        Set the input string for tokenization.

        Parameters:
        - s: Input string to tokenize
        """

    def token(self):
        """
        Get the next token from input.

        Returns:
        LexToken instance or None if end of input
        """

    def clone(self, object=None):
        """
        Create a copy of the lexer.

        Parameters:
        - object: Object containing token rules (optional)

        Returns:
        New Lexer instance
        """

    def begin(self, state):
        """
        Change lexer to the specified state.

        Parameters:
        - state: State name to enter
        """

    def push_state(self, state):
        """
        Push current state and enter new state.

        Parameters:
        - state: State name to enter
        """

    def pop_state(self):
        """
        Pop state from stack and return to previous state.

        Returns:
        Previous state name
        """

    def current_state(self):
        """
        Get the current lexer state.

        Returns:
        Current state name
        """

    def skip(self, n):
        """
        Skip n characters in the input.

        Parameters:
        - n: Number of characters to skip
        """

    def __iter__(self):
        """
        Iterator interface for tokenization.

        Returns:
        Iterator object (self)
        """

    def __next__(self):
        """
        Get next token for iterator interface.

        Returns:
        Next LexToken or raises StopIteration
        """

    # Public attributes
    lineno: int     # Current line number
    lexpos: int     # Current position in input string

Token Representation

Object representing a lexical token with type, value, and position information.

class LexToken:
    """
    Token object created by lexer.

    Attributes:
    - type: Token type string
    - value: Token value (original text or processed value)
    - lineno: Line number where token appears
    - lexpos: Character position in input string
    """
    type: str
    value: any
    lineno: int
    lexpos: int

Lexer Error Handling

Exception class for lexical analysis errors and logging utilities.

class LexError(Exception):
    """Exception raised for lexical analysis errors."""

class PlyLogger:
    """
    Logging utility for PLY operations.
    Provides structured logging for lexer construction and operation.
    """

Token Rule Conventions

Basic Token Rules

Define tokens using variables or functions with t_ prefix:

# Simple token with literal regex
t_PLUS = r'\+'
t_MINUS = r'-'
t_ignore = ' \t'  # Characters to ignore

# Token function with processing
def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    # Check for reserved words
    t.type = reserved.get(t.value, 'ID')
    return t

Special Token Functions

Required functions for proper lexer operation:

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

def t_error(t):
    print(f"Illegal character '{t.value[0]}'")
    t.lexer.skip(1)

Multiple States

Support for lexer states to handle context-sensitive tokenization:

states = (
    ('comment', 'exclusive'),
    ('string', 'exclusive'),
)

# Rules for specific states
def t_comment_start(t):
    r'/\*'
    t.lexer.begin('comment')

def t_comment_end(t):
    r'\*/'
    t.lexer.begin('INITIAL')

def t_comment_error(t):
    t.lexer.skip(1)

Error Recovery

The lexer provides multiple mechanisms for handling errors:

  1. t_error() function: Called when illegal characters are encountered
  2. skip() method: Skip characters during error recovery
  3. LexError exception: Raised for critical lexer errors
  4. Logging: Comprehensive error reporting through PlyLogger

Global Variables

When lex() is called, it sets global variables for convenience:

  • lexer: Global lexer instance
  • token: Global token function (reference to lexer.token())
  • input: Global input function (reference to lexer.input())

These allow for simplified usage patterns while maintaining access to the full Lexer API.

Constants

StringTypes = (str, bytes)  # Acceptable string types for PLY

Install with Tessl CLI

npx tessl i tessl/pypi-ply

docs

index.md

lexical-analysis.md

syntax-parsing.md

tile.json