Python implementation of lex and yacc parsing tools with LALR(1) algorithm and zero dependencies
—
The ply.lex module provides lexical analysis capabilities, converting raw text input into a stream of tokens using regular expressions and finite state machines. It supports multiple lexer states, comprehensive error handling, and flexible token rule definitions.
Creates a lexer instance by analyzing token rules defined in the calling module or specified object. Automatically discovers token rules through naming conventions and validates the lexical specification.
def lex(*, module=None, object=None, debug=False, reflags=int(re.VERBOSE), debuglog=None, errorlog=None):
"""
Build a lexer from token rules.
Parameters:
- module: Module containing token rules (default: calling module)
- object: Object containing token rules (alternative to module)
- debug: Enable debug mode for lexer construction
- reflags: Regular expression flags (default: re.VERBOSE)
- debuglog: Logger for debug output
- errorlog: Logger for error messages
Returns:
Lexer instance
"""Decorator for adding regular expression patterns to token rule functions, enabling more complex token processing while maintaining the pattern association.
def TOKEN(r):
"""
Decorator to add a regular expression pattern to a token rule function.
Parameters:
- r: Regular expression pattern string
Returns:
Decorated function with attached regex pattern
"""Usage example:
@TOKEN(r'\d+')
def t_NUMBER(t):
t.value = int(t.value)
return tRuns a lexer in standalone mode for testing and debugging purposes, reading input from command line arguments or standard input.
def runmain(lexer=None, data=None):
"""
Run lexer in standalone mode.
Parameters:
- lexer: Lexer instance to run (default: global lexer)
- data: Input data to tokenize (default: from command line)
"""The main lexer class that performs tokenization of input strings. Supports stateful tokenization, error recovery, and position tracking.
class Lexer:
def input(self, s):
"""
Set the input string for tokenization.
Parameters:
- s: Input string to tokenize
"""
def token(self):
"""
Get the next token from input.
Returns:
LexToken instance or None if end of input
"""
def clone(self, object=None):
"""
Create a copy of the lexer.
Parameters:
- object: Object containing token rules (optional)
Returns:
New Lexer instance
"""
def begin(self, state):
"""
Change lexer to the specified state.
Parameters:
- state: State name to enter
"""
def push_state(self, state):
"""
Push current state and enter new state.
Parameters:
- state: State name to enter
"""
def pop_state(self):
"""
Pop state from stack and return to previous state.
Returns:
Previous state name
"""
def current_state(self):
"""
Get the current lexer state.
Returns:
Current state name
"""
def skip(self, n):
"""
Skip n characters in the input.
Parameters:
- n: Number of characters to skip
"""
def __iter__(self):
"""
Iterator interface for tokenization.
Returns:
Iterator object (self)
"""
def __next__(self):
"""
Get next token for iterator interface.
Returns:
Next LexToken or raises StopIteration
"""
# Public attributes
lineno: int # Current line number
lexpos: int # Current position in input stringObject representing a lexical token with type, value, and position information.
class LexToken:
"""
Token object created by lexer.
Attributes:
- type: Token type string
- value: Token value (original text or processed value)
- lineno: Line number where token appears
- lexpos: Character position in input string
"""
type: str
value: any
lineno: int
lexpos: intException class for lexical analysis errors and logging utilities.
class LexError(Exception):
"""Exception raised for lexical analysis errors."""
class PlyLogger:
"""
Logging utility for PLY operations.
Provides structured logging for lexer construction and operation.
"""Define tokens using variables or functions with t_ prefix:
# Simple token with literal regex
t_PLUS = r'\+'
t_MINUS = r'-'
t_ignore = ' \t' # Characters to ignore
# Token function with processing
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
# Check for reserved words
t.type = reserved.get(t.value, 'ID')
return tRequired functions for proper lexer operation:
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
def t_error(t):
print(f"Illegal character '{t.value[0]}'")
t.lexer.skip(1)Support for lexer states to handle context-sensitive tokenization:
states = (
('comment', 'exclusive'),
('string', 'exclusive'),
)
# Rules for specific states
def t_comment_start(t):
r'/\*'
t.lexer.begin('comment')
def t_comment_end(t):
r'\*/'
t.lexer.begin('INITIAL')
def t_comment_error(t):
t.lexer.skip(1)The lexer provides multiple mechanisms for handling errors:
When lex() is called, it sets global variables for convenience:
lexer: Global lexer instancetoken: Global token function (reference to lexer.token())input: Global input function (reference to lexer.input())These allow for simplified usage patterns while maintaining access to the full Lexer API.
StringTypes = (str, bytes) # Acceptable string types for PLYInstall with Tessl CLI
npx tessl i tessl/pypi-ply