0
# Lexical Analysis
1
2
The `ply.lex` module provides lexical analysis capabilities, converting raw text input into a stream of tokens using regular expressions and finite state machines. It supports multiple lexer states, comprehensive error handling, and flexible token rule definitions.
3
4
## Capabilities
5
6
### Lexer Creation
7
8
Creates a lexer instance by analyzing token rules defined in the calling module or specified object. Automatically discovers token rules through naming conventions and validates the lexical specification.
9
10
```python { .api }
11
def lex(*, module=None, object=None, debug=False, reflags=int(re.VERBOSE), debuglog=None, errorlog=None):
12
"""
13
Build a lexer from token rules.
14
15
Parameters:
16
- module: Module containing token rules (default: calling module)
17
- object: Object containing token rules (alternative to module)
18
- debug: Enable debug mode for lexer construction
19
- reflags: Regular expression flags (default: re.VERBOSE)
20
- debuglog: Logger for debug output
21
- errorlog: Logger for error messages
22
23
Returns:
24
Lexer instance
25
"""
26
```
27
28
### Token Rule Decorator
29
30
Decorator for adding regular expression patterns to token rule functions, enabling more complex token processing while maintaining the pattern association.
31
32
```python { .api }
33
def TOKEN(r):
34
"""
35
Decorator to add a regular expression pattern to a token rule function.
36
37
Parameters:
38
- r: Regular expression pattern string
39
40
Returns:
41
Decorated function with attached regex pattern
42
"""
43
```
44
45
Usage example:
46
```python
47
@TOKEN(r'\d+')
48
def t_NUMBER(t):
49
t.value = int(t.value)
50
return t
51
```
52
53
### Standalone Lexer Execution
54
55
Runs a lexer in standalone mode for testing and debugging purposes, reading input from command line arguments or standard input.
56
57
```python { .api }
58
def runmain(lexer=None, data=None):
59
"""
60
Run lexer in standalone mode.
61
62
Parameters:
63
- lexer: Lexer instance to run (default: global lexer)
64
- data: Input data to tokenize (default: from command line)
65
"""
66
```
67
68
### Lexer Class
69
70
The main lexer class that performs tokenization of input strings. Supports stateful tokenization, error recovery, and position tracking.
71
72
```python { .api }
73
class Lexer:
74
def input(self, s):
75
"""
76
Set the input string for tokenization.
77
78
Parameters:
79
- s: Input string to tokenize
80
"""
81
82
def token(self):
83
"""
84
Get the next token from input.
85
86
Returns:
87
LexToken instance or None if end of input
88
"""
89
90
def clone(self, object=None):
91
"""
92
Create a copy of the lexer.
93
94
Parameters:
95
- object: Object containing token rules (optional)
96
97
Returns:
98
New Lexer instance
99
"""
100
101
def begin(self, state):
102
"""
103
Change lexer to the specified state.
104
105
Parameters:
106
- state: State name to enter
107
"""
108
109
def push_state(self, state):
110
"""
111
Push current state and enter new state.
112
113
Parameters:
114
- state: State name to enter
115
"""
116
117
def pop_state(self):
118
"""
119
Pop state from stack and return to previous state.
120
121
Returns:
122
Previous state name
123
"""
124
125
def current_state(self):
126
"""
127
Get the current lexer state.
128
129
Returns:
130
Current state name
131
"""
132
133
def skip(self, n):
134
"""
135
Skip n characters in the input.
136
137
Parameters:
138
- n: Number of characters to skip
139
"""
140
141
def __iter__(self):
142
"""
143
Iterator interface for tokenization.
144
145
Returns:
146
Iterator object (self)
147
"""
148
149
def __next__(self):
150
"""
151
Get next token for iterator interface.
152
153
Returns:
154
Next LexToken or raises StopIteration
155
"""
156
157
# Public attributes
158
lineno: int # Current line number
159
lexpos: int # Current position in input string
160
```
161
162
### Token Representation
163
164
Object representing a lexical token with type, value, and position information.
165
166
```python { .api }
167
class LexToken:
168
"""
169
Token object created by lexer.
170
171
Attributes:
172
- type: Token type string
173
- value: Token value (original text or processed value)
174
- lineno: Line number where token appears
175
- lexpos: Character position in input string
176
"""
177
type: str
178
value: any
179
lineno: int
180
lexpos: int
181
```
182
183
### Lexer Error Handling
184
185
Exception class for lexical analysis errors and logging utilities.
186
187
```python { .api }
188
class LexError(Exception):
189
"""Exception raised for lexical analysis errors."""
190
191
class PlyLogger:
192
"""
193
Logging utility for PLY operations.
194
Provides structured logging for lexer construction and operation.
195
"""
196
```
197
198
## Token Rule Conventions
199
200
### Basic Token Rules
201
202
Define tokens using variables or functions with `t_` prefix:
203
204
```python
205
# Simple token with literal regex
206
t_PLUS = r'\+'
207
t_MINUS = r'-'
208
t_ignore = ' \t' # Characters to ignore
209
210
# Token function with processing
211
def t_NUMBER(t):
212
r'\d+'
213
t.value = int(t.value)
214
return t
215
216
def t_ID(t):
217
r'[a-zA-Z_][a-zA-Z_0-9]*'
218
# Check for reserved words
219
t.type = reserved.get(t.value, 'ID')
220
return t
221
```
222
223
### Special Token Functions
224
225
Required functions for proper lexer operation:
226
227
```python
228
def t_newline(t):
229
r'\n+'
230
t.lexer.lineno += len(t.value)
231
232
def t_error(t):
233
print(f"Illegal character '{t.value[0]}'")
234
t.lexer.skip(1)
235
```
236
237
### Multiple States
238
239
Support for lexer states to handle context-sensitive tokenization:
240
241
```python
242
states = (
243
('comment', 'exclusive'),
244
('string', 'exclusive'),
245
)
246
247
# Rules for specific states
248
def t_comment_start(t):
249
r'/\*'
250
t.lexer.begin('comment')
251
252
def t_comment_end(t):
253
r'\*/'
254
t.lexer.begin('INITIAL')
255
256
def t_comment_error(t):
257
t.lexer.skip(1)
258
```
259
260
## Error Recovery
261
262
The lexer provides multiple mechanisms for handling errors:
263
264
1. **t_error() function**: Called when illegal characters are encountered
265
2. **skip() method**: Skip characters during error recovery
266
3. **LexError exception**: Raised for critical lexer errors
267
4. **Logging**: Comprehensive error reporting through PlyLogger
268
269
## Global Variables
270
271
When `lex()` is called, it sets global variables for convenience:
272
273
- `lexer`: Global lexer instance
274
- `token`: Global token function (reference to lexer.token())
275
- `input`: Global input function (reference to lexer.input())
276
277
These allow for simplified usage patterns while maintaining access to the full Lexer API.
278
279
## Constants
280
281
```python { .api }
282
StringTypes = (str, bytes) # Acceptable string types for PLY
283
```