Tessl Tile for pypi/lark-parser@0.12.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-parsing.md exceptions.md index.md tokens-lexing.md tree-processing.md utilities.md

tokens-lexing.mddocs/

0
# Tokens and Lexing
1

2
Token representation and lexical analysis including Token class for lexical units, lexer configuration, and specialized handling for indentation-sensitive languages.
3

4
## Capabilities
5

6
### Token Representation
7

8
The Token class represents lexical units produced by the lexer, inheriting from str while adding metadata.
9

10
```python { .api }
11
class Token(str):
12
    """
13
    String with meta-information representing a lexical token.
14
    Inherits from str so it can be used anywhere a string is expected.
15
    """
16
    
17
    def __new__(cls, type_: str, value: str, start_pos: int = None, 
18
               line: int = None, column: int = None, end_line: int = None,
19
               end_column: int = None, end_pos: int = None, 
20
               pos_in_stream: int = None) -> 'Token':
21
        """
22
        Create new token instance.
23
        
24
        Parameters:
25
        - type_: Token type name (terminal name from grammar)
26
        - value: Token string value
27
        - start_pos: Starting position in input text
28
        - line: Line number (1-based)
29
        - column: Column number (1-based)  
30
        - end_line: Ending line number
31
        - end_column: Ending column number
32
        - end_pos: Ending position in input text
33
        - pos_in_stream: Position in token stream
34
        
35
        Returns:
36
        Token: New token instance
37
        """
38
    
39
    def update(self, type_: str = None, value: str = None) -> 'Token':
40
        """
41
        Create updated copy of token with new type or value.
42
        
43
        Parameters:
44
        - type_: New token type (optional)
45
        - value: New token value (optional)
46
        
47
        Returns:
48
        Token: Updated token copy
49
        """
50
    
51
    @classmethod
52
    def new_borrow_pos(cls, type_: str, value: str, borrow_t: 'Token') -> 'Token':
53
        """
54
        Create token borrowing position information from another token.
55
        
56
        Parameters:
57
        - type_: Token type name
58
        - value: Token string value
59
        - borrow_t: Token to borrow position from
60
        
61
        Returns:
62
        Token: New token with borrowed position
63
        """
64
    
65
    # Attributes
66
    type: str           # Token type name
67
    value: str          # Token string value (same as str content)
68
    start_pos: int      # Start position in input
69
    line: int           # Line number (1-based)
70
    column: int         # Column number (1-based)
71
    end_line: int       # End line number
72
    end_column: int     # End column number
73
    end_pos: int        # End position in input
74
    pos_in_stream: int  # Position in token stream
75
```
76

77
### Indentation Handling
78

79
Post-lexer processor for handling Python-style indentation with INDENT/DEDENT tokens.
80

81
```python { .api }
82
class Indenter:
83
    """
84
    PostLex processor for Python-like indentation handling.
85
    Converts whitespace at line beginnings into INDENT/DEDENT tokens.
86
    """
87
    
88
    def __init__(self, tab_len: int = 8):
89
        """
90
        Initialize indenter.
91
        
92
        Parameters:
93
        - tab_len: Number of spaces equivalent to one tab
94
        """
95
    
96
    def process(self, stream: Iterator[Token]) -> Iterator[Token]:
97
        """
98
        Process token stream, converting indentation to INDENT/DEDENT tokens.
99
        
100
        Parameters:
101
        - stream: Input token stream
102
        
103
        Returns:
104
        Iterator[Token]: Stream with indentation tokens
105
        """
106
    
107
    def handle_NL(self, token: Token) -> Iterator[Token]:
108
        """
109
        Handle newline tokens for indentation tracking.
110
        
111
        Parameters:
112
        - token: Newline token
113
        
114
        Returns:
115
        Iterator[Token]: Processed tokens
116
        """
117
    
118
    def handle_OPEN_PAREN(self, token: Token) -> Iterator[Token]:
119
        """
120
        Handle opening parenthesis tokens.
121
        
122
        Parameters:
123
        - token: Opening parenthesis token
124
        
125
        Returns:
126
        Iterator[Token]: Processed tokens
127
        """
128
    
129
    def handle_CLOSE_PAREN(self, token: Token) -> Iterator[Token]:
130
        """
131
        Handle closing parenthesis tokens.
132
        
133
        Parameters:
134
        - token: Closing parenthesis token
135
        
136
        Returns:
137
        Iterator[Token]: Processed tokens
138
        """
139
    
140
    # Configuration attributes
141
    always_accept: Tuple[str, ...] = ('NL', 'COMMENT')  # Always accepted tokens
142
    NL_type: str = 'NL'                                 # Newline token type
143
    OPEN_PAREN_types: Tuple[str, ...] = ()             # Open paren types
144
    CLOSE_PAREN_types: Tuple[str, ...] = ()            # Close paren types
145
    INDENT_type: str = 'INDENT'                         # Indent token type
146
    DEDENT_type: str = 'DEDENT'                         # Dedent token type
147
    tab_len: int                                        # Tab length in spaces
148
```
149

150
### Python-Specific Indenter
151

152
Pre-configured indenter for Python syntax.
153

154
```python { .api }
155
class PythonIndenter(Indenter):
156
    """
157
    Indenter configured for Python language syntax.
158
    """
159
    
160
    NL_type = 'NEWLINE'
161
    OPEN_PAREN_types = ('LPAR', 'LSQB', 'LBRACE')
162
    CLOSE_PAREN_types = ('RPAR', 'RSQB', 'RBRACE')
163
    INDENT_type = 'INDENT'
164
    DEDENT_type = 'DEDENT'
165
    tab_len = 8
166
```
167

168
### Indentation Errors
169

170
Exception raised for invalid dedentation patterns.
171

172
```python { .api }
173
class DedentError(LarkError):
174
    """
175
    Raised when dedentation doesn't match any previous indentation level.
176
    """
177
```
178

179
### Lexer Configuration
180

181
Configuration classes for lexer behavior and terminal definitions.
182

183
```python { .api }
184
class LexerConf:
185
    """
186
    Configuration object for lexer components.
187
    """
188
    
189
    def __init__(self, terminals: List, ignore: List = None, 
190
                 g_regex_flags: int = 0, use_bytes: bool = False,
191
                 lexer_type: str = None, callbacks: Dict = None):
192
        """
193
        Initialize lexer configuration.
194
        
195
        Parameters:
196
        - terminals: List of terminal definitions
197
        - ignore: List of terminals to ignore
198
        - g_regex_flags: Global regex flags
199
        - use_bytes: Whether to use bytes instead of str
200
        - lexer_type: Type of lexer to use
201
        - callbacks: Token callbacks dictionary
202
        """
203
    
204
    terminals: List         # Terminal definitions
205
    ignore: List           # Ignored terminals
206
    g_regex_flags: int     # Global regex flags
207
    use_bytes: bool        # Use bytes input
208
    lexer_type: str        # Lexer type
209
    callbacks: Dict        # Token callbacks
210
```
211

212
### Terminal Definitions
213

214
Classes representing terminal symbol definitions in grammars.
215

216
```python { .api }
217
class TerminalDef:
218
    """
219
    Definition of a terminal symbol in the grammar.
220
    """
221
    
222
    def __init__(self, name: str, pattern, options: List = None):
223
        """
224
        Initialize terminal definition.
225
        
226
        Parameters:
227
        - name: Terminal name
228
        - pattern: Pattern object or string
229
        - options: List of terminal options
230
        """
231
    
232
    name: str              # Terminal name
233
    pattern: Pattern       # Pattern for matching
234
    options: List          # Terminal options
235
```
236

237
### Lexer Classes
238

239
Core lexer implementations for tokenizing input text.
240

241
```python { .api }
242
class Lexer:
243
    """
244
    Abstract base lexer class.
245
    """
246
    
247
    def lex(self, text: str, dont_ignore: bool = False) -> Iterator[Token]:
248
        """
249
        Tokenize input text.
250
        
251
        Parameters:
252
        - text: Input text to tokenize
253
        - dont_ignore: Include normally ignored tokens
254
        
255
        Returns:
256
        Iterator[Token]: Token stream
257
        """
258

259
class TraditionalLexer(Lexer):
260
    """
261
    Traditional regex-based lexer implementation.
262
    """
263

264
class LexerThread:
265
    """
266
    Lexer state for incremental tokenization.
267
    """
268
    
269
    def lex(self, stream: Iterator, newline_types: Set[str], 
270
           ignore_types: Set[str]) -> Iterator[Token]:
271
        """
272
        Perform lexical analysis on character stream.
273
        
274
        Parameters:
275
        - stream: Character stream
276
        - newline_types: Set of newline token types
277
        - ignore_types: Set of token types to ignore
278
        
279
        Returns:
280
        Iterator[Token]: Token stream
281
        """
282
```
283

284
## Usage Examples
285

286
### Basic Token Usage
287

288
```python
289
from lark import Lark, Token
290

291
parser = Lark(grammar)
292

293
# Get tokens without parsing
294
tokens = list(parser.lex("x = 42"))
295
for token in tokens:
296
    print(f"Type: {token.type}, Value: '{token.value}', Line: {token.line}")
297

298
# Tokens are strings
299
token = Token('IDENTIFIER', 'variable_name', line=1, column=5)
300
print(f"Token as string: {token}")  # Prints: variable_name
301
print(f"Token type: {token.type}")   # Prints: IDENTIFIER
302
```
303

304
### Creating Custom Tokens
305

306
```python
307
from lark import Token
308

309
# Create token with position information
310
token = Token(
311
    type_='NUMBER',
312
    value='123',
313
    start_pos=10,
314
    line=2,
315
    column=5,
316
    end_line=2,
317
    end_column=8,
318
    end_pos=13
319
)
320

321
# Update token
322
new_token = token.update(type_='INTEGER')
323
print(f"Updated type: {new_token.type}")
324

325
# Borrow position from another token
326
borrowed = Token.new_borrow_pos('IDENTIFIER', 'x', token)
327
print(f"Borrowed position - Line: {borrowed.line}, Column: {borrowed.column}")
328
```
329

330
### Python Indentation Handling
331

332
```python
333
from lark import Lark
334
from lark.indenter import PythonIndenter
335

336
# Grammar for Python-like syntax
337
python_grammar = """
338
    ?start: suite
339
    
340
    suite: NEWLINE INDENT stmt+ DEDENT
341
    
342
    stmt: simple_stmt NEWLINE
343
        | compound_stmt
344
    
345
    simple_stmt: expr_stmt
346
    compound_stmt: if_stmt
347
    
348
    if_stmt: "if" expr ":" suite
349
    
350
    expr_stmt: NAME "=" NUMBER
351
    
352
    %import common.NAME
353
    %import common.NUMBER
354
    %import common.NEWLINE
355
    %import common.WS
356
    %ignore WS
357
"""
358

359
# Use Python indenter for handling indentation
360
parser = Lark(
361
    python_grammar,
362
    postlex=PythonIndenter(),
363
    parser='lalr'
364
)
365

366
# Parse indented code
367
code = '''
368
if x:
369
    y = 1
370
    z = 2
371
'''
372

373
tree = parser.parse(code)
374
print(tree.pretty())
375
```
376

377
### Custom Indenter
378

379
```python
380
from lark.indenter import Indenter
381

382
class CustomIndenter(Indenter):
383
    """Custom indenter for specific syntax."""
384
    
385
    NL_type = 'NEWLINE'
386
    OPEN_PAREN_types = ('LPAREN', 'LBRACE')
387
    CLOSE_PAREN_types = ('RPAREN', 'RBRACE')
388
    INDENT_type = 'INDENT'
389
    DEDENT_type = 'DEDENT'
390
    tab_len = 4  # 4 spaces per indent level
391

392
# Use custom indenter
393
parser = Lark(grammar, postlex=CustomIndenter())
394
```
395

396
### Token Callbacks
397

398
```python
399
from lark import Lark, Token
400

401
def uppercase_identifiers(token):
402
    """Convert identifier tokens to uppercase."""
403
    if token.type == 'IDENTIFIER':
404
        return Token(token.type, token.value.upper(), 
405
                    line=token.line, column=token.column)
406
    return token
407

408
def log_numbers(token):
409
    """Log all number tokens."""
410
    if token.type == 'NUMBER':
411
        print(f"Found number: {token.value} at line {token.line}")
412
    return token
413

414
# Apply callbacks during lexing
415
parser = Lark(
416
    grammar,
417
    lexer_callbacks={
418
        'IDENTIFIER': uppercase_identifiers,
419
        'NUMBER': log_numbers
420
    }
421
)
422

423
result = parser.parse("x = 123")
424
```
425

426
### Lexer Configuration
427

428
```python
429
from lark import Lark
430
from lark.common import LexerConf
431
import re
432

433
# Configure lexer with specific options
434
lexer_conf = LexerConf(
435
    terminals=terminal_list,
436
    ignore=['WS', 'COMMENT'],
437
    g_regex_flags=re.IGNORECASE | re.MULTILINE,
438
    use_bytes=False,
439
    lexer_type='standard'
440
)
441

442
parser = Lark(grammar, lexer='standard')
443
```
444

445
### Position Tracking
446

447
```python
448
from lark import Lark
449

450
# Enable position tracking
451
parser = Lark(grammar, propagate_positions=True)
452
tree = parser.parse(text)
453

454
# Access position information
455
def print_positions(tree):
456
    if hasattr(tree, 'meta') and tree.meta:
457
        print(f"Rule '{tree.data}' at line {tree.meta.line}, "
458
              f"column {tree.meta.column}")
459
    
460
    for child in tree.children:
461
        if hasattr(child, 'children'):  # It's a Tree
462
            print_positions(child)
463
        else:  # It's a Token
464
            print(f"Token '{child.type}': '{child.value}' at "
465
                  f"line {child.line}, column {child.column}")
466

467
print_positions(tree)
468
```
469

470
### Advanced Token Processing
471

472
```python
473
from lark import Lark, Token
474

475
class TokenProcessor:
476
    """Advanced token processing with state."""
477
    
478
    def __init__(self):
479
        self.line_count = 0
480
        
481
    def process_newlines(self, token):
482
        if token.type == 'NEWLINE':
483
            self.line_count += 1
484
            # Add line number to token value
485
            return Token(token.type, f"\\n#{self.line_count}", 
486
                        line=token.line, column=token.column)
487
        return token
488

489
processor = TokenProcessor()
490
parser = Lark(
491
    grammar,
492
    lexer_callbacks={'NEWLINE': processor.process_newlines}
493
)
494
```

Version

Tile

Files

tokens-lexing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

tokens-lexing.mddocs/