or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdlexical-analysis.mdsyntax-parsing.md

lexical-analysis.mddocs/

0

# Lexical Analysis

1

2

The `ply.lex` module provides lexical analysis capabilities, converting raw text input into a stream of tokens using regular expressions and finite state machines. It supports multiple lexer states, comprehensive error handling, and flexible token rule definitions.

3

4

## Capabilities

5

6

### Lexer Creation

7

8

Creates a lexer instance by analyzing token rules defined in the calling module or specified object. Automatically discovers token rules through naming conventions and validates the lexical specification.

9

10

```python { .api }

11

def lex(*, module=None, object=None, debug=False, reflags=int(re.VERBOSE), debuglog=None, errorlog=None):

12

"""

13

Build a lexer from token rules.

14

15

Parameters:

16

- module: Module containing token rules (default: calling module)

17

- object: Object containing token rules (alternative to module)

18

- debug: Enable debug mode for lexer construction

19

- reflags: Regular expression flags (default: re.VERBOSE)

20

- debuglog: Logger for debug output

21

- errorlog: Logger for error messages

22

23

Returns:

24

Lexer instance

25

"""

26

```

27

28

### Token Rule Decorator

29

30

Decorator for adding regular expression patterns to token rule functions, enabling more complex token processing while maintaining the pattern association.

31

32

```python { .api }

33

def TOKEN(r):

34

"""

35

Decorator to add a regular expression pattern to a token rule function.

36

37

Parameters:

38

- r: Regular expression pattern string

39

40

Returns:

41

Decorated function with attached regex pattern

42

"""

43

```

44

45

Usage example:

46

```python

47

@TOKEN(r'\d+')

48

def t_NUMBER(t):

49

t.value = int(t.value)

50

return t

51

```

52

53

### Standalone Lexer Execution

54

55

Runs a lexer in standalone mode for testing and debugging purposes, reading input from command line arguments or standard input.

56

57

```python { .api }

58

def runmain(lexer=None, data=None):

59

"""

60

Run lexer in standalone mode.

61

62

Parameters:

63

- lexer: Lexer instance to run (default: global lexer)

64

- data: Input data to tokenize (default: from command line)

65

"""

66

```

67

68

### Lexer Class

69

70

The main lexer class that performs tokenization of input strings. Supports stateful tokenization, error recovery, and position tracking.

71

72

```python { .api }

73

class Lexer:

74

def input(self, s):

75

"""

76

Set the input string for tokenization.

77

78

Parameters:

79

- s: Input string to tokenize

80

"""

81

82

def token(self):

83

"""

84

Get the next token from input.

85

86

Returns:

87

LexToken instance or None if end of input

88

"""

89

90

def clone(self, object=None):

91

"""

92

Create a copy of the lexer.

93

94

Parameters:

95

- object: Object containing token rules (optional)

96

97

Returns:

98

New Lexer instance

99

"""

100

101

def begin(self, state):

102

"""

103

Change lexer to the specified state.

104

105

Parameters:

106

- state: State name to enter

107

"""

108

109

def push_state(self, state):

110

"""

111

Push current state and enter new state.

112

113

Parameters:

114

- state: State name to enter

115

"""

116

117

def pop_state(self):

118

"""

119

Pop state from stack and return to previous state.

120

121

Returns:

122

Previous state name

123

"""

124

125

def current_state(self):

126

"""

127

Get the current lexer state.

128

129

Returns:

130

Current state name

131

"""

132

133

def skip(self, n):

134

"""

135

Skip n characters in the input.

136

137

Parameters:

138

- n: Number of characters to skip

139

"""

140

141

def __iter__(self):

142

"""

143

Iterator interface for tokenization.

144

145

Returns:

146

Iterator object (self)

147

"""

148

149

def __next__(self):

150

"""

151

Get next token for iterator interface.

152

153

Returns:

154

Next LexToken or raises StopIteration

155

"""

156

157

# Public attributes

158

lineno: int # Current line number

159

lexpos: int # Current position in input string

160

```

161

162

### Token Representation

163

164

Object representing a lexical token with type, value, and position information.

165

166

```python { .api }

167

class LexToken:

168

"""

169

Token object created by lexer.

170

171

Attributes:

172

- type: Token type string

173

- value: Token value (original text or processed value)

174

- lineno: Line number where token appears

175

- lexpos: Character position in input string

176

"""

177

type: str

178

value: any

179

lineno: int

180

lexpos: int

181

```

182

183

### Lexer Error Handling

184

185

Exception class for lexical analysis errors and logging utilities.

186

187

```python { .api }

188

class LexError(Exception):

189

"""Exception raised for lexical analysis errors."""

190

191

class PlyLogger:

192

"""

193

Logging utility for PLY operations.

194

Provides structured logging for lexer construction and operation.

195

"""

196

```

197

198

## Token Rule Conventions

199

200

### Basic Token Rules

201

202

Define tokens using variables or functions with `t_` prefix:

203

204

```python

205

# Simple token with literal regex

206

t_PLUS = r'\+'

207

t_MINUS = r'-'

208

t_ignore = ' \t' # Characters to ignore

209

210

# Token function with processing

211

def t_NUMBER(t):

212

r'\d+'

213

t.value = int(t.value)

214

return t

215

216

def t_ID(t):

217

r'[a-zA-Z_][a-zA-Z_0-9]*'

218

# Check for reserved words

219

t.type = reserved.get(t.value, 'ID')

220

return t

221

```

222

223

### Special Token Functions

224

225

Required functions for proper lexer operation:

226

227

```python

228

def t_newline(t):

229

r'\n+'

230

t.lexer.lineno += len(t.value)

231

232

def t_error(t):

233

print(f"Illegal character '{t.value[0]}'")

234

t.lexer.skip(1)

235

```

236

237

### Multiple States

238

239

Support for lexer states to handle context-sensitive tokenization:

240

241

```python

242

states = (

243

('comment', 'exclusive'),

244

('string', 'exclusive'),

245

)

246

247

# Rules for specific states

248

def t_comment_start(t):

249

r'/\*'

250

t.lexer.begin('comment')

251

252

def t_comment_end(t):

253

r'\*/'

254

t.lexer.begin('INITIAL')

255

256

def t_comment_error(t):

257

t.lexer.skip(1)

258

```

259

260

## Error Recovery

261

262

The lexer provides multiple mechanisms for handling errors:

263

264

1. **t_error() function**: Called when illegal characters are encountered

265

2. **skip() method**: Skip characters during error recovery

266

3. **LexError exception**: Raised for critical lexer errors

267

4. **Logging**: Comprehensive error reporting through PlyLogger

268

269

## Global Variables

270

271

When `lex()` is called, it sets global variables for convenience:

272

273

- `lexer`: Global lexer instance

274

- `token`: Global token function (reference to lexer.token())

275

- `input`: Global input function (reference to lexer.input())

276

277

These allow for simplified usage patterns while maintaining access to the full Lexer API.

278

279

## Constants

280

281

```python { .api }

282

StringTypes = (str, bytes) # Acceptable string types for PLY

283

```