or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# Tokenize RT

1

2

A wrapper around the stdlib `tokenize` which roundtrips. The tokenize-rt package provides perfect roundtrip tokenization by introducing additional token types (ESCAPED_NL and UNIMPORTANT_WS) that preserve exact source code formatting, enabling precise refactoring tools that maintain whitespace, comments, and formatting while modifying Python source code.

3

4

## Package Information

5

6

- **Package Name**: tokenize-rt

7

- **Language**: Python

8

- **Installation**: `pip install tokenize-rt`

9

10

## Core Imports

11

12

```python

13

import tokenize_rt

14

```

15

16

Common for working with tokens:

17

18

```python

19

from tokenize_rt import src_to_tokens, tokens_to_src, Token

20

```

21

22

For additional utilities:

23

24

```python

25

from tokenize_rt import (

26

ESCAPED_NL, UNIMPORTANT_WS, NON_CODING_TOKENS, NAMED_UNICODE_RE,

27

Offset, reversed_enumerate, parse_string_literal,

28

rfind_string_parts, curly_escape, _re_partition

29

)

30

```

31

32

## Basic Usage

33

34

```python

35

from tokenize_rt import src_to_tokens, tokens_to_src, Token

36

37

# Convert source code to tokens

38

source = '''

39

def hello():

40

print("Hello, world!")

41

'''

42

43

# Tokenize with perfect roundtrip capability

44

tokens = src_to_tokens(source)

45

46

# Each token has name, src, line, and utf8_byte_offset

47

for token in tokens:

48

if token.name not in {'UNIMPORTANT_WS', 'ESCAPED_NL'}:

49

print(f'{token.name}: {token.src!r}')

50

51

# Convert back to source (perfect roundtrip)

52

reconstructed = tokens_to_src(tokens)

53

assert source == reconstructed

54

55

# Working with specific tokens

56

name_tokens = [t for t in tokens if t.name == 'NAME']

57

print(f"Found {len(name_tokens)} NAME tokens")

58

59

# Using token matching

60

for token in tokens:

61

if token.matches(name='NAME', src='hello'):

62

print(f"Found 'hello' at line {token.line}, offset {token.utf8_byte_offset}")

63

```

64

65

## Capabilities

66

67

### Core Tokenization

68

69

Convert between Python source code and token representations with perfect roundtrip capability, preserving all formatting including whitespace and escaped newlines.

70

71

```python { .api }

72

def src_to_tokens(src: str) -> list[Token]:

73

"""

74

Convert Python source code string to list of tokens.

75

76

Args:

77

src (str): Python source code to tokenize

78

79

Returns:

80

list[Token]: List of Token objects representing the source

81

"""

82

83

def tokens_to_src(tokens: Iterable[Token]) -> str:

84

"""

85

Convert an iterable of tokens back to source code string.

86

87

Args:

88

tokens (Iterable[Token]): Tokens to convert back to source

89

90

Returns:

91

str: Reconstructed source code

92

"""

93

```

94

95

### Token Data Structures

96

97

Data structures for representing tokens and their positions within source code.

98

99

```python { .api }

100

class Offset(NamedTuple):

101

"""

102

Represents a token offset with line and byte position information.

103

"""

104

line: int | None = None

105

utf8_byte_offset: int | None = None

106

107

class Token(NamedTuple):

108

"""

109

Represents a tokenized element with position information.

110

"""

111

name: str # Token type name (from token.tok_name or custom types)

112

src: str # Source text of the token

113

line: int | None = None # Line number where token appears

114

utf8_byte_offset: int | None = None # UTF-8 byte offset within the line

115

116

@property

117

def offset(self) -> Offset:

118

"""Returns an Offset object for this token."""

119

120

def matches(self, *, name: str, src: str) -> bool:

121

"""

122

Check if token matches given name and source.

123

124

Args:

125

name (str): Token name to match

126

src (str): Token source to match

127

128

Returns:

129

bool: True if both name and src match

130

"""

131

```

132

133

### Token Navigation Utilities

134

135

Helper functions for working with token sequences, particularly useful for code refactoring and analysis tools.

136

137

```python { .api }

138

def reversed_enumerate(tokens: Sequence[Token]) -> Generator[tuple[int, Token]]:

139

"""

140

Yield (index, token) pairs in reverse order.

141

142

Args:

143

tokens (Sequence[Token]): Token sequence to enumerate in reverse

144

145

Yields:

146

tuple[int, Token]: (index, token) pairs in reverse order

147

"""

148

149

def rfind_string_parts(tokens: Sequence[Token], i: int) -> tuple[int, ...]:

150

"""

151

Find the indices of string parts in a (joined) string literal.

152

153

Args:

154

tokens (Sequence[Token]): Token sequence to search

155

i (int): Starting index (should be at end of string literal)

156

157

Returns:

158

tuple[int, ...]: Indices of string parts, or empty tuple if not a string literal

159

"""

160

```

161

162

### String Literal Processing

163

164

Functions for parsing and processing Python string literals, including prefix extraction and escaping utilities.

165

166

```python { .api }

167

def parse_string_literal(src: str) -> tuple[str, str]:

168

"""

169

Parse a string literal's source into (prefix, string) components.

170

171

Args:

172

src (str): String literal source code

173

174

Returns:

175

tuple[str, str]: (prefix, string) pair

176

177

Example:

178

>>> parse_string_literal('f"foo"')

179

('f', '"foo"')

180

"""

181

182

def curly_escape(s: str) -> str:

183

"""

184

Escape curly braces in strings while preserving named unicode escapes.

185

186

Args:

187

s (str): String to escape

188

189

Returns:

190

str: String with curly braces escaped except in unicode names

191

"""

192

```

193

194

### Token Constants

195

196

Pre-defined constants for token classification and filtering.

197

198

```python { .api }

199

# Type imports (for reference in signatures)

200

from re import Pattern

201

ESCAPED_NL: str

202

"""Constant for escaped newline token type."""

203

204

UNIMPORTANT_WS: str

205

"""Constant for unimportant whitespace token type."""

206

207

NON_CODING_TOKENS: frozenset[str]

208

"""

209

Set of token names that don't affect control flow or code:

210

{'COMMENT', ESCAPED_NL, 'NL', UNIMPORTANT_WS}

211

"""

212

213

NAMED_UNICODE_RE: Pattern[str]

214

"""Regular expression pattern for matching named unicode escapes."""

215

```

216

217

### Internal Utilities

218

219

Internal helper functions that are exposed and may be useful for advanced use cases.

220

221

```python { .api }

222

def _re_partition(regex: Pattern[str], s: str) -> tuple[str, str, str]:

223

"""

224

Partition a string based on regex match (internal helper function).

225

226

Args:

227

regex (Pattern[str]): Compiled regular expression pattern

228

s (str): String to partition

229

230

Returns:

231

tuple[str, str, str]: (before_match, match, after_match) or (s, '', '') if no match

232

"""

233

```

234

235

### Command Line Interface

236

237

Command-line tool for tokenizing Python files and inspecting token sequences.

238

239

```python { .api }

240

def main(argv: Sequence[str] | None = None) -> int:

241

"""

242

Command-line interface that tokenizes a file and prints tokens with positions.

243

244

Args:

245

argv (Sequence[str] | None): Command line arguments, or None for sys.argv

246

247

Returns:

248

int: Exit code (0 for success)

249

"""

250

```

251

252

## Advanced Usage Examples

253

254

### Token Filtering and Analysis

255

256

```python

257

from tokenize_rt import src_to_tokens, NON_CODING_TOKENS

258

259

source = '''

260

# This is a comment

261

def func(): # Another comment

262

pass

263

'''

264

265

tokens = src_to_tokens(source)

266

267

# Filter out non-coding tokens

268

code_tokens = [t for t in tokens if t.name not in NON_CODING_TOKENS]

269

print("Code-only tokens:", [t.src for t in code_tokens])

270

271

# Find all comments

272

comments = [t for t in tokens if t.name == 'COMMENT']

273

print("Comments found:", [t.src for t in comments])

274

```

275

276

### String Literal Processing

277

278

```python

279

from tokenize_rt import src_to_tokens, parse_string_literal, rfind_string_parts

280

281

# Parse string prefixes

282

prefix, string_part = parse_string_literal('f"Hello {name}!"')

283

print(f"Prefix: {prefix!r}, String: {string_part!r}")

284

285

# Find string parts in concatenated strings

286

source = '"first" "second" "third"'

287

tokens = src_to_tokens(source)

288

289

# Find the string literal at the end

290

string_indices = rfind_string_parts(tokens, len(tokens) - 1)

291

print("String part indices:", string_indices)

292

```

293

294

### Token Modification for Refactoring

295

296

```python

297

from tokenize_rt import src_to_tokens, tokens_to_src, Token

298

299

source = 'old_name = 42'

300

tokens = src_to_tokens(source)

301

302

# Replace 'old_name' with 'new_name'

303

modified_tokens = []

304

for token in tokens:

305

if token.matches(name='NAME', src='old_name'):

306

# Create new token with same position but different source

307

modified_tokens.append(Token(

308

name=token.name,

309

src='new_name',

310

line=token.line,

311

utf8_byte_offset=token.utf8_byte_offset

312

))

313

else:

314

modified_tokens.append(token)

315

316

result = tokens_to_src(modified_tokens)

317

print(result) # 'new_name = 42'

318

```