0
# Tokenize RT
1
2
A wrapper around the stdlib `tokenize` which roundtrips. The tokenize-rt package provides perfect roundtrip tokenization by introducing additional token types (ESCAPED_NL and UNIMPORTANT_WS) that preserve exact source code formatting, enabling precise refactoring tools that maintain whitespace, comments, and formatting while modifying Python source code.
3
4
## Package Information
5
6
- **Package Name**: tokenize-rt
7
- **Language**: Python
8
- **Installation**: `pip install tokenize-rt`
9
10
## Core Imports
11
12
```python
13
import tokenize_rt
14
```
15
16
Common for working with tokens:
17
18
```python
19
from tokenize_rt import src_to_tokens, tokens_to_src, Token
20
```
21
22
For additional utilities:
23
24
```python
25
from tokenize_rt import (
26
ESCAPED_NL, UNIMPORTANT_WS, NON_CODING_TOKENS, NAMED_UNICODE_RE,
27
Offset, reversed_enumerate, parse_string_literal,
28
rfind_string_parts, curly_escape, _re_partition
29
)
30
```
31
32
## Basic Usage
33
34
```python
35
from tokenize_rt import src_to_tokens, tokens_to_src, Token
36
37
# Convert source code to tokens
38
source = '''
39
def hello():
40
print("Hello, world!")
41
'''
42
43
# Tokenize with perfect roundtrip capability
44
tokens = src_to_tokens(source)
45
46
# Each token has name, src, line, and utf8_byte_offset
47
for token in tokens:
48
if token.name not in {'UNIMPORTANT_WS', 'ESCAPED_NL'}:
49
print(f'{token.name}: {token.src!r}')
50
51
# Convert back to source (perfect roundtrip)
52
reconstructed = tokens_to_src(tokens)
53
assert source == reconstructed
54
55
# Working with specific tokens
56
name_tokens = [t for t in tokens if t.name == 'NAME']
57
print(f"Found {len(name_tokens)} NAME tokens")
58
59
# Using token matching
60
for token in tokens:
61
if token.matches(name='NAME', src='hello'):
62
print(f"Found 'hello' at line {token.line}, offset {token.utf8_byte_offset}")
63
```
64
65
## Capabilities
66
67
### Core Tokenization
68
69
Convert between Python source code and token representations with perfect roundtrip capability, preserving all formatting including whitespace and escaped newlines.
70
71
```python { .api }
72
def src_to_tokens(src: str) -> list[Token]:
73
"""
74
Convert Python source code string to list of tokens.
75
76
Args:
77
src (str): Python source code to tokenize
78
79
Returns:
80
list[Token]: List of Token objects representing the source
81
"""
82
83
def tokens_to_src(tokens: Iterable[Token]) -> str:
84
"""
85
Convert an iterable of tokens back to source code string.
86
87
Args:
88
tokens (Iterable[Token]): Tokens to convert back to source
89
90
Returns:
91
str: Reconstructed source code
92
"""
93
```
94
95
### Token Data Structures
96
97
Data structures for representing tokens and their positions within source code.
98
99
```python { .api }
100
class Offset(NamedTuple):
101
"""
102
Represents a token offset with line and byte position information.
103
"""
104
line: int | None = None
105
utf8_byte_offset: int | None = None
106
107
class Token(NamedTuple):
108
"""
109
Represents a tokenized element with position information.
110
"""
111
name: str # Token type name (from token.tok_name or custom types)
112
src: str # Source text of the token
113
line: int | None = None # Line number where token appears
114
utf8_byte_offset: int | None = None # UTF-8 byte offset within the line
115
116
@property
117
def offset(self) -> Offset:
118
"""Returns an Offset object for this token."""
119
120
def matches(self, *, name: str, src: str) -> bool:
121
"""
122
Check if token matches given name and source.
123
124
Args:
125
name (str): Token name to match
126
src (str): Token source to match
127
128
Returns:
129
bool: True if both name and src match
130
"""
131
```
132
133
### Token Navigation Utilities
134
135
Helper functions for working with token sequences, particularly useful for code refactoring and analysis tools.
136
137
```python { .api }
138
def reversed_enumerate(tokens: Sequence[Token]) -> Generator[tuple[int, Token]]:
139
"""
140
Yield (index, token) pairs in reverse order.
141
142
Args:
143
tokens (Sequence[Token]): Token sequence to enumerate in reverse
144
145
Yields:
146
tuple[int, Token]: (index, token) pairs in reverse order
147
"""
148
149
def rfind_string_parts(tokens: Sequence[Token], i: int) -> tuple[int, ...]:
150
"""
151
Find the indices of string parts in a (joined) string literal.
152
153
Args:
154
tokens (Sequence[Token]): Token sequence to search
155
i (int): Starting index (should be at end of string literal)
156
157
Returns:
158
tuple[int, ...]: Indices of string parts, or empty tuple if not a string literal
159
"""
160
```
161
162
### String Literal Processing
163
164
Functions for parsing and processing Python string literals, including prefix extraction and escaping utilities.
165
166
```python { .api }
167
def parse_string_literal(src: str) -> tuple[str, str]:
168
"""
169
Parse a string literal's source into (prefix, string) components.
170
171
Args:
172
src (str): String literal source code
173
174
Returns:
175
tuple[str, str]: (prefix, string) pair
176
177
Example:
178
>>> parse_string_literal('f"foo"')
179
('f', '"foo"')
180
"""
181
182
def curly_escape(s: str) -> str:
183
"""
184
Escape curly braces in strings while preserving named unicode escapes.
185
186
Args:
187
s (str): String to escape
188
189
Returns:
190
str: String with curly braces escaped except in unicode names
191
"""
192
```
193
194
### Token Constants
195
196
Pre-defined constants for token classification and filtering.
197
198
```python { .api }
199
# Type imports (for reference in signatures)
200
from re import Pattern
201
ESCAPED_NL: str
202
"""Constant for escaped newline token type."""
203
204
UNIMPORTANT_WS: str
205
"""Constant for unimportant whitespace token type."""
206
207
NON_CODING_TOKENS: frozenset[str]
208
"""
209
Set of token names that don't affect control flow or code:
210
{'COMMENT', ESCAPED_NL, 'NL', UNIMPORTANT_WS}
211
"""
212
213
NAMED_UNICODE_RE: Pattern[str]
214
"""Regular expression pattern for matching named unicode escapes."""
215
```
216
217
### Internal Utilities
218
219
Internal helper functions that are exposed and may be useful for advanced use cases.
220
221
```python { .api }
222
def _re_partition(regex: Pattern[str], s: str) -> tuple[str, str, str]:
223
"""
224
Partition a string based on regex match (internal helper function).
225
226
Args:
227
regex (Pattern[str]): Compiled regular expression pattern
228
s (str): String to partition
229
230
Returns:
231
tuple[str, str, str]: (before_match, match, after_match) or (s, '', '') if no match
232
"""
233
```
234
235
### Command Line Interface
236
237
Command-line tool for tokenizing Python files and inspecting token sequences.
238
239
```python { .api }
240
def main(argv: Sequence[str] | None = None) -> int:
241
"""
242
Command-line interface that tokenizes a file and prints tokens with positions.
243
244
Args:
245
argv (Sequence[str] | None): Command line arguments, or None for sys.argv
246
247
Returns:
248
int: Exit code (0 for success)
249
"""
250
```
251
252
## Advanced Usage Examples
253
254
### Token Filtering and Analysis
255
256
```python
257
from tokenize_rt import src_to_tokens, NON_CODING_TOKENS
258
259
source = '''
260
# This is a comment
261
def func(): # Another comment
262
pass
263
'''
264
265
tokens = src_to_tokens(source)
266
267
# Filter out non-coding tokens
268
code_tokens = [t for t in tokens if t.name not in NON_CODING_TOKENS]
269
print("Code-only tokens:", [t.src for t in code_tokens])
270
271
# Find all comments
272
comments = [t for t in tokens if t.name == 'COMMENT']
273
print("Comments found:", [t.src for t in comments])
274
```
275
276
### String Literal Processing
277
278
```python
279
from tokenize_rt import src_to_tokens, parse_string_literal, rfind_string_parts
280
281
# Parse string prefixes
282
prefix, string_part = parse_string_literal('f"Hello {name}!"')
283
print(f"Prefix: {prefix!r}, String: {string_part!r}")
284
285
# Find string parts in concatenated strings
286
source = '"first" "second" "third"'
287
tokens = src_to_tokens(source)
288
289
# Find the string literal at the end
290
string_indices = rfind_string_parts(tokens, len(tokens) - 1)
291
print("String part indices:", string_indices)
292
```
293
294
### Token Modification for Refactoring
295
296
```python
297
from tokenize_rt import src_to_tokens, tokens_to_src, Token
298
299
source = 'old_name = 42'
300
tokens = src_to_tokens(source)
301
302
# Replace 'old_name' with 'new_name'
303
modified_tokens = []
304
for token in tokens:
305
if token.matches(name='NAME', src='old_name'):
306
# Create new token with same position but different source
307
modified_tokens.append(Token(
308
name=token.name,
309
src='new_name',
310
line=token.line,
311
utf8_byte_offset=token.utf8_byte_offset
312
))
313
else:
314
modified_tokens.append(token)
315
316
result = tokens_to_src(modified_tokens)
317
print(result) # 'new_name = 42'
318
```