0
# HTML Utilities
1
2
Utility functions for converting between HTML and Telegraph's internal node format. These functions handle HTML parsing, validation, and conversion while respecting Telegraph's allowed tag restrictions.
3
4
## Capabilities
5
6
### HTML to Nodes Conversion
7
8
Convert HTML content to Telegraph's internal node format.
9
10
```python { .api }
11
def html_to_nodes(html_content: str) -> list:
12
"""
13
Convert HTML content to Telegraph nodes format.
14
15
Parameters:
16
- html_content (str): HTML string to convert
17
18
Returns:
19
list: Telegraph nodes representation of the HTML
20
21
Raises:
22
NotAllowedTag: HTML contains tags not allowed by Telegraph
23
InvalidHTML: HTML is malformed or has mismatched tags
24
"""
25
```
26
27
Usage examples:
28
29
```python
30
from telegraph.utils import html_to_nodes
31
32
# Simple HTML conversion
33
html = '<p>Hello <strong>world</strong>!</p>'
34
nodes = html_to_nodes(html)
35
print(nodes)
36
# Output: [{'tag': 'p', 'children': ['Hello ', {'tag': 'strong', 'children': ['world']}, '!']}]
37
38
# Complex HTML with attributes
39
html = '<p><a href="https://example.com">Link</a></p>'
40
nodes = html_to_nodes(html)
41
print(nodes)
42
# Output: [{'tag': 'p', 'children': [{'tag': 'a', 'attrs': {'href': 'https://example.com'}, 'children': ['Link']}]}]
43
44
# HTML with images
45
html = '<figure><img src="/file/image.jpg" alt="Photo"><figcaption>Caption</figcaption></figure>'
46
nodes = html_to_nodes(html)
47
```
48
49
### Nodes to HTML Conversion
50
51
Convert Telegraph nodes back to HTML format.
52
53
```python { .api }
54
def nodes_to_html(nodes: list) -> str:
55
"""
56
Convert Telegraph nodes to HTML format.
57
58
Parameters:
59
- nodes (list): Telegraph nodes to convert
60
61
Returns:
62
str: HTML representation of the nodes
63
"""
64
```
65
66
Usage examples:
67
68
```python
69
from telegraph.utils import nodes_to_html
70
71
# Convert nodes to HTML
72
nodes = [
73
{'tag': 'p', 'children': ['Hello ', {'tag': 'em', 'children': ['world']}, '!']}
74
]
75
html = nodes_to_html(nodes)
76
print(html)
77
# Output: '<p>Hello <em>world</em>!</p>'
78
79
# Complex nodes with attributes
80
nodes = [
81
{'tag': 'p', 'children': [
82
{'tag': 'a', 'attrs': {'href': 'https://example.com'}, 'children': ['Visit site']}
83
]}
84
]
85
html = nodes_to_html(nodes)
86
print(html)
87
# Output: '<p><a href="https://example.com">Visit site</a></p>'
88
```
89
90
### Round-trip Conversion
91
92
You can convert HTML to nodes and back to HTML:
93
94
```python
95
from telegraph.utils import html_to_nodes, nodes_to_html
96
97
original_html = '<p>Test <strong>content</strong> with <em>formatting</em>.</p>'
98
nodes = html_to_nodes(original_html)
99
converted_html = nodes_to_html(nodes)
100
print(converted_html)
101
# Output: '<p>Test <strong>content</strong> with <em>formatting</em>.</p>'
102
```
103
104
## Node Format Structure
105
106
Telegraph nodes use a specific JSON structure:
107
108
### Text Nodes
109
Plain strings represent text content:
110
```python
111
"Hello world"
112
```
113
114
### Element Nodes
115
Dictionaries represent HTML elements:
116
```python
117
{
118
'tag': 'p', # Required: HTML tag name
119
'attrs': {'id': 'content'}, # Optional: attributes dict
120
'children': ['Text content'] # Optional: child nodes list
121
}
122
```
123
124
### Common Node Examples
125
126
```python
127
# Paragraph with text
128
{'tag': 'p', 'children': ['Simple paragraph']}
129
130
# Bold text
131
{'tag': 'strong', 'children': ['Bold text']}
132
133
# Link with attributes
134
{'tag': 'a', 'attrs': {'href': 'https://example.com'}, 'children': ['Link text']}
135
136
# Image (void element)
137
{'tag': 'img', 'attrs': {'src': '/file/image.jpg', 'alt': 'Description'}}
138
139
# Nested elements
140
{'tag': 'p', 'children': [
141
'Text with ',
142
{'tag': 'strong', 'children': ['bold']},
143
' and ',
144
{'tag': 'em', 'children': ['italic']},
145
' formatting.'
146
]}
147
```
148
149
## Allowed HTML Tags
150
151
Telegraph supports a restricted set of HTML tags:
152
153
**Text formatting**: `b`, `strong`, `i`, `em`, `u`, `s`, `code`
154
**Structure**: `p`, `br`, `h3`, `h4`, `hr`, `blockquote`, `pre`
155
**Lists**: `ul`, `ol`, `li`
156
**Media**: `img`, `video`, `iframe`, `figure`, `figcaption`
157
**Links**: `a`
158
**Semantic**: `aside`
159
160
## HTML Processing Rules
161
162
### Whitespace Handling
163
- Multiple whitespace characters are collapsed to single spaces
164
- Leading/trailing whitespace is trimmed appropriately
165
- Whitespace in `<pre>` tags is preserved exactly
166
167
```python
168
# Multiple spaces collapsed
169
html = '<p>Multiple spaces here</p>'
170
nodes = html_to_nodes(html)
171
result = nodes_to_html(nodes)
172
print(result) # '<p>Multiple spaces here</p>'
173
174
# Preformatted text preserved
175
html = '<pre> Code with spaces </pre>'
176
nodes = html_to_nodes(html)
177
result = nodes_to_html(nodes)
178
print(result) # '<pre> Code with spaces </pre>'
179
```
180
181
### Case Normalization
182
HTML tag names are automatically converted to lowercase:
183
184
```python
185
html = '<P><STRONG>Upper case tags</STRONG></P>'
186
nodes = html_to_nodes(html)
187
result = nodes_to_html(nodes)
188
print(result) # '<p><strong>Upper case tags</strong></p>'
189
```
190
191
## Error Handling
192
193
HTML utility functions raise specific exceptions for different error conditions:
194
195
```python
196
from telegraph.utils import html_to_nodes
197
from telegraph.exceptions import NotAllowedTag, InvalidHTML
198
199
# Handle disallowed tags
200
try:
201
html = '<script>alert("bad")</script>'
202
nodes = html_to_nodes(html)
203
except NotAllowedTag as e:
204
print(f"Tag not allowed: {e}")
205
206
# Handle malformed HTML
207
try:
208
html = '<p><strong>Unclosed tags</p>'
209
nodes = html_to_nodes(html)
210
except InvalidHTML as e:
211
print(f"Invalid HTML: {e}")
212
213
# Handle missing start tags
214
try:
215
html = '</div><p>Content</p>'
216
nodes = html_to_nodes(html)
217
except InvalidHTML as e:
218
print(f"Missing start tag: {e}")
219
```
220
221
## Integration with Telegraph API
222
223
Use utilities to work with different content formats:
224
225
```python
226
from telegraph import Telegraph
227
from telegraph.utils import html_to_nodes, nodes_to_html
228
229
telegraph = Telegraph(access_token='your_token')
230
231
# Create page with HTML, retrieve as nodes
232
html_content = '<p>Original <strong>HTML</strong> content.</p>'
233
response = telegraph.create_page(
234
title='HTML Example',
235
html_content=html_content
236
)
237
238
# Get page content as nodes
239
page = telegraph.get_page(response['path'], return_html=False)
240
nodes = page['content']
241
242
# Modify nodes programmatically
243
nodes.append({'tag': 'p', 'children': ['Added paragraph.']})
244
245
# Convert back to HTML and update page
246
updated_html = nodes_to_html(nodes)
247
telegraph.edit_page(
248
response['path'],
249
title='Updated HTML Example',
250
html_content=updated_html
251
)
252
```
253
254
## Advanced Usage
255
256
### Custom Node Processing
257
258
```python
259
def process_nodes(nodes):
260
"""Process nodes recursively to modify content."""
261
processed = []
262
for node in nodes:
263
if isinstance(node, str):
264
# Process text nodes
265
processed.append(node.upper())
266
elif isinstance(node, dict):
267
# Process element nodes
268
new_node = {'tag': node['tag']}
269
if 'attrs' in node:
270
new_node['attrs'] = node['attrs']
271
if 'children' in node:
272
new_node['children'] = process_nodes(node['children'])
273
processed.append(new_node)
274
return processed
275
276
# Apply custom processing
277
original_nodes = html_to_nodes('<p>Process <em>this</em> text.</p>')
278
modified_nodes = process_nodes(original_nodes)
279
result_html = nodes_to_html(modified_nodes)
280
print(result_html) # '<p>PROCESS <em>THIS</em> TEXT.</p>'
281
```
282
283
## Additional Utilities
284
285
### JSON Serialization
286
287
Utility function for Telegraph-compatible JSON serialization.
288
289
```python { .api }
290
def json_dumps(*args, **kwargs) -> str:
291
"""
292
Serialize object to JSON string with Telegraph-compatible formatting.
293
294
Uses compact separators and ensures proper Unicode handling.
295
Arguments passed through to json.dumps() with optimized defaults.
296
297
Returns:
298
str: JSON string with compact formatting
299
"""
300
```
301
302
Usage example:
303
304
```python
305
from telegraph.utils import json_dumps
306
307
# Serialize nodes for Telegraph API
308
nodes = [{'tag': 'p', 'children': ['Hello, world!']}]
309
json_string = json_dumps(nodes)
310
print(json_string) # Compact JSON output
311
```
312
313
### File Handling Utility
314
315
Context manager for handling file uploads with proper resource management.
316
317
```python { .api }
318
class FilesOpener:
319
"""
320
Context manager for opening and managing file objects for upload.
321
322
Parameters:
323
- paths (str|list): File path(s) or file-like object(s)
324
- key_format (str): Format string for file keys, defaults to 'file{}'
325
"""
326
def __init__(self, paths, key_format: str = 'file{}'):
327
pass
328
329
def __enter__(self) -> list:
330
"""
331
Open files and return list of (key, (filename, file_object, mimetype)) tuples.
332
"""
333
pass
334
335
def __exit__(self, type, value, traceback):
336
"""
337
Close all opened files.
338
"""
339
pass
340
```
341
342
Usage example:
343
344
```python
345
from telegraph.utils import FilesOpener
346
347
# Handle single file
348
with FilesOpener('image.jpg') as files:
349
print(files) # [('file0', ('file0', <file_object>, 'image/jpeg'))]
350
351
# Handle multiple files
352
with FilesOpener(['img1.png', 'img2.jpg']) as files:
353
for key, (filename, file_obj, mimetype) in files:
354
print(f"{key}: {filename} ({mimetype})")
355
```
356
357
### Telegraph Constants
358
359
Important constants for HTML processing and validation.
360
361
```python { .api }
362
ALLOWED_TAGS: set = {
363
'a', 'aside', 'b', 'blockquote', 'br', 'code', 'em', 'figcaption', 'figure',
364
'h3', 'h4', 'hr', 'i', 'iframe', 'img', 'li', 'ol', 'p', 'pre', 's',
365
'strong', 'u', 'ul', 'video'
366
}
367
368
VOID_ELEMENTS: set = {
369
'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen',
370
'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr'
371
}
372
373
BLOCK_ELEMENTS: set = {
374
'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div', 'dl',
375
'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2',
376
'h3', 'h4', 'h5', 'h6', 'header', 'hgroup', 'hr', 'li', 'main', 'nav',
377
'noscript', 'ol', 'output', 'p', 'pre', 'section', 'table', 'tfoot', 'ul',
378
'video'
379
}
380
```
381
382
These constants can be imported and used for validation:
383
384
```python
385
from telegraph.utils import ALLOWED_TAGS, VOID_ELEMENTS, BLOCK_ELEMENTS
386
387
def validate_tag(tag_name):
388
"""Check if a tag is allowed by Telegraph."""
389
return tag_name.lower() in ALLOWED_TAGS
390
391
def is_void_element(tag_name):
392
"""Check if a tag is a void element (self-closing)."""
393
return tag_name.lower() in VOID_ELEMENTS
394
395
def is_block_element(tag_name):
396
"""Check if a tag is a block-level element."""
397
return tag_name.lower() in BLOCK_ELEMENTS
398
399
# Usage
400
print(validate_tag('p')) # True
401
print(validate_tag('script')) # False
402
print(is_void_element('br')) # True
403
print(is_block_element('p')) # True
404
```
405
406
### Content Validation
407
408
```python
409
def validate_content(html):
410
"""Validate HTML content for Telegraph compatibility."""
411
try:
412
nodes = html_to_nodes(html)
413
return True, "Content is valid"
414
except NotAllowedTag as e:
415
return False, f"Contains disallowed tag: {e}"
416
except InvalidHTML as e:
417
return False, f"Invalid HTML structure: {e}"
418
419
# Validate before creating page
420
html = '<p>Valid content with <strong>formatting</strong>.</p>'
421
is_valid, message = validate_content(html)
422
if is_valid:
423
telegraph.create_page(title='Validated Content', html_content=html)
424
else:
425
print(f"Invalid content: {message}")
426
```