0
# Utility Functions
1
2
Helper functions for text processing, CSS parsing, character escaping, and table formatting. These functions are used internally by html2text and are also available for advanced use cases requiring custom text processing.
3
4
## Capabilities
5
6
### Text Escaping and Processing
7
8
Functions for escaping markdown characters and processing text sections safely.
9
10
```python { .api }
11
def escape_md(text: str) -> str:
12
"""
13
Escape markdown-sensitive characters within markdown constructs.
14
15
Escapes characters that have special meaning in Markdown (like brackets,
16
parentheses, backslashes) to prevent them from being interpreted as
17
formatting when they should be literal text.
18
19
Args:
20
text: Text string to escape
21
22
Returns:
23
Text with markdown characters escaped with backslashes
24
25
Example:
26
>>> from html2text.utils import escape_md
27
>>> escape_md("Some [text] with (special) chars")
28
'Some \\[text\\] with \\(special\\) chars'
29
"""
30
31
def escape_md_section(text: str, snob: bool = False) -> str:
32
"""
33
Escape markdown-sensitive characters across document sections.
34
35
More comprehensive escaping for full document sections, handling
36
various markdown constructs that could interfere with formatting.
37
38
Args:
39
text: Text string to escape
40
snob: If True, escape additional characters for maximum safety
41
42
Returns:
43
Text with markdown characters properly escaped
44
45
Example:
46
>>> from html2text.utils import escape_md_section
47
>>> escape_md_section("1. Item\\n2. Another", snob=True)
48
'1\\. Item\\n2\\. Another'
49
"""
50
```
51
52
### Table Formatting
53
54
Functions for formatting and aligning table content in text output.
55
56
```python { .api }
57
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
58
"""
59
Add padding to tables in text for consistent column alignment.
60
61
Processes text containing markdown tables and adds appropriate padding
62
to ensure all columns have consistent width for improved readability.
63
64
Args:
65
text: Text containing markdown tables to format
66
right_margin: Additional padding spaces for right margin (default: 1)
67
68
Returns:
69
Text with properly padded and aligned tables
70
71
Example:
72
>>> table_text = "| Name | Age |\\n| Alice | 30 |\\n| Bob | 25 |"
73
>>> padded = pad_tables_in_text(table_text)
74
>>> print(padded)
75
| Name | Age |
76
| Alice | 30 |
77
| Bob | 25 |
78
"""
79
80
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
81
"""
82
Reformat table lines with consistent column widths.
83
84
Takes raw table lines and reformats them with proper padding
85
to create aligned columns.
86
87
Args:
88
lines: List of table row strings
89
right_margin: Right margin padding in spaces
90
91
Returns:
92
List of reformatted table lines with consistent alignment
93
"""
94
```
95
96
### CSS and Style Processing
97
98
Functions for parsing CSS styles and processing element styling, particularly useful for Google Docs HTML.
99
100
```python { .api }
101
def dumb_property_dict(style: str) -> Dict[str, str]:
102
"""
103
Parse CSS style string into property dictionary.
104
105
Takes a CSS style string (like from a style attribute) and converts
106
it into a dictionary of property-value pairs.
107
108
Args:
109
style: CSS style string with semicolon-separated property declarations
110
111
Returns:
112
Dictionary mapping CSS property names to values (both lowercased)
113
114
Example:
115
>>> from html2text.utils import dumb_property_dict
116
>>> style = "color: red; font-size: 14px; font-weight: bold"
117
>>> props = dumb_property_dict(style)
118
>>> print(props)
119
{'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}
120
"""
121
122
def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
123
"""
124
Parse CSS style definitions into a structured format.
125
126
Simple CSS parser that extracts style rules and properties for
127
processing HTML with inline styles or embedded CSS.
128
129
Args:
130
data: CSS string to parse
131
132
Returns:
133
Dictionary mapping selectors to property dictionaries
134
135
Example:
136
>>> css = "p { color: red; font-size: 14px; }"
137
>>> parsed = dumb_css_parser(css)
138
>>> print(parsed)
139
{'p': {'color': 'red', 'font-size': '14px'}}
140
"""
141
142
def element_style(
143
attrs: Dict[str, Optional[str]],
144
style_def: Dict[str, Dict[str, str]],
145
parent_style: Dict[str, str]
146
) -> Dict[str, str]:
147
"""
148
Compute final style attributes for an HTML element.
149
150
Combines parent styles, CSS class styles, and inline styles to
151
determine the effective styling for an element.
152
153
Args:
154
attrs: HTML element attributes dictionary
155
style_def: CSS style definitions from stylesheet
156
parent_style: Inherited styles from parent elements
157
158
Returns:
159
Dictionary of final computed styles for the element
160
"""
161
162
def google_text_emphasis(style: Dict[str, str]) -> List[str]:
163
"""
164
Extract text emphasis styles from Google Docs CSS.
165
166
Analyzes CSS style properties to determine what text emphasis
167
(bold, italic, underline, etc.) should be applied.
168
169
Args:
170
style: Dictionary of CSS style properties
171
172
Returns:
173
List of emphasis style names found in the styles
174
"""
175
176
def google_fixed_width_font(style: Dict[str, str]) -> bool:
177
"""
178
Check if CSS styles specify a fixed-width (monospace) font.
179
180
Args:
181
style: Dictionary of CSS style properties
182
183
Returns:
184
True if styles specify a monospace font family
185
"""
186
187
def google_has_height(style: Dict[str, str]) -> bool:
188
"""
189
Check if CSS styles have explicit height defined.
190
191
Args:
192
style: Dictionary of CSS style properties
193
194
Returns:
195
True if height property is explicitly set
196
"""
197
198
def google_list_style(style: Dict[str, str]) -> str:
199
"""
200
Determine list type from Google Docs CSS styles.
201
202
Args:
203
style: Dictionary of CSS style properties
204
205
Returns:
206
'ul' for unordered lists, 'ol' for ordered lists
207
"""
208
```
209
210
### HTML Processing Utilities
211
212
Helper functions for processing HTML elements and attributes.
213
214
```python { .api }
215
def hn(tag: str) -> int:
216
"""
217
Extract header level from HTML header tag name.
218
219
Args:
220
tag: HTML tag name (e.g., 'h1', 'h2', 'div')
221
222
Returns:
223
Header level (1-6) for header tags, 0 for non-header tags
224
225
Example:
226
>>> hn('h1')
227
1
228
>>> hn('h3')
229
3
230
>>> hn('div')
231
0
232
"""
233
234
def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
235
"""
236
Extract starting number from ordered list attributes.
237
238
Args:
239
attrs: HTML element attributes dictionary
240
241
Returns:
242
Starting number for ordered list (adjusted for 0-based indexing)
243
244
Example:
245
>>> attrs = {'start': '5'}
246
>>> list_numbering_start(attrs)
247
4 # Returns start - 1 for internal counting
248
"""
249
250
def skipwrap(
251
para: str,
252
wrap_links: bool,
253
wrap_list_items: bool,
254
wrap_tables: bool
255
) -> bool:
256
"""
257
Determine if a paragraph should skip text wrapping.
258
259
Analyzes paragraph content to decide whether it should be wrapped
260
based on content type and wrapping configuration.
261
262
Args:
263
para: Paragraph text to analyze
264
wrap_links: Whether to allow wrapping of links
265
wrap_list_items: Whether to allow wrapping of list items
266
wrap_tables: Whether to allow wrapping of tables
267
268
Returns:
269
True if paragraph should skip wrapping, False otherwise
270
"""
271
```
272
273
### Character and Entity Processing
274
275
Functions for handling HTML entities and character replacements.
276
277
```python { .api }
278
# Character mapping constants
279
unifiable_n: Dict[int, str]
280
"""Mapping of Unicode code points to ASCII replacements."""
281
282
control_character_replacements: Dict[int, int]
283
"""Mapping of control characters to their Unicode replacements."""
284
```
285
286
## Usage Examples
287
288
### Text Escaping
289
290
```python
291
from html2text.utils import escape_md, escape_md_section
292
293
# Basic markdown escaping
294
text = "Some [bracketed] text with (parentheses)"
295
escaped = escape_md(text)
296
print(escaped) # "Some \\[bracketed\\] text with \\(parentheses\\)"
297
298
# Section-level escaping with additional safety
299
content = """
300
1. First item
301
2. Second item
302
*Some emphasized text*
303
`Code with backticks`
304
"""
305
306
safe_content = escape_md_section(content, snob=True)
307
print(safe_content)
308
```
309
310
### Table Processing
311
312
```python
313
from html2text.utils import pad_tables_in_text
314
315
# Raw table text with inconsistent spacing
316
table_text = """
317
| Name | Age | City |
318
| Alice | 30 | New York |
319
| Bob | 25 | London |
320
| Charlie | 35 | Paris |
321
"""
322
323
# Add padding for consistent alignment
324
padded_table = pad_tables_in_text(table_text)
325
print(padded_table)
326
# Output will have consistent column widths
327
```
328
329
### CSS Processing
330
331
```python
332
from html2text.utils import dumb_css_parser, dumb_property_dict, element_style
333
334
# Parse inline CSS styles
335
inline_style = "color: red; font-size: 14px; font-weight: bold"
336
props = dumb_property_dict(inline_style)
337
print(props)
338
# Output: {'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}
339
340
# Parse CSS styles
341
css_content = """
342
.bold { font-weight: bold; color: black; }
343
.italic { font-style: italic; }
344
p { margin: 10px; font-size: 14px; }
345
"""
346
347
styles = dumb_css_parser(css_content)
348
print(styles)
349
350
# Compute element styles
351
element_attrs = {
352
'class': 'bold italic',
353
'style': 'color: red; font-size: 16px;'
354
}
355
356
parent_styles = {'margin': '5px'}
357
final_styles = element_style(element_attrs, styles, parent_styles)
358
print(final_styles)
359
# Will combine class styles, inline styles, and parent styles
360
```
361
362
### HTML Tag Processing
363
364
```python
365
from html2text.utils import hn, list_numbering_start
366
367
# Extract header levels
368
print(hn('h1')) # 1
369
print(hn('h3')) # 3
370
print(hn('div')) # 0
371
372
# Process list attributes
373
ol_attrs = {'start': '5', 'type': '1'}
374
start_num = list_numbering_start(ol_attrs)
375
print(start_num) # 4 (adjusted for 0-based counting)
376
```
377
378
### Wrapping Analysis
379
380
```python
381
from html2text.utils import skipwrap
382
383
# Test different paragraph types
384
paragraphs = [
385
"Regular paragraph text that can be wrapped normally.",
386
" This is a code block with leading spaces",
387
"* This is a list item that might not wrap",
388
"Here's a paragraph with [a link](http://example.com) in it.",
389
"| Name | Age | - this looks like a table"
390
]
391
392
for para in paragraphs:
393
should_skip = skipwrap(para, wrap_links=True, wrap_list_items=False, wrap_tables=False)
394
print(f"Skip wrapping: {should_skip} - {para[:30]}...")
395
```
396
397
### Google Docs Style Processing
398
399
```python
400
from html2text.utils import (
401
google_text_emphasis,
402
google_fixed_width_font,
403
google_list_style
404
)
405
406
# Analyze Google Docs styles
407
gdoc_style = {
408
'font-weight': 'bold',
409
'font-style': 'italic',
410
'text-decoration': 'underline',
411
'font-family': 'courier new'
412
}
413
414
emphasis = google_text_emphasis(gdoc_style)
415
print(f"Emphasis styles: {emphasis}")
416
417
is_monospace = google_fixed_width_font(gdoc_style)
418
print(f"Monospace font: {is_monospace}")
419
420
list_style = {
421
'list-style-type': 'disc'
422
}
423
list_type = google_list_style(list_style)
424
print(f"List type: {list_type}")
425
```