Tessl Tile for pypi/html2text@2025.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-conversion.md index.md utilities.md

utilities.mddocs/

0
# Utility Functions
1

2
Helper functions for text processing, CSS parsing, character escaping, and table formatting. These functions are used internally by html2text and are also available for advanced use cases requiring custom text processing.
3

4
## Capabilities
5

6
### Text Escaping and Processing
7

8
Functions for escaping markdown characters and processing text sections safely.
9

10
```python { .api }
11
def escape_md(text: str) -> str:
12
    """
13
    Escape markdown-sensitive characters within markdown constructs.
14
    
15
    Escapes characters that have special meaning in Markdown (like brackets,
16
    parentheses, backslashes) to prevent them from being interpreted as
17
    formatting when they should be literal text.
18
    
19
    Args:
20
        text: Text string to escape
21
        
22
    Returns:
23
        Text with markdown characters escaped with backslashes
24
        
25
    Example:
26
        >>> from html2text.utils import escape_md
27
        >>> escape_md("Some [text] with (special) chars")
28
        'Some \\[text\\] with \\(special\\) chars'
29
    """
30

31
def escape_md_section(text: str, snob: bool = False) -> str:
32
    """
33
    Escape markdown-sensitive characters across document sections.
34
    
35
    More comprehensive escaping for full document sections, handling
36
    various markdown constructs that could interfere with formatting.
37
    
38
    Args:
39
        text: Text string to escape
40
        snob: If True, escape additional characters for maximum safety
41
        
42
    Returns:
43
        Text with markdown characters properly escaped
44
        
45
    Example:
46
        >>> from html2text.utils import escape_md_section
47
        >>> escape_md_section("1. Item\\n2. Another", snob=True)
48
        '1\\. Item\\n2\\. Another'
49
    """
50
```
51

52
### Table Formatting
53

54
Functions for formatting and aligning table content in text output.
55

56
```python { .api }
57
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
58
    """
59
    Add padding to tables in text for consistent column alignment.
60
    
61
    Processes text containing markdown tables and adds appropriate padding
62
    to ensure all columns have consistent width for improved readability.
63
    
64
    Args:
65
        text: Text containing markdown tables to format
66
        right_margin: Additional padding spaces for right margin (default: 1)
67
        
68
    Returns:
69
        Text with properly padded and aligned tables
70
        
71
    Example:
72
        >>> table_text = "| Name | Age |\\n| Alice | 30 |\\n| Bob | 25 |"
73
        >>> padded = pad_tables_in_text(table_text)
74
        >>> print(padded)
75
        | Name  | Age |
76
        | Alice | 30  |
77
        | Bob   | 25  |
78
    """
79

80
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
81
    """
82
    Reformat table lines with consistent column widths.
83
    
84
    Takes raw table lines and reformats them with proper padding
85
    to create aligned columns.
86
    
87
    Args:
88
        lines: List of table row strings
89
        right_margin: Right margin padding in spaces
90
        
91
    Returns:
92
        List of reformatted table lines with consistent alignment
93
    """
94
```
95

96
### CSS and Style Processing
97

98
Functions for parsing CSS styles and processing element styling, particularly useful for Google Docs HTML.
99

100
```python { .api }
101
def dumb_property_dict(style: str) -> Dict[str, str]:
102
    """
103
    Parse CSS style string into property dictionary.
104
    
105
    Takes a CSS style string (like from a style attribute) and converts
106
    it into a dictionary of property-value pairs.
107
    
108
    Args:
109
        style: CSS style string with semicolon-separated property declarations
110
        
111
    Returns:
112
        Dictionary mapping CSS property names to values (both lowercased)
113
        
114
    Example:
115
        >>> from html2text.utils import dumb_property_dict
116
        >>> style = "color: red; font-size: 14px; font-weight: bold"
117
        >>> props = dumb_property_dict(style)
118
        >>> print(props)
119
        {'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}
120
    """
121

122
def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
123
    """
124
    Parse CSS style definitions into a structured format.
125
    
126
    Simple CSS parser that extracts style rules and properties for
127
    processing HTML with inline styles or embedded CSS.
128
    
129
    Args:
130
        data: CSS string to parse
131
        
132
    Returns:
133
        Dictionary mapping selectors to property dictionaries
134
        
135
    Example:
136
        >>> css = "p { color: red; font-size: 14px; }"
137
        >>> parsed = dumb_css_parser(css)
138
        >>> print(parsed)
139
        {'p': {'color': 'red', 'font-size': '14px'}}
140
    """
141

142
def element_style(
143
    attrs: Dict[str, Optional[str]], 
144
    style_def: Dict[str, Dict[str, str]], 
145
    parent_style: Dict[str, str]
146
) -> Dict[str, str]:
147
    """
148
    Compute final style attributes for an HTML element.
149
    
150
    Combines parent styles, CSS class styles, and inline styles to
151
    determine the effective styling for an element.
152
    
153
    Args:
154
        attrs: HTML element attributes dictionary
155
        style_def: CSS style definitions from stylesheet
156
        parent_style: Inherited styles from parent elements
157
        
158
    Returns:
159
        Dictionary of final computed styles for the element
160
    """
161

162
def google_text_emphasis(style: Dict[str, str]) -> List[str]:
163
    """
164
    Extract text emphasis styles from Google Docs CSS.
165
    
166
    Analyzes CSS style properties to determine what text emphasis
167
    (bold, italic, underline, etc.) should be applied.
168
    
169
    Args:
170
        style: Dictionary of CSS style properties
171
        
172
    Returns:
173
        List of emphasis style names found in the styles
174
    """
175

176
def google_fixed_width_font(style: Dict[str, str]) -> bool:
177
    """
178
    Check if CSS styles specify a fixed-width (monospace) font.
179
    
180
    Args:
181
        style: Dictionary of CSS style properties
182
        
183
    Returns:
184
        True if styles specify a monospace font family
185
    """
186

187
def google_has_height(style: Dict[str, str]) -> bool:
188
    """
189
    Check if CSS styles have explicit height defined.
190
    
191
    Args:
192
        style: Dictionary of CSS style properties
193
        
194
    Returns:
195
        True if height property is explicitly set
196
    """
197

198
def google_list_style(style: Dict[str, str]) -> str:
199
    """
200
    Determine list type from Google Docs CSS styles.
201
    
202
    Args:
203
        style: Dictionary of CSS style properties
204
        
205
    Returns:
206
        'ul' for unordered lists, 'ol' for ordered lists
207
    """
208
```
209

210
### HTML Processing Utilities
211

212
Helper functions for processing HTML elements and attributes.
213

214
```python { .api }
215
def hn(tag: str) -> int:
216
    """
217
    Extract header level from HTML header tag name.
218
    
219
    Args:
220
        tag: HTML tag name (e.g., 'h1', 'h2', 'div')
221
        
222
    Returns:
223
        Header level (1-6) for header tags, 0 for non-header tags
224
        
225
    Example:
226
        >>> hn('h1')
227
        1
228
        >>> hn('h3') 
229
        3
230
        >>> hn('div')
231
        0
232
    """
233

234
def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
235
    """
236
    Extract starting number from ordered list attributes.
237
    
238
    Args:
239
        attrs: HTML element attributes dictionary
240
        
241
    Returns:
242
        Starting number for ordered list (adjusted for 0-based indexing)
243
        
244
    Example:
245
        >>> attrs = {'start': '5'}
246
        >>> list_numbering_start(attrs)
247
        4  # Returns start - 1 for internal counting
248
    """
249

250
def skipwrap(
251
    para: str, 
252
    wrap_links: bool, 
253
    wrap_list_items: bool, 
254
    wrap_tables: bool
255
) -> bool:
256
    """
257
    Determine if a paragraph should skip text wrapping.
258
    
259
    Analyzes paragraph content to decide whether it should be wrapped
260
    based on content type and wrapping configuration.
261
    
262
    Args:
263
        para: Paragraph text to analyze
264
        wrap_links: Whether to allow wrapping of links
265
        wrap_list_items: Whether to allow wrapping of list items
266
        wrap_tables: Whether to allow wrapping of tables
267
        
268
    Returns:
269
        True if paragraph should skip wrapping, False otherwise
270
    """
271
```
272

273
### Character and Entity Processing
274

275
Functions for handling HTML entities and character replacements.
276

277
```python { .api }
278
# Character mapping constants
279
unifiable_n: Dict[int, str]
280
"""Mapping of Unicode code points to ASCII replacements."""
281

282
control_character_replacements: Dict[int, int]
283
"""Mapping of control characters to their Unicode replacements."""
284
```
285

286
## Usage Examples
287

288
### Text Escaping
289

290
```python
291
from html2text.utils import escape_md, escape_md_section
292

293
# Basic markdown escaping
294
text = "Some [bracketed] text with (parentheses)"
295
escaped = escape_md(text)
296
print(escaped)  # "Some \\[bracketed\\] text with \\(parentheses\\)"
297

298
# Section-level escaping with additional safety
299
content = """
300
1. First item
301
2. Second item
302
*Some emphasized text*
303
`Code with backticks`
304
"""
305

306
safe_content = escape_md_section(content, snob=True)
307
print(safe_content)
308
```
309

310
### Table Processing
311

312
```python
313
from html2text.utils import pad_tables_in_text
314

315
# Raw table text with inconsistent spacing
316
table_text = """
317
| Name | Age | City |
318
| Alice | 30 | New York |
319
| Bob | 25 | London |
320
| Charlie | 35 | Paris |
321
"""
322

323
# Add padding for consistent alignment
324
padded_table = pad_tables_in_text(table_text)
325
print(padded_table)
326
# Output will have consistent column widths
327
```
328

329
### CSS Processing
330

331
```python
332
from html2text.utils import dumb_css_parser, dumb_property_dict, element_style
333

334
# Parse inline CSS styles
335
inline_style = "color: red; font-size: 14px; font-weight: bold"
336
props = dumb_property_dict(inline_style)
337
print(props)
338
# Output: {'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}
339

340
# Parse CSS styles
341
css_content = """
342
.bold { font-weight: bold; color: black; }
343
.italic { font-style: italic; }
344
p { margin: 10px; font-size: 14px; }
345
"""
346

347
styles = dumb_css_parser(css_content)
348
print(styles)
349

350
# Compute element styles
351
element_attrs = {
352
    'class': 'bold italic',
353
    'style': 'color: red; font-size: 16px;'
354
}
355

356
parent_styles = {'margin': '5px'}
357
final_styles = element_style(element_attrs, styles, parent_styles)
358
print(final_styles)
359
# Will combine class styles, inline styles, and parent styles
360
```
361

362
### HTML Tag Processing
363

364
```python
365
from html2text.utils import hn, list_numbering_start
366

367
# Extract header levels
368
print(hn('h1'))    # 1
369
print(hn('h3'))    # 3  
370
print(hn('div'))   # 0
371

372
# Process list attributes
373
ol_attrs = {'start': '5', 'type': '1'}
374
start_num = list_numbering_start(ol_attrs)
375
print(start_num)   # 4 (adjusted for 0-based counting)
376
```
377

378
### Wrapping Analysis
379

380
```python
381
from html2text.utils import skipwrap
382

383
# Test different paragraph types
384
paragraphs = [
385
    "Regular paragraph text that can be wrapped normally.",
386
    "    This is a code block with leading spaces",
387
    "* This is a list item that might not wrap",
388
    "Here's a paragraph with [a link](http://example.com) in it.",
389
    "| Name | Age | - this looks like a table"
390
]
391

392
for para in paragraphs:
393
    should_skip = skipwrap(para, wrap_links=True, wrap_list_items=False, wrap_tables=False)
394
    print(f"Skip wrapping: {should_skip} - {para[:30]}...")
395
```
396

397
### Google Docs Style Processing
398

399
```python
400
from html2text.utils import (
401
    google_text_emphasis, 
402
    google_fixed_width_font,
403
    google_list_style
404
)
405

406
# Analyze Google Docs styles
407
gdoc_style = {
408
    'font-weight': 'bold',
409
    'font-style': 'italic', 
410
    'text-decoration': 'underline',
411
    'font-family': 'courier new'
412
}
413

414
emphasis = google_text_emphasis(gdoc_style)
415
print(f"Emphasis styles: {emphasis}")
416

417
is_monospace = google_fixed_width_font(gdoc_style)
418
print(f"Monospace font: {is_monospace}")
419

420
list_style = {
421
    'list-style-type': 'disc'
422
}
423
list_type = google_list_style(list_style)
424
print(f"List type: {list_type}")
425
```

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/