Tessl Tile for pypi/html2text@2025.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-conversion.md index.md utilities.md

core-conversion.mddocs/

0
# Core HTML Conversion
1

2
Primary conversion functionality for transforming HTML into Markdown or plain text. Provides both simple one-shot conversion and advanced configurable conversion with extensive formatting options.
3

4
## Capabilities
5

6
### Simple HTML Conversion
7

8
Convenience function for straightforward HTML to Markdown conversion with minimal configuration.
9

10
```python { .api }
11
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
12
    """
13
    Convert HTML string to Markdown/text using default settings.
14
    
15
    Args:
16
        html: HTML string to convert
17
        baseurl: Base URL for resolving relative links (default: "")
18
        bodywidth: Text wrapping width, None uses config.BODY_WIDTH (default: None)
19
    
20
    Returns:
21
        Converted Markdown/text string
22
    
23
    Example:
24
        >>> import html2text
25
        >>> html = "<p><strong>Bold</strong> and <em>italic</em></p>"
26
        >>> print(html2text.html2text(html))
27
        **Bold** and _italic_
28
    """
29
```
30

31
### Advanced HTML Conversion
32

33
Full-featured HTML to text converter with extensive configuration options for fine-grained control over output formatting.
34

35
```python { .api }
36
class HTML2Text(html.parser.HTMLParser):
37
    """
38
    Advanced HTML to text converter with comprehensive configuration options.
39
    
40
    Inherits from html.parser.HTMLParser to handle HTML parsing and provides
41
    extensive customization for output formatting, link handling, table processing,
42
    and text styling.
43
    """
44
    
45
    def __init__(
46
        self,
47
        out: Optional[OutCallback] = None,
48
        baseurl: str = "",
49
        bodywidth: int = 78
50
    ) -> None:
51
        """
52
        Initialize HTML2Text converter.
53
        
54
        Args:
55
            out: Optional custom output callback function for handling text output
56
            baseurl: Base URL for resolving relative links (default: "")
57
            bodywidth: Maximum line width for text wrapping (default: 78)
58
        """
59
    
60
    def handle(self, data: str) -> str:
61
        """
62
        Convert HTML string to Markdown/text with current configuration.
63
        
64
        This is the main conversion method that processes the HTML through
65
        the parser and returns the formatted output.
66
        
67
        Args:
68
            data: HTML string to convert
69
            
70
        Returns:
71
            Converted Markdown/text string
72
            
73
        Example:
74
            >>> h = html2text.HTML2Text()
75
            >>> h.ignore_links = True
76
            >>> html = "<p>Hello <a href='http://example.com'>world</a>!</p>"
77
            >>> print(h.handle(html))
78
            Hello world!
79
        """
80
    
81
    def feed(self, data: str) -> None:
82
        """
83
        Feed HTML data to the parser for processing.
84
        
85
        Args:
86
            data: HTML string to feed to parser
87
        """
88
    
89
    def finish(self) -> str:
90
        """
91
        Complete parsing and return formatted text output.
92
        
93
        Returns:
94
            Final formatted text string
95
        """
96
    
97
    def outtextf(self, s: str) -> None:
98
        """
99
        Default output callback function that appends text to internal buffer.
100
        
101
        This is the default implementation of the output callback that collects
102
        all text output into an internal list for final processing.
103
        
104
        Args:
105
            s: Text string to append to output buffer
106
        """
107
    
108
    def close(self) -> None:
109
        """
110
        Close the HTML parser and perform final cleanup.
111
        
112
        Inherited from HTMLParser, ensures proper parser cleanup.
113
        """
114
    
115
    def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]:
116
        """
117
        Find index of link with matching attributes in anchor list.
118
        
119
        Used internally for reference-style link processing to avoid
120
        duplicate link definitions.
121
        
122
        Args:
123
            attrs: Dictionary of HTML element attributes
124
            
125
        Returns:
126
            Index of matching anchor element or None if not found
127
        """
128
```
129

130
## HTML Element Support
131

132
html2text supports comprehensive HTML element conversion:
133

134
### Text Formatting
135
- **Bold**: `<strong>`, `<b>` → `**text**`
136
- **Italic**: `<em>`, `<i>` → `_text_`
137
- **Code**: `<code>`, `<tt>`, `<kbd>` → `` `text` ``
138
- **Strikethrough**: `<del>`, `<strike>`, `<s>` → `~~text~~`
139
- **Quotes**: `<q>` → `"text"`
140
- **Superscript/Subscript**: `<sup>`, `<sub>` (configurable)
141

142
### Structure Elements
143
- **Headers**: `<h1>` through `<h6>` → `# Header`
144
- **Paragraphs**: `<p>` → paragraph breaks
145
- **Line breaks**: `<br>` → line breaks
146
- **Horizontal rules**: `<hr>` → `* * *`
147
- **Blockquotes**: `<blockquote>` → `> text`
148
- **Preformatted**: `<pre>` → indented code blocks
149

150
### Lists
151
- **Unordered lists**: `<ul>`, `<li>` → `* item`
152
- **Ordered lists**: `<ol>`, `<li>` → `1. item`
153
- **Nested lists**: Full support with proper indentation
154
- **Definition lists**: `<dl>`, `<dt>`, `<dd>`
155

156
### Links and Images
157
- **Links**: `<a>` → `[text](url)` or reference-style
158
- **Images**: `<img>` → `![alt](src)` or configurable formats
159
- **Automatic links**: URL detection and conversion
160

161
### Tables
162
- **Tables**: `<table>`, `<tr>`, `<td>`, `<th>` → Markdown tables
163
- **Table formatting**: Configurable padding and alignment
164
- **Complex tables**: Colspan handling and formatting options
165

166
## Usage Examples
167

168
### Basic Text Conversion
169

170
```python
171
import html2text
172

173
# Simple paragraph with formatting
174
html = """
175
<div>
176
    <h1>Main Title</h1>
177
    <p>This is a <strong>bold statement</strong> with some <em>emphasis</em>.</p>
178
    <p>Here's a <a href="https://example.com">link</a> and some <code>inline code</code>.</p>
179
</div>
180
"""
181

182
converter = html2text.HTML2Text()
183
markdown = converter.handle(html)
184
print(markdown)
185
```
186

187
### List Processing
188

189
```python
190
html = """
191
<ul>
192
    <li>First item</li>
193
    <li>Second item with <strong>bold text</strong></li>
194
    <li>Third item
195
        <ol>
196
            <li>Nested ordered item</li>
197
            <li>Another nested item</li>
198
        </ol>
199
    </li>
200
</ul>
201
"""
202

203
converter = html2text.HTML2Text()
204
result = converter.handle(html)
205
print(result)
206
```
207

208
### Table Conversion
209

210
```python
211
html = """
212
<table>
213
    <tr>
214
        <th>Name</th>
215
        <th>Age</th>
216
        <th>City</th>
217
    </tr>
218
    <tr>
219
        <td>Alice</td>
220
        <td>30</td>
221
        <td>New York</td>
222
    </tr>
223
    <tr>
224
        <td>Bob</td>
225
        <td>25</td>
226
        <td>London</td>
227
    </tr>
228
</table>
229
"""
230

231
converter = html2text.HTML2Text()
232
converter.pad_tables = True  # Enable table padding
233
result = converter.handle(html)
234
print(result)
235
```
236

237
### Custom Output Handling
238

239
```python
240
def custom_output(text):
241
    """Custom output handler that uppercases text."""
242
    print(text.upper(), end='')
243

244
html = "<p>Hello world!</p>"
245
converter = html2text.HTML2Text(out=custom_output)
246
converter.handle(html)  # Will print "HELLO WORLD!" in uppercase
247
```

Version

Tile

Files

core-conversion.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-conversion.mddocs/