0
# Core HTML Conversion
1
2
Primary conversion functionality for transforming HTML into Markdown or plain text. Provides both simple one-shot conversion and advanced configurable conversion with extensive formatting options.
3
4
## Capabilities
5
6
### Simple HTML Conversion
7
8
Convenience function for straightforward HTML to Markdown conversion with minimal configuration.
9
10
```python { .api }
11
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
12
"""
13
Convert HTML string to Markdown/text using default settings.
14
15
Args:
16
html: HTML string to convert
17
baseurl: Base URL for resolving relative links (default: "")
18
bodywidth: Text wrapping width, None uses config.BODY_WIDTH (default: None)
19
20
Returns:
21
Converted Markdown/text string
22
23
Example:
24
>>> import html2text
25
>>> html = "<p><strong>Bold</strong> and <em>italic</em></p>"
26
>>> print(html2text.html2text(html))
27
**Bold** and _italic_
28
"""
29
```
30
31
### Advanced HTML Conversion
32
33
Full-featured HTML to text converter with extensive configuration options for fine-grained control over output formatting.
34
35
```python { .api }
36
class HTML2Text(html.parser.HTMLParser):
37
"""
38
Advanced HTML to text converter with comprehensive configuration options.
39
40
Inherits from html.parser.HTMLParser to handle HTML parsing and provides
41
extensive customization for output formatting, link handling, table processing,
42
and text styling.
43
"""
44
45
def __init__(
46
self,
47
out: Optional[OutCallback] = None,
48
baseurl: str = "",
49
bodywidth: int = 78
50
) -> None:
51
"""
52
Initialize HTML2Text converter.
53
54
Args:
55
out: Optional custom output callback function for handling text output
56
baseurl: Base URL for resolving relative links (default: "")
57
bodywidth: Maximum line width for text wrapping (default: 78)
58
"""
59
60
def handle(self, data: str) -> str:
61
"""
62
Convert HTML string to Markdown/text with current configuration.
63
64
This is the main conversion method that processes the HTML through
65
the parser and returns the formatted output.
66
67
Args:
68
data: HTML string to convert
69
70
Returns:
71
Converted Markdown/text string
72
73
Example:
74
>>> h = html2text.HTML2Text()
75
>>> h.ignore_links = True
76
>>> html = "<p>Hello <a href='http://example.com'>world</a>!</p>"
77
>>> print(h.handle(html))
78
Hello world!
79
"""
80
81
def feed(self, data: str) -> None:
82
"""
83
Feed HTML data to the parser for processing.
84
85
Args:
86
data: HTML string to feed to parser
87
"""
88
89
def finish(self) -> str:
90
"""
91
Complete parsing and return formatted text output.
92
93
Returns:
94
Final formatted text string
95
"""
96
97
def outtextf(self, s: str) -> None:
98
"""
99
Default output callback function that appends text to internal buffer.
100
101
This is the default implementation of the output callback that collects
102
all text output into an internal list for final processing.
103
104
Args:
105
s: Text string to append to output buffer
106
"""
107
108
def close(self) -> None:
109
"""
110
Close the HTML parser and perform final cleanup.
111
112
Inherited from HTMLParser, ensures proper parser cleanup.
113
"""
114
115
def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]:
116
"""
117
Find index of link with matching attributes in anchor list.
118
119
Used internally for reference-style link processing to avoid
120
duplicate link definitions.
121
122
Args:
123
attrs: Dictionary of HTML element attributes
124
125
Returns:
126
Index of matching anchor element or None if not found
127
"""
128
```
129
130
## HTML Element Support
131
132
html2text supports comprehensive HTML element conversion:
133
134
### Text Formatting
135
- **Bold**: `<strong>`, `<b>` → `**text**`
136
- **Italic**: `<em>`, `<i>` → `_text_`
137
- **Code**: `<code>`, `<tt>`, `<kbd>` → `` `text` ``
138
- **Strikethrough**: `<del>`, `<strike>`, `<s>` → `~~text~~`
139
- **Quotes**: `<q>` → `"text"`
140
- **Superscript/Subscript**: `<sup>`, `<sub>` (configurable)
141
142
### Structure Elements
143
- **Headers**: `<h1>` through `<h6>` → `# Header`
144
- **Paragraphs**: `<p>` → paragraph breaks
145
- **Line breaks**: `<br>` → line breaks
146
- **Horizontal rules**: `<hr>` → `* * *`
147
- **Blockquotes**: `<blockquote>` → `> text`
148
- **Preformatted**: `<pre>` → indented code blocks
149
150
### Lists
151
- **Unordered lists**: `<ul>`, `<li>` → `* item`
152
- **Ordered lists**: `<ol>`, `<li>` → `1. item`
153
- **Nested lists**: Full support with proper indentation
154
- **Definition lists**: `<dl>`, `<dt>`, `<dd>`
155
156
### Links and Images
157
- **Links**: `<a>` → `[text](url)` or reference-style
158
- **Images**: `<img>` → `` or configurable formats
159
- **Automatic links**: URL detection and conversion
160
161
### Tables
162
- **Tables**: `<table>`, `<tr>`, `<td>`, `<th>` → Markdown tables
163
- **Table formatting**: Configurable padding and alignment
164
- **Complex tables**: Colspan handling and formatting options
165
166
## Usage Examples
167
168
### Basic Text Conversion
169
170
```python
171
import html2text
172
173
# Simple paragraph with formatting
174
html = """
175
<div>
176
<h1>Main Title</h1>
177
<p>This is a <strong>bold statement</strong> with some <em>emphasis</em>.</p>
178
<p>Here's a <a href="https://example.com">link</a> and some <code>inline code</code>.</p>
179
</div>
180
"""
181
182
converter = html2text.HTML2Text()
183
markdown = converter.handle(html)
184
print(markdown)
185
```
186
187
### List Processing
188
189
```python
190
html = """
191
<ul>
192
<li>First item</li>
193
<li>Second item with <strong>bold text</strong></li>
194
<li>Third item
195
<ol>
196
<li>Nested ordered item</li>
197
<li>Another nested item</li>
198
</ol>
199
</li>
200
</ul>
201
"""
202
203
converter = html2text.HTML2Text()
204
result = converter.handle(html)
205
print(result)
206
```
207
208
### Table Conversion
209
210
```python
211
html = """
212
<table>
213
<tr>
214
<th>Name</th>
215
<th>Age</th>
216
<th>City</th>
217
</tr>
218
<tr>
219
<td>Alice</td>
220
<td>30</td>
221
<td>New York</td>
222
</tr>
223
<tr>
224
<td>Bob</td>
225
<td>25</td>
226
<td>London</td>
227
</tr>
228
</table>
229
"""
230
231
converter = html2text.HTML2Text()
232
converter.pad_tables = True # Enable table padding
233
result = converter.handle(html)
234
print(result)
235
```
236
237
### Custom Output Handling
238
239
```python
240
def custom_output(text):
241
"""Custom output handler that uppercases text."""
242
print(text.upper(), end='')
243
244
html = "<p>Hello world!</p>"
245
converter = html2text.HTML2Text(out=custom_output)
246
converter.handle(html) # Will print "HELLO WORLD!" in uppercase
247
```