Turn HTML into equivalent Markdown-structured text.
npx @tessl/cli install tessl/pypi-html2text@2025.4.00
# html2text
1
2
A comprehensive Python library that converts HTML into clean, readable plain ASCII text and valid Markdown format. It provides both programmatic API and command-line interface with extensive configuration options for handling links, code blocks, tables, and formatting elements while maintaining semantic structure.
3
4
## Package Information
5
6
- **Package Name**: html2text
7
- **Language**: Python
8
- **Installation**: `pip install html2text`
9
- **Python Requirements**: >=3.9
10
11
## Core Imports
12
13
```python
14
import html2text
15
```
16
17
For basic usage:
18
19
```python
20
from html2text import html2text
21
```
22
23
For advanced usage with configuration:
24
25
```python
26
from html2text import HTML2Text
27
```
28
29
## Basic Usage
30
31
```python
32
import html2text
33
34
# Simple conversion using convenience function
35
html = "<p><strong>Bold text</strong> and <em>italic text</em></p>"
36
markdown = html2text.html2text(html)
37
print(markdown)
38
# Output: **Bold text** and _italic text_
39
40
# Advanced usage with configuration
41
h = html2text.HTML2Text()
42
h.ignore_links = True
43
h.body_width = 0 # No line wrapping
44
markdown = h.handle("<p>Hello <a href='http://example.com'>world</a>!</p>")
45
print(markdown)
46
# Output: Hello world!
47
```
48
49
## Architecture
50
51
html2text uses an HTML parser-based architecture:
52
53
- **HTML2Text Class**: Main converter inheriting from `html.parser.HTMLParser` with extensive configuration options
54
- **Configuration System**: Module-level defaults with per-instance overrides for all formatting options
55
- **Utility Functions**: Helper functions for CSS parsing, text escaping, and table formatting
56
- **Element Classes**: Data structures for managing links and lists during conversion
57
- **CLI Interface**: Command-line tool exposing all configuration options
58
59
## Capabilities
60
61
### Core HTML Conversion
62
63
Primary conversion functionality for transforming HTML into Markdown or plain text with configurable formatting options.
64
65
```python { .api }
66
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
67
"""
68
Convert HTML string to Markdown/text.
69
70
Args:
71
html: HTML string to convert
72
baseurl: Base URL for resolving relative links
73
bodywidth: Text wrapping width (None uses default)
74
75
Returns:
76
Converted Markdown/text string
77
"""
78
79
class HTML2Text(html.parser.HTMLParser):
80
"""
81
Advanced HTML to text converter with extensive configuration options.
82
83
Args:
84
out: Optional custom output callback function
85
baseurl: Base URL for resolving relative links (default: "")
86
bodywidth: Maximum line width for text wrapping (default: 78)
87
"""
88
89
def handle(self, data: str) -> str:
90
"""
91
Convert HTML string to Markdown/text.
92
93
Args:
94
data: HTML string to convert
95
96
Returns:
97
Converted Markdown/text string
98
"""
99
```
100
101
[Core Conversion](./core-conversion.md)
102
103
### Configuration Options
104
105
Comprehensive formatting and behavior configuration for customizing HTML to text conversion including link handling, text formatting, table processing, and output styling.
106
107
```python { .api }
108
# Link and Image Configuration
109
ignore_links: bool = False # Skip all link formatting
110
ignore_mailto_links: bool = False # Skip mailto links
111
inline_links: bool = True # Use inline vs reference links
112
protect_links: bool = False # Wrap links with angle brackets
113
ignore_images: bool = False # Skip image formatting
114
images_to_alt: bool = False # Replace images with alt text only
115
116
# Text Formatting Configuration
117
body_width: int = 78 # Text wrapping width (0 for no wrap)
118
unicode_snob: bool = False # Use Unicode vs ASCII replacements
119
escape_snob: bool = False # Escape all special characters
120
ignore_emphasis: bool = False # Skip bold/italic formatting
121
single_line_break: bool = False # Use single vs double line breaks
122
123
# Table Configuration
124
bypass_tables: bool = False # Format tables as HTML vs Markdown
125
ignore_tables: bool = False # Skip table formatting entirely
126
pad_tables: bool = False # Pad table cells to equal width
127
```
128
129
[Configuration Options](./configuration.md)
130
131
### Utility Functions
132
133
Helper functions for text processing, CSS parsing, character escaping, and table formatting used internally and available for advanced use cases.
134
135
```python { .api }
136
def escape_md(text: str) -> str:
137
"""Escape markdown-sensitive characters within markdown constructs."""
138
139
def escape_md_section(text: str, snob: bool = False) -> str:
140
"""Escape markdown-sensitive characters across document sections."""
141
142
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
143
"""Add padding to tables in text for consistent column alignment."""
144
```
145
146
[Utility Functions](./utilities.md)
147
148
## Error Handling
149
150
html2text handles malformed HTML gracefully through its HTMLParser base class. Character encoding issues should be resolved before passing HTML to the converter:
151
152
```python
153
# Handle encoding explicitly if needed
154
with open('file.html', 'rb') as f:
155
html_bytes = f.read()
156
html_text = html_bytes.decode('utf-8', errors='ignore')
157
markdown = html2text.html2text(html_text)
158
```
159
160
## Command Line Interface
161
162
The package includes a command-line tool `html2text` with comprehensive configuration options:
163
164
```bash
165
# Basic usage
166
html2text input.html
167
168
# From stdin
169
echo "<p>Hello world</p>" | html2text
170
171
# With custom encoding
172
html2text input.html utf-8
173
174
# Common options
175
html2text --body-width=0 --ignore-links input.html
176
html2text --reference-links --pad-tables input.html
177
html2text --google-doc --hide-strikethrough gdoc.html
178
```
179
180
### CLI Options
181
182
**Text Formatting:**
183
- `--body-width=N` - Line width (0 for no wrapping, default: 78)
184
- `--single-line-break` - Use single line breaks instead of double
185
- `--escape-all` - Escape all special characters for safer output
186
187
**Link Handling:**
188
- `--ignore-links` - Don't include any link formatting
189
- `--ignore-mailto-links` - Don't include mailto: links
190
- `--reference-links` - Use reference-style links instead of inline
191
- `--protect-links` - Wrap links with angle brackets
192
- `--no-wrap-links` - Don't wrap long links
193
194
**Image Handling:**
195
- `--ignore-images` - Don't include any image formatting
196
- `--images-as-html` - Keep images as raw HTML tags
197
- `--images-to-alt` - Replace images with alt text only
198
- `--images-with-size` - Include width/height in HTML image tags
199
- `--default-image-alt=TEXT` - Default alt text for images
200
201
**Table Formatting:**
202
- `--pad-tables` - Pad cells to equal column width
203
- `--bypass-tables` - Format tables as HTML instead of Markdown
204
- `--ignore-tables` - Skip table formatting entirely
205
- `--wrap-tables` - Allow table content wrapping
206
207
**List and Emphasis:**
208
- `--ignore-emphasis` - Don't include formatting for bold/italic
209
- `--dash-unordered-list` - Use dashes instead of asterisks for lists
210
- `--asterisk-emphasis` - Use asterisks instead of underscores for emphasis
211
- `--wrap-list-items` - Allow list item wrapping
212
213
**Google Docs Support:**
214
- `--google-doc` - Enable Google Docs-specific processing
215
- `--google-list-indent=N` - Pixels Google uses for list indentation (default: 36)
216
- `--hide-strikethrough` - Hide strikethrough text (use with --google-doc)
217
218
## Types
219
220
```python { .api }
221
from typing import Dict, List, Optional, Protocol
222
223
class OutCallback(Protocol):
224
"""Protocol for custom output callback functions."""
225
def __call__(self, s: str) -> None: ...
226
227
class AnchorElement:
228
"""Represents link elements during processing."""
229
attrs: Dict[str, Optional[str]]
230
count: int
231
outcount: int
232
233
class ListElement:
234
"""Represents list elements during processing."""
235
name: str # 'ul' or 'ol'
236
num: int # Current list item number
237
```