Convert HTML to markdown with extensive customization options for tag filtering, heading styles, and output formatting.
npx @tessl/cli install tessl/pypi-markdownify@1.2.00
# Markdownify
1
2
A comprehensive Python library for converting HTML to Markdown. Markdownify provides extensive customization options including tag filtering (strip/convert specific tags), heading style control (ATX, SETEXT, underlined), list formatting, code block handling, and table conversion with advanced features like colspan support and header inference.
3
4
## Package Information
5
6
- **Package Name**: markdownify
7
- **Language**: Python
8
- **Installation**: `pip install markdownify`
9
10
## Core Imports
11
12
```python
13
from markdownify import markdownify
14
```
15
16
Or import the converter class directly:
17
18
```python
19
from markdownify import MarkdownConverter
20
```
21
22
You can also import constants for configuration:
23
24
```python
25
from markdownify import (
26
markdownify, MarkdownConverter,
27
ATX, ATX_CLOSED, UNDERLINED, SETEXT,
28
SPACES, BACKSLASH, ASTERISK, UNDERSCORE,
29
STRIP, LSTRIP, RSTRIP, STRIP_ONE
30
)
31
```
32
33
## Basic Usage
34
35
```python
36
from markdownify import markdownify as md
37
38
# Simple HTML to Markdown conversion
39
html = '<b>Bold text</b> and <a href="http://example.com">a link</a>'
40
markdown = md(html)
41
print(markdown) # **Bold text** and [a link](http://example.com)
42
43
# Convert with options
44
html = '<h1>Title</h1><p>Paragraph with <em>emphasis</em></p>'
45
markdown = md(html, heading_style='atx', strip=['em'])
46
print(markdown) # # Title\n\nParagraph with emphasis
47
48
# Using the MarkdownConverter class for repeated conversions
49
converter = MarkdownConverter(
50
heading_style='atx_closed',
51
bullets='*+-',
52
escape_misc=True
53
)
54
markdown1 = converter.convert('<h2>Section</h2><ul><li>Item 1</li></ul>')
55
markdown2 = converter.convert('<blockquote>Quote text</blockquote>')
56
```
57
58
## CLI Usage
59
60
```bash
61
# Convert HTML file to Markdown
62
markdownify input.html
63
64
# Convert from stdin
65
echo '<b>Bold</b>' | markdownify
66
67
# Basic formatting options
68
markdownify --heading-style=atx --bullets='*-+' input.html
69
markdownify --strong-em-symbol='_' --newline-style=backslash input.html
70
71
# Tag filtering
72
markdownify --strip a script style input.html
73
markdownify --convert h1 h2 p b i strong em input.html
74
75
# Advanced options
76
markdownify --wrap --wrap-width=100 --table-infer-header input.html
77
markdownify --keep-inline-images-in h1 h2 --code-language=python input.html
78
markdownify --no-escape-asterisks --no-escape-underscores input.html
79
markdownify --sub-symbol='~' --sup-symbol='^' --bs4-options=lxml input.html
80
```
81
82
## Capabilities
83
84
### Primary Conversion Function
85
86
The main function for converting HTML to Markdown with comprehensive options.
87
88
```python { .api }
89
def markdownify(
90
html: str,
91
autolinks: bool = True,
92
bs4_options: str | dict = 'html.parser',
93
bullets: str = '*+-',
94
code_language: str = '',
95
code_language_callback: callable = None,
96
convert: list = None,
97
default_title: bool = False,
98
escape_asterisks: bool = True,
99
escape_underscores: bool = True,
100
escape_misc: bool = False,
101
heading_style: str = 'underlined',
102
keep_inline_images_in: list = [],
103
newline_style: str = 'spaces',
104
strip: list = None,
105
strip_document: str = 'strip',
106
strip_pre: str = 'strip',
107
strong_em_symbol: str = '*',
108
sub_symbol: str = '',
109
sup_symbol: str = '',
110
table_infer_header: bool = False,
111
wrap: bool = False,
112
wrap_width: int = 80
113
) -> str:
114
"""
115
Convert HTML to Markdown with extensive customization options.
116
117
Parameters:
118
- html: HTML string to convert
119
- autolinks: Use automatic link style when link text matches href
120
- bs4_options: BeautifulSoup parser options (string for parser name, or dict with 'features' key and other options)
121
- bullets: String of bullet characters for nested lists (e.g., '*+-')
122
- code_language: Default language for code blocks
123
- code_language_callback: Function to determine code block language
124
- convert: List of tags to convert (excludes all others if specified)
125
- default_title: Use href as title when no title provided
126
- escape_asterisks: Escape asterisk characters in text
127
- escape_underscores: Escape underscore characters in text
128
- escape_misc: Escape miscellaneous Markdown special characters
129
- heading_style: Style for headings ('atx', 'atx_closed', 'underlined')
130
- keep_inline_images_in: Parent tags that should keep inline images
131
- newline_style: Style for line breaks ('spaces', 'backslash')
132
- strip: List of tags to strip (excludes from conversion)
133
- strip_document: Document-level whitespace stripping ('strip', 'lstrip', 'rstrip', None)
134
- strip_pre: Pre-block whitespace stripping ('strip', 'strip_one', None)
135
- strong_em_symbol: Symbol for strong/emphasis ('*', '_')
136
- sub_symbol: Characters to surround subscript text
137
- sup_symbol: Characters to surround superscript text
138
- table_infer_header: Infer table headers when not explicitly marked
139
- wrap: Wrap text paragraphs at specified width
140
- wrap_width: Width for text wrapping
141
142
Returns:
143
Markdown string
144
"""
145
```
146
147
### MarkdownConverter Class
148
149
The main converter class providing configurable HTML to Markdown conversion with caching and extensibility.
150
151
```python { .api }
152
class MarkdownConverter:
153
"""
154
Configurable HTML to Markdown converter with extensive customization options.
155
Supports custom conversion methods for specific tags and provides caching for performance.
156
"""
157
158
def __init__(
159
self,
160
autolinks: bool = True,
161
bs4_options: str | dict = 'html.parser',
162
bullets: str = '*+-',
163
code_language: str = '',
164
code_language_callback: callable = None,
165
convert: list = None,
166
default_title: bool = False,
167
escape_asterisks: bool = True,
168
escape_underscores: bool = True,
169
escape_misc: bool = False,
170
heading_style: str = 'underlined',
171
keep_inline_images_in: list = [],
172
newline_style: str = 'spaces',
173
strip: list = None,
174
strip_document: str = 'strip',
175
strip_pre: str = 'strip',
176
strong_em_symbol: str = '*',
177
sub_symbol: str = '',
178
sup_symbol: str = '',
179
table_infer_header: bool = False,
180
wrap: bool = False,
181
wrap_width: int = 80
182
):
183
"""
184
Initialize MarkdownConverter with configuration options.
185
186
Parameters: Same as markdownify() function
187
"""
188
189
def convert(self, html: str) -> str:
190
"""
191
Convert HTML string to Markdown.
192
193
Parameters:
194
- html: HTML string to convert
195
196
Returns:
197
Markdown string
198
"""
199
200
def convert_soup(self, soup) -> str:
201
"""
202
Convert BeautifulSoup object to Markdown.
203
204
Parameters:
205
- soup: BeautifulSoup parsed HTML object
206
207
Returns:
208
Markdown string
209
"""
210
```
211
212
### Command Line Interface
213
214
Entry point for command-line HTML to Markdown conversion.
215
216
```python { .api }
217
def main(argv: list = None):
218
"""
219
Command-line interface for markdownify.
220
221
Parameters:
222
- argv: Command line arguments (defaults to sys.argv[1:])
223
224
Supports all conversion options as command-line flags:
225
--strip, --convert, --autolinks, --heading-style, --bullets,
226
--strong-em-symbol, --sub-symbol, --sup-symbol, --newline-style,
227
--code-language, --no-escape-asterisks, --no-escape-underscores,
228
--keep-inline-images-in, --table-infer-header, --wrap, --wrap-width,
229
--bs4-options
230
"""
231
```
232
233
### Utility Functions
234
235
Helper functions for text processing and whitespace handling.
236
237
```python { .api }
238
def strip_pre(text: str) -> str:
239
"""
240
Strip all leading and trailing newlines from preformatted text.
241
242
Parameters:
243
- text: Text to strip
244
245
Returns:
246
Stripped text
247
"""
248
249
def strip1_pre(text: str) -> str:
250
"""
251
Strip one leading and trailing newline from preformatted text.
252
253
Parameters:
254
- text: Text to strip
255
256
Returns:
257
Stripped text with at most one leading/trailing newline removed
258
"""
259
260
def chomp(text: str) -> tuple:
261
"""
262
Extract leading/trailing spaces from inline text to prevent malformed Markdown.
263
264
Parameters:
265
- text: Text to process
266
267
Returns:
268
Tuple of (prefix_space, suffix_space, stripped_text)
269
"""
270
271
def abstract_inline_conversion(markup_fn: callable) -> callable:
272
"""
273
Factory function for creating inline tag conversion functions.
274
275
Parameters:
276
- markup_fn: Function that returns markup string for the tag
277
278
Returns:
279
Conversion function for inline tags
280
"""
281
282
def should_remove_whitespace_inside(el) -> bool:
283
"""
284
Determine if whitespace should be removed inside a block-level element.
285
286
Parameters:
287
- el: HTML element to check
288
289
Returns:
290
True if whitespace should be removed inside the element
291
"""
292
293
def should_remove_whitespace_outside(el) -> bool:
294
"""
295
Determine if whitespace should be removed outside a block-level element.
296
297
Parameters:
298
- el: HTML element to check
299
300
Returns:
301
True if whitespace should be removed outside the element
302
"""
303
```
304
305
## Configuration Constants
306
307
Style constants for configuring conversion behavior.
308
309
```python { .api }
310
# Heading styles
311
ATX = 'atx' # # Heading
312
ATX_CLOSED = 'atx_closed' # # Heading #
313
UNDERLINED = 'underlined' # Heading\n=======
314
SETEXT = UNDERLINED # Alias for UNDERLINED
315
316
# Newline styles for <br> tags
317
SPACES = 'spaces' # Two spaces at end of line
318
BACKSLASH = 'backslash' # Backslash at end of line
319
320
# Strong/emphasis symbols
321
ASTERISK = '*' # **bold** and *italic*
322
UNDERSCORE = '_' # __bold__ and _italic_
323
324
# Document/pre stripping options
325
STRIP = 'strip' # Remove leading and trailing whitespace
326
LSTRIP = 'lstrip' # Remove leading whitespace only
327
RSTRIP = 'rstrip' # Remove trailing whitespace only
328
STRIP_ONE = 'strip_one' # Remove one leading/trailing newline
329
```
330
331
## Custom Converter Extension
332
333
You can extend MarkdownConverter to create custom conversion behavior for specific tags:
334
335
```python
336
from markdownify import MarkdownConverter
337
338
class CustomConverter(MarkdownConverter):
339
def convert_custom_tag(self, el, text, parent_tags):
340
"""Custom conversion for <custom-tag> elements."""
341
return f"[CUSTOM: {text}]"
342
343
def convert_img(self, el, text, parent_tags):
344
"""Override image conversion to add custom behavior."""
345
result = super().convert_img(el, text, parent_tags)
346
return result + "\n\n" # Add extra newlines after images
347
348
# Usage
349
converter = CustomConverter()
350
html = '<custom-tag>content</custom-tag><img src="test.jpg" alt="Test">'
351
markdown = converter.convert(html)
352
```
353
354
## Error Handling
355
356
The library handles malformed HTML gracefully through BeautifulSoup's parsing capabilities. Invalid configuration options raise `ValueError` exceptions:
357
358
- Specifying both `strip` and `convert` options
359
- Invalid values for `heading_style`, `newline_style`, `strip_document`, or `strip_pre`
360
- Non-callable `code_language_callback`
361
362
## Dependencies
363
364
- **beautifulsoup4** (>=4.9,<5): HTML parsing and DOM manipulation
365
- **six** (>=1.15,<2): Python 2/3 compatibility utilities