Tessl Tile for pypi/markdownify@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-markdownify

Convert HTML to markdown with extensive customization options for tag filtering, heading styles, and output formatting.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/markdownify@1.2.x

To install, run

npx @tessl/cli install tessl/pypi-markdownify@1.2.0

0
# Markdownify
1

2
A comprehensive Python library for converting HTML to Markdown. Markdownify provides extensive customization options including tag filtering (strip/convert specific tags), heading style control (ATX, SETEXT, underlined), list formatting, code block handling, and table conversion with advanced features like colspan support and header inference.
3

4
## Package Information
5

6
- **Package Name**: markdownify
7
- **Language**: Python
8
- **Installation**: `pip install markdownify`
9

10
## Core Imports
11

12
```python
13
from markdownify import markdownify
14
```
15

16
Or import the converter class directly:
17

18
```python
19
from markdownify import MarkdownConverter
20
```
21

22
You can also import constants for configuration:
23

24
```python
25
from markdownify import (
26
    markdownify, MarkdownConverter,
27
    ATX, ATX_CLOSED, UNDERLINED, SETEXT,
28
    SPACES, BACKSLASH, ASTERISK, UNDERSCORE,
29
    STRIP, LSTRIP, RSTRIP, STRIP_ONE
30
)
31
```
32

33
## Basic Usage
34

35
```python
36
from markdownify import markdownify as md
37

38
# Simple HTML to Markdown conversion
39
html = '<b>Bold text</b> and <a href="http://example.com">a link</a>'
40
markdown = md(html)
41
print(markdown)  # **Bold text** and [a link](http://example.com)
42

43
# Convert with options
44
html = '<h1>Title</h1><p>Paragraph with <em>emphasis</em></p>'
45
markdown = md(html, heading_style='atx', strip=['em'])
46
print(markdown)  # # Title\n\nParagraph with emphasis
47

48
# Using the MarkdownConverter class for repeated conversions
49
converter = MarkdownConverter(
50
    heading_style='atx_closed',
51
    bullets='*+-',
52
    escape_misc=True
53
)
54
markdown1 = converter.convert('<h2>Section</h2><ul><li>Item 1</li></ul>')
55
markdown2 = converter.convert('<blockquote>Quote text</blockquote>')
56
```
57

58
## CLI Usage
59

60
```bash
61
# Convert HTML file to Markdown
62
markdownify input.html
63

64
# Convert from stdin
65
echo '<b>Bold</b>' | markdownify
66

67
# Basic formatting options
68
markdownify --heading-style=atx --bullets='*-+' input.html
69
markdownify --strong-em-symbol='_' --newline-style=backslash input.html
70

71
# Tag filtering
72
markdownify --strip a script style input.html
73
markdownify --convert h1 h2 p b i strong em input.html
74

75
# Advanced options
76
markdownify --wrap --wrap-width=100 --table-infer-header input.html
77
markdownify --keep-inline-images-in h1 h2 --code-language=python input.html
78
markdownify --no-escape-asterisks --no-escape-underscores input.html
79
markdownify --sub-symbol='~' --sup-symbol='^' --bs4-options=lxml input.html
80
```
81

82
## Capabilities
83

84
### Primary Conversion Function
85

86
The main function for converting HTML to Markdown with comprehensive options.
87

88
```python { .api }
89
def markdownify(
90
    html: str,
91
    autolinks: bool = True,
92
    bs4_options: str | dict = 'html.parser',
93
    bullets: str = '*+-',
94
    code_language: str = '',
95
    code_language_callback: callable = None,
96
    convert: list = None,
97
    default_title: bool = False,
98
    escape_asterisks: bool = True,
99
    escape_underscores: bool = True,
100
    escape_misc: bool = False,
101
    heading_style: str = 'underlined',
102
    keep_inline_images_in: list = [],
103
    newline_style: str = 'spaces',
104
    strip: list = None,
105
    strip_document: str = 'strip',
106
    strip_pre: str = 'strip',
107
    strong_em_symbol: str = '*',
108
    sub_symbol: str = '',
109
    sup_symbol: str = '',
110
    table_infer_header: bool = False,
111
    wrap: bool = False,
112
    wrap_width: int = 80
113
) -> str:
114
    """
115
    Convert HTML to Markdown with extensive customization options.
116
    
117
    Parameters:
118
    - html: HTML string to convert
119
    - autolinks: Use automatic link style when link text matches href
120
    - bs4_options: BeautifulSoup parser options (string for parser name, or dict with 'features' key and other options)
121
    - bullets: String of bullet characters for nested lists (e.g., '*+-')
122
    - code_language: Default language for code blocks
123
    - code_language_callback: Function to determine code block language
124
    - convert: List of tags to convert (excludes all others if specified)
125
    - default_title: Use href as title when no title provided
126
    - escape_asterisks: Escape asterisk characters in text
127
    - escape_underscores: Escape underscore characters in text
128
    - escape_misc: Escape miscellaneous Markdown special characters
129
    - heading_style: Style for headings ('atx', 'atx_closed', 'underlined')
130
    - keep_inline_images_in: Parent tags that should keep inline images
131
    - newline_style: Style for line breaks ('spaces', 'backslash')
132
    - strip: List of tags to strip (excludes from conversion)
133
    - strip_document: Document-level whitespace stripping ('strip', 'lstrip', 'rstrip', None)
134
    - strip_pre: Pre-block whitespace stripping ('strip', 'strip_one', None)
135
    - strong_em_symbol: Symbol for strong/emphasis ('*', '_')
136
    - sub_symbol: Characters to surround subscript text
137
    - sup_symbol: Characters to surround superscript text
138
    - table_infer_header: Infer table headers when not explicitly marked
139
    - wrap: Wrap text paragraphs at specified width
140
    - wrap_width: Width for text wrapping
141
    
142
    Returns:
143
    Markdown string
144
    """
145
```
146

147
### MarkdownConverter Class
148

149
The main converter class providing configurable HTML to Markdown conversion with caching and extensibility.
150

151
```python { .api }
152
class MarkdownConverter:
153
    """
154
    Configurable HTML to Markdown converter with extensive customization options.
155
    Supports custom conversion methods for specific tags and provides caching for performance.
156
    """
157
    
158
    def __init__(
159
        self,
160
        autolinks: bool = True,
161
        bs4_options: str | dict = 'html.parser',
162
        bullets: str = '*+-',
163
        code_language: str = '',
164
        code_language_callback: callable = None,
165
        convert: list = None,
166
        default_title: bool = False,
167
        escape_asterisks: bool = True,
168
        escape_underscores: bool = True,
169
        escape_misc: bool = False,
170
        heading_style: str = 'underlined',
171
        keep_inline_images_in: list = [],
172
        newline_style: str = 'spaces',
173
        strip: list = None,
174
        strip_document: str = 'strip',
175
        strip_pre: str = 'strip',
176
        strong_em_symbol: str = '*',
177
        sub_symbol: str = '',
178
        sup_symbol: str = '',
179
        table_infer_header: bool = False,
180
        wrap: bool = False,
181
        wrap_width: int = 80
182
    ):
183
        """
184
        Initialize MarkdownConverter with configuration options.
185
        
186
        Parameters: Same as markdownify() function
187
        """
188
    
189
    def convert(self, html: str) -> str:
190
        """
191
        Convert HTML string to Markdown.
192
        
193
        Parameters:
194
        - html: HTML string to convert
195
        
196
        Returns:
197
        Markdown string
198
        """
199
    
200
    def convert_soup(self, soup) -> str:
201
        """
202
        Convert BeautifulSoup object to Markdown.
203
        
204
        Parameters:
205
        - soup: BeautifulSoup parsed HTML object
206
        
207
        Returns:
208
        Markdown string
209
        """
210
```
211

212
### Command Line Interface
213

214
Entry point for command-line HTML to Markdown conversion.
215

216
```python { .api }
217
def main(argv: list = None):
218
    """
219
    Command-line interface for markdownify.
220
    
221
    Parameters:
222
    - argv: Command line arguments (defaults to sys.argv[1:])
223
    
224
    Supports all conversion options as command-line flags:
225
    --strip, --convert, --autolinks, --heading-style, --bullets,
226
    --strong-em-symbol, --sub-symbol, --sup-symbol, --newline-style,
227
    --code-language, --no-escape-asterisks, --no-escape-underscores,
228
    --keep-inline-images-in, --table-infer-header, --wrap, --wrap-width,
229
    --bs4-options
230
    """
231
```
232

233
### Utility Functions
234

235
Helper functions for text processing and whitespace handling.
236

237
```python { .api }
238
def strip_pre(text: str) -> str:
239
    """
240
    Strip all leading and trailing newlines from preformatted text.
241
    
242
    Parameters:
243
    - text: Text to strip
244
    
245
    Returns:
246
    Stripped text
247
    """
248

249
def strip1_pre(text: str) -> str:
250
    """
251
    Strip one leading and trailing newline from preformatted text.
252
    
253
    Parameters:
254
    - text: Text to strip
255
    
256
    Returns:
257
    Stripped text with at most one leading/trailing newline removed
258
    """
259

260
def chomp(text: str) -> tuple:
261
    """
262
    Extract leading/trailing spaces from inline text to prevent malformed Markdown.
263
    
264
    Parameters:
265
    - text: Text to process
266
    
267
    Returns:
268
    Tuple of (prefix_space, suffix_space, stripped_text)
269
    """
270

271
def abstract_inline_conversion(markup_fn: callable) -> callable:
272
    """
273
    Factory function for creating inline tag conversion functions.
274
    
275
    Parameters:
276
    - markup_fn: Function that returns markup string for the tag
277
    
278
    Returns:
279
    Conversion function for inline tags
280
    """
281

282
def should_remove_whitespace_inside(el) -> bool:
283
    """
284
    Determine if whitespace should be removed inside a block-level element.
285
    
286
    Parameters:
287
    - el: HTML element to check
288
    
289
    Returns:
290
    True if whitespace should be removed inside the element
291
    """
292

293
def should_remove_whitespace_outside(el) -> bool:
294
    """
295
    Determine if whitespace should be removed outside a block-level element.
296
    
297
    Parameters:
298
    - el: HTML element to check
299
    
300
    Returns:
301
    True if whitespace should be removed outside the element
302
    """
303
```
304

305
## Configuration Constants
306

307
Style constants for configuring conversion behavior.
308

309
```python { .api }
310
# Heading styles
311
ATX = 'atx'                    # # Heading
312
ATX_CLOSED = 'atx_closed'      # # Heading #
313
UNDERLINED = 'underlined'      # Heading\n=======
314
SETEXT = UNDERLINED            # Alias for UNDERLINED
315

316
# Newline styles for <br> tags
317
SPACES = 'spaces'              # Two spaces at end of line
318
BACKSLASH = 'backslash'        # Backslash at end of line
319

320
# Strong/emphasis symbols
321
ASTERISK = '*'                 # **bold** and *italic*
322
UNDERSCORE = '_'               # __bold__ and _italic_
323

324
# Document/pre stripping options
325
STRIP = 'strip'                # Remove leading and trailing whitespace
326
LSTRIP = 'lstrip'              # Remove leading whitespace only
327
RSTRIP = 'rstrip'              # Remove trailing whitespace only
328
STRIP_ONE = 'strip_one'        # Remove one leading/trailing newline
329
```
330

331
## Custom Converter Extension
332

333
You can extend MarkdownConverter to create custom conversion behavior for specific tags:
334

335
```python
336
from markdownify import MarkdownConverter
337

338
class CustomConverter(MarkdownConverter):
339
    def convert_custom_tag(self, el, text, parent_tags):
340
        """Custom conversion for <custom-tag> elements."""
341
        return f"[CUSTOM: {text}]"
342
    
343
    def convert_img(self, el, text, parent_tags):
344
        """Override image conversion to add custom behavior."""
345
        result = super().convert_img(el, text, parent_tags)
346
        return result + "\n\n"  # Add extra newlines after images
347

348
# Usage
349
converter = CustomConverter()
350
html = '<custom-tag>content</custom-tag><img src="test.jpg" alt="Test">'
351
markdown = converter.convert(html)
352
```
353

354
## Error Handling
355

356
The library handles malformed HTML gracefully through BeautifulSoup's parsing capabilities. Invalid configuration options raise `ValueError` exceptions:
357

358
- Specifying both `strip` and `convert` options
359
- Invalid values for `heading_style`, `newline_style`, `strip_document`, or `strip_pre`
360
- Non-callable `code_language_callback`
361

362
## Dependencies
363

364
- **beautifulsoup4** (>=4.9,<5): HTML parsing and DOM manipulation
365
- **six** (>=1.15,<2): Python 2/3 compatibility utilities