or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-markdownify

Convert HTML to markdown with extensive customization options for tag filtering, heading styles, and output formatting.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/markdownify@1.2.x

To install, run

npx @tessl/cli install tessl/pypi-markdownify@1.2.0

0

# Markdownify

1

2

A comprehensive Python library for converting HTML to Markdown. Markdownify provides extensive customization options including tag filtering (strip/convert specific tags), heading style control (ATX, SETEXT, underlined), list formatting, code block handling, and table conversion with advanced features like colspan support and header inference.

3

4

## Package Information

5

6

- **Package Name**: markdownify

7

- **Language**: Python

8

- **Installation**: `pip install markdownify`

9

10

## Core Imports

11

12

```python

13

from markdownify import markdownify

14

```

15

16

Or import the converter class directly:

17

18

```python

19

from markdownify import MarkdownConverter

20

```

21

22

You can also import constants for configuration:

23

24

```python

25

from markdownify import (

26

markdownify, MarkdownConverter,

27

ATX, ATX_CLOSED, UNDERLINED, SETEXT,

28

SPACES, BACKSLASH, ASTERISK, UNDERSCORE,

29

STRIP, LSTRIP, RSTRIP, STRIP_ONE

30

)

31

```

32

33

## Basic Usage

34

35

```python

36

from markdownify import markdownify as md

37

38

# Simple HTML to Markdown conversion

39

html = '<b>Bold text</b> and <a href="http://example.com">a link</a>'

40

markdown = md(html)

41

print(markdown) # **Bold text** and [a link](http://example.com)

42

43

# Convert with options

44

html = '<h1>Title</h1><p>Paragraph with <em>emphasis</em></p>'

45

markdown = md(html, heading_style='atx', strip=['em'])

46

print(markdown) # # Title\n\nParagraph with emphasis

47

48

# Using the MarkdownConverter class for repeated conversions

49

converter = MarkdownConverter(

50

heading_style='atx_closed',

51

bullets='*+-',

52

escape_misc=True

53

)

54

markdown1 = converter.convert('<h2>Section</h2><ul><li>Item 1</li></ul>')

55

markdown2 = converter.convert('<blockquote>Quote text</blockquote>')

56

```

57

58

## CLI Usage

59

60

```bash

61

# Convert HTML file to Markdown

62

markdownify input.html

63

64

# Convert from stdin

65

echo '<b>Bold</b>' | markdownify

66

67

# Basic formatting options

68

markdownify --heading-style=atx --bullets='*-+' input.html

69

markdownify --strong-em-symbol='_' --newline-style=backslash input.html

70

71

# Tag filtering

72

markdownify --strip a script style input.html

73

markdownify --convert h1 h2 p b i strong em input.html

74

75

# Advanced options

76

markdownify --wrap --wrap-width=100 --table-infer-header input.html

77

markdownify --keep-inline-images-in h1 h2 --code-language=python input.html

78

markdownify --no-escape-asterisks --no-escape-underscores input.html

79

markdownify --sub-symbol='~' --sup-symbol='^' --bs4-options=lxml input.html

80

```

81

82

## Capabilities

83

84

### Primary Conversion Function

85

86

The main function for converting HTML to Markdown with comprehensive options.

87

88

```python { .api }

89

def markdownify(

90

html: str,

91

autolinks: bool = True,

92

bs4_options: str | dict = 'html.parser',

93

bullets: str = '*+-',

94

code_language: str = '',

95

code_language_callback: callable = None,

96

convert: list = None,

97

default_title: bool = False,

98

escape_asterisks: bool = True,

99

escape_underscores: bool = True,

100

escape_misc: bool = False,

101

heading_style: str = 'underlined',

102

keep_inline_images_in: list = [],

103

newline_style: str = 'spaces',

104

strip: list = None,

105

strip_document: str = 'strip',

106

strip_pre: str = 'strip',

107

strong_em_symbol: str = '*',

108

sub_symbol: str = '',

109

sup_symbol: str = '',

110

table_infer_header: bool = False,

111

wrap: bool = False,

112

wrap_width: int = 80

113

) -> str:

114

"""

115

Convert HTML to Markdown with extensive customization options.

116

117

Parameters:

118

- html: HTML string to convert

119

- autolinks: Use automatic link style when link text matches href

120

- bs4_options: BeautifulSoup parser options (string for parser name, or dict with 'features' key and other options)

121

- bullets: String of bullet characters for nested lists (e.g., '*+-')

122

- code_language: Default language for code blocks

123

- code_language_callback: Function to determine code block language

124

- convert: List of tags to convert (excludes all others if specified)

125

- default_title: Use href as title when no title provided

126

- escape_asterisks: Escape asterisk characters in text

127

- escape_underscores: Escape underscore characters in text

128

- escape_misc: Escape miscellaneous Markdown special characters

129

- heading_style: Style for headings ('atx', 'atx_closed', 'underlined')

130

- keep_inline_images_in: Parent tags that should keep inline images

131

- newline_style: Style for line breaks ('spaces', 'backslash')

132

- strip: List of tags to strip (excludes from conversion)

133

- strip_document: Document-level whitespace stripping ('strip', 'lstrip', 'rstrip', None)

134

- strip_pre: Pre-block whitespace stripping ('strip', 'strip_one', None)

135

- strong_em_symbol: Symbol for strong/emphasis ('*', '_')

136

- sub_symbol: Characters to surround subscript text

137

- sup_symbol: Characters to surround superscript text

138

- table_infer_header: Infer table headers when not explicitly marked

139

- wrap: Wrap text paragraphs at specified width

140

- wrap_width: Width for text wrapping

141

142

Returns:

143

Markdown string

144

"""

145

```

146

147

### MarkdownConverter Class

148

149

The main converter class providing configurable HTML to Markdown conversion with caching and extensibility.

150

151

```python { .api }

152

class MarkdownConverter:

153

"""

154

Configurable HTML to Markdown converter with extensive customization options.

155

Supports custom conversion methods for specific tags and provides caching for performance.

156

"""

157

158

def __init__(

159

self,

160

autolinks: bool = True,

161

bs4_options: str | dict = 'html.parser',

162

bullets: str = '*+-',

163

code_language: str = '',

164

code_language_callback: callable = None,

165

convert: list = None,

166

default_title: bool = False,

167

escape_asterisks: bool = True,

168

escape_underscores: bool = True,

169

escape_misc: bool = False,

170

heading_style: str = 'underlined',

171

keep_inline_images_in: list = [],

172

newline_style: str = 'spaces',

173

strip: list = None,

174

strip_document: str = 'strip',

175

strip_pre: str = 'strip',

176

strong_em_symbol: str = '*',

177

sub_symbol: str = '',

178

sup_symbol: str = '',

179

table_infer_header: bool = False,

180

wrap: bool = False,

181

wrap_width: int = 80

182

):

183

"""

184

Initialize MarkdownConverter with configuration options.

185

186

Parameters: Same as markdownify() function

187

"""

188

189

def convert(self, html: str) -> str:

190

"""

191

Convert HTML string to Markdown.

192

193

Parameters:

194

- html: HTML string to convert

195

196

Returns:

197

Markdown string

198

"""

199

200

def convert_soup(self, soup) -> str:

201

"""

202

Convert BeautifulSoup object to Markdown.

203

204

Parameters:

205

- soup: BeautifulSoup parsed HTML object

206

207

Returns:

208

Markdown string

209

"""

210

```

211

212

### Command Line Interface

213

214

Entry point for command-line HTML to Markdown conversion.

215

216

```python { .api }

217

def main(argv: list = None):

218

"""

219

Command-line interface for markdownify.

220

221

Parameters:

222

- argv: Command line arguments (defaults to sys.argv[1:])

223

224

Supports all conversion options as command-line flags:

225

--strip, --convert, --autolinks, --heading-style, --bullets,

226

--strong-em-symbol, --sub-symbol, --sup-symbol, --newline-style,

227

--code-language, --no-escape-asterisks, --no-escape-underscores,

228

--keep-inline-images-in, --table-infer-header, --wrap, --wrap-width,

229

--bs4-options

230

"""

231

```

232

233

### Utility Functions

234

235

Helper functions for text processing and whitespace handling.

236

237

```python { .api }

238

def strip_pre(text: str) -> str:

239

"""

240

Strip all leading and trailing newlines from preformatted text.

241

242

Parameters:

243

- text: Text to strip

244

245

Returns:

246

Stripped text

247

"""

248

249

def strip1_pre(text: str) -> str:

250

"""

251

Strip one leading and trailing newline from preformatted text.

252

253

Parameters:

254

- text: Text to strip

255

256

Returns:

257

Stripped text with at most one leading/trailing newline removed

258

"""

259

260

def chomp(text: str) -> tuple:

261

"""

262

Extract leading/trailing spaces from inline text to prevent malformed Markdown.

263

264

Parameters:

265

- text: Text to process

266

267

Returns:

268

Tuple of (prefix_space, suffix_space, stripped_text)

269

"""

270

271

def abstract_inline_conversion(markup_fn: callable) -> callable:

272

"""

273

Factory function for creating inline tag conversion functions.

274

275

Parameters:

276

- markup_fn: Function that returns markup string for the tag

277

278

Returns:

279

Conversion function for inline tags

280

"""

281

282

def should_remove_whitespace_inside(el) -> bool:

283

"""

284

Determine if whitespace should be removed inside a block-level element.

285

286

Parameters:

287

- el: HTML element to check

288

289

Returns:

290

True if whitespace should be removed inside the element

291

"""

292

293

def should_remove_whitespace_outside(el) -> bool:

294

"""

295

Determine if whitespace should be removed outside a block-level element.

296

297

Parameters:

298

- el: HTML element to check

299

300

Returns:

301

True if whitespace should be removed outside the element

302

"""

303

```

304

305

## Configuration Constants

306

307

Style constants for configuring conversion behavior.

308

309

```python { .api }

310

# Heading styles

311

ATX = 'atx' # # Heading

312

ATX_CLOSED = 'atx_closed' # # Heading #

313

UNDERLINED = 'underlined' # Heading\n=======

314

SETEXT = UNDERLINED # Alias for UNDERLINED

315

316

# Newline styles for <br> tags

317

SPACES = 'spaces' # Two spaces at end of line

318

BACKSLASH = 'backslash' # Backslash at end of line

319

320

# Strong/emphasis symbols

321

ASTERISK = '*' # **bold** and *italic*

322

UNDERSCORE = '_' # __bold__ and _italic_

323

324

# Document/pre stripping options

325

STRIP = 'strip' # Remove leading and trailing whitespace

326

LSTRIP = 'lstrip' # Remove leading whitespace only

327

RSTRIP = 'rstrip' # Remove trailing whitespace only

328

STRIP_ONE = 'strip_one' # Remove one leading/trailing newline

329

```

330

331

## Custom Converter Extension

332

333

You can extend MarkdownConverter to create custom conversion behavior for specific tags:

334

335

```python

336

from markdownify import MarkdownConverter

337

338

class CustomConverter(MarkdownConverter):

339

def convert_custom_tag(self, el, text, parent_tags):

340

"""Custom conversion for <custom-tag> elements."""

341

return f"[CUSTOM: {text}]"

342

343

def convert_img(self, el, text, parent_tags):

344

"""Override image conversion to add custom behavior."""

345

result = super().convert_img(el, text, parent_tags)

346

return result + "\n\n" # Add extra newlines after images

347

348

# Usage

349

converter = CustomConverter()

350

html = '<custom-tag>content</custom-tag><img src="test.jpg" alt="Test">'

351

markdown = converter.convert(html)

352

```

353

354

## Error Handling

355

356

The library handles malformed HTML gracefully through BeautifulSoup's parsing capabilities. Invalid configuration options raise `ValueError` exceptions:

357

358

- Specifying both `strip` and `convert` options

359

- Invalid values for `heading_style`, `newline_style`, `strip_document`, or `strip_pre`

360

- Non-callable `code_language_callback`

361

362

## Dependencies

363

364

- **beautifulsoup4** (>=4.9,<5): HTML parsing and DOM manipulation

365

- **six** (>=1.15,<2): Python 2/3 compatibility utilities