or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-html2text

Turn HTML into equivalent Markdown-structured text.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/html2text@2025.4.x

To install, run

npx @tessl/cli install tessl/pypi-html2text@2025.4.0

0

# html2text

1

2

A comprehensive Python library that converts HTML into clean, readable plain ASCII text and valid Markdown format. It provides both programmatic API and command-line interface with extensive configuration options for handling links, code blocks, tables, and formatting elements while maintaining semantic structure.

3

4

## Package Information

5

6

- **Package Name**: html2text

7

- **Language**: Python

8

- **Installation**: `pip install html2text`

9

- **Python Requirements**: >=3.9

10

11

## Core Imports

12

13

```python

14

import html2text

15

```

16

17

For basic usage:

18

19

```python

20

from html2text import html2text

21

```

22

23

For advanced usage with configuration:

24

25

```python

26

from html2text import HTML2Text

27

```

28

29

## Basic Usage

30

31

```python

32

import html2text

33

34

# Simple conversion using convenience function

35

html = "<p><strong>Bold text</strong> and <em>italic text</em></p>"

36

markdown = html2text.html2text(html)

37

print(markdown)

38

# Output: **Bold text** and _italic text_

39

40

# Advanced usage with configuration

41

h = html2text.HTML2Text()

42

h.ignore_links = True

43

h.body_width = 0 # No line wrapping

44

markdown = h.handle("<p>Hello <a href='http://example.com'>world</a>!</p>")

45

print(markdown)

46

# Output: Hello world!

47

```

48

49

## Architecture

50

51

html2text uses an HTML parser-based architecture:

52

53

- **HTML2Text Class**: Main converter inheriting from `html.parser.HTMLParser` with extensive configuration options

54

- **Configuration System**: Module-level defaults with per-instance overrides for all formatting options

55

- **Utility Functions**: Helper functions for CSS parsing, text escaping, and table formatting

56

- **Element Classes**: Data structures for managing links and lists during conversion

57

- **CLI Interface**: Command-line tool exposing all configuration options

58

59

## Capabilities

60

61

### Core HTML Conversion

62

63

Primary conversion functionality for transforming HTML into Markdown or plain text with configurable formatting options.

64

65

```python { .api }

66

def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:

67

"""

68

Convert HTML string to Markdown/text.

69

70

Args:

71

html: HTML string to convert

72

baseurl: Base URL for resolving relative links

73

bodywidth: Text wrapping width (None uses default)

74

75

Returns:

76

Converted Markdown/text string

77

"""

78

79

class HTML2Text(html.parser.HTMLParser):

80

"""

81

Advanced HTML to text converter with extensive configuration options.

82

83

Args:

84

out: Optional custom output callback function

85

baseurl: Base URL for resolving relative links (default: "")

86

bodywidth: Maximum line width for text wrapping (default: 78)

87

"""

88

89

def handle(self, data: str) -> str:

90

"""

91

Convert HTML string to Markdown/text.

92

93

Args:

94

data: HTML string to convert

95

96

Returns:

97

Converted Markdown/text string

98

"""

99

```

100

101

[Core Conversion](./core-conversion.md)

102

103

### Configuration Options

104

105

Comprehensive formatting and behavior configuration for customizing HTML to text conversion including link handling, text formatting, table processing, and output styling.

106

107

```python { .api }

108

# Link and Image Configuration

109

ignore_links: bool = False # Skip all link formatting

110

ignore_mailto_links: bool = False # Skip mailto links

111

inline_links: bool = True # Use inline vs reference links

112

protect_links: bool = False # Wrap links with angle brackets

113

ignore_images: bool = False # Skip image formatting

114

images_to_alt: bool = False # Replace images with alt text only

115

116

# Text Formatting Configuration

117

body_width: int = 78 # Text wrapping width (0 for no wrap)

118

unicode_snob: bool = False # Use Unicode vs ASCII replacements

119

escape_snob: bool = False # Escape all special characters

120

ignore_emphasis: bool = False # Skip bold/italic formatting

121

single_line_break: bool = False # Use single vs double line breaks

122

123

# Table Configuration

124

bypass_tables: bool = False # Format tables as HTML vs Markdown

125

ignore_tables: bool = False # Skip table formatting entirely

126

pad_tables: bool = False # Pad table cells to equal width

127

```

128

129

[Configuration Options](./configuration.md)

130

131

### Utility Functions

132

133

Helper functions for text processing, CSS parsing, character escaping, and table formatting used internally and available for advanced use cases.

134

135

```python { .api }

136

def escape_md(text: str) -> str:

137

"""Escape markdown-sensitive characters within markdown constructs."""

138

139

def escape_md_section(text: str, snob: bool = False) -> str:

140

"""Escape markdown-sensitive characters across document sections."""

141

142

def pad_tables_in_text(text: str, right_margin: int = 1) -> str:

143

"""Add padding to tables in text for consistent column alignment."""

144

```

145

146

[Utility Functions](./utilities.md)

147

148

## Error Handling

149

150

html2text handles malformed HTML gracefully through its HTMLParser base class. Character encoding issues should be resolved before passing HTML to the converter:

151

152

```python

153

# Handle encoding explicitly if needed

154

with open('file.html', 'rb') as f:

155

html_bytes = f.read()

156

html_text = html_bytes.decode('utf-8', errors='ignore')

157

markdown = html2text.html2text(html_text)

158

```

159

160

## Command Line Interface

161

162

The package includes a command-line tool `html2text` with comprehensive configuration options:

163

164

```bash

165

# Basic usage

166

html2text input.html

167

168

# From stdin

169

echo "<p>Hello world</p>" | html2text

170

171

# With custom encoding

172

html2text input.html utf-8

173

174

# Common options

175

html2text --body-width=0 --ignore-links input.html

176

html2text --reference-links --pad-tables input.html

177

html2text --google-doc --hide-strikethrough gdoc.html

178

```

179

180

### CLI Options

181

182

**Text Formatting:**

183

- `--body-width=N` - Line width (0 for no wrapping, default: 78)

184

- `--single-line-break` - Use single line breaks instead of double

185

- `--escape-all` - Escape all special characters for safer output

186

187

**Link Handling:**

188

- `--ignore-links` - Don't include any link formatting

189

- `--ignore-mailto-links` - Don't include mailto: links

190

- `--reference-links` - Use reference-style links instead of inline

191

- `--protect-links` - Wrap links with angle brackets

192

- `--no-wrap-links` - Don't wrap long links

193

194

**Image Handling:**

195

- `--ignore-images` - Don't include any image formatting

196

- `--images-as-html` - Keep images as raw HTML tags

197

- `--images-to-alt` - Replace images with alt text only

198

- `--images-with-size` - Include width/height in HTML image tags

199

- `--default-image-alt=TEXT` - Default alt text for images

200

201

**Table Formatting:**

202

- `--pad-tables` - Pad cells to equal column width

203

- `--bypass-tables` - Format tables as HTML instead of Markdown

204

- `--ignore-tables` - Skip table formatting entirely

205

- `--wrap-tables` - Allow table content wrapping

206

207

**List and Emphasis:**

208

- `--ignore-emphasis` - Don't include formatting for bold/italic

209

- `--dash-unordered-list` - Use dashes instead of asterisks for lists

210

- `--asterisk-emphasis` - Use asterisks instead of underscores for emphasis

211

- `--wrap-list-items` - Allow list item wrapping

212

213

**Google Docs Support:**

214

- `--google-doc` - Enable Google Docs-specific processing

215

- `--google-list-indent=N` - Pixels Google uses for list indentation (default: 36)

216

- `--hide-strikethrough` - Hide strikethrough text (use with --google-doc)

217

218

## Types

219

220

```python { .api }

221

from typing import Dict, List, Optional, Protocol

222

223

class OutCallback(Protocol):

224

"""Protocol for custom output callback functions."""

225

def __call__(self, s: str) -> None: ...

226

227

class AnchorElement:

228

"""Represents link elements during processing."""

229

attrs: Dict[str, Optional[str]]

230

count: int

231

outcount: int

232

233

class ListElement:

234

"""Represents list elements during processing."""

235

name: str # 'ul' or 'ol'

236

num: int # Current list item number

237

```