or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-conversion.mdindex.mdutilities.md

core-conversion.mddocs/

0

# Core HTML Conversion

1

2

Primary conversion functionality for transforming HTML into Markdown or plain text. Provides both simple one-shot conversion and advanced configurable conversion with extensive formatting options.

3

4

## Capabilities

5

6

### Simple HTML Conversion

7

8

Convenience function for straightforward HTML to Markdown conversion with minimal configuration.

9

10

```python { .api }

11

def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:

12

"""

13

Convert HTML string to Markdown/text using default settings.

14

15

Args:

16

html: HTML string to convert

17

baseurl: Base URL for resolving relative links (default: "")

18

bodywidth: Text wrapping width, None uses config.BODY_WIDTH (default: None)

19

20

Returns:

21

Converted Markdown/text string

22

23

Example:

24

>>> import html2text

25

>>> html = "<p><strong>Bold</strong> and <em>italic</em></p>"

26

>>> print(html2text.html2text(html))

27

**Bold** and _italic_

28

"""

29

```

30

31

### Advanced HTML Conversion

32

33

Full-featured HTML to text converter with extensive configuration options for fine-grained control over output formatting.

34

35

```python { .api }

36

class HTML2Text(html.parser.HTMLParser):

37

"""

38

Advanced HTML to text converter with comprehensive configuration options.

39

40

Inherits from html.parser.HTMLParser to handle HTML parsing and provides

41

extensive customization for output formatting, link handling, table processing,

42

and text styling.

43

"""

44

45

def __init__(

46

self,

47

out: Optional[OutCallback] = None,

48

baseurl: str = "",

49

bodywidth: int = 78

50

) -> None:

51

"""

52

Initialize HTML2Text converter.

53

54

Args:

55

out: Optional custom output callback function for handling text output

56

baseurl: Base URL for resolving relative links (default: "")

57

bodywidth: Maximum line width for text wrapping (default: 78)

58

"""

59

60

def handle(self, data: str) -> str:

61

"""

62

Convert HTML string to Markdown/text with current configuration.

63

64

This is the main conversion method that processes the HTML through

65

the parser and returns the formatted output.

66

67

Args:

68

data: HTML string to convert

69

70

Returns:

71

Converted Markdown/text string

72

73

Example:

74

>>> h = html2text.HTML2Text()

75

>>> h.ignore_links = True

76

>>> html = "<p>Hello <a href='http://example.com'>world</a>!</p>"

77

>>> print(h.handle(html))

78

Hello world!

79

"""

80

81

def feed(self, data: str) -> None:

82

"""

83

Feed HTML data to the parser for processing.

84

85

Args:

86

data: HTML string to feed to parser

87

"""

88

89

def finish(self) -> str:

90

"""

91

Complete parsing and return formatted text output.

92

93

Returns:

94

Final formatted text string

95

"""

96

97

def outtextf(self, s: str) -> None:

98

"""

99

Default output callback function that appends text to internal buffer.

100

101

This is the default implementation of the output callback that collects

102

all text output into an internal list for final processing.

103

104

Args:

105

s: Text string to append to output buffer

106

"""

107

108

def close(self) -> None:

109

"""

110

Close the HTML parser and perform final cleanup.

111

112

Inherited from HTMLParser, ensures proper parser cleanup.

113

"""

114

115

def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]:

116

"""

117

Find index of link with matching attributes in anchor list.

118

119

Used internally for reference-style link processing to avoid

120

duplicate link definitions.

121

122

Args:

123

attrs: Dictionary of HTML element attributes

124

125

Returns:

126

Index of matching anchor element or None if not found

127

"""

128

```

129

130

## HTML Element Support

131

132

html2text supports comprehensive HTML element conversion:

133

134

### Text Formatting

135

- **Bold**: `<strong>`, `<b>``**text**`

136

- **Italic**: `<em>`, `<i>``_text_`

137

- **Code**: `<code>`, `<tt>`, `<kbd>``` `text` ``

138

- **Strikethrough**: `<del>`, `<strike>`, `<s>``~~text~~`

139

- **Quotes**: `<q>``"text"`

140

- **Superscript/Subscript**: `<sup>`, `<sub>` (configurable)

141

142

### Structure Elements

143

- **Headers**: `<h1>` through `<h6>``# Header`

144

- **Paragraphs**: `<p>` → paragraph breaks

145

- **Line breaks**: `<br>` → line breaks

146

- **Horizontal rules**: `<hr>``* * *`

147

- **Blockquotes**: `<blockquote>``> text`

148

- **Preformatted**: `<pre>` → indented code blocks

149

150

### Lists

151

- **Unordered lists**: `<ul>`, `<li>``* item`

152

- **Ordered lists**: `<ol>`, `<li>``1. item`

153

- **Nested lists**: Full support with proper indentation

154

- **Definition lists**: `<dl>`, `<dt>`, `<dd>`

155

156

### Links and Images

157

- **Links**: `<a>``[text](url)` or reference-style

158

- **Images**: `<img>``![alt](src)` or configurable formats

159

- **Automatic links**: URL detection and conversion

160

161

### Tables

162

- **Tables**: `<table>`, `<tr>`, `<td>`, `<th>` → Markdown tables

163

- **Table formatting**: Configurable padding and alignment

164

- **Complex tables**: Colspan handling and formatting options

165

166

## Usage Examples

167

168

### Basic Text Conversion

169

170

```python

171

import html2text

172

173

# Simple paragraph with formatting

174

html = """

175

<div>

176

<h1>Main Title</h1>

177

<p>This is a <strong>bold statement</strong> with some <em>emphasis</em>.</p>

178

<p>Here's a <a href="https://example.com">link</a> and some <code>inline code</code>.</p>

179

</div>

180

"""

181

182

converter = html2text.HTML2Text()

183

markdown = converter.handle(html)

184

print(markdown)

185

```

186

187

### List Processing

188

189

```python

190

html = """

191

<ul>

192

<li>First item</li>

193

<li>Second item with <strong>bold text</strong></li>

194

<li>Third item

195

<ol>

196

<li>Nested ordered item</li>

197

<li>Another nested item</li>

198

</ol>

199

</li>

200

</ul>

201

"""

202

203

converter = html2text.HTML2Text()

204

result = converter.handle(html)

205

print(result)

206

```

207

208

### Table Conversion

209

210

```python

211

html = """

212

<table>

213

<tr>

214

<th>Name</th>

215

<th>Age</th>

216

<th>City</th>

217

</tr>

218

<tr>

219

<td>Alice</td>

220

<td>30</td>

221

<td>New York</td>

222

</tr>

223

<tr>

224

<td>Bob</td>

225

<td>25</td>

226

<td>London</td>

227

</tr>

228

</table>

229

"""

230

231

converter = html2text.HTML2Text()

232

converter.pad_tables = True # Enable table padding

233

result = converter.handle(html)

234

print(result)

235

```

236

237

### Custom Output Handling

238

239

```python

240

def custom_output(text):

241

"""Custom output handler that uppercases text."""

242

print(text.upper(), end='')

243

244

html = "<p>Hello world!</p>"

245

converter = html2text.HTML2Text(out=custom_output)

246

converter.handle(html) # Will print "HELLO WORLD!" in uppercase

247

```