or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-conversion.mdindex.mdutilities.md

utilities.mddocs/

0

# Utility Functions

1

2

Helper functions for text processing, CSS parsing, character escaping, and table formatting. These functions are used internally by html2text and are also available for advanced use cases requiring custom text processing.

3

4

## Capabilities

5

6

### Text Escaping and Processing

7

8

Functions for escaping markdown characters and processing text sections safely.

9

10

```python { .api }

11

def escape_md(text: str) -> str:

12

"""

13

Escape markdown-sensitive characters within markdown constructs.

14

15

Escapes characters that have special meaning in Markdown (like brackets,

16

parentheses, backslashes) to prevent them from being interpreted as

17

formatting when they should be literal text.

18

19

Args:

20

text: Text string to escape

21

22

Returns:

23

Text with markdown characters escaped with backslashes

24

25

Example:

26

>>> from html2text.utils import escape_md

27

>>> escape_md("Some [text] with (special) chars")

28

'Some \\[text\\] with \\(special\\) chars'

29

"""

30

31

def escape_md_section(text: str, snob: bool = False) -> str:

32

"""

33

Escape markdown-sensitive characters across document sections.

34

35

More comprehensive escaping for full document sections, handling

36

various markdown constructs that could interfere with formatting.

37

38

Args:

39

text: Text string to escape

40

snob: If True, escape additional characters for maximum safety

41

42

Returns:

43

Text with markdown characters properly escaped

44

45

Example:

46

>>> from html2text.utils import escape_md_section

47

>>> escape_md_section("1. Item\\n2. Another", snob=True)

48

'1\\. Item\\n2\\. Another'

49

"""

50

```

51

52

### Table Formatting

53

54

Functions for formatting and aligning table content in text output.

55

56

```python { .api }

57

def pad_tables_in_text(text: str, right_margin: int = 1) -> str:

58

"""

59

Add padding to tables in text for consistent column alignment.

60

61

Processes text containing markdown tables and adds appropriate padding

62

to ensure all columns have consistent width for improved readability.

63

64

Args:

65

text: Text containing markdown tables to format

66

right_margin: Additional padding spaces for right margin (default: 1)

67

68

Returns:

69

Text with properly padded and aligned tables

70

71

Example:

72

>>> table_text = "| Name | Age |\\n| Alice | 30 |\\n| Bob | 25 |"

73

>>> padded = pad_tables_in_text(table_text)

74

>>> print(padded)

75

| Name | Age |

76

| Alice | 30 |

77

| Bob | 25 |

78

"""

79

80

def reformat_table(lines: List[str], right_margin: int) -> List[str]:

81

"""

82

Reformat table lines with consistent column widths.

83

84

Takes raw table lines and reformats them with proper padding

85

to create aligned columns.

86

87

Args:

88

lines: List of table row strings

89

right_margin: Right margin padding in spaces

90

91

Returns:

92

List of reformatted table lines with consistent alignment

93

"""

94

```

95

96

### CSS and Style Processing

97

98

Functions for parsing CSS styles and processing element styling, particularly useful for Google Docs HTML.

99

100

```python { .api }

101

def dumb_property_dict(style: str) -> Dict[str, str]:

102

"""

103

Parse CSS style string into property dictionary.

104

105

Takes a CSS style string (like from a style attribute) and converts

106

it into a dictionary of property-value pairs.

107

108

Args:

109

style: CSS style string with semicolon-separated property declarations

110

111

Returns:

112

Dictionary mapping CSS property names to values (both lowercased)

113

114

Example:

115

>>> from html2text.utils import dumb_property_dict

116

>>> style = "color: red; font-size: 14px; font-weight: bold"

117

>>> props = dumb_property_dict(style)

118

>>> print(props)

119

{'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}

120

"""

121

122

def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:

123

"""

124

Parse CSS style definitions into a structured format.

125

126

Simple CSS parser that extracts style rules and properties for

127

processing HTML with inline styles or embedded CSS.

128

129

Args:

130

data: CSS string to parse

131

132

Returns:

133

Dictionary mapping selectors to property dictionaries

134

135

Example:

136

>>> css = "p { color: red; font-size: 14px; }"

137

>>> parsed = dumb_css_parser(css)

138

>>> print(parsed)

139

{'p': {'color': 'red', 'font-size': '14px'}}

140

"""

141

142

def element_style(

143

attrs: Dict[str, Optional[str]],

144

style_def: Dict[str, Dict[str, str]],

145

parent_style: Dict[str, str]

146

) -> Dict[str, str]:

147

"""

148

Compute final style attributes for an HTML element.

149

150

Combines parent styles, CSS class styles, and inline styles to

151

determine the effective styling for an element.

152

153

Args:

154

attrs: HTML element attributes dictionary

155

style_def: CSS style definitions from stylesheet

156

parent_style: Inherited styles from parent elements

157

158

Returns:

159

Dictionary of final computed styles for the element

160

"""

161

162

def google_text_emphasis(style: Dict[str, str]) -> List[str]:

163

"""

164

Extract text emphasis styles from Google Docs CSS.

165

166

Analyzes CSS style properties to determine what text emphasis

167

(bold, italic, underline, etc.) should be applied.

168

169

Args:

170

style: Dictionary of CSS style properties

171

172

Returns:

173

List of emphasis style names found in the styles

174

"""

175

176

def google_fixed_width_font(style: Dict[str, str]) -> bool:

177

"""

178

Check if CSS styles specify a fixed-width (monospace) font.

179

180

Args:

181

style: Dictionary of CSS style properties

182

183

Returns:

184

True if styles specify a monospace font family

185

"""

186

187

def google_has_height(style: Dict[str, str]) -> bool:

188

"""

189

Check if CSS styles have explicit height defined.

190

191

Args:

192

style: Dictionary of CSS style properties

193

194

Returns:

195

True if height property is explicitly set

196

"""

197

198

def google_list_style(style: Dict[str, str]) -> str:

199

"""

200

Determine list type from Google Docs CSS styles.

201

202

Args:

203

style: Dictionary of CSS style properties

204

205

Returns:

206

'ul' for unordered lists, 'ol' for ordered lists

207

"""

208

```

209

210

### HTML Processing Utilities

211

212

Helper functions for processing HTML elements and attributes.

213

214

```python { .api }

215

def hn(tag: str) -> int:

216

"""

217

Extract header level from HTML header tag name.

218

219

Args:

220

tag: HTML tag name (e.g., 'h1', 'h2', 'div')

221

222

Returns:

223

Header level (1-6) for header tags, 0 for non-header tags

224

225

Example:

226

>>> hn('h1')

227

1

228

>>> hn('h3')

229

3

230

>>> hn('div')

231

0

232

"""

233

234

def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:

235

"""

236

Extract starting number from ordered list attributes.

237

238

Args:

239

attrs: HTML element attributes dictionary

240

241

Returns:

242

Starting number for ordered list (adjusted for 0-based indexing)

243

244

Example:

245

>>> attrs = {'start': '5'}

246

>>> list_numbering_start(attrs)

247

4 # Returns start - 1 for internal counting

248

"""

249

250

def skipwrap(

251

para: str,

252

wrap_links: bool,

253

wrap_list_items: bool,

254

wrap_tables: bool

255

) -> bool:

256

"""

257

Determine if a paragraph should skip text wrapping.

258

259

Analyzes paragraph content to decide whether it should be wrapped

260

based on content type and wrapping configuration.

261

262

Args:

263

para: Paragraph text to analyze

264

wrap_links: Whether to allow wrapping of links

265

wrap_list_items: Whether to allow wrapping of list items

266

wrap_tables: Whether to allow wrapping of tables

267

268

Returns:

269

True if paragraph should skip wrapping, False otherwise

270

"""

271

```

272

273

### Character and Entity Processing

274

275

Functions for handling HTML entities and character replacements.

276

277

```python { .api }

278

# Character mapping constants

279

unifiable_n: Dict[int, str]

280

"""Mapping of Unicode code points to ASCII replacements."""

281

282

control_character_replacements: Dict[int, int]

283

"""Mapping of control characters to their Unicode replacements."""

284

```

285

286

## Usage Examples

287

288

### Text Escaping

289

290

```python

291

from html2text.utils import escape_md, escape_md_section

292

293

# Basic markdown escaping

294

text = "Some [bracketed] text with (parentheses)"

295

escaped = escape_md(text)

296

print(escaped) # "Some \\[bracketed\\] text with \\(parentheses\\)"

297

298

# Section-level escaping with additional safety

299

content = """

300

1. First item

301

2. Second item

302

*Some emphasized text*

303

`Code with backticks`

304

"""

305

306

safe_content = escape_md_section(content, snob=True)

307

print(safe_content)

308

```

309

310

### Table Processing

311

312

```python

313

from html2text.utils import pad_tables_in_text

314

315

# Raw table text with inconsistent spacing

316

table_text = """

317

| Name | Age | City |

318

| Alice | 30 | New York |

319

| Bob | 25 | London |

320

| Charlie | 35 | Paris |

321

"""

322

323

# Add padding for consistent alignment

324

padded_table = pad_tables_in_text(table_text)

325

print(padded_table)

326

# Output will have consistent column widths

327

```

328

329

### CSS Processing

330

331

```python

332

from html2text.utils import dumb_css_parser, dumb_property_dict, element_style

333

334

# Parse inline CSS styles

335

inline_style = "color: red; font-size: 14px; font-weight: bold"

336

props = dumb_property_dict(inline_style)

337

print(props)

338

# Output: {'color': 'red', 'font-size': '14px', 'font-weight': 'bold'}

339

340

# Parse CSS styles

341

css_content = """

342

.bold { font-weight: bold; color: black; }

343

.italic { font-style: italic; }

344

p { margin: 10px; font-size: 14px; }

345

"""

346

347

styles = dumb_css_parser(css_content)

348

print(styles)

349

350

# Compute element styles

351

element_attrs = {

352

'class': 'bold italic',

353

'style': 'color: red; font-size: 16px;'

354

}

355

356

parent_styles = {'margin': '5px'}

357

final_styles = element_style(element_attrs, styles, parent_styles)

358

print(final_styles)

359

# Will combine class styles, inline styles, and parent styles

360

```

361

362

### HTML Tag Processing

363

364

```python

365

from html2text.utils import hn, list_numbering_start

366

367

# Extract header levels

368

print(hn('h1')) # 1

369

print(hn('h3')) # 3

370

print(hn('div')) # 0

371

372

# Process list attributes

373

ol_attrs = {'start': '5', 'type': '1'}

374

start_num = list_numbering_start(ol_attrs)

375

print(start_num) # 4 (adjusted for 0-based counting)

376

```

377

378

### Wrapping Analysis

379

380

```python

381

from html2text.utils import skipwrap

382

383

# Test different paragraph types

384

paragraphs = [

385

"Regular paragraph text that can be wrapped normally.",

386

" This is a code block with leading spaces",

387

"* This is a list item that might not wrap",

388

"Here's a paragraph with [a link](http://example.com) in it.",

389

"| Name | Age | - this looks like a table"

390

]

391

392

for para in paragraphs:

393

should_skip = skipwrap(para, wrap_links=True, wrap_list_items=False, wrap_tables=False)

394

print(f"Skip wrapping: {should_skip} - {para[:30]}...")

395

```

396

397

### Google Docs Style Processing

398

399

```python

400

from html2text.utils import (

401

google_text_emphasis,

402

google_fixed_width_font,

403

google_list_style

404

)

405

406

# Analyze Google Docs styles

407

gdoc_style = {

408

'font-weight': 'bold',

409

'font-style': 'italic',

410

'text-decoration': 'underline',

411

'font-family': 'courier new'

412

}

413

414

emphasis = google_text_emphasis(gdoc_style)

415

print(f"Emphasis styles: {emphasis}")

416

417

is_monospace = google_fixed_width_font(gdoc_style)

418

print(f"Monospace font: {is_monospace}")

419

420

list_style = {

421

'list-style-type': 'disc'

422

}

423

list_type = google_list_style(list_style)

424

print(f"List type: {list_type}")

425

```