or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdlexbor-parser.mdmodest-parser.mdnode-operations.md

modest-parser.mddocs/

0

# HTML Parsing with Modest Engine

1

2

The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods for extracting and modifying HTML content.

3

4

## Capabilities

5

6

### HTMLParser Class

7

8

Main parser class that handles HTML document parsing with automatic encoding detection and provides access to the parsed DOM tree.

9

10

```python { .api }

11

class HTMLParser:

12

def __init__(

13

self,

14

html: str | bytes,

15

detect_encoding: bool = True,

16

use_meta_tags: bool = True,

17

decode_errors: str = 'ignore'

18

):

19

"""

20

Initialize HTML parser with content.

21

22

Parameters:

23

- html: HTML content as string or bytes

24

- detect_encoding: Auto-detect encoding for bytes input

25

- use_meta_tags: Use HTML meta tags for encoding detection

26

- decode_errors: Error handling ('ignore', 'strict', 'replace')

27

"""

28

```

29

30

**Usage Example:**

31

```python

32

from selectolax.parser import HTMLParser

33

34

# Parse from string

35

parser = HTMLParser('<div>Hello <strong>world</strong>!</div>')

36

37

# Parse from bytes with encoding detection

38

html_bytes = b'<div>Caf\xe9</div>'

39

parser = HTMLParser(html_bytes, detect_encoding=True)

40

41

# Parse with strict error handling

42

parser = HTMLParser(html_content, decode_errors='strict')

43

```

44

45

### CSS Selector Methods

46

47

Query the DOM tree using CSS selectors to find matching elements.

48

49

```python { .api }

50

def css(self, query: str) -> list[Node]:

51

"""

52

Find all elements matching CSS selector.

53

54

Parameters:

55

- query: CSS selector string

56

57

Returns:

58

List of Node objects matching the selector

59

"""

60

61

def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:

62

"""

63

Find first element matching CSS selector.

64

65

Parameters:

66

- query: CSS selector string

67

- default: Value to return if no match found

68

- strict: If True, raise error when multiple matches exist

69

70

Returns:

71

First matching Node object or default value

72

"""

73

```

74

75

**Usage Example:**

76

```python

77

# Find all paragraphs

78

paragraphs = parser.css('p')

79

80

# Find first heading with class

81

heading = parser.css_first('h1.title')

82

83

# Find with default value

84

nav = parser.css_first('nav', default=None)

85

86

# Strict mode - error if multiple matches

87

unique_element = parser.css_first('#unique-id', strict=True)

88

89

# Complex selectors

90

items = parser.css('div.content > ul li:nth-child(odd)')

91

```

92

93

### Tag-Based Selection

94

95

Select elements by tag name for simple element retrieval.

96

97

```python { .api }

98

def tags(self, name: str) -> list[Node]:

99

"""

100

Find all elements with specified tag name.

101

102

Parameters:

103

- name: HTML tag name (e.g., 'div', 'p', 'a')

104

105

Returns:

106

List of Node objects with matching tag name

107

"""

108

```

109

110

**Usage Example:**

111

```python

112

# Get all links

113

links = parser.tags('a')

114

115

# Get all images

116

images = parser.tags('img')

117

118

# Get all divs

119

divs = parser.tags('div')

120

```

121

122

### Text Extraction

123

124

Extract text content from the parsed document.

125

126

```python { .api }

127

def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:

128

"""

129

Extract text content from document body.

130

131

Parameters:

132

- deep: Include text from child elements

133

- separator: String to join text from different elements

134

- strip: Apply str.strip() to each text part

135

136

Returns:

137

Extracted text content as string

138

"""

139

```

140

141

**Usage Example:**

142

```python

143

# Get all text content

144

all_text = parser.text()

145

146

# Get text with custom separator

147

spaced_text = parser.text(separator=' | ')

148

149

# Get cleaned text

150

clean_text = parser.text(strip=True)

151

152

# Get only direct text (no children)

153

direct_text = parser.text(deep=False)

154

```

155

156

### DOM Tree Access

157

158

Access key parts of the HTML document structure.

159

160

```python { .api }

161

@property

162

def root(self) -> Node | None:

163

"""Returns root HTML element node."""

164

165

@property

166

def head(self) -> Node | None:

167

"""Returns HTML head element node."""

168

169

@property

170

def body(self) -> Node | None:

171

"""Returns HTML body element node."""

172

173

@property

174

def input_encoding(self) -> str:

175

"""Returns detected/used character encoding."""

176

177

@property

178

def raw_html(self) -> bytes:

179

"""Returns raw HTML bytes used for parsing."""

180

181

@property

182

def html(self) -> str | None:

183

"""Returns HTML representation of the entire document."""

184

```

185

186

**Usage Example:**

187

```python

188

# Access document structure

189

root = parser.root

190

head = parser.head

191

body = parser.body

192

193

# Check encoding

194

encoding = parser.input_encoding # e.g., 'UTF-8'

195

196

# Get original bytes

197

original = parser.raw_html

198

```

199

200

### DOM Manipulation

201

202

Modify the HTML document structure by removing unwanted elements.

203

204

```python { .api }

205

def strip_tags(self, tags: list[str], recursive: bool = False) -> None:

206

"""

207

Remove specified tags from document.

208

209

Parameters:

210

- tags: List of tag names to remove

211

- recursive: Remove all child nodes with the tag

212

"""

213

214

def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:

215

"""

216

Remove tag wrappers while keeping content.

217

218

Parameters:

219

- tags: List of tag names to unwrap

220

- delete_empty: Remove empty tags after unwrapping

221

"""

222

```

223

224

**Usage Example:**

225

```python

226

# Remove script and style tags

227

parser.strip_tags(['script', 'style', 'noscript'])

228

229

# Remove tags recursively (including children)

230

parser.strip_tags(['iframe', 'object'], recursive=True)

231

232

# Unwrap formatting tags while keeping text

233

parser.unwrap_tags(['b', 'i', 'strong', 'em'])

234

235

# Clean up empty tags after unwrapping

236

parser.unwrap_tags(['span', 'div'], delete_empty=True)

237

```

238

239

### Advanced Selection and Matching

240

241

Additional methods for advanced element selection and content matching.

242

243

```python { .api }

244

def select(self, query: str = None) -> Selector:

245

"""

246

Create advanced selector object with chaining support.

247

248

Parameters:

249

- query: Optional initial CSS selector

250

251

Returns:

252

Selector object supporting method chaining and filtering

253

"""

254

255

def any_css_matches(self, selectors: tuple[str, ...]) -> bool:

256

"""

257

Check if any CSS selectors match elements in document.

258

259

Parameters:

260

- selectors: Tuple of CSS selector strings

261

262

Returns:

263

True if any selector matches elements, False otherwise

264

"""

265

266

def scripts_contain(self, query: str) -> bool:

267

"""

268

Check if any script tag contains specified text.

269

270

Caches script tags on first call for performance.

271

272

Parameters:

273

- query: Text to search for in script content

274

275

Returns:

276

True if any script contains the text, False otherwise

277

"""

278

279

def script_srcs_contain(self, queries: tuple[str, ...]) -> bool:

280

"""

281

Check if any script src attribute contains specified text.

282

283

Caches values on first call for performance.

284

285

Parameters:

286

- queries: Tuple of text strings to search for in src attributes

287

288

Returns:

289

True if any script src contains any query text, False otherwise

290

"""

291

```

292

293

**Usage Example:**

294

```python

295

# Advanced selector with chaining

296

advanced_selector = parser.select('div.content')

297

# Further operations can be chained on the selector

298

299

# Check for CSS matches across document

300

important_selectors = ('.error', '.warning', '.critical')

301

has_important = parser.any_css_matches(important_selectors)

302

303

# Script content analysis

304

has_analytics = parser.scripts_contain('google-analytics')

305

has_tracking = parser.scripts_contain('facebook')

306

307

# Script source analysis

308

ad_scripts = ('ads.js', 'doubleclick', 'adsystem')

309

has_ads = parser.script_srcs_contain(ad_scripts)

310

311

# Content filtering based on scripts

312

if has_analytics or has_ads:

313

print("Page contains tracking or ads")

314

# Remove or flag for privacy

315

316

### Utility Functions

317

318

Additional utility functions for HTML element creation and parsing.

319

320

```python { .api }

321

def create_tag(tag: str) -> Node:

322

"""

323

Create a new HTML element with specified tag name.

324

325

Parameters:

326

- tag: HTML tag name (e.g., 'div', 'p', 'img')

327

328

Returns:

329

New Node element with the specified tag

330

"""

331

332

def parse_fragment(html: str) -> list[Node]:

333

"""

334

Parse HTML fragment into list of nodes without adding wrapper elements.

335

336

Unlike HTMLParser which adds missing html/head/body tags, this function

337

returns nodes exactly as specified in the HTML fragment.

338

339

Parameters:

340

- html: HTML fragment string to parse

341

342

Returns:

343

List of Node objects representing the parsed HTML fragment

344

"""

345

```

346

347

**Usage Example:**

348

```python

349

from selectolax.parser import create_tag, parse_fragment

350

351

# Create new elements

352

div = create_tag('div')

353

paragraph = create_tag('p')

354

link = create_tag('a')

355

356

# Parse HTML fragments without wrappers

357

fragment_html = '<li>Item 1</li><li>Item 2</li><li>Item 3</li>'

358

list_items = parse_fragment(fragment_html)

359

360

# Use in DOM manipulation

361

container = create_tag('ul')

362

for item in list_items:

363

container.insert_child(item)

364

365

print(container.html) # <ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>

366

```

367

368

### Document Cloning

369

370

Create independent copies of parsed documents.

371

372

```python { .api }

373

def clone(self) -> HTMLParser:

374

"""

375

Create a deep copy of the entire parsed document.

376

377

Returns:

378

New HTMLParser instance with identical content

379

"""

380

```

381

382

**Usage Example:**

383

```python

384

# Clone document for safe manipulation

385

original = HTMLParser(html_content)

386

copy = original.clone()

387

388

# Modify copy without affecting original

389

copy.strip_tags(['script', 'style'])

390

clean_text = copy.text(strip=True)

391

392

# Original remains unchanged

393

original_text = original.text()

394

```

395

396

### Text Processing

397

398

Advanced text manipulation methods for better text extraction.

399

400

```python { .api }

401

def merge_text_nodes(self) -> None:

402

"""

403

Merge adjacent text nodes to improve text extraction quality.

404

405

Useful after removing HTML tags to eliminate extra spaces

406

and fragmented text caused by tag removal.

407

"""

408

```

409

410

**Usage Example:**

411

```python

412

# Clean up text after tag manipulation

413

parser = HTMLParser('<div><strong>Hello</strong> world!</div>')

414

content = parser.css_first('div')

415

416

# Remove formatting tags

417

parser.unwrap_tags(['strong'])

418

print(parser.text()) # May have extra spaces: "Hello world!"

419

420

# Merge text nodes for cleaner output

421

parser.merge_text_nodes()

422

print(parser.text()) # Clean output: "Hello world!"

423

```