0
# HTML Parsing with Modest Engine
1
2
The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods for extracting and modifying HTML content.
3
4
## Capabilities
5
6
### HTMLParser Class
7
8
Main parser class that handles HTML document parsing with automatic encoding detection and provides access to the parsed DOM tree.
9
10
```python { .api }
11
class HTMLParser:
12
def __init__(
13
self,
14
html: str | bytes,
15
detect_encoding: bool = True,
16
use_meta_tags: bool = True,
17
decode_errors: str = 'ignore'
18
):
19
"""
20
Initialize HTML parser with content.
21
22
Parameters:
23
- html: HTML content as string or bytes
24
- detect_encoding: Auto-detect encoding for bytes input
25
- use_meta_tags: Use HTML meta tags for encoding detection
26
- decode_errors: Error handling ('ignore', 'strict', 'replace')
27
"""
28
```
29
30
**Usage Example:**
31
```python
32
from selectolax.parser import HTMLParser
33
34
# Parse from string
35
parser = HTMLParser('<div>Hello <strong>world</strong>!</div>')
36
37
# Parse from bytes with encoding detection
38
html_bytes = b'<div>Caf\xe9</div>'
39
parser = HTMLParser(html_bytes, detect_encoding=True)
40
41
# Parse with strict error handling
42
parser = HTMLParser(html_content, decode_errors='strict')
43
```
44
45
### CSS Selector Methods
46
47
Query the DOM tree using CSS selectors to find matching elements.
48
49
```python { .api }
50
def css(self, query: str) -> list[Node]:
51
"""
52
Find all elements matching CSS selector.
53
54
Parameters:
55
- query: CSS selector string
56
57
Returns:
58
List of Node objects matching the selector
59
"""
60
61
def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:
62
"""
63
Find first element matching CSS selector.
64
65
Parameters:
66
- query: CSS selector string
67
- default: Value to return if no match found
68
- strict: If True, raise error when multiple matches exist
69
70
Returns:
71
First matching Node object or default value
72
"""
73
```
74
75
**Usage Example:**
76
```python
77
# Find all paragraphs
78
paragraphs = parser.css('p')
79
80
# Find first heading with class
81
heading = parser.css_first('h1.title')
82
83
# Find with default value
84
nav = parser.css_first('nav', default=None)
85
86
# Strict mode - error if multiple matches
87
unique_element = parser.css_first('#unique-id', strict=True)
88
89
# Complex selectors
90
items = parser.css('div.content > ul li:nth-child(odd)')
91
```
92
93
### Tag-Based Selection
94
95
Select elements by tag name for simple element retrieval.
96
97
```python { .api }
98
def tags(self, name: str) -> list[Node]:
99
"""
100
Find all elements with specified tag name.
101
102
Parameters:
103
- name: HTML tag name (e.g., 'div', 'p', 'a')
104
105
Returns:
106
List of Node objects with matching tag name
107
"""
108
```
109
110
**Usage Example:**
111
```python
112
# Get all links
113
links = parser.tags('a')
114
115
# Get all images
116
images = parser.tags('img')
117
118
# Get all divs
119
divs = parser.tags('div')
120
```
121
122
### Text Extraction
123
124
Extract text content from the parsed document.
125
126
```python { .api }
127
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
128
"""
129
Extract text content from document body.
130
131
Parameters:
132
- deep: Include text from child elements
133
- separator: String to join text from different elements
134
- strip: Apply str.strip() to each text part
135
136
Returns:
137
Extracted text content as string
138
"""
139
```
140
141
**Usage Example:**
142
```python
143
# Get all text content
144
all_text = parser.text()
145
146
# Get text with custom separator
147
spaced_text = parser.text(separator=' | ')
148
149
# Get cleaned text
150
clean_text = parser.text(strip=True)
151
152
# Get only direct text (no children)
153
direct_text = parser.text(deep=False)
154
```
155
156
### DOM Tree Access
157
158
Access key parts of the HTML document structure.
159
160
```python { .api }
161
@property
162
def root(self) -> Node | None:
163
"""Returns root HTML element node."""
164
165
@property
166
def head(self) -> Node | None:
167
"""Returns HTML head element node."""
168
169
@property
170
def body(self) -> Node | None:
171
"""Returns HTML body element node."""
172
173
@property
174
def input_encoding(self) -> str:
175
"""Returns detected/used character encoding."""
176
177
@property
178
def raw_html(self) -> bytes:
179
"""Returns raw HTML bytes used for parsing."""
180
181
@property
182
def html(self) -> str | None:
183
"""Returns HTML representation of the entire document."""
184
```
185
186
**Usage Example:**
187
```python
188
# Access document structure
189
root = parser.root
190
head = parser.head
191
body = parser.body
192
193
# Check encoding
194
encoding = parser.input_encoding # e.g., 'UTF-8'
195
196
# Get original bytes
197
original = parser.raw_html
198
```
199
200
### DOM Manipulation
201
202
Modify the HTML document structure by removing unwanted elements.
203
204
```python { .api }
205
def strip_tags(self, tags: list[str], recursive: bool = False) -> None:
206
"""
207
Remove specified tags from document.
208
209
Parameters:
210
- tags: List of tag names to remove
211
- recursive: Remove all child nodes with the tag
212
"""
213
214
def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:
215
"""
216
Remove tag wrappers while keeping content.
217
218
Parameters:
219
- tags: List of tag names to unwrap
220
- delete_empty: Remove empty tags after unwrapping
221
"""
222
```
223
224
**Usage Example:**
225
```python
226
# Remove script and style tags
227
parser.strip_tags(['script', 'style', 'noscript'])
228
229
# Remove tags recursively (including children)
230
parser.strip_tags(['iframe', 'object'], recursive=True)
231
232
# Unwrap formatting tags while keeping text
233
parser.unwrap_tags(['b', 'i', 'strong', 'em'])
234
235
# Clean up empty tags after unwrapping
236
parser.unwrap_tags(['span', 'div'], delete_empty=True)
237
```
238
239
### Advanced Selection and Matching
240
241
Additional methods for advanced element selection and content matching.
242
243
```python { .api }
244
def select(self, query: str = None) -> Selector:
245
"""
246
Create advanced selector object with chaining support.
247
248
Parameters:
249
- query: Optional initial CSS selector
250
251
Returns:
252
Selector object supporting method chaining and filtering
253
"""
254
255
def any_css_matches(self, selectors: tuple[str, ...]) -> bool:
256
"""
257
Check if any CSS selectors match elements in document.
258
259
Parameters:
260
- selectors: Tuple of CSS selector strings
261
262
Returns:
263
True if any selector matches elements, False otherwise
264
"""
265
266
def scripts_contain(self, query: str) -> bool:
267
"""
268
Check if any script tag contains specified text.
269
270
Caches script tags on first call for performance.
271
272
Parameters:
273
- query: Text to search for in script content
274
275
Returns:
276
True if any script contains the text, False otherwise
277
"""
278
279
def script_srcs_contain(self, queries: tuple[str, ...]) -> bool:
280
"""
281
Check if any script src attribute contains specified text.
282
283
Caches values on first call for performance.
284
285
Parameters:
286
- queries: Tuple of text strings to search for in src attributes
287
288
Returns:
289
True if any script src contains any query text, False otherwise
290
"""
291
```
292
293
**Usage Example:**
294
```python
295
# Advanced selector with chaining
296
advanced_selector = parser.select('div.content')
297
# Further operations can be chained on the selector
298
299
# Check for CSS matches across document
300
important_selectors = ('.error', '.warning', '.critical')
301
has_important = parser.any_css_matches(important_selectors)
302
303
# Script content analysis
304
has_analytics = parser.scripts_contain('google-analytics')
305
has_tracking = parser.scripts_contain('facebook')
306
307
# Script source analysis
308
ad_scripts = ('ads.js', 'doubleclick', 'adsystem')
309
has_ads = parser.script_srcs_contain(ad_scripts)
310
311
# Content filtering based on scripts
312
if has_analytics or has_ads:
313
print("Page contains tracking or ads")
314
# Remove or flag for privacy
315
316
### Utility Functions
317
318
Additional utility functions for HTML element creation and parsing.
319
320
```python { .api }
321
def create_tag(tag: str) -> Node:
322
"""
323
Create a new HTML element with specified tag name.
324
325
Parameters:
326
- tag: HTML tag name (e.g., 'div', 'p', 'img')
327
328
Returns:
329
New Node element with the specified tag
330
"""
331
332
def parse_fragment(html: str) -> list[Node]:
333
"""
334
Parse HTML fragment into list of nodes without adding wrapper elements.
335
336
Unlike HTMLParser which adds missing html/head/body tags, this function
337
returns nodes exactly as specified in the HTML fragment.
338
339
Parameters:
340
- html: HTML fragment string to parse
341
342
Returns:
343
List of Node objects representing the parsed HTML fragment
344
"""
345
```
346
347
**Usage Example:**
348
```python
349
from selectolax.parser import create_tag, parse_fragment
350
351
# Create new elements
352
div = create_tag('div')
353
paragraph = create_tag('p')
354
link = create_tag('a')
355
356
# Parse HTML fragments without wrappers
357
fragment_html = '<li>Item 1</li><li>Item 2</li><li>Item 3</li>'
358
list_items = parse_fragment(fragment_html)
359
360
# Use in DOM manipulation
361
container = create_tag('ul')
362
for item in list_items:
363
container.insert_child(item)
364
365
print(container.html) # <ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>
366
```
367
368
### Document Cloning
369
370
Create independent copies of parsed documents.
371
372
```python { .api }
373
def clone(self) -> HTMLParser:
374
"""
375
Create a deep copy of the entire parsed document.
376
377
Returns:
378
New HTMLParser instance with identical content
379
"""
380
```
381
382
**Usage Example:**
383
```python
384
# Clone document for safe manipulation
385
original = HTMLParser(html_content)
386
copy = original.clone()
387
388
# Modify copy without affecting original
389
copy.strip_tags(['script', 'style'])
390
clean_text = copy.text(strip=True)
391
392
# Original remains unchanged
393
original_text = original.text()
394
```
395
396
### Text Processing
397
398
Advanced text manipulation methods for better text extraction.
399
400
```python { .api }
401
def merge_text_nodes(self) -> None:
402
"""
403
Merge adjacent text nodes to improve text extraction quality.
404
405
Useful after removing HTML tags to eliminate extra spaces
406
and fragmented text caused by tag removal.
407
"""
408
```
409
410
**Usage Example:**
411
```python
412
# Clean up text after tag manipulation
413
parser = HTMLParser('<div><strong>Hello</strong> world!</div>')
414
content = parser.css_first('div')
415
416
# Remove formatting tags
417
parser.unwrap_tags(['strong'])
418
print(parser.text()) # May have extra spaces: "Hello world!"
419
420
# Merge text nodes for cleaner output
421
parser.merge_text_nodes()
422
print(parser.text()) # Clean output: "Hello world!"
423
```