Tessl Tile for pypi/selectolax@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md lexbor-parser.md modest-parser.md node-operations.md

modest-parser.mddocs/

0
# HTML Parsing with Modest Engine
1

2
The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods for extracting and modifying HTML content.
3

4
## Capabilities
5

6
### HTMLParser Class
7

8
Main parser class that handles HTML document parsing with automatic encoding detection and provides access to the parsed DOM tree.
9

10
```python { .api }
11
class HTMLParser:
12
    def __init__(
13
        self, 
14
        html: str | bytes, 
15
        detect_encoding: bool = True, 
16
        use_meta_tags: bool = True, 
17
        decode_errors: str = 'ignore'
18
    ):
19
        """
20
        Initialize HTML parser with content.
21
        
22
        Parameters:
23
        - html: HTML content as string or bytes
24
        - detect_encoding: Auto-detect encoding for bytes input
25
        - use_meta_tags: Use HTML meta tags for encoding detection
26
        - decode_errors: Error handling ('ignore', 'strict', 'replace')
27
        """
28
```
29

30
**Usage Example:**
31
```python
32
from selectolax.parser import HTMLParser
33

34
# Parse from string
35
parser = HTMLParser('<div>Hello <strong>world</strong>!</div>')
36

37
# Parse from bytes with encoding detection
38
html_bytes = b'<div>Caf\xe9</div>'
39
parser = HTMLParser(html_bytes, detect_encoding=True)
40

41
# Parse with strict error handling
42
parser = HTMLParser(html_content, decode_errors='strict')
43
```
44

45
### CSS Selector Methods
46

47
Query the DOM tree using CSS selectors to find matching elements.
48

49
```python { .api }
50
def css(self, query: str) -> list[Node]:
51
    """
52
    Find all elements matching CSS selector.
53
    
54
    Parameters:
55
    - query: CSS selector string
56
    
57
    Returns:
58
    List of Node objects matching the selector
59
    """
60

61
def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:
62
    """
63
    Find first element matching CSS selector.
64
    
65
    Parameters:
66
    - query: CSS selector string
67
    - default: Value to return if no match found
68
    - strict: If True, raise error when multiple matches exist
69
    
70
    Returns:
71
    First matching Node object or default value
72
    """
73
```
74

75
**Usage Example:**
76
```python
77
# Find all paragraphs
78
paragraphs = parser.css('p')
79

80
# Find first heading with class
81
heading = parser.css_first('h1.title')
82

83
# Find with default value
84
nav = parser.css_first('nav', default=None)
85

86
# Strict mode - error if multiple matches
87
unique_element = parser.css_first('#unique-id', strict=True)
88

89
# Complex selectors
90
items = parser.css('div.content > ul li:nth-child(odd)')
91
```
92

93
### Tag-Based Selection
94

95
Select elements by tag name for simple element retrieval.
96

97
```python { .api }
98
def tags(self, name: str) -> list[Node]:
99
    """
100
    Find all elements with specified tag name.
101
    
102
    Parameters:
103
    - name: HTML tag name (e.g., 'div', 'p', 'a')
104
    
105
    Returns:
106
    List of Node objects with matching tag name
107
    """
108
```
109

110
**Usage Example:**
111
```python
112
# Get all links
113
links = parser.tags('a')
114

115
# Get all images
116
images = parser.tags('img')
117

118
# Get all divs
119
divs = parser.tags('div')
120
```
121

122
### Text Extraction
123

124
Extract text content from the parsed document.
125

126
```python { .api }
127
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
128
    """
129
    Extract text content from document body.
130
    
131
    Parameters:
132
    - deep: Include text from child elements
133
    - separator: String to join text from different elements
134
    - strip: Apply str.strip() to each text part
135
    
136
    Returns:
137
    Extracted text content as string
138
    """
139
```
140

141
**Usage Example:**
142
```python
143
# Get all text content
144
all_text = parser.text()
145

146
# Get text with custom separator
147
spaced_text = parser.text(separator=' | ')
148

149
# Get cleaned text
150
clean_text = parser.text(strip=True)
151

152
# Get only direct text (no children)
153
direct_text = parser.text(deep=False)
154
```
155

156
### DOM Tree Access
157

158
Access key parts of the HTML document structure.
159

160
```python { .api }
161
@property
162
def root(self) -> Node | None:
163
    """Returns root HTML element node."""
164

165
@property  
166
def head(self) -> Node | None:
167
    """Returns HTML head element node."""
168

169
@property
170
def body(self) -> Node | None:
171
    """Returns HTML body element node."""
172

173
@property
174
def input_encoding(self) -> str:
175
    """Returns detected/used character encoding."""
176

177
@property
178
def raw_html(self) -> bytes:
179
    """Returns raw HTML bytes used for parsing."""
180

181
@property
182
def html(self) -> str | None:
183
    """Returns HTML representation of the entire document."""
184
```
185

186
**Usage Example:**
187
```python
188
# Access document structure
189
root = parser.root
190
head = parser.head
191
body = parser.body
192

193
# Check encoding
194
encoding = parser.input_encoding  # e.g., 'UTF-8'
195

196
# Get original bytes
197
original = parser.raw_html
198
```
199

200
### DOM Manipulation
201

202
Modify the HTML document structure by removing unwanted elements.
203

204
```python { .api }
205
def strip_tags(self, tags: list[str], recursive: bool = False) -> None:
206
    """
207
    Remove specified tags from document.
208
    
209
    Parameters:
210
    - tags: List of tag names to remove
211
    - recursive: Remove all child nodes with the tag
212
    """
213

214
def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:
215
    """
216
    Remove tag wrappers while keeping content.
217
    
218
    Parameters:
219
    - tags: List of tag names to unwrap
220
    - delete_empty: Remove empty tags after unwrapping
221
    """
222
```
223

224
**Usage Example:**
225
```python
226
# Remove script and style tags
227
parser.strip_tags(['script', 'style', 'noscript'])
228

229
# Remove tags recursively (including children)
230
parser.strip_tags(['iframe', 'object'], recursive=True)
231

232
# Unwrap formatting tags while keeping text
233
parser.unwrap_tags(['b', 'i', 'strong', 'em'])
234

235
# Clean up empty tags after unwrapping
236
parser.unwrap_tags(['span', 'div'], delete_empty=True)
237
```
238

239
### Advanced Selection and Matching
240

241
Additional methods for advanced element selection and content matching.
242

243
```python { .api }
244
def select(self, query: str = None) -> Selector:
245
    """
246
    Create advanced selector object with chaining support.
247
    
248
    Parameters:
249
    - query: Optional initial CSS selector
250
    
251
    Returns:
252
    Selector object supporting method chaining and filtering
253
    """
254

255
def any_css_matches(self, selectors: tuple[str, ...]) -> bool:
256
    """
257
    Check if any CSS selectors match elements in document.
258
    
259
    Parameters:
260
    - selectors: Tuple of CSS selector strings
261
    
262
    Returns:
263
    True if any selector matches elements, False otherwise
264
    """
265

266
def scripts_contain(self, query: str) -> bool:
267
    """
268
    Check if any script tag contains specified text.
269
    
270
    Caches script tags on first call for performance.
271
    
272
    Parameters:
273
    - query: Text to search for in script content
274
    
275
    Returns:
276
    True if any script contains the text, False otherwise
277
    """
278

279
def script_srcs_contain(self, queries: tuple[str, ...]) -> bool:
280
    """
281
    Check if any script src attribute contains specified text.
282
    
283
    Caches values on first call for performance.
284
    
285
    Parameters:
286
    - queries: Tuple of text strings to search for in src attributes
287
    
288
    Returns:
289
    True if any script src contains any query text, False otherwise
290
    """
291
```
292

293
**Usage Example:**
294
```python
295
# Advanced selector with chaining
296
advanced_selector = parser.select('div.content')
297
# Further operations can be chained on the selector
298

299
# Check for CSS matches across document
300
important_selectors = ('.error', '.warning', '.critical')
301
has_important = parser.any_css_matches(important_selectors)
302

303
# Script content analysis
304
has_analytics = parser.scripts_contain('google-analytics')
305
has_tracking = parser.scripts_contain('facebook')
306

307
# Script source analysis
308
ad_scripts = ('ads.js', 'doubleclick', 'adsystem')
309
has_ads = parser.script_srcs_contain(ad_scripts)
310

311
# Content filtering based on scripts
312
if has_analytics or has_ads:
313
    print("Page contains tracking or ads")
314
    # Remove or flag for privacy
315

316
### Utility Functions
317

318
Additional utility functions for HTML element creation and parsing.
319

320
```python { .api }
321
def create_tag(tag: str) -> Node:
322
    """
323
    Create a new HTML element with specified tag name.
324
    
325
    Parameters:
326
    - tag: HTML tag name (e.g., 'div', 'p', 'img')
327
    
328
    Returns:
329
    New Node element with the specified tag
330
    """
331

332
def parse_fragment(html: str) -> list[Node]:
333
    """
334
    Parse HTML fragment into list of nodes without adding wrapper elements.
335
    
336
    Unlike HTMLParser which adds missing html/head/body tags, this function
337
    returns nodes exactly as specified in the HTML fragment.
338
    
339
    Parameters:
340
    - html: HTML fragment string to parse
341
    
342
    Returns:
343
    List of Node objects representing the parsed HTML fragment
344
    """
345
```
346

347
**Usage Example:**
348
```python
349
from selectolax.parser import create_tag, parse_fragment
350

351
# Create new elements
352
div = create_tag('div')
353
paragraph = create_tag('p')
354
link = create_tag('a')
355

356
# Parse HTML fragments without wrappers
357
fragment_html = '<li>Item 1</li><li>Item 2</li><li>Item 3</li>'
358
list_items = parse_fragment(fragment_html)
359

360
# Use in DOM manipulation
361
container = create_tag('ul')
362
for item in list_items:
363
    container.insert_child(item)
364

365
print(container.html)  # <ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>
366
```
367

368
### Document Cloning
369

370
Create independent copies of parsed documents.
371

372
```python { .api }
373
def clone(self) -> HTMLParser:
374
    """
375
    Create a deep copy of the entire parsed document.
376
    
377
    Returns:
378
    New HTMLParser instance with identical content
379
    """
380
```
381

382
**Usage Example:**
383
```python
384
# Clone document for safe manipulation
385
original = HTMLParser(html_content)
386
copy = original.clone()
387

388
# Modify copy without affecting original
389
copy.strip_tags(['script', 'style'])
390
clean_text = copy.text(strip=True)
391

392
# Original remains unchanged
393
original_text = original.text()
394
```
395

396
### Text Processing
397

398
Advanced text manipulation methods for better text extraction.
399

400
```python { .api }
401
def merge_text_nodes(self) -> None:
402
    """
403
    Merge adjacent text nodes to improve text extraction quality.
404
    
405
    Useful after removing HTML tags to eliminate extra spaces
406
    and fragmented text caused by tag removal.
407
    """
408
```
409

410
**Usage Example:**
411
```python
412
# Clean up text after tag manipulation
413
parser = HTMLParser('<div><strong>Hello</strong> world!</div>')
414
content = parser.css_first('div')
415

416
# Remove formatting tags
417
parser.unwrap_tags(['strong'])
418
print(parser.text())  # May have extra spaces: "Hello  world!"
419

420
# Merge text nodes for cleaner output
421
parser.merge_text_nodes()
422
print(parser.text())  # Clean output: "Hello world!"
423
```

Version

Tile

Files

modest-parser.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

modest-parser.mddocs/