Tessl Tile for pypi/selectolax@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md lexbor-parser.md modest-parser.md node-operations.md

node-operations.mddocs/

0
# DOM Node Operations
1

2
Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, structural navigation, and DOM modifications for both Node (Modest engine) and LexborNode (Lexbor engine) types.
3

4
## Capabilities
5

6
### Node Classes
7

8
HTML element representation with full DOM manipulation capabilities.
9

10
```python { .api }
11
class Node:
12
    """HTML node using Modest engine."""
13
    pass
14

15
class LexborNode:
16
    """HTML node using Lexbor engine."""
17
    pass
18
```
19

20
Both classes provide identical interfaces with the same methods and properties.
21

22
### CSS Selection on Nodes
23

24
Apply CSS selectors to specific nodes for scoped element searching.
25

26
```python { .api }
27
def css(self, query: str) -> list[Node]:
28
    """
29
    Find child elements matching CSS selector.
30
    
31
    Parameters:
32
    - query: CSS selector string
33
    
34
    Returns:
35
    List of Node objects matching selector within this node's subtree
36
    """
37

38
def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:
39
    """
40
    Find first child element matching CSS selector.
41
    
42
    Parameters:
43
    - query: CSS selector string  
44
    - default: Value to return if no match found
45
    - strict: If True, error when multiple matches exist
46
    
47
    Returns:
48
    First matching Node object or default value
49
    """
50
```
51

52
**Usage Example:**
53
```python
54
# Find within specific container
55
container = parser.css_first('div.content')
56
if container:
57
    # Search only within container
58
    links = container.css('a')
59
    first_paragraph = container.css_first('p')
60
    
61
    # Nested selection
62
    important_items = container.css('ul.important li')
63
```
64

65
### Text Content Extraction
66

67
Extract text content from individual nodes with flexible formatting options.
68

69
```python { .api }
70
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
71
    """
72
    Extract text content from this node.
73
    
74
    Parameters:
75
    - deep: Include text from child elements
76
    - separator: String to join text from different child elements
77
    - strip: Apply str.strip() to each text part
78
    
79
    Returns:
80
    Text content as string
81
    """
82
```
83

84
**Usage Example:**
85
```python
86
# Get text from specific element
87
title = parser.css_first('h1').text()
88

89
# Get text with custom formatting
90
nav_text = nav_element.text(separator=' | ', strip=True)
91

92
# Get only direct text (no children)
93
button_text = button_element.text(deep=False)
94

95
# Extract from multiple elements
96
article_texts = [p.text(strip=True) for p in article.css('p')]
97
```
98

99
### Node Properties
100

101
Access structural information and content of HTML nodes.
102

103
```python { .api }
104
@property
105
def tag(self) -> str:
106
    """HTML tag name (e.g., 'div', 'p', 'a')."""
107

108
@property
109
def attributes(self) -> dict:
110
    """Read-only dictionary of element attributes."""
111

112
@property
113
def attrs(self) -> AttributeDict:
114
    """Mutable dictionary-like access to element attributes."""
115

116
@property
117
def parent(self) -> Node | None:
118
    """Parent node in DOM tree."""
119

120
@property
121
def next(self) -> Node | None:
122
    """Next sibling node."""
123

124
@property
125
def prev(self) -> Node | None:
126
    """Previous sibling node."""
127

128
@property
129
def child(self) -> Node | None:
130
    """First child node."""
131

132
@property
133
def last_child(self) -> Node | None:
134
    """Last child node."""
135

136
@property
137
def html(self) -> str:
138
    """HTML representation of this node and its children."""
139

140
@property
141
def id(self) -> str | None:
142
    """HTML id attribute value (Node only)."""
143

144
@property
145
def mem_id(self) -> int:
146
    """Memory address identifier for the node."""
147

148
@property
149
def tag_id(self) -> int:
150
    """Numeric tag identifier (LexborNode only)."""
151

152
@property
153
def first_child(self) -> Node | None:
154
    """First child node (alias for child in LexborNode)."""
155

156
@property
157
def raw_value(self) -> bytes:
158
    """Raw unparsed value of text node (Node only)."""
159

160
@property  
161
def text_content(self) -> str | None:
162
    """Text content of this specific node only (not children)."""
163
```
164

165
**Usage Example:**
166
```python
167
# Access node properties
168
element = parser.css_first('div.content')
169

170
tag_name = element.tag  # 'div'
171
class_attr = element.attributes['class']  # 'content' (read-only)
172
parent_element = element.parent
173
next_sibling = element.next
174

175
# Navigate DOM tree
176
first_child = element.child
177
last_child = element.last_child
178

179
# Get HTML output
180
html_content = element.html
181

182
# Access additional properties
183
element_id = element.id  # HTML id attribute (if exists)
184
memory_id = element.mem_id  # Unique memory identifier
185

186
# Direct text content (no children)
187
text_node = parser.css_first('p').child  # Get text node
188
if text_node and text_node.text_content:
189
    direct_text = text_node.text_content  # Text of this node only
190
```
191

192
### Attribute Management
193

194
Dictionary-like interface for accessing and modifying HTML attributes.
195

196
```python { .api }
197
class AttributeDict:
198
    def __getitem__(self, key: str) -> str | None:
199
        """Get attribute value by name."""
200
    
201
    def __setitem__(self, key: str, value: str) -> None:
202
        """Set attribute value."""
203
    
204
    def __delitem__(self, key: str) -> None:
205
        """Remove attribute."""
206
    
207
    def __contains__(self, key: str) -> bool:
208
        """Check if attribute exists."""
209
    
210
    def get(self, key: str, default=None) -> str | None:
211
        """Get attribute with default value."""
212
    
213
    def sget(self, key: str, default: str = "") -> str:
214
        """Get attribute, return empty string for None values."""
215
    
216
    def keys(self) -> Iterator[str]:
217
        """Iterator over attribute names."""
218
    
219
    def values(self) -> Iterator[str | None]:
220
        """Iterator over attribute values."""
221
    
222
    def items(self) -> Iterator[tuple[str, str | None]]:
223
        """Iterator over (name, value) pairs."""
224
```
225

226
**Usage Example:**
227
```python
228
# Access attributes (read-only)
229
link = parser.css_first('a')
230
read_only_attrs = link.attributes  # dict
231
href = read_only_attrs['href']
232

233
# Access mutable attributes
234
attrs = link.attrs  # AttributeDict
235

236
# Get attributes with different methods
237
href = attrs['href']
238
title = attrs.get('title', 'No title')
239
class_name = attrs.sget('class', 'no-class')  # Returns "" instead of None
240

241
# Set and modify attributes (only works with attrs, not attributes)
242
attrs['target'] = '_blank'
243
attrs['rel'] = 'noopener'
244

245
# Check existence
246
has_id = 'id' in attrs
247

248
# Remove attributes
249
del attrs['onclick']
250

251
# Iterate attributes
252
for name, value in attrs.items():
253
    print(f"{name}: {value}")
254

255
# Read-only vs mutable comparison
256
print(link.attributes)  # {'href': 'example.com', 'class': 'link'}
257
link.attrs['new-attr'] = 'value'
258
print(link.attributes)  # {'href': 'example.com', 'class': 'link', 'new-attr': 'value'}
259
```
260

261
### DOM Modification
262

263
Modify document structure by adding, removing, and replacing elements.
264

265
```python { .api }
266
def remove(self) -> None:
267
    """Remove this node from DOM tree."""
268

269
def decompose(self) -> None:
270
    """Remove and destroy this node and all children."""
271

272
def unwrap(self) -> None:
273
    """Remove tag wrapper while keeping child content."""
274

275
def replace_with(self, value: str | bytes | Node) -> None:
276
    """Replace this node with text or another node."""
277

278
def insert_before(self, value: str | bytes | Node) -> None:
279
    """Insert text or node before this node."""
280

281
def insert_after(self, value: str | bytes | Node) -> None:
282
    """Insert text or node after this node."""
283

284
def insert_child(self, value: str | bytes | Node) -> None:
285
    """Insert text or node as child (at end) of this node."""
286
```
287

288
**Usage Example:**
289
```python
290
# Remove elements
291
script_tags = parser.css('script')
292
for script in script_tags:
293
    script.remove()
294

295
# Destroy elements completely
296
ads = parser.css('.advertisement')
297
for ad in ads:
298
    ad.decompose()
299

300
# Unwrap formatting tags
301
bold_tags = parser.css('b')
302
for bold in bold_tags:
303
    bold.unwrap()  # Keeps text, removes <b> wrapper
304

305
# Replace with text
306
old_img = parser.css_first('img')
307
if old_img:
308
    alt_text = old_img.attributes.get('alt', 'Image')
309
    old_img.replace_with(alt_text)  # Replace with text
310

311
# Replace with another node  
312
from selectolax.lexbor import create_tag
313
new_img = create_tag('img', {'src': 'new.jpg', 'alt': 'New image'})
314
old_img.replace_with(new_img)
315

316
# Insert text and nodes
317
container = parser.css_first('div.content')
318
container.insert_child('Added text at end')
319
container.insert_after('Text after container')
320
container.insert_before('Text before container')
321

322
# Insert HTML elements
323
new_paragraph = create_tag('p', {'class': 'inserted'})
324
container.insert_child(new_paragraph)
325
```
326

327
### Bulk Operations
328

329
Perform operations on multiple elements efficiently.
330

331
```python { .api }
332
def strip_tags(self, tags: list[str], recursive: bool = False) -> None:
333
    """
334
    Remove specified child tags from this node.
335
    
336
    Parameters:
337
    - tags: List of tag names to remove
338
    - recursive: Remove all descendants with matching tags
339
    """
340

341
def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:
342
    """
343
    Unwrap specified child tags while keeping content.
344
    
345
    Parameters:
346
    - tags: List of tag names to unwrap
347
    - delete_empty: Remove empty tags after unwrapping
348
    """
349
```
350

351
**Usage Example:**
352
```python
353
# Clean up content section
354
content = parser.css_first('div.content')
355
if content:
356
    # Remove unwanted tags
357
    content.strip_tags(['script', 'style', 'noscript'])
358
    
359
    # Unwrap formatting tags
360
    content.unwrap_tags(['span', 'font'], delete_empty=True)
361

362
# Process article content
363
article = parser.css_first('article')
364
if article:
365
    # Remove all ads and tracking
366
    article.strip_tags(['iframe', 'object', 'embed'], recursive=True)
367
    
368
    # Clean up empty containers
369
    article.unwrap_tags(['div', 'span'], delete_empty=True)
370
```
371

372
### Node Iteration and Traversal
373

374
Iterate through child nodes and traverse the DOM tree structure.
375

376
```python { .api }
377
def iter(self, include_text: bool = False) -> Iterator[Node]:
378
    """
379
    Iterate over child nodes at current level (Node only).
380
    
381
    Parameters:
382
    - include_text: Include text nodes in iteration
383
    
384
    Yields:
385
    Node objects for each child element
386
    """
387

388
def traverse(self, include_text: bool = False) -> Iterator[Node]:
389
    """
390
    Depth-first traversal of all descendant nodes (Node only).
391
    
392
    Parameters:
393
    - include_text: Include text nodes in traversal
394
    
395
    Yields:
396
    Node objects in depth-first order
397
    """
398
```
399

400
**Usage Example:**
401
```python
402
# Iterate over direct children only
403
container = parser.css_first('div.content')
404
for child in container.iter():
405
    print(f"Child tag: {child.tag}")
406

407
# Include text nodes
408
for child in container.iter(include_text=True):
409
    if child.tag == '-text':
410
        print(f"Text content: {child.text()}")
411

412
# Traverse entire subtree
413
for node in container.traverse():
414
    print(f"Descendant: {node.tag}")
415

416
# Deep traversal including text
417
all_nodes = [node for node in container.traverse(include_text=True)]
418

419
### Text Node Processing
420

421
Merge adjacent text nodes for cleaner text extraction.
422

423
```python { .api }
424
def merge_text_nodes(self) -> None:
425
    """
426
    Merge adjacent text nodes within this node.
427
    
428
    Useful after removing HTML tags to eliminate extra spaces
429
    and fragmented text caused by tag removal.
430
    """
431
```
432

433
**Usage Example:**
434
```python
435
# Clean up fragmented text nodes
436
html = '<div><strong>Hello</strong> <em>beautiful</em> world!</div>'
437
parser = HTMLParser(html)
438
container = parser.css_first('div')
439

440
# Remove formatting tags
441
container.unwrap_tags(['strong', 'em'])
442
print(container.text())  # May show: "Hello  beautiful  world!"
443

444
# Merge text nodes for cleaner output
445
container.merge_text_nodes()
446
print(container.text())  # Clean output: "Hello beautiful world!"
447

448
# Works with any node
449
article = parser.css_first('article')
450
if article:
451
    # Clean up after removing unwanted tags
452
    article.strip_tags(['script', 'style'])
453
    article.merge_text_nodes()
454
    clean_text = article.text(strip=True)
455
```
456

457
### CSS Matching Utilities
458

459
Check if nodes match CSS selectors without retrieving results.
460

461
```python { .api }
462
def css_matches(self, selector: str) -> bool:
463
    """
464
    Check if this node matches CSS selector.
465
    
466
    Parameters:
467
    - selector: CSS selector string
468
    
469
    Returns:
470
    True if node matches selector, False otherwise
471
    """
472

473
def any_css_matches(self, selectors: tuple[str, ...]) -> bool:
474
    """
475
    Check if node matches any of multiple CSS selectors.
476
    
477
    Parameters:
478
    - selectors: Tuple of CSS selector strings
479
    
480
    Returns:
481
    True if node matches any selector, False otherwise
482
    """
483
```
484

485
**Usage Example:**
486
```python
487
# Check if element matches selector
488
element = parser.css_first('div')
489
is_content = element.css_matches('.content')
490
is_container = element.css_matches('.container')
491

492
# Check against multiple selectors
493
important_selectors = ('.important', '.critical', '.error')
494
is_important = element.any_css_matches(important_selectors)
495

496
# Conditional processing based on matching
497
if element.css_matches('.article'):
498
    # Process as article
499
    process_article(element)
500
elif element.css_matches('.sidebar'):
501
    # Process as sidebar
502
    process_sidebar(element)
503
```
504

505
### Advanced Text Extraction
506

507
Additional text extraction methods for specialized use cases.
508

509
```python { .api }
510
def text_lexbor(self) -> str:
511
    """
512
    Extract text using Lexbor's built-in method (LexborNode only).
513
    
514
    Uses the underlying Lexbor engine's native text extraction.
515
    Faster for simple text extraction without formatting options.
516
    
517
    Returns:
518
    Text content as string
519
    
520
    Raises:
521
    RuntimeError: If text extraction fails
522
    """
523
```
524

525
**Usage Example:**
526
```python
527
from selectolax.lexbor import LexborHTMLParser
528

529
# Use Lexbor's native text extraction
530
parser = LexborHTMLParser('<div>Hello <b>world</b>!</div>')
531
element = parser.css_first('div')
532

533
# Fast native text extraction
534
native_text = element.text_lexbor()  # "Hello world!"
535

536
# Compare with regular text method
537
regular_text = element.text()  # Same result but more options
538

539
# Use native method for performance-critical applications
540
articles = parser.css('article')
541
all_text = [article.text_lexbor() for article in articles]
542
```
543

544
### Advanced Selection Methods
545

546
Additional methods for enhanced selection and content analysis.
547

548
```python { .api }
549
def select(self, query: str = None) -> Selector:
550
    """
551
    Create advanced selector with chaining support (Node only).
552
    
553
    Parameters:
554
    - query: Optional initial CSS selector
555
    
556
    Returns:
557
    Selector object supporting method chaining
558
    """
559

560
def scripts_contain(self, query: str) -> bool:
561
    """
562
    Check if any child script tags contain text (Node only).
563
    
564
    Caches script tags on first call for performance.
565
    
566
    Parameters:
567
    - query: Text to search for in script content
568
    
569
    Returns:
570
    True if any script contains the text, False otherwise
571
    """
572
```
573

574
**Usage Example:**
575
```python
576
# Advanced selector with chaining
577
container = parser.css_first('div.content')
578
selector = container.select('p.important')
579
# Can chain additional operations on selector
580

581
# Check for script content within specific nodes
582
article = parser.css_first('article')
583
has_tracking = article.scripts_contain('analytics')
584
has_ads = article.scripts_contain('adsystem')
585

586
# Raw value access for text nodes
587
html_with_entities = '<div>&#x3C;test&#x3E;</div>'
588
parser = HTMLParser(html_with_entities)
589
text_node = parser.css_first('div').child
590

591
print(text_node.text())  # "<test>" (parsed)  
592
print(text_node.raw_value)  # b"&#x3C;test&#x3E;" (original)
593
```
594

595
### Node Creation and Cloning
596

597
Create new nodes and clone existing ones for DOM manipulation.
598

599
```python { .api }
600
# For LexborNode only
601
def create_tag(name: str, attrs: dict = None) -> LexborNode:
602
    """
603
    Create new HTML element (Lexbor engine only).
604
    
605
    Parameters:
606
    - name: HTML tag name
607
    - attrs: Dictionary of attributes
608
    
609
    Returns:
610
    New LexborNode element
611
    """
612
```
613

614
**Usage Example:**
615
```python
616
from selectolax.lexbor import create_tag
617

618
# Create new elements
619
wrapper = create_tag('div', {'class': 'wrapper'})
620
link = create_tag('a', {'href': '#', 'class': 'button'})
621

622
# Build complex structures
623
container = create_tag('div', {'class': 'container'})
624
header = create_tag('h2', {'class': 'title'})
625
paragraph = create_tag('p', {'class': 'description'})
626

627
# Note: Node insertion and complex DOM building
628
# requires working with the underlying parser APIs
629
```

Version

Tile

Files

node-operations.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

node-operations.mddocs/