0
# DOM Node Operations
1
2
Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, structural navigation, and DOM modifications for both Node (Modest engine) and LexborNode (Lexbor engine) types.
3
4
## Capabilities
5
6
### Node Classes
7
8
HTML element representation with full DOM manipulation capabilities.
9
10
```python { .api }
11
class Node:
12
"""HTML node using Modest engine."""
13
pass
14
15
class LexborNode:
16
"""HTML node using Lexbor engine."""
17
pass
18
```
19
20
Both classes provide identical interfaces with the same methods and properties.
21
22
### CSS Selection on Nodes
23
24
Apply CSS selectors to specific nodes for scoped element searching.
25
26
```python { .api }
27
def css(self, query: str) -> list[Node]:
28
"""
29
Find child elements matching CSS selector.
30
31
Parameters:
32
- query: CSS selector string
33
34
Returns:
35
List of Node objects matching selector within this node's subtree
36
"""
37
38
def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:
39
"""
40
Find first child element matching CSS selector.
41
42
Parameters:
43
- query: CSS selector string
44
- default: Value to return if no match found
45
- strict: If True, error when multiple matches exist
46
47
Returns:
48
First matching Node object or default value
49
"""
50
```
51
52
**Usage Example:**
53
```python
54
# Find within specific container
55
container = parser.css_first('div.content')
56
if container:
57
# Search only within container
58
links = container.css('a')
59
first_paragraph = container.css_first('p')
60
61
# Nested selection
62
important_items = container.css('ul.important li')
63
```
64
65
### Text Content Extraction
66
67
Extract text content from individual nodes with flexible formatting options.
68
69
```python { .api }
70
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
71
"""
72
Extract text content from this node.
73
74
Parameters:
75
- deep: Include text from child elements
76
- separator: String to join text from different child elements
77
- strip: Apply str.strip() to each text part
78
79
Returns:
80
Text content as string
81
"""
82
```
83
84
**Usage Example:**
85
```python
86
# Get text from specific element
87
title = parser.css_first('h1').text()
88
89
# Get text with custom formatting
90
nav_text = nav_element.text(separator=' | ', strip=True)
91
92
# Get only direct text (no children)
93
button_text = button_element.text(deep=False)
94
95
# Extract from multiple elements
96
article_texts = [p.text(strip=True) for p in article.css('p')]
97
```
98
99
### Node Properties
100
101
Access structural information and content of HTML nodes.
102
103
```python { .api }
104
@property
105
def tag(self) -> str:
106
"""HTML tag name (e.g., 'div', 'p', 'a')."""
107
108
@property
109
def attributes(self) -> dict:
110
"""Read-only dictionary of element attributes."""
111
112
@property
113
def attrs(self) -> AttributeDict:
114
"""Mutable dictionary-like access to element attributes."""
115
116
@property
117
def parent(self) -> Node | None:
118
"""Parent node in DOM tree."""
119
120
@property
121
def next(self) -> Node | None:
122
"""Next sibling node."""
123
124
@property
125
def prev(self) -> Node | None:
126
"""Previous sibling node."""
127
128
@property
129
def child(self) -> Node | None:
130
"""First child node."""
131
132
@property
133
def last_child(self) -> Node | None:
134
"""Last child node."""
135
136
@property
137
def html(self) -> str:
138
"""HTML representation of this node and its children."""
139
140
@property
141
def id(self) -> str | None:
142
"""HTML id attribute value (Node only)."""
143
144
@property
145
def mem_id(self) -> int:
146
"""Memory address identifier for the node."""
147
148
@property
149
def tag_id(self) -> int:
150
"""Numeric tag identifier (LexborNode only)."""
151
152
@property
153
def first_child(self) -> Node | None:
154
"""First child node (alias for child in LexborNode)."""
155
156
@property
157
def raw_value(self) -> bytes:
158
"""Raw unparsed value of text node (Node only)."""
159
160
@property
161
def text_content(self) -> str | None:
162
"""Text content of this specific node only (not children)."""
163
```
164
165
**Usage Example:**
166
```python
167
# Access node properties
168
element = parser.css_first('div.content')
169
170
tag_name = element.tag # 'div'
171
class_attr = element.attributes['class'] # 'content' (read-only)
172
parent_element = element.parent
173
next_sibling = element.next
174
175
# Navigate DOM tree
176
first_child = element.child
177
last_child = element.last_child
178
179
# Get HTML output
180
html_content = element.html
181
182
# Access additional properties
183
element_id = element.id # HTML id attribute (if exists)
184
memory_id = element.mem_id # Unique memory identifier
185
186
# Direct text content (no children)
187
text_node = parser.css_first('p').child # Get text node
188
if text_node and text_node.text_content:
189
direct_text = text_node.text_content # Text of this node only
190
```
191
192
### Attribute Management
193
194
Dictionary-like interface for accessing and modifying HTML attributes.
195
196
```python { .api }
197
class AttributeDict:
198
def __getitem__(self, key: str) -> str | None:
199
"""Get attribute value by name."""
200
201
def __setitem__(self, key: str, value: str) -> None:
202
"""Set attribute value."""
203
204
def __delitem__(self, key: str) -> None:
205
"""Remove attribute."""
206
207
def __contains__(self, key: str) -> bool:
208
"""Check if attribute exists."""
209
210
def get(self, key: str, default=None) -> str | None:
211
"""Get attribute with default value."""
212
213
def sget(self, key: str, default: str = "") -> str:
214
"""Get attribute, return empty string for None values."""
215
216
def keys(self) -> Iterator[str]:
217
"""Iterator over attribute names."""
218
219
def values(self) -> Iterator[str | None]:
220
"""Iterator over attribute values."""
221
222
def items(self) -> Iterator[tuple[str, str | None]]:
223
"""Iterator over (name, value) pairs."""
224
```
225
226
**Usage Example:**
227
```python
228
# Access attributes (read-only)
229
link = parser.css_first('a')
230
read_only_attrs = link.attributes # dict
231
href = read_only_attrs['href']
232
233
# Access mutable attributes
234
attrs = link.attrs # AttributeDict
235
236
# Get attributes with different methods
237
href = attrs['href']
238
title = attrs.get('title', 'No title')
239
class_name = attrs.sget('class', 'no-class') # Returns "" instead of None
240
241
# Set and modify attributes (only works with attrs, not attributes)
242
attrs['target'] = '_blank'
243
attrs['rel'] = 'noopener'
244
245
# Check existence
246
has_id = 'id' in attrs
247
248
# Remove attributes
249
del attrs['onclick']
250
251
# Iterate attributes
252
for name, value in attrs.items():
253
print(f"{name}: {value}")
254
255
# Read-only vs mutable comparison
256
print(link.attributes) # {'href': 'example.com', 'class': 'link'}
257
link.attrs['new-attr'] = 'value'
258
print(link.attributes) # {'href': 'example.com', 'class': 'link', 'new-attr': 'value'}
259
```
260
261
### DOM Modification
262
263
Modify document structure by adding, removing, and replacing elements.
264
265
```python { .api }
266
def remove(self) -> None:
267
"""Remove this node from DOM tree."""
268
269
def decompose(self) -> None:
270
"""Remove and destroy this node and all children."""
271
272
def unwrap(self) -> None:
273
"""Remove tag wrapper while keeping child content."""
274
275
def replace_with(self, value: str | bytes | Node) -> None:
276
"""Replace this node with text or another node."""
277
278
def insert_before(self, value: str | bytes | Node) -> None:
279
"""Insert text or node before this node."""
280
281
def insert_after(self, value: str | bytes | Node) -> None:
282
"""Insert text or node after this node."""
283
284
def insert_child(self, value: str | bytes | Node) -> None:
285
"""Insert text or node as child (at end) of this node."""
286
```
287
288
**Usage Example:**
289
```python
290
# Remove elements
291
script_tags = parser.css('script')
292
for script in script_tags:
293
script.remove()
294
295
# Destroy elements completely
296
ads = parser.css('.advertisement')
297
for ad in ads:
298
ad.decompose()
299
300
# Unwrap formatting tags
301
bold_tags = parser.css('b')
302
for bold in bold_tags:
303
bold.unwrap() # Keeps text, removes <b> wrapper
304
305
# Replace with text
306
old_img = parser.css_first('img')
307
if old_img:
308
alt_text = old_img.attributes.get('alt', 'Image')
309
old_img.replace_with(alt_text) # Replace with text
310
311
# Replace with another node
312
from selectolax.lexbor import create_tag
313
new_img = create_tag('img', {'src': 'new.jpg', 'alt': 'New image'})
314
old_img.replace_with(new_img)
315
316
# Insert text and nodes
317
container = parser.css_first('div.content')
318
container.insert_child('Added text at end')
319
container.insert_after('Text after container')
320
container.insert_before('Text before container')
321
322
# Insert HTML elements
323
new_paragraph = create_tag('p', {'class': 'inserted'})
324
container.insert_child(new_paragraph)
325
```
326
327
### Bulk Operations
328
329
Perform operations on multiple elements efficiently.
330
331
```python { .api }
332
def strip_tags(self, tags: list[str], recursive: bool = False) -> None:
333
"""
334
Remove specified child tags from this node.
335
336
Parameters:
337
- tags: List of tag names to remove
338
- recursive: Remove all descendants with matching tags
339
"""
340
341
def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:
342
"""
343
Unwrap specified child tags while keeping content.
344
345
Parameters:
346
- tags: List of tag names to unwrap
347
- delete_empty: Remove empty tags after unwrapping
348
"""
349
```
350
351
**Usage Example:**
352
```python
353
# Clean up content section
354
content = parser.css_first('div.content')
355
if content:
356
# Remove unwanted tags
357
content.strip_tags(['script', 'style', 'noscript'])
358
359
# Unwrap formatting tags
360
content.unwrap_tags(['span', 'font'], delete_empty=True)
361
362
# Process article content
363
article = parser.css_first('article')
364
if article:
365
# Remove all ads and tracking
366
article.strip_tags(['iframe', 'object', 'embed'], recursive=True)
367
368
# Clean up empty containers
369
article.unwrap_tags(['div', 'span'], delete_empty=True)
370
```
371
372
### Node Iteration and Traversal
373
374
Iterate through child nodes and traverse the DOM tree structure.
375
376
```python { .api }
377
def iter(self, include_text: bool = False) -> Iterator[Node]:
378
"""
379
Iterate over child nodes at current level (Node only).
380
381
Parameters:
382
- include_text: Include text nodes in iteration
383
384
Yields:
385
Node objects for each child element
386
"""
387
388
def traverse(self, include_text: bool = False) -> Iterator[Node]:
389
"""
390
Depth-first traversal of all descendant nodes (Node only).
391
392
Parameters:
393
- include_text: Include text nodes in traversal
394
395
Yields:
396
Node objects in depth-first order
397
"""
398
```
399
400
**Usage Example:**
401
```python
402
# Iterate over direct children only
403
container = parser.css_first('div.content')
404
for child in container.iter():
405
print(f"Child tag: {child.tag}")
406
407
# Include text nodes
408
for child in container.iter(include_text=True):
409
if child.tag == '-text':
410
print(f"Text content: {child.text()}")
411
412
# Traverse entire subtree
413
for node in container.traverse():
414
print(f"Descendant: {node.tag}")
415
416
# Deep traversal including text
417
all_nodes = [node for node in container.traverse(include_text=True)]
418
419
### Text Node Processing
420
421
Merge adjacent text nodes for cleaner text extraction.
422
423
```python { .api }
424
def merge_text_nodes(self) -> None:
425
"""
426
Merge adjacent text nodes within this node.
427
428
Useful after removing HTML tags to eliminate extra spaces
429
and fragmented text caused by tag removal.
430
"""
431
```
432
433
**Usage Example:**
434
```python
435
# Clean up fragmented text nodes
436
html = '<div><strong>Hello</strong> <em>beautiful</em> world!</div>'
437
parser = HTMLParser(html)
438
container = parser.css_first('div')
439
440
# Remove formatting tags
441
container.unwrap_tags(['strong', 'em'])
442
print(container.text()) # May show: "Hello beautiful world!"
443
444
# Merge text nodes for cleaner output
445
container.merge_text_nodes()
446
print(container.text()) # Clean output: "Hello beautiful world!"
447
448
# Works with any node
449
article = parser.css_first('article')
450
if article:
451
# Clean up after removing unwanted tags
452
article.strip_tags(['script', 'style'])
453
article.merge_text_nodes()
454
clean_text = article.text(strip=True)
455
```
456
457
### CSS Matching Utilities
458
459
Check if nodes match CSS selectors without retrieving results.
460
461
```python { .api }
462
def css_matches(self, selector: str) -> bool:
463
"""
464
Check if this node matches CSS selector.
465
466
Parameters:
467
- selector: CSS selector string
468
469
Returns:
470
True if node matches selector, False otherwise
471
"""
472
473
def any_css_matches(self, selectors: tuple[str, ...]) -> bool:
474
"""
475
Check if node matches any of multiple CSS selectors.
476
477
Parameters:
478
- selectors: Tuple of CSS selector strings
479
480
Returns:
481
True if node matches any selector, False otherwise
482
"""
483
```
484
485
**Usage Example:**
486
```python
487
# Check if element matches selector
488
element = parser.css_first('div')
489
is_content = element.css_matches('.content')
490
is_container = element.css_matches('.container')
491
492
# Check against multiple selectors
493
important_selectors = ('.important', '.critical', '.error')
494
is_important = element.any_css_matches(important_selectors)
495
496
# Conditional processing based on matching
497
if element.css_matches('.article'):
498
# Process as article
499
process_article(element)
500
elif element.css_matches('.sidebar'):
501
# Process as sidebar
502
process_sidebar(element)
503
```
504
505
### Advanced Text Extraction
506
507
Additional text extraction methods for specialized use cases.
508
509
```python { .api }
510
def text_lexbor(self) -> str:
511
"""
512
Extract text using Lexbor's built-in method (LexborNode only).
513
514
Uses the underlying Lexbor engine's native text extraction.
515
Faster for simple text extraction without formatting options.
516
517
Returns:
518
Text content as string
519
520
Raises:
521
RuntimeError: If text extraction fails
522
"""
523
```
524
525
**Usage Example:**
526
```python
527
from selectolax.lexbor import LexborHTMLParser
528
529
# Use Lexbor's native text extraction
530
parser = LexborHTMLParser('<div>Hello <b>world</b>!</div>')
531
element = parser.css_first('div')
532
533
# Fast native text extraction
534
native_text = element.text_lexbor() # "Hello world!"
535
536
# Compare with regular text method
537
regular_text = element.text() # Same result but more options
538
539
# Use native method for performance-critical applications
540
articles = parser.css('article')
541
all_text = [article.text_lexbor() for article in articles]
542
```
543
544
### Advanced Selection Methods
545
546
Additional methods for enhanced selection and content analysis.
547
548
```python { .api }
549
def select(self, query: str = None) -> Selector:
550
"""
551
Create advanced selector with chaining support (Node only).
552
553
Parameters:
554
- query: Optional initial CSS selector
555
556
Returns:
557
Selector object supporting method chaining
558
"""
559
560
def scripts_contain(self, query: str) -> bool:
561
"""
562
Check if any child script tags contain text (Node only).
563
564
Caches script tags on first call for performance.
565
566
Parameters:
567
- query: Text to search for in script content
568
569
Returns:
570
True if any script contains the text, False otherwise
571
"""
572
```
573
574
**Usage Example:**
575
```python
576
# Advanced selector with chaining
577
container = parser.css_first('div.content')
578
selector = container.select('p.important')
579
# Can chain additional operations on selector
580
581
# Check for script content within specific nodes
582
article = parser.css_first('article')
583
has_tracking = article.scripts_contain('analytics')
584
has_ads = article.scripts_contain('adsystem')
585
586
# Raw value access for text nodes
587
html_with_entities = '<div><test></div>'
588
parser = HTMLParser(html_with_entities)
589
text_node = parser.css_first('div').child
590
591
print(text_node.text()) # "<test>" (parsed)
592
print(text_node.raw_value) # b"<test>" (original)
593
```
594
595
### Node Creation and Cloning
596
597
Create new nodes and clone existing ones for DOM manipulation.
598
599
```python { .api }
600
# For LexborNode only
601
def create_tag(name: str, attrs: dict = None) -> LexborNode:
602
"""
603
Create new HTML element (Lexbor engine only).
604
605
Parameters:
606
- name: HTML tag name
607
- attrs: Dictionary of attributes
608
609
Returns:
610
New LexborNode element
611
"""
612
```
613
614
**Usage Example:**
615
```python
616
from selectolax.lexbor import create_tag
617
618
# Create new elements
619
wrapper = create_tag('div', {'class': 'wrapper'})
620
link = create_tag('a', {'href': '#', 'class': 'button'})
621
622
# Build complex structures
623
container = create_tag('div', {'class': 'container'})
624
header = create_tag('h2', {'class': 'title'})
625
paragraph = create_tag('p', {'class': 'description'})
626
627
# Note: Node insertion and complex DOM building
628
# requires working with the underlying parser APIs
629
```