or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdlexbor-parser.mdmodest-parser.mdnode-operations.md

node-operations.mddocs/

0

# DOM Node Operations

1

2

Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, structural navigation, and DOM modifications for both Node (Modest engine) and LexborNode (Lexbor engine) types.

3

4

## Capabilities

5

6

### Node Classes

7

8

HTML element representation with full DOM manipulation capabilities.

9

10

```python { .api }

11

class Node:

12

"""HTML node using Modest engine."""

13

pass

14

15

class LexborNode:

16

"""HTML node using Lexbor engine."""

17

pass

18

```

19

20

Both classes provide identical interfaces with the same methods and properties.

21

22

### CSS Selection on Nodes

23

24

Apply CSS selectors to specific nodes for scoped element searching.

25

26

```python { .api }

27

def css(self, query: str) -> list[Node]:

28

"""

29

Find child elements matching CSS selector.

30

31

Parameters:

32

- query: CSS selector string

33

34

Returns:

35

List of Node objects matching selector within this node's subtree

36

"""

37

38

def css_first(self, query: str, default=None, strict: bool = False) -> Node | None:

39

"""

40

Find first child element matching CSS selector.

41

42

Parameters:

43

- query: CSS selector string

44

- default: Value to return if no match found

45

- strict: If True, error when multiple matches exist

46

47

Returns:

48

First matching Node object or default value

49

"""

50

```

51

52

**Usage Example:**

53

```python

54

# Find within specific container

55

container = parser.css_first('div.content')

56

if container:

57

# Search only within container

58

links = container.css('a')

59

first_paragraph = container.css_first('p')

60

61

# Nested selection

62

important_items = container.css('ul.important li')

63

```

64

65

### Text Content Extraction

66

67

Extract text content from individual nodes with flexible formatting options.

68

69

```python { .api }

70

def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:

71

"""

72

Extract text content from this node.

73

74

Parameters:

75

- deep: Include text from child elements

76

- separator: String to join text from different child elements

77

- strip: Apply str.strip() to each text part

78

79

Returns:

80

Text content as string

81

"""

82

```

83

84

**Usage Example:**

85

```python

86

# Get text from specific element

87

title = parser.css_first('h1').text()

88

89

# Get text with custom formatting

90

nav_text = nav_element.text(separator=' | ', strip=True)

91

92

# Get only direct text (no children)

93

button_text = button_element.text(deep=False)

94

95

# Extract from multiple elements

96

article_texts = [p.text(strip=True) for p in article.css('p')]

97

```

98

99

### Node Properties

100

101

Access structural information and content of HTML nodes.

102

103

```python { .api }

104

@property

105

def tag(self) -> str:

106

"""HTML tag name (e.g., 'div', 'p', 'a')."""

107

108

@property

109

def attributes(self) -> dict:

110

"""Read-only dictionary of element attributes."""

111

112

@property

113

def attrs(self) -> AttributeDict:

114

"""Mutable dictionary-like access to element attributes."""

115

116

@property

117

def parent(self) -> Node | None:

118

"""Parent node in DOM tree."""

119

120

@property

121

def next(self) -> Node | None:

122

"""Next sibling node."""

123

124

@property

125

def prev(self) -> Node | None:

126

"""Previous sibling node."""

127

128

@property

129

def child(self) -> Node | None:

130

"""First child node."""

131

132

@property

133

def last_child(self) -> Node | None:

134

"""Last child node."""

135

136

@property

137

def html(self) -> str:

138

"""HTML representation of this node and its children."""

139

140

@property

141

def id(self) -> str | None:

142

"""HTML id attribute value (Node only)."""

143

144

@property

145

def mem_id(self) -> int:

146

"""Memory address identifier for the node."""

147

148

@property

149

def tag_id(self) -> int:

150

"""Numeric tag identifier (LexborNode only)."""

151

152

@property

153

def first_child(self) -> Node | None:

154

"""First child node (alias for child in LexborNode)."""

155

156

@property

157

def raw_value(self) -> bytes:

158

"""Raw unparsed value of text node (Node only)."""

159

160

@property

161

def text_content(self) -> str | None:

162

"""Text content of this specific node only (not children)."""

163

```

164

165

**Usage Example:**

166

```python

167

# Access node properties

168

element = parser.css_first('div.content')

169

170

tag_name = element.tag # 'div'

171

class_attr = element.attributes['class'] # 'content' (read-only)

172

parent_element = element.parent

173

next_sibling = element.next

174

175

# Navigate DOM tree

176

first_child = element.child

177

last_child = element.last_child

178

179

# Get HTML output

180

html_content = element.html

181

182

# Access additional properties

183

element_id = element.id # HTML id attribute (if exists)

184

memory_id = element.mem_id # Unique memory identifier

185

186

# Direct text content (no children)

187

text_node = parser.css_first('p').child # Get text node

188

if text_node and text_node.text_content:

189

direct_text = text_node.text_content # Text of this node only

190

```

191

192

### Attribute Management

193

194

Dictionary-like interface for accessing and modifying HTML attributes.

195

196

```python { .api }

197

class AttributeDict:

198

def __getitem__(self, key: str) -> str | None:

199

"""Get attribute value by name."""

200

201

def __setitem__(self, key: str, value: str) -> None:

202

"""Set attribute value."""

203

204

def __delitem__(self, key: str) -> None:

205

"""Remove attribute."""

206

207

def __contains__(self, key: str) -> bool:

208

"""Check if attribute exists."""

209

210

def get(self, key: str, default=None) -> str | None:

211

"""Get attribute with default value."""

212

213

def sget(self, key: str, default: str = "") -> str:

214

"""Get attribute, return empty string for None values."""

215

216

def keys(self) -> Iterator[str]:

217

"""Iterator over attribute names."""

218

219

def values(self) -> Iterator[str | None]:

220

"""Iterator over attribute values."""

221

222

def items(self) -> Iterator[tuple[str, str | None]]:

223

"""Iterator over (name, value) pairs."""

224

```

225

226

**Usage Example:**

227

```python

228

# Access attributes (read-only)

229

link = parser.css_first('a')

230

read_only_attrs = link.attributes # dict

231

href = read_only_attrs['href']

232

233

# Access mutable attributes

234

attrs = link.attrs # AttributeDict

235

236

# Get attributes with different methods

237

href = attrs['href']

238

title = attrs.get('title', 'No title')

239

class_name = attrs.sget('class', 'no-class') # Returns "" instead of None

240

241

# Set and modify attributes (only works with attrs, not attributes)

242

attrs['target'] = '_blank'

243

attrs['rel'] = 'noopener'

244

245

# Check existence

246

has_id = 'id' in attrs

247

248

# Remove attributes

249

del attrs['onclick']

250

251

# Iterate attributes

252

for name, value in attrs.items():

253

print(f"{name}: {value}")

254

255

# Read-only vs mutable comparison

256

print(link.attributes) # {'href': 'example.com', 'class': 'link'}

257

link.attrs['new-attr'] = 'value'

258

print(link.attributes) # {'href': 'example.com', 'class': 'link', 'new-attr': 'value'}

259

```

260

261

### DOM Modification

262

263

Modify document structure by adding, removing, and replacing elements.

264

265

```python { .api }

266

def remove(self) -> None:

267

"""Remove this node from DOM tree."""

268

269

def decompose(self) -> None:

270

"""Remove and destroy this node and all children."""

271

272

def unwrap(self) -> None:

273

"""Remove tag wrapper while keeping child content."""

274

275

def replace_with(self, value: str | bytes | Node) -> None:

276

"""Replace this node with text or another node."""

277

278

def insert_before(self, value: str | bytes | Node) -> None:

279

"""Insert text or node before this node."""

280

281

def insert_after(self, value: str | bytes | Node) -> None:

282

"""Insert text or node after this node."""

283

284

def insert_child(self, value: str | bytes | Node) -> None:

285

"""Insert text or node as child (at end) of this node."""

286

```

287

288

**Usage Example:**

289

```python

290

# Remove elements

291

script_tags = parser.css('script')

292

for script in script_tags:

293

script.remove()

294

295

# Destroy elements completely

296

ads = parser.css('.advertisement')

297

for ad in ads:

298

ad.decompose()

299

300

# Unwrap formatting tags

301

bold_tags = parser.css('b')

302

for bold in bold_tags:

303

bold.unwrap() # Keeps text, removes <b> wrapper

304

305

# Replace with text

306

old_img = parser.css_first('img')

307

if old_img:

308

alt_text = old_img.attributes.get('alt', 'Image')

309

old_img.replace_with(alt_text) # Replace with text

310

311

# Replace with another node

312

from selectolax.lexbor import create_tag

313

new_img = create_tag('img', {'src': 'new.jpg', 'alt': 'New image'})

314

old_img.replace_with(new_img)

315

316

# Insert text and nodes

317

container = parser.css_first('div.content')

318

container.insert_child('Added text at end')

319

container.insert_after('Text after container')

320

container.insert_before('Text before container')

321

322

# Insert HTML elements

323

new_paragraph = create_tag('p', {'class': 'inserted'})

324

container.insert_child(new_paragraph)

325

```

326

327

### Bulk Operations

328

329

Perform operations on multiple elements efficiently.

330

331

```python { .api }

332

def strip_tags(self, tags: list[str], recursive: bool = False) -> None:

333

"""

334

Remove specified child tags from this node.

335

336

Parameters:

337

- tags: List of tag names to remove

338

- recursive: Remove all descendants with matching tags

339

"""

340

341

def unwrap_tags(self, tags: list[str], delete_empty: bool = False) -> None:

342

"""

343

Unwrap specified child tags while keeping content.

344

345

Parameters:

346

- tags: List of tag names to unwrap

347

- delete_empty: Remove empty tags after unwrapping

348

"""

349

```

350

351

**Usage Example:**

352

```python

353

# Clean up content section

354

content = parser.css_first('div.content')

355

if content:

356

# Remove unwanted tags

357

content.strip_tags(['script', 'style', 'noscript'])

358

359

# Unwrap formatting tags

360

content.unwrap_tags(['span', 'font'], delete_empty=True)

361

362

# Process article content

363

article = parser.css_first('article')

364

if article:

365

# Remove all ads and tracking

366

article.strip_tags(['iframe', 'object', 'embed'], recursive=True)

367

368

# Clean up empty containers

369

article.unwrap_tags(['div', 'span'], delete_empty=True)

370

```

371

372

### Node Iteration and Traversal

373

374

Iterate through child nodes and traverse the DOM tree structure.

375

376

```python { .api }

377

def iter(self, include_text: bool = False) -> Iterator[Node]:

378

"""

379

Iterate over child nodes at current level (Node only).

380

381

Parameters:

382

- include_text: Include text nodes in iteration

383

384

Yields:

385

Node objects for each child element

386

"""

387

388

def traverse(self, include_text: bool = False) -> Iterator[Node]:

389

"""

390

Depth-first traversal of all descendant nodes (Node only).

391

392

Parameters:

393

- include_text: Include text nodes in traversal

394

395

Yields:

396

Node objects in depth-first order

397

"""

398

```

399

400

**Usage Example:**

401

```python

402

# Iterate over direct children only

403

container = parser.css_first('div.content')

404

for child in container.iter():

405

print(f"Child tag: {child.tag}")

406

407

# Include text nodes

408

for child in container.iter(include_text=True):

409

if child.tag == '-text':

410

print(f"Text content: {child.text()}")

411

412

# Traverse entire subtree

413

for node in container.traverse():

414

print(f"Descendant: {node.tag}")

415

416

# Deep traversal including text

417

all_nodes = [node for node in container.traverse(include_text=True)]

418

419

### Text Node Processing

420

421

Merge adjacent text nodes for cleaner text extraction.

422

423

```python { .api }

424

def merge_text_nodes(self) -> None:

425

"""

426

Merge adjacent text nodes within this node.

427

428

Useful after removing HTML tags to eliminate extra spaces

429

and fragmented text caused by tag removal.

430

"""

431

```

432

433

**Usage Example:**

434

```python

435

# Clean up fragmented text nodes

436

html = '<div><strong>Hello</strong> <em>beautiful</em> world!</div>'

437

parser = HTMLParser(html)

438

container = parser.css_first('div')

439

440

# Remove formatting tags

441

container.unwrap_tags(['strong', 'em'])

442

print(container.text()) # May show: "Hello beautiful world!"

443

444

# Merge text nodes for cleaner output

445

container.merge_text_nodes()

446

print(container.text()) # Clean output: "Hello beautiful world!"

447

448

# Works with any node

449

article = parser.css_first('article')

450

if article:

451

# Clean up after removing unwanted tags

452

article.strip_tags(['script', 'style'])

453

article.merge_text_nodes()

454

clean_text = article.text(strip=True)

455

```

456

457

### CSS Matching Utilities

458

459

Check if nodes match CSS selectors without retrieving results.

460

461

```python { .api }

462

def css_matches(self, selector: str) -> bool:

463

"""

464

Check if this node matches CSS selector.

465

466

Parameters:

467

- selector: CSS selector string

468

469

Returns:

470

True if node matches selector, False otherwise

471

"""

472

473

def any_css_matches(self, selectors: tuple[str, ...]) -> bool:

474

"""

475

Check if node matches any of multiple CSS selectors.

476

477

Parameters:

478

- selectors: Tuple of CSS selector strings

479

480

Returns:

481

True if node matches any selector, False otherwise

482

"""

483

```

484

485

**Usage Example:**

486

```python

487

# Check if element matches selector

488

element = parser.css_first('div')

489

is_content = element.css_matches('.content')

490

is_container = element.css_matches('.container')

491

492

# Check against multiple selectors

493

important_selectors = ('.important', '.critical', '.error')

494

is_important = element.any_css_matches(important_selectors)

495

496

# Conditional processing based on matching

497

if element.css_matches('.article'):

498

# Process as article

499

process_article(element)

500

elif element.css_matches('.sidebar'):

501

# Process as sidebar

502

process_sidebar(element)

503

```

504

505

### Advanced Text Extraction

506

507

Additional text extraction methods for specialized use cases.

508

509

```python { .api }

510

def text_lexbor(self) -> str:

511

"""

512

Extract text using Lexbor's built-in method (LexborNode only).

513

514

Uses the underlying Lexbor engine's native text extraction.

515

Faster for simple text extraction without formatting options.

516

517

Returns:

518

Text content as string

519

520

Raises:

521

RuntimeError: If text extraction fails

522

"""

523

```

524

525

**Usage Example:**

526

```python

527

from selectolax.lexbor import LexborHTMLParser

528

529

# Use Lexbor's native text extraction

530

parser = LexborHTMLParser('<div>Hello <b>world</b>!</div>')

531

element = parser.css_first('div')

532

533

# Fast native text extraction

534

native_text = element.text_lexbor() # "Hello world!"

535

536

# Compare with regular text method

537

regular_text = element.text() # Same result but more options

538

539

# Use native method for performance-critical applications

540

articles = parser.css('article')

541

all_text = [article.text_lexbor() for article in articles]

542

```

543

544

### Advanced Selection Methods

545

546

Additional methods for enhanced selection and content analysis.

547

548

```python { .api }

549

def select(self, query: str = None) -> Selector:

550

"""

551

Create advanced selector with chaining support (Node only).

552

553

Parameters:

554

- query: Optional initial CSS selector

555

556

Returns:

557

Selector object supporting method chaining

558

"""

559

560

def scripts_contain(self, query: str) -> bool:

561

"""

562

Check if any child script tags contain text (Node only).

563

564

Caches script tags on first call for performance.

565

566

Parameters:

567

- query: Text to search for in script content

568

569

Returns:

570

True if any script contains the text, False otherwise

571

"""

572

```

573

574

**Usage Example:**

575

```python

576

# Advanced selector with chaining

577

container = parser.css_first('div.content')

578

selector = container.select('p.important')

579

# Can chain additional operations on selector

580

581

# Check for script content within specific nodes

582

article = parser.css_first('article')

583

has_tracking = article.scripts_contain('analytics')

584

has_ads = article.scripts_contain('adsystem')

585

586

# Raw value access for text nodes

587

html_with_entities = '<div>&#x3C;test&#x3E;</div>'

588

parser = HTMLParser(html_with_entities)

589

text_node = parser.css_first('div').child

590

591

print(text_node.text()) # "<test>" (parsed)

592

print(text_node.raw_value) # b"&#x3C;test&#x3E;" (original)

593

```

594

595

### Node Creation and Cloning

596

597

Create new nodes and clone existing ones for DOM manipulation.

598

599

```python { .api }

600

# For LexborNode only

601

def create_tag(name: str, attrs: dict = None) -> LexborNode:

602

"""

603

Create new HTML element (Lexbor engine only).

604

605

Parameters:

606

- name: HTML tag name

607

- attrs: Dictionary of attributes

608

609

Returns:

610

New LexborNode element

611

"""

612

```

613

614

**Usage Example:**

615

```python

616

from selectolax.lexbor import create_tag

617

618

# Create new elements

619

wrapper = create_tag('div', {'class': 'wrapper'})

620

link = create_tag('a', {'href': '#', 'class': 'button'})

621

622

# Build complex structures

623

container = create_tag('div', {'class': 'container'})

624

header = create_tag('h2', {'class': 'title'})

625

paragraph = create_tag('p', {'class': 'description'})

626

627

# Note: Node insertion and complex DOM building

628

# requires working with the underlying parser APIs

629

```