0
# selectolax
1
2
A high-performance HTML5 parser with CSS selector support, providing two parsing backends (Modest and Lexbor engines) for maximum compatibility and speed. selectolax enables efficient HTML document parsing and manipulation with a Python API that supports advanced CSS selectors, attribute access, text extraction, and DOM traversal operations.
3
4
## Package Information
5
6
- **Package Name**: selectolax
7
- **Language**: Python
8
- **Installation**: `pip install selectolax`
9
- **Version**: 0.3.34
10
11
## Core Imports
12
13
**Modest engine (default)**:
14
```python
15
from selectolax.parser import HTMLParser
16
```
17
18
**Lexbor engine (enhanced CSS selectors)**:
19
```python
20
from selectolax.lexbor import LexborHTMLParser
21
```
22
23
**Utility functions**:
24
```python
25
# Element creation and fragment parsing
26
from selectolax.parser import create_tag, parse_fragment
27
from selectolax.lexbor import create_tag, parse_fragment
28
29
# Exception handling
30
from selectolax.lexbor import SelectolaxError
31
```
32
33
## Basic Usage
34
35
```python
36
from selectolax.parser import HTMLParser
37
38
# Parse HTML content
39
html = """
40
<html>
41
<head><title>Sample Page</title></head>
42
<body>
43
<div class="content">
44
<h1 id="title">Hello World</h1>
45
<p class="text">This is a paragraph.</p>
46
<ul>
47
<li>Item 1</li>
48
<li>Item 2</li>
49
</ul>
50
</div>
51
</body>
52
</html>
53
"""
54
55
# Create parser instance
56
tree = HTMLParser(html)
57
58
# Extract text using CSS selectors
59
title = tree.css_first('h1#title').text()
60
print(f"Title: {title}") # Output: Title: Hello World
61
62
# Get all list items
63
items = [node.text() for node in tree.css('li')]
64
print(f"Items: {items}") # Output: Items: ['Item 1', 'Item 2']
65
66
# Access attributes
67
title_id = tree.css_first('h1').attributes['id']
68
print(f"Title ID: {title_id}") # Output: Title ID: title
69
70
# Extract all text content
71
all_text = tree.text(strip=True)
72
print(f"All text: {all_text}")
73
```
74
75
## Architecture
76
77
selectolax provides two high-performance HTML parsing engines:
78
79
- **Modest Engine**: The default parser providing comprehensive HTML5 parsing with CSS selectors
80
- **Lexbor Engine**: Enhanced parser with additional features like custom pseudo-classes (`:lexbor-contains`)
81
82
Both engines expose similar APIs through their respective parser classes (`HTMLParser` and `LexborHTMLParser`) and node classes (`Node` and `LexborNode`), allowing easy switching between backends while maintaining compatibility.
83
84
The parsing workflow involves:
85
1. **Parse**: Create parser instance with HTML content
86
2. **Select**: Use CSS selectors or tag-based queries to find elements
87
3. **Extract**: Get text content, attributes, or HTML structure
88
4. **Manipulate**: Modify DOM structure by adding, removing, or replacing elements
89
90
## Capabilities
91
92
### HTML Parsing with Modest Engine
93
94
The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods.
95
96
```python { .api }
97
class HTMLParser:
98
def __init__(self, html, detect_encoding=True, use_meta_tags=True, decode_errors='ignore'): ...
99
def css(self, query: str) -> list: ...
100
def css_first(self, query: str, default=None, strict=False): ...
101
def tags(self, name: str) -> list: ...
102
def text(self, deep=True, separator='', strip=False) -> str: ...
103
```
104
105
[Modest Engine Parser](./modest-parser.md)
106
107
### Enhanced Parsing with Lexbor Engine
108
109
Alternative HTML5 parser using the Lexbor engine. Offers enhanced CSS selector capabilities including custom pseudo-classes for advanced text matching and improved performance characteristics.
110
111
```python { .api }
112
class LexborHTMLParser:
113
def __init__(self, html): ...
114
def css(self, query: str) -> list: ...
115
def css_first(self, query: str, default=None, strict=False): ...
116
def tags(self, name: str) -> list: ...
117
def text(self, deep=True, separator='', strip=False) -> str: ...
118
```
119
120
[Lexbor Engine Parser](./lexbor-parser.md)
121
122
### DOM Node Operations
123
124
Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, and structural modifications.
125
126
```python { .api }
127
class Node:
128
def css(self, query: str) -> list: ...
129
def css_first(self, query: str, default=None, strict=False): ...
130
def text(self, deep=True, separator='', strip=False) -> str: ...
131
def remove(self) -> None: ...
132
def decompose(self) -> None: ...
133
```
134
135
[Node Operations](./node-operations.md)
136
137
## Common Types
138
139
```python { .api }
140
# HTML content input types
141
HtmlInput = str | bytes
142
143
# CSS selector query type
144
CssQuery = str
145
146
# Attribute dictionary interface
147
class AttributeDict:
148
def __getitem__(self, key: str) -> str | None: ...
149
def __setitem__(self, key: str, value: str) -> None: ...
150
def __contains__(self, key: str) -> bool: ...
151
def get(self, key: str, default=None) -> str | None: ...
152
def keys(self) -> Iterator[str]: ...
153
def values(self) -> Iterator[str | None]: ...
154
def items(self) -> Iterator[tuple[str, str | None]]: ...
155
156
# Exception classes
157
class SelectolaxError(Exception):
158
"""Base exception for selectolax-related errors."""
159
pass
160
```