0
# Enhanced Parsing with Lexbor Engine
1
2
Alternative HTML5 parser using the Lexbor engine. Offers enhanced CSS selector capabilities including custom pseudo-classes for advanced text matching, improved performance characteristics, and extended selector support beyond standard CSS.
3
4
## Capabilities
5
6
### LexborHTMLParser Class
7
8
Fast HTML parser with enhanced CSS selector support and custom pseudo-classes for advanced document querying.
9
10
```python { .api }
11
class LexborHTMLParser:
12
def __init__(self, html: str | bytes):
13
"""
14
Initialize Lexbor HTML parser.
15
16
Parameters:
17
- html: HTML content as string or bytes
18
"""
19
```
20
21
**Usage Example:**
22
```python
23
from selectolax.lexbor import LexborHTMLParser
24
25
# Parse HTML content
26
html = '<div><p>Hello world</p><p class="special">Special content</p></div>'
27
parser = LexborHTMLParser(html)
28
29
# Parse from bytes
30
html_bytes = b'<html><body><h1>Title</h1></body></html>'
31
parser = LexborHTMLParser(html_bytes)
32
```
33
34
### Enhanced CSS Selectors
35
36
Advanced CSS selector capabilities including custom pseudo-classes for text matching and extended selector support.
37
38
```python { .api }
39
def css(self, query: str) -> list[LexborNode]:
40
"""
41
Find elements using enhanced CSS selectors.
42
43
Supports standard CSS selectors plus Lexbor extensions:
44
- :lexbor-contains("text") - case-sensitive text matching
45
- :lexbor-contains("text" i) - case-insensitive text matching
46
47
Parameters:
48
- query: CSS selector with optional Lexbor extensions
49
50
Returns:
51
List of LexborNode objects matching the selector
52
"""
53
54
def css_first(self, query: str, default=None, strict: bool = False) -> LexborNode | None:
55
"""
56
Find first element with enhanced CSS selectors.
57
58
Parameters:
59
- query: CSS selector string with optional Lexbor extensions
60
- default: Value to return if no match found
61
- strict: If True, error when multiple matches exist
62
63
Returns:
64
First matching LexborNode object or default value
65
"""
66
```
67
68
**Usage Example:**
69
```python
70
# Standard CSS selectors
71
paragraphs = parser.css('p.content')
72
first_div = parser.css_first('div')
73
74
# Lexbor custom pseudo-classes - case sensitive
75
awesome_nodes = parser.css('p:lexbor-contains("awesome")')
76
77
# Lexbor custom pseudo-classes - case insensitive
78
case_insensitive = parser.css('p:lexbor-contains("HELLO" i)')
79
80
# Complex selectors with custom pseudo-classes
81
specific = parser.css('div.content p:lexbor-contains("important" i)')
82
```
83
84
### Tag-Based Selection
85
86
Select elements by tag name with improved performance over the Modest engine.
87
88
```python { .api }
89
def tags(self, name: str) -> list[LexborNode]:
90
"""
91
Find all elements with specified tag name.
92
93
Parameters:
94
- name: HTML tag name (e.g., 'div', 'p', 'a')
95
96
Returns:
97
List of LexborNode objects with matching tag name
98
"""
99
```
100
101
**Usage Example:**
102
```python
103
# Get all links
104
links = parser.tags('a')
105
106
# Get all headings
107
headings = parser.tags('h1')
108
109
# Get all list items
110
items = parser.tags('li')
111
```
112
113
### Text Extraction
114
115
Extract text content with enhanced performance and consistent behavior.
116
117
```python { .api }
118
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
119
"""
120
Extract text content from document body.
121
122
Parameters:
123
- deep: Include text from child elements
124
- separator: String to join text from different elements
125
- strip: Apply str.strip() to each text part
126
127
Returns:
128
Extracted text content as string
129
"""
130
```
131
132
**Usage Example:**
133
```python
134
# Get all text content
135
all_text = parser.text()
136
137
# Get text with separators
138
separated_text = parser.text(separator=' | ')
139
140
# Get clean text without extra whitespace
141
clean_text = parser.text(strip=True)
142
143
# Get only direct text content
144
direct_text = parser.text(deep=False)
145
```
146
147
### DOM Tree Access
148
149
Access document structure with enhanced node types and consistent interface.
150
151
```python { .api }
152
@property
153
def root(self) -> LexborNode | None:
154
"""Returns root HTML element node."""
155
156
@property
157
def head(self) -> LexborNode | None:
158
"""Returns HTML head element node."""
159
160
@property
161
def body(self) -> LexborNode | None:
162
"""Returns HTML body element node."""
163
164
@property
165
def html(self) -> str | None:
166
"""Returns HTML representation of the document."""
167
168
@property
169
def raw_html(self) -> bytes:
170
"""Returns raw HTML bytes used for parsing."""
171
172
@property
173
def selector(self) -> LexborCSSSelector:
174
"""Returns CSS selector instance for advanced queries."""
175
```
176
177
**Usage Example:**
178
```python
179
# Access document parts
180
root = parser.root
181
head = parser.head
182
body = parser.body
183
184
# Get HTML output
185
html_output = parser.html
186
187
# Access raw input
188
original = parser.raw_html
189
190
# Get selector for advanced operations
191
css_selector = parser.selector
192
```
193
194
### Advanced CSS Selector Interface
195
196
Direct access to the underlying CSS selector engine for advanced use cases.
197
198
```python { .api }
199
class LexborCSSSelector:
200
def find(self, query: str, node: LexborNode) -> list[LexborNode]:
201
"""
202
Find elements matching selector within given node.
203
204
Parameters:
205
- query: CSS selector string
206
- node: Root node to search within
207
208
Returns:
209
List of matching LexborNode objects
210
"""
211
212
def any_matches(self, query: str, node: LexborNode) -> bool:
213
"""
214
Check if any elements match selector.
215
216
Parameters:
217
- query: CSS selector string
218
- node: Root node to search within
219
220
Returns:
221
True if any matches exist, False otherwise
222
"""
223
```
224
225
**Usage Example:**
226
```python
227
# Get selector instance
228
selector = parser.selector
229
230
# Search within specific node
231
content_div = parser.css_first('div.content')
232
if content_div:
233
matches = selector.find('p.important', content_div)
234
235
# Check for existence without retrieving
236
has_errors = selector.any_matches('.error', parser.root)
237
```
238
239
## Utility Functions
240
241
### Element Creation
242
243
Create new HTML elements programmatically.
244
245
```python { .api }
246
def create_tag(tag: str) -> LexborNode:
247
"""
248
Create new HTML element with specified tag name.
249
250
Parameters:
251
- tag: HTML tag name (e.g., 'div', 'p', 'img')
252
253
Returns:
254
New LexborNode element with the specified tag
255
"""
256
257
def parse_fragment(html: str) -> list[LexborNode]:
258
"""
259
Parse HTML fragment into list of nodes without adding wrapper elements.
260
261
Unlike LexborHTMLParser which adds missing html/head/body tags, this function
262
returns nodes exactly as specified in the HTML fragment.
263
264
Parameters:
265
- html: HTML fragment string to parse
266
267
Returns:
268
List of LexborNode objects representing the parsed HTML fragment
269
"""
270
```
271
272
**Usage Example:**
273
```python
274
from selectolax.lexbor import create_tag, parse_fragment
275
276
# Create simple elements
277
div = create_tag('div')
278
paragraph = create_tag('p')
279
link = create_tag('a')
280
281
# Parse HTML fragments without wrappers
282
fragment_html = '<span>Text 1</span><span>Text 2</span>'
283
spans = parse_fragment(fragment_html)
284
285
# Use in DOM manipulation
286
container = create_tag('div')
287
for span in spans:
288
container.insert_child(span)
289
290
print(container.html) # <div><span>Text 1</span><span>Text 2</span></div>
291
```
292
293
### Document Cloning
294
295
Create independent copies of parsed documents.
296
297
```python { .api }
298
def clone(self) -> LexborHTMLParser:
299
"""
300
Create a deep copy of the entire parsed document.
301
302
Returns:
303
New LexborHTMLParser instance with identical content
304
"""
305
```
306
307
**Usage Example:**
308
```python
309
# Clone document for safe manipulation
310
original = LexborHTMLParser(html_content)
311
backup = original.clone()
312
313
# Modify original without affecting backup
314
original.strip_tags(['img'])
315
processed_text = original.text(strip=True)
316
317
# Backup remains unchanged
318
original_html = backup.html
319
```
320
321
### Text Processing
322
323
Advanced text manipulation methods for better text extraction.
324
325
```python { .api }
326
def merge_text_nodes(self) -> None:
327
"""
328
Merge adjacent text nodes to improve text extraction quality.
329
330
Useful after removing HTML tags to eliminate extra spaces
331
and fragmented text caused by tag removal.
332
"""
333
```
334
335
**Usage Example:**
336
```python
337
# Clean up text after tag manipulation
338
parser = LexborHTMLParser('<div><em>Hello</em> <strong>world</strong>!</div>')
339
340
# Remove formatting tags
341
parser.unwrap_tags(['em', 'strong'])
342
print(parser.text()) # May have extra spaces: "Hello world !"
343
344
# Merge text nodes for cleaner output
345
parser.merge_text_nodes()
346
print(parser.text()) # Clean output: "Hello world!"
347
```