Tessl Tile for pypi/selectolax@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md lexbor-parser.md modest-parser.md node-operations.md

lexbor-parser.mddocs/

0
# Enhanced Parsing with Lexbor Engine
1

2
Alternative HTML5 parser using the Lexbor engine. Offers enhanced CSS selector capabilities including custom pseudo-classes for advanced text matching, improved performance characteristics, and extended selector support beyond standard CSS.
3

4
## Capabilities
5

6
### LexborHTMLParser Class
7

8
Fast HTML parser with enhanced CSS selector support and custom pseudo-classes for advanced document querying.
9

10
```python { .api }
11
class LexborHTMLParser:
12
    def __init__(self, html: str | bytes):
13
        """
14
        Initialize Lexbor HTML parser.
15
        
16
        Parameters:
17
        - html: HTML content as string or bytes
18
        """
19
```
20

21
**Usage Example:**
22
```python
23
from selectolax.lexbor import LexborHTMLParser
24

25
# Parse HTML content
26
html = '<div><p>Hello world</p><p class="special">Special content</p></div>'
27
parser = LexborHTMLParser(html)
28

29
# Parse from bytes
30
html_bytes = b'<html><body><h1>Title</h1></body></html>'
31
parser = LexborHTMLParser(html_bytes)
32
```
33

34
### Enhanced CSS Selectors
35

36
Advanced CSS selector capabilities including custom pseudo-classes for text matching and extended selector support.
37

38
```python { .api }
39
def css(self, query: str) -> list[LexborNode]:
40
    """
41
    Find elements using enhanced CSS selectors.
42
    
43
    Supports standard CSS selectors plus Lexbor extensions:
44
    - :lexbor-contains("text") - case-sensitive text matching
45
    - :lexbor-contains("text" i) - case-insensitive text matching
46
    
47
    Parameters:
48
    - query: CSS selector with optional Lexbor extensions
49
    
50
    Returns:
51
    List of LexborNode objects matching the selector
52
    """
53

54
def css_first(self, query: str, default=None, strict: bool = False) -> LexborNode | None:
55
    """
56
    Find first element with enhanced CSS selectors.
57
    
58
    Parameters:
59
    - query: CSS selector string with optional Lexbor extensions
60
    - default: Value to return if no match found
61
    - strict: If True, error when multiple matches exist
62
    
63
    Returns:
64
    First matching LexborNode object or default value
65
    """
66
```
67

68
**Usage Example:**
69
```python
70
# Standard CSS selectors
71
paragraphs = parser.css('p.content')
72
first_div = parser.css_first('div')
73

74
# Lexbor custom pseudo-classes - case sensitive
75
awesome_nodes = parser.css('p:lexbor-contains("awesome")')
76

77
# Lexbor custom pseudo-classes - case insensitive  
78
case_insensitive = parser.css('p:lexbor-contains("HELLO" i)')
79

80
# Complex selectors with custom pseudo-classes
81
specific = parser.css('div.content p:lexbor-contains("important" i)')
82
```
83

84
### Tag-Based Selection
85

86
Select elements by tag name with improved performance over the Modest engine.
87

88
```python { .api }
89
def tags(self, name: str) -> list[LexborNode]:
90
    """
91
    Find all elements with specified tag name.
92
    
93
    Parameters:
94
    - name: HTML tag name (e.g., 'div', 'p', 'a')
95
    
96
    Returns:
97
    List of LexborNode objects with matching tag name
98
    """
99
```
100

101
**Usage Example:**
102
```python
103
# Get all links
104
links = parser.tags('a')
105

106
# Get all headings
107
headings = parser.tags('h1')
108

109
# Get all list items
110
items = parser.tags('li')
111
```
112

113
### Text Extraction
114

115
Extract text content with enhanced performance and consistent behavior.
116

117
```python { .api }
118
def text(self, deep: bool = True, separator: str = '', strip: bool = False) -> str:
119
    """
120
    Extract text content from document body.
121
    
122
    Parameters:
123
    - deep: Include text from child elements
124
    - separator: String to join text from different elements
125
    - strip: Apply str.strip() to each text part
126
    
127
    Returns:
128
    Extracted text content as string
129
    """
130
```
131

132
**Usage Example:**
133
```python
134
# Get all text content
135
all_text = parser.text()
136

137
# Get text with separators
138
separated_text = parser.text(separator=' | ')
139

140
# Get clean text without extra whitespace
141
clean_text = parser.text(strip=True)
142

143
# Get only direct text content
144
direct_text = parser.text(deep=False)
145
```
146

147
### DOM Tree Access
148

149
Access document structure with enhanced node types and consistent interface.
150

151
```python { .api }
152
@property
153
def root(self) -> LexborNode | None:
154
    """Returns root HTML element node."""
155

156
@property
157
def head(self) -> LexborNode | None:
158
    """Returns HTML head element node."""
159

160
@property
161
def body(self) -> LexborNode | None:
162
    """Returns HTML body element node."""
163

164
@property
165
def html(self) -> str | None:
166
    """Returns HTML representation of the document."""
167

168
@property
169
def raw_html(self) -> bytes:
170
    """Returns raw HTML bytes used for parsing."""
171

172
@property
173
def selector(self) -> LexborCSSSelector:
174
    """Returns CSS selector instance for advanced queries."""
175
```
176

177
**Usage Example:**
178
```python
179
# Access document parts
180
root = parser.root
181
head = parser.head  
182
body = parser.body
183

184
# Get HTML output
185
html_output = parser.html
186

187
# Access raw input
188
original = parser.raw_html
189

190
# Get selector for advanced operations
191
css_selector = parser.selector
192
```
193

194
### Advanced CSS Selector Interface
195

196
Direct access to the underlying CSS selector engine for advanced use cases.
197

198
```python { .api }
199
class LexborCSSSelector:
200
    def find(self, query: str, node: LexborNode) -> list[LexborNode]:
201
        """
202
        Find elements matching selector within given node.
203
        
204
        Parameters:
205
        - query: CSS selector string
206
        - node: Root node to search within
207
        
208
        Returns:
209
        List of matching LexborNode objects
210
        """
211
    
212
    def any_matches(self, query: str, node: LexborNode) -> bool:
213
        """
214
        Check if any elements match selector.
215
        
216
        Parameters:
217
        - query: CSS selector string
218
        - node: Root node to search within
219
        
220
        Returns:
221
        True if any matches exist, False otherwise
222
        """
223
```
224

225
**Usage Example:**
226
```python
227
# Get selector instance
228
selector = parser.selector
229

230
# Search within specific node
231
content_div = parser.css_first('div.content')
232
if content_div:
233
    matches = selector.find('p.important', content_div)
234

235
# Check for existence without retrieving
236
has_errors = selector.any_matches('.error', parser.root)
237
```
238

239
## Utility Functions
240

241
### Element Creation
242

243
Create new HTML elements programmatically.
244

245
```python { .api }
246
def create_tag(tag: str) -> LexborNode:
247
    """
248
    Create new HTML element with specified tag name.
249
    
250
    Parameters:
251
    - tag: HTML tag name (e.g., 'div', 'p', 'img')
252
    
253
    Returns:
254
    New LexborNode element with the specified tag
255
    """
256

257
def parse_fragment(html: str) -> list[LexborNode]:
258
    """
259
    Parse HTML fragment into list of nodes without adding wrapper elements.
260
    
261
    Unlike LexborHTMLParser which adds missing html/head/body tags, this function
262
    returns nodes exactly as specified in the HTML fragment.
263
    
264
    Parameters:
265
    - html: HTML fragment string to parse
266
    
267
    Returns:
268
    List of LexborNode objects representing the parsed HTML fragment
269
    """
270
```
271

272
**Usage Example:**
273
```python
274
from selectolax.lexbor import create_tag, parse_fragment
275

276
# Create simple elements
277
div = create_tag('div')
278
paragraph = create_tag('p')
279
link = create_tag('a')
280

281
# Parse HTML fragments without wrappers
282
fragment_html = '<span>Text 1</span><span>Text 2</span>'
283
spans = parse_fragment(fragment_html)
284

285
# Use in DOM manipulation
286
container = create_tag('div')
287
for span in spans:
288
    container.insert_child(span)
289

290
print(container.html)  # <div><span>Text 1</span><span>Text 2</span></div>
291
```
292

293
### Document Cloning
294

295
Create independent copies of parsed documents.
296

297
```python { .api }
298
def clone(self) -> LexborHTMLParser:
299
    """
300
    Create a deep copy of the entire parsed document.
301
    
302
    Returns:
303
    New LexborHTMLParser instance with identical content
304
    """
305
```
306

307
**Usage Example:**
308
```python
309
# Clone document for safe manipulation
310
original = LexborHTMLParser(html_content)
311
backup = original.clone()
312

313
# Modify original without affecting backup
314
original.strip_tags(['img'])
315
processed_text = original.text(strip=True)
316

317
# Backup remains unchanged
318
original_html = backup.html
319
```
320

321
### Text Processing
322

323
Advanced text manipulation methods for better text extraction.
324

325
```python { .api }
326
def merge_text_nodes(self) -> None:
327
    """
328
    Merge adjacent text nodes to improve text extraction quality.
329
    
330
    Useful after removing HTML tags to eliminate extra spaces
331
    and fragmented text caused by tag removal.
332
    """
333
```
334

335
**Usage Example:**
336
```python
337
# Clean up text after tag manipulation
338
parser = LexborHTMLParser('<div><em>Hello</em> <strong>world</strong>!</div>')
339

340
# Remove formatting tags
341
parser.unwrap_tags(['em', 'strong'])
342
print(parser.text())  # May have extra spaces: "Hello  world !"
343

344
# Merge text nodes for cleaner output
345
parser.merge_text_nodes()
346
print(parser.text())  # Clean output: "Hello world!"
347
```

Version

Tile

Files

lexbor-parser.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

lexbor-parser.mddocs/