Tessl Tile for pypi/selectolax@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-selectolax

Fast HTML5 parser with CSS selectors using Modest and Lexbor engines

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/selectolax@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-selectolax@0.3.0

0
# selectolax
1

2
A high-performance HTML5 parser with CSS selector support, providing two parsing backends (Modest and Lexbor engines) for maximum compatibility and speed. selectolax enables efficient HTML document parsing and manipulation with a Python API that supports advanced CSS selectors, attribute access, text extraction, and DOM traversal operations.
3

4
## Package Information
5

6
- **Package Name**: selectolax
7
- **Language**: Python
8
- **Installation**: `pip install selectolax`
9
- **Version**: 0.3.34
10

11
## Core Imports
12

13
**Modest engine (default)**:
14
```python
15
from selectolax.parser import HTMLParser
16
```
17

18
**Lexbor engine (enhanced CSS selectors)**:
19
```python
20
from selectolax.lexbor import LexborHTMLParser
21
```
22

23
**Utility functions**:
24
```python
25
# Element creation and fragment parsing
26
from selectolax.parser import create_tag, parse_fragment
27
from selectolax.lexbor import create_tag, parse_fragment
28

29
# Exception handling
30
from selectolax.lexbor import SelectolaxError
31
```
32

33
## Basic Usage
34

35
```python
36
from selectolax.parser import HTMLParser
37

38
# Parse HTML content
39
html = """
40
<html>
41
    <head><title>Sample Page</title></head>
42
    <body>
43
        <div class="content">
44
            <h1 id="title">Hello World</h1>
45
            <p class="text">This is a paragraph.</p>
46
            <ul>
47
                <li>Item 1</li>
48
                <li>Item 2</li>
49
            </ul>
50
        </div>
51
    </body>
52
</html>
53
"""
54

55
# Create parser instance
56
tree = HTMLParser(html)
57

58
# Extract text using CSS selectors
59
title = tree.css_first('h1#title').text()
60
print(f"Title: {title}")  # Output: Title: Hello World
61

62
# Get all list items
63
items = [node.text() for node in tree.css('li')]
64
print(f"Items: {items}")  # Output: Items: ['Item 1', 'Item 2']
65

66
# Access attributes
67
title_id = tree.css_first('h1').attributes['id']
68
print(f"Title ID: {title_id}")  # Output: Title ID: title
69

70
# Extract all text content
71
all_text = tree.text(strip=True)
72
print(f"All text: {all_text}")
73
```
74

75
## Architecture
76

77
selectolax provides two high-performance HTML parsing engines:
78

79
- **Modest Engine**: The default parser providing comprehensive HTML5 parsing with CSS selectors
80
- **Lexbor Engine**: Enhanced parser with additional features like custom pseudo-classes (`:lexbor-contains`)
81

82
Both engines expose similar APIs through their respective parser classes (`HTMLParser` and `LexborHTMLParser`) and node classes (`Node` and `LexborNode`), allowing easy switching between backends while maintaining compatibility.
83

84
The parsing workflow involves:
85
1. **Parse**: Create parser instance with HTML content
86
2. **Select**: Use CSS selectors or tag-based queries to find elements
87
3. **Extract**: Get text content, attributes, or HTML structure
88
4. **Manipulate**: Modify DOM structure by adding, removing, or replacing elements
89

90
## Capabilities
91

92
### HTML Parsing with Modest Engine
93

94
The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods.
95

96
```python { .api }
97
class HTMLParser:
98
    def __init__(self, html, detect_encoding=True, use_meta_tags=True, decode_errors='ignore'): ...
99
    def css(self, query: str) -> list: ...
100
    def css_first(self, query: str, default=None, strict=False): ...
101
    def tags(self, name: str) -> list: ...
102
    def text(self, deep=True, separator='', strip=False) -> str: ...
103
```
104

105
[Modest Engine Parser](./modest-parser.md)
106

107
### Enhanced Parsing with Lexbor Engine
108

109
Alternative HTML5 parser using the Lexbor engine. Offers enhanced CSS selector capabilities including custom pseudo-classes for advanced text matching and improved performance characteristics.
110

111
```python { .api }
112
class LexborHTMLParser:
113
    def __init__(self, html): ...
114
    def css(self, query: str) -> list: ...
115
    def css_first(self, query: str, default=None, strict=False): ...
116
    def tags(self, name: str) -> list: ...
117
    def text(self, deep=True, separator='', strip=False) -> str: ...
118
```
119

120
[Lexbor Engine Parser](./lexbor-parser.md)
121

122
### DOM Node Operations
123

124
Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, and structural modifications.
125

126
```python { .api }
127
class Node:
128
    def css(self, query: str) -> list: ...
129
    def css_first(self, query: str, default=None, strict=False): ...
130
    def text(self, deep=True, separator='', strip=False) -> str: ...
131
    def remove(self) -> None: ...
132
    def decompose(self) -> None: ...
133
```
134

135
[Node Operations](./node-operations.md)
136

137
## Common Types
138

139
```python { .api }
140
# HTML content input types
141
HtmlInput = str | bytes
142

143
# CSS selector query type  
144
CssQuery = str
145

146
# Attribute dictionary interface
147
class AttributeDict:
148
    def __getitem__(self, key: str) -> str | None: ...
149
    def __setitem__(self, key: str, value: str) -> None: ...
150
    def __contains__(self, key: str) -> bool: ...
151
    def get(self, key: str, default=None) -> str | None: ...
152
    def keys(self) -> Iterator[str]: ...
153
    def values(self) -> Iterator[str | None]: ...
154
    def items(self) -> Iterator[tuple[str, str | None]]: ...
155

156
# Exception classes
157
class SelectolaxError(Exception):
158
    """Base exception for selectolax-related errors."""
159
    pass
160
```