or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-selectolax

Fast HTML5 parser with CSS selectors using Modest and Lexbor engines

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/selectolax@0.3.x

To install, run

npx @tessl/cli install tessl/pypi-selectolax@0.3.0

0

# selectolax

1

2

A high-performance HTML5 parser with CSS selector support, providing two parsing backends (Modest and Lexbor engines) for maximum compatibility and speed. selectolax enables efficient HTML document parsing and manipulation with a Python API that supports advanced CSS selectors, attribute access, text extraction, and DOM traversal operations.

3

4

## Package Information

5

6

- **Package Name**: selectolax

7

- **Language**: Python

8

- **Installation**: `pip install selectolax`

9

- **Version**: 0.3.34

10

11

## Core Imports

12

13

**Modest engine (default)**:

14

```python

15

from selectolax.parser import HTMLParser

16

```

17

18

**Lexbor engine (enhanced CSS selectors)**:

19

```python

20

from selectolax.lexbor import LexborHTMLParser

21

```

22

23

**Utility functions**:

24

```python

25

# Element creation and fragment parsing

26

from selectolax.parser import create_tag, parse_fragment

27

from selectolax.lexbor import create_tag, parse_fragment

28

29

# Exception handling

30

from selectolax.lexbor import SelectolaxError

31

```

32

33

## Basic Usage

34

35

```python

36

from selectolax.parser import HTMLParser

37

38

# Parse HTML content

39

html = """

40

<html>

41

<head><title>Sample Page</title></head>

42

<body>

43

<div class="content">

44

<h1 id="title">Hello World</h1>

45

<p class="text">This is a paragraph.</p>

46

<ul>

47

<li>Item 1</li>

48

<li>Item 2</li>

49

</ul>

50

</div>

51

</body>

52

</html>

53

"""

54

55

# Create parser instance

56

tree = HTMLParser(html)

57

58

# Extract text using CSS selectors

59

title = tree.css_first('h1#title').text()

60

print(f"Title: {title}") # Output: Title: Hello World

61

62

# Get all list items

63

items = [node.text() for node in tree.css('li')]

64

print(f"Items: {items}") # Output: Items: ['Item 1', 'Item 2']

65

66

# Access attributes

67

title_id = tree.css_first('h1').attributes['id']

68

print(f"Title ID: {title_id}") # Output: Title ID: title

69

70

# Extract all text content

71

all_text = tree.text(strip=True)

72

print(f"All text: {all_text}")

73

```

74

75

## Architecture

76

77

selectolax provides two high-performance HTML parsing engines:

78

79

- **Modest Engine**: The default parser providing comprehensive HTML5 parsing with CSS selectors

80

- **Lexbor Engine**: Enhanced parser with additional features like custom pseudo-classes (`:lexbor-contains`)

81

82

Both engines expose similar APIs through their respective parser classes (`HTMLParser` and `LexborHTMLParser`) and node classes (`Node` and `LexborNode`), allowing easy switching between backends while maintaining compatibility.

83

84

The parsing workflow involves:

85

1. **Parse**: Create parser instance with HTML content

86

2. **Select**: Use CSS selectors or tag-based queries to find elements

87

3. **Extract**: Get text content, attributes, or HTML structure

88

4. **Manipulate**: Modify DOM structure by adding, removing, or replacing elements

89

90

## Capabilities

91

92

### HTML Parsing with Modest Engine

93

94

The primary HTML5 parser using the Modest engine. Provides comprehensive parsing capabilities with automatic encoding detection, CSS selector support, and DOM manipulation methods.

95

96

```python { .api }

97

class HTMLParser:

98

def __init__(self, html, detect_encoding=True, use_meta_tags=True, decode_errors='ignore'): ...

99

def css(self, query: str) -> list: ...

100

def css_first(self, query: str, default=None, strict=False): ...

101

def tags(self, name: str) -> list: ...

102

def text(self, deep=True, separator='', strip=False) -> str: ...

103

```

104

105

[Modest Engine Parser](./modest-parser.md)

106

107

### Enhanced Parsing with Lexbor Engine

108

109

Alternative HTML5 parser using the Lexbor engine. Offers enhanced CSS selector capabilities including custom pseudo-classes for advanced text matching and improved performance characteristics.

110

111

```python { .api }

112

class LexborHTMLParser:

113

def __init__(self, html): ...

114

def css(self, query: str) -> list: ...

115

def css_first(self, query: str, default=None, strict=False): ...

116

def tags(self, name: str) -> list: ...

117

def text(self, deep=True, separator='', strip=False) -> str: ...

118

```

119

120

[Lexbor Engine Parser](./lexbor-parser.md)

121

122

### DOM Node Operations

123

124

Comprehensive node manipulation capabilities for traversing, modifying, and extracting data from parsed HTML documents. Includes text extraction, attribute access, and structural modifications.

125

126

```python { .api }

127

class Node:

128

def css(self, query: str) -> list: ...

129

def css_first(self, query: str, default=None, strict=False): ...

130

def text(self, deep=True, separator='', strip=False) -> str: ...

131

def remove(self) -> None: ...

132

def decompose(self) -> None: ...

133

```

134

135

[Node Operations](./node-operations.md)

136

137

## Common Types

138

139

```python { .api }

140

# HTML content input types

141

HtmlInput = str | bytes

142

143

# CSS selector query type

144

CssQuery = str

145

146

# Attribute dictionary interface

147

class AttributeDict:

148

def __getitem__(self, key: str) -> str | None: ...

149

def __setitem__(self, key: str, value: str) -> None: ...

150

def __contains__(self, key: str) -> bool: ...

151

def get(self, key: str, default=None) -> str | None: ...

152

def keys(self) -> Iterator[str]: ...

153

def values(self) -> Iterator[str | None]: ...

154

def items(self) -> Iterator[tuple[str, str | None]]: ...

155

156

# Exception classes

157

class SelectolaxError(Exception):

158

"""Base exception for selectolax-related errors."""

159

pass

160

```