Tessl Tile for pypi/parsel@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

css-translation.md data-extraction.md element-modification.md index.md parsing-selection.md selectorlist-operations.md xml-namespaces.md xpath-extensions.md

index.mddocs/

0
# Parsel
1

2
Parsel is a library to extract data from HTML, XML, and JSON documents using XPath and CSS selectors. It provides a unified API through the Selector and SelectorList classes that enables developers to chain operations and extract data from web documents efficiently with support for XPath expressions, CSS selectors, JMESPath for JSON, and regular expressions.
3

4
## Package Information
5

6
- **Package Name**: parsel
7
- **Language**: Python
8
- **Installation**: `pip install parsel`
9
- **Dependencies**: lxml, cssselect, jmespath, w3lib, packaging
10

11
## Core Imports
12

13
```python
14
from parsel import Selector, SelectorList
15
```
16

17
Direct module imports:
18

19
```python
20
from parsel import css2xpath
21
from parsel import xpathfuncs
22
```
23

24
## Basic Usage
25

26
```python
27
from parsel import Selector
28

29
# Parse HTML document
30
html = """
31
<html>
32
    <body>
33
        <h1>Hello, Parsel!</h1>
34
        <ul>
35
            <li><a href="http://example.com">Link 1</a></li>
36
            <li><a href="http://scrapy.org">Link 2</a></li>
37
        </ul>
38
        <script type="application/json">{"a": ["b", "c"]}</script>
39
    </body>
40
</html>
41
"""
42

43
selector = Selector(text=html)
44

45
# Extract text using CSS selectors
46
title = selector.css('h1::text').get()  # 'Hello, Parsel!'
47

48
# Extract links using XPath
49
for li in selector.css('ul > li'):
50
    href = li.xpath('.//@href').get()
51
    print(href)
52

53
# Extract and parse JSON content
54
json_data = selector.css('script::text').jmespath("a").getall()  # ['b', 'c']
55

56
# Use regular expressions
57
words = selector.xpath('//h1/text()').re(r'\\w+')  # ['Hello', 'Parsel']
58
```
59

60
## Architecture
61

62
Parsel's architecture centers around two main classes:
63

64
- **Selector**: Wraps input data (HTML/XML/JSON/text) and provides selection methods
65
- **SelectorList**: List of Selector objects with chainable methods for batch operations
66

67
The library supports multiple parsing strategies:
68
- **HTML parsing**: Using lxml.html.HTMLParser with CSS pseudo-element support
69
- **XML parsing**: Using SafeXMLParser (extends lxml.etree.XMLParser) with namespace management
70
- **JSON parsing**: Native Python JSON parsing with JMESPath query support
71
- **Text parsing**: Plain text content with regex extraction
72

73
## Capabilities
74

75
### Document Parsing and Selection
76

77
Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages.
78

79
```python { .api }
80
class Selector:
81
    def __init__(
82
        self,
83
        text: Optional[str] = None,
84
        type: Optional[str] = None,
85
        body: bytes = b"",
86
        encoding: str = "utf-8",
87
        namespaces: Optional[Mapping[str, str]] = None,
88
        root: Optional[Any] = None,
89
        base_url: Optional[str] = None,
90
        _expr: Optional[str] = None,
91
        huge_tree: bool = True,
92
    ) -> None: ...
93

94
    def xpath(
95
        self,
96
        query: str,
97
        namespaces: Optional[Mapping[str, str]] = None,
98
        **kwargs: Any,
99
    ) -> SelectorList["Selector"]: ...
100

101
    def css(self, query: str) -> SelectorList["Selector"]: ...
102

103
    def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]: ...
104
```
105

106
[Document Parsing and Selection](./parsing-selection.md)
107

108
### Data Extraction and Content Retrieval
109

110
Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement and formatting.
111

112
```python { .api }
113
def get(self) -> Any: ...
114
def getall(self) -> List[str]: ...
115
def re(
116
    self, regex: Union[str, Pattern[str]], replace_entities: bool = True
117
) -> List[str]: ...
118
def re_first(
119
    self,
120
    regex: Union[str, Pattern[str]],
121
    default: Optional[str] = None,
122
    replace_entities: bool = True,
123
) -> Optional[str]: ...
124

125
@property
126
def attrib(self) -> Dict[str, str]: ...
127
```
128

129
[Data Extraction](./data-extraction.md)
130

131
### SelectorList Operations
132

133
Batch operations on multiple selectors with chainable methods for filtering, extracting, and transforming collections of selected elements.
134

135
```python { .api }
136
class SelectorList(List["Selector"]):
137
    def xpath(
138
        self,
139
        xpath: str,
140
        namespaces: Optional[Mapping[str, str]] = None,
141
        **kwargs: Any,
142
    ) -> "SelectorList[Selector]": ...
143

144
    def css(self, query: str) -> "SelectorList[Selector]": ...
145

146
    def jmespath(self, query: str, **kwargs: Any) -> "SelectorList[Selector]": ...
147

148
    def get(self, default: Optional[str] = None) -> Optional[str]: ...
149
    def getall(self) -> List[str]: ...
150
```
151

152
[SelectorList Operations](./selectorlist-operations.md)
153

154
### XML Namespace Management
155

156
Functionality for working with XML namespaces including registration, removal, and namespace-aware queries.
157

158
```python { .api }
159
def register_namespace(self, prefix: str, uri: str) -> None: ...
160
def remove_namespaces(self) -> None: ...
161
```
162

163
[XML Namespace Management](./xml-namespaces.md)
164

165
### Element Modification
166

167
Methods for removing and modifying DOM elements within the parsed document structure.
168

169
```python { .api }
170
def drop(self) -> None: ...
171
def remove(self) -> None: ...  # deprecated
172
```
173

174
[Element Modification](./element-modification.md)
175

176
### CSS Selector Translation
177

178
Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features.
179

180
```python { .api }
181
def css2xpath(query: str) -> str: ...
182

183
class GenericTranslator:
184
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...
185

186
class HTMLTranslator:
187
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...
188
```
189

190
[CSS Translation](./css-translation.md)
191

192
### XPath Extension Functions
193

194
Custom XPath functions for enhanced element selection including CSS class checking and other utility functions.
195

196
```python { .api }
197
def set_xpathfunc(fname: str, func: Optional[Callable]) -> None: ...
198
def has_class(context: Any, *classes: str) -> bool: ...
199
def setup() -> None: ...
200
```
201

202
[XPath Extensions](./xpath-extensions.md)
203

204
## Types
205

206
```python { .api }
207
# Type aliases
208
_SelectorType = TypeVar("_SelectorType", bound="Selector")
209
_ParserType = Union[etree.XMLParser, etree.HTMLParser]
210
_TostringMethodType = Literal["html", "xml"]
211

212
# Exception classes
213
class CannotRemoveElementWithoutRoot(Exception): ...
214
class CannotRemoveElementWithoutParent(Exception): ...
215
class CannotDropElementWithoutParent(CannotRemoveElementWithoutParent): ...
216

217
# CSS Translator classes
218
class XPathExpr:
219
    textnode: bool
220
    attribute: Optional[str]
221
    
222
    @classmethod
223
    def from_xpath(
224
        cls,
225
        xpath: "XPathExpr", 
226
        textnode: bool = False, 
227
        attribute: Optional[str] = None
228
    ) -> "XPathExpr": ...
229
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/