0
# Parsel
1
2
Parsel is a library to extract data from HTML, XML, and JSON documents using XPath and CSS selectors. It provides a unified API through the Selector and SelectorList classes that enables developers to chain operations and extract data from web documents efficiently with support for XPath expressions, CSS selectors, JMESPath for JSON, and regular expressions.
3
4
## Package Information
5
6
- **Package Name**: parsel
7
- **Language**: Python
8
- **Installation**: `pip install parsel`
9
- **Dependencies**: lxml, cssselect, jmespath, w3lib, packaging
10
11
## Core Imports
12
13
```python
14
from parsel import Selector, SelectorList
15
```
16
17
Direct module imports:
18
19
```python
20
from parsel import css2xpath
21
from parsel import xpathfuncs
22
```
23
24
## Basic Usage
25
26
```python
27
from parsel import Selector
28
29
# Parse HTML document
30
html = """
31
<html>
32
<body>
33
<h1>Hello, Parsel!</h1>
34
<ul>
35
<li><a href="http://example.com">Link 1</a></li>
36
<li><a href="http://scrapy.org">Link 2</a></li>
37
</ul>
38
<script type="application/json">{"a": ["b", "c"]}</script>
39
</body>
40
</html>
41
"""
42
43
selector = Selector(text=html)
44
45
# Extract text using CSS selectors
46
title = selector.css('h1::text').get() # 'Hello, Parsel!'
47
48
# Extract links using XPath
49
for li in selector.css('ul > li'):
50
href = li.xpath('.//@href').get()
51
print(href)
52
53
# Extract and parse JSON content
54
json_data = selector.css('script::text').jmespath("a").getall() # ['b', 'c']
55
56
# Use regular expressions
57
words = selector.xpath('//h1/text()').re(r'\\w+') # ['Hello', 'Parsel']
58
```
59
60
## Architecture
61
62
Parsel's architecture centers around two main classes:
63
64
- **Selector**: Wraps input data (HTML/XML/JSON/text) and provides selection methods
65
- **SelectorList**: List of Selector objects with chainable methods for batch operations
66
67
The library supports multiple parsing strategies:
68
- **HTML parsing**: Using lxml.html.HTMLParser with CSS pseudo-element support
69
- **XML parsing**: Using SafeXMLParser (extends lxml.etree.XMLParser) with namespace management
70
- **JSON parsing**: Native Python JSON parsing with JMESPath query support
71
- **Text parsing**: Plain text content with regex extraction
72
73
## Capabilities
74
75
### Document Parsing and Selection
76
77
Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages.
78
79
```python { .api }
80
class Selector:
81
def __init__(
82
self,
83
text: Optional[str] = None,
84
type: Optional[str] = None,
85
body: bytes = b"",
86
encoding: str = "utf-8",
87
namespaces: Optional[Mapping[str, str]] = None,
88
root: Optional[Any] = None,
89
base_url: Optional[str] = None,
90
_expr: Optional[str] = None,
91
huge_tree: bool = True,
92
) -> None: ...
93
94
def xpath(
95
self,
96
query: str,
97
namespaces: Optional[Mapping[str, str]] = None,
98
**kwargs: Any,
99
) -> SelectorList["Selector"]: ...
100
101
def css(self, query: str) -> SelectorList["Selector"]: ...
102
103
def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]: ...
104
```
105
106
[Document Parsing and Selection](./parsing-selection.md)
107
108
### Data Extraction and Content Retrieval
109
110
Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement and formatting.
111
112
```python { .api }
113
def get(self) -> Any: ...
114
def getall(self) -> List[str]: ...
115
def re(
116
self, regex: Union[str, Pattern[str]], replace_entities: bool = True
117
) -> List[str]: ...
118
def re_first(
119
self,
120
regex: Union[str, Pattern[str]],
121
default: Optional[str] = None,
122
replace_entities: bool = True,
123
) -> Optional[str]: ...
124
125
@property
126
def attrib(self) -> Dict[str, str]: ...
127
```
128
129
[Data Extraction](./data-extraction.md)
130
131
### SelectorList Operations
132
133
Batch operations on multiple selectors with chainable methods for filtering, extracting, and transforming collections of selected elements.
134
135
```python { .api }
136
class SelectorList(List["Selector"]):
137
def xpath(
138
self,
139
xpath: str,
140
namespaces: Optional[Mapping[str, str]] = None,
141
**kwargs: Any,
142
) -> "SelectorList[Selector]": ...
143
144
def css(self, query: str) -> "SelectorList[Selector]": ...
145
146
def jmespath(self, query: str, **kwargs: Any) -> "SelectorList[Selector]": ...
147
148
def get(self, default: Optional[str] = None) -> Optional[str]: ...
149
def getall(self) -> List[str]: ...
150
```
151
152
[SelectorList Operations](./selectorlist-operations.md)
153
154
### XML Namespace Management
155
156
Functionality for working with XML namespaces including registration, removal, and namespace-aware queries.
157
158
```python { .api }
159
def register_namespace(self, prefix: str, uri: str) -> None: ...
160
def remove_namespaces(self) -> None: ...
161
```
162
163
[XML Namespace Management](./xml-namespaces.md)
164
165
### Element Modification
166
167
Methods for removing and modifying DOM elements within the parsed document structure.
168
169
```python { .api }
170
def drop(self) -> None: ...
171
def remove(self) -> None: ... # deprecated
172
```
173
174
[Element Modification](./element-modification.md)
175
176
### CSS Selector Translation
177
178
Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features.
179
180
```python { .api }
181
def css2xpath(query: str) -> str: ...
182
183
class GenericTranslator:
184
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...
185
186
class HTMLTranslator:
187
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...
188
```
189
190
[CSS Translation](./css-translation.md)
191
192
### XPath Extension Functions
193
194
Custom XPath functions for enhanced element selection including CSS class checking and other utility functions.
195
196
```python { .api }
197
def set_xpathfunc(fname: str, func: Optional[Callable]) -> None: ...
198
def has_class(context: Any, *classes: str) -> bool: ...
199
def setup() -> None: ...
200
```
201
202
[XPath Extensions](./xpath-extensions.md)
203
204
## Types
205
206
```python { .api }
207
# Type aliases
208
_SelectorType = TypeVar("_SelectorType", bound="Selector")
209
_ParserType = Union[etree.XMLParser, etree.HTMLParser]
210
_TostringMethodType = Literal["html", "xml"]
211
212
# Exception classes
213
class CannotRemoveElementWithoutRoot(Exception): ...
214
class CannotRemoveElementWithoutParent(Exception): ...
215
class CannotDropElementWithoutParent(CannotRemoveElementWithoutParent): ...
216
217
# CSS Translator classes
218
class XPathExpr:
219
textnode: bool
220
attribute: Optional[str]
221
222
@classmethod
223
def from_xpath(
224
cls,
225
xpath: "XPathExpr",
226
textnode: bool = False,
227
attribute: Optional[str] = None
228
) -> "XPathExpr": ...
229
```