0
# Document Parsing and Selection
1
2
Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages including XPath, CSS selectors, and JMESPath.
3
4
## Capabilities
5
6
### Selector Initialization
7
8
Create Selector instances from various input formats with configurable parsing options.
9
10
```python { .api }
11
class Selector:
12
def __init__(
13
self,
14
text: Optional[str] = None,
15
type: Optional[str] = None,
16
body: bytes = b"",
17
encoding: str = "utf-8",
18
namespaces: Optional[Mapping[str, str]] = None,
19
root: Optional[Any] = None,
20
base_url: Optional[str] = None,
21
_expr: Optional[str] = None,
22
huge_tree: bool = True,
23
) -> None:
24
"""
25
Initialize a Selector for parsing and selecting from documents.
26
27
Parameters:
28
- text (str, optional): Text content to parse
29
- type (str, optional): Document type - "html", "xml", "json", or "text"
30
- body (bytes): Raw bytes content (alternative to text)
31
- encoding (str): Character encoding for body content, defaults to "utf-8"
32
- namespaces (dict, optional): XML namespace prefix mappings
33
- root (Any, optional): Pre-parsed root element or data
34
- base_url (str, optional): Base URL for resolving relative URLs
35
- _expr (str, optional): Expression that created this selector
36
- huge_tree (bool): Enable large document parsing support, defaults to True
37
38
Raises:
39
- ValueError: Invalid type or missing required arguments
40
- TypeError: Invalid argument types
41
"""
42
```
43
44
**Usage Example:**
45
46
```python
47
from parsel import Selector
48
49
# Parse HTML text
50
html_selector = Selector(text="<html><body><h1>Title</h1></body></html>")
51
52
# Parse XML with explicit type
53
xml_selector = Selector(text="<root><item>data</item></root>", type="xml")
54
55
# Parse JSON
56
json_selector = Selector(text='{"name": "value", "items": [1, 2, 3]}', type="json")
57
58
# Parse from bytes with encoding
59
bytes_selector = Selector(body=b"<html><body>Content</body></html>", encoding="utf-8")
60
61
# Parse with XML namespaces
62
ns_selector = Selector(
63
text="<root xmlns:ns='http://example.com'><ns:item>data</ns:item></root>",
64
type="xml",
65
namespaces={"ns": "http://example.com"}
66
)
67
```
68
69
### XPath Selection
70
71
Execute XPath expressions for precise element selection with namespace support and variable binding.
72
73
```python { .api }
74
def xpath(
75
self,
76
query: str,
77
namespaces: Optional[Mapping[str, str]] = None,
78
**kwargs: Any,
79
) -> SelectorList["Selector"]:
80
"""
81
Find nodes matching the XPath query.
82
83
Parameters:
84
- query (str): XPath expression to execute
85
- namespaces (dict, optional): Additional namespace prefix mappings
86
- **kwargs: Variable bindings for XPath variables
87
88
Returns:
89
SelectorList: Collection of matching Selector objects
90
91
Raises:
92
- ValueError: Invalid XPath expression or unsupported selector type
93
- XPathError: XPath syntax or evaluation errors
94
"""
95
```
96
97
**Usage Example:**
98
99
```python
100
selector = Selector(text="""
101
<html>
102
<body>
103
<div class="content">
104
<p>First paragraph</p>
105
<p>Second paragraph</p>
106
</div>
107
<a href="http://example.com">Link</a>
108
</body>
109
</html>
110
""")
111
112
# Select all paragraphs
113
paragraphs = selector.xpath('//p')
114
115
# Select text content
116
text_nodes = selector.xpath('//p/text()')
117
118
# Select attributes
119
hrefs = selector.xpath('//a/@href')
120
121
# Use XPath variables
122
links = selector.xpath('//a[@href=$url]', url="http://example.com")
123
124
# Complex XPath expressions
125
content_divs = selector.xpath('//div[@class="content"]//p[position()>1]')
126
```
127
128
### CSS Selection
129
130
Apply CSS selectors with support for pseudo-elements and advanced CSS features.
131
132
```python { .api }
133
def css(self, query: str) -> SelectorList["Selector"]:
134
"""
135
Apply CSS selector and return matching elements.
136
137
Parameters:
138
- query (str): CSS selector expression
139
140
Returns:
141
SelectorList: Collection of matching Selector objects
142
143
Raises:
144
- ValueError: Invalid CSS selector or unsupported selector type
145
- ExpressionError: CSS syntax errors
146
"""
147
```
148
149
**Usage Example:**
150
151
```python
152
selector = Selector(text="""
153
<html>
154
<body>
155
<div class="container">
156
<h1 id="title">Main Title</h1>
157
<p class="intro">Introduction text</p>
158
<ul>
159
<li><a href="link1.html">Link 1</a></li>
160
<li><a href="link2.html">Link 2</a></li>
161
</ul>
162
</div>
163
</body>
164
</html>
165
""")
166
167
# Select by class
168
intro = selector.css('.intro')
169
170
# Select by ID
171
title = selector.css('#title')
172
173
# Select descendants
174
links = selector.css('.container a')
175
176
# Pseudo-element selectors for text content
177
title_text = selector.css('h1::text')
178
179
# Pseudo-element selectors for attributes
180
link_urls = selector.css('a::attr(href)')
181
182
# Complex selectors
183
first_link = selector.css('ul li:first-child a')
184
```
185
186
### JMESPath Selection
187
188
Query JSON data using JMESPath expressions for complex data extraction.
189
190
```python { .api }
191
def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]:
192
"""
193
Find objects matching the JMESPath query for JSON data.
194
195
Parameters:
196
- query (str): JMESPath expression to apply
197
- **kwargs: Additional options passed to jmespath.search()
198
199
Returns:
200
SelectorList: Collection of matching Selector objects with extracted data
201
202
Note:
203
- Works with JSON-type selectors or JSON content within HTML/XML elements
204
- Results are wrapped in new Selector objects for chaining
205
"""
206
```
207
208
**Usage Example:**
209
210
```python
211
# JSON document
212
json_text = '''
213
{
214
"users": [
215
{"name": "Alice", "age": 30, "email": "alice@example.com"},
216
{"name": "Bob", "age": 25, "email": "bob@example.com"}
217
],
218
"metadata": {
219
"total": 2,
220
"page": 1
221
}
222
}
223
'''
224
225
selector = Selector(text=json_text, type="json")
226
227
# Extract all user names
228
names = selector.jmespath('users[*].name')
229
230
# Extract specific user
231
first_user = selector.jmespath('users[0]')
232
233
# Complex queries
234
adult_emails = selector.jmespath('users[?age >= `30`].email')
235
236
# Nested data extraction
237
metadata = selector.jmespath('metadata.total')
238
239
# JSON within HTML
240
html_with_json = """
241
<script type="application/json">
242
{"config": {"theme": "dark", "version": "1.0"}}
243
</script>
244
"""
245
html_selector = Selector(text=html_with_json)
246
theme = html_selector.css('script::text').jmespath('config.theme')
247
```
248
249
## Document Type Detection
250
251
Parsel automatically detects document types or allows explicit specification:
252
253
- **HTML**: Default type, uses HTML5-compliant parsing
254
- **XML**: Strict XML parsing with namespace support
255
- **JSON**: Native JSON parsing with JMESPath support
256
- **Text**: Plain text content for regex extraction
257
258
Auto-detection works by examining content structure:
259
- JSON: Valid JSON syntax detected automatically
260
- XML: Explicit type specification recommended for XML namespaces
261
- HTML: Default fallback for markup content