0
# CSS Selector Translation
1
2
Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features. Parsel extends standard CSS selector capabilities with additional pseudo-elements for enhanced data extraction.
3
4
## Capabilities
5
6
### CSS to XPath Conversion
7
8
Convert CSS selectors to equivalent XPath expressions for internal processing.
9
10
```python { .api }
11
def css2xpath(query: str) -> str:
12
"""
13
Convert CSS selector to XPath expression.
14
15
This is the main utility function for CSS-to-XPath translation using
16
the HTMLTranslator with pseudo-element support.
17
18
Parameters:
19
- query (str): CSS selector to convert
20
21
Returns:
22
str: Equivalent XPath expression
23
24
Examples:
25
- 'div.class' -> 'descendant-or-self::div[@class and contains(concat(" ", normalize-space(@class), " "), " class ")]'
26
- 'p::text' -> 'descendant-or-self::p/text()'
27
- 'a::attr(href)' -> 'descendant-or-self::a/@href'
28
"""
29
```
30
31
**Usage Example:**
32
33
```python
34
from parsel import css2xpath
35
36
# Basic element selectors
37
div_xpath = css2xpath('div')
38
# Returns: 'descendant-or-self::div'
39
40
# Class selectors
41
class_xpath = css2xpath('.container')
42
# Returns: 'descendant-or-self::*[@class and contains(concat(" ", normalize-space(@class), " "), " container ")]'
43
44
# ID selectors
45
id_xpath = css2xpath('#main')
46
# Returns: 'descendant-or-self::*[@id = "main"]'
47
48
# Attribute selectors
49
attr_xpath = css2xpath('input[type="text"]')
50
# Returns: 'descendant-or-self::input[@type = "text"]'
51
52
# Descendant selectors
53
desc_xpath = css2xpath('div p')
54
# Returns: 'descendant-or-self::div/descendant-or-self::p'
55
56
# Child selectors
57
child_xpath = css2xpath('ul > li')
58
# Returns: 'descendant-or-self::ul/li'
59
60
# Pseudo-element selectors (Parsel extension)
61
text_xpath = css2xpath('p::text')
62
# Returns: 'descendant-or-self::p/text()'
63
64
attr_xpath = css2xpath('a::attr(href)')
65
# Returns: 'descendant-or-self::a/@href'
66
```
67
68
### Generic XML Translator
69
70
CSS to XPath translator for generic XML documents.
71
72
```python { .api }
73
class GenericTranslator:
74
"""
75
CSS to XPath translator for generic XML documents.
76
77
Provides caching and pseudo-element support for XML parsing.
78
"""
79
80
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
81
"""
82
Convert CSS selector to XPath with caching.
83
84
Parameters:
85
- css (str): CSS selector to convert
86
- prefix (str): XPath prefix for the query
87
88
Returns:
89
str: XPath expression equivalent to CSS selector
90
91
Note:
92
- Results are cached (LRU cache with 256 entries)
93
- Supports pseudo-elements ::text and ::attr()
94
"""
95
```
96
97
### HTML-Optimized Translator
98
99
CSS to XPath translator optimized for HTML documents.
100
101
```python { .api }
102
class HTMLTranslator:
103
"""
104
CSS to XPath translator optimized for HTML documents.
105
106
Provides HTML-specific optimizations and pseudo-element support.
107
"""
108
109
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
110
"""
111
Convert CSS selector to XPath with HTML optimizations.
112
113
Parameters:
114
- css (str): CSS selector to convert
115
- prefix (str): XPath prefix for the query
116
117
Returns:
118
str: XPath expression optimized for HTML parsing
119
120
Note:
121
- Cached results (LRU cache with 256 entries)
122
- HTML-specific case handling and optimizations
123
- Supports pseudo-elements ::text and ::attr()
124
"""
125
```
126
127
**Usage Example:**
128
129
```python
130
from parsel.csstranslator import GenericTranslator, HTMLTranslator
131
132
# Create translator instances
133
xml_translator = GenericTranslator()
134
html_translator = HTMLTranslator()
135
136
css_selector = 'article h2.title'
137
138
# Convert using XML translator
139
xml_xpath = xml_translator.css_to_xpath(css_selector)
140
# Returns XPath suitable for generic XML
141
142
# Convert using HTML translator
143
html_xpath = html_translator.css_to_xpath(css_selector)
144
# Returns XPath optimized for HTML parsing
145
146
# Both support pseudo-elements
147
text_css = 'p.content::text'
148
xml_text_xpath = xml_translator.css_to_xpath(text_css)
149
html_text_xpath = html_translator.css_to_xpath(text_css)
150
# Both return: 'descendant-or-self::p[@class and contains(...)]./text()'
151
```
152
153
### Extended XPath Expressions
154
155
Enhanced XPath expressions with pseudo-element support.
156
157
```python { .api }
158
class XPathExpr:
159
"""
160
Extended XPath expression with pseudo-element support.
161
162
Extends cssselect's XPathExpr to handle ::text and ::attr() pseudo-elements.
163
"""
164
165
textnode: bool = False
166
attribute: Optional[str] = None
167
168
@classmethod
169
def from_xpath(
170
cls,
171
xpath: "XPathExpr",
172
textnode: bool = False,
173
attribute: Optional[str] = None,
174
) -> "XPathExpr":
175
"""
176
Create XPathExpr from existing expression with pseudo-element flags.
177
178
Parameters:
179
- xpath: Base XPath expression
180
- textnode (bool): Whether to target text nodes
181
- attribute (str, optional): Attribute name to target
182
183
Returns:
184
XPathExpr: Extended expression with pseudo-element support
185
"""
186
187
def __str__(self) -> str:
188
"""
189
Convert to string representation with pseudo-element handling.
190
191
Returns:
192
str: XPath string with text() or @attribute suffixes as needed
193
"""
194
```
195
196
## Pseudo-Element Support
197
198
Parsel extends CSS selectors with custom pseudo-elements for enhanced data extraction.
199
200
### Text Node Selection
201
202
The `::text` pseudo-element selects text content of elements.
203
204
**Usage Example:**
205
206
```python
207
from parsel import Selector, css2xpath
208
209
html = """
210
<div class="content">
211
<h1>Main Title</h1>
212
<p>First paragraph with <em>emphasis</em> text.</p>
213
<p>Second paragraph.</p>
214
</div>
215
"""
216
217
selector = Selector(text=html)
218
219
# CSS with ::text pseudo-element
220
title_text = selector.css('h1::text').get()
221
# Returns: 'Main Title'
222
223
# Equivalent XPath (what css2xpath generates)
224
xpath_equivalent = css2xpath('h1::text')
225
# Returns: 'descendant-or-self::h1/text()'
226
227
# Manual XPath gives same result
228
manual_xpath = selector.xpath('//h1/text()').get()
229
# Returns: 'Main Title'
230
231
# Extract all text nodes from paragraphs
232
p_texts = selector.css('p::text').getall()
233
# Returns: ['First paragraph with ', 'Second paragraph.']
234
# Note: Excludes text from nested <em> element
235
```
236
237
### Attribute Value Selection
238
239
The `::attr(name)` pseudo-element selects attribute values.
240
241
**Usage Example:**
242
243
```python
244
html = """
245
<div class="links">
246
<a href="https://example.com" title="Example Site">Example</a>
247
<a href="https://google.com" title="Search Engine">Google</a>
248
<img src="image.jpg" alt="Description" width="300">
249
</div>
250
"""
251
252
selector = Selector(text=html)
253
254
# Extract href attributes using ::attr() pseudo-element
255
hrefs = selector.css('a::attr(href)').getall()
256
# Returns: ['https://example.com', 'https://google.com']
257
258
# Extract title attributes
259
titles = selector.css('a::attr(title)').getall()
260
# Returns: ['Example Site', 'Search Engine']
261
262
# Extract image attributes
263
img_src = selector.css('img::attr(src)').get()
264
# Returns: 'image.jpg'
265
266
img_alt = selector.css('img::attr(alt)').get()
267
# Returns: 'Description'
268
269
# Check XPath conversion
270
attr_xpath = css2xpath('a::attr(href)')
271
# Returns: 'descendant-or-self::a/@href'
272
273
# Equivalent manual XPath
274
manual_hrefs = selector.xpath('//a/@href').getall()
275
# Returns: ['https://example.com', 'https://google.com']
276
```
277
278
### Complex Pseudo-Element Combinations
279
280
Combine pseudo-elements with other CSS selector features.
281
282
**Usage Example:**
283
284
```python
285
html = """
286
<article>
287
<header>
288
<h1 class="title">Article Title</h1>
289
<p class="meta">Published on <time datetime="2024-01-15">January 15, 2024</time></p>
290
</header>
291
<section class="content">
292
<p class="intro">Introduction paragraph.</p>
293
<p class="body">Main content paragraph.</p>
294
</section>
295
<footer>
296
<a href="/author/john" class="author-link">John Doe</a>
297
</footer>
298
</article>
299
"""
300
301
selector = Selector(text=html)
302
303
# Complex selectors with pseudo-elements
304
article_title = selector.css('header h1.title::text').get()
305
# Returns: 'Article Title'
306
307
# Get datetime attribute from time element within meta paragraph
308
datetime_attr = selector.css('.meta time::attr(datetime)').get()
309
# Returns: '2024-01-15'
310
311
# Get author link URL
312
author_url = selector.css('footer .author-link::attr(href)').get()
313
# Returns: '/author/john'
314
315
# Get content paragraph texts (excluding intro)
316
content_texts = selector.css('section.content p.body::text').getall()
317
# Returns: ['Main content paragraph.']
318
319
# Combine descendant and pseudo-element selectors
320
intro_text = selector.css('article section .intro::text').get()
321
# Returns: 'Introduction paragraph.'
322
```
323
324
## Translation Internals
325
326
### Caching Mechanism
327
328
Both GenericTranslator and HTMLTranslator use LRU caching for performance.
329
330
```python
331
# Cache configuration
332
@lru_cache(maxsize=256)
333
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
334
# Translation logic with caching
335
```
336
337
### Pseudo-Element Processing
338
339
The translation process handles pseudo-elements through dynamic dispatch:
340
341
1. **Parse CSS selector** using cssselect library
342
2. **Detect pseudo-elements** (::text, ::attr())
343
3. **Generate base XPath** for element selection
344
4. **Apply pseudo-element transformations** (/text() or /@attribute)
345
5. **Return complete XPath** expression
346
347
### Performance Considerations
348
349
- **Caching**: Frequently used CSS selectors are cached for faster repeated access
350
- **Compilation**: CSS selectors are compiled to XPath once and reused
351
- **Memory usage**: LRU cache limits memory usage to 256 entries per translator
352
- **Thread safety**: Translators can be used safely across multiple threads