Tessl Tile for pypi/parsel@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

css-translation.md data-extraction.md element-modification.md index.md parsing-selection.md selectorlist-operations.md xml-namespaces.md xpath-extensions.md

css-translation.mddocs/

0
# CSS Selector Translation
1

2
Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features. Parsel extends standard CSS selector capabilities with additional pseudo-elements for enhanced data extraction.
3

4
## Capabilities
5

6
### CSS to XPath Conversion
7

8
Convert CSS selectors to equivalent XPath expressions for internal processing.
9

10
```python { .api }
11
def css2xpath(query: str) -> str:
12
    """
13
    Convert CSS selector to XPath expression.
14
    
15
    This is the main utility function for CSS-to-XPath translation using
16
    the HTMLTranslator with pseudo-element support.
17
    
18
    Parameters:
19
    - query (str): CSS selector to convert
20
    
21
    Returns:
22
    str: Equivalent XPath expression
23
    
24
    Examples:
25
    - 'div.class' -> 'descendant-or-self::div[@class and contains(concat(" ", normalize-space(@class), " "), " class ")]'
26
    - 'p::text' -> 'descendant-or-self::p/text()'
27
    - 'a::attr(href)' -> 'descendant-or-self::a/@href'
28
    """
29
```
30

31
**Usage Example:**
32

33
```python
34
from parsel import css2xpath
35

36
# Basic element selectors
37
div_xpath = css2xpath('div')
38
# Returns: 'descendant-or-self::div'
39

40
# Class selectors
41
class_xpath = css2xpath('.container')
42
# Returns: 'descendant-or-self::*[@class and contains(concat(" ", normalize-space(@class), " "), " container ")]'
43

44
# ID selectors
45
id_xpath = css2xpath('#main')
46
# Returns: 'descendant-or-self::*[@id = "main"]'
47

48
# Attribute selectors
49
attr_xpath = css2xpath('input[type="text"]')
50
# Returns: 'descendant-or-self::input[@type = "text"]'
51

52
# Descendant selectors
53
desc_xpath = css2xpath('div p')
54
# Returns: 'descendant-or-self::div/descendant-or-self::p'
55

56
# Child selectors
57
child_xpath = css2xpath('ul > li')
58
# Returns: 'descendant-or-self::ul/li'
59

60
# Pseudo-element selectors (Parsel extension)
61
text_xpath = css2xpath('p::text')
62
# Returns: 'descendant-or-self::p/text()'
63

64
attr_xpath = css2xpath('a::attr(href)')
65
# Returns: 'descendant-or-self::a/@href'
66
```
67

68
### Generic XML Translator
69

70
CSS to XPath translator for generic XML documents.
71

72
```python { .api }
73
class GenericTranslator:
74
    """
75
    CSS to XPath translator for generic XML documents.
76
    
77
    Provides caching and pseudo-element support for XML parsing.
78
    """
79
    
80
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
81
        """
82
        Convert CSS selector to XPath with caching.
83
        
84
        Parameters:
85
        - css (str): CSS selector to convert
86
        - prefix (str): XPath prefix for the query
87
        
88
        Returns:
89
        str: XPath expression equivalent to CSS selector
90
        
91
        Note:
92
        - Results are cached (LRU cache with 256 entries)
93
        - Supports pseudo-elements ::text and ::attr()
94
        """
95
```
96

97
### HTML-Optimized Translator
98

99
CSS to XPath translator optimized for HTML documents.
100

101
```python { .api }
102
class HTMLTranslator:
103
    """
104
    CSS to XPath translator optimized for HTML documents.
105
    
106
    Provides HTML-specific optimizations and pseudo-element support.
107
    """
108
    
109
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
110
        """
111
        Convert CSS selector to XPath with HTML optimizations.
112
        
113
        Parameters:
114
        - css (str): CSS selector to convert  
115
        - prefix (str): XPath prefix for the query
116
        
117
        Returns:
118
        str: XPath expression optimized for HTML parsing
119
        
120
        Note:
121
        - Cached results (LRU cache with 256 entries)
122
        - HTML-specific case handling and optimizations
123
        - Supports pseudo-elements ::text and ::attr()
124
        """
125
```
126

127
**Usage Example:**
128

129
```python
130
from parsel.csstranslator import GenericTranslator, HTMLTranslator
131

132
# Create translator instances
133
xml_translator = GenericTranslator()
134
html_translator = HTMLTranslator()
135

136
css_selector = 'article h2.title'
137

138
# Convert using XML translator
139
xml_xpath = xml_translator.css_to_xpath(css_selector)
140
# Returns XPath suitable for generic XML
141

142
# Convert using HTML translator  
143
html_xpath = html_translator.css_to_xpath(css_selector)
144
# Returns XPath optimized for HTML parsing
145

146
# Both support pseudo-elements
147
text_css = 'p.content::text'
148
xml_text_xpath = xml_translator.css_to_xpath(text_css)
149
html_text_xpath = html_translator.css_to_xpath(text_css)
150
# Both return: 'descendant-or-self::p[@class and contains(...)]./text()'
151
```
152

153
### Extended XPath Expressions
154

155
Enhanced XPath expressions with pseudo-element support.
156

157
```python { .api }
158
class XPathExpr:
159
    """
160
    Extended XPath expression with pseudo-element support.
161
    
162
    Extends cssselect's XPathExpr to handle ::text and ::attr() pseudo-elements.
163
    """
164
    
165
    textnode: bool = False
166
    attribute: Optional[str] = None
167
    
168
    @classmethod
169
    def from_xpath(
170
        cls,
171
        xpath: "XPathExpr",
172
        textnode: bool = False,
173
        attribute: Optional[str] = None,
174
    ) -> "XPathExpr":
175
        """
176
        Create XPathExpr from existing expression with pseudo-element flags.
177
        
178
        Parameters:
179
        - xpath: Base XPath expression
180
        - textnode (bool): Whether to target text nodes
181
        - attribute (str, optional): Attribute name to target
182
        
183
        Returns:
184
        XPathExpr: Extended expression with pseudo-element support
185
        """
186
    
187
    def __str__(self) -> str:
188
        """
189
        Convert to string representation with pseudo-element handling.
190
        
191
        Returns:
192
        str: XPath string with text() or @attribute suffixes as needed
193
        """
194
```
195

196
## Pseudo-Element Support
197

198
Parsel extends CSS selectors with custom pseudo-elements for enhanced data extraction.
199

200
### Text Node Selection
201

202
The `::text` pseudo-element selects text content of elements.
203

204
**Usage Example:**
205

206
```python
207
from parsel import Selector, css2xpath
208

209
html = """
210
<div class="content">
211
    <h1>Main Title</h1>
212
    <p>First paragraph with <em>emphasis</em> text.</p>
213
    <p>Second paragraph.</p>
214
</div>
215
"""
216

217
selector = Selector(text=html)
218

219
# CSS with ::text pseudo-element
220
title_text = selector.css('h1::text').get()
221
# Returns: 'Main Title'
222

223
# Equivalent XPath (what css2xpath generates)
224
xpath_equivalent = css2xpath('h1::text')
225
# Returns: 'descendant-or-self::h1/text()'
226

227
# Manual XPath gives same result
228
manual_xpath = selector.xpath('//h1/text()').get()
229
# Returns: 'Main Title'
230

231
# Extract all text nodes from paragraphs
232
p_texts = selector.css('p::text').getall()
233
# Returns: ['First paragraph with ', 'Second paragraph.']
234
# Note: Excludes text from nested <em> element
235
```
236

237
### Attribute Value Selection
238

239
The `::attr(name)` pseudo-element selects attribute values.
240

241
**Usage Example:**
242

243
```python
244
html = """
245
<div class="links">
246
    <a href="https://example.com" title="Example Site">Example</a>
247
    <a href="https://google.com" title="Search Engine">Google</a>
248
    <img src="image.jpg" alt="Description" width="300">
249
</div>
250
"""
251

252
selector = Selector(text=html)
253

254
# Extract href attributes using ::attr() pseudo-element
255
hrefs = selector.css('a::attr(href)').getall()
256
# Returns: ['https://example.com', 'https://google.com']
257

258
# Extract title attributes
259
titles = selector.css('a::attr(title)').getall()
260
# Returns: ['Example Site', 'Search Engine']
261

262
# Extract image attributes
263
img_src = selector.css('img::attr(src)').get()
264
# Returns: 'image.jpg'
265

266
img_alt = selector.css('img::attr(alt)').get()
267
# Returns: 'Description'
268

269
# Check XPath conversion
270
attr_xpath = css2xpath('a::attr(href)')
271
# Returns: 'descendant-or-self::a/@href'
272

273
# Equivalent manual XPath
274
manual_hrefs = selector.xpath('//a/@href').getall()
275
# Returns: ['https://example.com', 'https://google.com']
276
```
277

278
### Complex Pseudo-Element Combinations
279

280
Combine pseudo-elements with other CSS selector features.
281

282
**Usage Example:**
283

284
```python
285
html = """
286
<article>
287
    <header>
288
        <h1 class="title">Article Title</h1>
289
        <p class="meta">Published on <time datetime="2024-01-15">January 15, 2024</time></p>
290
    </header>
291
    <section class="content">
292
        <p class="intro">Introduction paragraph.</p>
293
        <p class="body">Main content paragraph.</p>
294
    </section>
295
    <footer>
296
        <a href="/author/john" class="author-link">John Doe</a>
297
    </footer>
298
</article>
299
"""
300

301
selector = Selector(text=html)
302

303
# Complex selectors with pseudo-elements
304
article_title = selector.css('header h1.title::text').get()
305
# Returns: 'Article Title'
306

307
# Get datetime attribute from time element within meta paragraph
308
datetime_attr = selector.css('.meta time::attr(datetime)').get() 
309
# Returns: '2024-01-15'
310

311
# Get author link URL
312
author_url = selector.css('footer .author-link::attr(href)').get()
313
# Returns: '/author/john'
314

315
# Get content paragraph texts (excluding intro)
316
content_texts = selector.css('section.content p.body::text').getall()
317
# Returns: ['Main content paragraph.']
318

319
# Combine descendant and pseudo-element selectors
320
intro_text = selector.css('article section .intro::text').get()
321
# Returns: 'Introduction paragraph.'
322
```
323

324
## Translation Internals
325

326
### Caching Mechanism
327

328
Both GenericTranslator and HTMLTranslator use LRU caching for performance.
329

330
```python
331
# Cache configuration
332
@lru_cache(maxsize=256)
333
def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:
334
    # Translation logic with caching
335
```
336

337
### Pseudo-Element Processing
338

339
The translation process handles pseudo-elements through dynamic dispatch:
340

341
1. **Parse CSS selector** using cssselect library
342
2. **Detect pseudo-elements** (::text, ::attr())
343
3. **Generate base XPath** for element selection
344
4. **Apply pseudo-element transformations** (/text() or /@attribute)
345
5. **Return complete XPath** expression
346

347
### Performance Considerations
348

349
- **Caching**: Frequently used CSS selectors are cached for faster repeated access
350
- **Compilation**: CSS selectors are compiled to XPath once and reused
351
- **Memory usage**: LRU cache limits memory usage to 256 entries per translator
352
- **Thread safety**: Translators can be used safely across multiple threads

Version

Tile

Files

css-translation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

css-translation.mddocs/