Tessl Tile for pypi/parsel@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

css-translation.md data-extraction.md element-modification.md index.md parsing-selection.md selectorlist-operations.md xml-namespaces.md xpath-extensions.md

xpath-extensions.mddocs/

0
# XPath Extension Functions
1

2
Custom XPath functions for enhanced element selection including CSS class checking and other utility functions. Parsel extends lxml's XPath capabilities with domain-specific functions for web scraping and document processing.
3

4
## Capabilities
5

6
### Extension Function Registration
7

8
Register custom XPath functions for use in XPath expressions.
9

10
```python { .api }
11
def set_xpathfunc(fname: str, func: Optional[Callable]) -> None:
12
    """
13
    Register a custom extension function for XPath expressions.
14

15
    Parameters:
16
    - fname (str): Function name to register in XPath namespace
17
    - func (Callable, optional): Function to register, or None to remove
18

19
    Note:
20
    - Functions are registered in the global XPath namespace (None)
21
    - Registered functions persist for the lifetime of the process
22
    - Functions receive context parameter plus any XPath arguments
23
    - Setting func=None removes the function registration
24

25
    Function Signature:
26
    - func(context, *args) -> Any
27
    - context: lxml evaluation context
28
    - *args: Arguments passed from XPath expression
29
    """
30
```
31

32
**Usage Example:**
33

34
```python
35
from parsel import Selector
36
from parsel.xpathfuncs import set_xpathfunc
37

38
# Define custom XPath function
39
def has_word(context, word):
40
    """Check if element text contains a specific word."""
41
    node_text = context.context_node.text or ""
42
    return word.lower() in node_text.lower()
43

44
# Register the function
45
set_xpathfunc('has-word', has_word)
46

47
html = """
48
<div>
49
    <p>This paragraph contains Python programming content.</p>
50
    <p>This paragraph discusses JavaScript frameworks.</p>
51
    <p>This paragraph covers HTML markup basics.</p>
52
</div>
53
"""
54

55
selector = Selector(text=html)
56

57
# Use custom function in XPath
58
python_paragraphs = selector.xpath('//p[has-word("Python")]')
59
programming_content = python_paragraphs.xpath('.//text()').get()
60
# Returns: 'This paragraph contains Python programming content.'
61

62
# Remove the function
63
set_xpathfunc('has-word', None)
64

65
# Function is no longer available
66
# selector.xpath('//p[has-word("test")]')  # Would raise error
67
```
68

69
### Built-in Extension Setup
70

71
Initialize all built-in XPath extension functions.
72

73
```python { .api }
74
def setup() -> None:
75
    """
76
    Register all built-in XPath extension functions.
77
    
78
    Currently registers:
79
    - has-class: Check if element has specific CSS classes
80
    
81
    This function is called automatically when parsel is imported.
82
    """
83
```
84

85
### CSS Class Checking Function
86

87
Built-in XPath function for checking CSS class membership.
88

89
```python { .api }
90
def has_class(context: Any, *classes: str) -> bool:
91
    """
92
    XPath extension function to check if element has specific CSS classes.
93

94
    Parameters:
95
    - context: lxml XPath evaluation context (automatic)
96
    - *classes: CSS class names to check for
97

98
    Returns:
99
    bool: True if all specified classes are present in element's class attribute
100

101
    Raises:
102
    - ValueError: If no classes provided or arguments are not strings
103

104
    Note:
105
    - Handles HTML5 whitespace normalization
106
    - Requires ALL specified classes to be present (AND operation)
107
    - Case-sensitive class matching
108
    - Automatically registered as 'has-class' function
109
    """
110
```
111

112
**Usage Example:**
113

114
```python
115
from parsel import Selector
116

117
html = """
118
<div class="container main-content active">
119
    <p class="text primary">Primary text paragraph</p>
120
    <p class="text secondary highlighted">Secondary text paragraph</p>
121
    <p class="text">Basic text paragraph</p>
122
    <span class="label important urgent">Urgent label</span>
123
</div>
124
"""
125

126
selector = Selector(text=html)
127

128
# Check for single class
129
text_elements = selector.xpath('//p[has-class("text")]')
130
print(f"Elements with 'text' class: {len(text_elements)}")  # 3
131

132
# Check for multiple classes (all must be present)
133
primary_text = selector.xpath('//p[has-class("text", "primary")]')
134
print(f"Elements with both 'text' and 'primary': {len(primary_text)}")  # 1
135

136
# Check for multiple classes on different element
137
urgent_labels = selector.xpath('//span[has-class("label", "important", "urgent")]')
138
print(f"Urgent important labels: {len(urgent_labels)}")  # 1
139

140
# Complex combinations
141
highlighted_secondary = selector.xpath('//p[has-class("secondary", "highlighted")]')
142
highlighted_text = highlighted_secondary.xpath('.//text()').get()
143
# Returns: 'Secondary text paragraph'
144

145
# Check container classes
146
main_containers = selector.xpath('//div[has-class("container", "main-content")]')
147
print(f"Main content containers: {len(main_containers)}")  # 1
148

149
# Non-matching example
150
nonexistent = selector.xpath('//p[has-class("text", "nonexistent")]')
151
print(f"Non-matching elements: {len(nonexistent)}")  # 0
152
```
153

154
### Advanced XPath Function Usage
155

156
Combine custom XPath functions with standard XPath features.
157

158
**Usage Example:**
159

160
```python
161
# Define additional custom functions
162
def contains_number(context):
163
    """Check if element text contains any numeric digits."""
164
    import re
165
    node_text = context.context_node.text or ""
166
    return bool(re.search(r'\d', node_text))
167

168
def text_length_gt(context, min_length):
169
    """Check if element text length is greater than specified value."""
170
    node_text = context.context_node.text or ""
171
    return len(node_text.strip()) > int(min_length)
172

173
# Register functions
174
set_xpathfunc('contains-number', contains_number)
175
set_xpathfunc('text-length-gt', text_length_gt)
176

177
html = """
178
<article>
179
    <h1 class="title main">Article About Data Science in 2024</h1>
180
    <p class="intro short">Brief intro.</p>
181
    <p class="content long">This is a comprehensive paragraph about machine learning 
182
       algorithms and their applications in modern data science. It contains detailed 
183
       explanations and examples.</p>
184
    <p class="stats">Processing 1000 records per second with 95% accuracy.</p>
185
    <p class="conclusion">Final thoughts on the topic.</p>
186
</article>
187
"""
188

189
selector = Selector(text=html)
190

191
# Combine has-class with custom functions
192
long_content = selector.xpath('//p[has-class("content") and text-length-gt("50")]')
193
print(f"Long content paragraphs: {len(long_content)}")
194

195
# Find elements with numbers that have specific classes
196
stats_with_numbers = selector.xpath('//p[has-class("stats") and contains-number()]')
197
stats_text = stats_with_numbers.xpath('.//text()').get()
198
# Returns: 'Processing 1000 records per second with 95% accuracy.'
199

200
# Complex conditions
201
title_with_year = selector.xpath('//h1[has-class("title") and contains-number()]')
202
title_text = title_with_year.xpath('.//text()').get()
203
# Returns: 'Article About Data Science in 2024'
204

205
# Multiple custom functions
206
long_paragraphs_no_numbers = selector.xpath('//p[text-length-gt("20") and not(contains-number())]')
207
print(f"Long paragraphs without numbers: {len(long_paragraphs_no_numbers)}")
208
```
209

210
## Error Handling and Validation
211

212
XPath extension functions include built-in validation and error handling.
213

214
**Usage Example:**
215

216
```python
217
html = """
218
<div class="test">
219
    <p class="item valid">Valid content</p>
220
    <p class="item">Basic content</p>
221
</div>
222
"""
223

224
selector = Selector(text=html)
225

226
# Test error conditions
227
try:
228
    # Empty class list - should raise ValueError
229
    result = selector.xpath('//p[has-class()]')
230
except Exception as e:
231
    print(f"Expected error for empty classes: {type(e).__name__}")
232

233
try:
234
    # Non-string class argument - should raise ValueError
235
    # Note: This would be caught during XPath evaluation
236
    result = selector.xpath('//p[has-class("valid", 123)]')
237
except Exception as e:
238
    print(f"Error for non-string argument: {type(e).__name__}")
239

240
# Valid usage
241
valid_items = selector.xpath('//p[has-class("item", "valid")]')
242
print(f"Valid items found: {len(valid_items)}")
243
```
244

245
## Performance and Optimization
246

247
### Function Call Optimization
248

249
XPath extension functions are optimized for repeated use:
250

251
- **Argument validation caching**: Validation results are cached per evaluation context
252
- **Whitespace processing**: HTML5 whitespace normalization is optimized
253
- **Context reuse**: Evaluation context is reused across function calls
254

255
**Performance Example:**
256

257
```python
258
from parsel import Selector
259

260
# Large HTML document with many elements
261
html = """
262
<div class="container">
263
""" + "\n".join([
264
    f'<p class="item type-{i % 3} {"active" if i % 5 == 0 else ""}">Item {i}</p>'
265
    for i in range(1000)
266
]) + """
267
</div>
268
"""
269

270
selector = Selector(text=html)
271

272
# Efficient batch processing with has-class
273
# The function validation is cached for performance
274
active_items = selector.xpath('//p[has-class("item", "active")]')
275
print(f"Found {len(active_items)} active items")
276

277
# Extract specific type with active status
278
active_type_0 = selector.xpath('//p[has-class("item", "type-0", "active")]')
279
print(f"Active type-0 items: {len(active_type_0)}")
280
```
281

282
### Memory Management
283

284
- **Context cleanup**: Extension functions don't hold references to DOM nodes
285
- **String processing**: Minimal string allocation for class checking
286
- **Cache efficiency**: Validation cache is scoped to evaluation context
287

288
## Integration with Standard XPath
289

290
XPath extension functions work seamlessly with standard XPath features:
291

292
```python
293
html = """
294
<section class="products">
295
    <div class="product featured premium">Premium Product A</div>
296
    <div class="product featured">Featured Product B</div>  
297
    <div class="product premium">Premium Product C</div>
298
    <div class="product">Basic Product D</div>
299
</section>
300
"""
301

302
selector = Selector(text=html)
303

304
# Combine with positional functions
305
first_featured = selector.xpath('(//div[has-class("product", "featured")])[1]')
306
first_featured_text = first_featured.xpath('.//text()').get()
307
# Returns: 'Premium Product A'
308

309
# Combine with text functions
310
premium_with_a = selector.xpath('//div[has-class("product", "premium") and contains(text(), "A")]')
311

312
# Combine with attribute checks
313
products_with_class = selector.xpath('//div[@class and has-class("product")]')
314
print(f"Products with class attribute: {len(products_with_class)}")
315

316
# Complex boolean logic
317
featured_or_premium = selector.xpath('//div[has-class("product") and (has-class("featured") or has-class("premium"))]')
318
print(f"Featured or premium products: {len(featured_or_premium)}")
319
```

Version

Tile

Files

xpath-extensions.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

xpath-extensions.mddocs/