0
# XPath Extension Functions
1
2
Custom XPath functions for enhanced element selection including CSS class checking and other utility functions. Parsel extends lxml's XPath capabilities with domain-specific functions for web scraping and document processing.
3
4
## Capabilities
5
6
### Extension Function Registration
7
8
Register custom XPath functions for use in XPath expressions.
9
10
```python { .api }
11
def set_xpathfunc(fname: str, func: Optional[Callable]) -> None:
12
"""
13
Register a custom extension function for XPath expressions.
14
15
Parameters:
16
- fname (str): Function name to register in XPath namespace
17
- func (Callable, optional): Function to register, or None to remove
18
19
Note:
20
- Functions are registered in the global XPath namespace (None)
21
- Registered functions persist for the lifetime of the process
22
- Functions receive context parameter plus any XPath arguments
23
- Setting func=None removes the function registration
24
25
Function Signature:
26
- func(context, *args) -> Any
27
- context: lxml evaluation context
28
- *args: Arguments passed from XPath expression
29
"""
30
```
31
32
**Usage Example:**
33
34
```python
35
from parsel import Selector
36
from parsel.xpathfuncs import set_xpathfunc
37
38
# Define custom XPath function
39
def has_word(context, word):
40
"""Check if element text contains a specific word."""
41
node_text = context.context_node.text or ""
42
return word.lower() in node_text.lower()
43
44
# Register the function
45
set_xpathfunc('has-word', has_word)
46
47
html = """
48
<div>
49
<p>This paragraph contains Python programming content.</p>
50
<p>This paragraph discusses JavaScript frameworks.</p>
51
<p>This paragraph covers HTML markup basics.</p>
52
</div>
53
"""
54
55
selector = Selector(text=html)
56
57
# Use custom function in XPath
58
python_paragraphs = selector.xpath('//p[has-word("Python")]')
59
programming_content = python_paragraphs.xpath('.//text()').get()
60
# Returns: 'This paragraph contains Python programming content.'
61
62
# Remove the function
63
set_xpathfunc('has-word', None)
64
65
# Function is no longer available
66
# selector.xpath('//p[has-word("test")]') # Would raise error
67
```
68
69
### Built-in Extension Setup
70
71
Initialize all built-in XPath extension functions.
72
73
```python { .api }
74
def setup() -> None:
75
"""
76
Register all built-in XPath extension functions.
77
78
Currently registers:
79
- has-class: Check if element has specific CSS classes
80
81
This function is called automatically when parsel is imported.
82
"""
83
```
84
85
### CSS Class Checking Function
86
87
Built-in XPath function for checking CSS class membership.
88
89
```python { .api }
90
def has_class(context: Any, *classes: str) -> bool:
91
"""
92
XPath extension function to check if element has specific CSS classes.
93
94
Parameters:
95
- context: lxml XPath evaluation context (automatic)
96
- *classes: CSS class names to check for
97
98
Returns:
99
bool: True if all specified classes are present in element's class attribute
100
101
Raises:
102
- ValueError: If no classes provided or arguments are not strings
103
104
Note:
105
- Handles HTML5 whitespace normalization
106
- Requires ALL specified classes to be present (AND operation)
107
- Case-sensitive class matching
108
- Automatically registered as 'has-class' function
109
"""
110
```
111
112
**Usage Example:**
113
114
```python
115
from parsel import Selector
116
117
html = """
118
<div class="container main-content active">
119
<p class="text primary">Primary text paragraph</p>
120
<p class="text secondary highlighted">Secondary text paragraph</p>
121
<p class="text">Basic text paragraph</p>
122
<span class="label important urgent">Urgent label</span>
123
</div>
124
"""
125
126
selector = Selector(text=html)
127
128
# Check for single class
129
text_elements = selector.xpath('//p[has-class("text")]')
130
print(f"Elements with 'text' class: {len(text_elements)}") # 3
131
132
# Check for multiple classes (all must be present)
133
primary_text = selector.xpath('//p[has-class("text", "primary")]')
134
print(f"Elements with both 'text' and 'primary': {len(primary_text)}") # 1
135
136
# Check for multiple classes on different element
137
urgent_labels = selector.xpath('//span[has-class("label", "important", "urgent")]')
138
print(f"Urgent important labels: {len(urgent_labels)}") # 1
139
140
# Complex combinations
141
highlighted_secondary = selector.xpath('//p[has-class("secondary", "highlighted")]')
142
highlighted_text = highlighted_secondary.xpath('.//text()').get()
143
# Returns: 'Secondary text paragraph'
144
145
# Check container classes
146
main_containers = selector.xpath('//div[has-class("container", "main-content")]')
147
print(f"Main content containers: {len(main_containers)}") # 1
148
149
# Non-matching example
150
nonexistent = selector.xpath('//p[has-class("text", "nonexistent")]')
151
print(f"Non-matching elements: {len(nonexistent)}") # 0
152
```
153
154
### Advanced XPath Function Usage
155
156
Combine custom XPath functions with standard XPath features.
157
158
**Usage Example:**
159
160
```python
161
# Define additional custom functions
162
def contains_number(context):
163
"""Check if element text contains any numeric digits."""
164
import re
165
node_text = context.context_node.text or ""
166
return bool(re.search(r'\d', node_text))
167
168
def text_length_gt(context, min_length):
169
"""Check if element text length is greater than specified value."""
170
node_text = context.context_node.text or ""
171
return len(node_text.strip()) > int(min_length)
172
173
# Register functions
174
set_xpathfunc('contains-number', contains_number)
175
set_xpathfunc('text-length-gt', text_length_gt)
176
177
html = """
178
<article>
179
<h1 class="title main">Article About Data Science in 2024</h1>
180
<p class="intro short">Brief intro.</p>
181
<p class="content long">This is a comprehensive paragraph about machine learning
182
algorithms and their applications in modern data science. It contains detailed
183
explanations and examples.</p>
184
<p class="stats">Processing 1000 records per second with 95% accuracy.</p>
185
<p class="conclusion">Final thoughts on the topic.</p>
186
</article>
187
"""
188
189
selector = Selector(text=html)
190
191
# Combine has-class with custom functions
192
long_content = selector.xpath('//p[has-class("content") and text-length-gt("50")]')
193
print(f"Long content paragraphs: {len(long_content)}")
194
195
# Find elements with numbers that have specific classes
196
stats_with_numbers = selector.xpath('//p[has-class("stats") and contains-number()]')
197
stats_text = stats_with_numbers.xpath('.//text()').get()
198
# Returns: 'Processing 1000 records per second with 95% accuracy.'
199
200
# Complex conditions
201
title_with_year = selector.xpath('//h1[has-class("title") and contains-number()]')
202
title_text = title_with_year.xpath('.//text()').get()
203
# Returns: 'Article About Data Science in 2024'
204
205
# Multiple custom functions
206
long_paragraphs_no_numbers = selector.xpath('//p[text-length-gt("20") and not(contains-number())]')
207
print(f"Long paragraphs without numbers: {len(long_paragraphs_no_numbers)}")
208
```
209
210
## Error Handling and Validation
211
212
XPath extension functions include built-in validation and error handling.
213
214
**Usage Example:**
215
216
```python
217
html = """
218
<div class="test">
219
<p class="item valid">Valid content</p>
220
<p class="item">Basic content</p>
221
</div>
222
"""
223
224
selector = Selector(text=html)
225
226
# Test error conditions
227
try:
228
# Empty class list - should raise ValueError
229
result = selector.xpath('//p[has-class()]')
230
except Exception as e:
231
print(f"Expected error for empty classes: {type(e).__name__}")
232
233
try:
234
# Non-string class argument - should raise ValueError
235
# Note: This would be caught during XPath evaluation
236
result = selector.xpath('//p[has-class("valid", 123)]')
237
except Exception as e:
238
print(f"Error for non-string argument: {type(e).__name__}")
239
240
# Valid usage
241
valid_items = selector.xpath('//p[has-class("item", "valid")]')
242
print(f"Valid items found: {len(valid_items)}")
243
```
244
245
## Performance and Optimization
246
247
### Function Call Optimization
248
249
XPath extension functions are optimized for repeated use:
250
251
- **Argument validation caching**: Validation results are cached per evaluation context
252
- **Whitespace processing**: HTML5 whitespace normalization is optimized
253
- **Context reuse**: Evaluation context is reused across function calls
254
255
**Performance Example:**
256
257
```python
258
from parsel import Selector
259
260
# Large HTML document with many elements
261
html = """
262
<div class="container">
263
""" + "\n".join([
264
f'<p class="item type-{i % 3} {"active" if i % 5 == 0 else ""}">Item {i}</p>'
265
for i in range(1000)
266
]) + """
267
</div>
268
"""
269
270
selector = Selector(text=html)
271
272
# Efficient batch processing with has-class
273
# The function validation is cached for performance
274
active_items = selector.xpath('//p[has-class("item", "active")]')
275
print(f"Found {len(active_items)} active items")
276
277
# Extract specific type with active status
278
active_type_0 = selector.xpath('//p[has-class("item", "type-0", "active")]')
279
print(f"Active type-0 items: {len(active_type_0)}")
280
```
281
282
### Memory Management
283
284
- **Context cleanup**: Extension functions don't hold references to DOM nodes
285
- **String processing**: Minimal string allocation for class checking
286
- **Cache efficiency**: Validation cache is scoped to evaluation context
287
288
## Integration with Standard XPath
289
290
XPath extension functions work seamlessly with standard XPath features:
291
292
```python
293
html = """
294
<section class="products">
295
<div class="product featured premium">Premium Product A</div>
296
<div class="product featured">Featured Product B</div>
297
<div class="product premium">Premium Product C</div>
298
<div class="product">Basic Product D</div>
299
</section>
300
"""
301
302
selector = Selector(text=html)
303
304
# Combine with positional functions
305
first_featured = selector.xpath('(//div[has-class("product", "featured")])[1]')
306
first_featured_text = first_featured.xpath('.//text()').get()
307
# Returns: 'Premium Product A'
308
309
# Combine with text functions
310
premium_with_a = selector.xpath('//div[has-class("product", "premium") and contains(text(), "A")]')
311
312
# Combine with attribute checks
313
products_with_class = selector.xpath('//div[@class and has-class("product")]')
314
print(f"Products with class attribute: {len(products_with_class)}")
315
316
# Complex boolean logic
317
featured_or_premium = selector.xpath('//div[has-class("product") and (has-class("featured") or has-class("premium"))]')
318
print(f"Featured or premium products: {len(featured_or_premium)}")
319
```