or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

css-translation.mddata-extraction.mdelement-modification.mdindex.mdparsing-selection.mdselectorlist-operations.mdxml-namespaces.mdxpath-extensions.md

xpath-extensions.mddocs/

0

# XPath Extension Functions

1

2

Custom XPath functions for enhanced element selection including CSS class checking and other utility functions. Parsel extends lxml's XPath capabilities with domain-specific functions for web scraping and document processing.

3

4

## Capabilities

5

6

### Extension Function Registration

7

8

Register custom XPath functions for use in XPath expressions.

9

10

```python { .api }

11

def set_xpathfunc(fname: str, func: Optional[Callable]) -> None:

12

"""

13

Register a custom extension function for XPath expressions.

14

15

Parameters:

16

- fname (str): Function name to register in XPath namespace

17

- func (Callable, optional): Function to register, or None to remove

18

19

Note:

20

- Functions are registered in the global XPath namespace (None)

21

- Registered functions persist for the lifetime of the process

22

- Functions receive context parameter plus any XPath arguments

23

- Setting func=None removes the function registration

24

25

Function Signature:

26

- func(context, *args) -> Any

27

- context: lxml evaluation context

28

- *args: Arguments passed from XPath expression

29

"""

30

```

31

32

**Usage Example:**

33

34

```python

35

from parsel import Selector

36

from parsel.xpathfuncs import set_xpathfunc

37

38

# Define custom XPath function

39

def has_word(context, word):

40

"""Check if element text contains a specific word."""

41

node_text = context.context_node.text or ""

42

return word.lower() in node_text.lower()

43

44

# Register the function

45

set_xpathfunc('has-word', has_word)

46

47

html = """

48

<div>

49

<p>This paragraph contains Python programming content.</p>

50

<p>This paragraph discusses JavaScript frameworks.</p>

51

<p>This paragraph covers HTML markup basics.</p>

52

</div>

53

"""

54

55

selector = Selector(text=html)

56

57

# Use custom function in XPath

58

python_paragraphs = selector.xpath('//p[has-word("Python")]')

59

programming_content = python_paragraphs.xpath('.//text()').get()

60

# Returns: 'This paragraph contains Python programming content.'

61

62

# Remove the function

63

set_xpathfunc('has-word', None)

64

65

# Function is no longer available

66

# selector.xpath('//p[has-word("test")]') # Would raise error

67

```

68

69

### Built-in Extension Setup

70

71

Initialize all built-in XPath extension functions.

72

73

```python { .api }

74

def setup() -> None:

75

"""

76

Register all built-in XPath extension functions.

77

78

Currently registers:

79

- has-class: Check if element has specific CSS classes

80

81

This function is called automatically when parsel is imported.

82

"""

83

```

84

85

### CSS Class Checking Function

86

87

Built-in XPath function for checking CSS class membership.

88

89

```python { .api }

90

def has_class(context: Any, *classes: str) -> bool:

91

"""

92

XPath extension function to check if element has specific CSS classes.

93

94

Parameters:

95

- context: lxml XPath evaluation context (automatic)

96

- *classes: CSS class names to check for

97

98

Returns:

99

bool: True if all specified classes are present in element's class attribute

100

101

Raises:

102

- ValueError: If no classes provided or arguments are not strings

103

104

Note:

105

- Handles HTML5 whitespace normalization

106

- Requires ALL specified classes to be present (AND operation)

107

- Case-sensitive class matching

108

- Automatically registered as 'has-class' function

109

"""

110

```

111

112

**Usage Example:**

113

114

```python

115

from parsel import Selector

116

117

html = """

118

<div class="container main-content active">

119

<p class="text primary">Primary text paragraph</p>

120

<p class="text secondary highlighted">Secondary text paragraph</p>

121

<p class="text">Basic text paragraph</p>

122

<span class="label important urgent">Urgent label</span>

123

</div>

124

"""

125

126

selector = Selector(text=html)

127

128

# Check for single class

129

text_elements = selector.xpath('//p[has-class("text")]')

130

print(f"Elements with 'text' class: {len(text_elements)}") # 3

131

132

# Check for multiple classes (all must be present)

133

primary_text = selector.xpath('//p[has-class("text", "primary")]')

134

print(f"Elements with both 'text' and 'primary': {len(primary_text)}") # 1

135

136

# Check for multiple classes on different element

137

urgent_labels = selector.xpath('//span[has-class("label", "important", "urgent")]')

138

print(f"Urgent important labels: {len(urgent_labels)}") # 1

139

140

# Complex combinations

141

highlighted_secondary = selector.xpath('//p[has-class("secondary", "highlighted")]')

142

highlighted_text = highlighted_secondary.xpath('.//text()').get()

143

# Returns: 'Secondary text paragraph'

144

145

# Check container classes

146

main_containers = selector.xpath('//div[has-class("container", "main-content")]')

147

print(f"Main content containers: {len(main_containers)}") # 1

148

149

# Non-matching example

150

nonexistent = selector.xpath('//p[has-class("text", "nonexistent")]')

151

print(f"Non-matching elements: {len(nonexistent)}") # 0

152

```

153

154

### Advanced XPath Function Usage

155

156

Combine custom XPath functions with standard XPath features.

157

158

**Usage Example:**

159

160

```python

161

# Define additional custom functions

162

def contains_number(context):

163

"""Check if element text contains any numeric digits."""

164

import re

165

node_text = context.context_node.text or ""

166

return bool(re.search(r'\d', node_text))

167

168

def text_length_gt(context, min_length):

169

"""Check if element text length is greater than specified value."""

170

node_text = context.context_node.text or ""

171

return len(node_text.strip()) > int(min_length)

172

173

# Register functions

174

set_xpathfunc('contains-number', contains_number)

175

set_xpathfunc('text-length-gt', text_length_gt)

176

177

html = """

178

<article>

179

<h1 class="title main">Article About Data Science in 2024</h1>

180

<p class="intro short">Brief intro.</p>

181

<p class="content long">This is a comprehensive paragraph about machine learning

182

algorithms and their applications in modern data science. It contains detailed

183

explanations and examples.</p>

184

<p class="stats">Processing 1000 records per second with 95% accuracy.</p>

185

<p class="conclusion">Final thoughts on the topic.</p>

186

</article>

187

"""

188

189

selector = Selector(text=html)

190

191

# Combine has-class with custom functions

192

long_content = selector.xpath('//p[has-class("content") and text-length-gt("50")]')

193

print(f"Long content paragraphs: {len(long_content)}")

194

195

# Find elements with numbers that have specific classes

196

stats_with_numbers = selector.xpath('//p[has-class("stats") and contains-number()]')

197

stats_text = stats_with_numbers.xpath('.//text()').get()

198

# Returns: 'Processing 1000 records per second with 95% accuracy.'

199

200

# Complex conditions

201

title_with_year = selector.xpath('//h1[has-class("title") and contains-number()]')

202

title_text = title_with_year.xpath('.//text()').get()

203

# Returns: 'Article About Data Science in 2024'

204

205

# Multiple custom functions

206

long_paragraphs_no_numbers = selector.xpath('//p[text-length-gt("20") and not(contains-number())]')

207

print(f"Long paragraphs without numbers: {len(long_paragraphs_no_numbers)}")

208

```

209

210

## Error Handling and Validation

211

212

XPath extension functions include built-in validation and error handling.

213

214

**Usage Example:**

215

216

```python

217

html = """

218

<div class="test">

219

<p class="item valid">Valid content</p>

220

<p class="item">Basic content</p>

221

</div>

222

"""

223

224

selector = Selector(text=html)

225

226

# Test error conditions

227

try:

228

# Empty class list - should raise ValueError

229

result = selector.xpath('//p[has-class()]')

230

except Exception as e:

231

print(f"Expected error for empty classes: {type(e).__name__}")

232

233

try:

234

# Non-string class argument - should raise ValueError

235

# Note: This would be caught during XPath evaluation

236

result = selector.xpath('//p[has-class("valid", 123)]')

237

except Exception as e:

238

print(f"Error for non-string argument: {type(e).__name__}")

239

240

# Valid usage

241

valid_items = selector.xpath('//p[has-class("item", "valid")]')

242

print(f"Valid items found: {len(valid_items)}")

243

```

244

245

## Performance and Optimization

246

247

### Function Call Optimization

248

249

XPath extension functions are optimized for repeated use:

250

251

- **Argument validation caching**: Validation results are cached per evaluation context

252

- **Whitespace processing**: HTML5 whitespace normalization is optimized

253

- **Context reuse**: Evaluation context is reused across function calls

254

255

**Performance Example:**

256

257

```python

258

from parsel import Selector

259

260

# Large HTML document with many elements

261

html = """

262

<div class="container">

263

""" + "\n".join([

264

f'<p class="item type-{i % 3} {"active" if i % 5 == 0 else ""}">Item {i}</p>'

265

for i in range(1000)

266

]) + """

267

</div>

268

"""

269

270

selector = Selector(text=html)

271

272

# Efficient batch processing with has-class

273

# The function validation is cached for performance

274

active_items = selector.xpath('//p[has-class("item", "active")]')

275

print(f"Found {len(active_items)} active items")

276

277

# Extract specific type with active status

278

active_type_0 = selector.xpath('//p[has-class("item", "type-0", "active")]')

279

print(f"Active type-0 items: {len(active_type_0)}")

280

```

281

282

### Memory Management

283

284

- **Context cleanup**: Extension functions don't hold references to DOM nodes

285

- **String processing**: Minimal string allocation for class checking

286

- **Cache efficiency**: Validation cache is scoped to evaluation context

287

288

## Integration with Standard XPath

289

290

XPath extension functions work seamlessly with standard XPath features:

291

292

```python

293

html = """

294

<section class="products">

295

<div class="product featured premium">Premium Product A</div>

296

<div class="product featured">Featured Product B</div>

297

<div class="product premium">Premium Product C</div>

298

<div class="product">Basic Product D</div>

299

</section>

300

"""

301

302

selector = Selector(text=html)

303

304

# Combine with positional functions

305

first_featured = selector.xpath('(//div[has-class("product", "featured")])[1]')

306

first_featured_text = first_featured.xpath('.//text()').get()

307

# Returns: 'Premium Product A'

308

309

# Combine with text functions

310

premium_with_a = selector.xpath('//div[has-class("product", "premium") and contains(text(), "A")]')

311

312

# Combine with attribute checks

313

products_with_class = selector.xpath('//div[@class and has-class("product")]')

314

print(f"Products with class attribute: {len(products_with_class)}")

315

316

# Complex boolean logic

317

featured_or_premium = selector.xpath('//div[has-class("product") and (has-class("featured") or has-class("premium"))]')

318

print(f"Featured or premium products: {len(featured_or_premium)}")

319

```