or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

css-translation.mddata-extraction.mdelement-modification.mdindex.mdparsing-selection.mdselectorlist-operations.mdxml-namespaces.mdxpath-extensions.md

css-translation.mddocs/

0

# CSS Selector Translation

1

2

Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features. Parsel extends standard CSS selector capabilities with additional pseudo-elements for enhanced data extraction.

3

4

## Capabilities

5

6

### CSS to XPath Conversion

7

8

Convert CSS selectors to equivalent XPath expressions for internal processing.

9

10

```python { .api }

11

def css2xpath(query: str) -> str:

12

"""

13

Convert CSS selector to XPath expression.

14

15

This is the main utility function for CSS-to-XPath translation using

16

the HTMLTranslator with pseudo-element support.

17

18

Parameters:

19

- query (str): CSS selector to convert

20

21

Returns:

22

str: Equivalent XPath expression

23

24

Examples:

25

- 'div.class' -> 'descendant-or-self::div[@class and contains(concat(" ", normalize-space(@class), " "), " class ")]'

26

- 'p::text' -> 'descendant-or-self::p/text()'

27

- 'a::attr(href)' -> 'descendant-or-self::a/@href'

28

"""

29

```

30

31

**Usage Example:**

32

33

```python

34

from parsel import css2xpath

35

36

# Basic element selectors

37

div_xpath = css2xpath('div')

38

# Returns: 'descendant-or-self::div'

39

40

# Class selectors

41

class_xpath = css2xpath('.container')

42

# Returns: 'descendant-or-self::*[@class and contains(concat(" ", normalize-space(@class), " "), " container ")]'

43

44

# ID selectors

45

id_xpath = css2xpath('#main')

46

# Returns: 'descendant-or-self::*[@id = "main"]'

47

48

# Attribute selectors

49

attr_xpath = css2xpath('input[type="text"]')

50

# Returns: 'descendant-or-self::input[@type = "text"]'

51

52

# Descendant selectors

53

desc_xpath = css2xpath('div p')

54

# Returns: 'descendant-or-self::div/descendant-or-self::p'

55

56

# Child selectors

57

child_xpath = css2xpath('ul > li')

58

# Returns: 'descendant-or-self::ul/li'

59

60

# Pseudo-element selectors (Parsel extension)

61

text_xpath = css2xpath('p::text')

62

# Returns: 'descendant-or-self::p/text()'

63

64

attr_xpath = css2xpath('a::attr(href)')

65

# Returns: 'descendant-or-self::a/@href'

66

```

67

68

### Generic XML Translator

69

70

CSS to XPath translator for generic XML documents.

71

72

```python { .api }

73

class GenericTranslator:

74

"""

75

CSS to XPath translator for generic XML documents.

76

77

Provides caching and pseudo-element support for XML parsing.

78

"""

79

80

def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:

81

"""

82

Convert CSS selector to XPath with caching.

83

84

Parameters:

85

- css (str): CSS selector to convert

86

- prefix (str): XPath prefix for the query

87

88

Returns:

89

str: XPath expression equivalent to CSS selector

90

91

Note:

92

- Results are cached (LRU cache with 256 entries)

93

- Supports pseudo-elements ::text and ::attr()

94

"""

95

```

96

97

### HTML-Optimized Translator

98

99

CSS to XPath translator optimized for HTML documents.

100

101

```python { .api }

102

class HTMLTranslator:

103

"""

104

CSS to XPath translator optimized for HTML documents.

105

106

Provides HTML-specific optimizations and pseudo-element support.

107

"""

108

109

def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:

110

"""

111

Convert CSS selector to XPath with HTML optimizations.

112

113

Parameters:

114

- css (str): CSS selector to convert

115

- prefix (str): XPath prefix for the query

116

117

Returns:

118

str: XPath expression optimized for HTML parsing

119

120

Note:

121

- Cached results (LRU cache with 256 entries)

122

- HTML-specific case handling and optimizations

123

- Supports pseudo-elements ::text and ::attr()

124

"""

125

```

126

127

**Usage Example:**

128

129

```python

130

from parsel.csstranslator import GenericTranslator, HTMLTranslator

131

132

# Create translator instances

133

xml_translator = GenericTranslator()

134

html_translator = HTMLTranslator()

135

136

css_selector = 'article h2.title'

137

138

# Convert using XML translator

139

xml_xpath = xml_translator.css_to_xpath(css_selector)

140

# Returns XPath suitable for generic XML

141

142

# Convert using HTML translator

143

html_xpath = html_translator.css_to_xpath(css_selector)

144

# Returns XPath optimized for HTML parsing

145

146

# Both support pseudo-elements

147

text_css = 'p.content::text'

148

xml_text_xpath = xml_translator.css_to_xpath(text_css)

149

html_text_xpath = html_translator.css_to_xpath(text_css)

150

# Both return: 'descendant-or-self::p[@class and contains(...)]./text()'

151

```

152

153

### Extended XPath Expressions

154

155

Enhanced XPath expressions with pseudo-element support.

156

157

```python { .api }

158

class XPathExpr:

159

"""

160

Extended XPath expression with pseudo-element support.

161

162

Extends cssselect's XPathExpr to handle ::text and ::attr() pseudo-elements.

163

"""

164

165

textnode: bool = False

166

attribute: Optional[str] = None

167

168

@classmethod

169

def from_xpath(

170

cls,

171

xpath: "XPathExpr",

172

textnode: bool = False,

173

attribute: Optional[str] = None,

174

) -> "XPathExpr":

175

"""

176

Create XPathExpr from existing expression with pseudo-element flags.

177

178

Parameters:

179

- xpath: Base XPath expression

180

- textnode (bool): Whether to target text nodes

181

- attribute (str, optional): Attribute name to target

182

183

Returns:

184

XPathExpr: Extended expression with pseudo-element support

185

"""

186

187

def __str__(self) -> str:

188

"""

189

Convert to string representation with pseudo-element handling.

190

191

Returns:

192

str: XPath string with text() or @attribute suffixes as needed

193

"""

194

```

195

196

## Pseudo-Element Support

197

198

Parsel extends CSS selectors with custom pseudo-elements for enhanced data extraction.

199

200

### Text Node Selection

201

202

The `::text` pseudo-element selects text content of elements.

203

204

**Usage Example:**

205

206

```python

207

from parsel import Selector, css2xpath

208

209

html = """

210

<div class="content">

211

<h1>Main Title</h1>

212

<p>First paragraph with <em>emphasis</em> text.</p>

213

<p>Second paragraph.</p>

214

</div>

215

"""

216

217

selector = Selector(text=html)

218

219

# CSS with ::text pseudo-element

220

title_text = selector.css('h1::text').get()

221

# Returns: 'Main Title'

222

223

# Equivalent XPath (what css2xpath generates)

224

xpath_equivalent = css2xpath('h1::text')

225

# Returns: 'descendant-or-self::h1/text()'

226

227

# Manual XPath gives same result

228

manual_xpath = selector.xpath('//h1/text()').get()

229

# Returns: 'Main Title'

230

231

# Extract all text nodes from paragraphs

232

p_texts = selector.css('p::text').getall()

233

# Returns: ['First paragraph with ', 'Second paragraph.']

234

# Note: Excludes text from nested <em> element

235

```

236

237

### Attribute Value Selection

238

239

The `::attr(name)` pseudo-element selects attribute values.

240

241

**Usage Example:**

242

243

```python

244

html = """

245

<div class="links">

246

<a href="https://example.com" title="Example Site">Example</a>

247

<a href="https://google.com" title="Search Engine">Google</a>

248

<img src="image.jpg" alt="Description" width="300">

249

</div>

250

"""

251

252

selector = Selector(text=html)

253

254

# Extract href attributes using ::attr() pseudo-element

255

hrefs = selector.css('a::attr(href)').getall()

256

# Returns: ['https://example.com', 'https://google.com']

257

258

# Extract title attributes

259

titles = selector.css('a::attr(title)').getall()

260

# Returns: ['Example Site', 'Search Engine']

261

262

# Extract image attributes

263

img_src = selector.css('img::attr(src)').get()

264

# Returns: 'image.jpg'

265

266

img_alt = selector.css('img::attr(alt)').get()

267

# Returns: 'Description'

268

269

# Check XPath conversion

270

attr_xpath = css2xpath('a::attr(href)')

271

# Returns: 'descendant-or-self::a/@href'

272

273

# Equivalent manual XPath

274

manual_hrefs = selector.xpath('//a/@href').getall()

275

# Returns: ['https://example.com', 'https://google.com']

276

```

277

278

### Complex Pseudo-Element Combinations

279

280

Combine pseudo-elements with other CSS selector features.

281

282

**Usage Example:**

283

284

```python

285

html = """

286

<article>

287

<header>

288

<h1 class="title">Article Title</h1>

289

<p class="meta">Published on <time datetime="2024-01-15">January 15, 2024</time></p>

290

</header>

291

<section class="content">

292

<p class="intro">Introduction paragraph.</p>

293

<p class="body">Main content paragraph.</p>

294

</section>

295

<footer>

296

<a href="/author/john" class="author-link">John Doe</a>

297

</footer>

298

</article>

299

"""

300

301

selector = Selector(text=html)

302

303

# Complex selectors with pseudo-elements

304

article_title = selector.css('header h1.title::text').get()

305

# Returns: 'Article Title'

306

307

# Get datetime attribute from time element within meta paragraph

308

datetime_attr = selector.css('.meta time::attr(datetime)').get()

309

# Returns: '2024-01-15'

310

311

# Get author link URL

312

author_url = selector.css('footer .author-link::attr(href)').get()

313

# Returns: '/author/john'

314

315

# Get content paragraph texts (excluding intro)

316

content_texts = selector.css('section.content p.body::text').getall()

317

# Returns: ['Main content paragraph.']

318

319

# Combine descendant and pseudo-element selectors

320

intro_text = selector.css('article section .intro::text').get()

321

# Returns: 'Introduction paragraph.'

322

```

323

324

## Translation Internals

325

326

### Caching Mechanism

327

328

Both GenericTranslator and HTMLTranslator use LRU caching for performance.

329

330

```python

331

# Cache configuration

332

@lru_cache(maxsize=256)

333

def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str:

334

# Translation logic with caching

335

```

336

337

### Pseudo-Element Processing

338

339

The translation process handles pseudo-elements through dynamic dispatch:

340

341

1. **Parse CSS selector** using cssselect library

342

2. **Detect pseudo-elements** (::text, ::attr())

343

3. **Generate base XPath** for element selection

344

4. **Apply pseudo-element transformations** (/text() or /@attribute)

345

5. **Return complete XPath** expression

346

347

### Performance Considerations

348

349

- **Caching**: Frequently used CSS selectors are cached for faster repeated access

350

- **Compilation**: CSS selectors are compiled to XPath once and reused

351

- **Memory usage**: LRU cache limits memory usage to 256 entries per translator

352

- **Thread safety**: Translators can be used safely across multiple threads