or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

css-translation.mddata-extraction.mdelement-modification.mdindex.mdparsing-selection.mdselectorlist-operations.mdxml-namespaces.mdxpath-extensions.md

parsing-selection.mddocs/

0

# Document Parsing and Selection

1

2

Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages including XPath, CSS selectors, and JMESPath.

3

4

## Capabilities

5

6

### Selector Initialization

7

8

Create Selector instances from various input formats with configurable parsing options.

9

10

```python { .api }

11

class Selector:

12

def __init__(

13

self,

14

text: Optional[str] = None,

15

type: Optional[str] = None,

16

body: bytes = b"",

17

encoding: str = "utf-8",

18

namespaces: Optional[Mapping[str, str]] = None,

19

root: Optional[Any] = None,

20

base_url: Optional[str] = None,

21

_expr: Optional[str] = None,

22

huge_tree: bool = True,

23

) -> None:

24

"""

25

Initialize a Selector for parsing and selecting from documents.

26

27

Parameters:

28

- text (str, optional): Text content to parse

29

- type (str, optional): Document type - "html", "xml", "json", or "text"

30

- body (bytes): Raw bytes content (alternative to text)

31

- encoding (str): Character encoding for body content, defaults to "utf-8"

32

- namespaces (dict, optional): XML namespace prefix mappings

33

- root (Any, optional): Pre-parsed root element or data

34

- base_url (str, optional): Base URL for resolving relative URLs

35

- _expr (str, optional): Expression that created this selector

36

- huge_tree (bool): Enable large document parsing support, defaults to True

37

38

Raises:

39

- ValueError: Invalid type or missing required arguments

40

- TypeError: Invalid argument types

41

"""

42

```

43

44

**Usage Example:**

45

46

```python

47

from parsel import Selector

48

49

# Parse HTML text

50

html_selector = Selector(text="<html><body><h1>Title</h1></body></html>")

51

52

# Parse XML with explicit type

53

xml_selector = Selector(text="<root><item>data</item></root>", type="xml")

54

55

# Parse JSON

56

json_selector = Selector(text='{"name": "value", "items": [1, 2, 3]}', type="json")

57

58

# Parse from bytes with encoding

59

bytes_selector = Selector(body=b"<html><body>Content</body></html>", encoding="utf-8")

60

61

# Parse with XML namespaces

62

ns_selector = Selector(

63

text="<root xmlns:ns='http://example.com'><ns:item>data</ns:item></root>",

64

type="xml",

65

namespaces={"ns": "http://example.com"}

66

)

67

```

68

69

### XPath Selection

70

71

Execute XPath expressions for precise element selection with namespace support and variable binding.

72

73

```python { .api }

74

def xpath(

75

self,

76

query: str,

77

namespaces: Optional[Mapping[str, str]] = None,

78

**kwargs: Any,

79

) -> SelectorList["Selector"]:

80

"""

81

Find nodes matching the XPath query.

82

83

Parameters:

84

- query (str): XPath expression to execute

85

- namespaces (dict, optional): Additional namespace prefix mappings

86

- **kwargs: Variable bindings for XPath variables

87

88

Returns:

89

SelectorList: Collection of matching Selector objects

90

91

Raises:

92

- ValueError: Invalid XPath expression or unsupported selector type

93

- XPathError: XPath syntax or evaluation errors

94

"""

95

```

96

97

**Usage Example:**

98

99

```python

100

selector = Selector(text="""

101

<html>

102

<body>

103

<div class="content">

104

<p>First paragraph</p>

105

<p>Second paragraph</p>

106

</div>

107

<a href="http://example.com">Link</a>

108

</body>

109

</html>

110

""")

111

112

# Select all paragraphs

113

paragraphs = selector.xpath('//p')

114

115

# Select text content

116

text_nodes = selector.xpath('//p/text()')

117

118

# Select attributes

119

hrefs = selector.xpath('//a/@href')

120

121

# Use XPath variables

122

links = selector.xpath('//a[@href=$url]', url="http://example.com")

123

124

# Complex XPath expressions

125

content_divs = selector.xpath('//div[@class="content"]//p[position()>1]')

126

```

127

128

### CSS Selection

129

130

Apply CSS selectors with support for pseudo-elements and advanced CSS features.

131

132

```python { .api }

133

def css(self, query: str) -> SelectorList["Selector"]:

134

"""

135

Apply CSS selector and return matching elements.

136

137

Parameters:

138

- query (str): CSS selector expression

139

140

Returns:

141

SelectorList: Collection of matching Selector objects

142

143

Raises:

144

- ValueError: Invalid CSS selector or unsupported selector type

145

- ExpressionError: CSS syntax errors

146

"""

147

```

148

149

**Usage Example:**

150

151

```python

152

selector = Selector(text="""

153

<html>

154

<body>

155

<div class="container">

156

<h1 id="title">Main Title</h1>

157

<p class="intro">Introduction text</p>

158

<ul>

159

<li><a href="link1.html">Link 1</a></li>

160

<li><a href="link2.html">Link 2</a></li>

161

</ul>

162

</div>

163

</body>

164

</html>

165

""")

166

167

# Select by class

168

intro = selector.css('.intro')

169

170

# Select by ID

171

title = selector.css('#title')

172

173

# Select descendants

174

links = selector.css('.container a')

175

176

# Pseudo-element selectors for text content

177

title_text = selector.css('h1::text')

178

179

# Pseudo-element selectors for attributes

180

link_urls = selector.css('a::attr(href)')

181

182

# Complex selectors

183

first_link = selector.css('ul li:first-child a')

184

```

185

186

### JMESPath Selection

187

188

Query JSON data using JMESPath expressions for complex data extraction.

189

190

```python { .api }

191

def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]:

192

"""

193

Find objects matching the JMESPath query for JSON data.

194

195

Parameters:

196

- query (str): JMESPath expression to apply

197

- **kwargs: Additional options passed to jmespath.search()

198

199

Returns:

200

SelectorList: Collection of matching Selector objects with extracted data

201

202

Note:

203

- Works with JSON-type selectors or JSON content within HTML/XML elements

204

- Results are wrapped in new Selector objects for chaining

205

"""

206

```

207

208

**Usage Example:**

209

210

```python

211

# JSON document

212

json_text = '''

213

{

214

"users": [

215

{"name": "Alice", "age": 30, "email": "alice@example.com"},

216

{"name": "Bob", "age": 25, "email": "bob@example.com"}

217

],

218

"metadata": {

219

"total": 2,

220

"page": 1

221

}

222

}

223

'''

224

225

selector = Selector(text=json_text, type="json")

226

227

# Extract all user names

228

names = selector.jmespath('users[*].name')

229

230

# Extract specific user

231

first_user = selector.jmespath('users[0]')

232

233

# Complex queries

234

adult_emails = selector.jmespath('users[?age >= `30`].email')

235

236

# Nested data extraction

237

metadata = selector.jmespath('metadata.total')

238

239

# JSON within HTML

240

html_with_json = """

241

<script type="application/json">

242

{"config": {"theme": "dark", "version": "1.0"}}

243

</script>

244

"""

245

html_selector = Selector(text=html_with_json)

246

theme = html_selector.css('script::text').jmespath('config.theme')

247

```

248

249

## Document Type Detection

250

251

Parsel automatically detects document types or allows explicit specification:

252

253

- **HTML**: Default type, uses HTML5-compliant parsing

254

- **XML**: Strict XML parsing with namespace support

255

- **JSON**: Native JSON parsing with JMESPath support

256

- **Text**: Plain text content for regex extraction

257

258

Auto-detection works by examining content structure:

259

- JSON: Valid JSON syntax detected automatically

260

- XML: Explicit type specification recommended for XML namespaces

261

- HTML: Default fallback for markup content