or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-parsel

Parsel is a library to extract data from HTML and XML using XPath and CSS selectors

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/parsel@1.10.x

To install, run

npx @tessl/cli install tessl/pypi-parsel@1.10.0

0

# Parsel

1

2

Parsel is a library to extract data from HTML, XML, and JSON documents using XPath and CSS selectors. It provides a unified API through the Selector and SelectorList classes that enables developers to chain operations and extract data from web documents efficiently with support for XPath expressions, CSS selectors, JMESPath for JSON, and regular expressions.

3

4

## Package Information

5

6

- **Package Name**: parsel

7

- **Language**: Python

8

- **Installation**: `pip install parsel`

9

- **Dependencies**: lxml, cssselect, jmespath, w3lib, packaging

10

11

## Core Imports

12

13

```python

14

from parsel import Selector, SelectorList

15

```

16

17

Direct module imports:

18

19

```python

20

from parsel import css2xpath

21

from parsel import xpathfuncs

22

```

23

24

## Basic Usage

25

26

```python

27

from parsel import Selector

28

29

# Parse HTML document

30

html = """

31

<html>

32

<body>

33

<h1>Hello, Parsel!</h1>

34

<ul>

35

<li><a href="http://example.com">Link 1</a></li>

36

<li><a href="http://scrapy.org">Link 2</a></li>

37

</ul>

38

<script type="application/json">{"a": ["b", "c"]}</script>

39

</body>

40

</html>

41

"""

42

43

selector = Selector(text=html)

44

45

# Extract text using CSS selectors

46

title = selector.css('h1::text').get() # 'Hello, Parsel!'

47

48

# Extract links using XPath

49

for li in selector.css('ul > li'):

50

href = li.xpath('.//@href').get()

51

print(href)

52

53

# Extract and parse JSON content

54

json_data = selector.css('script::text').jmespath("a").getall() # ['b', 'c']

55

56

# Use regular expressions

57

words = selector.xpath('//h1/text()').re(r'\\w+') # ['Hello', 'Parsel']

58

```

59

60

## Architecture

61

62

Parsel's architecture centers around two main classes:

63

64

- **Selector**: Wraps input data (HTML/XML/JSON/text) and provides selection methods

65

- **SelectorList**: List of Selector objects with chainable methods for batch operations

66

67

The library supports multiple parsing strategies:

68

- **HTML parsing**: Using lxml.html.HTMLParser with CSS pseudo-element support

69

- **XML parsing**: Using SafeXMLParser (extends lxml.etree.XMLParser) with namespace management

70

- **JSON parsing**: Native Python JSON parsing with JMESPath query support

71

- **Text parsing**: Plain text content with regex extraction

72

73

## Capabilities

74

75

### Document Parsing and Selection

76

77

Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages.

78

79

```python { .api }

80

class Selector:

81

def __init__(

82

self,

83

text: Optional[str] = None,

84

type: Optional[str] = None,

85

body: bytes = b"",

86

encoding: str = "utf-8",

87

namespaces: Optional[Mapping[str, str]] = None,

88

root: Optional[Any] = None,

89

base_url: Optional[str] = None,

90

_expr: Optional[str] = None,

91

huge_tree: bool = True,

92

) -> None: ...

93

94

def xpath(

95

self,

96

query: str,

97

namespaces: Optional[Mapping[str, str]] = None,

98

**kwargs: Any,

99

) -> SelectorList["Selector"]: ...

100

101

def css(self, query: str) -> SelectorList["Selector"]: ...

102

103

def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]: ...

104

```

105

106

[Document Parsing and Selection](./parsing-selection.md)

107

108

### Data Extraction and Content Retrieval

109

110

Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement and formatting.

111

112

```python { .api }

113

def get(self) -> Any: ...

114

def getall(self) -> List[str]: ...

115

def re(

116

self, regex: Union[str, Pattern[str]], replace_entities: bool = True

117

) -> List[str]: ...

118

def re_first(

119

self,

120

regex: Union[str, Pattern[str]],

121

default: Optional[str] = None,

122

replace_entities: bool = True,

123

) -> Optional[str]: ...

124

125

@property

126

def attrib(self) -> Dict[str, str]: ...

127

```

128

129

[Data Extraction](./data-extraction.md)

130

131

### SelectorList Operations

132

133

Batch operations on multiple selectors with chainable methods for filtering, extracting, and transforming collections of selected elements.

134

135

```python { .api }

136

class SelectorList(List["Selector"]):

137

def xpath(

138

self,

139

xpath: str,

140

namespaces: Optional[Mapping[str, str]] = None,

141

**kwargs: Any,

142

) -> "SelectorList[Selector]": ...

143

144

def css(self, query: str) -> "SelectorList[Selector]": ...

145

146

def jmespath(self, query: str, **kwargs: Any) -> "SelectorList[Selector]": ...

147

148

def get(self, default: Optional[str] = None) -> Optional[str]: ...

149

def getall(self) -> List[str]: ...

150

```

151

152

[SelectorList Operations](./selectorlist-operations.md)

153

154

### XML Namespace Management

155

156

Functionality for working with XML namespaces including registration, removal, and namespace-aware queries.

157

158

```python { .api }

159

def register_namespace(self, prefix: str, uri: str) -> None: ...

160

def remove_namespaces(self) -> None: ...

161

```

162

163

[XML Namespace Management](./xml-namespaces.md)

164

165

### Element Modification

166

167

Methods for removing and modifying DOM elements within the parsed document structure.

168

169

```python { .api }

170

def drop(self) -> None: ...

171

def remove(self) -> None: ... # deprecated

172

```

173

174

[Element Modification](./element-modification.md)

175

176

### CSS Selector Translation

177

178

Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features.

179

180

```python { .api }

181

def css2xpath(query: str) -> str: ...

182

183

class GenericTranslator:

184

def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...

185

186

class HTMLTranslator:

187

def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...

188

```

189

190

[CSS Translation](./css-translation.md)

191

192

### XPath Extension Functions

193

194

Custom XPath functions for enhanced element selection including CSS class checking and other utility functions.

195

196

```python { .api }

197

def set_xpathfunc(fname: str, func: Optional[Callable]) -> None: ...

198

def has_class(context: Any, *classes: str) -> bool: ...

199

def setup() -> None: ...

200

```

201

202

[XPath Extensions](./xpath-extensions.md)

203

204

## Types

205

206

```python { .api }

207

# Type aliases

208

_SelectorType = TypeVar("_SelectorType", bound="Selector")

209

_ParserType = Union[etree.XMLParser, etree.HTMLParser]

210

_TostringMethodType = Literal["html", "xml"]

211

212

# Exception classes

213

class CannotRemoveElementWithoutRoot(Exception): ...

214

class CannotRemoveElementWithoutParent(Exception): ...

215

class CannotDropElementWithoutParent(CannotRemoveElementWithoutParent): ...

216

217

# CSS Translator classes

218

class XPathExpr:

219

textnode: bool

220

attribute: Optional[str]

221

222

@classmethod

223

def from_xpath(

224

cls,

225

xpath: "XPathExpr",

226

textnode: bool = False,

227

attribute: Optional[str] = None

228

) -> "XPathExpr": ...

229

```