or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

css-translation.mddata-extraction.mdelement-modification.mdindex.mdparsing-selection.mdselectorlist-operations.mdxml-namespaces.mdxpath-extensions.md

data-extraction.mddocs/

0

# Data Extraction and Content Retrieval

1

2

Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement, regex matching, and various output formats.

3

4

## Capabilities

5

6

### Content Serialization

7

8

Extract the full content of selected elements as strings with proper formatting.

9

10

```python { .api }

11

def get(self) -> Any:

12

"""

13

Serialize and return the matched node content.

14

15

Returns:

16

- For HTML/XML: String representation with percent-encoded content unquoted

17

- For JSON/text: Raw data as-is

18

- For boolean values: "1" for True, "0" for False

19

- For other types: String conversion

20

21

Note:

22

- Uses appropriate serialization method based on document type

23

- Preserves XML/HTML structure in output

24

"""

25

26

def getall(self) -> List[str]:

27

"""

28

Serialize and return the matched node in a 1-element list.

29

30

Returns:

31

List[str]: Single-element list containing serialized content

32

"""

33

34

# Legacy alias

35

extract = get

36

```

37

38

**Usage Example:**

39

40

```python

41

from parsel import Selector

42

43

html = """

44

<div class="content">

45

<p>First <strong>bold</strong> paragraph</p>

46

<p>Second paragraph</p>

47

</div>

48

"""

49

50

selector = Selector(text=html)

51

52

# Extract full element with tags

53

full_content = selector.css('.content').get()

54

# Returns: '<div class="content">\\n <p>First <strong>bold</strong> paragraph</p>\\n <p>Second paragraph</p>\\n</div>'

55

56

# Extract text content only

57

text_only = selector.css('.content p::text').getall()

58

# Returns: ['First ', 'Second paragraph']

59

60

# Extract as single item

61

first_text = selector.css('.content p::text').get()

62

# Returns: 'First '

63

```

64

65

### Regular Expression Matching

66

67

Apply regular expressions to extracted content with optional entity replacement.

68

69

```python { .api }

70

def re(

71

self, regex: Union[str, Pattern[str]], replace_entities: bool = True

72

) -> List[str]:

73

"""

74

Apply regex and return list of matching strings.

75

76

Parameters:

77

- regex (str or Pattern): Regular expression pattern

78

- replace_entities (bool): Replace HTML entities except &amp; and &lt;

79

80

Returns:

81

List[str]: All regex matches from the content

82

83

Extraction rules:

84

- Named group "extract": Returns only the named group content

85

- Multiple numbered groups: Returns all groups flattened

86

- No groups: Returns entire regex matches

87

"""

88

89

def re_first(

90

self,

91

regex: Union[str, Pattern[str]],

92

default: Optional[str] = None,

93

replace_entities: bool = True,

94

) -> Optional[str]:

95

"""

96

Apply regex and return first matching string.

97

98

Parameters:

99

- regex (str or Pattern): Regular expression pattern

100

- default (str, optional): Value to return if no match found

101

- replace_entities (bool): Replace HTML entities except &amp; and &lt;

102

103

Returns:

104

str or None: First match or default value

105

"""

106

```

107

108

**Usage Example:**

109

110

```python

111

html = """

112

<div>

113

<p>Price: $25.99</p>

114

<p>Discount: 15%</p>

115

<p>Contact: user@example.com</p>

116

</div>

117

"""

118

119

selector = Selector(text=html)

120

121

# Extract all numbers

122

numbers = selector.css('div').re(r'\\d+\\.?\\d*')

123

# Returns: ['25.99', '15']

124

125

# Extract email addresses

126

emails = selector.css('div').re(r'[\\w.-]+@[\\w.-]+\\.\\w+')

127

# Returns: ['user@example.com']

128

129

# Extract with named groups

130

prices = selector.css('div').re(r'Price: \\$(?P<extract>\\d+\\.\\d+)')

131

# Returns: ['25.99']

132

133

# Get first match with default

134

first_number = selector.css('div').re_first(r'\\d+', default='0')

135

# Returns: '25'

136

137

# Extract from specific elements

138

contact_email = selector.css('p:contains("Contact")').re_first(r'[\\w.-]+@[\\w.-]+\\.\\w+')

139

# Returns: 'user@example.com'

140

```

141

142

### Attribute Access

143

144

Access element attributes through the attrib property.

145

146

```python { .api }

147

@property

148

def attrib(self) -> Dict[str, str]:

149

"""

150

Return the attributes dictionary for underlying element.

151

152

Returns:

153

Dict[str, str]: All attributes as key-value pairs

154

155

Note:

156

- Empty dict for non-element nodes

157

- Converts lxml attrib to standard dict

158

"""

159

```

160

161

**Usage Example:**

162

163

```python

164

html = """

165

<div class="container" id="main" data-value="123">

166

<a href="https://example.com" target="_blank" title="External Link">Link</a>

167

<img src="image.jpg" alt="Description" width="300" height="200">

168

</div>

169

"""

170

171

selector = Selector(text=html)

172

173

# Get all attributes of div

174

div_attrs = selector.css('div').attrib

175

# Returns: {'class': 'container', 'id': 'main', 'data-value': '123'}

176

177

# Get all attributes of link

178

link_attrs = selector.css('a').attrib

179

# Returns: {'href': 'https://example.com', 'target': '_blank', 'title': 'External Link'}

180

181

# Access specific attribute values

182

href_value = selector.css('a').attrib.get('href')

183

# Returns: 'https://example.com'

184

185

# Check for attribute existence

186

has_target = 'target' in selector.css('a').attrib

187

# Returns: True

188

```

189

190

### Entity Replacement

191

192

Control HTML entity replacement in text extraction.

193

194

**Usage Example:**

195

196

```python

197

html = """

198

<p>Price: &lt; $100 &amp; shipping included &gt;</p>

199

<p>Copyright &copy; 2024</p>

200

"""

201

202

selector = Selector(text=html)

203

204

# With entity replacement (default)

205

text_with_entities = selector.css('p').re(r'.+', replace_entities=True)

206

# Returns: ['Price: < $100 & shipping included >', 'Copyright © 2024']

207

208

# Without entity replacement

209

text_raw = selector.css('p').re(r'.+', replace_entities=False)

210

# Returns: ['Price: &lt; $100 &amp; shipping included &gt;', 'Copyright &copy; 2024']

211

212

# Specific entities are preserved (&amp; and &lt;)

213

mixed_content = selector.css('p:first-child').re(r'.+', replace_entities=True)

214

# Returns: ['Price: < $100 & shipping included >']

215

```

216

217

## Content Type Handling

218

219

Different content types return appropriate data formats:

220

221

- **HTML/XML elements**: Serialized markup with proper encoding

222

- **Text nodes**: Plain text content

223

- **Attribute values**: String attribute values

224

- **JSON data**: Native Python objects (dict, list, etc.)

225

- **Boolean XPath results**: "1" for True, "0" for False

226

- **Numeric XPath results**: String representation of numbers

227

228

## Performance Considerations

229

230

- Use `get()` for single values, `getall()` for multiple values

231

- Regular expressions are compiled and cached automatically

232

- Entity replacement adds processing overhead - disable if not needed

233

- Attribute access creates new dict each time - cache if accessing repeatedly