Tessl Tile for pypi/parsel@1.10.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

css-translation.md data-extraction.md element-modification.md index.md parsing-selection.md selectorlist-operations.md xml-namespaces.md xpath-extensions.md

data-extraction.mddocs/

0
# Data Extraction and Content Retrieval
1

2
Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement, regex matching, and various output formats.
3

4
## Capabilities
5

6
### Content Serialization
7

8
Extract the full content of selected elements as strings with proper formatting.
9

10
```python { .api }
11
def get(self) -> Any:
12
    """
13
    Serialize and return the matched node content.
14

15
    Returns:
16
    - For HTML/XML: String representation with percent-encoded content unquoted
17
    - For JSON/text: Raw data as-is
18
    - For boolean values: "1" for True, "0" for False
19
    - For other types: String conversion
20

21
    Note:
22
    - Uses appropriate serialization method based on document type
23
    - Preserves XML/HTML structure in output
24
    """
25

26
def getall(self) -> List[str]:
27
    """
28
    Serialize and return the matched node in a 1-element list.
29

30
    Returns:
31
    List[str]: Single-element list containing serialized content
32
    """
33

34
# Legacy alias
35
extract = get
36
```
37

38
**Usage Example:**
39

40
```python
41
from parsel import Selector
42

43
html = """
44
<div class="content">
45
    <p>First <strong>bold</strong> paragraph</p>
46
    <p>Second paragraph</p>
47
</div>
48
"""
49

50
selector = Selector(text=html)
51

52
# Extract full element with tags
53
full_content = selector.css('.content').get()
54
# Returns: '<div class="content">\\n    <p>First <strong>bold</strong> paragraph</p>\\n    <p>Second paragraph</p>\\n</div>'
55

56
# Extract text content only
57
text_only = selector.css('.content p::text').getall()
58
# Returns: ['First ', 'Second paragraph']
59

60
# Extract as single item
61
first_text = selector.css('.content p::text').get()
62
# Returns: 'First '
63
```
64

65
### Regular Expression Matching
66

67
Apply regular expressions to extracted content with optional entity replacement.
68

69
```python { .api }
70
def re(
71
    self, regex: Union[str, Pattern[str]], replace_entities: bool = True
72
) -> List[str]:
73
    """
74
    Apply regex and return list of matching strings.
75

76
    Parameters:
77
    - regex (str or Pattern): Regular expression pattern
78
    - replace_entities (bool): Replace HTML entities except &amp; and &lt;
79

80
    Returns:
81
    List[str]: All regex matches from the content
82

83
    Extraction rules:
84
    - Named group "extract": Returns only the named group content
85
    - Multiple numbered groups: Returns all groups flattened
86
    - No groups: Returns entire regex matches
87
    """
88

89
def re_first(
90
    self,
91
    regex: Union[str, Pattern[str]],
92
    default: Optional[str] = None,
93
    replace_entities: bool = True,
94
) -> Optional[str]:
95
    """
96
    Apply regex and return first matching string.
97

98
    Parameters:
99
    - regex (str or Pattern): Regular expression pattern
100
    - default (str, optional): Value to return if no match found
101
    - replace_entities (bool): Replace HTML entities except &amp; and &lt;
102

103
    Returns:
104
    str or None: First match or default value
105
    """
106
```
107

108
**Usage Example:**
109

110
```python
111
html = """
112
<div>
113
    <p>Price: $25.99</p>
114
    <p>Discount: 15%</p>
115
    <p>Contact: user@example.com</p>
116
</div>
117
"""
118

119
selector = Selector(text=html)
120

121
# Extract all numbers
122
numbers = selector.css('div').re(r'\\d+\\.?\\d*')
123
# Returns: ['25.99', '15']
124

125
# Extract email addresses
126
emails = selector.css('div').re(r'[\\w.-]+@[\\w.-]+\\.\\w+')
127
# Returns: ['user@example.com']
128

129
# Extract with named groups
130
prices = selector.css('div').re(r'Price: \\$(?P<extract>\\d+\\.\\d+)')
131
# Returns: ['25.99']
132

133
# Get first match with default
134
first_number = selector.css('div').re_first(r'\\d+', default='0')
135
# Returns: '25'
136

137
# Extract from specific elements
138
contact_email = selector.css('p:contains("Contact")').re_first(r'[\\w.-]+@[\\w.-]+\\.\\w+')
139
# Returns: 'user@example.com'
140
```
141

142
### Attribute Access
143

144
Access element attributes through the attrib property.
145

146
```python { .api }
147
@property
148
def attrib(self) -> Dict[str, str]:
149
    """
150
    Return the attributes dictionary for underlying element.
151

152
    Returns:
153
    Dict[str, str]: All attributes as key-value pairs
154

155
    Note:
156
    - Empty dict for non-element nodes
157
    - Converts lxml attrib to standard dict
158
    """
159
```
160

161
**Usage Example:**
162

163
```python
164
html = """
165
<div class="container" id="main" data-value="123">
166
    <a href="https://example.com" target="_blank" title="External Link">Link</a>
167
    <img src="image.jpg" alt="Description" width="300" height="200">
168
</div>
169
"""
170

171
selector = Selector(text=html)
172

173
# Get all attributes of div
174
div_attrs = selector.css('div').attrib
175
# Returns: {'class': 'container', 'id': 'main', 'data-value': '123'}
176

177
# Get all attributes of link
178
link_attrs = selector.css('a').attrib
179
# Returns: {'href': 'https://example.com', 'target': '_blank', 'title': 'External Link'}
180

181
# Access specific attribute values
182
href_value = selector.css('a').attrib.get('href')
183
# Returns: 'https://example.com'
184

185
# Check for attribute existence
186
has_target = 'target' in selector.css('a').attrib
187
# Returns: True
188
```
189

190
### Entity Replacement
191

192
Control HTML entity replacement in text extraction.
193

194
**Usage Example:**
195

196
```python
197
html = """
198
<p>Price: &lt; $100 &amp; shipping included &gt;</p>
199
<p>Copyright &copy; 2024</p>
200
"""
201

202
selector = Selector(text=html)
203

204
# With entity replacement (default)
205
text_with_entities = selector.css('p').re(r'.+', replace_entities=True)
206
# Returns: ['Price: < $100 & shipping included >', 'Copyright © 2024']
207

208
# Without entity replacement
209
text_raw = selector.css('p').re(r'.+', replace_entities=False)
210
# Returns: ['Price: &lt; $100 &amp; shipping included &gt;', 'Copyright &copy; 2024']
211

212
# Specific entities are preserved (&amp; and &lt;)
213
mixed_content = selector.css('p:first-child').re(r'.+', replace_entities=True)
214
# Returns: ['Price: < $100 & shipping included >']
215
```
216

217
## Content Type Handling
218

219
Different content types return appropriate data formats:
220

221
- **HTML/XML elements**: Serialized markup with proper encoding
222
- **Text nodes**: Plain text content
223
- **Attribute values**: String attribute values  
224
- **JSON data**: Native Python objects (dict, list, etc.)
225
- **Boolean XPath results**: "1" for True, "0" for False
226
- **Numeric XPath results**: String representation of numbers
227

228
## Performance Considerations
229

230
- Use `get()` for single values, `getall()` for multiple values
231
- Regular expressions are compiled and cached automatically
232
- Entity replacement adds processing overhead - disable if not needed
233
- Attribute access creates new dict each time - cache if accessing repeatedly

Version

Tile

Files

data-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-extraction.mddocs/