0
# Data Extraction and Content Retrieval
1
2
Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement, regex matching, and various output formats.
3
4
## Capabilities
5
6
### Content Serialization
7
8
Extract the full content of selected elements as strings with proper formatting.
9
10
```python { .api }
11
def get(self) -> Any:
12
"""
13
Serialize and return the matched node content.
14
15
Returns:
16
- For HTML/XML: String representation with percent-encoded content unquoted
17
- For JSON/text: Raw data as-is
18
- For boolean values: "1" for True, "0" for False
19
- For other types: String conversion
20
21
Note:
22
- Uses appropriate serialization method based on document type
23
- Preserves XML/HTML structure in output
24
"""
25
26
def getall(self) -> List[str]:
27
"""
28
Serialize and return the matched node in a 1-element list.
29
30
Returns:
31
List[str]: Single-element list containing serialized content
32
"""
33
34
# Legacy alias
35
extract = get
36
```
37
38
**Usage Example:**
39
40
```python
41
from parsel import Selector
42
43
html = """
44
<div class="content">
45
<p>First <strong>bold</strong> paragraph</p>
46
<p>Second paragraph</p>
47
</div>
48
"""
49
50
selector = Selector(text=html)
51
52
# Extract full element with tags
53
full_content = selector.css('.content').get()
54
# Returns: '<div class="content">\\n <p>First <strong>bold</strong> paragraph</p>\\n <p>Second paragraph</p>\\n</div>'
55
56
# Extract text content only
57
text_only = selector.css('.content p::text').getall()
58
# Returns: ['First ', 'Second paragraph']
59
60
# Extract as single item
61
first_text = selector.css('.content p::text').get()
62
# Returns: 'First '
63
```
64
65
### Regular Expression Matching
66
67
Apply regular expressions to extracted content with optional entity replacement.
68
69
```python { .api }
70
def re(
71
self, regex: Union[str, Pattern[str]], replace_entities: bool = True
72
) -> List[str]:
73
"""
74
Apply regex and return list of matching strings.
75
76
Parameters:
77
- regex (str or Pattern): Regular expression pattern
78
- replace_entities (bool): Replace HTML entities except & and <
79
80
Returns:
81
List[str]: All regex matches from the content
82
83
Extraction rules:
84
- Named group "extract": Returns only the named group content
85
- Multiple numbered groups: Returns all groups flattened
86
- No groups: Returns entire regex matches
87
"""
88
89
def re_first(
90
self,
91
regex: Union[str, Pattern[str]],
92
default: Optional[str] = None,
93
replace_entities: bool = True,
94
) -> Optional[str]:
95
"""
96
Apply regex and return first matching string.
97
98
Parameters:
99
- regex (str or Pattern): Regular expression pattern
100
- default (str, optional): Value to return if no match found
101
- replace_entities (bool): Replace HTML entities except & and <
102
103
Returns:
104
str or None: First match or default value
105
"""
106
```
107
108
**Usage Example:**
109
110
```python
111
html = """
112
<div>
113
<p>Price: $25.99</p>
114
<p>Discount: 15%</p>
115
<p>Contact: user@example.com</p>
116
</div>
117
"""
118
119
selector = Selector(text=html)
120
121
# Extract all numbers
122
numbers = selector.css('div').re(r'\\d+\\.?\\d*')
123
# Returns: ['25.99', '15']
124
125
# Extract email addresses
126
emails = selector.css('div').re(r'[\\w.-]+@[\\w.-]+\\.\\w+')
127
# Returns: ['user@example.com']
128
129
# Extract with named groups
130
prices = selector.css('div').re(r'Price: \\$(?P<extract>\\d+\\.\\d+)')
131
# Returns: ['25.99']
132
133
# Get first match with default
134
first_number = selector.css('div').re_first(r'\\d+', default='0')
135
# Returns: '25'
136
137
# Extract from specific elements
138
contact_email = selector.css('p:contains("Contact")').re_first(r'[\\w.-]+@[\\w.-]+\\.\\w+')
139
# Returns: 'user@example.com'
140
```
141
142
### Attribute Access
143
144
Access element attributes through the attrib property.
145
146
```python { .api }
147
@property
148
def attrib(self) -> Dict[str, str]:
149
"""
150
Return the attributes dictionary for underlying element.
151
152
Returns:
153
Dict[str, str]: All attributes as key-value pairs
154
155
Note:
156
- Empty dict for non-element nodes
157
- Converts lxml attrib to standard dict
158
"""
159
```
160
161
**Usage Example:**
162
163
```python
164
html = """
165
<div class="container" id="main" data-value="123">
166
<a href="https://example.com" target="_blank" title="External Link">Link</a>
167
<img src="image.jpg" alt="Description" width="300" height="200">
168
</div>
169
"""
170
171
selector = Selector(text=html)
172
173
# Get all attributes of div
174
div_attrs = selector.css('div').attrib
175
# Returns: {'class': 'container', 'id': 'main', 'data-value': '123'}
176
177
# Get all attributes of link
178
link_attrs = selector.css('a').attrib
179
# Returns: {'href': 'https://example.com', 'target': '_blank', 'title': 'External Link'}
180
181
# Access specific attribute values
182
href_value = selector.css('a').attrib.get('href')
183
# Returns: 'https://example.com'
184
185
# Check for attribute existence
186
has_target = 'target' in selector.css('a').attrib
187
# Returns: True
188
```
189
190
### Entity Replacement
191
192
Control HTML entity replacement in text extraction.
193
194
**Usage Example:**
195
196
```python
197
html = """
198
<p>Price: < $100 & shipping included ></p>
199
<p>Copyright © 2024</p>
200
"""
201
202
selector = Selector(text=html)
203
204
# With entity replacement (default)
205
text_with_entities = selector.css('p').re(r'.+', replace_entities=True)
206
# Returns: ['Price: < $100 & shipping included >', 'Copyright © 2024']
207
208
# Without entity replacement
209
text_raw = selector.css('p').re(r'.+', replace_entities=False)
210
# Returns: ['Price: < $100 & shipping included >', 'Copyright © 2024']
211
212
# Specific entities are preserved (& and <)
213
mixed_content = selector.css('p:first-child').re(r'.+', replace_entities=True)
214
# Returns: ['Price: < $100 & shipping included >']
215
```
216
217
## Content Type Handling
218
219
Different content types return appropriate data formats:
220
221
- **HTML/XML elements**: Serialized markup with proper encoding
222
- **Text nodes**: Plain text content
223
- **Attribute values**: String attribute values
224
- **JSON data**: Native Python objects (dict, list, etc.)
225
- **Boolean XPath results**: "1" for True, "0" for False
226
- **Numeric XPath results**: String representation of numbers
227
228
## Performance Considerations
229
230
- Use `get()` for single values, `getall()` for multiple values
231
- Regular expressions are compiled and cached automatically
232
- Entity replacement adds processing overhead - disable if not needed
233
- Attribute access creates new dict each time - cache if accessing repeatedly