A modern CSS selector implementation for Beautiful Soup.
npx @tessl/cli install tessl/pypi-soupsieve@1.9.00
# Soupsieve
1
2
A modern CSS selector implementation for Beautiful Soup 4, providing comprehensive CSS selector support from CSS Level 1 through CSS Level 4 drafts. Soupsieve serves as the default selector engine for Beautiful Soup 4.7.0+ and can be used independently for sophisticated CSS-based element selection from HTML/XML documents.
3
4
## Package Information
5
6
- **Package Name**: soupsieve
7
- **Language**: Python
8
- **Installation**: `pip install soupsieve`
9
10
## Core Imports
11
12
```python
13
import soupsieve
14
```
15
16
Alternative import for shorter syntax:
17
18
```python
19
import soupsieve as sv
20
```
21
22
Specific functions and classes can be imported directly:
23
24
```python
25
from soupsieve import compile, select, match, SoupSieve, SelectorSyntaxError
26
```
27
28
## Basic Usage
29
30
```python
31
import soupsieve as sv
32
from bs4 import BeautifulSoup
33
34
# Create a soup object from HTML
35
html = """
36
<div class="container">
37
<p id="intro">Introduction paragraph</p>
38
<div class="content">
39
<p class="highlight">Important content</p>
40
<span>Additional info</span>
41
</div>
42
</div>
43
"""
44
soup = BeautifulSoup(html, 'html.parser')
45
46
# Basic selection - find all paragraphs
47
paragraphs = sv.select('p', soup)
48
print(f"Found {len(paragraphs)} paragraphs")
49
50
# Select with class
51
highlighted = sv.select('.highlight', soup)
52
if highlighted:
53
print(f"Highlighted text: {highlighted[0].get_text()}")
54
55
# Select first match only
56
first_p = sv.select_one('p', soup)
57
print(f"First paragraph: {first_p.get_text()}")
58
59
# Test if element matches selector
60
intro = soup.find(id='intro')
61
if sv.match('#intro', intro):
62
print("Element matches #intro selector")
63
64
# Compiled selectors for reuse
65
compiled = sv.compile('div.content > *')
66
children = compiled.select(soup)
67
print(f"Found {len(children)} direct children of .content")
68
```
69
70
## Architecture
71
72
Soupsieve's architecture centers around CSS parsing and matching:
73
74
- **Parser**: Converts CSS selector strings into structured selector objects
75
- **Matcher**: Evaluates selectors against Beautiful Soup elements using tree traversal
76
- **Compiler**: Provides caching and reusable compiled selector objects
77
- **Types**: Immutable data structures representing selector components
78
79
The library automatically handles HTML vs XML differences and provides namespace support for XML documents.
80
81
## Capabilities
82
83
### CSS Selector Functions
84
85
Core functions for selecting elements using CSS selectors. These provide the primary interface for CSS-based element selection.
86
87
```python { .api }
88
def select(select, tag, namespaces=None, limit=0, flags=0, **kwargs):
89
"""
90
Select all matching elements under the specified tag.
91
92
Parameters:
93
- select: str, CSS selector string
94
- tag: BeautifulSoup Tag or document to search within
95
- namespaces: dict, optional namespace mappings for XML
96
- limit: int, maximum results to return (0 = unlimited)
97
- flags: int, selection flags for advanced options
98
- **kwargs: additional options including 'custom' selectors
99
100
Returns:
101
List of matching BeautifulSoup Tag objects
102
"""
103
104
def select_one(select, tag, namespaces=None, flags=0, **kwargs):
105
"""
106
Select the first matching element.
107
108
Parameters:
109
- select: str, CSS selector string
110
- tag: BeautifulSoup Tag or document to search within
111
- namespaces: dict, optional namespace mappings for XML
112
- flags: int, selection flags for advanced options
113
- **kwargs: additional options including 'custom' selectors
114
115
Returns:
116
First matching BeautifulSoup Tag object or None
117
"""
118
119
def iselect(select, tag, namespaces=None, limit=0, flags=0, **kwargs):
120
"""
121
Iterate over matching elements (generator).
122
123
Parameters:
124
- select: str, CSS selector string
125
- tag: BeautifulSoup Tag or document to search within
126
- namespaces: dict, optional namespace mappings for XML
127
- limit: int, maximum results to yield (0 = unlimited)
128
- flags: int, selection flags for advanced options
129
- **kwargs: additional options including 'custom' selectors
130
131
Yields:
132
BeautifulSoup Tag objects that match the selector
133
"""
134
```
135
136
### Element Matching and Filtering
137
138
Functions for testing individual elements and filtering collections.
139
140
```python { .api }
141
def match(select, tag, namespaces=None, flags=0, **kwargs):
142
"""
143
Test if a tag matches the CSS selector.
144
145
Parameters:
146
- select: str, CSS selector string
147
- tag: BeautifulSoup Tag to test
148
- namespaces: dict, optional namespace mappings for XML
149
- flags: int, matching flags for advanced options
150
- **kwargs: additional options including 'custom' selectors
151
152
Returns:
153
bool, True if tag matches selector, False otherwise
154
"""
155
156
def filter(select, iterable, namespaces=None, flags=0, **kwargs):
157
"""
158
Filter a collection of tags by CSS selector.
159
160
Parameters:
161
- select: str, CSS selector string
162
- iterable: collection of BeautifulSoup Tags to filter
163
- namespaces: dict, optional namespace mappings for XML
164
- flags: int, filtering flags for advanced options
165
- **kwargs: additional options including 'custom' selectors
166
167
Returns:
168
List of Tags from iterable that match the selector
169
"""
170
171
def closest(select, tag, namespaces=None, flags=0, **kwargs):
172
"""
173
Find the closest matching ancestor element.
174
175
Parameters:
176
- select: str, CSS selector string
177
- tag: BeautifulSoup Tag to start ancestor search from
178
- namespaces: dict, optional namespace mappings for XML
179
- flags: int, matching flags for advanced options
180
- **kwargs: additional options including 'custom' selectors
181
182
Returns:
183
Closest ancestor Tag that matches selector or None
184
"""
185
```
186
187
### Selector Compilation and Caching
188
189
Functions for compiling selectors for reuse and managing the selector cache.
190
191
```python { .api }
192
def compile(pattern, namespaces=None, flags=0, **kwargs):
193
"""
194
Compile CSS selector pattern into reusable SoupSieve object.
195
196
Parameters:
197
- pattern: str or SoupSieve, CSS selector string to compile
198
- namespaces: dict, optional namespace mappings for XML
199
- flags: int, compilation flags for advanced options
200
- **kwargs: additional options including 'custom' selectors
201
202
Returns:
203
SoupSieve compiled selector object
204
205
Raises:
206
ValueError: if flags/namespaces/custom provided with SoupSieve input
207
SelectorSyntaxError: for invalid CSS selector syntax
208
"""
209
210
def purge():
211
"""
212
Clear the internal compiled selector cache.
213
214
Returns:
215
None
216
"""
217
```
218
219
### Utility Functions
220
221
Helper functions for CSS identifier escaping.
222
223
```python { .api }
224
def escape(ident):
225
"""
226
Escape CSS identifier for safe use in selectors.
227
228
Parameters:
229
- ident: str, identifier string to escape
230
231
Returns:
232
str, CSS-escaped identifier safe for use in selectors
233
"""
234
```
235
236
### Deprecated Comment Functions
237
238
Functions for extracting comments (deprecated, will be removed in future versions).
239
240
```python { .api }
241
def comments(tag, limit=0, flags=0, **kwargs):
242
"""
243
Extract comments from tag tree [DEPRECATED].
244
245
Parameters:
246
- tag: BeautifulSoup Tag to search for comments
247
- limit: int, maximum comments to return (0 = unlimited)
248
- flags: int, unused flags parameter
249
- **kwargs: additional unused options
250
251
Returns:
252
List of comment strings
253
254
Note: Deprecated - not related to CSS selectors, will be removed
255
"""
256
257
def icomments(tag, limit=0, flags=0, **kwargs):
258
"""
259
Iterate comments from tag tree [DEPRECATED].
260
261
Parameters:
262
- tag: BeautifulSoup Tag to search for comments
263
- limit: int, maximum comments to yield (0 = unlimited)
264
- flags: int, unused flags parameter
265
- **kwargs: additional unused options
266
267
Yields:
268
Comment strings
269
270
Note: Deprecated - not related to CSS selectors, will be removed
271
"""
272
```
273
274
## Classes
275
276
### SoupSieve
277
278
The main compiled selector class providing reusable CSS selector functionality with caching benefits.
279
280
```python { .api }
281
class SoupSieve:
282
"""
283
Compiled CSS selector object for efficient reuse.
284
285
Attributes:
286
- pattern: str, original CSS selector pattern
287
- selectors: internal parsed selector structure
288
- namespaces: namespace mappings used during compilation
289
- custom: custom selector definitions used during compilation
290
- flags: compilation flags used during compilation
291
"""
292
293
def match(self, tag):
294
"""
295
Test if tag matches this compiled selector.
296
297
Parameters:
298
- tag: BeautifulSoup Tag to test
299
300
Returns:
301
bool, True if tag matches, False otherwise
302
"""
303
304
def select(self, tag, limit=0):
305
"""
306
Select all matching elements under tag using this compiled selector.
307
308
Parameters:
309
- tag: BeautifulSoup Tag or document to search within
310
- limit: int, maximum results to return (0 = unlimited)
311
312
Returns:
313
List of matching BeautifulSoup Tag objects
314
"""
315
316
def select_one(self, tag):
317
"""
318
Select first matching element using this compiled selector.
319
320
Parameters:
321
- tag: BeautifulSoup Tag or document to search within
322
323
Returns:
324
First matching BeautifulSoup Tag object or None
325
"""
326
327
def iselect(self, tag, limit=0):
328
"""
329
Iterate matching elements using this compiled selector.
330
331
Parameters:
332
- tag: BeautifulSoup Tag or document to search within
333
- limit: int, maximum results to yield (0 = unlimited)
334
335
Yields:
336
BeautifulSoup Tag objects that match the selector
337
"""
338
339
def filter(self, iterable):
340
"""
341
Filter collection of tags using this compiled selector.
342
343
Parameters:
344
- iterable: collection of BeautifulSoup Tags to filter
345
346
Returns:
347
List of Tags from iterable that match this selector
348
"""
349
350
def closest(self, tag):
351
"""
352
Find closest matching ancestor using this compiled selector.
353
354
Parameters:
355
- tag: BeautifulSoup Tag to start ancestor search from
356
357
Returns:
358
Closest ancestor Tag that matches this selector or None
359
"""
360
361
def comments(self, tag, limit=0):
362
"""
363
Extract comments using this selector [DEPRECATED].
364
365
Parameters:
366
- tag: BeautifulSoup Tag to search for comments
367
- limit: int, maximum comments to return (0 = unlimited)
368
369
Returns:
370
List of comment strings
371
372
Note: Deprecated - will be removed in future versions
373
"""
374
375
def icomments(self, tag, limit=0):
376
"""
377
Iterate comments using this selector [DEPRECATED].
378
379
Parameters:
380
- tag: BeautifulSoup Tag to search for comments
381
- limit: int, maximum comments to yield (0 = unlimited)
382
383
Yields:
384
Comment strings
385
386
Note: Deprecated - will be removed in future versions
387
"""
388
```
389
390
### Exception Classes
391
392
Exception types raised by soupsieve for error conditions.
393
394
```python { .api }
395
class SelectorSyntaxError(SyntaxError):
396
"""
397
Exception raised for invalid CSS selector syntax.
398
399
Attributes:
400
- line: int, line number of syntax error (if available)
401
- col: int, column number of syntax error (if available)
402
- context: str, pattern context showing error location (if available)
403
"""
404
405
def __init__(self, msg, pattern=None, index=None):
406
"""
407
Initialize syntax error with optional location information.
408
409
Parameters:
410
- msg: str, error message
411
- pattern: str, CSS pattern that caused error (optional)
412
- index: int, character index of error in pattern (optional)
413
"""
414
```
415
416
### Constants
417
418
```python { .api }
419
DEBUG = 0x00001 # Debug flag constant for development and testing
420
```
421
422
## Types
423
424
### Namespace Support
425
426
```python { .api }
427
# Namespace dictionary for XML documents
428
Namespaces = dict[str, str]
429
# Example: {'html': 'http://www.w3.org/1999/xhtml', 'svg': 'http://www.w3.org/2000/svg'}
430
431
# Custom selector definitions
432
CustomSelectors = dict[str, str]
433
# Example: {'my-selector': 'div.custom-class', 'important': '.highlight.critical'}
434
```
435
436
## Advanced Usage Examples
437
438
### Namespace-Aware Selection (XML)
439
440
```python
441
import soupsieve as sv
442
from bs4 import BeautifulSoup
443
444
xml_content = '''
445
<root xmlns:html="http://www.w3.org/1999/xhtml">
446
<html:div class="content">
447
<html:p>Namespaced paragraph</html:p>
448
</html:div>
449
</root>
450
'''
451
452
soup = BeautifulSoup(xml_content, 'xml')
453
namespaces = {'html': 'http://www.w3.org/1999/xhtml'}
454
455
# Select namespaced elements
456
divs = sv.select('html|div', soup, namespaces=namespaces)
457
paragraphs = sv.select('html|p', soup, namespaces=namespaces)
458
```
459
460
### Custom Selectors
461
462
```python
463
import soupsieve as sv
464
from bs4 import BeautifulSoup
465
466
html = '<div class="important highlight">Content</div><p class="note">Note</p>'
467
soup = BeautifulSoup(html, 'html.parser')
468
469
# Define custom selectors
470
custom = {
471
'special': '.important.highlight',
472
'content': 'div, p'
473
}
474
475
# Use custom selectors
476
special_divs = sv.select(':special', soup, custom=custom)
477
content_elements = sv.select(':content', soup, custom=custom)
478
```
479
480
### Performance with Compiled Selectors
481
482
```python
483
import soupsieve as sv
484
from bs4 import BeautifulSoup
485
486
# Compile once, use many times for better performance
487
complex_selector = sv.compile('div.container > p:nth-child(odd):not(.excluded)')
488
489
# Use compiled selector on multiple documents
490
for html_content in document_list:
491
soup = BeautifulSoup(html_content, 'html.parser')
492
matches = complex_selector.select(soup)
493
process_matches(matches)
494
495
# Clear cache when done with heavy selector use
496
sv.purge()
497
```
498
499
## Error Handling
500
501
```python
502
import soupsieve as sv
503
from soupsieve import SelectorSyntaxError
504
from bs4 import BeautifulSoup
505
506
soup = BeautifulSoup('<div>content</div>', 'html.parser')
507
508
try:
509
# This will raise SelectorSyntaxError due to invalid CSS
510
results = sv.select('div[invalid-syntax', soup)
511
except SelectorSyntaxError as e:
512
print(f"CSS selector error: {e}")
513
if e.line and e.col:
514
print(f"Error at line {e.line}, column {e.col}")
515
516
try:
517
# This will raise TypeError for invalid tag input
518
results = sv.select('div', "not a tag object")
519
except TypeError as e:
520
print(f"Invalid input type: {e}")
521
```