Tessl Tile for pypi/beautifulsoup4@4.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

content.md index.md modification.md output.md parsing.md search.md

parsing.mddocs/

0
# Core Parsing
1

2
Primary BeautifulSoup class for parsing HTML and XML documents with configurable parser backends and automatic encoding detection. Handles malformed markup gracefully while providing access to the complete parse tree.
3

4
## Capabilities
5

6
### BeautifulSoup Parser
7

8
The main parsing class that converts HTML/XML markup into a navigable parse tree using pluggable parser backends.
9

10
```python { .api }
11
class BeautifulSoup(Tag):
12
    def __init__(self, markup="", features=None, builder=None, 
13
                 parse_only=None, from_encoding=None, **kwargs):
14
        """
15
        Parse HTML/XML markup into a navigable tree structure.
16
        
17
        Parameters:
18
        - markup: str, bytes, or file-like object containing HTML/XML
19
        - features: str or list, parser features ('html.parser', 'lxml', 'html5lib', 'xml')
20
        - builder: TreeBuilder instance (alternative to features)
21
        - parse_only: SoupStrainer to parse only matching elements
22
        - from_encoding: str, character encoding to assume for markup
23
        - **kwargs: deprecated arguments from BeautifulSoup 3.x
24
        
25
        Examples:
26
        - BeautifulSoup(html_string, 'html.parser')
27
        - BeautifulSoup(xml_string, 'lxml-xml') 
28
        - BeautifulSoup(markup, 'html5lib')
29
        """
30
```
31

32
Usage Examples:
33

34
```python
35
# Parse HTML with different parsers
36
from bs4 import BeautifulSoup
37

38
html = '<html><body><p>Hello</p></body></html>'
39

40
# Built-in HTML parser (slower but always available)
41
soup = BeautifulSoup(html, 'html.parser')
42

43
# lxml parser (faster, requires lxml package)
44
soup = BeautifulSoup(html, 'lxml')
45

46
# html5lib parser (most lenient, handles HTML5)
47
soup = BeautifulSoup(html, 'html5lib')
48

49
# XML parsing with lxml
50
xml = '<?xml version="1.0"?><root><item>data</item></root>'
51
soup = BeautifulSoup(xml, 'xml')  # or 'lxml-xml'
52

53
# Parse from file
54
with open('document.html', 'r') as f:
55
    soup = BeautifulSoup(f, 'html.parser')
56

57
# Parse with encoding specification
58
soup = BeautifulSoup(markup_bytes, 'html.parser', from_encoding='utf-8')
59
```
60

61
### Element Creation
62

63
Create new tags and strings that are associated with the soup object and can be inserted into the parse tree.
64

65
```python { .api }
66
def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
67
    """
68
    Create a new Tag associated with this soup.
69
    
70
    Parameters:
71
    - name: str, tag name
72
    - namespace: str, XML namespace URI
73
    - nsprefix: str, XML namespace prefix  
74
    - **attrs: tag attributes as keyword arguments
75
    
76
    Returns:
77
    Tag instance ready for insertion into parse tree
78
    """
79

80
def new_string(self, s, subclass=NavigableString):
81
    """
82
    Create a new NavigableString associated with this soup.
83
    
84
    Parameters:
85
    - s: str, string content
86
    - subclass: NavigableString subclass (Comment, CData, etc.)
87
    
88
    Returns:
89
    NavigableString instance ready for insertion
90
    """
91

92
def decode(self, pretty_print=False, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"):
93
    """
94
    Render the entire soup as Unicode string.
95
    
96
    Parameters:
97
    - pretty_print: bool - format with indentation (default: False)
98
    - eventual_encoding: str - encoding for XML declaration if XML (default: "utf-8")
99
    - formatter: str or function - entity formatting ("minimal", "html", "xml")
100
    
101
    Returns:
102
    str - Complete document as Unicode string
103
    
104
    Note: BeautifulSoup.decode() differs from Tag.decode() in first parameter
105
    """
106
```
107

108
Usage Examples:
109

110
```python
111
from bs4 import BeautifulSoup, Comment
112

113
soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
114

115
# Create new tag with attributes
116
new_div = soup.new_tag('div', class_='container', id='main')
117
new_div.string = 'Content here'
118

119
# Create with namespace (XML)
120
new_item = soup.new_tag('item', namespace='http://example.com/ns')
121

122
# Create navigable string
123
new_text = soup.new_string('Some text content')
124

125
# Create comment
126
new_comment = soup.new_string('This is a comment', Comment)
127

128
# Insert into tree
129
soup.body.append(new_div)
130
soup.body.append(new_comment)
131
```
132

133
### Parsing Options
134

135
Control parsing behavior with features, filters, and encoding options.
136

137
```python { .api }
138
# Parser features (can be combined)
139
features = [
140
    'html.parser',  # Built-in Python HTML parser
141
    'lxml',         # lxml HTML parser (fast)
142
    'lxml-xml',     # lxml XML parser  
143
    'xml',          # Alias for lxml-xml
144
    'html5lib',     # html5lib parser (lenient)
145
    'html',         # HTML parsing mode
146
    'xml',          # XML parsing mode
147
    'fast',         # Prefer faster parsers
148
    'permissive'    # Handle malformed markup
149
]
150

151
# Parse only specific elements
152
from bs4 import SoupStrainer
153

154
# Only parse div tags with class 'content'
155
parse_only = SoupStrainer('div', class_='content')
156
soup = BeautifulSoup(markup, 'html.parser', parse_only=parse_only)
157

158
# Only parse links
159
parse_only = SoupStrainer('a')
160
soup = BeautifulSoup(markup, 'html.parser', parse_only=parse_only)
161
```
162

163
### Parser Information
164

165
Access information about the parser used and document characteristics.
166

167
```python { .api }
168
# Parser properties
169
soup.builder          # TreeBuilder instance used
170
soup.is_xml          # Boolean, True if XML parser was used
171
soup.original_encoding    # Detected encoding of source markup
172
soup.declared_html_encoding  # Encoding declared in HTML meta tags
173
soup.contains_replacement_characters  # Whether encoding conversion lost data
174
```
175

176
### Error Handling
177

178
Handle parsing errors and invalid markup gracefully.
179

180
```python { .api }
181
class FeatureNotFound(ValueError):
182
    """Raised when requested parser features are not available"""
183

184
class ParserRejectedMarkup(Exception):
185
    """Raised when parser cannot handle the provided markup"""
186
```
187

188
Usage Examples:
189

190
```python
191
from bs4 import BeautifulSoup, FeatureNotFound
192

193
try:
194
    # This will fail if lxml is not installed
195
    soup = BeautifulSoup(markup, 'lxml')
196
except FeatureNotFound:
197
    # Fall back to built-in parser
198
    soup = BeautifulSoup(markup, 'html.parser')
199

200
# Handle malformed markup
201
malformed_html = '<html><body><p>Unclosed paragraph<div>Mixed nesting</body></html>'
202
soup = BeautifulSoup(malformed_html, 'html.parser')  # Parses successfully
203
```
204

205
### Diagnostic Functions
206

207
Debug parsing issues and compare parser performance with diagnostic utilities.
208

209
```python { .api }
210
def diagnose(data):
211
    """
212
    Comprehensive diagnostic suite for troubleshooting parsing issues.
213
    
214
    Tests multiple parsers on the same data and shows results and errors.
215
    Useful for tech support and debugging parser selection problems.
216
    
217
    Parameters:
218
    - data: str, bytes, file-like object, or filename to parse
219
    
220
    Prints diagnostic information including:
221
    - Beautiful Soup version and Python version
222
    - Available parsers and their versions  
223
    - Parse results from each parser
224
    - Exception traces for failed parsers
225
    """
226

227
def lxml_trace(data, html=True, **kwargs):
228
    """
229
    Print lxml parsing events to see raw parser behavior.
230
    
231
    Shows the underlying lxml events during parsing without Beautiful Soup.
232
    
233
    Parameters:
234
    - data: str - markup to parse
235
    - html: bool - use HTML parser mode (default: True)
236
    - **kwargs: additional lxml parser options
237
    
238
    Prints events in format: "event, tag, text"
239
    """
240

241
def htmlparser_trace(data):
242
    """
243
    Print HTMLParser events to see raw parser behavior.
244
    
245
    Shows the underlying HTMLParser events during parsing without Beautiful Soup.
246
    
247
    Parameters:
248
    - data: str - markup to parse
249
    
250
    Prints events like: "TAG START", "DATA", "TAG END"
251
    """
252

253
def benchmark_parsers(num_elements=100000):
254
    """
255
    Basic performance benchmark comparing available parsers.
256
    
257
    Generates a large invalid HTML document and times parsing with
258
    different parser backends to compare performance.
259
    
260
    Parameters:
261
    - num_elements: int - size of generated test document
262
    
263
    Prints timing results for each available parser
264
    """
265

266
def profile(num_elements=100000, parser="lxml"):
267
    """
268
    Profile Beautiful Soup parsing performance in detail.
269
    
270
    Uses cProfile to analyze where time is spent during parsing.
271
    
272
    Parameters:
273
    - num_elements: int - size of generated test document  
274
    - parser: str - parser to profile ("lxml", "html.parser", etc.)
275
    
276
    Returns profile statistics for analysis
277
    """
278
```
279

280
Usage Examples:
281

282
```python
283
from bs4.diagnose import diagnose, lxml_trace, htmlparser_trace, benchmark_parsers
284

285
# Debug parsing problems
286
problematic_html = '<html><body><p>Malformed HTML...'
287
diagnose(problematic_html)
288

289
# Compare parser performance
290
benchmark_parsers(50000)
291

292
# See raw parser events
293
lxml_trace('<p>Hello <b>world</b></p>')
294
htmlparser_trace('<p>Hello <em>world</em></p>')
295

296
# Profile for performance optimization
297
from bs4.diagnose import profile
298
profile(100000, 'lxml')
299
```
300

301
### Builder and Parser Configuration
302

303
Advanced parser configuration and tree builder architecture for customizing parsing behavior.
304

305
```python { .api }
306
class TreeBuilder:
307
    """
308
    Base class for parser backends that convert markup into Beautiful Soup trees.
309
    
310
    Used internally by BeautifulSoup to abstract different parser implementations.
311
    """
312
    features = []  # List of supported feature strings
313
    is_xml = False  # Whether this parser handles XML
314
    preserve_whitespace_tags = set()  # Tags that preserve whitespace
315
    empty_element_tags = None  # Tags that can be self-closing
316
    cdata_list_attributes = {}  # Attributes containing space-separated lists
317

318
class HTMLTreeBuilder(TreeBuilder):
319
    """
320
    Base class for HTML-specific tree builders.
321
    
322
    Defines HTML-specific parsing behavior and tag characteristics.
323
    """
324
    preserve_whitespace_tags = {'pre', 'textarea'}
325
    empty_element_tags = {'br', 'hr', 'input', 'img', 'meta', 'spacer', 'link', 'frame', 'base'}
326

327
class TreeBuilderRegistry:
328
    """
329
    Registry for managing available parser backends.
330
    
331
    Automatically selects appropriate parsers based on requested features.
332
    """
333
    def register(self, treebuilder_class): ...
334
    def lookup(self, *features): ...
335

336
# Parser feature constants
337
FAST = 'fast'
338
PERMISSIVE = 'permissive' 
339
STRICT = 'strict'
340
XML = 'xml'
341
HTML = 'html'
342
HTML_5 = 'html5'
343

344
# Global parser registry
345
builder_registry = TreeBuilderRegistry()
346
```
347

348
### Encoding Detection and Processing
349

350
Handle character encoding detection and entity processing.
351

352
```python { .api }
353
class UnicodeDammit:
354
    """
355
    Automatic character encoding detection and conversion to Unicode.
356
    
357
    Handles encoding detection from HTML meta tags, XML declarations,
358
    byte order marks, and statistical analysis of byte patterns.
359
    """
360
    def __init__(self, markup, override_encodings=[], smart_quotes_to="xml", 
361
                 is_html=True, exclude_encodings=[]): ...
362
    
363
    @property
364
    def unicode_markup(self): ...  # Converted Unicode string
365
    @property  
366
    def original_encoding(self): ...  # Detected source encoding
367

368
class EntitySubstitution:
369
    """
370
    HTML and XML entity encoding and decoding utilities.
371
    
372
    Handles conversion between Unicode characters and HTML/XML entities.
373
    """
374
    @classmethod
375
    def substitute_html(cls, s): ...  # Convert to HTML entities
376
    @classmethod
377
    def substitute_xml(cls, s): ...   # Convert to XML entities
378
    @classmethod
379
    def quoted_attribute_value(cls, value): ...  # Quote attribute values
380

381
class HTMLAwareEntitySubstitution(EntitySubstitution):
382
    """
383
    Entity substitution that preserves script and style tag contents.
384
    
385
    Avoids entity conversion in script and style tags where it would
386
    break JavaScript or CSS code.
387
    """
388
    cdata_containing_tags = {'script', 'style'}
389
    preformatted_tags = {'pre'}
390
```
391

392
Usage Examples:
393

394
```python
395
from bs4.builder import builder_registry, FAST, PERMISSIVE
396
from bs4.dammit import UnicodeDammit, EntitySubstitution
397

398
# Check available parsers
399
available_parsers = []
400
for builder in builder_registry.builders:
401
    available_parsers.append(builder.features)
402
print("Available parsers:", available_parsers)
403

404
# Manual encoding detection
405
raw_data = b'<html><meta charset="latin1"><body>Caf\xe9</body></html>'
406
dammit = UnicodeDammit(raw_data)
407
print("Detected encoding:", dammit.original_encoding)
408
print("Unicode markup:", dammit.unicode_markup)
409

410
# Entity handling
411
text_with_entities = "R&D <division> & \"innovation\""
412
html_entities = EntitySubstitution.substitute_html(text_with_entities)
413
xml_entities = EntitySubstitution.substitute_xml(text_with_entities)
414
print("HTML entities:", html_entities)
415
print("XML entities:", xml_entities)
416

417
# Parser feature lookup
418
fast_parser = builder_registry.lookup(FAST)
419
permissive_html_parser = builder_registry.lookup(PERMISSIVE, 'html')
420
```

Version

Tile

Files

parsing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

parsing.mddocs/