Tessl Tile for pypi/beautifulsoup4@4.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-beautifulsoup4

Python library for pulling data out of HTML and XML files with pluggable parser architecture and intuitive navigation API

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/beautifulsoup4@4.3.x

To install, run

npx @tessl/cli install tessl/pypi-beautifulsoup4@4.3.0

0
# Beautiful Soup
1

2
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work by providing a Pythonic API for parsing documents with malformed markup.
3

4
## Package Information
5

6
- **Package Name**: beautifulsoup4
7
- **Language**: Python
8
- **Installation**: `pip install beautifulsoup4`
9
- **Parser Dependencies**: 
10
  - Built-in: `html.parser` (included with Python)
11
  - Optional: `pip install lxml` (faster, supports XML)
12
  - Optional: `pip install html5lib` (pure Python, handles HTML5)
13

14
## Core Imports
15

16
```python
17
from bs4 import BeautifulSoup
18
```
19

20
Additional classes for advanced usage:
21

22
```python
23
from bs4 import BeautifulSoup, Tag, NavigableString, Comment
24
from bs4 import CData, ProcessingInstruction, Doctype
25
from bs4 import SoupStrainer, ResultSet
26
```
27

28
Diagnostic and configuration imports:
29

30
```python
31
from bs4.diagnose import diagnose, lxml_trace, htmlparser_trace, benchmark_parsers, profile
32
from bs4.builder import builder_registry, TreeBuilder, HTMLTreeBuilder
33
from bs4.dammit import UnicodeDammit, EntitySubstitution
34
```
35

36
## Basic Usage
37

38
```python
39
from bs4 import BeautifulSoup
40

41
# Parse HTML content
42
html = '<html><head><title>Sample Page</title></head><body><p class="content">Hello, world!</p></body></html>'
43
soup = BeautifulSoup(html, 'html.parser')
44

45
# Navigate the parse tree
46
title = soup.title.string
47
print(title)  # "Sample Page"
48

49
# Find elements by tag
50
paragraph = soup.find('p')
51
print(paragraph.get_text())  # "Hello, world!"
52

53
# Find elements by CSS class
54
content = soup.find('p', class_='content')
55
print(content['class'])  # ['content']
56

57
# Use CSS selectors
58
content = soup.select('p.content')[0]
59
print(content.get_text())  # "Hello, world!"
60

61
# Modify the tree
62
new_tag = soup.new_tag('span', id='highlight')
63
new_tag.string = 'Important!'
64
paragraph.append(new_tag)
65

66
# Output modified HTML
67
print(soup.prettify())
68
```
69

70
## Architecture
71

72
Beautiful Soup uses a layered architecture that separates parsing from tree manipulation:
73

74
- **Parser Layer**: Pluggable parser backends (html.parser, lxml, html5lib) handle markup parsing with different performance and compliance characteristics
75
- **Parse Tree**: Hierarchical representation using PageElement base class with specialized Tag and NavigableString nodes
76
- **Navigation API**: Bidirectional tree traversal with parent/child/sibling relationships and document-order navigation
77
- **Search System**: Flexible element finding with CSS selectors, attribute matching, and callable filters
78
- **Encoding Handling**: Automatic character encoding detection and Unicode conversion via UnicodeDammit
79

80
This design enables Beautiful Soup to handle malformed markup gracefully while providing an intuitive Pythonic API for web scraping, document processing, and HTML/XML manipulation tasks.
81

82
## Capabilities
83

84
### Core Parsing
85

86
Primary BeautifulSoup class for parsing HTML and XML documents with configurable parser backends and encoding detection.
87

88
```python { .api }
89
class BeautifulSoup(Tag):
90
    def __init__(self, markup="", features=None, builder=None, 
91
                 parse_only=None, from_encoding=None, **kwargs): ...
92
    def new_tag(self, name, namespace=None, nsprefix=None, **attrs): ...
93
    def new_string(self, s, subclass=NavigableString): ...
94
```
95

96
[Core Parsing](./parsing.md)
97

98
### Tree Navigation
99

100
Navigate through the parse tree using parent-child relationships, sibling traversal, and document-order iteration with both property access and generator-based approaches.
101

102
```python { .api }
103
# Navigation properties
104
@property
105
def parent(self): ...
106
@property  
107
def next_sibling(self): ...
108
@property
109
def previous_sibling(self): ...
110
@property
111
def next_element(self): ...
112
@property
113
def previous_element(self): ...
114

115
# Navigation generators
116
@property
117
def parents(self): ...
118
@property
119
def next_siblings(self): ...
120
@property
121
def previous_siblings(self): ...
122
@property
123
def next_elements(self): ...
124
@property  
125
def previous_elements(self): ...
126
```
127

128
[Tree Navigation](./navigation.md)
129

130
### Element Search
131

132
Find elements using tag names, attributes, text content, CSS selectors, and custom matching functions with both single and multiple result options.
133

134
```python { .api }
135
def find(self, name=None, attrs={}, recursive=True, text=None, **kwargs): ...
136
def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs): ...
137
def select(self, selector): ...
138
def select_one(self, selector): ...
139

140
# Directional search
141
def find_next(self, name=None, attrs={}, text=None, **kwargs): ...
142
def find_previous(self, name=None, attrs={}, text=None, **kwargs): ...
143
def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs): ...
144
def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs): ...
145
def find_parent(self, name=None, attrs={}, **kwargs): ...
146
```
147

148
[Element Search](./search.md)
149

150
### Tree Modification  
151

152
Modify the parse tree by inserting, removing, replacing elements and their attributes with automatic relationship maintenance.
153

154
```python { .api }
155
def extract(self): ...
156
def decompose(self): ...
157
def replace_with(self, *args): ...
158
def wrap(self, wrap_inside): ...
159
def unwrap(self): ...
160
def insert(self, position, new_child): ...
161
def insert_before(self, *args): ...
162
def insert_after(self, *args): ...
163
def append(self, tag): ...
164
def clear(self, decompose=False): ...
165
```
166

167
[Tree Modification](./modification.md)
168

169
### Content Extraction
170

171
Extract text content, attribute values, and formatted output from parse tree elements with flexible filtering and formatting options.
172

173
```python { .api }
174
def get_text(self, separator="", strip=False, types=(NavigableString,)): ...
175
def get(self, key, default=None): ...
176
def has_attr(self, key): ...
177
@property
178
def string(self): ...
179
@property
180
def strings(self): ...
181
@property
182
def stripped_strings(self): ...
183
@property
184
def text(self): ...
185
```
186

187
[Content Extraction](./content.md)
188

189
### Output and Serialization
190

191
Render parse tree elements as formatted HTML/XML with encoding control, pretty-printing, and entity substitution options.
192

193
```python { .api }
194
def encode(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...
195
def decode(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...
196
def prettify(self, encoding=None, formatter="minimal"): ...
197
def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...
198
def encode_contents(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...
199
```
200

201
[Output and Serialization](./output.md)
202

203
## Types
204

205
```python { .api }
206
class PageElement:
207
    """Base class for all parse tree elements"""
208
    
209
class NavigableString(str, PageElement):
210
    """Text content within tags"""
211
    
212
class PreformattedString(NavigableString):
213
    """Text that should preserve original formatting"""
214
    
215
class Tag(PageElement):
216
    """HTML/XML elements with attributes and children"""
217
    name: str
218
    attrs: dict
219
    contents: list
220
    
221
class Comment(NavigableString):
222
    """HTML/XML comments"""
223
    
224
class CData(NavigableString):
225
    """CDATA sections"""
226
    
227
class ProcessingInstruction(NavigableString):
228
    """XML processing instructions"""
229
    
230
class Doctype(NavigableString):
231
    """DOCTYPE declarations"""
232
    
233
class SoupStrainer:
234
    """Search criteria for filtering elements"""
235
    def __init__(self, name=None, attrs={}, text=None, **kwargs): ...
236
    
237
class ResultSet(list):
238
    """List of search results with source tracking"""
239
    
240
class FeatureNotFound(ValueError):
241
    """Raised when requested parser features are not available"""
242
    
243
class StopParsing(Exception):
244
    """Exception to stop parsing early"""
245
    
246
class ParserRejectedMarkup(Exception):
247
    """Raised when parser cannot handle the provided markup"""
248
```