Python library for pulling data out of HTML and XML files with pluggable parser architecture and intuitive navigation API
npx @tessl/cli install tessl/pypi-beautifulsoup4@4.3.00
# Beautiful Soup
1
2
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work by providing a Pythonic API for parsing documents with malformed markup.
3
4
## Package Information
5
6
- **Package Name**: beautifulsoup4
7
- **Language**: Python
8
- **Installation**: `pip install beautifulsoup4`
9
- **Parser Dependencies**:
10
- Built-in: `html.parser` (included with Python)
11
- Optional: `pip install lxml` (faster, supports XML)
12
- Optional: `pip install html5lib` (pure Python, handles HTML5)
13
14
## Core Imports
15
16
```python
17
from bs4 import BeautifulSoup
18
```
19
20
Additional classes for advanced usage:
21
22
```python
23
from bs4 import BeautifulSoup, Tag, NavigableString, Comment
24
from bs4 import CData, ProcessingInstruction, Doctype
25
from bs4 import SoupStrainer, ResultSet
26
```
27
28
Diagnostic and configuration imports:
29
30
```python
31
from bs4.diagnose import diagnose, lxml_trace, htmlparser_trace, benchmark_parsers, profile
32
from bs4.builder import builder_registry, TreeBuilder, HTMLTreeBuilder
33
from bs4.dammit import UnicodeDammit, EntitySubstitution
34
```
35
36
## Basic Usage
37
38
```python
39
from bs4 import BeautifulSoup
40
41
# Parse HTML content
42
html = '<html><head><title>Sample Page</title></head><body><p class="content">Hello, world!</p></body></html>'
43
soup = BeautifulSoup(html, 'html.parser')
44
45
# Navigate the parse tree
46
title = soup.title.string
47
print(title) # "Sample Page"
48
49
# Find elements by tag
50
paragraph = soup.find('p')
51
print(paragraph.get_text()) # "Hello, world!"
52
53
# Find elements by CSS class
54
content = soup.find('p', class_='content')
55
print(content['class']) # ['content']
56
57
# Use CSS selectors
58
content = soup.select('p.content')[0]
59
print(content.get_text()) # "Hello, world!"
60
61
# Modify the tree
62
new_tag = soup.new_tag('span', id='highlight')
63
new_tag.string = 'Important!'
64
paragraph.append(new_tag)
65
66
# Output modified HTML
67
print(soup.prettify())
68
```
69
70
## Architecture
71
72
Beautiful Soup uses a layered architecture that separates parsing from tree manipulation:
73
74
- **Parser Layer**: Pluggable parser backends (html.parser, lxml, html5lib) handle markup parsing with different performance and compliance characteristics
75
- **Parse Tree**: Hierarchical representation using PageElement base class with specialized Tag and NavigableString nodes
76
- **Navigation API**: Bidirectional tree traversal with parent/child/sibling relationships and document-order navigation
77
- **Search System**: Flexible element finding with CSS selectors, attribute matching, and callable filters
78
- **Encoding Handling**: Automatic character encoding detection and Unicode conversion via UnicodeDammit
79
80
This design enables Beautiful Soup to handle malformed markup gracefully while providing an intuitive Pythonic API for web scraping, document processing, and HTML/XML manipulation tasks.
81
82
## Capabilities
83
84
### Core Parsing
85
86
Primary BeautifulSoup class for parsing HTML and XML documents with configurable parser backends and encoding detection.
87
88
```python { .api }
89
class BeautifulSoup(Tag):
90
def __init__(self, markup="", features=None, builder=None,
91
parse_only=None, from_encoding=None, **kwargs): ...
92
def new_tag(self, name, namespace=None, nsprefix=None, **attrs): ...
93
def new_string(self, s, subclass=NavigableString): ...
94
```
95
96
[Core Parsing](./parsing.md)
97
98
### Tree Navigation
99
100
Navigate through the parse tree using parent-child relationships, sibling traversal, and document-order iteration with both property access and generator-based approaches.
101
102
```python { .api }
103
# Navigation properties
104
@property
105
def parent(self): ...
106
@property
107
def next_sibling(self): ...
108
@property
109
def previous_sibling(self): ...
110
@property
111
def next_element(self): ...
112
@property
113
def previous_element(self): ...
114
115
# Navigation generators
116
@property
117
def parents(self): ...
118
@property
119
def next_siblings(self): ...
120
@property
121
def previous_siblings(self): ...
122
@property
123
def next_elements(self): ...
124
@property
125
def previous_elements(self): ...
126
```
127
128
[Tree Navigation](./navigation.md)
129
130
### Element Search
131
132
Find elements using tag names, attributes, text content, CSS selectors, and custom matching functions with both single and multiple result options.
133
134
```python { .api }
135
def find(self, name=None, attrs={}, recursive=True, text=None, **kwargs): ...
136
def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs): ...
137
def select(self, selector): ...
138
def select_one(self, selector): ...
139
140
# Directional search
141
def find_next(self, name=None, attrs={}, text=None, **kwargs): ...
142
def find_previous(self, name=None, attrs={}, text=None, **kwargs): ...
143
def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs): ...
144
def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs): ...
145
def find_parent(self, name=None, attrs={}, **kwargs): ...
146
```
147
148
[Element Search](./search.md)
149
150
### Tree Modification
151
152
Modify the parse tree by inserting, removing, replacing elements and their attributes with automatic relationship maintenance.
153
154
```python { .api }
155
def extract(self): ...
156
def decompose(self): ...
157
def replace_with(self, *args): ...
158
def wrap(self, wrap_inside): ...
159
def unwrap(self): ...
160
def insert(self, position, new_child): ...
161
def insert_before(self, *args): ...
162
def insert_after(self, *args): ...
163
def append(self, tag): ...
164
def clear(self, decompose=False): ...
165
```
166
167
[Tree Modification](./modification.md)
168
169
### Content Extraction
170
171
Extract text content, attribute values, and formatted output from parse tree elements with flexible filtering and formatting options.
172
173
```python { .api }
174
def get_text(self, separator="", strip=False, types=(NavigableString,)): ...
175
def get(self, key, default=None): ...
176
def has_attr(self, key): ...
177
@property
178
def string(self): ...
179
@property
180
def strings(self): ...
181
@property
182
def stripped_strings(self): ...
183
@property
184
def text(self): ...
185
```
186
187
[Content Extraction](./content.md)
188
189
### Output and Serialization
190
191
Render parse tree elements as formatted HTML/XML with encoding control, pretty-printing, and entity substitution options.
192
193
```python { .api }
194
def encode(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...
195
def decode(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...
196
def prettify(self, encoding=None, formatter="minimal"): ...
197
def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...
198
def encode_contents(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...
199
```
200
201
[Output and Serialization](./output.md)
202
203
## Types
204
205
```python { .api }
206
class PageElement:
207
"""Base class for all parse tree elements"""
208
209
class NavigableString(str, PageElement):
210
"""Text content within tags"""
211
212
class PreformattedString(NavigableString):
213
"""Text that should preserve original formatting"""
214
215
class Tag(PageElement):
216
"""HTML/XML elements with attributes and children"""
217
name: str
218
attrs: dict
219
contents: list
220
221
class Comment(NavigableString):
222
"""HTML/XML comments"""
223
224
class CData(NavigableString):
225
"""CDATA sections"""
226
227
class ProcessingInstruction(NavigableString):
228
"""XML processing instructions"""
229
230
class Doctype(NavigableString):
231
"""DOCTYPE declarations"""
232
233
class SoupStrainer:
234
"""Search criteria for filtering elements"""
235
def __init__(self, name=None, attrs={}, text=None, **kwargs): ...
236
237
class ResultSet(list):
238
"""List of search results with source tracking"""
239
240
class FeatureNotFound(ValueError):
241
"""Raised when requested parser features are not available"""
242
243
class StopParsing(Exception):
244
"""Exception to stop parsing early"""
245
246
class ParserRejectedMarkup(Exception):
247
"""Raised when parser cannot handle the provided markup"""
248
```