or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

content.mdindex.mdmodification.mdnavigation.mdoutput.mdparsing.mdsearch.md

index.mddocs/

0

# Beautiful Soup

1

2

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work by providing a Pythonic API for parsing documents with malformed markup.

3

4

## Package Information

5

6

- **Package Name**: beautifulsoup4

7

- **Language**: Python

8

- **Installation**: `pip install beautifulsoup4`

9

- **Parser Dependencies**:

10

- Built-in: `html.parser` (included with Python)

11

- Optional: `pip install lxml` (faster, supports XML)

12

- Optional: `pip install html5lib` (pure Python, handles HTML5)

13

14

## Core Imports

15

16

```python

17

from bs4 import BeautifulSoup

18

```

19

20

Additional classes for advanced usage:

21

22

```python

23

from bs4 import BeautifulSoup, Tag, NavigableString, Comment

24

from bs4 import CData, ProcessingInstruction, Doctype

25

from bs4 import SoupStrainer, ResultSet

26

```

27

28

Diagnostic and configuration imports:

29

30

```python

31

from bs4.diagnose import diagnose, lxml_trace, htmlparser_trace, benchmark_parsers, profile

32

from bs4.builder import builder_registry, TreeBuilder, HTMLTreeBuilder

33

from bs4.dammit import UnicodeDammit, EntitySubstitution

34

```

35

36

## Basic Usage

37

38

```python

39

from bs4 import BeautifulSoup

40

41

# Parse HTML content

42

html = '<html><head><title>Sample Page</title></head><body><p class="content">Hello, world!</p></body></html>'

43

soup = BeautifulSoup(html, 'html.parser')

44

45

# Navigate the parse tree

46

title = soup.title.string

47

print(title) # "Sample Page"

48

49

# Find elements by tag

50

paragraph = soup.find('p')

51

print(paragraph.get_text()) # "Hello, world!"

52

53

# Find elements by CSS class

54

content = soup.find('p', class_='content')

55

print(content['class']) # ['content']

56

57

# Use CSS selectors

58

content = soup.select('p.content')[0]

59

print(content.get_text()) # "Hello, world!"

60

61

# Modify the tree

62

new_tag = soup.new_tag('span', id='highlight')

63

new_tag.string = 'Important!'

64

paragraph.append(new_tag)

65

66

# Output modified HTML

67

print(soup.prettify())

68

```

69

70

## Architecture

71

72

Beautiful Soup uses a layered architecture that separates parsing from tree manipulation:

73

74

- **Parser Layer**: Pluggable parser backends (html.parser, lxml, html5lib) handle markup parsing with different performance and compliance characteristics

75

- **Parse Tree**: Hierarchical representation using PageElement base class with specialized Tag and NavigableString nodes

76

- **Navigation API**: Bidirectional tree traversal with parent/child/sibling relationships and document-order navigation

77

- **Search System**: Flexible element finding with CSS selectors, attribute matching, and callable filters

78

- **Encoding Handling**: Automatic character encoding detection and Unicode conversion via UnicodeDammit

79

80

This design enables Beautiful Soup to handle malformed markup gracefully while providing an intuitive Pythonic API for web scraping, document processing, and HTML/XML manipulation tasks.

81

82

## Capabilities

83

84

### Core Parsing

85

86

Primary BeautifulSoup class for parsing HTML and XML documents with configurable parser backends and encoding detection.

87

88

```python { .api }

89

class BeautifulSoup(Tag):

90

def __init__(self, markup="", features=None, builder=None,

91

parse_only=None, from_encoding=None, **kwargs): ...

92

def new_tag(self, name, namespace=None, nsprefix=None, **attrs): ...

93

def new_string(self, s, subclass=NavigableString): ...

94

```

95

96

[Core Parsing](./parsing.md)

97

98

### Tree Navigation

99

100

Navigate through the parse tree using parent-child relationships, sibling traversal, and document-order iteration with both property access and generator-based approaches.

101

102

```python { .api }

103

# Navigation properties

104

@property

105

def parent(self): ...

106

@property

107

def next_sibling(self): ...

108

@property

109

def previous_sibling(self): ...

110

@property

111

def next_element(self): ...

112

@property

113

def previous_element(self): ...

114

115

# Navigation generators

116

@property

117

def parents(self): ...

118

@property

119

def next_siblings(self): ...

120

@property

121

def previous_siblings(self): ...

122

@property

123

def next_elements(self): ...

124

@property

125

def previous_elements(self): ...

126

```

127

128

[Tree Navigation](./navigation.md)

129

130

### Element Search

131

132

Find elements using tag names, attributes, text content, CSS selectors, and custom matching functions with both single and multiple result options.

133

134

```python { .api }

135

def find(self, name=None, attrs={}, recursive=True, text=None, **kwargs): ...

136

def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs): ...

137

def select(self, selector): ...

138

def select_one(self, selector): ...

139

140

# Directional search

141

def find_next(self, name=None, attrs={}, text=None, **kwargs): ...

142

def find_previous(self, name=None, attrs={}, text=None, **kwargs): ...

143

def find_next_sibling(self, name=None, attrs={}, text=None, **kwargs): ...

144

def find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs): ...

145

def find_parent(self, name=None, attrs={}, **kwargs): ...

146

```

147

148

[Element Search](./search.md)

149

150

### Tree Modification

151

152

Modify the parse tree by inserting, removing, replacing elements and their attributes with automatic relationship maintenance.

153

154

```python { .api }

155

def extract(self): ...

156

def decompose(self): ...

157

def replace_with(self, *args): ...

158

def wrap(self, wrap_inside): ...

159

def unwrap(self): ...

160

def insert(self, position, new_child): ...

161

def insert_before(self, *args): ...

162

def insert_after(self, *args): ...

163

def append(self, tag): ...

164

def clear(self, decompose=False): ...

165

```

166

167

[Tree Modification](./modification.md)

168

169

### Content Extraction

170

171

Extract text content, attribute values, and formatted output from parse tree elements with flexible filtering and formatting options.

172

173

```python { .api }

174

def get_text(self, separator="", strip=False, types=(NavigableString,)): ...

175

def get(self, key, default=None): ...

176

def has_attr(self, key): ...

177

@property

178

def string(self): ...

179

@property

180

def strings(self): ...

181

@property

182

def stripped_strings(self): ...

183

@property

184

def text(self): ...

185

```

186

187

[Content Extraction](./content.md)

188

189

### Output and Serialization

190

191

Render parse tree elements as formatted HTML/XML with encoding control, pretty-printing, and entity substitution options.

192

193

```python { .api }

194

def encode(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...

195

def decode(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...

196

def prettify(self, encoding=None, formatter="minimal"): ...

197

def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): ...

198

def encode_contents(self, encoding="utf-8", indent_level=None, formatter="minimal", errors="xmlcharrefreplace"): ...

199

```

200

201

[Output and Serialization](./output.md)

202

203

## Types

204

205

```python { .api }

206

class PageElement:

207

"""Base class for all parse tree elements"""

208

209

class NavigableString(str, PageElement):

210

"""Text content within tags"""

211

212

class PreformattedString(NavigableString):

213

"""Text that should preserve original formatting"""

214

215

class Tag(PageElement):

216

"""HTML/XML elements with attributes and children"""

217

name: str

218

attrs: dict

219

contents: list

220

221

class Comment(NavigableString):

222

"""HTML/XML comments"""

223

224

class CData(NavigableString):

225

"""CDATA sections"""

226

227

class ProcessingInstruction(NavigableString):

228

"""XML processing instructions"""

229

230

class Doctype(NavigableString):

231

"""DOCTYPE declarations"""

232

233

class SoupStrainer:

234

"""Search criteria for filtering elements"""

235

def __init__(self, name=None, attrs={}, text=None, **kwargs): ...

236

237

class ResultSet(list):

238

"""List of search results with source tracking"""

239

240

class FeatureNotFound(ValueError):

241

"""Raised when requested parser features are not available"""

242

243

class StopParsing(Exception):

244

"""Exception to stop parsing early"""

245

246

class ParserRejectedMarkup(Exception):

247

"""Raised when parser cannot handle the provided markup"""

248

```