or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-jsoup--jsoup

Java HTML parser library implementing the WHATWG HTML5 specification for parsing, manipulating, and sanitizing HTML and XML documents.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.jsoup/jsoup@1.21.x

To install, run

npx @tessl/cli install tessl/maven-org-jsoup--jsoup@1.21.0

0

# jsoup

1

2

jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and XPath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

3

4

## Package Information

5

6

- **Package Name**: jsoup

7

- **Package Type**: Maven

8

- **Language**: Java

9

- **Group ID**: org.jsoup

10

- **Artifact ID**: jsoup

11

- **Installation**: Add to `pom.xml`:

12

```xml

13

<dependency>

14

<groupId>org.jsoup</groupId>

15

<artifactId>jsoup</artifactId>

16

<version>1.21.1</version>

17

</dependency>

18

```

19

20

## Core Imports

21

22

```java

23

import org.jsoup.Jsoup;

24

import org.jsoup.nodes.Document;

25

import org.jsoup.nodes.Element;

26

import org.jsoup.select.Elements;

27

```

28

29

For HTTP connections:

30

```java

31

import org.jsoup.Connection;

32

```

33

34

For HTML sanitization:

35

```java

36

import org.jsoup.safety.Cleaner;

37

import org.jsoup.safety.Safelist;

38

```

39

40

## Basic Usage

41

42

```java

43

import org.jsoup.Jsoup;

44

import org.jsoup.nodes.Document;

45

import org.jsoup.nodes.Element;

46

import org.jsoup.select.Elements;

47

48

// Parse HTML from string

49

Document doc = Jsoup.parse("<html><body><p>Hello World!</p></body></html>");

50

51

// Parse HTML from URL

52

Document webDoc = Jsoup.connect("https://example.com")

53

.userAgent("Mozilla/5.0")

54

.timeout(3000)

55

.get();

56

57

// Extract data using CSS selectors

58

Elements links = doc.select("a[href]");

59

Element firstParagraph = doc.selectFirst("p");

60

String title = doc.title();

61

62

// Manipulate DOM

63

firstParagraph.text("Updated text");

64

doc.body().append("<p>New paragraph</p>");

65

66

// Clean untrusted HTML

67

String cleanHtml = Jsoup.clean(userInput, Safelist.basic());

68

```

69

70

## Architecture

71

72

jsoup is built around several key components:

73

74

- **Parsing Engine**: HTML5-compliant parser that handles malformed HTML gracefully

75

- **DOM API**: Document, Element, and Node classes providing jQuery-like manipulation methods

76

- **CSS Selectors**: Comprehensive CSS selector support for element selection and traversal

77

- **HTTP Client**: Built-in HTTP connection handling with session support and configuration options

78

- **Safety Features**: HTML sanitization with configurable allowlists to prevent XSS attacks

79

- **Flexible Input**: Support for parsing from strings, files, InputStreams, URLs, and Paths

80

81

## Capabilities

82

83

### HTML/XML Parsing

84

85

Core parsing functionality for converting HTML and XML strings, files, and streams into navigable DOM structures.

86

87

```java { .api }

88

// Parse from string

89

public static Document parse(String html);

90

public static Document parse(String html, String baseUri);

91

92

// Parse from file

93

public static Document parse(File file) throws IOException;

94

public static Document parse(File file, String charsetName) throws IOException;

95

96

// Parse fragments

97

public static Document parseBodyFragment(String bodyHtml);

98

public static Document parseBodyFragment(String bodyHtml, String baseUri);

99

```

100

101

[HTML/XML Parsing](./parsing.md)

102

103

### DOM Manipulation

104

105

Document Object Model manipulation with Element and Node classes providing methods for traversing, modifying, and extracting content from parsed HTML.

106

107

```java { .api }

108

// Document methods

109

public Element body();

110

public String title();

111

public void title(String title);

112

public Element createElement(String tagName);

113

114

// Element methods

115

public String text();

116

public Element text(String text);

117

public String html();

118

public Element html(String html);

119

public Element attr(String attributeKey, String attributeValue);

120

public Element appendChild(Node child);

121

```

122

123

[DOM Manipulation](./dom-manipulation.md)

124

125

### CSS Selection

126

127

CSS selector engine for finding and filtering elements using familiar CSS syntax, plus bulk operations on element collections.

128

129

```java { .api }

130

// Selection methods

131

public Elements select(String cssQuery);

132

public Element selectFirst(String cssQuery);

133

public boolean is(String cssQuery);

134

135

// Elements collection operations

136

public Elements addClass(String className);

137

public Elements attr(String attributeKey, String attributeValue);

138

public String text();

139

public Elements remove();

140

```

141

142

[CSS Selection](./css-selection.md)

143

144

### HTTP Connection

145

146

HTTP client functionality for fetching web pages with full configuration control including headers, cookies, timeouts, and session management.

147

148

```java { .api }

149

// Connection creation

150

public static Connection connect(String url);

151

public static Connection newSession();

152

153

// Configuration methods

154

public Connection userAgent(String userAgent);

155

public Connection timeout(int millis);

156

public Connection cookie(String name, String value);

157

public Connection header(String name, String value);

158

159

// Execution methods

160

public Document get() throws IOException;

161

public Document post() throws IOException;

162

public Connection.Response execute() throws IOException;

163

```

164

165

[HTTP Connection](./http-connection.md)

166

167

### HTML Sanitization

168

169

Security-focused HTML cleaning using configurable allowlists to prevent XSS attacks while preserving safe content.

170

171

```java { .api }

172

// Cleaning methods

173

public static String clean(String bodyHtml, Safelist safelist);

174

public static boolean isValid(String bodyHtml, Safelist safelist);

175

176

// Safelist presets

177

public static Safelist none();

178

public static Safelist basic();

179

public static Safelist relaxed();

180

181

// Cleaner class

182

public Document clean(Document dirtyDocument);

183

public boolean isValid(Document dirtyDocument);

184

```

185

186

[HTML Sanitization](./html-sanitization.md)

187

188

### Form Handling

189

190

HTML form processing with automatic form control discovery and submission capabilities through the HTTP connection system.

191

192

```java { .api }

193

// FormElement methods

194

public Elements elements();

195

public Connection submit();

196

public List<Connection.KeyVal> formData();

197

198

// Form data manipulation

199

public Connection data(String key, String value);

200

public Connection data(Map<String, String> data);

201

```

202

203

[Form Handling](./form-handling.md)

204

205

## Core Types

206

207

```java { .api }

208

// Main document class extending Element

209

public class Document extends Element {

210

public Element head();

211

public Element body();

212

public String title();

213

public Document.OutputSettings outputSettings();

214

}

215

216

// HTML element with tag and attributes

217

public class Element extends Node {

218

public String tagName();

219

public String text();

220

public String html();

221

public Attributes attributes();

222

public Elements children();

223

public Element parent();

224

}

225

226

// Collection of elements with bulk operations

227

public class Elements extends ArrayList<Element> {

228

public Elements select(String cssQuery);

229

public String text();

230

public Elements attr(String attributeKey, String attributeValue);

231

}

232

233

// HTTP connection interface

234

public interface Connection {

235

Connection url(String url);

236

Connection userAgent(String userAgent);

237

Connection timeout(int millis);

238

Document get() throws IOException;

239

Document post() throws IOException;

240

}

241

```

242

243

## Exception Handling

244

245

jsoup defines several specific exceptions for different error conditions:

246

247

```java { .api }

248

// HTTP errors

249

public class HttpStatusException extends IOException {

250

public int getStatusCode();

251

public String getUrl();

252

}

253

254

// Unsupported content types

255

public class UnsupportedMimeTypeException extends IOException {

256

public String getMimeType();

257

public String getUrl();

258

}

259

260

// HTML serialization errors

261

public class SerializationException extends RuntimeException {

262

}

263

```