0
# jsoup
1
2
jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and XPath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.
3
4
## Package Information
5
6
- **Package Name**: jsoup
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Group ID**: org.jsoup
10
- **Artifact ID**: jsoup
11
- **Installation**: Add to `pom.xml`:
12
```xml
13
<dependency>
14
<groupId>org.jsoup</groupId>
15
<artifactId>jsoup</artifactId>
16
<version>1.21.1</version>
17
</dependency>
18
```
19
20
## Core Imports
21
22
```java
23
import org.jsoup.Jsoup;
24
import org.jsoup.nodes.Document;
25
import org.jsoup.nodes.Element;
26
import org.jsoup.select.Elements;
27
```
28
29
For HTTP connections:
30
```java
31
import org.jsoup.Connection;
32
```
33
34
For HTML sanitization:
35
```java
36
import org.jsoup.safety.Cleaner;
37
import org.jsoup.safety.Safelist;
38
```
39
40
## Basic Usage
41
42
```java
43
import org.jsoup.Jsoup;
44
import org.jsoup.nodes.Document;
45
import org.jsoup.nodes.Element;
46
import org.jsoup.select.Elements;
47
48
// Parse HTML from string
49
Document doc = Jsoup.parse("<html><body><p>Hello World!</p></body></html>");
50
51
// Parse HTML from URL
52
Document webDoc = Jsoup.connect("https://example.com")
53
.userAgent("Mozilla/5.0")
54
.timeout(3000)
55
.get();
56
57
// Extract data using CSS selectors
58
Elements links = doc.select("a[href]");
59
Element firstParagraph = doc.selectFirst("p");
60
String title = doc.title();
61
62
// Manipulate DOM
63
firstParagraph.text("Updated text");
64
doc.body().append("<p>New paragraph</p>");
65
66
// Clean untrusted HTML
67
String cleanHtml = Jsoup.clean(userInput, Safelist.basic());
68
```
69
70
## Architecture
71
72
jsoup is built around several key components:
73
74
- **Parsing Engine**: HTML5-compliant parser that handles malformed HTML gracefully
75
- **DOM API**: Document, Element, and Node classes providing jQuery-like manipulation methods
76
- **CSS Selectors**: Comprehensive CSS selector support for element selection and traversal
77
- **HTTP Client**: Built-in HTTP connection handling with session support and configuration options
78
- **Safety Features**: HTML sanitization with configurable allowlists to prevent XSS attacks
79
- **Flexible Input**: Support for parsing from strings, files, InputStreams, URLs, and Paths
80
81
## Capabilities
82
83
### HTML/XML Parsing
84
85
Core parsing functionality for converting HTML and XML strings, files, and streams into navigable DOM structures.
86
87
```java { .api }
88
// Parse from string
89
public static Document parse(String html);
90
public static Document parse(String html, String baseUri);
91
92
// Parse from file
93
public static Document parse(File file) throws IOException;
94
public static Document parse(File file, String charsetName) throws IOException;
95
96
// Parse fragments
97
public static Document parseBodyFragment(String bodyHtml);
98
public static Document parseBodyFragment(String bodyHtml, String baseUri);
99
```
100
101
[HTML/XML Parsing](./parsing.md)
102
103
### DOM Manipulation
104
105
Document Object Model manipulation with Element and Node classes providing methods for traversing, modifying, and extracting content from parsed HTML.
106
107
```java { .api }
108
// Document methods
109
public Element body();
110
public String title();
111
public void title(String title);
112
public Element createElement(String tagName);
113
114
// Element methods
115
public String text();
116
public Element text(String text);
117
public String html();
118
public Element html(String html);
119
public Element attr(String attributeKey, String attributeValue);
120
public Element appendChild(Node child);
121
```
122
123
[DOM Manipulation](./dom-manipulation.md)
124
125
### CSS Selection
126
127
CSS selector engine for finding and filtering elements using familiar CSS syntax, plus bulk operations on element collections.
128
129
```java { .api }
130
// Selection methods
131
public Elements select(String cssQuery);
132
public Element selectFirst(String cssQuery);
133
public boolean is(String cssQuery);
134
135
// Elements collection operations
136
public Elements addClass(String className);
137
public Elements attr(String attributeKey, String attributeValue);
138
public String text();
139
public Elements remove();
140
```
141
142
[CSS Selection](./css-selection.md)
143
144
### HTTP Connection
145
146
HTTP client functionality for fetching web pages with full configuration control including headers, cookies, timeouts, and session management.
147
148
```java { .api }
149
// Connection creation
150
public static Connection connect(String url);
151
public static Connection newSession();
152
153
// Configuration methods
154
public Connection userAgent(String userAgent);
155
public Connection timeout(int millis);
156
public Connection cookie(String name, String value);
157
public Connection header(String name, String value);
158
159
// Execution methods
160
public Document get() throws IOException;
161
public Document post() throws IOException;
162
public Connection.Response execute() throws IOException;
163
```
164
165
[HTTP Connection](./http-connection.md)
166
167
### HTML Sanitization
168
169
Security-focused HTML cleaning using configurable allowlists to prevent XSS attacks while preserving safe content.
170
171
```java { .api }
172
// Cleaning methods
173
public static String clean(String bodyHtml, Safelist safelist);
174
public static boolean isValid(String bodyHtml, Safelist safelist);
175
176
// Safelist presets
177
public static Safelist none();
178
public static Safelist basic();
179
public static Safelist relaxed();
180
181
// Cleaner class
182
public Document clean(Document dirtyDocument);
183
public boolean isValid(Document dirtyDocument);
184
```
185
186
[HTML Sanitization](./html-sanitization.md)
187
188
### Form Handling
189
190
HTML form processing with automatic form control discovery and submission capabilities through the HTTP connection system.
191
192
```java { .api }
193
// FormElement methods
194
public Elements elements();
195
public Connection submit();
196
public List<Connection.KeyVal> formData();
197
198
// Form data manipulation
199
public Connection data(String key, String value);
200
public Connection data(Map<String, String> data);
201
```
202
203
[Form Handling](./form-handling.md)
204
205
## Core Types
206
207
```java { .api }
208
// Main document class extending Element
209
public class Document extends Element {
210
public Element head();
211
public Element body();
212
public String title();
213
public Document.OutputSettings outputSettings();
214
}
215
216
// HTML element with tag and attributes
217
public class Element extends Node {
218
public String tagName();
219
public String text();
220
public String html();
221
public Attributes attributes();
222
public Elements children();
223
public Element parent();
224
}
225
226
// Collection of elements with bulk operations
227
public class Elements extends ArrayList<Element> {
228
public Elements select(String cssQuery);
229
public String text();
230
public Elements attr(String attributeKey, String attributeValue);
231
}
232
233
// HTTP connection interface
234
public interface Connection {
235
Connection url(String url);
236
Connection userAgent(String userAgent);
237
Connection timeout(int millis);
238
Document get() throws IOException;
239
Document post() throws IOException;
240
}
241
```
242
243
## Exception Handling
244
245
jsoup defines several specific exceptions for different error conditions:
246
247
```java { .api }
248
// HTTP errors
249
public class HttpStatusException extends IOException {
250
public int getStatusCode();
251
public String getUrl();
252
}
253
254
// Unsupported content types
255
public class UnsupportedMimeTypeException extends IOException {
256
public String getMimeType();
257
public String getUrl();
258
}
259
260
// HTML serialization errors
261
public class SerializationException extends RuntimeException {
262
}
263
```