Java HTML parser library implementing the WHATWG HTML5 specification for parsing, manipulating, and sanitizing HTML and XML documents.
npx @tessl/cli install tessl/maven-org-jsoup--jsoup@1.21.0jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and XPath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.
pom.xml:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.21.1</version>
</dependency>import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;For HTTP connections:
import org.jsoup.Connection;For HTML sanitization:
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
// Parse HTML from string
Document doc = Jsoup.parse("<html><body><p>Hello World!</p></body></html>");
// Parse HTML from URL
Document webDoc = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0")
.timeout(3000)
.get();
// Extract data using CSS selectors
Elements links = doc.select("a[href]");
Element firstParagraph = doc.selectFirst("p");
String title = doc.title();
// Manipulate DOM
firstParagraph.text("Updated text");
doc.body().append("<p>New paragraph</p>");
// Clean untrusted HTML
String cleanHtml = Jsoup.clean(userInput, Safelist.basic());jsoup is built around several key components:
Core parsing functionality for converting HTML and XML strings, files, and streams into navigable DOM structures.
// Parse from string
public static Document parse(String html);
public static Document parse(String html, String baseUri);
// Parse from file
public static Document parse(File file) throws IOException;
public static Document parse(File file, String charsetName) throws IOException;
// Parse fragments
public static Document parseBodyFragment(String bodyHtml);
public static Document parseBodyFragment(String bodyHtml, String baseUri);Document Object Model manipulation with Element and Node classes providing methods for traversing, modifying, and extracting content from parsed HTML.
// Document methods
public Element body();
public String title();
public void title(String title);
public Element createElement(String tagName);
// Element methods
public String text();
public Element text(String text);
public String html();
public Element html(String html);
public Element attr(String attributeKey, String attributeValue);
public Element appendChild(Node child);CSS selector engine for finding and filtering elements using familiar CSS syntax, plus bulk operations on element collections.
// Selection methods
public Elements select(String cssQuery);
public Element selectFirst(String cssQuery);
public boolean is(String cssQuery);
// Elements collection operations
public Elements addClass(String className);
public Elements attr(String attributeKey, String attributeValue);
public String text();
public Elements remove();HTTP client functionality for fetching web pages with full configuration control including headers, cookies, timeouts, and session management.
// Connection creation
public static Connection connect(String url);
public static Connection newSession();
// Configuration methods
public Connection userAgent(String userAgent);
public Connection timeout(int millis);
public Connection cookie(String name, String value);
public Connection header(String name, String value);
// Execution methods
public Document get() throws IOException;
public Document post() throws IOException;
public Connection.Response execute() throws IOException;Security-focused HTML cleaning using configurable allowlists to prevent XSS attacks while preserving safe content.
// Cleaning methods
public static String clean(String bodyHtml, Safelist safelist);
public static boolean isValid(String bodyHtml, Safelist safelist);
// Safelist presets
public static Safelist none();
public static Safelist basic();
public static Safelist relaxed();
// Cleaner class
public Document clean(Document dirtyDocument);
public boolean isValid(Document dirtyDocument);HTML form processing with automatic form control discovery and submission capabilities through the HTTP connection system.
// FormElement methods
public Elements elements();
public Connection submit();
public List<Connection.KeyVal> formData();
// Form data manipulation
public Connection data(String key, String value);
public Connection data(Map<String, String> data);// Main document class extending Element
public class Document extends Element {
public Element head();
public Element body();
public String title();
public Document.OutputSettings outputSettings();
}
// HTML element with tag and attributes
public class Element extends Node {
public String tagName();
public String text();
public String html();
public Attributes attributes();
public Elements children();
public Element parent();
}
// Collection of elements with bulk operations
public class Elements extends ArrayList<Element> {
public Elements select(String cssQuery);
public String text();
public Elements attr(String attributeKey, String attributeValue);
}
// HTTP connection interface
public interface Connection {
Connection url(String url);
Connection userAgent(String userAgent);
Connection timeout(int millis);
Document get() throws IOException;
Document post() throws IOException;
}jsoup defines several specific exceptions for different error conditions:
// HTTP errors
public class HttpStatusException extends IOException {
public int getStatusCode();
public String getUrl();
}
// Unsupported content types
public class UnsupportedMimeTypeException extends IOException {
public String getMimeType();
public String getUrl();
}
// HTML serialization errors
public class SerializationException extends RuntimeException {
}