or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

css-selection.mddom-manipulation.mdform-handling.mdhtml-sanitization.mdhttp-connection.mdindex.mdparsing.md
tile.json

tessl/maven-org-jsoup--jsoup

Java HTML parser library implementing the WHATWG HTML5 specification for parsing, manipulating, and sanitizing HTML and XML documents.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.jsoup/jsoup@1.21.x

To install, run

npx @tessl/cli install tessl/maven-org-jsoup--jsoup@1.21.0

index.mddocs/

jsoup

jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and XPath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

Package Information

  • Package Name: jsoup
  • Package Type: Maven
  • Language: Java
  • Group ID: org.jsoup
  • Artifact ID: jsoup
  • Installation: Add to pom.xml:
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.21.1</version>
    </dependency>

Core Imports

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

For HTTP connections:

import org.jsoup.Connection;

For HTML sanitization:

import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;

Basic Usage

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

// Parse HTML from string
Document doc = Jsoup.parse("<html><body><p>Hello World!</p></body></html>");

// Parse HTML from URL
Document webDoc = Jsoup.connect("https://example.com")
    .userAgent("Mozilla/5.0")
    .timeout(3000)
    .get();

// Extract data using CSS selectors
Elements links = doc.select("a[href]");
Element firstParagraph = doc.selectFirst("p");
String title = doc.title();

// Manipulate DOM
firstParagraph.text("Updated text");
doc.body().append("<p>New paragraph</p>");

// Clean untrusted HTML
String cleanHtml = Jsoup.clean(userInput, Safelist.basic());

Architecture

jsoup is built around several key components:

  • Parsing Engine: HTML5-compliant parser that handles malformed HTML gracefully
  • DOM API: Document, Element, and Node classes providing jQuery-like manipulation methods
  • CSS Selectors: Comprehensive CSS selector support for element selection and traversal
  • HTTP Client: Built-in HTTP connection handling with session support and configuration options
  • Safety Features: HTML sanitization with configurable allowlists to prevent XSS attacks
  • Flexible Input: Support for parsing from strings, files, InputStreams, URLs, and Paths

Capabilities

HTML/XML Parsing

Core parsing functionality for converting HTML and XML strings, files, and streams into navigable DOM structures.

// Parse from string
public static Document parse(String html);
public static Document parse(String html, String baseUri);

// Parse from file
public static Document parse(File file) throws IOException;
public static Document parse(File file, String charsetName) throws IOException;

// Parse fragments
public static Document parseBodyFragment(String bodyHtml);
public static Document parseBodyFragment(String bodyHtml, String baseUri);

HTML/XML Parsing

DOM Manipulation

Document Object Model manipulation with Element and Node classes providing methods for traversing, modifying, and extracting content from parsed HTML.

// Document methods
public Element body();
public String title();
public void title(String title);
public Element createElement(String tagName);

// Element methods  
public String text();
public Element text(String text);
public String html();
public Element html(String html);
public Element attr(String attributeKey, String attributeValue);
public Element appendChild(Node child);

DOM Manipulation

CSS Selection

CSS selector engine for finding and filtering elements using familiar CSS syntax, plus bulk operations on element collections.

// Selection methods
public Elements select(String cssQuery);
public Element selectFirst(String cssQuery);
public boolean is(String cssQuery);

// Elements collection operations
public Elements addClass(String className);
public Elements attr(String attributeKey, String attributeValue);
public String text();
public Elements remove();

CSS Selection

HTTP Connection

HTTP client functionality for fetching web pages with full configuration control including headers, cookies, timeouts, and session management.

// Connection creation
public static Connection connect(String url);
public static Connection newSession();

// Configuration methods
public Connection userAgent(String userAgent);
public Connection timeout(int millis);
public Connection cookie(String name, String value);
public Connection header(String name, String value);

// Execution methods
public Document get() throws IOException;
public Document post() throws IOException;
public Connection.Response execute() throws IOException;

HTTP Connection

HTML Sanitization

Security-focused HTML cleaning using configurable allowlists to prevent XSS attacks while preserving safe content.

// Cleaning methods
public static String clean(String bodyHtml, Safelist safelist);
public static boolean isValid(String bodyHtml, Safelist safelist);

// Safelist presets
public static Safelist none();
public static Safelist basic();
public static Safelist relaxed();

// Cleaner class
public Document clean(Document dirtyDocument);
public boolean isValid(Document dirtyDocument);

HTML Sanitization

Form Handling

HTML form processing with automatic form control discovery and submission capabilities through the HTTP connection system.

// FormElement methods
public Elements elements();
public Connection submit();
public List<Connection.KeyVal> formData();

// Form data manipulation
public Connection data(String key, String value);
public Connection data(Map<String, String> data);

Form Handling

Core Types

// Main document class extending Element
public class Document extends Element {
    public Element head();
    public Element body();
    public String title();
    public Document.OutputSettings outputSettings();
}

// HTML element with tag and attributes
public class Element extends Node {
    public String tagName();
    public String text();
    public String html();
    public Attributes attributes();
    public Elements children();
    public Element parent();
}

// Collection of elements with bulk operations
public class Elements extends ArrayList<Element> {
    public Elements select(String cssQuery);
    public String text();
    public Elements attr(String attributeKey, String attributeValue);
}

// HTTP connection interface
public interface Connection {
    Connection url(String url);
    Connection userAgent(String userAgent);
    Connection timeout(int millis);
    Document get() throws IOException;
    Document post() throws IOException;
}

Exception Handling

jsoup defines several specific exceptions for different error conditions:

// HTTP errors
public class HttpStatusException extends IOException {
    public int getStatusCode();
    public String getUrl();
}

// Unsupported content types
public class UnsupportedMimeTypeException extends IOException {
    public String getMimeType();
    public String getUrl();
}

// HTML serialization errors
public class SerializationException extends RuntimeException {
}