tessl/maven-org-jsoup--jsoup

Java HTML parser library implementing the WHATWG HTML5 specification for parsing, manipulating, and sanitizing HTML and XML documents.

—

Pending

Overview

Eval results

Files

HTML/XML Parsing

Name: tessl/maven-org-jsoup--jsoup
Author: tessl

Core parsing functionality for converting HTML and XML strings, files, and streams into navigable DOM structures. jsoup implements the WHATWG HTML5 specification and handles malformed HTML gracefully.

Capabilities

Parse from String

Parse HTML content from strings with optional base URI for resolving relative URLs.

/**
 * Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.
 * @param html HTML to parse
 * @return Document with parsed HTML
 */
public static Document parse(String html);

/**
 * Parse HTML into a Document with base URI for resolving relative URLs.
 * @param html HTML to parse
 * @param baseUri The URL where the HTML was retrieved from
 * @return Document with parsed HTML
 */
public static Document parse(String html, String baseUri);

/**
 * Parse HTML with custom parser (e.g., XML parser).
 * @param html HTML to parse
 * @param baseUri Base URI for resolving relative URLs
 * @param parser Parser to use (Parser.htmlParser() or Parser.xmlParser())
 * @return Document with parsed content
 */
public static Document parse(String html, String baseUri, Parser parser);

Usage Examples:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

// Basic HTML parsing
Document doc = Jsoup.parse("<html><body><h1>Title</h1></body></html>");

// Parse with base URI for relative URL resolution
Document doc = Jsoup.parse(
    "<html><body><a href='/page'>Link</a></body></html>", 
    "https://example.com"
);

// Parse XML content
Document xmlDoc = Jsoup.parse(
    "<root><item id='1'>Value</item></root>", 
    "", 
    Parser.xmlParser()
);

Parse HTML Fragments

Parse partial HTML content intended as body fragments rather than complete documents.

/**
 * Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
 * @param bodyHtml body HTML fragment
 * @return Document with HTML fragment wrapped in basic document structure
 */
public static Document parseBodyFragment(String bodyHtml);

/**
 * Parse HTML fragment with base URI for relative URL resolution.
 * @param bodyHtml body HTML fragment
 * @param baseUri URL to resolve relative URLs against
 * @return Document with HTML fragment
 */
public static Document parseBodyFragment(String bodyHtml, String baseUri);

Usage Examples:

// Parse HTML fragment
Document doc = Jsoup.parseBodyFragment("<p>Hello <b>world</b>!</p>");
Element body = doc.body(); // Access the generated body element

// Parse fragment with base URI
Document doc = Jsoup.parseBodyFragment(
    "<img src='/image.jpg' alt='Image'>", 
    "https://example.com"
);

Parse from File

Parse HTML content from files with automatic or specified character encoding detection.

/**
 * Parse the contents of a file as HTML with auto-detected charset.
 * @param file file to load HTML from (supports gzipped files)
 * @return Document with parsed HTML
 * @throws IOException if the file could not be found or read
 */
public static Document parse(File file) throws IOException;

/**
 * Parse file with specified character encoding.
 * @param file file to load HTML from
 * @param charsetName character set of file contents (null for auto-detection)
 * @return Document with parsed HTML
 * @throws IOException if the file could not be found, read, or charset is invalid
 */
public static Document parse(File file, String charsetName) throws IOException;

/**
 * Parse file with charset and base URI.
 * @param file file to load HTML from
 * @param charsetName character set (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @return Document with parsed HTML
 * @throws IOException if the file could not be found, read, or charset is invalid
 */
public static Document parse(File file, String charsetName, String baseUri) throws IOException;

/**
 * Parse file with custom parser.
 * @param file file to load HTML from
 * @param charsetName character set (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @param parser custom parser to use
 * @return Document with parsed content
 * @throws IOException if the file could not be found, read, or charset is invalid
 */
public static Document parse(File file, String charsetName, String baseUri, Parser parser) throws IOException;

Usage Examples:

import java.io.File;

// Parse with auto-detected encoding
Document doc = Jsoup.parse(new File("index.html"));

// Parse with specific encoding
Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

// Parse with encoding and base URI
Document doc = Jsoup.parse(
    new File("content.html"), 
    "UTF-8", 
    "https://example.com"
);

Parse from Path

Parse HTML content from Java NIO Path objects (Java 8+ feature).

/**
 * Parse the contents of a file path as HTML with auto-detected charset.
 * @param path file path to load HTML from (supports gzipped files)
 * @return Document with parsed HTML
 * @throws IOException if the file could not be found or read
 */
public static Document parse(Path path) throws IOException;

/**
 * Parse path with specified character encoding.
 * @param path file path to load HTML from
 * @param charsetName character set of file contents (null for auto-detection)
 * @return Document with parsed HTML
 * @throws IOException if the path could not be found, read, or charset is invalid
 */
public static Document parse(Path path, String charsetName) throws IOException;

/**
 * Parse path with charset and base URI.
 * @param path file path to load HTML from
 * @param charsetName character set (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @return Document with parsed HTML
 * @throws IOException if the path could not be found, read, or charset is invalid
 */
public static Document parse(Path path, String charsetName, String baseUri) throws IOException;

/**
 * Parse path with custom parser.
 * @param path file path to load HTML from
 * @param charsetName character set (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @param parser custom parser to use
 * @return Document with parsed content
 * @throws IOException if the path could not be found, read, or charset is invalid
 */
public static Document parse(Path path, String charsetName, String baseUri, Parser parser) throws IOException;

Parse from InputStream

Parse HTML content from input streams with specified character encoding.

/**
 * Read an input stream, and parse it to a Document.
 * @param in input stream to read (will be closed after reading)
 * @param charsetName character set of stream contents (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @return Document with parsed HTML
 * @throws IOException if the stream could not be read or charset is invalid
 */
public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException;

/**
 * Parse InputStream with custom parser.
 * @param in input stream to read
 * @param charsetName character set (null for auto-detection)
 * @param baseUri base URI for resolving relative URLs
 * @param parser custom parser to use
 * @return Document with parsed content
 * @throws IOException if the stream could not be read or charset is invalid
 */
public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOException;

Parse from URL

Fetch and parse HTML content directly from URLs with timeout control.

/**
 * Fetch a URL, and parse it as HTML.
 * @param url URL to fetch (must be http or https)
 * @param timeoutMillis Connection and read timeout in milliseconds
 * @return Document with parsed HTML
 * @throws IOException if connection fails, times out, or returns error status
 * @throws HttpStatusException if HTTP response is not OK
 * @throws UnsupportedMimeTypeException if response MIME type is not supported
 */
public static Document parse(URL url, int timeoutMillis) throws IOException;

Usage Example:

import java.net.URL;

// Fetch and parse URL with 5-second timeout
Document doc = Jsoup.parse(new URL("https://example.com"), 5000);

Parser Configuration

Parser Class

Create and configure custom parsers for specific parsing requirements.

/**
 * HTML parser factory method.
 * @return Parser configured for HTML parsing
 */
public static Parser htmlParser();

/**
 * XML parser factory method.
 * @return Parser configured for XML parsing
 */
public static Parser xmlParser();

/**
 * Parse HTML input with this parser.
 * @param html HTML content to parse
 * @param baseUri base URI for relative URL resolution
 * @return Document with parsed content
 */
public Document parseInput(String html, String baseUri);

/**
 * Parse HTML fragment with context element.
 * @param fragment HTML fragment to parse
 * @param context Element providing parsing context
 * @param baseUri base URI for relative URLs
 * @return List of parsed nodes
 */
public List<Node> parseFragmentInput(String fragment, Element context, String baseUri);

Parse Settings

Control case sensitivity and normalization behavior during parsing.

public class ParseSettings {
    /** Default HTML settings (case-insensitive tags and attributes) */
    public static final ParseSettings htmlDefault;
    
    /** Preserve case settings (case-sensitive tags and attributes) */
    public static final ParseSettings preserveCase;
    
    /**
     * Create custom parse settings.
     * @param preserveTagCase whether to preserve tag name case
     * @param preserveAttributeCase whether to preserve attribute name case
     */
    public ParseSettings(boolean preserveTagCase, boolean preserveAttributeCase);
}

Usage Examples:

import org.jsoup.parser.Parser;
import org.jsoup.parser.ParseSettings;

// Create HTML parser with case-sensitive settings
Parser parser = Parser.htmlParser();
parser.settings(ParseSettings.preserveCase);

// Parse with custom parser
Document doc = Jsoup.parse(html, baseUri, parser);

// XML parsing (automatically case-sensitive)
Parser xmlParser = Parser.xmlParser();
Document xmlDoc = Jsoup.parse(xmlContent, "", xmlParser);

Error Handling and Position Tracking

Enable error tracking and position information during parsing for debugging and validation.

/**
 * Enable parse error tracking.
 * @param maxErrors maximum number of errors to track (0 = unlimited)
 * @return this parser for chaining
 */
public Parser setTrackErrors(int maxErrors);

/**
 * Get parse errors if error tracking is enabled.
 * @return List of ParseError objects
 */
public List<ParseError> getErrors();

/**
 * Enable position tracking for parsed nodes.
 * @param trackPosition whether to track source positions
 * @return this parser for chaining
 */
public Parser setTrackPosition(boolean trackPosition);

Usage Example:

// Create parser with error tracking
Parser parser = Parser.htmlParser();
parser.setTrackErrors(50);  // Track up to 50 errors
parser.setTrackPosition(true);  // Track source positions

Document doc = parser.parseInput(html, baseUri);

// Check for parse errors
List<ParseError> errors = parser.getErrors();
if (!errors.isEmpty()) {
    System.out.println("Parse errors found: " + errors.size());
    for (ParseError error : errors) {
        System.out.println("Error: " + error.getErrorMessage());
    }
}

Character Encoding

jsoup automatically detects character encoding from:

Byte-order mark (BOM) in the input
<meta charset> declaration in HTML
http-equiv meta tag with charset information
Specified encoding parameter
UTF-8 fallback (if no encoding detected)

Encoding Priority:

Explicitly specified encoding parameter
BOM detection
HTML meta declarations
UTF-8 default

This ensures reliable parsing of HTML content regardless of encoding inconsistencies.

Install with Tessl CLI

npx tessl i tessl/maven-org-jsoup--jsoup

docs

tessl/maven-org-jsoup--jsoup

parsing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

HTML/XML Parsing

Capabilities

Parse from String

Parse HTML Fragments

Parse from File

Parse from Path

Parse from InputStream

Parse from URL

Parser Configuration

Parser Class

Parse Settings

Error Handling and Position Tracking

Character Encoding

parsing.mddocs/