CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-htmlparser2

Fast & forgiving HTML/XML parser with callback-based interface and DOM generation capabilities

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

htmlparser2

htmlparser2 is a fast and forgiving HTML/XML parser that provides both low-level callback-based parsing and high-level DOM generation. It's designed for maximum performance with minimal memory allocations and supports streaming, malformed HTML handling, and comprehensive parsing of RSS/Atom feeds.

Package Information

  • Package Name: htmlparser2
  • Package Type: npm
  • Language: TypeScript
  • Installation: npm install htmlparser2

Core Imports

import * as htmlparser2 from "htmlparser2";
import { Parser, parseDocument, parseFeed, WritableStream } from "htmlparser2";

For CommonJS:

const htmlparser2 = require("htmlparser2");
const { Parser, parseDocument, parseFeed, WritableStream } = require("htmlparser2");

For WritableStream (separate export):

import { WritableStream } from "htmlparser2/WritableStream";

Basic Usage

import { parseDocument, Parser } from "htmlparser2";

// DOM parsing - parse complete HTML to DOM tree
const document = parseDocument("<div>Hello <b>world</b>!</div>");
console.log(document.children[0].children[1].children[0].data); // "world"

// Callback-based parsing - for minimal memory usage
const parser = new Parser({
  onopentag(name, attributes) {
    if (name === "script" && attributes.type === "text/javascript") {
      console.log("Found JavaScript!");
    }
  },
  ontext(text) {
    console.log("Text:", text);
  },
  onclosetag(tagname) {
    console.log("Closed:", tagname);
  }
});

parser.write("Xyz <script type='text/javascript'>const foo = 'bar';</script>");
parser.end();

Architecture

htmlparser2 is built around several key components:

  • Tokenizer: Low-level HTML/XML tokenization with state machine parsing
  • Parser: High-level parser that uses Tokenizer and fires callback events
  • Handler Interface: Standardized callback interface for parsing events
  • DOM Integration: Seamless integration with domhandler for DOM tree construction
  • Stream Support: WritableStream wrapper for Node.js streaming workflows
  • Feed Processing: Specialized support for RSS/Atom feed parsing

Capabilities

DOM Parsing

High-level functions for parsing HTML/XML into DOM trees using domhandler. Perfect for scraping, template processing, and document analysis.

function parseDocument(data: string, options?: Options): Document;
/** @deprecated Use parseDocument instead */
function parseDOM(data: string, options?: Options): ChildNode[];

DOM Parsing

Callback-Based Parsing

Low-level Parser class with callback interface for memory-efficient streaming parsing. Ideal for large documents and real-time processing.

class Parser {
  constructor(cbs?: Partial<Handler> | null, options?: ParserOptions);
  write(chunk: string): void;
  end(chunk?: string): void;
}

interface Handler {
  onopentag(name: string, attribs: { [s: string]: string }, isImplied: boolean): void;
  ontext(data: string): void;
  onclosetag(name: string, isImplied: boolean): void;
  oncomment(data: string): void;
  // ... additional callback methods
}

Callback-Based Parsing

Stream Processing

WritableStream integration for Node.js streams, enabling pipeline processing and integration with other stream-based tools.

class WritableStream extends Writable {
  constructor(cbs: Partial<Handler>, options?: ParserOptions);
}

Stream Processing

Feed Parsing

Specialized functionality for parsing RSS, RDF, and Atom feeds with automatic feed detection and structured data extraction.

function parseFeed(feed: string, options?: Options): Feed | null;

Feed Parsing

Low-Level Tokenization

Direct access to the underlying tokenizer for custom parsing implementations and advanced use cases.

class Tokenizer {
  constructor(options: ParserOptions, cbs: Callbacks);
  write(chunk: string): void;
  end(chunk?: string): void;
}

Low-Level Tokenization

Common Types

interface Options extends ParserOptions, DomHandlerOptions {}

interface DomHandlerOptions {
  /** Include location information for nodes */
  withStartIndices?: boolean;
  /** Include end location information for nodes */
  withEndIndices?: boolean;
  /** Normalize whitespace in text content */
  normalizeWhitespace?: boolean;
}

interface ParserOptions {
  /** Enable XML parsing mode for feeds and XML documents */
  xmlMode?: boolean;
  /** Decode HTML entities in text content */
  decodeEntities?: boolean;
  /** Convert tag names to lowercase */
  lowerCaseTags?: boolean;
  /** Convert attribute names to lowercase */
  lowerCaseAttributeNames?: boolean;
  /** Recognize CDATA sections even in HTML mode */
  recognizeCDATA?: boolean;
  /** Recognize self-closing tags even in HTML mode */
  recognizeSelfClosing?: boolean;
  /** Custom tokenizer class to use */
  Tokenizer?: typeof Tokenizer;
}

// DOM types (from domhandler dependency)
interface Document extends Node {
  children: ChildNode[];
}

interface Element extends Node {
  name: string;
  attribs: { [name: string]: string };
  children: ChildNode[];
}

interface Text extends Node {
  type: "text";
  data: string;
}

interface Comment extends Node {
  type: "comment";
  data: string;
}

interface ProcessingInstruction extends Node {
  type: "directive";
  name: string;
  data: string;
}

type ChildNode = Element | Text | Comment | ProcessingInstruction;

// DOM Handler classes
class DomHandler {
  constructor(callback?: (error: Error | null, dom: ChildNode[]) => void, options?: DomHandlerOptions, elementCallback?: (element: Element) => void);
  root: Document;
}

/** @deprecated Use DomHandler instead */
const DefaultHandler = DomHandler;

// Feed types (from domutils dependency)  
interface Feed {
  type: string;
  title?: string;
  link?: string;
  description?: string;
  items: FeedItem[];
}

// Namespace exports
namespace ElementType {
  const Text: string;
  const Directive: string;
  const Comment: string;
  const Script: string;
  const Style: string;
  const Tag: string;
  const CDATA: string;
  const Doctype: string;
}

namespace DomUtils {
  function getFeed(dom: ChildNode[]): Feed | null;
  // Additional DOM manipulation utilities from domutils package
}
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/htmlparser2@10.0.x
Publish Source
CLI
Badge
tessl/npm-htmlparser2 badge