Fast & forgiving HTML/XML parser with callback-based interface and DOM generation capabilities
npx @tessl/cli install tessl/npm-htmlparser2@10.0.00
# htmlparser2
1
2
htmlparser2 is a fast and forgiving HTML/XML parser that provides both low-level callback-based parsing and high-level DOM generation. It's designed for maximum performance with minimal memory allocations and supports streaming, malformed HTML handling, and comprehensive parsing of RSS/Atom feeds.
3
4
## Package Information
5
6
- **Package Name**: htmlparser2
7
- **Package Type**: npm
8
- **Language**: TypeScript
9
- **Installation**: `npm install htmlparser2`
10
11
## Core Imports
12
13
```typescript
14
import * as htmlparser2 from "htmlparser2";
15
import { Parser, parseDocument, parseFeed, WritableStream } from "htmlparser2";
16
```
17
18
For CommonJS:
19
20
```javascript
21
const htmlparser2 = require("htmlparser2");
22
const { Parser, parseDocument, parseFeed, WritableStream } = require("htmlparser2");
23
```
24
25
For WritableStream (separate export):
26
27
```typescript
28
import { WritableStream } from "htmlparser2/WritableStream";
29
```
30
31
## Basic Usage
32
33
```typescript
34
import { parseDocument, Parser } from "htmlparser2";
35
36
// DOM parsing - parse complete HTML to DOM tree
37
const document = parseDocument("<div>Hello <b>world</b>!</div>");
38
console.log(document.children[0].children[1].children[0].data); // "world"
39
40
// Callback-based parsing - for minimal memory usage
41
const parser = new Parser({
42
onopentag(name, attributes) {
43
if (name === "script" && attributes.type === "text/javascript") {
44
console.log("Found JavaScript!");
45
}
46
},
47
ontext(text) {
48
console.log("Text:", text);
49
},
50
onclosetag(tagname) {
51
console.log("Closed:", tagname);
52
}
53
});
54
55
parser.write("Xyz <script type='text/javascript'>const foo = 'bar';</script>");
56
parser.end();
57
```
58
59
## Architecture
60
61
htmlparser2 is built around several key components:
62
63
- **Tokenizer**: Low-level HTML/XML tokenization with state machine parsing
64
- **Parser**: High-level parser that uses Tokenizer and fires callback events
65
- **Handler Interface**: Standardized callback interface for parsing events
66
- **DOM Integration**: Seamless integration with domhandler for DOM tree construction
67
- **Stream Support**: WritableStream wrapper for Node.js streaming workflows
68
- **Feed Processing**: Specialized support for RSS/Atom feed parsing
69
70
## Capabilities
71
72
### DOM Parsing
73
74
High-level functions for parsing HTML/XML into DOM trees using domhandler. Perfect for scraping, template processing, and document analysis.
75
76
```javascript { .api }
77
function parseDocument(data: string, options?: Options): Document;
78
/** @deprecated Use parseDocument instead */
79
function parseDOM(data: string, options?: Options): ChildNode[];
80
```
81
82
[DOM Parsing](./dom-parsing.md)
83
84
### Callback-Based Parsing
85
86
Low-level Parser class with callback interface for memory-efficient streaming parsing. Ideal for large documents and real-time processing.
87
88
```typescript { .api }
89
class Parser {
90
constructor(cbs?: Partial<Handler> | null, options?: ParserOptions);
91
write(chunk: string): void;
92
end(chunk?: string): void;
93
}
94
95
interface Handler {
96
onopentag(name: string, attribs: { [s: string]: string }, isImplied: boolean): void;
97
ontext(data: string): void;
98
onclosetag(name: string, isImplied: boolean): void;
99
oncomment(data: string): void;
100
// ... additional callback methods
101
}
102
```
103
104
[Callback-Based Parsing](./callback-parsing.md)
105
106
### Stream Processing
107
108
WritableStream integration for Node.js streams, enabling pipeline processing and integration with other stream-based tools.
109
110
```typescript { .api }
111
class WritableStream extends Writable {
112
constructor(cbs: Partial<Handler>, options?: ParserOptions);
113
}
114
```
115
116
[Stream Processing](./stream-processing.md)
117
118
### Feed Parsing
119
120
Specialized functionality for parsing RSS, RDF, and Atom feeds with automatic feed detection and structured data extraction.
121
122
```typescript { .api }
123
function parseFeed(feed: string, options?: Options): Feed | null;
124
```
125
126
[Feed Parsing](./feed-parsing.md)
127
128
### Low-Level Tokenization
129
130
Direct access to the underlying tokenizer for custom parsing implementations and advanced use cases.
131
132
```typescript { .api }
133
class Tokenizer {
134
constructor(options: ParserOptions, cbs: Callbacks);
135
write(chunk: string): void;
136
end(chunk?: string): void;
137
}
138
```
139
140
[Low-Level Tokenization](./tokenization.md)
141
142
## Common Types
143
144
```typescript { .api }
145
interface Options extends ParserOptions, DomHandlerOptions {}
146
147
interface DomHandlerOptions {
148
/** Include location information for nodes */
149
withStartIndices?: boolean;
150
/** Include end location information for nodes */
151
withEndIndices?: boolean;
152
/** Normalize whitespace in text content */
153
normalizeWhitespace?: boolean;
154
}
155
156
interface ParserOptions {
157
/** Enable XML parsing mode for feeds and XML documents */
158
xmlMode?: boolean;
159
/** Decode HTML entities in text content */
160
decodeEntities?: boolean;
161
/** Convert tag names to lowercase */
162
lowerCaseTags?: boolean;
163
/** Convert attribute names to lowercase */
164
lowerCaseAttributeNames?: boolean;
165
/** Recognize CDATA sections even in HTML mode */
166
recognizeCDATA?: boolean;
167
/** Recognize self-closing tags even in HTML mode */
168
recognizeSelfClosing?: boolean;
169
/** Custom tokenizer class to use */
170
Tokenizer?: typeof Tokenizer;
171
}
172
173
// DOM types (from domhandler dependency)
174
interface Document extends Node {
175
children: ChildNode[];
176
}
177
178
interface Element extends Node {
179
name: string;
180
attribs: { [name: string]: string };
181
children: ChildNode[];
182
}
183
184
interface Text extends Node {
185
type: "text";
186
data: string;
187
}
188
189
interface Comment extends Node {
190
type: "comment";
191
data: string;
192
}
193
194
interface ProcessingInstruction extends Node {
195
type: "directive";
196
name: string;
197
data: string;
198
}
199
200
type ChildNode = Element | Text | Comment | ProcessingInstruction;
201
202
// DOM Handler classes
203
class DomHandler {
204
constructor(callback?: (error: Error | null, dom: ChildNode[]) => void, options?: DomHandlerOptions, elementCallback?: (element: Element) => void);
205
root: Document;
206
}
207
208
/** @deprecated Use DomHandler instead */
209
const DefaultHandler = DomHandler;
210
211
// Feed types (from domutils dependency)
212
interface Feed {
213
type: string;
214
title?: string;
215
link?: string;
216
description?: string;
217
items: FeedItem[];
218
}
219
220
// Namespace exports
221
namespace ElementType {
222
const Text: string;
223
const Directive: string;
224
const Comment: string;
225
const Script: string;
226
const Style: string;
227
const Tag: string;
228
const CDATA: string;
229
const Doctype: string;
230
}
231
232
namespace DomUtils {
233
function getFeed(dom: ChildNode[]): Feed | null;
234
// Additional DOM manipulation utilities from domutils package
235
}
236
```