or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/npm-htmlparser2

Fast & forgiving HTML/XML parser with callback-based interface and DOM generation capabilities

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/htmlparser2@10.0.x

To install, run

npx @tessl/cli install tessl/npm-htmlparser2@10.0.0

0

# htmlparser2

1

2

htmlparser2 is a fast and forgiving HTML/XML parser that provides both low-level callback-based parsing and high-level DOM generation. It's designed for maximum performance with minimal memory allocations and supports streaming, malformed HTML handling, and comprehensive parsing of RSS/Atom feeds.

3

4

## Package Information

5

6

- **Package Name**: htmlparser2

7

- **Package Type**: npm

8

- **Language**: TypeScript

9

- **Installation**: `npm install htmlparser2`

10

11

## Core Imports

12

13

```typescript

14

import * as htmlparser2 from "htmlparser2";

15

import { Parser, parseDocument, parseFeed, WritableStream } from "htmlparser2";

16

```

17

18

For CommonJS:

19

20

```javascript

21

const htmlparser2 = require("htmlparser2");

22

const { Parser, parseDocument, parseFeed, WritableStream } = require("htmlparser2");

23

```

24

25

For WritableStream (separate export):

26

27

```typescript

28

import { WritableStream } from "htmlparser2/WritableStream";

29

```

30

31

## Basic Usage

32

33

```typescript

34

import { parseDocument, Parser } from "htmlparser2";

35

36

// DOM parsing - parse complete HTML to DOM tree

37

const document = parseDocument("<div>Hello <b>world</b>!</div>");

38

console.log(document.children[0].children[1].children[0].data); // "world"

39

40

// Callback-based parsing - for minimal memory usage

41

const parser = new Parser({

42

onopentag(name, attributes) {

43

if (name === "script" && attributes.type === "text/javascript") {

44

console.log("Found JavaScript!");

45

}

46

},

47

ontext(text) {

48

console.log("Text:", text);

49

},

50

onclosetag(tagname) {

51

console.log("Closed:", tagname);

52

}

53

});

54

55

parser.write("Xyz <script type='text/javascript'>const foo = 'bar';</script>");

56

parser.end();

57

```

58

59

## Architecture

60

61

htmlparser2 is built around several key components:

62

63

- **Tokenizer**: Low-level HTML/XML tokenization with state machine parsing

64

- **Parser**: High-level parser that uses Tokenizer and fires callback events

65

- **Handler Interface**: Standardized callback interface for parsing events

66

- **DOM Integration**: Seamless integration with domhandler for DOM tree construction

67

- **Stream Support**: WritableStream wrapper for Node.js streaming workflows

68

- **Feed Processing**: Specialized support for RSS/Atom feed parsing

69

70

## Capabilities

71

72

### DOM Parsing

73

74

High-level functions for parsing HTML/XML into DOM trees using domhandler. Perfect for scraping, template processing, and document analysis.

75

76

```javascript { .api }

77

function parseDocument(data: string, options?: Options): Document;

78

/** @deprecated Use parseDocument instead */

79

function parseDOM(data: string, options?: Options): ChildNode[];

80

```

81

82

[DOM Parsing](./dom-parsing.md)

83

84

### Callback-Based Parsing

85

86

Low-level Parser class with callback interface for memory-efficient streaming parsing. Ideal for large documents and real-time processing.

87

88

```typescript { .api }

89

class Parser {

90

constructor(cbs?: Partial<Handler> | null, options?: ParserOptions);

91

write(chunk: string): void;

92

end(chunk?: string): void;

93

}

94

95

interface Handler {

96

onopentag(name: string, attribs: { [s: string]: string }, isImplied: boolean): void;

97

ontext(data: string): void;

98

onclosetag(name: string, isImplied: boolean): void;

99

oncomment(data: string): void;

100

// ... additional callback methods

101

}

102

```

103

104

[Callback-Based Parsing](./callback-parsing.md)

105

106

### Stream Processing

107

108

WritableStream integration for Node.js streams, enabling pipeline processing and integration with other stream-based tools.

109

110

```typescript { .api }

111

class WritableStream extends Writable {

112

constructor(cbs: Partial<Handler>, options?: ParserOptions);

113

}

114

```

115

116

[Stream Processing](./stream-processing.md)

117

118

### Feed Parsing

119

120

Specialized functionality for parsing RSS, RDF, and Atom feeds with automatic feed detection and structured data extraction.

121

122

```typescript { .api }

123

function parseFeed(feed: string, options?: Options): Feed | null;

124

```

125

126

[Feed Parsing](./feed-parsing.md)

127

128

### Low-Level Tokenization

129

130

Direct access to the underlying tokenizer for custom parsing implementations and advanced use cases.

131

132

```typescript { .api }

133

class Tokenizer {

134

constructor(options: ParserOptions, cbs: Callbacks);

135

write(chunk: string): void;

136

end(chunk?: string): void;

137

}

138

```

139

140

[Low-Level Tokenization](./tokenization.md)

141

142

## Common Types

143

144

```typescript { .api }

145

interface Options extends ParserOptions, DomHandlerOptions {}

146

147

interface DomHandlerOptions {

148

/** Include location information for nodes */

149

withStartIndices?: boolean;

150

/** Include end location information for nodes */

151

withEndIndices?: boolean;

152

/** Normalize whitespace in text content */

153

normalizeWhitespace?: boolean;

154

}

155

156

interface ParserOptions {

157

/** Enable XML parsing mode for feeds and XML documents */

158

xmlMode?: boolean;

159

/** Decode HTML entities in text content */

160

decodeEntities?: boolean;

161

/** Convert tag names to lowercase */

162

lowerCaseTags?: boolean;

163

/** Convert attribute names to lowercase */

164

lowerCaseAttributeNames?: boolean;

165

/** Recognize CDATA sections even in HTML mode */

166

recognizeCDATA?: boolean;

167

/** Recognize self-closing tags even in HTML mode */

168

recognizeSelfClosing?: boolean;

169

/** Custom tokenizer class to use */

170

Tokenizer?: typeof Tokenizer;

171

}

172

173

// DOM types (from domhandler dependency)

174

interface Document extends Node {

175

children: ChildNode[];

176

}

177

178

interface Element extends Node {

179

name: string;

180

attribs: { [name: string]: string };

181

children: ChildNode[];

182

}

183

184

interface Text extends Node {

185

type: "text";

186

data: string;

187

}

188

189

interface Comment extends Node {

190

type: "comment";

191

data: string;

192

}

193

194

interface ProcessingInstruction extends Node {

195

type: "directive";

196

name: string;

197

data: string;

198

}

199

200

type ChildNode = Element | Text | Comment | ProcessingInstruction;

201

202

// DOM Handler classes

203

class DomHandler {

204

constructor(callback?: (error: Error | null, dom: ChildNode[]) => void, options?: DomHandlerOptions, elementCallback?: (element: Element) => void);

205

root: Document;

206

}

207

208

/** @deprecated Use DomHandler instead */

209

const DefaultHandler = DomHandler;

210

211

// Feed types (from domutils dependency)

212

interface Feed {

213

type: string;

214

title?: string;

215

link?: string;

216

description?: string;

217

items: FeedItem[];

218

}

219

220

// Namespace exports

221

namespace ElementType {

222

const Text: string;

223

const Directive: string;

224

const Comment: string;

225

const Script: string;

226

const Style: string;

227

const Tag: string;

228

const CDATA: string;

229

const Doctype: string;

230

}

231

232

namespace DomUtils {

233

function getFeed(dom: ChildNode[]): Feed | null;

234

// Additional DOM manipulation utilities from domutils package

235

}

236

```