0
# HTML Parsing
1
2
Core HTML parsing functionality that converts HTML strings into abstract syntax trees. Parse5 implements the WHATWG HTML Living Standard parsing algorithm and handles malformed HTML gracefully.
3
4
## Capabilities
5
6
### Document Parsing
7
8
Parses a complete HTML document string into a document AST node.
9
10
```typescript { .api }
11
/**
12
* Parses an HTML string into a complete document AST
13
* @param html - Input HTML string to parse
14
* @param options - Optional parsing configuration
15
* @returns Document AST node representing the parsed HTML
16
*/
17
function parse<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(
18
html: string,
19
options?: ParserOptions<T>
20
): T['document'];
21
```
22
23
**Usage Examples:**
24
25
```typescript
26
import { parse } from "parse5";
27
28
// Parse a complete HTML document
29
const document = parse('<!DOCTYPE html><html><head><title>Test</title></head><body><h1>Hello World</h1></body></html>');
30
31
// Access document structure
32
console.log(document.childNodes[0].nodeName); // '#documentType'
33
console.log(document.childNodes[1].tagName); // 'html'
34
35
// Parse with options
36
const documentWithLocation = parse('<html><body>Content</body></html>', {
37
sourceCodeLocationInfo: true,
38
scriptingEnabled: false
39
});
40
```
41
42
### Fragment Parsing
43
44
Parses HTML fragments with optional context element. When parsing fragments, the parser behavior changes based on the context element to match browser behavior.
45
46
```typescript { .api }
47
/**
48
* Parses HTML fragment with context element
49
* @param fragmentContext - Context element that affects parsing behavior
50
* @param html - HTML fragment string to parse
51
* @param options - Parsing configuration options
52
* @returns DocumentFragment containing parsed nodes
53
*/
54
function parseFragment<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(
55
fragmentContext: T['parentNode'] | null,
56
html: string,
57
options: ParserOptions<T>
58
): T['documentFragment'];
59
60
/**
61
* Parses HTML fragment without context element
62
* @param html - HTML fragment string to parse
63
* @param options - Optional parsing configuration
64
* @returns DocumentFragment containing parsed nodes
65
*/
66
function parseFragment<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(
67
html: string,
68
options?: ParserOptions<T>
69
): T['documentFragment'];
70
```
71
72
**Usage Examples:**
73
74
```typescript
75
import { parse, parseFragment } from "parse5";
76
77
// Parse fragment without context
78
const fragment = parseFragment('<div><span>Hello</span><p>World</p></div>');
79
console.log(fragment.childNodes.length); // 1
80
console.log(fragment.childNodes[0].tagName); // 'div'
81
82
// Parse fragment with context for accurate parsing
83
const document = parse('<table></table>');
84
const tableElement = document.childNodes[1].childNodes[1].childNodes[0]; // html > body > table
85
86
const tableRowFragment = parseFragment(
87
tableElement,
88
'<tr><td>Cell content</td></tr>',
89
{ sourceCodeLocationInfo: true }
90
);
91
console.log(tableRowFragment.childNodes[0].tagName); // 'tr'
92
93
// Parse template content
94
const templateFragment = parseFragment('<div>Template content</div>');
95
```
96
97
### Advanced Parsing Options
98
99
Control parsing behavior through comprehensive options.
100
101
```typescript { .api }
102
interface ParserOptions<T extends TreeAdapterTypeMap> {
103
/**
104
* The scripting flag. If set to true, noscript element content
105
* will be parsed as text. Defaults to true.
106
*/
107
scriptingEnabled?: boolean;
108
109
/**
110
* Enables source code location information. When enabled, each node
111
* will have a sourceCodeLocation property with position data.
112
* Defaults to false.
113
*/
114
sourceCodeLocationInfo?: boolean;
115
116
/**
117
* Specifies the tree adapter to use for creating and manipulating AST nodes.
118
* Defaults to the built-in default tree adapter.
119
*/
120
treeAdapter?: TreeAdapter<T>;
121
122
/**
123
* Error handling callback function. Called for each parsing error encountered.
124
*/
125
onParseError?: ParserErrorHandler;
126
}
127
```
128
129
**Usage Examples:**
130
131
```typescript
132
import { parse, parseFragment } from "parse5";
133
134
// Enable location tracking for debugging
135
const documentWithLocations = parse('<div>Content</div>', {
136
sourceCodeLocationInfo: true
137
});
138
139
// Each element will have sourceCodeLocation property
140
const divElement = documentWithLocations.childNodes[1].childNodes[1].childNodes[0];
141
console.log(divElement.sourceCodeLocation);
142
// Output: { startLine: 1, startCol: 1, startOffset: 0, endLine: 1, endCol: 19, endOffset: 18 }
143
144
// Handle parsing errors
145
const errors: string[] = [];
146
const documentWithErrors = parse('<div><span></div>', {
147
onParseError: (error) => {
148
errors.push(`${error.code} at line ${error.startLine}`);
149
}
150
});
151
console.log(errors); // ['end-tag-with-trailing-solidus at line 1']
152
153
// Disable script execution context
154
const noScriptDocument = parse('<noscript>This content is visible</noscript>', {
155
scriptingEnabled: false
156
});
157
```
158
159
### Parser Class (Advanced)
160
161
Advanced users can directly use the Parser class for more control over the parsing process.
162
163
```typescript { .api }
164
/**
165
* Core HTML parser class. Internal API - use parse() and parseFragment() functions instead.
166
*/
167
class Parser<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap> {
168
/**
169
* Static method to parse HTML string into document
170
*/
171
static parse<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(
172
html: string,
173
options?: ParserOptions<T>
174
): T['document'];
175
176
/**
177
* Static method to get fragment parser instance
178
*/
179
static getFragmentParser<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(
180
fragmentContext: T['parentNode'] | null,
181
options?: ParserOptions<T>
182
): Parser<T>;
183
184
/**
185
* Get parsed fragment from fragment parser
186
*/
187
getFragment(): T['documentFragment'];
188
}
189
```
190
191
## Common Parsing Patterns
192
193
### HTML Document Structure
194
195
```typescript
196
import { parse } from "parse5";
197
198
const html = '<!DOCTYPE html><html><head><title>Page</title></head><body><div>Content</div></body></html>';
199
const document = parse(html);
200
201
// Document structure:
202
// document
203
// ├── DocumentType node ('#documentType')
204
// └── Element node ('html')
205
// ├── Element node ('head')
206
// │ └── Element node ('title')
207
// │ └── Text node ('Page')
208
// └── Element node ('body')
209
// └── Element node ('div')
210
// └── Text node ('Content')
211
```
212
213
### Fragment Parsing with Context
214
215
```typescript
216
import { parse, parseFragment } from "parse5";
217
218
// Parse table row requires table context for proper parsing
219
const table = parse('<table></table>');
220
const tableElement = table.childNodes[1].childNodes[1].childNodes[0];
221
222
const fragment = parseFragment(tableElement, '<tr><td>Data</td></tr>');
223
// Without context, the tr would be parsed incorrectly
224
```
225
226
### Error Recovery
227
228
Parse5 automatically recovers from many HTML errors:
229
230
```typescript
231
import { parse } from "parse5";
232
233
// Missing closing tags
234
const doc1 = parse('<div><p>Unclosed paragraph<div>Another div</div>');
235
// Parser automatically closes the <p> tag
236
237
// Misplaced elements
238
const doc2 = parse('<html><div>Content before body</div><body>Body content</body></html>');
239
// Parser moves the div to the correct location in body
240
```
241
242
## Source Code Location Tracking
243
244
Parse5 provides comprehensive source code location tracking for debugging and development tools. When enabled, each parsed node includes detailed position information about its location in the original HTML source.
245
246
### Location Information Interface
247
248
```typescript { .api }
249
/**
250
* Basic location information interface
251
*/
252
interface Location {
253
/** One-based line index of the first character */
254
startLine: number;
255
/** One-based column index of the first character */
256
startCol: number;
257
/** Zero-based first character index */
258
startOffset: number;
259
/** One-based line index of the last character */
260
endLine: number;
261
/** One-based column index of the last character (after the character) */
262
endCol: number;
263
/** Zero-based last character index (after the character) */
264
endOffset: number;
265
}
266
267
/**
268
* Location information with attribute positions
269
*/
270
interface LocationWithAttributes extends Location {
271
/** Start tag attributes' location info */
272
attrs?: Record<string, Location>;
273
}
274
275
/**
276
* Element location with start and end tag positions
277
*/
278
interface ElementLocation extends LocationWithAttributes {
279
/** Element's start tag location info */
280
startTag?: Location;
281
/** Element's end tag location info (undefined if no closing tag) */
282
endTag?: Location;
283
}
284
```
285
286
### Enabling Location Tracking
287
288
Location tracking is controlled through the `sourceCodeLocationInfo` option in `ParserOptions`:
289
290
```typescript
291
import { parse, parseFragment } from "parse5";
292
293
// Enable location tracking for document parsing
294
const document = parse('<div class="container">Hello <span>World</span></div>', {
295
sourceCodeLocationInfo: true
296
});
297
298
// Enable location tracking for fragment parsing
299
const fragment = parseFragment('<p>Paragraph with <strong>emphasis</strong></p>', {
300
sourceCodeLocationInfo: true
301
});
302
```
303
304
### Using Location Information
305
306
When location tracking is enabled, each node includes a `sourceCodeLocation` property:
307
308
```typescript
309
import { parse } from "parse5";
310
import type { Element, Location, ElementLocation } from "parse5";
311
312
const html = `<div class="container">
313
<h1>Title</h1>
314
<p>Paragraph with <em>emphasis</em></p>
315
</div>`;
316
317
const document = parse(html, { sourceCodeLocationInfo: true });
318
319
// Navigate to elements
320
const htmlElement = document.childNodes[1] as Element;
321
const bodyElement = htmlElement.childNodes[1] as Element;
322
const divElement = bodyElement.childNodes[1] as Element;
323
324
// Access location information
325
const divLocation = divElement.sourceCodeLocation as ElementLocation;
326
console.log('Div element location:');
327
console.log(` Start: line ${divLocation.startLine}, col ${divLocation.startCol}`);
328
console.log(` End: line ${divLocation.endLine}, col ${divLocation.endCol}`);
329
console.log(` Offset: ${divLocation.startOffset}-${divLocation.endOffset}`);
330
331
// Access start tag location
332
if (divLocation.startTag) {
333
console.log('Start tag location:');
334
console.log(` <div class="container"> at line ${divLocation.startTag.startLine}`);
335
}
336
337
// Access end tag location
338
if (divLocation.endTag) {
339
console.log('End tag location:');
340
console.log(` </div> at line ${divLocation.endTag.startLine}`);
341
}
342
343
// Access attribute locations
344
if (divLocation.attrs && divLocation.attrs.class) {
345
const classLocation = divLocation.attrs.class;
346
console.log(`Class attribute at line ${classLocation.startLine}, col ${classLocation.startCol}`);
347
}
348
```
349
350
### Location-Based Source Extraction
351
352
```typescript
353
import { parse } from "parse5";
354
import type { Element, ElementLocation } from "parse5";
355
356
class SourceExtractor {
357
constructor(private html: string) {}
358
359
getElementSource(element: Element): string | null {
360
const location = element.sourceCodeLocation as ElementLocation;
361
if (!location) return null;
362
363
return this.html.substring(location.startOffset, location.endOffset);
364
}
365
366
getStartTagSource(element: Element): string | null {
367
const location = element.sourceCodeLocation as ElementLocation;
368
if (!location?.startTag) return null;
369
370
return this.html.substring(location.startTag.startOffset, location.startTag.endOffset);
371
}
372
373
getAttributeSource(element: Element, attrName: string): string | null {
374
const location = element.sourceCodeLocation as ElementLocation;
375
const attrLocation = location?.attrs?.[attrName];
376
if (!attrLocation) return null;
377
378
return this.html.substring(attrLocation.startOffset, attrLocation.endOffset);
379
}
380
381
getElementContext(element: Element, contextLines = 2): string[] | null {
382
const location = element.sourceCodeLocation as ElementLocation;
383
if (!location) return null;
384
385
const lines = this.html.split('\n');
386
const startLine = Math.max(0, location.startLine - 1 - contextLines);
387
const endLine = Math.min(lines.length, location.endLine + contextLines);
388
389
return lines.slice(startLine, endLine).map((line, index) => {
390
const lineNumber = startLine + index + 1;
391
const marker = lineNumber >= location.startLine && lineNumber <= location.endLine ? '>' : ' ';
392
return `${marker} ${lineNumber.toString().padStart(3)}: ${line}`;
393
});
394
}
395
}
396
397
// Usage
398
const html = `<!DOCTYPE html>
399
<html>
400
<head>
401
<title>Test Page</title>
402
</head>
403
<body>
404
<div class="container">
405
<h1>Main Title</h1>
406
<p>Content paragraph</p>
407
</div>
408
</body>
409
</html>`;
410
411
const document = parse(html, { sourceCodeLocationInfo: true });
412
const extractor = new SourceExtractor(html);
413
414
// Find the div element
415
function findElementByTagName(node: any, tagName: string): Element | null {
416
if (node.tagName === tagName) return node;
417
if (node.childNodes) {
418
for (const child of node.childNodes) {
419
const found = findElementByTagName(child, tagName);
420
if (found) return found;
421
}
422
}
423
return null;
424
}
425
426
const divElement = findElementByTagName(document, 'div');
427
if (divElement) {
428
console.log('Element source:', extractor.getElementSource(divElement));
429
console.log('Start tag source:', extractor.getStartTagSource(divElement));
430
console.log('Class attribute source:', extractor.getAttributeSource(divElement, 'class'));
431
console.log('Context:');
432
console.log(extractor.getElementContext(divElement)?.join('\n'));
433
}
434
```
435
436
### Location-Aware Error Reporting
437
438
```typescript
439
import { parse } from "parse5";
440
import type { ParserError, Element } from "parse5";
441
442
class LocationAwareErrorReporter {
443
private errors: Array<{ error: ParserError; context: string }> = [];
444
445
parseWithLocationTracking(html: string) {
446
const lines = html.split('\n');
447
448
const document = parse(html, {
449
sourceCodeLocationInfo: true,
450
onParseError: (error) => {
451
const line = lines[error.startLine - 1] || '';
452
const contextStart = Math.max(0, error.startCol - 10);
453
const contextEnd = Math.min(line.length, error.endCol + 10);
454
const context = line.substring(contextStart, contextEnd);
455
456
this.errors.push({ error, context });
457
}
458
});
459
460
return { document, errors: this.errors };
461
}
462
463
generateErrorReport(): string {
464
if (this.errors.length === 0) {
465
return 'No parsing errors found.';
466
}
467
468
let report = `Found ${this.errors.length} parsing error(s):\n\n`;
469
470
this.errors.forEach((item, index) => {
471
const { error, context } = item;
472
report += `${index + 1}. Error: ${error.code}\n`;
473
report += ` Location: Line ${error.startLine}, Column ${error.startCol}\n`;
474
report += ` Context: "${context}"\n`;
475
report += ` Position: ${error.startOffset}-${error.endOffset}\n\n`;
476
});
477
478
return report;
479
}
480
}
481
482
// Usage
483
const reporter = new LocationAwareErrorReporter();
484
const result = reporter.parseWithLocationTracking('<div><span></div>'); // Missing closing span tag
485
486
console.log(reporter.generateErrorReport());
487
```
488
489
### Performance Considerations
490
491
Location tracking adds overhead to parsing performance and memory usage:
492
493
```typescript
494
import { parse } from "parse5";
495
496
// Benchmark parsing with and without location tracking
497
function benchmarkParsing(html: string, iterations = 1000) {
498
console.log('Benchmarking parsing performance...');
499
500
// Without location tracking
501
const startWithout = Date.now();
502
for (let i = 0; i < iterations; i++) {
503
parse(html, { sourceCodeLocationInfo: false });
504
}
505
const timeWithout = Date.now() - startWithout;
506
507
// With location tracking
508
const startWith = Date.now();
509
for (let i = 0; i < iterations; i++) {
510
parse(html, { sourceCodeLocationInfo: true });
511
}
512
const timeWith = Date.now() - startWith;
513
514
console.log(`Without location tracking: ${timeWithout}ms`);
515
console.log(`With location tracking: ${timeWith}ms`);
516
console.log(`Overhead: ${((timeWith - timeWithout) / timeWithout * 100).toFixed(1)}%`);
517
}
518
519
// Test with sample HTML
520
const sampleHtml = '<div><p>Hello</p><span>World</span></div>'.repeat(100);
521
benchmarkParsing(sampleHtml);
522
```
523
524
**Best Practices:**
525
526
1. **Enable only when needed**: Location tracking should only be enabled for debugging, development tools, or error reporting scenarios
527
2. **Disable in production**: For production parsing where location information isn't needed, keep `sourceCodeLocationInfo: false` for better performance
528
3. **Cache parsed results**: When location information is needed for multiple operations, parse once and reuse the result
529
4. **Use selective extraction**: Instead of keeping all parsed data in memory, extract only the location information you need