or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

error-handling.mdhtml-utilities.mdindex.mdparsing.mdserialization.mdtokenization.mdtree-adapters.md

parsing.mddocs/

0

# HTML Parsing

1

2

Core HTML parsing functionality that converts HTML strings into abstract syntax trees. Parse5 implements the WHATWG HTML Living Standard parsing algorithm and handles malformed HTML gracefully.

3

4

## Capabilities

5

6

### Document Parsing

7

8

Parses a complete HTML document string into a document AST node.

9

10

```typescript { .api }

11

/**

12

* Parses an HTML string into a complete document AST

13

* @param html - Input HTML string to parse

14

* @param options - Optional parsing configuration

15

* @returns Document AST node representing the parsed HTML

16

*/

17

function parse<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(

18

html: string,

19

options?: ParserOptions<T>

20

): T['document'];

21

```

22

23

**Usage Examples:**

24

25

```typescript

26

import { parse } from "parse5";

27

28

// Parse a complete HTML document

29

const document = parse('<!DOCTYPE html><html><head><title>Test</title></head><body><h1>Hello World</h1></body></html>');

30

31

// Access document structure

32

console.log(document.childNodes[0].nodeName); // '#documentType'

33

console.log(document.childNodes[1].tagName); // 'html'

34

35

// Parse with options

36

const documentWithLocation = parse('<html><body>Content</body></html>', {

37

sourceCodeLocationInfo: true,

38

scriptingEnabled: false

39

});

40

```

41

42

### Fragment Parsing

43

44

Parses HTML fragments with optional context element. When parsing fragments, the parser behavior changes based on the context element to match browser behavior.

45

46

```typescript { .api }

47

/**

48

* Parses HTML fragment with context element

49

* @param fragmentContext - Context element that affects parsing behavior

50

* @param html - HTML fragment string to parse

51

* @param options - Parsing configuration options

52

* @returns DocumentFragment containing parsed nodes

53

*/

54

function parseFragment<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(

55

fragmentContext: T['parentNode'] | null,

56

html: string,

57

options: ParserOptions<T>

58

): T['documentFragment'];

59

60

/**

61

* Parses HTML fragment without context element

62

* @param html - HTML fragment string to parse

63

* @param options - Optional parsing configuration

64

* @returns DocumentFragment containing parsed nodes

65

*/

66

function parseFragment<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(

67

html: string,

68

options?: ParserOptions<T>

69

): T['documentFragment'];

70

```

71

72

**Usage Examples:**

73

74

```typescript

75

import { parse, parseFragment } from "parse5";

76

77

// Parse fragment without context

78

const fragment = parseFragment('<div><span>Hello</span><p>World</p></div>');

79

console.log(fragment.childNodes.length); // 1

80

console.log(fragment.childNodes[0].tagName); // 'div'

81

82

// Parse fragment with context for accurate parsing

83

const document = parse('<table></table>');

84

const tableElement = document.childNodes[1].childNodes[1].childNodes[0]; // html > body > table

85

86

const tableRowFragment = parseFragment(

87

tableElement,

88

'<tr><td>Cell content</td></tr>',

89

{ sourceCodeLocationInfo: true }

90

);

91

console.log(tableRowFragment.childNodes[0].tagName); // 'tr'

92

93

// Parse template content

94

const templateFragment = parseFragment('<div>Template content</div>');

95

```

96

97

### Advanced Parsing Options

98

99

Control parsing behavior through comprehensive options.

100

101

```typescript { .api }

102

interface ParserOptions<T extends TreeAdapterTypeMap> {

103

/**

104

* The scripting flag. If set to true, noscript element content

105

* will be parsed as text. Defaults to true.

106

*/

107

scriptingEnabled?: boolean;

108

109

/**

110

* Enables source code location information. When enabled, each node

111

* will have a sourceCodeLocation property with position data.

112

* Defaults to false.

113

*/

114

sourceCodeLocationInfo?: boolean;

115

116

/**

117

* Specifies the tree adapter to use for creating and manipulating AST nodes.

118

* Defaults to the built-in default tree adapter.

119

*/

120

treeAdapter?: TreeAdapter<T>;

121

122

/**

123

* Error handling callback function. Called for each parsing error encountered.

124

*/

125

onParseError?: ParserErrorHandler;

126

}

127

```

128

129

**Usage Examples:**

130

131

```typescript

132

import { parse, parseFragment } from "parse5";

133

134

// Enable location tracking for debugging

135

const documentWithLocations = parse('<div>Content</div>', {

136

sourceCodeLocationInfo: true

137

});

138

139

// Each element will have sourceCodeLocation property

140

const divElement = documentWithLocations.childNodes[1].childNodes[1].childNodes[0];

141

console.log(divElement.sourceCodeLocation);

142

// Output: { startLine: 1, startCol: 1, startOffset: 0, endLine: 1, endCol: 19, endOffset: 18 }

143

144

// Handle parsing errors

145

const errors: string[] = [];

146

const documentWithErrors = parse('<div><span></div>', {

147

onParseError: (error) => {

148

errors.push(`${error.code} at line ${error.startLine}`);

149

}

150

});

151

console.log(errors); // ['end-tag-with-trailing-solidus at line 1']

152

153

// Disable script execution context

154

const noScriptDocument = parse('<noscript>This content is visible</noscript>', {

155

scriptingEnabled: false

156

});

157

```

158

159

### Parser Class (Advanced)

160

161

Advanced users can directly use the Parser class for more control over the parsing process.

162

163

```typescript { .api }

164

/**

165

* Core HTML parser class. Internal API - use parse() and parseFragment() functions instead.

166

*/

167

class Parser<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap> {

168

/**

169

* Static method to parse HTML string into document

170

*/

171

static parse<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(

172

html: string,

173

options?: ParserOptions<T>

174

): T['document'];

175

176

/**

177

* Static method to get fragment parser instance

178

*/

179

static getFragmentParser<T extends TreeAdapterTypeMap = DefaultTreeAdapterMap>(

180

fragmentContext: T['parentNode'] | null,

181

options?: ParserOptions<T>

182

): Parser<T>;

183

184

/**

185

* Get parsed fragment from fragment parser

186

*/

187

getFragment(): T['documentFragment'];

188

}

189

```

190

191

## Common Parsing Patterns

192

193

### HTML Document Structure

194

195

```typescript

196

import { parse } from "parse5";

197

198

const html = '<!DOCTYPE html><html><head><title>Page</title></head><body><div>Content</div></body></html>';

199

const document = parse(html);

200

201

// Document structure:

202

// document

203

// ├── DocumentType node ('#documentType')

204

// └── Element node ('html')

205

// ├── Element node ('head')

206

// │ └── Element node ('title')

207

// │ └── Text node ('Page')

208

// └── Element node ('body')

209

// └── Element node ('div')

210

// └── Text node ('Content')

211

```

212

213

### Fragment Parsing with Context

214

215

```typescript

216

import { parse, parseFragment } from "parse5";

217

218

// Parse table row requires table context for proper parsing

219

const table = parse('<table></table>');

220

const tableElement = table.childNodes[1].childNodes[1].childNodes[0];

221

222

const fragment = parseFragment(tableElement, '<tr><td>Data</td></tr>');

223

// Without context, the tr would be parsed incorrectly

224

```

225

226

### Error Recovery

227

228

Parse5 automatically recovers from many HTML errors:

229

230

```typescript

231

import { parse } from "parse5";

232

233

// Missing closing tags

234

const doc1 = parse('<div><p>Unclosed paragraph<div>Another div</div>');

235

// Parser automatically closes the <p> tag

236

237

// Misplaced elements

238

const doc2 = parse('<html><div>Content before body</div><body>Body content</body></html>');

239

// Parser moves the div to the correct location in body

240

```

241

242

## Source Code Location Tracking

243

244

Parse5 provides comprehensive source code location tracking for debugging and development tools. When enabled, each parsed node includes detailed position information about its location in the original HTML source.

245

246

### Location Information Interface

247

248

```typescript { .api }

249

/**

250

* Basic location information interface

251

*/

252

interface Location {

253

/** One-based line index of the first character */

254

startLine: number;

255

/** One-based column index of the first character */

256

startCol: number;

257

/** Zero-based first character index */

258

startOffset: number;

259

/** One-based line index of the last character */

260

endLine: number;

261

/** One-based column index of the last character (after the character) */

262

endCol: number;

263

/** Zero-based last character index (after the character) */

264

endOffset: number;

265

}

266

267

/**

268

* Location information with attribute positions

269

*/

270

interface LocationWithAttributes extends Location {

271

/** Start tag attributes' location info */

272

attrs?: Record<string, Location>;

273

}

274

275

/**

276

* Element location with start and end tag positions

277

*/

278

interface ElementLocation extends LocationWithAttributes {

279

/** Element's start tag location info */

280

startTag?: Location;

281

/** Element's end tag location info (undefined if no closing tag) */

282

endTag?: Location;

283

}

284

```

285

286

### Enabling Location Tracking

287

288

Location tracking is controlled through the `sourceCodeLocationInfo` option in `ParserOptions`:

289

290

```typescript

291

import { parse, parseFragment } from "parse5";

292

293

// Enable location tracking for document parsing

294

const document = parse('<div class="container">Hello <span>World</span></div>', {

295

sourceCodeLocationInfo: true

296

});

297

298

// Enable location tracking for fragment parsing

299

const fragment = parseFragment('<p>Paragraph with <strong>emphasis</strong></p>', {

300

sourceCodeLocationInfo: true

301

});

302

```

303

304

### Using Location Information

305

306

When location tracking is enabled, each node includes a `sourceCodeLocation` property:

307

308

```typescript

309

import { parse } from "parse5";

310

import type { Element, Location, ElementLocation } from "parse5";

311

312

const html = `<div class="container">

313

<h1>Title</h1>

314

<p>Paragraph with <em>emphasis</em></p>

315

</div>`;

316

317

const document = parse(html, { sourceCodeLocationInfo: true });

318

319

// Navigate to elements

320

const htmlElement = document.childNodes[1] as Element;

321

const bodyElement = htmlElement.childNodes[1] as Element;

322

const divElement = bodyElement.childNodes[1] as Element;

323

324

// Access location information

325

const divLocation = divElement.sourceCodeLocation as ElementLocation;

326

console.log('Div element location:');

327

console.log(` Start: line ${divLocation.startLine}, col ${divLocation.startCol}`);

328

console.log(` End: line ${divLocation.endLine}, col ${divLocation.endCol}`);

329

console.log(` Offset: ${divLocation.startOffset}-${divLocation.endOffset}`);

330

331

// Access start tag location

332

if (divLocation.startTag) {

333

console.log('Start tag location:');

334

console.log(` <div class="container"> at line ${divLocation.startTag.startLine}`);

335

}

336

337

// Access end tag location

338

if (divLocation.endTag) {

339

console.log('End tag location:');

340

console.log(` </div> at line ${divLocation.endTag.startLine}`);

341

}

342

343

// Access attribute locations

344

if (divLocation.attrs && divLocation.attrs.class) {

345

const classLocation = divLocation.attrs.class;

346

console.log(`Class attribute at line ${classLocation.startLine}, col ${classLocation.startCol}`);

347

}

348

```

349

350

### Location-Based Source Extraction

351

352

```typescript

353

import { parse } from "parse5";

354

import type { Element, ElementLocation } from "parse5";

355

356

class SourceExtractor {

357

constructor(private html: string) {}

358

359

getElementSource(element: Element): string | null {

360

const location = element.sourceCodeLocation as ElementLocation;

361

if (!location) return null;

362

363

return this.html.substring(location.startOffset, location.endOffset);

364

}

365

366

getStartTagSource(element: Element): string | null {

367

const location = element.sourceCodeLocation as ElementLocation;

368

if (!location?.startTag) return null;

369

370

return this.html.substring(location.startTag.startOffset, location.startTag.endOffset);

371

}

372

373

getAttributeSource(element: Element, attrName: string): string | null {

374

const location = element.sourceCodeLocation as ElementLocation;

375

const attrLocation = location?.attrs?.[attrName];

376

if (!attrLocation) return null;

377

378

return this.html.substring(attrLocation.startOffset, attrLocation.endOffset);

379

}

380

381

getElementContext(element: Element, contextLines = 2): string[] | null {

382

const location = element.sourceCodeLocation as ElementLocation;

383

if (!location) return null;

384

385

const lines = this.html.split('\n');

386

const startLine = Math.max(0, location.startLine - 1 - contextLines);

387

const endLine = Math.min(lines.length, location.endLine + contextLines);

388

389

return lines.slice(startLine, endLine).map((line, index) => {

390

const lineNumber = startLine + index + 1;

391

const marker = lineNumber >= location.startLine && lineNumber <= location.endLine ? '>' : ' ';

392

return `${marker} ${lineNumber.toString().padStart(3)}: ${line}`;

393

});

394

}

395

}

396

397

// Usage

398

const html = `<!DOCTYPE html>

399

<html>

400

<head>

401

<title>Test Page</title>

402

</head>

403

<body>

404

<div class="container">

405

<h1>Main Title</h1>

406

<p>Content paragraph</p>

407

</div>

408

</body>

409

</html>`;

410

411

const document = parse(html, { sourceCodeLocationInfo: true });

412

const extractor = new SourceExtractor(html);

413

414

// Find the div element

415

function findElementByTagName(node: any, tagName: string): Element | null {

416

if (node.tagName === tagName) return node;

417

if (node.childNodes) {

418

for (const child of node.childNodes) {

419

const found = findElementByTagName(child, tagName);

420

if (found) return found;

421

}

422

}

423

return null;

424

}

425

426

const divElement = findElementByTagName(document, 'div');

427

if (divElement) {

428

console.log('Element source:', extractor.getElementSource(divElement));

429

console.log('Start tag source:', extractor.getStartTagSource(divElement));

430

console.log('Class attribute source:', extractor.getAttributeSource(divElement, 'class'));

431

console.log('Context:');

432

console.log(extractor.getElementContext(divElement)?.join('\n'));

433

}

434

```

435

436

### Location-Aware Error Reporting

437

438

```typescript

439

import { parse } from "parse5";

440

import type { ParserError, Element } from "parse5";

441

442

class LocationAwareErrorReporter {

443

private errors: Array<{ error: ParserError; context: string }> = [];

444

445

parseWithLocationTracking(html: string) {

446

const lines = html.split('\n');

447

448

const document = parse(html, {

449

sourceCodeLocationInfo: true,

450

onParseError: (error) => {

451

const line = lines[error.startLine - 1] || '';

452

const contextStart = Math.max(0, error.startCol - 10);

453

const contextEnd = Math.min(line.length, error.endCol + 10);

454

const context = line.substring(contextStart, contextEnd);

455

456

this.errors.push({ error, context });

457

}

458

});

459

460

return { document, errors: this.errors };

461

}

462

463

generateErrorReport(): string {

464

if (this.errors.length === 0) {

465

return 'No parsing errors found.';

466

}

467

468

let report = `Found ${this.errors.length} parsing error(s):\n\n`;

469

470

this.errors.forEach((item, index) => {

471

const { error, context } = item;

472

report += `${index + 1}. Error: ${error.code}\n`;

473

report += ` Location: Line ${error.startLine}, Column ${error.startCol}\n`;

474

report += ` Context: "${context}"\n`;

475

report += ` Position: ${error.startOffset}-${error.endOffset}\n\n`;

476

});

477

478

return report;

479

}

480

}

481

482

// Usage

483

const reporter = new LocationAwareErrorReporter();

484

const result = reporter.parseWithLocationTracking('<div><span></div>'); // Missing closing span tag

485

486

console.log(reporter.generateErrorReport());

487

```

488

489

### Performance Considerations

490

491

Location tracking adds overhead to parsing performance and memory usage:

492

493

```typescript

494

import { parse } from "parse5";

495

496

// Benchmark parsing with and without location tracking

497

function benchmarkParsing(html: string, iterations = 1000) {

498

console.log('Benchmarking parsing performance...');

499

500

// Without location tracking

501

const startWithout = Date.now();

502

for (let i = 0; i < iterations; i++) {

503

parse(html, { sourceCodeLocationInfo: false });

504

}

505

const timeWithout = Date.now() - startWithout;

506

507

// With location tracking

508

const startWith = Date.now();

509

for (let i = 0; i < iterations; i++) {

510

parse(html, { sourceCodeLocationInfo: true });

511

}

512

const timeWith = Date.now() - startWith;

513

514

console.log(`Without location tracking: ${timeWithout}ms`);

515

console.log(`With location tracking: ${timeWith}ms`);

516

console.log(`Overhead: ${((timeWith - timeWithout) / timeWithout * 100).toFixed(1)}%`);

517

}

518

519

// Test with sample HTML

520

const sampleHtml = '<div><p>Hello</p><span>World</span></div>'.repeat(100);

521

benchmarkParsing(sampleHtml);

522

```

523

524

**Best Practices:**

525

526

1. **Enable only when needed**: Location tracking should only be enabled for debugging, development tools, or error reporting scenarios

527

2. **Disable in production**: For production parsing where location information isn't needed, keep `sourceCodeLocationInfo: false` for better performance

528

3. **Cache parsed results**: When location information is needed for multiple operations, parse once and reuse the result

529

4. **Use selective extraction**: Instead of keeping all parsed data in memory, extract only the location information you need