or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

content-integration.mddata-types-configuration.mdindex.mdmodule-configuration.mdserver-composables.mdxml-html-utilities.md

xml-html-utilities.mddocs/

0

# XML and HTML Utilities

1

2

Utility functions for parsing existing XML sitemaps and extracting sitemap metadata from HTML documents for analysis and integration purposes.

3

4

```typescript { .api }

5

import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';

6

import type { SitemapParseResult, SitemapWarning } from '@nuxtjs/sitemap/utils';

7

```

8

9

## Capabilities

10

11

### XML Sitemap Parsing

12

13

Parse existing XML sitemap content into structured data with validation and warning reporting.

14

15

```typescript { .api }

16

/**

17

* Parse XML sitemap content into structured data

18

* Handles both regular sitemaps and sitemap index files

19

* @param xml - Raw XML sitemap content as string

20

* @returns Promise resolving to parsed sitemap data with URLs and validation warnings

21

*/

22

function parseSitemapXml(xml: string): Promise<SitemapParseResult>;

23

24

interface SitemapParseResult {

25

/** Array of parsed sitemap URLs */

26

urls: SitemapUrlInput[];

27

/** Array of validation warnings encountered during parsing */

28

warnings: SitemapWarning[];

29

}

30

31

interface SitemapWarning {

32

/** Type of warning encountered */

33

type: 'validation';

34

/** Human-readable warning message */

35

message: string;

36

/** Context information about where the warning occurred */

37

context?: {

38

url?: string;

39

field?: string;

40

value?: unknown;

41

};

42

}

43

```

44

45

### HTML Metadata Extraction

46

47

Extract sitemap-relevant metadata from HTML documents for automatic discovery and analysis.

48

49

```typescript { .api }

50

/**

51

* Extract sitemap metadata from HTML document content

52

* Discovers images, videos, and other sitemap-relevant information

53

* @param html - Raw HTML content as string

54

* @param options - Optional configuration for metadata extraction

55

* @returns Array of sitemap URLs with discovered metadata

56

*/

57

function parseHtmlExtractSitemapMeta(

58

html: string,

59

options?: {

60

/** Whether to discover images in the HTML content */

61

images?: boolean;

62

/** Whether to discover videos in the HTML content */

63

videos?: boolean;

64

/** Whether to extract lastmod information */

65

lastmod?: boolean;

66

/** Whether to extract alternative language links */

67

alternatives?: boolean;

68

/** Function to resolve relative URLs to absolute URLs */

69

resolveUrl?: (url: string) => string;

70

}

71

): SitemapUrl[];

72

```

73

74

### Parsed Data Types

75

76

**URL Entry Structure**

77

78

```typescript { .api }

79

interface SitemapUrl {

80

/** URL location (required) */

81

loc: string;

82

/** Last modification date */

83

lastmod?: string | Date;

84

/** Change frequency indicator */

85

changefreq?: Changefreq;

86

/** Priority value between 0.0 and 1.0 */

87

priority?: 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1;

88

/** Alternative language versions */

89

alternatives?: AlternativeEntry[];

90

/** Google News metadata */

91

news?: GoogleNewsEntry;

92

/** Associated images */

93

images?: ImageEntry[];

94

/** Associated videos */

95

videos?: VideoEntry[];

96

}

97

98

type Changefreq =

99

| 'always'

100

| 'hourly'

101

| 'daily'

102

| 'weekly'

103

| 'monthly'

104

| 'yearly'

105

| 'never';

106

```

107

108

**Image Metadata Structure**

109

110

```typescript { .api }

111

interface ImageEntry {

112

/** Image URL location */

113

loc: string | URL;

114

/** Image caption text */

115

caption?: string;

116

/** Geographic location information */

117

geoLocation?: string;

118

/** Image title */

119

title?: string;

120

/** License URL */

121

license?: string | URL;

122

}

123

```

124

125

**Video Metadata Structure**

126

127

```typescript { .api }

128

interface VideoEntry {

129

/** Video title (required) */

130

title: string;

131

/** Video thumbnail URL (required) */

132

thumbnail_loc: string | URL;

133

/** Video description (required) */

134

description: string;

135

/** Direct video content URL */

136

content_loc?: string | URL;

137

/** Video player page URL */

138

player_loc?: string | URL;

139

/** Video duration in seconds */

140

duration?: number;

141

/** Video expiration date */

142

expiration_date?: Date | string;

143

/** Video rating (0.0 to 5.0) */

144

rating?: number;

145

/** View count */

146

view_count?: number;

147

/** Publication date */

148

publication_date?: Date | string;

149

/** Family-friendly flag */

150

family_friendly?: 'yes' | 'no' | boolean;

151

/** Geographic restrictions */

152

restriction?: Restriction;

153

/** Platform restrictions */

154

platform?: Platform;

155

/** Pricing information */

156

price?: PriceEntry[];

157

/** Subscription requirement */

158

requires_subscription?: 'yes' | 'no' | boolean;

159

/** Uploader information */

160

uploader?: {

161

uploader: string;

162

info?: string | URL;

163

};

164

/** Live content indicator */

165

live?: 'yes' | 'no' | boolean;

166

/** Content tags */

167

tag?: string | string[];

168

}

169

170

interface Restriction {

171

relationship: 'allow' | 'deny';

172

restriction: string;

173

}

174

175

interface Platform {

176

relationship: 'allow' | 'deny';

177

platform: string;

178

}

179

180

interface PriceEntry {

181

price?: number | string;

182

currency?: string;

183

type?: 'rent' | 'purchase' | 'package' | 'subscription';

184

}

185

```

186

187

**Alternative URL Structure**

188

189

```typescript { .api }

190

interface AlternativeEntry {

191

/** Language/locale code (hreflang attribute) */

192

hreflang: string;

193

/** Alternative URL */

194

href: string | URL;

195

}

196

```

197

198

**Google News Structure**

199

200

```typescript { .api }

201

interface GoogleNewsEntry {

202

/** News article title */

203

title: string;

204

/** Article publication date in W3C format */

205

publication_date: Date | string;

206

/** Publication information */

207

publication: {

208

/** Publication name as it appears on news.google.com */

209

name: string;

210

/** Publication language (ISO 639 code) */

211

language: string;

212

};

213

}

214

```

215

216

**Usage Examples:**

217

218

```typescript

219

// Parse an existing XML sitemap

220

import { parseSitemapXml } from '@nuxtjs/sitemap/utils';

221

222

const xmlContent = `<?xml version="1.0" encoding="UTF-8"?>

223

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

224

<url>

225

<loc>https://example.com/</loc>

226

<lastmod>2023-12-01</lastmod>

227

<changefreq>daily</changefreq>

228

<priority>1.0</priority>

229

</url>

230

<url>

231

<loc>https://example.com/about</loc>

232

<lastmod>invalid-date</lastmod>

233

<priority>0.8</priority>

234

</url>

235

</urlset>`;

236

237

const result = parseSitemapXml(xmlContent);

238

239

console.log(result.urls);

240

// [

241

// {

242

// loc: 'https://example.com/',

243

// lastmod: '2023-12-01',

244

// changefreq: 'daily',

245

// priority: 1.0

246

// },

247

// {

248

// loc: 'https://example.com/about',

249

// priority: 0.8

250

// }

251

// ]

252

253

console.log(result.warnings);

254

// [

255

// {

256

// type: 'invalid-date',

257

// message: 'Invalid lastmod date: invalid-date',

258

// context: 'https://example.com/about'

259

// }

260

// ]

261

262

// Extract metadata from HTML content

263

import { parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';

264

265

const htmlContent = `

266

<!DOCTYPE html>

267

<html>

268

<head>

269

<title>My Blog Post</title>

270

<meta property="og:image" content="https://example.com/hero.jpg">

271

<meta property="article:published_time" content="2023-12-01T10:00:00Z">

272

</head>

273

<body>

274

<h1>My Blog Post</h1>

275

<img src="/images/diagram.png" alt="Technical diagram">

276

<video src="/videos/demo.mp4" poster="/videos/demo-thumb.jpg">

277

<source src="/videos/demo.mp4" type="video/mp4">

278

</video>

279

</body>

280

</html>

281

`;

282

283

const metadata = parseHtmlExtractSitemapMeta(htmlContent);

284

285

console.log(metadata);

286

// [

287

// {

288

// images: [

289

// {

290

// loc: 'https://example.com/hero.jpg',

291

// title: 'My Blog Post'

292

// },

293

// {

294

// loc: '/images/diagram.png',

295

// caption: 'Technical diagram'

296

// }

297

// ],

298

// videos: [

299

// {

300

// title: 'My Blog Post',

301

// content_loc: '/videos/demo.mp4',

302

// thumbnail_loc: '/videos/demo-thumb.jpg'

303

// }

304

// ],

305

// lastmod: '2023-12-01T10:00:00Z'

306

// }

307

// ]

308

309

// Handle parsing errors gracefully

310

try {

311

const result = parseSitemapXml(invalidXml);

312

313

// Process results

314

result.urls.forEach(url => {

315

console.log(`Processing URL: ${url.loc}`);

316

});

317

318

// Handle warnings

319

if (result.warnings.length > 0) {

320

console.warn('Parsing warnings:');

321

result.warnings.forEach(warning => {

322

console.warn(`- ${warning.type}: ${warning.message}`);

323

});

324

}

325

} catch (error) {

326

console.error('Failed to parse sitemap XML:', error);

327

}

328

329

// Integration with existing sitemap generation

330

import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';

331

332

// Parse competitor's sitemap for analysis

333

const competitorSitemap = await $fetch('https://competitor.com/sitemap.xml');

334

const parsed = parseSitemapXml(competitorSitemap);

335

336

// Use parsed data to inform your sitemap structure

337

const competitorUrls = parsed.urls.map(url => ({

338

loc: url.loc.replace('competitor.com', 'mysite.com'),

339

priority: Math.max(0.1, (url.priority || 0.5) - 0.1) // Slightly lower priority

340

}));

341

342

// Extract metadata from rendered pages for automatic discovery

343

const pageHtml = await $fetch('https://mysite.com/blog/post-1');

344

const extractedMeta = parseHtmlExtractSitemapMeta(pageHtml);

345

346

// Combine with existing sitemap data

347

const enrichedUrl = {

348

loc: '/blog/post-1',

349

...extractedMeta[0], // Use discovered metadata

350

priority: 0.8

351

};

352

```

353

354

## Parsing Features

355

356

**XML Sitemap Support**

357

358

The XML parser supports:

359

- Standard sitemap XML format (sitemaps.org schema)

360

- Sitemap index files with nested sitemap references

361

- Image extensions (Google Image sitemaps)

362

- Video extensions (Google Video sitemaps)

363

- News extensions (Google News sitemaps)

364

- Alternative/hreflang entries for i18n

365

- Validation with detailed warning reporting

366

- Malformed XML handling with graceful degradation

367

368

**HTML Metadata Discovery**

369

370

The HTML parser extracts:

371

- Open Graph image and video metadata

372

- Structured data (JSON-LD, microdata)

373

- HTML img and video elements

374

- Meta tags for publication dates and modification times

375

- Link tags for alternative versions

376

- Title and description elements

377

- Validation of discovered URLs and content

378

379

**Error Handling and Validation**

380

381

Both parsers provide:

382

- Comprehensive validation of URLs, dates, and numeric values

383

- Warning collection for non-fatal parsing issues

384

- Graceful handling of malformed or incomplete data

385

- Context information for debugging parsing issues

386

- Type-safe output with proper TypeScript interfaces