0
# XML and HTML Utilities
1
2
Utility functions for parsing existing XML sitemaps and extracting sitemap metadata from HTML documents for analysis and integration purposes.
3
4
```typescript { .api }
5
import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
6
import type { SitemapParseResult, SitemapWarning } from '@nuxtjs/sitemap/utils';
7
```
8
9
## Capabilities
10
11
### XML Sitemap Parsing
12
13
Parse existing XML sitemap content into structured data with validation and warning reporting.
14
15
```typescript { .api }
16
/**
17
* Parse XML sitemap content into structured data
18
* Handles both regular sitemaps and sitemap index files
19
* @param xml - Raw XML sitemap content as string
20
* @returns Promise resolving to parsed sitemap data with URLs and validation warnings
21
*/
22
function parseSitemapXml(xml: string): Promise<SitemapParseResult>;
23
24
interface SitemapParseResult {
25
/** Array of parsed sitemap URLs */
26
urls: SitemapUrlInput[];
27
/** Array of validation warnings encountered during parsing */
28
warnings: SitemapWarning[];
29
}
30
31
interface SitemapWarning {
32
/** Type of warning encountered */
33
type: 'validation';
34
/** Human-readable warning message */
35
message: string;
36
/** Context information about where the warning occurred */
37
context?: {
38
url?: string;
39
field?: string;
40
value?: unknown;
41
};
42
}
43
```
44
45
### HTML Metadata Extraction
46
47
Extract sitemap-relevant metadata from HTML documents for automatic discovery and analysis.
48
49
```typescript { .api }
50
/**
51
* Extract sitemap metadata from HTML document content
52
* Discovers images, videos, and other sitemap-relevant information
53
* @param html - Raw HTML content as string
54
* @param options - Optional configuration for metadata extraction
55
* @returns Array of sitemap URLs with discovered metadata
56
*/
57
function parseHtmlExtractSitemapMeta(
58
html: string,
59
options?: {
60
/** Whether to discover images in the HTML content */
61
images?: boolean;
62
/** Whether to discover videos in the HTML content */
63
videos?: boolean;
64
/** Whether to extract lastmod information */
65
lastmod?: boolean;
66
/** Whether to extract alternative language links */
67
alternatives?: boolean;
68
/** Function to resolve relative URLs to absolute URLs */
69
resolveUrl?: (url: string) => string;
70
}
71
): SitemapUrl[];
72
```
73
74
### Parsed Data Types
75
76
**URL Entry Structure**
77
78
```typescript { .api }
79
interface SitemapUrl {
80
/** URL location (required) */
81
loc: string;
82
/** Last modification date */
83
lastmod?: string | Date;
84
/** Change frequency indicator */
85
changefreq?: Changefreq;
86
/** Priority value between 0.0 and 1.0 */
87
priority?: 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1;
88
/** Alternative language versions */
89
alternatives?: AlternativeEntry[];
90
/** Google News metadata */
91
news?: GoogleNewsEntry;
92
/** Associated images */
93
images?: ImageEntry[];
94
/** Associated videos */
95
videos?: VideoEntry[];
96
}
97
98
type Changefreq =
99
| 'always'
100
| 'hourly'
101
| 'daily'
102
| 'weekly'
103
| 'monthly'
104
| 'yearly'
105
| 'never';
106
```
107
108
**Image Metadata Structure**
109
110
```typescript { .api }
111
interface ImageEntry {
112
/** Image URL location */
113
loc: string | URL;
114
/** Image caption text */
115
caption?: string;
116
/** Geographic location information */
117
geoLocation?: string;
118
/** Image title */
119
title?: string;
120
/** License URL */
121
license?: string | URL;
122
}
123
```
124
125
**Video Metadata Structure**
126
127
```typescript { .api }
128
interface VideoEntry {
129
/** Video title (required) */
130
title: string;
131
/** Video thumbnail URL (required) */
132
thumbnail_loc: string | URL;
133
/** Video description (required) */
134
description: string;
135
/** Direct video content URL */
136
content_loc?: string | URL;
137
/** Video player page URL */
138
player_loc?: string | URL;
139
/** Video duration in seconds */
140
duration?: number;
141
/** Video expiration date */
142
expiration_date?: Date | string;
143
/** Video rating (0.0 to 5.0) */
144
rating?: number;
145
/** View count */
146
view_count?: number;
147
/** Publication date */
148
publication_date?: Date | string;
149
/** Family-friendly flag */
150
family_friendly?: 'yes' | 'no' | boolean;
151
/** Geographic restrictions */
152
restriction?: Restriction;
153
/** Platform restrictions */
154
platform?: Platform;
155
/** Pricing information */
156
price?: PriceEntry[];
157
/** Subscription requirement */
158
requires_subscription?: 'yes' | 'no' | boolean;
159
/** Uploader information */
160
uploader?: {
161
uploader: string;
162
info?: string | URL;
163
};
164
/** Live content indicator */
165
live?: 'yes' | 'no' | boolean;
166
/** Content tags */
167
tag?: string | string[];
168
}
169
170
interface Restriction {
171
relationship: 'allow' | 'deny';
172
restriction: string;
173
}
174
175
interface Platform {
176
relationship: 'allow' | 'deny';
177
platform: string;
178
}
179
180
interface PriceEntry {
181
price?: number | string;
182
currency?: string;
183
type?: 'rent' | 'purchase' | 'package' | 'subscription';
184
}
185
```
186
187
**Alternative URL Structure**
188
189
```typescript { .api }
190
interface AlternativeEntry {
191
/** Language/locale code (hreflang attribute) */
192
hreflang: string;
193
/** Alternative URL */
194
href: string | URL;
195
}
196
```
197
198
**Google News Structure**
199
200
```typescript { .api }
201
interface GoogleNewsEntry {
202
/** News article title */
203
title: string;
204
/** Article publication date in W3C format */
205
publication_date: Date | string;
206
/** Publication information */
207
publication: {
208
/** Publication name as it appears on news.google.com */
209
name: string;
210
/** Publication language (ISO 639 code) */
211
language: string;
212
};
213
}
214
```
215
216
**Usage Examples:**
217
218
```typescript
219
// Parse an existing XML sitemap
220
import { parseSitemapXml } from '@nuxtjs/sitemap/utils';
221
222
const xmlContent = `<?xml version="1.0" encoding="UTF-8"?>
223
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
224
<url>
225
<loc>https://example.com/</loc>
226
<lastmod>2023-12-01</lastmod>
227
<changefreq>daily</changefreq>
228
<priority>1.0</priority>
229
</url>
230
<url>
231
<loc>https://example.com/about</loc>
232
<lastmod>invalid-date</lastmod>
233
<priority>0.8</priority>
234
</url>
235
</urlset>`;
236
237
const result = parseSitemapXml(xmlContent);
238
239
console.log(result.urls);
240
// [
241
// {
242
// loc: 'https://example.com/',
243
// lastmod: '2023-12-01',
244
// changefreq: 'daily',
245
// priority: 1.0
246
// },
247
// {
248
// loc: 'https://example.com/about',
249
// priority: 0.8
250
// }
251
// ]
252
253
console.log(result.warnings);
254
// [
255
// {
256
// type: 'invalid-date',
257
// message: 'Invalid lastmod date: invalid-date',
258
// context: 'https://example.com/about'
259
// }
260
// ]
261
262
// Extract metadata from HTML content
263
import { parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
264
265
const htmlContent = `
266
<!DOCTYPE html>
267
<html>
268
<head>
269
<title>My Blog Post</title>
270
<meta property="og:image" content="https://example.com/hero.jpg">
271
<meta property="article:published_time" content="2023-12-01T10:00:00Z">
272
</head>
273
<body>
274
<h1>My Blog Post</h1>
275
<img src="/images/diagram.png" alt="Technical diagram">
276
<video src="/videos/demo.mp4" poster="/videos/demo-thumb.jpg">
277
<source src="/videos/demo.mp4" type="video/mp4">
278
</video>
279
</body>
280
</html>
281
`;
282
283
const metadata = parseHtmlExtractSitemapMeta(htmlContent);
284
285
console.log(metadata);
286
// [
287
// {
288
// images: [
289
// {
290
// loc: 'https://example.com/hero.jpg',
291
// title: 'My Blog Post'
292
// },
293
// {
294
// loc: '/images/diagram.png',
295
// caption: 'Technical diagram'
296
// }
297
// ],
298
// videos: [
299
// {
300
// title: 'My Blog Post',
301
// content_loc: '/videos/demo.mp4',
302
// thumbnail_loc: '/videos/demo-thumb.jpg'
303
// }
304
// ],
305
// lastmod: '2023-12-01T10:00:00Z'
306
// }
307
// ]
308
309
// Handle parsing errors gracefully
310
try {
311
const result = parseSitemapXml(invalidXml);
312
313
// Process results
314
result.urls.forEach(url => {
315
console.log(`Processing URL: ${url.loc}`);
316
});
317
318
// Handle warnings
319
if (result.warnings.length > 0) {
320
console.warn('Parsing warnings:');
321
result.warnings.forEach(warning => {
322
console.warn(`- ${warning.type}: ${warning.message}`);
323
});
324
}
325
} catch (error) {
326
console.error('Failed to parse sitemap XML:', error);
327
}
328
329
// Integration with existing sitemap generation
330
import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
331
332
// Parse competitor's sitemap for analysis
333
const competitorSitemap = await $fetch('https://competitor.com/sitemap.xml');
334
const parsed = parseSitemapXml(competitorSitemap);
335
336
// Use parsed data to inform your sitemap structure
337
const competitorUrls = parsed.urls.map(url => ({
338
loc: url.loc.replace('competitor.com', 'mysite.com'),
339
priority: Math.max(0.1, (url.priority || 0.5) - 0.1) // Slightly lower priority
340
}));
341
342
// Extract metadata from rendered pages for automatic discovery
343
const pageHtml = await $fetch('https://mysite.com/blog/post-1');
344
const extractedMeta = parseHtmlExtractSitemapMeta(pageHtml);
345
346
// Combine with existing sitemap data
347
const enrichedUrl = {
348
loc: '/blog/post-1',
349
...extractedMeta[0], // Use discovered metadata
350
priority: 0.8
351
};
352
```
353
354
## Parsing Features
355
356
**XML Sitemap Support**
357
358
The XML parser supports:
359
- Standard sitemap XML format (sitemaps.org schema)
360
- Sitemap index files with nested sitemap references
361
- Image extensions (Google Image sitemaps)
362
- Video extensions (Google Video sitemaps)
363
- News extensions (Google News sitemaps)
364
- Alternative/hreflang entries for i18n
365
- Validation with detailed warning reporting
366
- Malformed XML handling with graceful degradation
367
368
**HTML Metadata Discovery**
369
370
The HTML parser extracts:
371
- Open Graph image and video metadata
372
- Structured data (JSON-LD, microdata)
373
- HTML img and video elements
374
- Meta tags for publication dates and modification times
375
- Link tags for alternative versions
376
- Title and description elements
377
- Validation of discovered URLs and content
378
379
**Error Handling and Validation**
380
381
Both parsers provide:
382
- Comprehensive validation of URLs, dates, and numeric values
383
- Warning collection for non-fatal parsing issues
384
- Graceful handling of malformed or incomplete data
385
- Context information for debugging parsing issues
386
- Type-safe output with proper TypeScript interfaces