Tessl Tile for npm/@nuxtjs/sitemap@7.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

content-integration.md data-types-configuration.md index.md module-configuration.md server-composables.md xml-html-utilities.md

xml-html-utilities.mddocs/

0
# XML and HTML Utilities
1

2
Utility functions for parsing existing XML sitemaps and extracting sitemap metadata from HTML documents for analysis and integration purposes.
3

4
```typescript { .api }
5
import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
6
import type { SitemapParseResult, SitemapWarning } from '@nuxtjs/sitemap/utils';
7
```
8

9
## Capabilities
10

11
### XML Sitemap Parsing
12

13
Parse existing XML sitemap content into structured data with validation and warning reporting.
14

15
```typescript { .api }
16
/**
17
 * Parse XML sitemap content into structured data
18
 * Handles both regular sitemaps and sitemap index files
19
 * @param xml - Raw XML sitemap content as string
20
 * @returns Promise resolving to parsed sitemap data with URLs and validation warnings
21
 */
22
function parseSitemapXml(xml: string): Promise<SitemapParseResult>;
23

24
interface SitemapParseResult {
25
  /** Array of parsed sitemap URLs */
26
  urls: SitemapUrlInput[];
27
  /** Array of validation warnings encountered during parsing */
28
  warnings: SitemapWarning[];
29
}
30

31
interface SitemapWarning {
32
  /** Type of warning encountered */
33
  type: 'validation';
34
  /** Human-readable warning message */
35
  message: string;
36
  /** Context information about where the warning occurred */
37
  context?: {
38
    url?: string;
39
    field?: string;
40
    value?: unknown;
41
  };
42
}
43
```
44

45
### HTML Metadata Extraction
46

47
Extract sitemap-relevant metadata from HTML documents for automatic discovery and analysis.
48

49
```typescript { .api }
50
/**
51
 * Extract sitemap metadata from HTML document content
52
 * Discovers images, videos, and other sitemap-relevant information
53
 * @param html - Raw HTML content as string
54
 * @param options - Optional configuration for metadata extraction
55
 * @returns Array of sitemap URLs with discovered metadata
56
 */
57
function parseHtmlExtractSitemapMeta(
58
  html: string, 
59
  options?: {
60
    /** Whether to discover images in the HTML content */
61
    images?: boolean;
62
    /** Whether to discover videos in the HTML content */
63
    videos?: boolean;
64
    /** Whether to extract lastmod information */
65
    lastmod?: boolean;
66
    /** Whether to extract alternative language links */
67
    alternatives?: boolean;
68
    /** Function to resolve relative URLs to absolute URLs */
69
    resolveUrl?: (url: string) => string;
70
  }
71
): SitemapUrl[];
72
```
73

74
### Parsed Data Types
75

76
**URL Entry Structure**
77

78
```typescript { .api }
79
interface SitemapUrl {
80
  /** URL location (required) */
81
  loc: string;
82
  /** Last modification date */
83
  lastmod?: string | Date;
84
  /** Change frequency indicator */
85
  changefreq?: Changefreq;
86
  /** Priority value between 0.0 and 1.0 */
87
  priority?: 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1;
88
  /** Alternative language versions */
89
  alternatives?: AlternativeEntry[];
90
  /** Google News metadata */
91
  news?: GoogleNewsEntry;
92
  /** Associated images */
93
  images?: ImageEntry[];
94
  /** Associated videos */
95
  videos?: VideoEntry[];
96
}
97

98
type Changefreq = 
99
  | 'always' 
100
  | 'hourly' 
101
  | 'daily' 
102
  | 'weekly' 
103
  | 'monthly' 
104
  | 'yearly' 
105
  | 'never';
106
```
107

108
**Image Metadata Structure**
109

110
```typescript { .api }
111
interface ImageEntry {
112
  /** Image URL location */
113
  loc: string | URL;
114
  /** Image caption text */
115
  caption?: string;
116
  /** Geographic location information */
117
  geoLocation?: string;
118
  /** Image title */
119
  title?: string;
120
  /** License URL */
121
  license?: string | URL;
122
}
123
```
124

125
**Video Metadata Structure**
126

127
```typescript { .api }
128
interface VideoEntry {
129
  /** Video title (required) */
130
  title: string;
131
  /** Video thumbnail URL (required) */
132
  thumbnail_loc: string | URL;
133
  /** Video description (required) */
134
  description: string;
135
  /** Direct video content URL */
136
  content_loc?: string | URL;
137
  /** Video player page URL */
138
  player_loc?: string | URL;
139
  /** Video duration in seconds */
140
  duration?: number;
141
  /** Video expiration date */
142
  expiration_date?: Date | string;
143
  /** Video rating (0.0 to 5.0) */
144
  rating?: number;
145
  /** View count */
146
  view_count?: number;
147
  /** Publication date */
148
  publication_date?: Date | string;
149
  /** Family-friendly flag */
150
  family_friendly?: 'yes' | 'no' | boolean;
151
  /** Geographic restrictions */
152
  restriction?: Restriction;
153
  /** Platform restrictions */
154
  platform?: Platform;
155
  /** Pricing information */
156
  price?: PriceEntry[];
157
  /** Subscription requirement */
158
  requires_subscription?: 'yes' | 'no' | boolean;
159
  /** Uploader information */
160
  uploader?: {
161
    uploader: string;
162
    info?: string | URL;
163
  };
164
  /** Live content indicator */
165
  live?: 'yes' | 'no' | boolean;
166
  /** Content tags */
167
  tag?: string | string[];
168
}
169

170
interface Restriction {
171
  relationship: 'allow' | 'deny';
172
  restriction: string;
173
}
174

175
interface Platform {
176
  relationship: 'allow' | 'deny';
177
  platform: string;
178
}
179

180
interface PriceEntry {
181
  price?: number | string;
182
  currency?: string;
183
  type?: 'rent' | 'purchase' | 'package' | 'subscription';
184
}
185
```
186

187
**Alternative URL Structure**
188

189
```typescript { .api }
190
interface AlternativeEntry {
191
  /** Language/locale code (hreflang attribute) */
192
  hreflang: string;
193
  /** Alternative URL */
194
  href: string | URL;
195
}
196
```
197

198
**Google News Structure**
199

200
```typescript { .api }
201
interface GoogleNewsEntry {
202
  /** News article title */
203
  title: string;
204
  /** Article publication date in W3C format */
205
  publication_date: Date | string;
206
  /** Publication information */
207
  publication: {
208
    /** Publication name as it appears on news.google.com */
209
    name: string;
210
    /** Publication language (ISO 639 code) */
211
    language: string;
212
  };
213
}
214
```
215

216
**Usage Examples:**
217

218
```typescript
219
// Parse an existing XML sitemap
220
import { parseSitemapXml } from '@nuxtjs/sitemap/utils';
221

222
const xmlContent = `<?xml version="1.0" encoding="UTF-8"?>
223
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
224
  <url>
225
    <loc>https://example.com/</loc>
226
    <lastmod>2023-12-01</lastmod>
227
    <changefreq>daily</changefreq>
228
    <priority>1.0</priority>
229
  </url>
230
  <url>
231
    <loc>https://example.com/about</loc>
232
    <lastmod>invalid-date</lastmod>
233
    <priority>0.8</priority>
234
  </url>
235
</urlset>`;
236

237
const result = parseSitemapXml(xmlContent);
238

239
console.log(result.urls);
240
// [
241
//   {
242
//     loc: 'https://example.com/',
243
//     lastmod: '2023-12-01',
244
//     changefreq: 'daily',
245
//     priority: 1.0
246
//   },
247
//   {
248
//     loc: 'https://example.com/about',
249
//     priority: 0.8
250
//   }
251
// ]
252

253
console.log(result.warnings);
254
// [
255
//   {
256
//     type: 'invalid-date',
257
//     message: 'Invalid lastmod date: invalid-date',
258
//     context: 'https://example.com/about'
259
//   }
260
// ]
261

262
// Extract metadata from HTML content
263
import { parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
264

265
const htmlContent = `
266
<!DOCTYPE html>
267
<html>
268
<head>
269
  <title>My Blog Post</title>
270
  <meta property="og:image" content="https://example.com/hero.jpg">
271
  <meta property="article:published_time" content="2023-12-01T10:00:00Z">
272
</head>
273
<body>
274
  <h1>My Blog Post</h1>
275
  <img src="/images/diagram.png" alt="Technical diagram">
276
  <video src="/videos/demo.mp4" poster="/videos/demo-thumb.jpg">
277
    <source src="/videos/demo.mp4" type="video/mp4">
278
  </video>
279
</body>
280
</html>
281
`;
282

283
const metadata = parseHtmlExtractSitemapMeta(htmlContent);
284

285
console.log(metadata);
286
// [
287
//   {
288
//     images: [
289
//       {
290
//         loc: 'https://example.com/hero.jpg',
291
//         title: 'My Blog Post'
292
//       },
293
//       {
294
//         loc: '/images/diagram.png',
295
//         caption: 'Technical diagram'
296
//       }
297
//     ],
298
//     videos: [
299
//       {
300
//         title: 'My Blog Post',
301
//         content_loc: '/videos/demo.mp4',
302
//         thumbnail_loc: '/videos/demo-thumb.jpg'
303
//       }
304
//     ],
305
//     lastmod: '2023-12-01T10:00:00Z'
306
//   }
307
// ]
308

309
// Handle parsing errors gracefully
310
try {
311
  const result = parseSitemapXml(invalidXml);
312
  
313
  // Process results
314
  result.urls.forEach(url => {
315
    console.log(`Processing URL: ${url.loc}`);
316
  });
317
  
318
  // Handle warnings
319
  if (result.warnings.length > 0) {
320
    console.warn('Parsing warnings:');
321
    result.warnings.forEach(warning => {
322
      console.warn(`- ${warning.type}: ${warning.message}`);
323
    });
324
  }
325
} catch (error) {
326
  console.error('Failed to parse sitemap XML:', error);
327
}
328

329
// Integration with existing sitemap generation
330
import { parseSitemapXml, parseHtmlExtractSitemapMeta } from '@nuxtjs/sitemap/utils';
331

332
// Parse competitor's sitemap for analysis
333
const competitorSitemap = await $fetch('https://competitor.com/sitemap.xml');
334
const parsed = parseSitemapXml(competitorSitemap);
335

336
// Use parsed data to inform your sitemap structure
337
const competitorUrls = parsed.urls.map(url => ({
338
  loc: url.loc.replace('competitor.com', 'mysite.com'),
339
  priority: Math.max(0.1, (url.priority || 0.5) - 0.1) // Slightly lower priority
340
}));
341

342
// Extract metadata from rendered pages for automatic discovery
343
const pageHtml = await $fetch('https://mysite.com/blog/post-1');
344
const extractedMeta = parseHtmlExtractSitemapMeta(pageHtml);
345

346
// Combine with existing sitemap data
347
const enrichedUrl = {
348
  loc: '/blog/post-1',
349
  ...extractedMeta[0], // Use discovered metadata
350
  priority: 0.8
351
};
352
```
353

354
## Parsing Features
355

356
**XML Sitemap Support**
357

358
The XML parser supports:
359
- Standard sitemap XML format (sitemaps.org schema)
360
- Sitemap index files with nested sitemap references
361
- Image extensions (Google Image sitemaps)
362
- Video extensions (Google Video sitemaps)  
363
- News extensions (Google News sitemaps)
364
- Alternative/hreflang entries for i18n
365
- Validation with detailed warning reporting
366
- Malformed XML handling with graceful degradation
367

368
**HTML Metadata Discovery**
369

370
The HTML parser extracts:
371
- Open Graph image and video metadata
372
- Structured data (JSON-LD, microdata)
373
- HTML img and video elements
374
- Meta tags for publication dates and modification times
375
- Link tags for alternative versions
376
- Title and description elements
377
- Validation of discovered URLs and content
378

379
**Error Handling and Validation**
380

381
Both parsers provide:
382
- Comprehensive validation of URLs, dates, and numeric values
383
- Warning collection for non-fatal parsing issues
384
- Graceful handling of malformed or incomplete data
385
- Context information for debugging parsing issues
386
- Type-safe output with proper TypeScript interfaces

Version

Tile

Files

xml-html-utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

xml-html-utilities.mddocs/