Tessl Tile for npm/metascraper-url@5.49.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# metascraper-url
1

2
metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.
3

4
## Package Information
5

6
- **Package Name**: metascraper-url
7
- **Package Type**: npm
8
- **Language**: JavaScript
9
- **Installation**: `npm install metascraper-url`
10

11
## Core Imports
12

13
```javascript
14
const metascraperUrl = require("metascraper-url");
15
```
16

17
For ES modules:
18

19
```javascript
20
import metascraperUrl from "metascraper-url";
21
```
22

23
## Basic Usage
24

25
```javascript
26
const metascraper = require("metascraper");
27
const metascraperUrl = require("metascraper-url");
28

29
// Create a metascraper instance with the URL plugin
30
const scraper = metascraper([metascraperUrl()]);
31

32
// Extract URL from HTML
33
const html = `
34
  <html>
35
    <head>
36
      <meta property="og:url" content="https://example.com/canonical" />
37
      <link rel="canonical" href="https://example.com/alternate" />
38
    </head>
39
  </html>
40
`;
41

42
scraper({ html, url: "https://example.com/original" })
43
  .then(metadata => {
44
    console.log(metadata.url); // "https://example.com/canonical"
45
  });
46
```
47

48
## Architecture
49

50
metascraper-url follows the standard metascraper plugin pattern:
51

52
- **Factory Function**: The main export is a factory function that returns a rules object
53
- **Rules Object**: Contains URL extraction rules and package metadata
54
- **Rule Chain**: Multiple extraction strategies tried in priority order
55
- **Fallback Strategy**: Returns input URL if no extraction rules succeed
56
- **Helper Integration**: Uses @metascraper/helpers for URL validation and rule creation
57

58
## Capabilities
59

60
### URL Extraction Factory
61

62
Creates a metascraper rules object for URL extraction from HTML markup.
63

64
```javascript { .api }
65
/**
66
 * Creates metascraper rules for URL extraction
67
 * @returns {Rules} Object containing URL extraction rules and package metadata
68
 */
69
function metascraperUrl(): Rules;
70

71
interface Rules {
72
  /** Array of URL extraction rules executed in priority order */
73
  url: RuleFunction[];
74
  /** Package name identifier for debugging purposes */
75
  pkgName: string;
76
}
77

78
type RuleFunction = (options: RuleOptions) => string | null | undefined;
79

80
interface RuleOptions {
81
  /** Cheerio DOM instance for HTML parsing */
82
  htmlDom: import("cheerio").CheerioAPI;
83
  /** Input URL for context and fallback */
84
  url: string;
85
}
86
```
87

88
### URL Extraction Rules
89

90
The plugin implements the following extraction strategies in order of priority:
91

92
#### OpenGraph URL Rule
93
Extracts URL from OpenGraph meta tag (`og:url`).
94

95
```javascript { .api }
96
// Selector: meta[property="og:url"]
97
// Attribute: content
98
// Priority: 1 (highest)
99
```
100

101
#### Twitter URL Rules
102
Extracts URL from Twitter Card meta tags.
103

104
```javascript { .api }
105
// Twitter name attribute: meta[name="twitter:url"]
106
// Twitter property attribute: meta[property="twitter:url"]  
107
// Attribute: content
108
// Priority: 2-3
109
```
110

111
#### Canonical Link Rule
112
Extracts URL from canonical link element.
113

114
```javascript { .api }
115
// Selector: link[rel="canonical"]
116
// Attribute: href
117
// Priority: 4
118
```
119

120
#### Alternate Hreflang Rule  
121
Extracts URL from alternate hreflang link with x-default.
122

123
```javascript { .api }
124
// Selector: link[rel="alternate"][hreflang="x-default"]
125
// Attribute: href
126
// Priority: 5
127
```
128

129
#### Fallback Rule
130
Returns the input URL as final fallback.
131

132
```javascript { .api }
133
// Implementation: ({ url }) => url
134
// Priority: 6 (lowest)
135
```
136

137
## Usage Examples
138

139
### With Multiple Meta Tags
140

141
```javascript
142
const html = `
143
  <html>
144
    <head>
145
      <meta property="og:url" content="https://example.com/og-url" />
146
      <meta name="twitter:url" content="https://example.com/twitter-url" />
147
      <link rel="canonical" href="https://example.com/canonical-url" />
148
    </head>
149
  </html>
150
`;
151

152
// Will extract "https://example.com/og-url" (highest priority)
153
const metadata = await scraper({ html, url: "https://example.com/fallback" });
154
console.log(metadata.url); // "https://example.com/og-url"
155
```
156

157
### With Only Canonical Link
158

159
```javascript
160
const html = `
161
  <html>
162
    <head>
163
      <link rel="canonical" href="https://example.com/canonical" />
164
    </head>
165
  </html>
166
`;
167

168
// Will extract from canonical link
169
const metadata = await scraper({ html, url: "https://example.com/fallback" });
170
console.log(metadata.url); // "https://example.com/canonical"
171
```
172

173
### With No URL Meta Tags
174

175
```javascript
176
const html = `
177
  <html>
178
    <head>
179
      <title>Page Title</title>
180
    </head>
181
  </html>
182
`;
183

184
// Will use fallback URL
185
const metadata = await scraper({ html, url: "https://example.com/fallback" });
186
console.log(metadata.url); // "https://example.com/fallback"
187
```
188

189
### Using with Other metascraper Plugins
190

191
```javascript
192
const metascraper = require("metascraper");
193
const metascraperUrl = require("metascraper-url");
194
const metascraperTitle = require("metascraper-title");
195
const metascraperDescription = require("metascraper-description");
196

197
const scraper = metascraper([
198
  metascraperUrl(),
199
  metascraperTitle(),
200
  metascraperDescription()
201
]);
202

203
const metadata = await scraper({ html, url });
204
// metadata.url, metadata.title, metadata.description all extracted
205
```
206

207
## Error Handling
208

209
- Individual extraction rules return `null` or `undefined` when they fail to find a valid URL
210
- Rules are tried sequentially until one succeeds
211
- URL validation and normalization handled by `@metascraper/helpers`
212
- Malformed URLs are automatically filtered out
213
- The fallback rule ensures a URL is always returned (the input URL)

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/