Get url property from HTML markup using metascraper rules
npx @tessl/cli install tessl/npm-metascraper-url@5.49.00
# metascraper-url
1
2
metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.
3
4
## Package Information
5
6
- **Package Name**: metascraper-url
7
- **Package Type**: npm
8
- **Language**: JavaScript
9
- **Installation**: `npm install metascraper-url`
10
11
## Core Imports
12
13
```javascript
14
const metascraperUrl = require("metascraper-url");
15
```
16
17
For ES modules:
18
19
```javascript
20
import metascraperUrl from "metascraper-url";
21
```
22
23
## Basic Usage
24
25
```javascript
26
const metascraper = require("metascraper");
27
const metascraperUrl = require("metascraper-url");
28
29
// Create a metascraper instance with the URL plugin
30
const scraper = metascraper([metascraperUrl()]);
31
32
// Extract URL from HTML
33
const html = `
34
<html>
35
<head>
36
<meta property="og:url" content="https://example.com/canonical" />
37
<link rel="canonical" href="https://example.com/alternate" />
38
</head>
39
</html>
40
`;
41
42
scraper({ html, url: "https://example.com/original" })
43
.then(metadata => {
44
console.log(metadata.url); // "https://example.com/canonical"
45
});
46
```
47
48
## Architecture
49
50
metascraper-url follows the standard metascraper plugin pattern:
51
52
- **Factory Function**: The main export is a factory function that returns a rules object
53
- **Rules Object**: Contains URL extraction rules and package metadata
54
- **Rule Chain**: Multiple extraction strategies tried in priority order
55
- **Fallback Strategy**: Returns input URL if no extraction rules succeed
56
- **Helper Integration**: Uses @metascraper/helpers for URL validation and rule creation
57
58
## Capabilities
59
60
### URL Extraction Factory
61
62
Creates a metascraper rules object for URL extraction from HTML markup.
63
64
```javascript { .api }
65
/**
66
* Creates metascraper rules for URL extraction
67
* @returns {Rules} Object containing URL extraction rules and package metadata
68
*/
69
function metascraperUrl(): Rules;
70
71
interface Rules {
72
/** Array of URL extraction rules executed in priority order */
73
url: RuleFunction[];
74
/** Package name identifier for debugging purposes */
75
pkgName: string;
76
}
77
78
type RuleFunction = (options: RuleOptions) => string | null | undefined;
79
80
interface RuleOptions {
81
/** Cheerio DOM instance for HTML parsing */
82
htmlDom: import("cheerio").CheerioAPI;
83
/** Input URL for context and fallback */
84
url: string;
85
}
86
```
87
88
### URL Extraction Rules
89
90
The plugin implements the following extraction strategies in order of priority:
91
92
#### OpenGraph URL Rule
93
Extracts URL from OpenGraph meta tag (`og:url`).
94
95
```javascript { .api }
96
// Selector: meta[property="og:url"]
97
// Attribute: content
98
// Priority: 1 (highest)
99
```
100
101
#### Twitter URL Rules
102
Extracts URL from Twitter Card meta tags.
103
104
```javascript { .api }
105
// Twitter name attribute: meta[name="twitter:url"]
106
// Twitter property attribute: meta[property="twitter:url"]
107
// Attribute: content
108
// Priority: 2-3
109
```
110
111
#### Canonical Link Rule
112
Extracts URL from canonical link element.
113
114
```javascript { .api }
115
// Selector: link[rel="canonical"]
116
// Attribute: href
117
// Priority: 4
118
```
119
120
#### Alternate Hreflang Rule
121
Extracts URL from alternate hreflang link with x-default.
122
123
```javascript { .api }
124
// Selector: link[rel="alternate"][hreflang="x-default"]
125
// Attribute: href
126
// Priority: 5
127
```
128
129
#### Fallback Rule
130
Returns the input URL as final fallback.
131
132
```javascript { .api }
133
// Implementation: ({ url }) => url
134
// Priority: 6 (lowest)
135
```
136
137
## Usage Examples
138
139
### With Multiple Meta Tags
140
141
```javascript
142
const html = `
143
<html>
144
<head>
145
<meta property="og:url" content="https://example.com/og-url" />
146
<meta name="twitter:url" content="https://example.com/twitter-url" />
147
<link rel="canonical" href="https://example.com/canonical-url" />
148
</head>
149
</html>
150
`;
151
152
// Will extract "https://example.com/og-url" (highest priority)
153
const metadata = await scraper({ html, url: "https://example.com/fallback" });
154
console.log(metadata.url); // "https://example.com/og-url"
155
```
156
157
### With Only Canonical Link
158
159
```javascript
160
const html = `
161
<html>
162
<head>
163
<link rel="canonical" href="https://example.com/canonical" />
164
</head>
165
</html>
166
`;
167
168
// Will extract from canonical link
169
const metadata = await scraper({ html, url: "https://example.com/fallback" });
170
console.log(metadata.url); // "https://example.com/canonical"
171
```
172
173
### With No URL Meta Tags
174
175
```javascript
176
const html = `
177
<html>
178
<head>
179
<title>Page Title</title>
180
</head>
181
</html>
182
`;
183
184
// Will use fallback URL
185
const metadata = await scraper({ html, url: "https://example.com/fallback" });
186
console.log(metadata.url); // "https://example.com/fallback"
187
```
188
189
### Using with Other metascraper Plugins
190
191
```javascript
192
const metascraper = require("metascraper");
193
const metascraperUrl = require("metascraper-url");
194
const metascraperTitle = require("metascraper-title");
195
const metascraperDescription = require("metascraper-description");
196
197
const scraper = metascraper([
198
metascraperUrl(),
199
metascraperTitle(),
200
metascraperDescription()
201
]);
202
203
const metadata = await scraper({ html, url });
204
// metadata.url, metadata.title, metadata.description all extracted
205
```
206
207
## Error Handling
208
209
- Individual extraction rules return `null` or `undefined` when they fail to find a valid URL
210
- Rules are tried sequentially until one succeeds
211
- URL validation and normalization handled by `@metascraper/helpers`
212
- Malformed URLs are automatically filtered out
213
- The fallback rule ensures a URL is always returned (the input URL)