or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# metascraper-url

1

2

metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.

3

4

## Package Information

5

6

- **Package Name**: metascraper-url

7

- **Package Type**: npm

8

- **Language**: JavaScript

9

- **Installation**: `npm install metascraper-url`

10

11

## Core Imports

12

13

```javascript

14

const metascraperUrl = require("metascraper-url");

15

```

16

17

For ES modules:

18

19

```javascript

20

import metascraperUrl from "metascraper-url";

21

```

22

23

## Basic Usage

24

25

```javascript

26

const metascraper = require("metascraper");

27

const metascraperUrl = require("metascraper-url");

28

29

// Create a metascraper instance with the URL plugin

30

const scraper = metascraper([metascraperUrl()]);

31

32

// Extract URL from HTML

33

const html = `

34

<html>

35

<head>

36

<meta property="og:url" content="https://example.com/canonical" />

37

<link rel="canonical" href="https://example.com/alternate" />

38

</head>

39

</html>

40

`;

41

42

scraper({ html, url: "https://example.com/original" })

43

.then(metadata => {

44

console.log(metadata.url); // "https://example.com/canonical"

45

});

46

```

47

48

## Architecture

49

50

metascraper-url follows the standard metascraper plugin pattern:

51

52

- **Factory Function**: The main export is a factory function that returns a rules object

53

- **Rules Object**: Contains URL extraction rules and package metadata

54

- **Rule Chain**: Multiple extraction strategies tried in priority order

55

- **Fallback Strategy**: Returns input URL if no extraction rules succeed

56

- **Helper Integration**: Uses @metascraper/helpers for URL validation and rule creation

57

58

## Capabilities

59

60

### URL Extraction Factory

61

62

Creates a metascraper rules object for URL extraction from HTML markup.

63

64

```javascript { .api }

65

/**

66

* Creates metascraper rules for URL extraction

67

* @returns {Rules} Object containing URL extraction rules and package metadata

68

*/

69

function metascraperUrl(): Rules;

70

71

interface Rules {

72

/** Array of URL extraction rules executed in priority order */

73

url: RuleFunction[];

74

/** Package name identifier for debugging purposes */

75

pkgName: string;

76

}

77

78

type RuleFunction = (options: RuleOptions) => string | null | undefined;

79

80

interface RuleOptions {

81

/** Cheerio DOM instance for HTML parsing */

82

htmlDom: import("cheerio").CheerioAPI;

83

/** Input URL for context and fallback */

84

url: string;

85

}

86

```

87

88

### URL Extraction Rules

89

90

The plugin implements the following extraction strategies in order of priority:

91

92

#### OpenGraph URL Rule

93

Extracts URL from OpenGraph meta tag (`og:url`).

94

95

```javascript { .api }

96

// Selector: meta[property="og:url"]

97

// Attribute: content

98

// Priority: 1 (highest)

99

```

100

101

#### Twitter URL Rules

102

Extracts URL from Twitter Card meta tags.

103

104

```javascript { .api }

105

// Twitter name attribute: meta[name="twitter:url"]

106

// Twitter property attribute: meta[property="twitter:url"]

107

// Attribute: content

108

// Priority: 2-3

109

```

110

111

#### Canonical Link Rule

112

Extracts URL from canonical link element.

113

114

```javascript { .api }

115

// Selector: link[rel="canonical"]

116

// Attribute: href

117

// Priority: 4

118

```

119

120

#### Alternate Hreflang Rule

121

Extracts URL from alternate hreflang link with x-default.

122

123

```javascript { .api }

124

// Selector: link[rel="alternate"][hreflang="x-default"]

125

// Attribute: href

126

// Priority: 5

127

```

128

129

#### Fallback Rule

130

Returns the input URL as final fallback.

131

132

```javascript { .api }

133

// Implementation: ({ url }) => url

134

// Priority: 6 (lowest)

135

```

136

137

## Usage Examples

138

139

### With Multiple Meta Tags

140

141

```javascript

142

const html = `

143

<html>

144

<head>

145

<meta property="og:url" content="https://example.com/og-url" />

146

<meta name="twitter:url" content="https://example.com/twitter-url" />

147

<link rel="canonical" href="https://example.com/canonical-url" />

148

</head>

149

</html>

150

`;

151

152

// Will extract "https://example.com/og-url" (highest priority)

153

const metadata = await scraper({ html, url: "https://example.com/fallback" });

154

console.log(metadata.url); // "https://example.com/og-url"

155

```

156

157

### With Only Canonical Link

158

159

```javascript

160

const html = `

161

<html>

162

<head>

163

<link rel="canonical" href="https://example.com/canonical" />

164

</head>

165

</html>

166

`;

167

168

// Will extract from canonical link

169

const metadata = await scraper({ html, url: "https://example.com/fallback" });

170

console.log(metadata.url); // "https://example.com/canonical"

171

```

172

173

### With No URL Meta Tags

174

175

```javascript

176

const html = `

177

<html>

178

<head>

179

<title>Page Title</title>

180

</head>

181

</html>

182

`;

183

184

// Will use fallback URL

185

const metadata = await scraper({ html, url: "https://example.com/fallback" });

186

console.log(metadata.url); // "https://example.com/fallback"

187

```

188

189

### Using with Other metascraper Plugins

190

191

```javascript

192

const metascraper = require("metascraper");

193

const metascraperUrl = require("metascraper-url");

194

const metascraperTitle = require("metascraper-title");

195

const metascraperDescription = require("metascraper-description");

196

197

const scraper = metascraper([

198

metascraperUrl(),

199

metascraperTitle(),

200

metascraperDescription()

201

]);

202

203

const metadata = await scraper({ html, url });

204

// metadata.url, metadata.title, metadata.description all extracted

205

```

206

207

## Error Handling

208

209

- Individual extraction rules return `null` or `undefined` when they fail to find a valid URL

210

- Rules are tried sequentially until one succeeds

211

- URL validation and normalization handled by `@metascraper/helpers`

212

- Malformed URLs are automatically filtered out

213

- The fallback rule ensures a URL is always returned (the input URL)