Get url property from HTML markup using metascraper rules
tessl install tessl/npm-metascraper-url@5.49.0metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.
npm install metascraper-urlconst metascraperUrl = require("metascraper-url");For ES modules:
import metascraperUrl from "metascraper-url";const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");
// Create a metascraper instance with the URL plugin
const scraper = metascraper([metascraperUrl()]);
// Extract URL from HTML
const html = `
<html>
<head>
<meta property="og:url" content="https://example.com/canonical" />
<link rel="canonical" href="https://example.com/alternate" />
</head>
</html>
`;
scraper({ html, url: "https://example.com/original" })
.then(metadata => {
console.log(metadata.url); // "https://example.com/canonical"
});metascraper-url follows the standard metascraper plugin pattern:
Creates a metascraper rules object for URL extraction from HTML markup.
/**
* Creates metascraper rules for URL extraction
* @returns {Rules} Object containing URL extraction rules and package metadata
*/
function metascraperUrl(): Rules;
interface Rules {
/** Array of URL extraction rules executed in priority order */
url: RuleFunction[];
/** Package name identifier for debugging purposes */
pkgName: string;
}
type RuleFunction = (options: RuleOptions) => string | null | undefined;
interface RuleOptions {
/** Cheerio DOM instance for HTML parsing */
htmlDom: import("cheerio").CheerioAPI;
/** Input URL for context and fallback */
url: string;
}The plugin implements the following extraction strategies in order of priority:
Extracts URL from OpenGraph meta tag (og:url).
// Selector: meta[property="og:url"]
// Attribute: content
// Priority: 1 (highest)Extracts URL from Twitter Card meta tags.
// Twitter name attribute: meta[name="twitter:url"]
// Twitter property attribute: meta[property="twitter:url"]
// Attribute: content
// Priority: 2-3Extracts URL from canonical link element.
// Selector: link[rel="canonical"]
// Attribute: href
// Priority: 4Extracts URL from alternate hreflang link with x-default.
// Selector: link[rel="alternate"][hreflang="x-default"]
// Attribute: href
// Priority: 5Returns the input URL as final fallback.
// Implementation: ({ url }) => url
// Priority: 6 (lowest)const html = `
<html>
<head>
<meta property="og:url" content="https://example.com/og-url" />
<meta name="twitter:url" content="https://example.com/twitter-url" />
<link rel="canonical" href="https://example.com/canonical-url" />
</head>
</html>
`;
// Will extract "https://example.com/og-url" (highest priority)
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/og-url"const html = `
<html>
<head>
<link rel="canonical" href="https://example.com/canonical" />
</head>
</html>
`;
// Will extract from canonical link
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/canonical"const html = `
<html>
<head>
<title>Page Title</title>
</head>
</html>
`;
// Will use fallback URL
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/fallback"const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");
const metascraperTitle = require("metascraper-title");
const metascraperDescription = require("metascraper-description");
const scraper = metascraper([
metascraperUrl(),
metascraperTitle(),
metascraperDescription()
]);
const metadata = await scraper({ html, url });
// metadata.url, metadata.title, metadata.description all extractednull or undefined when they fail to find a valid URL@metascraper/helpers