CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-metascraper-url

Get url property from HTML markup using metascraper rules

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

metascraper-url

metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.

Package Information

  • Package Name: metascraper-url
  • Package Type: npm
  • Language: JavaScript
  • Installation: npm install metascraper-url

Core Imports

const metascraperUrl = require("metascraper-url");

For ES modules:

import metascraperUrl from "metascraper-url";

Basic Usage

const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");

// Create a metascraper instance with the URL plugin
const scraper = metascraper([metascraperUrl()]);

// Extract URL from HTML
const html = `
  <html>
    <head>
      <meta property="og:url" content="https://example.com/canonical" />
      <link rel="canonical" href="https://example.com/alternate" />
    </head>
  </html>
`;

scraper({ html, url: "https://example.com/original" })
  .then(metadata => {
    console.log(metadata.url); // "https://example.com/canonical"
  });

Architecture

metascraper-url follows the standard metascraper plugin pattern:

  • Factory Function: The main export is a factory function that returns a rules object
  • Rules Object: Contains URL extraction rules and package metadata
  • Rule Chain: Multiple extraction strategies tried in priority order
  • Fallback Strategy: Returns input URL if no extraction rules succeed
  • Helper Integration: Uses @metascraper/helpers for URL validation and rule creation

Capabilities

URL Extraction Factory

Creates a metascraper rules object for URL extraction from HTML markup.

/**
 * Creates metascraper rules for URL extraction
 * @returns {Rules} Object containing URL extraction rules and package metadata
 */
function metascraperUrl(): Rules;

interface Rules {
  /** Array of URL extraction rules executed in priority order */
  url: RuleFunction[];
  /** Package name identifier for debugging purposes */
  pkgName: string;
}

type RuleFunction = (options: RuleOptions) => string | null | undefined;

interface RuleOptions {
  /** Cheerio DOM instance for HTML parsing */
  htmlDom: import("cheerio").CheerioAPI;
  /** Input URL for context and fallback */
  url: string;
}

URL Extraction Rules

The plugin implements the following extraction strategies in order of priority:

OpenGraph URL Rule

Extracts URL from OpenGraph meta tag (og:url).

// Selector: meta[property="og:url"]
// Attribute: content
// Priority: 1 (highest)

Twitter URL Rules

Extracts URL from Twitter Card meta tags.

// Twitter name attribute: meta[name="twitter:url"]
// Twitter property attribute: meta[property="twitter:url"]  
// Attribute: content
// Priority: 2-3

Canonical Link Rule

Extracts URL from canonical link element.

// Selector: link[rel="canonical"]
// Attribute: href
// Priority: 4

Alternate Hreflang Rule

Extracts URL from alternate hreflang link with x-default.

// Selector: link[rel="alternate"][hreflang="x-default"]
// Attribute: href
// Priority: 5

Fallback Rule

Returns the input URL as final fallback.

// Implementation: ({ url }) => url
// Priority: 6 (lowest)

Usage Examples

With Multiple Meta Tags

const html = `
  <html>
    <head>
      <meta property="og:url" content="https://example.com/og-url" />
      <meta name="twitter:url" content="https://example.com/twitter-url" />
      <link rel="canonical" href="https://example.com/canonical-url" />
    </head>
  </html>
`;

// Will extract "https://example.com/og-url" (highest priority)
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/og-url"

With Only Canonical Link

const html = `
  <html>
    <head>
      <link rel="canonical" href="https://example.com/canonical" />
    </head>
  </html>
`;

// Will extract from canonical link
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/canonical"

With No URL Meta Tags

const html = `
  <html>
    <head>
      <title>Page Title</title>
    </head>
  </html>
`;

// Will use fallback URL
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/fallback"

Using with Other metascraper Plugins

const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");
const metascraperTitle = require("metascraper-title");
const metascraperDescription = require("metascraper-description");

const scraper = metascraper([
  metascraperUrl(),
  metascraperTitle(),
  metascraperDescription()
]);

const metadata = await scraper({ html, url });
// metadata.url, metadata.title, metadata.description all extracted

Error Handling

  • Individual extraction rules return null or undefined when they fail to find a valid URL
  • Rules are tried sequentially until one succeeds
  • URL validation and normalization handled by @metascraper/helpers
  • Malformed URLs are automatically filtered out
  • The fallback rule ensures a URL is always returned (the input URL)
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/metascraper-url@5.49.x
Publish Source
CLI
Badge
tessl/npm-metascraper-url badge