or run

tessl search
Log in

Version

Files

tile.json

tessl/npm-metascraper-url

Get url property from HTML markup using metascraper rules

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/metascraper-url@5.49.x

To install, run

tessl install tessl/npm-metascraper-url@5.49.0

index.mddocs/

metascraper-url

metascraper-url is a plugin for the metascraper library that extracts URL information from HTML markup. It implements multiple extraction strategies to identify the canonical URL of a webpage by checking for OpenGraph url meta tags, Twitter url meta tags, canonical link elements, and alternate hreflang links.

Package Information

  • Package Name: metascraper-url
  • Package Type: npm
  • Language: JavaScript
  • Installation: npm install metascraper-url

Core Imports

const metascraperUrl = require("metascraper-url");

For ES modules:

import metascraperUrl from "metascraper-url";

Basic Usage

const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");

// Create a metascraper instance with the URL plugin
const scraper = metascraper([metascraperUrl()]);

// Extract URL from HTML
const html = `
  <html>
    <head>
      <meta property="og:url" content="https://example.com/canonical" />
      <link rel="canonical" href="https://example.com/alternate" />
    </head>
  </html>
`;

scraper({ html, url: "https://example.com/original" })
  .then(metadata => {
    console.log(metadata.url); // "https://example.com/canonical"
  });

Architecture

metascraper-url follows the standard metascraper plugin pattern:

  • Factory Function: The main export is a factory function that returns a rules object
  • Rules Object: Contains URL extraction rules and package metadata
  • Rule Chain: Multiple extraction strategies tried in priority order
  • Fallback Strategy: Returns input URL if no extraction rules succeed
  • Helper Integration: Uses @metascraper/helpers for URL validation and rule creation

Capabilities

URL Extraction Factory

Creates a metascraper rules object for URL extraction from HTML markup.

/**
 * Creates metascraper rules for URL extraction
 * @returns {Rules} Object containing URL extraction rules and package metadata
 */
function metascraperUrl(): Rules;

interface Rules {
  /** Array of URL extraction rules executed in priority order */
  url: RuleFunction[];
  /** Package name identifier for debugging purposes */
  pkgName: string;
}

type RuleFunction = (options: RuleOptions) => string | null | undefined;

interface RuleOptions {
  /** Cheerio DOM instance for HTML parsing */
  htmlDom: import("cheerio").CheerioAPI;
  /** Input URL for context and fallback */
  url: string;
}

URL Extraction Rules

The plugin implements the following extraction strategies in order of priority:

OpenGraph URL Rule

Extracts URL from OpenGraph meta tag (og:url).

// Selector: meta[property="og:url"]
// Attribute: content
// Priority: 1 (highest)

Twitter URL Rules

Extracts URL from Twitter Card meta tags.

// Twitter name attribute: meta[name="twitter:url"]
// Twitter property attribute: meta[property="twitter:url"]  
// Attribute: content
// Priority: 2-3

Canonical Link Rule

Extracts URL from canonical link element.

// Selector: link[rel="canonical"]
// Attribute: href
// Priority: 4

Alternate Hreflang Rule

Extracts URL from alternate hreflang link with x-default.

// Selector: link[rel="alternate"][hreflang="x-default"]
// Attribute: href
// Priority: 5

Fallback Rule

Returns the input URL as final fallback.

// Implementation: ({ url }) => url
// Priority: 6 (lowest)

Usage Examples

With Multiple Meta Tags

const html = `
  <html>
    <head>
      <meta property="og:url" content="https://example.com/og-url" />
      <meta name="twitter:url" content="https://example.com/twitter-url" />
      <link rel="canonical" href="https://example.com/canonical-url" />
    </head>
  </html>
`;

// Will extract "https://example.com/og-url" (highest priority)
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/og-url"

With Only Canonical Link

const html = `
  <html>
    <head>
      <link rel="canonical" href="https://example.com/canonical" />
    </head>
  </html>
`;

// Will extract from canonical link
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/canonical"

With No URL Meta Tags

const html = `
  <html>
    <head>
      <title>Page Title</title>
    </head>
  </html>
`;

// Will use fallback URL
const metadata = await scraper({ html, url: "https://example.com/fallback" });
console.log(metadata.url); // "https://example.com/fallback"

Using with Other metascraper Plugins

const metascraper = require("metascraper");
const metascraperUrl = require("metascraper-url");
const metascraperTitle = require("metascraper-title");
const metascraperDescription = require("metascraper-description");

const scraper = metascraper([
  metascraperUrl(),
  metascraperTitle(),
  metascraperDescription()
]);

const metadata = await scraper({ html, url });
// metadata.url, metadata.title, metadata.description all extracted

Error Handling

  • Individual extraction rules return null or undefined when they fail to find a valid URL
  • Rules are tried sequentially until one succeeds
  • URL validation and normalization handled by @metascraper/helpers
  • Malformed URLs are automatically filtered out
  • The fallback rule ensures a URL is always returned (the input URL)