or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.md
tile.json

tessl/npm-metascraper-author

Get author property from HTML markup using metascraper plugin rules

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/metascraper-author@5.49.x

To install, run

npx @tessl/cli install tessl/npm-metascraper-author@5.49.0

index.mddocs/

Metascraper Author

Metascraper Author is a specialized plugin for the metascraper ecosystem that extracts author information from HTML markup. It implements a comprehensive set of extraction rules to identify authors from various HTML structures including JSON-LD structured data, meta tags, microdata, and semantic HTML elements.

Package Information

  • Package Name: metascraper-author
  • Package Type: npm
  • Language: JavaScript
  • Installation: npm install metascraper-author

Core Imports

const metascraperAuthor = require('metascraper-author');

Note: This package uses CommonJS exports only. ES6 import syntax is not supported.

Basic Usage

const metascraper = require('metascraper')([
  require('metascraper-author')()
]);

const html = `
  <html>
    <head>
      <meta name="author" content="John Doe">
    </head>
    <body>
      <article>
        <h1>Sample Article</h1>
        <p>Content here...</p>
      </article>
    </body>
  </html>
`;

const url = 'https://example.com/article';

(async () => {
  const metadata = await metascraper({ html, url });
  console.log(metadata.author); // "John Doe"
})();

Architecture

Metascraper Author follows the metascraper plugin architecture pattern:

  • Factory Function: Exports a function that returns a rules object compatible with the metascraper rule engine
  • Rule-based Extraction: Uses a prioritized array of extraction rules, each targeting different HTML patterns
  • Fallback Strategy: Implements multiple extraction strategies with increasing specificity
  • Validation Layer: Includes strict validation to ensure extracted author names meet quality requirements

Capabilities

Rule Factory Function

Creates and returns extraction rules for identifying author information from HTML markup.

/**
 * Factory function that returns metascraper rules for author extraction
 * @returns {Rules} Rules object containing author extraction strategies
 */
function metascraperAuthor() {
  return {
    /** Array of extraction rules for author identification */
    author: RulesOptions[],
    /** Package identifier for metascraper */
    pkgName: 'metascraper-author'
  };
}

/**
 * Rule extraction function type
 * @typedef {Function} RulesOptions
 * @param {RulesTestOptions} options - Rule execution context
 * @returns {string|null|undefined} Extracted value or null/undefined if not found
 */

/**
 * Rule execution context
 * @typedef {Object} RulesTestOptions  
 * @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM instance
 * @property {string} url - Page URL for context
 */

/**
 * Metascraper rules object
 * @typedef {Object} Rules
 * @property {RulesOptions[]} [author] - Array of author extraction rules
 * @property {string} [pkgName] - Package identifier
 * @property {Function} [test] - Optional test function for conditional rule execution
 */

Extraction Rules

The plugin implements 13 different extraction strategies in priority order:

1. JSON-LD Structured Data

  • Extracts from author.name property in JSON-LD
  • Extracts from brand.name property as fallback

2. Meta Tags

  • <meta name="author" content="...">
  • <meta property="article:author" content="...">

3. Microdata

  • Elements with itemprop*="author" containing itemprop="name"
  • Elements with itemprop*="author"

4. Semantic HTML

  • Links with rel="author"

5. CSS Class-based Selectors (with strict validation)

  • Links with class containing "author"
  • Author class elements containing links
  • Links with href containing "/author/"

6. Alternative Patterns

  • Links with class containing "screenname"
  • Elements with class containing "author" (strict)
  • Elements with class containing "byline" (strict, excluding dates)

Internal Validation

The plugin uses internal validation mechanisms to ensure quality author extraction:

/**
 * Internal strict validation function
 * Enforces stricter matching criteria for author extraction rules
 * @param {Function} rule - Base extraction rule to enhance
 * @returns {Function} Enhanced rule with strict validation
 */
const strict = rule => $ => {
  const value = rule($);
  return /^\S+\s+\S+/.test(value) && value; // Must contain at least two words
};

Validation Features:

  • Word Count Validation: Author names must contain at least two words (regex: /^\S+\s+\S+/)
  • Strict Rules: Some extraction patterns use enhanced validation for better accuracy
  • Content Filtering: Automatically filters out date values and invalid content from byline elements

Dependencies

Runtime Dependencies

This package depends on @metascraper/helpers which provides the following key utility functions:

/**
 * Extract JSON-LD structured data values
 * @param {string} path - JSONPath expression (e.g., 'author.name')
 * @returns {Function} Rule function for extracting JSON-LD values
 */
const $jsonld = require('@metascraper/helpers').$jsonld;

/**
 * Filter and extract text content from DOM elements
 * @param {CheerioAPI} $ - Cheerio instance
 * @param {CheerioElement} elements - Selected elements
 * @param {Function} [filterFn] - Optional element filter function
 * @returns {string|null} Extracted and cleaned text content
 */
const $filter = require('@metascraper/helpers').$filter;

/**
 * Convert a mapping function into a metascraper rule
 * @param {Function} mapper - Function to process extracted values
 * @returns {Function} Metascraper-compatible rule function
 */
const toRule = require('@metascraper/helpers').toRule;

/**
 * Validate and parse date values
 * @param {string} value - Potential date string
 * @returns {boolean} True if value is a valid date
 */
const date = require('@metascraper/helpers').date;

/**
 * Clean and validate author strings
 * @param {string} value - Raw author value
 * @returns {string|null} Cleaned author string or null if invalid
 */
const author = require('@metascraper/helpers').author;

Validation and Quality Control

The plugin includes comprehensive validation mechanisms:

  • Word Count Validation: Author names must contain at least two words (using REGEX_STRICT)
  • Date Filtering: Automatically filters out date values from byline extractions
  • Empty Value Rejection: Rejects empty, undefined, or whitespace-only values
  • URL Filtering: Excludes URLs from being considered as author names
  • Content Sanitization: Cleans and normalizes extracted author strings

Error Handling

The plugin gracefully handles various edge cases:

  • Missing Elements: Returns false when target elements are not found
  • Invalid Content: Returns false for content that doesn't meet validation criteria
  • Malformed HTML: Continues processing with other rules if one rule fails
  • Empty Documents: Handles documents with no author information without throwing errors

Integration with Metascraper

This plugin is designed to be used within the metascraper ecosystem:

  1. Install both metascraper and metascraper-author
  2. Initialize metascraper with the author plugin
  3. The plugin automatically contributes to the author property in extraction results
  4. Works seamlessly with other metascraper plugins

Type Definitions

The metascraper-author package uses the following type definitions:

/**
 * Main export - factory function for creating author extraction rules
 * @returns {Rules} Metascraper rules object
 */
module.exports = function metascraperAuthor() {
  return {
    author: RulesOptions[],
    pkgName: 'metascraper-author'
  };
};

/**
 * Individual extraction rule function
 * @typedef {Function} RulesOptions
 * @param {RulesTestOptions} options - Extraction context
 * @returns {string|null|undefined} Extracted author or null if not found
 */

/**
 * Context provided to each rule during extraction
 * @typedef {Object} RulesTestOptions
 * @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM API for HTML parsing
 * @property {string} url - Source URL for context and relative link resolution
 */

/**
 * Complete rules object returned by the factory function
 * @typedef {Object} Rules
 * @property {RulesOptions[]} author - Prioritized array of author extraction rules
 * @property {string} pkgName - Package identifier for debugging
 * @property {Function} [test] - Optional conditional test function
 */