Tessl Tile for npm/metascraper-author@5.49.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/npm-metascraper-author

Get author property from HTML markup using metascraper plugin rules

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:npm/metascraper-author@5.49.x

To install, run

npx @tessl/cli install tessl/npm-metascraper-author@5.49.0

0
# Metascraper Author
1

2
Metascraper Author is a specialized plugin for the metascraper ecosystem that extracts author information from HTML markup. It implements a comprehensive set of extraction rules to identify authors from various HTML structures including JSON-LD structured data, meta tags, microdata, and semantic HTML elements.
3

4
## Package Information
5

6
- **Package Name**: metascraper-author
7
- **Package Type**: npm
8
- **Language**: JavaScript
9
- **Installation**: `npm install metascraper-author`
10

11
## Core Imports
12

13
```javascript
14
const metascraperAuthor = require('metascraper-author');
15
```
16

17
Note: This package uses CommonJS exports only. ES6 import syntax is not supported.
18

19
## Basic Usage
20

21
```javascript
22
const metascraper = require('metascraper')([
23
  require('metascraper-author')()
24
]);
25

26
const html = `
27
  <html>
28
    <head>
29
      <meta name="author" content="John Doe">
30
    </head>
31
    <body>
32
      <article>
33
        <h1>Sample Article</h1>
34
        <p>Content here...</p>
35
      </article>
36
    </body>
37
  </html>
38
`;
39

40
const url = 'https://example.com/article';
41

42
(async () => {
43
  const metadata = await metascraper({ html, url });
44
  console.log(metadata.author); // "John Doe"
45
})();
46
```
47

48
## Architecture
49

50
Metascraper Author follows the metascraper plugin architecture pattern:
51

52
- **Factory Function**: Exports a function that returns a rules object compatible with the metascraper rule engine
53
- **Rule-based Extraction**: Uses a prioritized array of extraction rules, each targeting different HTML patterns
54
- **Fallback Strategy**: Implements multiple extraction strategies with increasing specificity
55
- **Validation Layer**: Includes strict validation to ensure extracted author names meet quality requirements
56

57
## Capabilities
58

59
### Rule Factory Function
60

61
Creates and returns extraction rules for identifying author information from HTML markup.
62

63
```javascript { .api }
64
/**
65
 * Factory function that returns metascraper rules for author extraction
66
 * @returns {Rules} Rules object containing author extraction strategies
67
 */
68
function metascraperAuthor() {
69
  return {
70
    /** Array of extraction rules for author identification */
71
    author: RulesOptions[],
72
    /** Package identifier for metascraper */
73
    pkgName: 'metascraper-author'
74
  };
75
}
76

77
/**
78
 * Rule extraction function type
79
 * @typedef {Function} RulesOptions
80
 * @param {RulesTestOptions} options - Rule execution context
81
 * @returns {string|null|undefined} Extracted value or null/undefined if not found
82
 */
83

84
/**
85
 * Rule execution context
86
 * @typedef {Object} RulesTestOptions  
87
 * @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM instance
88
 * @property {string} url - Page URL for context
89
 */
90

91
/**
92
 * Metascraper rules object
93
 * @typedef {Object} Rules
94
 * @property {RulesOptions[]} [author] - Array of author extraction rules
95
 * @property {string} [pkgName] - Package identifier
96
 * @property {Function} [test] - Optional test function for conditional rule execution
97
 */
98
```
99

100
### Extraction Rules
101

102
The plugin implements 13 different extraction strategies in priority order:
103

104
#### 1. JSON-LD Structured Data
105
- Extracts from `author.name` property in JSON-LD
106
- Extracts from `brand.name` property as fallback
107

108
#### 2. Meta Tags
109
- `<meta name="author" content="...">`
110
- `<meta property="article:author" content="...">`
111

112
#### 3. Microdata
113
- Elements with `itemprop*="author"` containing `itemprop="name"`
114
- Elements with `itemprop*="author"`
115

116
#### 4. Semantic HTML
117
- Links with `rel="author"`
118

119
#### 5. CSS Class-based Selectors (with strict validation)
120
- Links with class containing "author"
121
- Author class elements containing links
122
- Links with href containing "/author/"
123

124
#### 6. Alternative Patterns
125
- Links with class containing "screenname"
126
- Elements with class containing "author" (strict)
127
- Elements with class containing "byline" (strict, excluding dates)
128

129
### Internal Validation
130

131
The plugin uses internal validation mechanisms to ensure quality author extraction:
132

133
```javascript { .api }
134
/**
135
 * Internal strict validation function
136
 * Enforces stricter matching criteria for author extraction rules
137
 * @param {Function} rule - Base extraction rule to enhance
138
 * @returns {Function} Enhanced rule with strict validation
139
 */
140
const strict = rule => $ => {
141
  const value = rule($);
142
  return /^\S+\s+\S+/.test(value) && value; // Must contain at least two words
143
};
144
```
145

146
**Validation Features:**
147
- **Word Count Validation**: Author names must contain at least two words (regex: `/^\S+\s+\S+/`)
148
- **Strict Rules**: Some extraction patterns use enhanced validation for better accuracy
149
- **Content Filtering**: Automatically filters out date values and invalid content from byline elements
150

151
## Dependencies
152

153
### Runtime Dependencies
154

155
This package depends on `@metascraper/helpers` which provides the following key utility functions:
156

157
```javascript { .api }
158
/**
159
 * Extract JSON-LD structured data values
160
 * @param {string} path - JSONPath expression (e.g., 'author.name')
161
 * @returns {Function} Rule function for extracting JSON-LD values
162
 */
163
const $jsonld = require('@metascraper/helpers').$jsonld;
164

165
/**
166
 * Filter and extract text content from DOM elements
167
 * @param {CheerioAPI} $ - Cheerio instance
168
 * @param {CheerioElement} elements - Selected elements
169
 * @param {Function} [filterFn] - Optional element filter function
170
 * @returns {string|null} Extracted and cleaned text content
171
 */
172
const $filter = require('@metascraper/helpers').$filter;
173

174
/**
175
 * Convert a mapping function into a metascraper rule
176
 * @param {Function} mapper - Function to process extracted values
177
 * @returns {Function} Metascraper-compatible rule function
178
 */
179
const toRule = require('@metascraper/helpers').toRule;
180

181
/**
182
 * Validate and parse date values
183
 * @param {string} value - Potential date string
184
 * @returns {boolean} True if value is a valid date
185
 */
186
const date = require('@metascraper/helpers').date;
187

188
/**
189
 * Clean and validate author strings
190
 * @param {string} value - Raw author value
191
 * @returns {string|null} Cleaned author string or null if invalid
192
 */
193
const author = require('@metascraper/helpers').author;
194
```
195

196
## Validation and Quality Control
197

198
The plugin includes comprehensive validation mechanisms:
199

200
- **Word Count Validation**: Author names must contain at least two words (using `REGEX_STRICT`)
201
- **Date Filtering**: Automatically filters out date values from byline extractions
202
- **Empty Value Rejection**: Rejects empty, undefined, or whitespace-only values
203
- **URL Filtering**: Excludes URLs from being considered as author names
204
- **Content Sanitization**: Cleans and normalizes extracted author strings
205

206
## Error Handling
207

208
The plugin gracefully handles various edge cases:
209

210
- **Missing Elements**: Returns `false` when target elements are not found
211
- **Invalid Content**: Returns `false` for content that doesn't meet validation criteria
212
- **Malformed HTML**: Continues processing with other rules if one rule fails
213
- **Empty Documents**: Handles documents with no author information without throwing errors
214

215
## Integration with Metascraper
216

217
This plugin is designed to be used within the metascraper ecosystem:
218

219
1. Install both `metascraper` and `metascraper-author`
220
2. Initialize metascraper with the author plugin
221
3. The plugin automatically contributes to the `author` property in extraction results
222
4. Works seamlessly with other metascraper plugins
223

224
## Type Definitions
225

226
The metascraper-author package uses the following type definitions:
227

228
```javascript { .api }
229
/**
230
 * Main export - factory function for creating author extraction rules
231
 * @returns {Rules} Metascraper rules object
232
 */
233
module.exports = function metascraperAuthor() {
234
  return {
235
    author: RulesOptions[],
236
    pkgName: 'metascraper-author'
237
  };
238
};
239

240
/**
241
 * Individual extraction rule function
242
 * @typedef {Function} RulesOptions
243
 * @param {RulesTestOptions} options - Extraction context
244
 * @returns {string|null|undefined} Extracted author or null if not found
245
 */
246

247
/**
248
 * Context provided to each rule during extraction
249
 * @typedef {Object} RulesTestOptions
250
 * @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM API for HTML parsing
251
 * @property {string} url - Source URL for context and relative link resolution
252
 */
253

254
/**
255
 * Complete rules object returned by the factory function
256
 * @typedef {Object} Rules
257
 * @property {RulesOptions[]} author - Prioritized array of author extraction rules
258
 * @property {string} pkgName - Package identifier for debugging
259
 * @property {Function} [test] - Optional conditional test function
260
 */
261
```