Get author property from HTML markup using metascraper plugin rules
npx @tessl/cli install tessl/npm-metascraper-author@5.49.00
# Metascraper Author
1
2
Metascraper Author is a specialized plugin for the metascraper ecosystem that extracts author information from HTML markup. It implements a comprehensive set of extraction rules to identify authors from various HTML structures including JSON-LD structured data, meta tags, microdata, and semantic HTML elements.
3
4
## Package Information
5
6
- **Package Name**: metascraper-author
7
- **Package Type**: npm
8
- **Language**: JavaScript
9
- **Installation**: `npm install metascraper-author`
10
11
## Core Imports
12
13
```javascript
14
const metascraperAuthor = require('metascraper-author');
15
```
16
17
Note: This package uses CommonJS exports only. ES6 import syntax is not supported.
18
19
## Basic Usage
20
21
```javascript
22
const metascraper = require('metascraper')([
23
require('metascraper-author')()
24
]);
25
26
const html = `
27
<html>
28
<head>
29
<meta name="author" content="John Doe">
30
</head>
31
<body>
32
<article>
33
<h1>Sample Article</h1>
34
<p>Content here...</p>
35
</article>
36
</body>
37
</html>
38
`;
39
40
const url = 'https://example.com/article';
41
42
(async () => {
43
const metadata = await metascraper({ html, url });
44
console.log(metadata.author); // "John Doe"
45
})();
46
```
47
48
## Architecture
49
50
Metascraper Author follows the metascraper plugin architecture pattern:
51
52
- **Factory Function**: Exports a function that returns a rules object compatible with the metascraper rule engine
53
- **Rule-based Extraction**: Uses a prioritized array of extraction rules, each targeting different HTML patterns
54
- **Fallback Strategy**: Implements multiple extraction strategies with increasing specificity
55
- **Validation Layer**: Includes strict validation to ensure extracted author names meet quality requirements
56
57
## Capabilities
58
59
### Rule Factory Function
60
61
Creates and returns extraction rules for identifying author information from HTML markup.
62
63
```javascript { .api }
64
/**
65
* Factory function that returns metascraper rules for author extraction
66
* @returns {Rules} Rules object containing author extraction strategies
67
*/
68
function metascraperAuthor() {
69
return {
70
/** Array of extraction rules for author identification */
71
author: RulesOptions[],
72
/** Package identifier for metascraper */
73
pkgName: 'metascraper-author'
74
};
75
}
76
77
/**
78
* Rule extraction function type
79
* @typedef {Function} RulesOptions
80
* @param {RulesTestOptions} options - Rule execution context
81
* @returns {string|null|undefined} Extracted value or null/undefined if not found
82
*/
83
84
/**
85
* Rule execution context
86
* @typedef {Object} RulesTestOptions
87
* @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM instance
88
* @property {string} url - Page URL for context
89
*/
90
91
/**
92
* Metascraper rules object
93
* @typedef {Object} Rules
94
* @property {RulesOptions[]} [author] - Array of author extraction rules
95
* @property {string} [pkgName] - Package identifier
96
* @property {Function} [test] - Optional test function for conditional rule execution
97
*/
98
```
99
100
### Extraction Rules
101
102
The plugin implements 13 different extraction strategies in priority order:
103
104
#### 1. JSON-LD Structured Data
105
- Extracts from `author.name` property in JSON-LD
106
- Extracts from `brand.name` property as fallback
107
108
#### 2. Meta Tags
109
- `<meta name="author" content="...">`
110
- `<meta property="article:author" content="...">`
111
112
#### 3. Microdata
113
- Elements with `itemprop*="author"` containing `itemprop="name"`
114
- Elements with `itemprop*="author"`
115
116
#### 4. Semantic HTML
117
- Links with `rel="author"`
118
119
#### 5. CSS Class-based Selectors (with strict validation)
120
- Links with class containing "author"
121
- Author class elements containing links
122
- Links with href containing "/author/"
123
124
#### 6. Alternative Patterns
125
- Links with class containing "screenname"
126
- Elements with class containing "author" (strict)
127
- Elements with class containing "byline" (strict, excluding dates)
128
129
### Internal Validation
130
131
The plugin uses internal validation mechanisms to ensure quality author extraction:
132
133
```javascript { .api }
134
/**
135
* Internal strict validation function
136
* Enforces stricter matching criteria for author extraction rules
137
* @param {Function} rule - Base extraction rule to enhance
138
* @returns {Function} Enhanced rule with strict validation
139
*/
140
const strict = rule => $ => {
141
const value = rule($);
142
return /^\S+\s+\S+/.test(value) && value; // Must contain at least two words
143
};
144
```
145
146
**Validation Features:**
147
- **Word Count Validation**: Author names must contain at least two words (regex: `/^\S+\s+\S+/`)
148
- **Strict Rules**: Some extraction patterns use enhanced validation for better accuracy
149
- **Content Filtering**: Automatically filters out date values and invalid content from byline elements
150
151
## Dependencies
152
153
### Runtime Dependencies
154
155
This package depends on `@metascraper/helpers` which provides the following key utility functions:
156
157
```javascript { .api }
158
/**
159
* Extract JSON-LD structured data values
160
* @param {string} path - JSONPath expression (e.g., 'author.name')
161
* @returns {Function} Rule function for extracting JSON-LD values
162
*/
163
const $jsonld = require('@metascraper/helpers').$jsonld;
164
165
/**
166
* Filter and extract text content from DOM elements
167
* @param {CheerioAPI} $ - Cheerio instance
168
* @param {CheerioElement} elements - Selected elements
169
* @param {Function} [filterFn] - Optional element filter function
170
* @returns {string|null} Extracted and cleaned text content
171
*/
172
const $filter = require('@metascraper/helpers').$filter;
173
174
/**
175
* Convert a mapping function into a metascraper rule
176
* @param {Function} mapper - Function to process extracted values
177
* @returns {Function} Metascraper-compatible rule function
178
*/
179
const toRule = require('@metascraper/helpers').toRule;
180
181
/**
182
* Validate and parse date values
183
* @param {string} value - Potential date string
184
* @returns {boolean} True if value is a valid date
185
*/
186
const date = require('@metascraper/helpers').date;
187
188
/**
189
* Clean and validate author strings
190
* @param {string} value - Raw author value
191
* @returns {string|null} Cleaned author string or null if invalid
192
*/
193
const author = require('@metascraper/helpers').author;
194
```
195
196
## Validation and Quality Control
197
198
The plugin includes comprehensive validation mechanisms:
199
200
- **Word Count Validation**: Author names must contain at least two words (using `REGEX_STRICT`)
201
- **Date Filtering**: Automatically filters out date values from byline extractions
202
- **Empty Value Rejection**: Rejects empty, undefined, or whitespace-only values
203
- **URL Filtering**: Excludes URLs from being considered as author names
204
- **Content Sanitization**: Cleans and normalizes extracted author strings
205
206
## Error Handling
207
208
The plugin gracefully handles various edge cases:
209
210
- **Missing Elements**: Returns `false` when target elements are not found
211
- **Invalid Content**: Returns `false` for content that doesn't meet validation criteria
212
- **Malformed HTML**: Continues processing with other rules if one rule fails
213
- **Empty Documents**: Handles documents with no author information without throwing errors
214
215
## Integration with Metascraper
216
217
This plugin is designed to be used within the metascraper ecosystem:
218
219
1. Install both `metascraper` and `metascraper-author`
220
2. Initialize metascraper with the author plugin
221
3. The plugin automatically contributes to the `author` property in extraction results
222
4. Works seamlessly with other metascraper plugins
223
224
## Type Definitions
225
226
The metascraper-author package uses the following type definitions:
227
228
```javascript { .api }
229
/**
230
* Main export - factory function for creating author extraction rules
231
* @returns {Rules} Metascraper rules object
232
*/
233
module.exports = function metascraperAuthor() {
234
return {
235
author: RulesOptions[],
236
pkgName: 'metascraper-author'
237
};
238
};
239
240
/**
241
* Individual extraction rule function
242
* @typedef {Function} RulesOptions
243
* @param {RulesTestOptions} options - Extraction context
244
* @returns {string|null|undefined} Extracted author or null if not found
245
*/
246
247
/**
248
* Context provided to each rule during extraction
249
* @typedef {Object} RulesTestOptions
250
* @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM API for HTML parsing
251
* @property {string} url - Source URL for context and relative link resolution
252
*/
253
254
/**
255
* Complete rules object returned by the factory function
256
* @typedef {Object} Rules
257
* @property {RulesOptions[]} author - Prioritized array of author extraction rules
258
* @property {string} pkgName - Package identifier for debugging
259
* @property {Function} [test] - Optional conditional test function
260
*/
261
```