or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/npm-metascraper-author

Get author property from HTML markup using metascraper plugin rules

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/metascraper-author@5.49.x

To install, run

npx @tessl/cli install tessl/npm-metascraper-author@5.49.0

0

# Metascraper Author

1

2

Metascraper Author is a specialized plugin for the metascraper ecosystem that extracts author information from HTML markup. It implements a comprehensive set of extraction rules to identify authors from various HTML structures including JSON-LD structured data, meta tags, microdata, and semantic HTML elements.

3

4

## Package Information

5

6

- **Package Name**: metascraper-author

7

- **Package Type**: npm

8

- **Language**: JavaScript

9

- **Installation**: `npm install metascraper-author`

10

11

## Core Imports

12

13

```javascript

14

const metascraperAuthor = require('metascraper-author');

15

```

16

17

Note: This package uses CommonJS exports only. ES6 import syntax is not supported.

18

19

## Basic Usage

20

21

```javascript

22

const metascraper = require('metascraper')([

23

require('metascraper-author')()

24

]);

25

26

const html = `

27

<html>

28

<head>

29

<meta name="author" content="John Doe">

30

</head>

31

<body>

32

<article>

33

<h1>Sample Article</h1>

34

<p>Content here...</p>

35

</article>

36

</body>

37

</html>

38

`;

39

40

const url = 'https://example.com/article';

41

42

(async () => {

43

const metadata = await metascraper({ html, url });

44

console.log(metadata.author); // "John Doe"

45

})();

46

```

47

48

## Architecture

49

50

Metascraper Author follows the metascraper plugin architecture pattern:

51

52

- **Factory Function**: Exports a function that returns a rules object compatible with the metascraper rule engine

53

- **Rule-based Extraction**: Uses a prioritized array of extraction rules, each targeting different HTML patterns

54

- **Fallback Strategy**: Implements multiple extraction strategies with increasing specificity

55

- **Validation Layer**: Includes strict validation to ensure extracted author names meet quality requirements

56

57

## Capabilities

58

59

### Rule Factory Function

60

61

Creates and returns extraction rules for identifying author information from HTML markup.

62

63

```javascript { .api }

64

/**

65

* Factory function that returns metascraper rules for author extraction

66

* @returns {Rules} Rules object containing author extraction strategies

67

*/

68

function metascraperAuthor() {

69

return {

70

/** Array of extraction rules for author identification */

71

author: RulesOptions[],

72

/** Package identifier for metascraper */

73

pkgName: 'metascraper-author'

74

};

75

}

76

77

/**

78

* Rule extraction function type

79

* @typedef {Function} RulesOptions

80

* @param {RulesTestOptions} options - Rule execution context

81

* @returns {string|null|undefined} Extracted value or null/undefined if not found

82

*/

83

84

/**

85

* Rule execution context

86

* @typedef {Object} RulesTestOptions

87

* @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM instance

88

* @property {string} url - Page URL for context

89

*/

90

91

/**

92

* Metascraper rules object

93

* @typedef {Object} Rules

94

* @property {RulesOptions[]} [author] - Array of author extraction rules

95

* @property {string} [pkgName] - Package identifier

96

* @property {Function} [test] - Optional test function for conditional rule execution

97

*/

98

```

99

100

### Extraction Rules

101

102

The plugin implements 13 different extraction strategies in priority order:

103

104

#### 1. JSON-LD Structured Data

105

- Extracts from `author.name` property in JSON-LD

106

- Extracts from `brand.name` property as fallback

107

108

#### 2. Meta Tags

109

- `<meta name="author" content="...">`

110

- `<meta property="article:author" content="...">`

111

112

#### 3. Microdata

113

- Elements with `itemprop*="author"` containing `itemprop="name"`

114

- Elements with `itemprop*="author"`

115

116

#### 4. Semantic HTML

117

- Links with `rel="author"`

118

119

#### 5. CSS Class-based Selectors (with strict validation)

120

- Links with class containing "author"

121

- Author class elements containing links

122

- Links with href containing "/author/"

123

124

#### 6. Alternative Patterns

125

- Links with class containing "screenname"

126

- Elements with class containing "author" (strict)

127

- Elements with class containing "byline" (strict, excluding dates)

128

129

### Internal Validation

130

131

The plugin uses internal validation mechanisms to ensure quality author extraction:

132

133

```javascript { .api }

134

/**

135

* Internal strict validation function

136

* Enforces stricter matching criteria for author extraction rules

137

* @param {Function} rule - Base extraction rule to enhance

138

* @returns {Function} Enhanced rule with strict validation

139

*/

140

const strict = rule => $ => {

141

const value = rule($);

142

return /^\S+\s+\S+/.test(value) && value; // Must contain at least two words

143

};

144

```

145

146

**Validation Features:**

147

- **Word Count Validation**: Author names must contain at least two words (regex: `/^\S+\s+\S+/`)

148

- **Strict Rules**: Some extraction patterns use enhanced validation for better accuracy

149

- **Content Filtering**: Automatically filters out date values and invalid content from byline elements

150

151

## Dependencies

152

153

### Runtime Dependencies

154

155

This package depends on `@metascraper/helpers` which provides the following key utility functions:

156

157

```javascript { .api }

158

/**

159

* Extract JSON-LD structured data values

160

* @param {string} path - JSONPath expression (e.g., 'author.name')

161

* @returns {Function} Rule function for extracting JSON-LD values

162

*/

163

const $jsonld = require('@metascraper/helpers').$jsonld;

164

165

/**

166

* Filter and extract text content from DOM elements

167

* @param {CheerioAPI} $ - Cheerio instance

168

* @param {CheerioElement} elements - Selected elements

169

* @param {Function} [filterFn] - Optional element filter function

170

* @returns {string|null} Extracted and cleaned text content

171

*/

172

const $filter = require('@metascraper/helpers').$filter;

173

174

/**

175

* Convert a mapping function into a metascraper rule

176

* @param {Function} mapper - Function to process extracted values

177

* @returns {Function} Metascraper-compatible rule function

178

*/

179

const toRule = require('@metascraper/helpers').toRule;

180

181

/**

182

* Validate and parse date values

183

* @param {string} value - Potential date string

184

* @returns {boolean} True if value is a valid date

185

*/

186

const date = require('@metascraper/helpers').date;

187

188

/**

189

* Clean and validate author strings

190

* @param {string} value - Raw author value

191

* @returns {string|null} Cleaned author string or null if invalid

192

*/

193

const author = require('@metascraper/helpers').author;

194

```

195

196

## Validation and Quality Control

197

198

The plugin includes comprehensive validation mechanisms:

199

200

- **Word Count Validation**: Author names must contain at least two words (using `REGEX_STRICT`)

201

- **Date Filtering**: Automatically filters out date values from byline extractions

202

- **Empty Value Rejection**: Rejects empty, undefined, or whitespace-only values

203

- **URL Filtering**: Excludes URLs from being considered as author names

204

- **Content Sanitization**: Cleans and normalizes extracted author strings

205

206

## Error Handling

207

208

The plugin gracefully handles various edge cases:

209

210

- **Missing Elements**: Returns `false` when target elements are not found

211

- **Invalid Content**: Returns `false` for content that doesn't meet validation criteria

212

- **Malformed HTML**: Continues processing with other rules if one rule fails

213

- **Empty Documents**: Handles documents with no author information without throwing errors

214

215

## Integration with Metascraper

216

217

This plugin is designed to be used within the metascraper ecosystem:

218

219

1. Install both `metascraper` and `metascraper-author`

220

2. Initialize metascraper with the author plugin

221

3. The plugin automatically contributes to the `author` property in extraction results

222

4. Works seamlessly with other metascraper plugins

223

224

## Type Definitions

225

226

The metascraper-author package uses the following type definitions:

227

228

```javascript { .api }

229

/**

230

* Main export - factory function for creating author extraction rules

231

* @returns {Rules} Metascraper rules object

232

*/

233

module.exports = function metascraperAuthor() {

234

return {

235

author: RulesOptions[],

236

pkgName: 'metascraper-author'

237

};

238

};

239

240

/**

241

* Individual extraction rule function

242

* @typedef {Function} RulesOptions

243

* @param {RulesTestOptions} options - Extraction context

244

* @returns {string|null|undefined} Extracted author or null if not found

245

*/

246

247

/**

248

* Context provided to each rule during extraction

249

* @typedef {Object} RulesTestOptions

250

* @property {import('cheerio').CheerioAPI} htmlDom - Cheerio DOM API for HTML parsing

251

* @property {string} url - Source URL for context and relative link resolution

252

*/

253

254

/**

255

* Complete rules object returned by the factory function

256

* @typedef {Object} Rules

257

* @property {RulesOptions[]} author - Prioritized array of author extraction rules

258

* @property {string} pkgName - Package identifier for debugging

259

* @property {Function} [test] - Optional conditional test function

260

*/

261

```