or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

article-data.mdconfiguration.mdcore-extraction.mdindex.mdmedia-extraction.md

configuration.mddocs/

0

# Configuration

1

2

Comprehensive configuration system for customizing Goose3 extraction behavior, including parser selection, language targeting, content identification patterns, network settings, and image handling options.

3

4

## Capabilities

5

6

### Configuration Class

7

8

The main configuration class that controls all aspects of the extraction process.

9

10

```python { .api }

11

class Configuration:

12

def __init__(self):

13

"""Initialize configuration with default values."""

14

15

# Parser and processing options

16

parser_class: str # 'lxml' or 'soup'

17

available_parsers: list # Available parser names

18

19

# Language and localization

20

target_language: str # Language code (e.g., 'en', 'es', 'zh')

21

use_meta_language: bool # Use meta tags for language detection

22

23

# Network and fetching

24

browser_user_agent: str # User agent string for requests

25

http_timeout: float # HTTP request timeout in seconds

26

http_auth: tuple # HTTP authentication tuple (username, password)

27

http_proxies: dict # HTTP proxy configuration

28

http_headers: dict # Additional HTTP headers

29

strict: bool # Strict error handling for network issues

30

31

# Image processing

32

enable_image_fetching: bool # Enable image downloading and processing

33

local_storage_path: str # Directory for storing downloaded images

34

images_min_bytes: int # Minimum image size in bytes

35

imagemagick_convert_path: str # Path to ImageMagick convert binary (unused)

36

imagemagick_identify_path: str # Path to ImageMagick identify binary (unused)

37

38

# Content processing options

39

parse_lists: bool # Parse and format list elements

40

pretty_lists: bool # Pretty formatting for lists

41

parse_headers: bool # Parse header elements

42

keep_footnotes: bool # Keep footnote content

43

44

# Content extraction patterns (properties with getters/setters)

45

known_context_patterns: list # Patterns for identifying article content

46

known_publish_date_tags: list # Patterns for extracting publication dates

47

known_author_patterns: list # Patterns for extracting author information

48

49

# Advanced options

50

stopwords_class: type # Class for stopwords processing

51

log_level: str # Logging level

52

53

# Methods

54

def get_parser(self):

55

"""Retrieve the current parser class based on parser_class setting"""

56

...

57

```

58

59

### Pattern Helper Classes

60

61

Classes for defining custom content extraction patterns.

62

63

```python { .api }

64

class ArticleContextPattern:

65

def __init__(self, *, attr=None, value=None, tag=None, domain=None):

66

"""

67

Pattern for identifying article content areas.

68

69

Parameters:

70

- attr: HTML attribute name (e.g., 'class', 'id')

71

- value: Attribute value to match

72

- tag: HTML tag name to match

73

- domain: Domain to which this pattern applies (optional)

74

75

Note: Must provide either (attr and value) or tag

76

77

Raises:

78

- Exception: If neither (attr and value) nor tag is provided

79

"""

80

81

attr: str

82

value: str

83

tag: str

84

domain: str

85

86

class PublishDatePattern:

87

def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None):

88

"""

89

Pattern for extracting publication dates.

90

91

Parameters:

92

- attr: HTML attribute name

93

- value: Attribute value to match

94

- content: Name of attribute containing the date value

95

- subcontent: JSON object key for nested data (optional)

96

- tag: HTML tag name to match

97

- domain: Domain to which this pattern applies (optional)

98

99

Note: Must provide either (attr and value) or tag

100

101

Raises:

102

- Exception: If neither (attr and value) nor tag is provided

103

"""

104

105

attr: str

106

value: str

107

content: str

108

subcontent: str

109

tag: str

110

domain: str

111

112

class AuthorPattern:

113

def __init__(self, *, attr=None, value=None, tag=None, domain=None):

114

"""

115

Pattern for extracting author information.

116

117

Parameters:

118

- attr: HTML attribute name

119

- value: Attribute value to match

120

- tag: HTML tag name to match

121

- domain: Domain to which this pattern applies (optional)

122

123

Note: Must provide either (attr and value) or tag

124

125

Raises:

126

- Exception: If neither (attr and value) nor tag is provided

127

"""

128

129

attr: str

130

value: str

131

tag: str

132

domain: str

133

```

134

135

### Configuration Examples

136

137

Basic configuration setup:

138

139

```python

140

from goose3 import Configuration

141

142

config = Configuration()

143

config.parser_class = 'soup'

144

config.target_language = 'es'

145

config.browser_user_agent = 'Mozilla/5.0 Custom Agent'

146

```

147

148

Image extraction configuration:

149

150

```python

151

config = Configuration()

152

config.enable_image_fetching = True

153

config.local_storage_path = '/tmp/goose_images'

154

```

155

156

Custom content patterns:

157

158

```python

159

from goose3 import Configuration, ArticleContextPattern

160

161

config = Configuration()

162

163

# Add custom article content pattern

164

custom_pattern = ArticleContextPattern(

165

attr='class',

166

value='article-body',

167

domain='example.com'

168

)

169

config.known_article_content_patterns.append(custom_pattern)

170

171

# Add tag-based pattern

172

tag_pattern = ArticleContextPattern(tag='main')

173

config.known_article_content_patterns.append(tag_pattern)

174

```

175

176

Language-specific configuration:

177

178

```python

179

# Chinese language support

180

from goose3.text import StopWordsChinese

181

182

config = Configuration()

183

config.target_language = 'zh'

184

config.use_meta_language = False

185

config.stopwords_class = StopWordsChinese

186

187

# Arabic language support

188

from goose3.text import StopWordsArabic

189

190

config = Configuration()

191

config.target_language = 'ar'

192

config.use_meta_language = False

193

config.stopwords_class = StopWordsArabic

194

195

# Korean language support

196

from goose3.text import StopWordsKorean

197

198

config = Configuration()

199

config.target_language = 'ko'

200

config.use_meta_language = False

201

config.stopwords_class = StopWordsKorean

202

203

# Automatic language detection

204

config = Configuration()

205

config.use_meta_language = True

206

```

207

208

Network and error handling:

209

210

```python

211

# Lenient error handling

212

config = Configuration()

213

config.strict = False # Don't raise network exceptions

214

215

# Custom user agent

216

config = Configuration()

217

config.browser_user_agent = 'MyBot/1.0 (Custom Web Crawler)'

218

```

219

220

### Default Patterns

221

222

Goose3 includes built-in content extraction patterns:

223

224

```python

225

# Default article content patterns

226

KNOWN_ARTICLE_CONTENT_PATTERNS = [

227

ArticleContextPattern(attr="class", value="short-story"),

228

ArticleContextPattern(attr="itemprop", value="articleBody"),

229

ArticleContextPattern(attr="class", value="post-content"),

230

ArticleContextPattern(attr="class", value="g-content"),

231

ArticleContextPattern(attr="class", value="post-outer"),

232

ArticleContextPattern(tag="article"),

233

]

234

235

# Available parsers

236

AVAILABLE_PARSERS = {

237

"lxml": Parser, # Default HTML parser

238

"soup": ParserSoup, # BeautifulSoup parser

239

}

240

```

241

242

### Advanced Configuration Usage

243

244

Combining multiple configuration options:

245

246

```python

247

from goose3 import Goose, Configuration, PublishDatePattern

248

249

config = Configuration()

250

config.parser_class = 'lxml'

251

config.target_language = 'en'

252

config.enable_image_fetching = True

253

config.local_storage_path = '/tmp/article_images'

254

config.strict = True

255

config.browser_user_agent = 'ArticleBot/1.0'

256

257

# Add custom publish date pattern

258

date_pattern = PublishDatePattern(

259

attr='property',

260

value='article:published_time',

261

content='content'

262

)

263

config.known_publish_date_tags.append(date_pattern)

264

265

g = Goose(config)

266

article = g.extract(url='https://example.com/article')

267

```