0
# Configuration
1
2
Comprehensive configuration system for customizing Goose3 extraction behavior, including parser selection, language targeting, content identification patterns, network settings, and image handling options.
3
4
## Capabilities
5
6
### Configuration Class
7
8
The main configuration class that controls all aspects of the extraction process.
9
10
```python { .api }
11
class Configuration:
12
def __init__(self):
13
"""Initialize configuration with default values."""
14
15
# Parser and processing options
16
parser_class: str # 'lxml' or 'soup'
17
available_parsers: list # Available parser names
18
19
# Language and localization
20
target_language: str # Language code (e.g., 'en', 'es', 'zh')
21
use_meta_language: bool # Use meta tags for language detection
22
23
# Network and fetching
24
browser_user_agent: str # User agent string for requests
25
http_timeout: float # HTTP request timeout in seconds
26
http_auth: tuple # HTTP authentication tuple (username, password)
27
http_proxies: dict # HTTP proxy configuration
28
http_headers: dict # Additional HTTP headers
29
strict: bool # Strict error handling for network issues
30
31
# Image processing
32
enable_image_fetching: bool # Enable image downloading and processing
33
local_storage_path: str # Directory for storing downloaded images
34
images_min_bytes: int # Minimum image size in bytes
35
imagemagick_convert_path: str # Path to ImageMagick convert binary (unused)
36
imagemagick_identify_path: str # Path to ImageMagick identify binary (unused)
37
38
# Content processing options
39
parse_lists: bool # Parse and format list elements
40
pretty_lists: bool # Pretty formatting for lists
41
parse_headers: bool # Parse header elements
42
keep_footnotes: bool # Keep footnote content
43
44
# Content extraction patterns (properties with getters/setters)
45
known_context_patterns: list # Patterns for identifying article content
46
known_publish_date_tags: list # Patterns for extracting publication dates
47
known_author_patterns: list # Patterns for extracting author information
48
49
# Advanced options
50
stopwords_class: type # Class for stopwords processing
51
log_level: str # Logging level
52
53
# Methods
54
def get_parser(self):
55
"""Retrieve the current parser class based on parser_class setting"""
56
...
57
```
58
59
### Pattern Helper Classes
60
61
Classes for defining custom content extraction patterns.
62
63
```python { .api }
64
class ArticleContextPattern:
65
def __init__(self, *, attr=None, value=None, tag=None, domain=None):
66
"""
67
Pattern for identifying article content areas.
68
69
Parameters:
70
- attr: HTML attribute name (e.g., 'class', 'id')
71
- value: Attribute value to match
72
- tag: HTML tag name to match
73
- domain: Domain to which this pattern applies (optional)
74
75
Note: Must provide either (attr and value) or tag
76
77
Raises:
78
- Exception: If neither (attr and value) nor tag is provided
79
"""
80
81
attr: str
82
value: str
83
tag: str
84
domain: str
85
86
class PublishDatePattern:
87
def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None):
88
"""
89
Pattern for extracting publication dates.
90
91
Parameters:
92
- attr: HTML attribute name
93
- value: Attribute value to match
94
- content: Name of attribute containing the date value
95
- subcontent: JSON object key for nested data (optional)
96
- tag: HTML tag name to match
97
- domain: Domain to which this pattern applies (optional)
98
99
Note: Must provide either (attr and value) or tag
100
101
Raises:
102
- Exception: If neither (attr and value) nor tag is provided
103
"""
104
105
attr: str
106
value: str
107
content: str
108
subcontent: str
109
tag: str
110
domain: str
111
112
class AuthorPattern:
113
def __init__(self, *, attr=None, value=None, tag=None, domain=None):
114
"""
115
Pattern for extracting author information.
116
117
Parameters:
118
- attr: HTML attribute name
119
- value: Attribute value to match
120
- tag: HTML tag name to match
121
- domain: Domain to which this pattern applies (optional)
122
123
Note: Must provide either (attr and value) or tag
124
125
Raises:
126
- Exception: If neither (attr and value) nor tag is provided
127
"""
128
129
attr: str
130
value: str
131
tag: str
132
domain: str
133
```
134
135
### Configuration Examples
136
137
Basic configuration setup:
138
139
```python
140
from goose3 import Configuration
141
142
config = Configuration()
143
config.parser_class = 'soup'
144
config.target_language = 'es'
145
config.browser_user_agent = 'Mozilla/5.0 Custom Agent'
146
```
147
148
Image extraction configuration:
149
150
```python
151
config = Configuration()
152
config.enable_image_fetching = True
153
config.local_storage_path = '/tmp/goose_images'
154
```
155
156
Custom content patterns:
157
158
```python
159
from goose3 import Configuration, ArticleContextPattern
160
161
config = Configuration()
162
163
# Add custom article content pattern
164
custom_pattern = ArticleContextPattern(
165
attr='class',
166
value='article-body',
167
domain='example.com'
168
)
169
config.known_article_content_patterns.append(custom_pattern)
170
171
# Add tag-based pattern
172
tag_pattern = ArticleContextPattern(tag='main')
173
config.known_article_content_patterns.append(tag_pattern)
174
```
175
176
Language-specific configuration:
177
178
```python
179
# Chinese language support
180
from goose3.text import StopWordsChinese
181
182
config = Configuration()
183
config.target_language = 'zh'
184
config.use_meta_language = False
185
config.stopwords_class = StopWordsChinese
186
187
# Arabic language support
188
from goose3.text import StopWordsArabic
189
190
config = Configuration()
191
config.target_language = 'ar'
192
config.use_meta_language = False
193
config.stopwords_class = StopWordsArabic
194
195
# Korean language support
196
from goose3.text import StopWordsKorean
197
198
config = Configuration()
199
config.target_language = 'ko'
200
config.use_meta_language = False
201
config.stopwords_class = StopWordsKorean
202
203
# Automatic language detection
204
config = Configuration()
205
config.use_meta_language = True
206
```
207
208
Network and error handling:
209
210
```python
211
# Lenient error handling
212
config = Configuration()
213
config.strict = False # Don't raise network exceptions
214
215
# Custom user agent
216
config = Configuration()
217
config.browser_user_agent = 'MyBot/1.0 (Custom Web Crawler)'
218
```
219
220
### Default Patterns
221
222
Goose3 includes built-in content extraction patterns:
223
224
```python
225
# Default article content patterns
226
KNOWN_ARTICLE_CONTENT_PATTERNS = [
227
ArticleContextPattern(attr="class", value="short-story"),
228
ArticleContextPattern(attr="itemprop", value="articleBody"),
229
ArticleContextPattern(attr="class", value="post-content"),
230
ArticleContextPattern(attr="class", value="g-content"),
231
ArticleContextPattern(attr="class", value="post-outer"),
232
ArticleContextPattern(tag="article"),
233
]
234
235
# Available parsers
236
AVAILABLE_PARSERS = {
237
"lxml": Parser, # Default HTML parser
238
"soup": ParserSoup, # BeautifulSoup parser
239
}
240
```
241
242
### Advanced Configuration Usage
243
244
Combining multiple configuration options:
245
246
```python
247
from goose3 import Goose, Configuration, PublishDatePattern
248
249
config = Configuration()
250
config.parser_class = 'lxml'
251
config.target_language = 'en'
252
config.enable_image_fetching = True
253
config.local_storage_path = '/tmp/article_images'
254
config.strict = True
255
config.browser_user_agent = 'ArticleBot/1.0'
256
257
# Add custom publish date pattern
258
date_pattern = PublishDatePattern(
259
attr='property',
260
value='article:published_time',
261
content='content'
262
)
263
config.known_publish_date_tags.append(date_pattern)
264
265
g = Goose(config)
266
article = g.extract(url='https://example.com/article')
267
```