Tessl Tile for pypi/goose3@3.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-data.md configuration.md core-extraction.md index.md media-extraction.md

configuration.mddocs/

0
# Configuration
1

2
Comprehensive configuration system for customizing Goose3 extraction behavior, including parser selection, language targeting, content identification patterns, network settings, and image handling options.
3

4
## Capabilities
5

6
### Configuration Class
7

8
The main configuration class that controls all aspects of the extraction process.
9

10
```python { .api }
11
class Configuration:
12
    def __init__(self):
13
        """Initialize configuration with default values."""
14
        
15
    # Parser and processing options
16
    parser_class: str  # 'lxml' or 'soup'
17
    available_parsers: list  # Available parser names
18
    
19
    # Language and localization
20
    target_language: str  # Language code (e.g., 'en', 'es', 'zh')
21
    use_meta_language: bool  # Use meta tags for language detection
22
    
23
    # Network and fetching
24
    browser_user_agent: str  # User agent string for requests
25
    http_timeout: float  # HTTP request timeout in seconds
26
    http_auth: tuple  # HTTP authentication tuple (username, password)
27
    http_proxies: dict  # HTTP proxy configuration
28
    http_headers: dict  # Additional HTTP headers
29
    strict: bool  # Strict error handling for network issues
30
    
31
    # Image processing
32
    enable_image_fetching: bool  # Enable image downloading and processing
33
    local_storage_path: str  # Directory for storing downloaded images
34
    images_min_bytes: int  # Minimum image size in bytes
35
    imagemagick_convert_path: str  # Path to ImageMagick convert binary (unused)
36
    imagemagick_identify_path: str  # Path to ImageMagick identify binary (unused)
37
    
38
    # Content processing options
39
    parse_lists: bool  # Parse and format list elements
40
    pretty_lists: bool  # Pretty formatting for lists
41
    parse_headers: bool  # Parse header elements
42
    keep_footnotes: bool  # Keep footnote content
43
    
44
    # Content extraction patterns (properties with getters/setters)
45
    known_context_patterns: list  # Patterns for identifying article content
46
    known_publish_date_tags: list  # Patterns for extracting publication dates
47
    known_author_patterns: list  # Patterns for extracting author information
48
    
49
    # Advanced options
50
    stopwords_class: type  # Class for stopwords processing
51
    log_level: str  # Logging level
52
    
53
    # Methods
54
    def get_parser(self): 
55
        """Retrieve the current parser class based on parser_class setting"""
56
        ...
57
```
58

59
### Pattern Helper Classes
60

61
Classes for defining custom content extraction patterns.
62

63
```python { .api }
64
class ArticleContextPattern:
65
    def __init__(self, *, attr=None, value=None, tag=None, domain=None):
66
        """
67
        Pattern for identifying article content areas.
68
        
69
        Parameters:
70
        - attr: HTML attribute name (e.g., 'class', 'id')
71
        - value: Attribute value to match
72
        - tag: HTML tag name to match
73
        - domain: Domain to which this pattern applies (optional)
74
        
75
        Note: Must provide either (attr and value) or tag
76
        
77
        Raises:
78
        - Exception: If neither (attr and value) nor tag is provided
79
        """
80
        
81
    attr: str
82
    value: str
83
    tag: str
84
    domain: str
85

86
class PublishDatePattern:
87
    def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None):
88
        """
89
        Pattern for extracting publication dates.
90
        
91
        Parameters:
92
        - attr: HTML attribute name
93
        - value: Attribute value to match
94
        - content: Name of attribute containing the date value
95
        - subcontent: JSON object key for nested data (optional)
96
        - tag: HTML tag name to match
97
        - domain: Domain to which this pattern applies (optional)
98
        
99
        Note: Must provide either (attr and value) or tag
100
        
101
        Raises:
102
        - Exception: If neither (attr and value) nor tag is provided
103
        """
104
        
105
    attr: str
106
    value: str
107
    content: str
108
    subcontent: str
109
    tag: str
110
    domain: str
111

112
class AuthorPattern:
113
    def __init__(self, *, attr=None, value=None, tag=None, domain=None):
114
        """
115
        Pattern for extracting author information.
116
        
117
        Parameters:
118
        - attr: HTML attribute name
119
        - value: Attribute value to match
120
        - tag: HTML tag name to match
121
        - domain: Domain to which this pattern applies (optional)
122
        
123
        Note: Must provide either (attr and value) or tag
124
        
125
        Raises:
126
        - Exception: If neither (attr and value) nor tag is provided
127
        """
128
        
129
    attr: str
130
    value: str
131
    tag: str
132
    domain: str
133
```
134

135
### Configuration Examples
136

137
Basic configuration setup:
138

139
```python
140
from goose3 import Configuration
141

142
config = Configuration()
143
config.parser_class = 'soup'
144
config.target_language = 'es'
145
config.browser_user_agent = 'Mozilla/5.0 Custom Agent'
146
```
147

148
Image extraction configuration:
149

150
```python
151
config = Configuration()
152
config.enable_image_fetching = True
153
config.local_storage_path = '/tmp/goose_images'
154
```
155

156
Custom content patterns:
157

158
```python
159
from goose3 import Configuration, ArticleContextPattern
160

161
config = Configuration()
162

163
# Add custom article content pattern
164
custom_pattern = ArticleContextPattern(
165
    attr='class', 
166
    value='article-body',
167
    domain='example.com'
168
)
169
config.known_article_content_patterns.append(custom_pattern)
170

171
# Add tag-based pattern
172
tag_pattern = ArticleContextPattern(tag='main')
173
config.known_article_content_patterns.append(tag_pattern)
174
```
175

176
Language-specific configuration:
177

178
```python
179
# Chinese language support
180
from goose3.text import StopWordsChinese
181

182
config = Configuration()
183
config.target_language = 'zh'
184
config.use_meta_language = False
185
config.stopwords_class = StopWordsChinese
186

187
# Arabic language support
188
from goose3.text import StopWordsArabic
189

190
config = Configuration()
191
config.target_language = 'ar'
192
config.use_meta_language = False
193
config.stopwords_class = StopWordsArabic
194

195
# Korean language support
196
from goose3.text import StopWordsKorean
197

198
config = Configuration()
199
config.target_language = 'ko'
200
config.use_meta_language = False
201
config.stopwords_class = StopWordsKorean
202

203
# Automatic language detection
204
config = Configuration()
205
config.use_meta_language = True
206
```
207

208
Network and error handling:
209

210
```python
211
# Lenient error handling
212
config = Configuration()
213
config.strict = False  # Don't raise network exceptions
214

215
# Custom user agent
216
config = Configuration()
217
config.browser_user_agent = 'MyBot/1.0 (Custom Web Crawler)'
218
```
219

220
### Default Patterns
221

222
Goose3 includes built-in content extraction patterns:
223

224
```python
225
# Default article content patterns
226
KNOWN_ARTICLE_CONTENT_PATTERNS = [
227
    ArticleContextPattern(attr="class", value="short-story"),
228
    ArticleContextPattern(attr="itemprop", value="articleBody"),
229
    ArticleContextPattern(attr="class", value="post-content"),
230
    ArticleContextPattern(attr="class", value="g-content"),
231
    ArticleContextPattern(attr="class", value="post-outer"),
232
    ArticleContextPattern(tag="article"),
233
]
234

235
# Available parsers
236
AVAILABLE_PARSERS = {
237
    "lxml": Parser,      # Default HTML parser
238
    "soup": ParserSoup,  # BeautifulSoup parser
239
}
240
```
241

242
### Advanced Configuration Usage
243

244
Combining multiple configuration options:
245

246
```python
247
from goose3 import Goose, Configuration, PublishDatePattern
248

249
config = Configuration()
250
config.parser_class = 'lxml'
251
config.target_language = 'en'
252
config.enable_image_fetching = True
253
config.local_storage_path = '/tmp/article_images'
254
config.strict = True
255
config.browser_user_agent = 'ArticleBot/1.0'
256

257
# Add custom publish date pattern
258
date_pattern = PublishDatePattern(
259
    attr='property',
260
    value='article:published_time',
261
    content='content'
262
)
263
config.known_publish_date_tags.append(date_pattern)
264

265
g = Goose(config)
266
article = g.extract(url='https://example.com/article')
267
```

Version

Tile

Files

configuration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

configuration.mddocs/