0
# Configuration & Utilities
1
2
Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features. The Configuration class provides extensive customization options for article processing, while utility functions offer additional capabilities like fulltext extraction and trending topic discovery.
3
4
## Capabilities
5
6
### Configuration Management
7
8
Comprehensive configuration options for customizing newspaper3k behavior.
9
10
```python { .api }
11
class Configuration:
12
def __init__(self):
13
"""Initialize configuration with default settings."""
14
15
def get_language(self) -> str:
16
"""Get the current language setting."""
17
18
def set_language(self, language: str):
19
"""
20
Set the target language for processing.
21
22
Parameters:
23
- language: Two-character language code (e.g., 'en', 'es', 'fr')
24
25
Raises:
26
Exception: If language code is invalid or not 2 characters
27
"""
28
29
@staticmethod
30
def get_stopwords_class(language: str):
31
"""
32
Get the appropriate stopwords class for a language.
33
34
Parameters:
35
- language: Two-character language code
36
37
Returns:
38
Stopwords class for the specified language
39
"""
40
41
@staticmethod
42
def get_parser():
43
"""Get the HTML parser class (lxml-based Parser)."""
44
```
45
46
### Configuration Properties
47
48
Extensive configuration options for fine-tuning extraction behavior.
49
50
```python { .api }
51
# Content Validation Thresholds
52
MIN_WORD_COUNT: int = 300 # Minimum words for valid article
53
MIN_SENT_COUNT: int = 7 # Minimum sentences for valid article
54
MAX_TITLE: int = 200 # Maximum title length in characters
55
MAX_TEXT: int = 100000 # Maximum article text length
56
MAX_KEYWORDS: int = 35 # Maximum keywords to extract
57
MAX_AUTHORS: int = 10 # Maximum authors to extract
58
MAX_SUMMARY: int = 5000 # Maximum summary length
59
MAX_SUMMARY_SENT: int = 5 # Maximum summary sentences
60
61
# Caching and Storage
62
MAX_FILE_MEMO: int = 20000 # Max URLs cached per news source
63
memoize_articles: bool = True # Cache articles between runs
64
65
# Media Processing
66
fetch_images: bool = True # Download and process images
67
image_dimension_ration: float = 16/9.0 # Preferred image aspect ratio
68
69
# Network and Processing
70
follow_meta_refresh: bool = False # Follow meta refresh redirects
71
use_meta_language: bool = True # Use language from HTML meta tags
72
keep_article_html: bool = False # Retain cleaned article HTML
73
http_success_only: bool = True # Fail on HTTP error responses
74
request_timeout: int = 7 # HTTP request timeout in seconds
75
number_threads: int = 10 # Default thread count
76
thread_timeout_seconds: int = 1 # Thread timeout in seconds
77
78
# Language and Localization
79
language: str = 'en' # Target language code
80
stopwords_class: class = StopWords # Stopwords class for language
81
82
# HTTP Configuration
83
browser_user_agent: str # HTTP User-Agent header
84
headers: dict = {} # Additional HTTP headers
85
proxies: dict = {} # Proxy configuration
86
87
# Debugging
88
verbose: bool = False # Enable debug logging
89
```
90
91
### Utility Functions
92
93
Standalone functions for specialized processing and information retrieval.
94
95
```python { .api }
96
def fulltext(html: str, language: str = 'en') -> str:
97
"""
98
Extract clean text content from raw HTML.
99
100
Parameters:
101
- html: Raw HTML string
102
- language: Language code for processing (default: 'en')
103
104
Returns:
105
Extracted plain text content
106
"""
107
108
def hot() -> list:
109
"""
110
Get trending topics from Google Trends.
111
112
Returns:
113
List of trending search terms, or None if failed
114
"""
115
116
def languages():
117
"""Print list of supported languages to console."""
118
119
def popular_urls() -> list:
120
"""
121
Get list of popular news source URLs.
122
123
Returns:
124
List of pre-extracted popular news website URLs
125
"""
126
```
127
128
### Language Support Classes
129
130
Specialized stopwords classes for different languages.
131
132
```python { .api }
133
class StopWords:
134
"""Default English stopwords class."""
135
136
class StopWordsChinese(StopWords):
137
"""Chinese language stopwords."""
138
139
class StopWordsArabic(StopWords):
140
"""Arabic and Persian language stopwords."""
141
142
class StopWordsKorean(StopWords):
143
"""Korean language stopwords."""
144
145
class StopWordsHindi(StopWords):
146
"""Hindi language stopwords."""
147
148
class StopWordsJapanese(StopWords):
149
"""Japanese language stopwords."""
150
```
151
152
### Helper Functions
153
154
Additional utility functions for configuration and language support.
155
156
```python { .api }
157
def get_available_languages() -> list:
158
"""
159
Get list of supported language codes.
160
161
Returns:
162
List of two-character language codes
163
"""
164
165
def print_available_languages():
166
"""Print supported languages to console."""
167
168
def extend_config(config: Configuration, config_items: dict) -> Configuration:
169
"""
170
Merge configuration object with additional settings.
171
172
Parameters:
173
- config: Base Configuration object
174
- config_items: Dictionary of additional configuration values
175
176
Returns:
177
Updated Configuration object
178
"""
179
```
180
181
## Usage Examples
182
183
### Basic Configuration
184
185
```python
186
from newspaper import Configuration, Article
187
188
# Create custom configuration
189
config = Configuration()
190
config.language = 'es'
191
config.MIN_WORD_COUNT = 500
192
config.fetch_images = False
193
config.request_timeout = 10
194
195
# Use with article
196
article = Article('http://spanish-news-site.com/article', config=config)
197
article.build()
198
```
199
200
### Multi-language Processing
201
202
```python
203
from newspaper import Configuration, Article
204
205
# Process articles in different languages
206
languages = ['en', 'es', 'fr', 'de']
207
articles = {}
208
209
for lang in languages:
210
config = Configuration()
211
config.set_language(lang)
212
213
# Language-specific URL (example)
214
url = f'http://news-site.com/{lang}/article'
215
article = Article(url, config=config)
216
article.build()
217
218
articles[lang] = article
219
print(f"{lang}: {article.title}")
220
```
221
222
### Performance Optimization
223
224
```python
225
from newspaper import Configuration, build
226
227
# High-performance configuration
228
config = Configuration()
229
config.number_threads = 20
230
config.thread_timeout_seconds = 2
231
config.request_timeout = 5
232
config.memoize_articles = True
233
config.fetch_images = False # Skip images for speed
234
235
# Build source with optimized settings
236
source = build('http://news-site.com', config=config)
237
print(f"Fast processing: {len(source.articles)} articles discovered")
238
```
239
240
### Content Quality Configuration
241
242
```python
243
from newspaper import Configuration, Article
244
245
# Strict content validation
246
config = Configuration()
247
config.MIN_WORD_COUNT = 800 # Require longer articles
248
config.MIN_SENT_COUNT = 15 # Require more sentences
249
config.MAX_KEYWORDS = 50 # Extract more keywords
250
config.MAX_SUMMARY_SENT = 10 # Longer summaries
251
252
# Use strict configuration
253
article = Article('http://long-form-article.com', config=config)
254
article.build()
255
256
if article.is_valid_body():
257
print(f"High-quality article: {len(article.text)} words")
258
print(f"Keywords: {len(article.keywords)}")
259
print(f"Summary sentences: {len(article.summary.split('.'))}")
260
```
261
262
### Network Configuration
263
264
```python
265
from newspaper import Configuration, Article
266
267
# Custom network settings
268
config = Configuration()
269
config.browser_user_agent = 'MyBot/1.0'
270
config.headers = {
271
'Accept-Language': 'en-US,en;q=0.9',
272
'Accept-Encoding': 'gzip, deflate'
273
}
274
config.proxies = {
275
'http': 'http://proxy.example.com:8080',
276
'https': 'https://proxy.example.com:8080'
277
}
278
config.request_timeout = 15
279
280
# Use custom network settings
281
article = Article('http://example.com/article', config=config)
282
article.download()
283
```
284
285
### Language Detection and Processing
286
287
```python
288
from newspaper import get_available_languages, Configuration
289
290
# Show supported languages
291
print("Supported languages:")
292
print(get_available_languages())
293
294
# Auto-detect and process
295
def process_with_language_detection(url):
296
# First pass - detect language
297
article = Article(url)
298
article.download()
299
article.parse() # This extracts meta_lang
300
301
detected_lang = article.meta_lang
302
if detected_lang in get_available_languages():
303
# Second pass with detected language
304
config = Configuration()
305
config.set_language(detected_lang)
306
307
article_lang = Article(url, config=config)
308
article_lang.build()
309
return article_lang
310
311
return article
312
313
# Process with language detection
314
result = process_with_language_detection('http://multilingual-site.com/article')
315
print(f"Language: {result.meta_lang}")
316
print(f"Title: {result.title}")
317
```
318
319
### Utility Functions Usage
320
321
```python
322
from newspaper import fulltext, hot, popular_urls
323
324
# Extract text from raw HTML
325
html_content = """
326
<html><body>
327
<h1>News Title</h1>
328
<p>This is the main article content with <a href="#">links</a> and formatting.</p>
329
</body></html>
330
"""
331
332
clean_text = fulltext(html_content, language='en')
333
print(f"Extracted text: {clean_text}")
334
335
# Get trending topics
336
try:
337
trending = hot()
338
if trending:
339
print("Trending topics:", trending[:5])
340
except Exception as e:
341
print(f"Could not fetch trending topics: {e}")
342
343
# Get popular news sources
344
popular_sources = popular_urls()
345
print(f"Popular sources: {len(popular_sources)} URLs")
346
for source in popular_sources[:5]:
347
print(f" {source}")
348
```
349
350
### Debug Configuration
351
352
```python
353
from newspaper import Configuration, Article
354
import logging
355
356
# Enable debug logging
357
config = Configuration()
358
config.verbose = True
359
360
# Set up logging to see debug output
361
logging.basicConfig(level=logging.DEBUG)
362
363
# Process with verbose output
364
article = Article('http://example.com/article', config=config)
365
article.build() # Will show detailed debug information
366
```