Tessl Tile for pypi/newspaper3k@0.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-processing.md configuration.md index.md multithreading.md source-management.md

configuration.mddocs/

0
# Configuration & Utilities
1

2
Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features. The Configuration class provides extensive customization options for article processing, while utility functions offer additional capabilities like fulltext extraction and trending topic discovery.
3

4
## Capabilities
5

6
### Configuration Management
7

8
Comprehensive configuration options for customizing newspaper3k behavior.
9

10
```python { .api }
11
class Configuration:
12
    def __init__(self):
13
        """Initialize configuration with default settings."""
14

15
    def get_language(self) -> str:
16
        """Get the current language setting."""
17

18
    def set_language(self, language: str):
19
        """
20
        Set the target language for processing.
21
        
22
        Parameters:
23
        - language: Two-character language code (e.g., 'en', 'es', 'fr')
24
        
25
        Raises:
26
        Exception: If language code is invalid or not 2 characters
27
        """
28

29
    @staticmethod
30
    def get_stopwords_class(language: str):
31
        """
32
        Get the appropriate stopwords class for a language.
33
        
34
        Parameters:
35
        - language: Two-character language code
36
        
37
        Returns:
38
        Stopwords class for the specified language
39
        """
40

41
    @staticmethod  
42
    def get_parser():
43
        """Get the HTML parser class (lxml-based Parser)."""
44
```
45

46
### Configuration Properties
47

48
Extensive configuration options for fine-tuning extraction behavior.
49

50
```python { .api }
51
# Content Validation Thresholds
52
MIN_WORD_COUNT: int = 300        # Minimum words for valid article
53
MIN_SENT_COUNT: int = 7          # Minimum sentences for valid article
54
MAX_TITLE: int = 200             # Maximum title length in characters
55
MAX_TEXT: int = 100000           # Maximum article text length
56
MAX_KEYWORDS: int = 35           # Maximum keywords to extract
57
MAX_AUTHORS: int = 10            # Maximum authors to extract
58
MAX_SUMMARY: int = 5000          # Maximum summary length
59
MAX_SUMMARY_SENT: int = 5        # Maximum summary sentences
60

61
# Caching and Storage
62
MAX_FILE_MEMO: int = 20000       # Max URLs cached per news source
63
memoize_articles: bool = True    # Cache articles between runs
64

65
# Media Processing
66
fetch_images: bool = True        # Download and process images
67
image_dimension_ration: float = 16/9.0  # Preferred image aspect ratio
68

69
# Network and Processing
70
follow_meta_refresh: bool = False    # Follow meta refresh redirects
71
use_meta_language: bool = True       # Use language from HTML meta tags
72
keep_article_html: bool = False      # Retain cleaned article HTML
73
http_success_only: bool = True       # Fail on HTTP error responses
74
request_timeout: int = 7             # HTTP request timeout in seconds
75
number_threads: int = 10             # Default thread count
76
thread_timeout_seconds: int = 1      # Thread timeout in seconds
77

78
# Language and Localization
79
language: str = 'en'                 # Target language code
80
stopwords_class: class = StopWords   # Stopwords class for language
81

82
# HTTP Configuration
83
browser_user_agent: str             # HTTP User-Agent header
84
headers: dict = {}                  # Additional HTTP headers
85
proxies: dict = {}                  # Proxy configuration
86

87
# Debugging
88
verbose: bool = False               # Enable debug logging
89
```
90

91
### Utility Functions
92

93
Standalone functions for specialized processing and information retrieval.
94

95
```python { .api }
96
def fulltext(html: str, language: str = 'en') -> str:
97
    """
98
    Extract clean text content from raw HTML.
99
    
100
    Parameters:
101
    - html: Raw HTML string
102
    - language: Language code for processing (default: 'en')
103
    
104
    Returns:
105
    Extracted plain text content
106
    """
107

108
def hot() -> list:
109
    """
110
    Get trending topics from Google Trends.
111
    
112
    Returns:
113
    List of trending search terms, or None if failed
114
    """
115

116
def languages():
117
    """Print list of supported languages to console."""
118

119
def popular_urls() -> list:
120
    """
121
    Get list of popular news source URLs.
122
    
123
    Returns:
124
    List of pre-extracted popular news website URLs
125
    """
126
```
127

128
### Language Support Classes
129

130
Specialized stopwords classes for different languages.
131

132
```python { .api }
133
class StopWords:
134
    """Default English stopwords class."""
135

136
class StopWordsChinese(StopWords):
137
    """Chinese language stopwords."""
138

139
class StopWordsArabic(StopWords):
140
    """Arabic and Persian language stopwords."""
141

142
class StopWordsKorean(StopWords):
143
    """Korean language stopwords."""
144

145
class StopWordsHindi(StopWords):
146
    """Hindi language stopwords."""
147

148
class StopWordsJapanese(StopWords):
149
    """Japanese language stopwords."""
150
```
151

152
### Helper Functions
153

154
Additional utility functions for configuration and language support.
155

156
```python { .api }
157
def get_available_languages() -> list:
158
    """
159
    Get list of supported language codes.
160
    
161
    Returns:
162
    List of two-character language codes
163
    """
164

165
def print_available_languages():
166
    """Print supported languages to console."""
167

168
def extend_config(config: Configuration, config_items: dict) -> Configuration:
169
    """
170
    Merge configuration object with additional settings.
171
    
172
    Parameters:
173
    - config: Base Configuration object
174
    - config_items: Dictionary of additional configuration values
175
    
176
    Returns:
177
    Updated Configuration object
178
    """
179
```
180

181
## Usage Examples
182

183
### Basic Configuration
184

185
```python
186
from newspaper import Configuration, Article
187

188
# Create custom configuration
189
config = Configuration()
190
config.language = 'es'
191
config.MIN_WORD_COUNT = 500
192
config.fetch_images = False
193
config.request_timeout = 10
194

195
# Use with article
196
article = Article('http://spanish-news-site.com/article', config=config)
197
article.build()
198
```
199

200
### Multi-language Processing
201

202
```python
203
from newspaper import Configuration, Article
204

205
# Process articles in different languages
206
languages = ['en', 'es', 'fr', 'de']
207
articles = {}
208

209
for lang in languages:
210
    config = Configuration()
211
    config.set_language(lang)
212
    
213
    # Language-specific URL (example)
214
    url = f'http://news-site.com/{lang}/article'
215
    article = Article(url, config=config)
216
    article.build()
217
    
218
    articles[lang] = article
219
    print(f"{lang}: {article.title}")
220
```
221

222
### Performance Optimization
223

224
```python
225
from newspaper import Configuration, build
226

227
# High-performance configuration
228
config = Configuration()
229
config.number_threads = 20
230
config.thread_timeout_seconds = 2
231
config.request_timeout = 5
232
config.memoize_articles = True
233
config.fetch_images = False  # Skip images for speed
234

235
# Build source with optimized settings
236
source = build('http://news-site.com', config=config)
237
print(f"Fast processing: {len(source.articles)} articles discovered")
238
```
239

240
### Content Quality Configuration
241

242
```python
243
from newspaper import Configuration, Article
244

245
# Strict content validation
246
config = Configuration()
247
config.MIN_WORD_COUNT = 800      # Require longer articles
248
config.MIN_SENT_COUNT = 15       # Require more sentences
249
config.MAX_KEYWORDS = 50         # Extract more keywords
250
config.MAX_SUMMARY_SENT = 10     # Longer summaries
251

252
# Use strict configuration
253
article = Article('http://long-form-article.com', config=config)
254
article.build()
255

256
if article.is_valid_body():
257
    print(f"High-quality article: {len(article.text)} words")
258
    print(f"Keywords: {len(article.keywords)}")
259
    print(f"Summary sentences: {len(article.summary.split('.'))}")
260
```
261

262
### Network Configuration
263

264
```python
265
from newspaper import Configuration, Article
266

267
# Custom network settings
268
config = Configuration()
269
config.browser_user_agent = 'MyBot/1.0'
270
config.headers = {
271
    'Accept-Language': 'en-US,en;q=0.9',
272
    'Accept-Encoding': 'gzip, deflate'
273
}
274
config.proxies = {
275
    'http': 'http://proxy.example.com:8080',
276
    'https': 'https://proxy.example.com:8080'
277
}
278
config.request_timeout = 15
279

280
# Use custom network settings
281
article = Article('http://example.com/article', config=config)
282
article.download()
283
```
284

285
### Language Detection and Processing
286

287
```python
288
from newspaper import get_available_languages, Configuration
289

290
# Show supported languages
291
print("Supported languages:")
292
print(get_available_languages())
293

294
# Auto-detect and process
295
def process_with_language_detection(url):
296
    # First pass - detect language
297
    article = Article(url)
298
    article.download()
299
    article.parse()  # This extracts meta_lang
300
    
301
    detected_lang = article.meta_lang
302
    if detected_lang in get_available_languages():
303
        # Second pass with detected language
304
        config = Configuration()
305
        config.set_language(detected_lang)
306
        
307
        article_lang = Article(url, config=config)
308
        article_lang.build()
309
        return article_lang
310
    
311
    return article
312

313
# Process with language detection
314
result = process_with_language_detection('http://multilingual-site.com/article')
315
print(f"Language: {result.meta_lang}")
316
print(f"Title: {result.title}")
317
```
318

319
### Utility Functions Usage
320

321
```python
322
from newspaper import fulltext, hot, popular_urls
323

324
# Extract text from raw HTML
325
html_content = """
326
<html><body>
327
<h1>News Title</h1>
328
<p>This is the main article content with <a href="#">links</a> and formatting.</p>
329
</body></html>
330
"""
331

332
clean_text = fulltext(html_content, language='en')
333
print(f"Extracted text: {clean_text}")
334

335
# Get trending topics
336
try:
337
    trending = hot()
338
    if trending:
339
        print("Trending topics:", trending[:5])
340
except Exception as e:
341
    print(f"Could not fetch trending topics: {e}")
342

343
# Get popular news sources
344
popular_sources = popular_urls()
345
print(f"Popular sources: {len(popular_sources)} URLs")
346
for source in popular_sources[:5]:
347
    print(f"  {source}")
348
```
349

350
### Debug Configuration
351

352
```python
353
from newspaper import Configuration, Article
354
import logging
355

356
# Enable debug logging
357
config = Configuration()
358
config.verbose = True
359

360
# Set up logging to see debug output
361
logging.basicConfig(level=logging.DEBUG)
362

363
# Process with verbose output
364
article = Article('http://example.com/article', config=config)
365
article.build()  # Will show detailed debug information
366
```

Version

Tile

Files

configuration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

configuration.mddocs/