Tessl Tile for pypi/newspaper3k@0.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-processing.md configuration.md index.md multithreading.md source-management.md

source-management.mddocs/

0
# Source Management
1

2
Functionality for working with news websites and domains as collections of articles. The Source class provides comprehensive capabilities for discovering, organizing, and processing articles from news sources, including article discovery, category extraction, RSS feed processing, and batch operations.
3

4
## Capabilities
5

6
### Source Creation and Building
7

8
Create and initialize Source objects for news websites with automatic article discovery.
9

10
```python { .api }
11
class Source:
12
    def __init__(self, url: str, config=None, **kwargs):
13
        """
14
        Initialize a news source object.
15
        
16
        Parameters:
17
        - url: Homepage URL of the news source
18
        - config: Configuration object for processing options
19
        - **kwargs: Additional configuration parameters
20
        
21
        Raises:
22
        Exception: If URL is invalid or malformed
23
        """
24

25
    def build(self):
26
        """
27
        Complete source processing pipeline: download homepage, parse structure,
28
        discover categories and feeds, generate article objects.
29
        """
30

31
def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source:
32
    """
33
    Factory function to create and optionally build a Source object.
34
    
35
    Parameters:
36
    - url: Source homepage URL
37
    - dry: If True, create source without building (no downloads)
38
    - config: Configuration object
39
    - **kwargs: Additional configuration parameters
40
    
41
    Returns:
42
    Source object, built if dry=False
43
    """
44
```
45

46
### Content Download and Parsing
47

48
Download and parse source homepage and category pages.
49

50
```python { .api }
51
def download(self):
52
    """Download homepage HTML content."""
53

54
def parse(self):
55
    """Parse homepage HTML to extract source structure and metadata."""
56

57
def download_categories(self):
58
    """Download all category page HTML content using multi-threading."""
59

60
def download_feeds(self):
61
    """Download RSS/Atom feed content for all discovered feeds."""
62
```
63

64
### Content Discovery
65

66
Discover and organize source content including categories, feeds, and articles.
67

68
```python { .api }
69
def set_categories(self):
70
    """Discover and set category URLs from homepage."""
71

72
def set_feeds(self):
73
    """
74
    Discover and set RSS/Atom feed URLs.
75
    Checks common feed locations and category pages for feed links.
76
    """
77

78
def generate_articles(self):
79
    """
80
    Generate Article objects from discovered URLs.
81
    Creates articles from category pages and feed content.
82
    """
83

84
def set_description(self):
85
    """Extract and set source description from homepage metadata."""
86
```
87

88
### Batch Processing
89

90
Process multiple articles from the source efficiently.
91

92
```python { .api }
93
def download_articles(self, thread_count_per_source: int = 1):
94
    """
95
    Download all source articles using multi-threading.
96
    
97
    Parameters:
98
    - thread_count_per_source: Number of threads to use for downloading
99
    """
100
```
101

102
### Content Filtering
103

104
Filter and validate articles based on quality criteria.
105

106
```python { .api }
107
def purge_articles(self, reason: str, articles: list) -> list:
108
    """
109
    Filter articles based on validation criteria.
110
    
111
    Parameters:
112
    - reason: Filter type - 'url' for URL validation, 'body' for content validation
113
    - articles: List of articles to filter
114
    
115
    Returns:
116
    Filtered list of valid articles
117
    """
118
```
119

120
### Source Properties
121

122
Access source information and discovered content.
123

124
```python { .api }
125
# Source Information
126
url: str                   # Homepage URL
127
domain: str                # Domain name
128
scheme: str                # URL scheme (http/https)
129
brand: str                 # Brand name extracted from domain
130
description: str           # Source description from metadata
131

132
# Content Collections
133
categories: list           # List of Category objects
134
feeds: list               # List of Feed objects
135
articles: list            # List of Article objects
136

137
# Content Data
138
html: str                 # Homepage HTML content
139
doc: object               # lxml DOM object of homepage
140
logo_url: str             # Source logo URL
141
favicon: str              # Favicon URL
142

143
# Processing State  
144
is_parsed: bool           # Whether source has been parsed
145
is_downloaded: bool       # Whether source has been downloaded
146
```
147

148
### Helper Classes
149

150
Supporting classes for organizing source content.
151

152
```python { .api }
153
class Category:
154
    def __init__(self, url: str):
155
        """
156
        Represents a news category/section.
157
        
158
        Parameters:
159
        - url: Category page URL
160
        """
161
    
162
    url: str                 # Category URL
163
    html: str               # Category page HTML
164
    doc: object             # lxml DOM object
165

166
class Feed:
167
    def __init__(self, url: str):
168
        """
169
        Represents an RSS/Atom feed.
170
        
171
        Parameters:
172
        - url: Feed URL
173
        """
174
    
175
    url: str                # Feed URL
176
    rss: str               # Feed content
177
```
178

179
## Usage Examples
180

181
### Basic Source Processing
182

183
```python
184
from newspaper import build
185

186
# Build source and discover articles
187
cnn_source = build('http://cnn.com')
188

189
print(f"Source: {cnn_source.brand}")
190
print(f"Articles found: {len(cnn_source.articles)}")
191
print(f"Categories: {len(cnn_source.categories)}")
192
print(f"Feeds: {len(cnn_source.feeds)}")
193

194
# Access discovered articles
195
for article in cnn_source.articles[:5]:
196
    print(f"Article URL: {article.url}")
197
```
198

199
### Manual Source Building
200

201
```python
202
from newspaper import Source
203

204
# Create source without automatic building
205
source = Source('http://example.com')
206

207
# Manual step-by-step processing
208
source.download()
209
source.parse()
210
source.set_categories()
211
source.download_categories()
212
source.set_feeds()
213
source.download_feeds()
214
source.generate_articles()
215

216
print(f"Generated {len(source.articles)} articles")
217
```
218

219
### Article Quality Filtering
220

221
```python
222
from newspaper import build
223

224
# Build source and filter articles
225
source = build('http://news-site.com')
226

227
# Filter by URL validity
228
valid_url_articles = source.purge_articles('url', source.articles)
229
print(f"Valid URL articles: {len(valid_url_articles)}")
230

231
# Download and filter by content quality
232
for article in valid_url_articles[:10]:
233
    article.download()
234
    article.parse()
235

236
valid_body_articles = source.purge_articles('body', valid_url_articles[:10])
237
print(f"Valid content articles: {len(valid_body_articles)}")
238
```
239

240
### Multi-threaded Article Processing
241

242
```python
243
from newspaper import build
244

245
# Build source
246
source = build('http://news-site.com')
247

248
# Download all articles with multiple threads
249
source.download_articles(thread_count_per_source=5)
250

251
# Process downloaded articles
252
for article in source.articles:
253
    if hasattr(article, 'html') and article.html:
254
        article.parse()
255
        if article.is_valid_body():
256
            article.nlp()
257
            print(f"Processed: {article.title}")
258
```
259

260
### Category and Feed Analysis
261

262
```python
263
from newspaper import build
264

265
source = build('http://news-site.com')
266

267
# Examine categories
268
print("Categories:")
269
for category in source.categories:
270
    print(f"  {category.url}")
271

272
# Examine feeds  
273
print("Feeds:")
274
for feed in source.feeds:
275
    print(f"  {feed.url}")
276

277
# Source metadata
278
print(f"Description: {source.description}")
279
print(f"Logo: {source.logo_url}")
280
print(f"Favicon: {source.favicon}")
281
```
282

283
### Custom Configuration for Sources
284

285
```python
286
from newspaper import build, Configuration
287

288
# Create custom configuration
289
config = Configuration()
290
config.number_threads = 20
291
config.request_timeout = 10
292
config.language = 'fr'
293

294
# Build source with custom settings
295
source = build('http://french-news-site.com', config=config)
296
print(f"Articles discovered: {len(source.articles)}")
297
```

Version

Tile

Files

source-management.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

source-management.mddocs/