0
# Source Management
1
2
Functionality for working with news websites and domains as collections of articles. The Source class provides comprehensive capabilities for discovering, organizing, and processing articles from news sources, including article discovery, category extraction, RSS feed processing, and batch operations.
3
4
## Capabilities
5
6
### Source Creation and Building
7
8
Create and initialize Source objects for news websites with automatic article discovery.
9
10
```python { .api }
11
class Source:
12
def __init__(self, url: str, config=None, **kwargs):
13
"""
14
Initialize a news source object.
15
16
Parameters:
17
- url: Homepage URL of the news source
18
- config: Configuration object for processing options
19
- **kwargs: Additional configuration parameters
20
21
Raises:
22
Exception: If URL is invalid or malformed
23
"""
24
25
def build(self):
26
"""
27
Complete source processing pipeline: download homepage, parse structure,
28
discover categories and feeds, generate article objects.
29
"""
30
31
def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source:
32
"""
33
Factory function to create and optionally build a Source object.
34
35
Parameters:
36
- url: Source homepage URL
37
- dry: If True, create source without building (no downloads)
38
- config: Configuration object
39
- **kwargs: Additional configuration parameters
40
41
Returns:
42
Source object, built if dry=False
43
"""
44
```
45
46
### Content Download and Parsing
47
48
Download and parse source homepage and category pages.
49
50
```python { .api }
51
def download(self):
52
"""Download homepage HTML content."""
53
54
def parse(self):
55
"""Parse homepage HTML to extract source structure and metadata."""
56
57
def download_categories(self):
58
"""Download all category page HTML content using multi-threading."""
59
60
def download_feeds(self):
61
"""Download RSS/Atom feed content for all discovered feeds."""
62
```
63
64
### Content Discovery
65
66
Discover and organize source content including categories, feeds, and articles.
67
68
```python { .api }
69
def set_categories(self):
70
"""Discover and set category URLs from homepage."""
71
72
def set_feeds(self):
73
"""
74
Discover and set RSS/Atom feed URLs.
75
Checks common feed locations and category pages for feed links.
76
"""
77
78
def generate_articles(self):
79
"""
80
Generate Article objects from discovered URLs.
81
Creates articles from category pages and feed content.
82
"""
83
84
def set_description(self):
85
"""Extract and set source description from homepage metadata."""
86
```
87
88
### Batch Processing
89
90
Process multiple articles from the source efficiently.
91
92
```python { .api }
93
def download_articles(self, thread_count_per_source: int = 1):
94
"""
95
Download all source articles using multi-threading.
96
97
Parameters:
98
- thread_count_per_source: Number of threads to use for downloading
99
"""
100
```
101
102
### Content Filtering
103
104
Filter and validate articles based on quality criteria.
105
106
```python { .api }
107
def purge_articles(self, reason: str, articles: list) -> list:
108
"""
109
Filter articles based on validation criteria.
110
111
Parameters:
112
- reason: Filter type - 'url' for URL validation, 'body' for content validation
113
- articles: List of articles to filter
114
115
Returns:
116
Filtered list of valid articles
117
"""
118
```
119
120
### Source Properties
121
122
Access source information and discovered content.
123
124
```python { .api }
125
# Source Information
126
url: str # Homepage URL
127
domain: str # Domain name
128
scheme: str # URL scheme (http/https)
129
brand: str # Brand name extracted from domain
130
description: str # Source description from metadata
131
132
# Content Collections
133
categories: list # List of Category objects
134
feeds: list # List of Feed objects
135
articles: list # List of Article objects
136
137
# Content Data
138
html: str # Homepage HTML content
139
doc: object # lxml DOM object of homepage
140
logo_url: str # Source logo URL
141
favicon: str # Favicon URL
142
143
# Processing State
144
is_parsed: bool # Whether source has been parsed
145
is_downloaded: bool # Whether source has been downloaded
146
```
147
148
### Helper Classes
149
150
Supporting classes for organizing source content.
151
152
```python { .api }
153
class Category:
154
def __init__(self, url: str):
155
"""
156
Represents a news category/section.
157
158
Parameters:
159
- url: Category page URL
160
"""
161
162
url: str # Category URL
163
html: str # Category page HTML
164
doc: object # lxml DOM object
165
166
class Feed:
167
def __init__(self, url: str):
168
"""
169
Represents an RSS/Atom feed.
170
171
Parameters:
172
- url: Feed URL
173
"""
174
175
url: str # Feed URL
176
rss: str # Feed content
177
```
178
179
## Usage Examples
180
181
### Basic Source Processing
182
183
```python
184
from newspaper import build
185
186
# Build source and discover articles
187
cnn_source = build('http://cnn.com')
188
189
print(f"Source: {cnn_source.brand}")
190
print(f"Articles found: {len(cnn_source.articles)}")
191
print(f"Categories: {len(cnn_source.categories)}")
192
print(f"Feeds: {len(cnn_source.feeds)}")
193
194
# Access discovered articles
195
for article in cnn_source.articles[:5]:
196
print(f"Article URL: {article.url}")
197
```
198
199
### Manual Source Building
200
201
```python
202
from newspaper import Source
203
204
# Create source without automatic building
205
source = Source('http://example.com')
206
207
# Manual step-by-step processing
208
source.download()
209
source.parse()
210
source.set_categories()
211
source.download_categories()
212
source.set_feeds()
213
source.download_feeds()
214
source.generate_articles()
215
216
print(f"Generated {len(source.articles)} articles")
217
```
218
219
### Article Quality Filtering
220
221
```python
222
from newspaper import build
223
224
# Build source and filter articles
225
source = build('http://news-site.com')
226
227
# Filter by URL validity
228
valid_url_articles = source.purge_articles('url', source.articles)
229
print(f"Valid URL articles: {len(valid_url_articles)}")
230
231
# Download and filter by content quality
232
for article in valid_url_articles[:10]:
233
article.download()
234
article.parse()
235
236
valid_body_articles = source.purge_articles('body', valid_url_articles[:10])
237
print(f"Valid content articles: {len(valid_body_articles)}")
238
```
239
240
### Multi-threaded Article Processing
241
242
```python
243
from newspaper import build
244
245
# Build source
246
source = build('http://news-site.com')
247
248
# Download all articles with multiple threads
249
source.download_articles(thread_count_per_source=5)
250
251
# Process downloaded articles
252
for article in source.articles:
253
if hasattr(article, 'html') and article.html:
254
article.parse()
255
if article.is_valid_body():
256
article.nlp()
257
print(f"Processed: {article.title}")
258
```
259
260
### Category and Feed Analysis
261
262
```python
263
from newspaper import build
264
265
source = build('http://news-site.com')
266
267
# Examine categories
268
print("Categories:")
269
for category in source.categories:
270
print(f" {category.url}")
271
272
# Examine feeds
273
print("Feeds:")
274
for feed in source.feeds:
275
print(f" {feed.url}")
276
277
# Source metadata
278
print(f"Description: {source.description}")
279
print(f"Logo: {source.logo_url}")
280
print(f"Favicon: {source.favicon}")
281
```
282
283
### Custom Configuration for Sources
284
285
```python
286
from newspaper import build, Configuration
287
288
# Create custom configuration
289
config = Configuration()
290
config.number_threads = 20
291
config.request_timeout = 10
292
config.language = 'fr'
293
294
# Build source with custom settings
295
source = build('http://french-news-site.com', config=config)
296
print(f"Articles discovered: {len(source.articles)}")
297
```