or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

article-processing.mdconfiguration.mdindex.mdmultithreading.mdsource-management.md

source-management.mddocs/

0

# Source Management

1

2

Functionality for working with news websites and domains as collections of articles. The Source class provides comprehensive capabilities for discovering, organizing, and processing articles from news sources, including article discovery, category extraction, RSS feed processing, and batch operations.

3

4

## Capabilities

5

6

### Source Creation and Building

7

8

Create and initialize Source objects for news websites with automatic article discovery.

9

10

```python { .api }

11

class Source:

12

def __init__(self, url: str, config=None, **kwargs):

13

"""

14

Initialize a news source object.

15

16

Parameters:

17

- url: Homepage URL of the news source

18

- config: Configuration object for processing options

19

- **kwargs: Additional configuration parameters

20

21

Raises:

22

Exception: If URL is invalid or malformed

23

"""

24

25

def build(self):

26

"""

27

Complete source processing pipeline: download homepage, parse structure,

28

discover categories and feeds, generate article objects.

29

"""

30

31

def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source:

32

"""

33

Factory function to create and optionally build a Source object.

34

35

Parameters:

36

- url: Source homepage URL

37

- dry: If True, create source without building (no downloads)

38

- config: Configuration object

39

- **kwargs: Additional configuration parameters

40

41

Returns:

42

Source object, built if dry=False

43

"""

44

```

45

46

### Content Download and Parsing

47

48

Download and parse source homepage and category pages.

49

50

```python { .api }

51

def download(self):

52

"""Download homepage HTML content."""

53

54

def parse(self):

55

"""Parse homepage HTML to extract source structure and metadata."""

56

57

def download_categories(self):

58

"""Download all category page HTML content using multi-threading."""

59

60

def download_feeds(self):

61

"""Download RSS/Atom feed content for all discovered feeds."""

62

```

63

64

### Content Discovery

65

66

Discover and organize source content including categories, feeds, and articles.

67

68

```python { .api }

69

def set_categories(self):

70

"""Discover and set category URLs from homepage."""

71

72

def set_feeds(self):

73

"""

74

Discover and set RSS/Atom feed URLs.

75

Checks common feed locations and category pages for feed links.

76

"""

77

78

def generate_articles(self):

79

"""

80

Generate Article objects from discovered URLs.

81

Creates articles from category pages and feed content.

82

"""

83

84

def set_description(self):

85

"""Extract and set source description from homepage metadata."""

86

```

87

88

### Batch Processing

89

90

Process multiple articles from the source efficiently.

91

92

```python { .api }

93

def download_articles(self, thread_count_per_source: int = 1):

94

"""

95

Download all source articles using multi-threading.

96

97

Parameters:

98

- thread_count_per_source: Number of threads to use for downloading

99

"""

100

```

101

102

### Content Filtering

103

104

Filter and validate articles based on quality criteria.

105

106

```python { .api }

107

def purge_articles(self, reason: str, articles: list) -> list:

108

"""

109

Filter articles based on validation criteria.

110

111

Parameters:

112

- reason: Filter type - 'url' for URL validation, 'body' for content validation

113

- articles: List of articles to filter

114

115

Returns:

116

Filtered list of valid articles

117

"""

118

```

119

120

### Source Properties

121

122

Access source information and discovered content.

123

124

```python { .api }

125

# Source Information

126

url: str # Homepage URL

127

domain: str # Domain name

128

scheme: str # URL scheme (http/https)

129

brand: str # Brand name extracted from domain

130

description: str # Source description from metadata

131

132

# Content Collections

133

categories: list # List of Category objects

134

feeds: list # List of Feed objects

135

articles: list # List of Article objects

136

137

# Content Data

138

html: str # Homepage HTML content

139

doc: object # lxml DOM object of homepage

140

logo_url: str # Source logo URL

141

favicon: str # Favicon URL

142

143

# Processing State

144

is_parsed: bool # Whether source has been parsed

145

is_downloaded: bool # Whether source has been downloaded

146

```

147

148

### Helper Classes

149

150

Supporting classes for organizing source content.

151

152

```python { .api }

153

class Category:

154

def __init__(self, url: str):

155

"""

156

Represents a news category/section.

157

158

Parameters:

159

- url: Category page URL

160

"""

161

162

url: str # Category URL

163

html: str # Category page HTML

164

doc: object # lxml DOM object

165

166

class Feed:

167

def __init__(self, url: str):

168

"""

169

Represents an RSS/Atom feed.

170

171

Parameters:

172

- url: Feed URL

173

"""

174

175

url: str # Feed URL

176

rss: str # Feed content

177

```

178

179

## Usage Examples

180

181

### Basic Source Processing

182

183

```python

184

from newspaper import build

185

186

# Build source and discover articles

187

cnn_source = build('http://cnn.com')

188

189

print(f"Source: {cnn_source.brand}")

190

print(f"Articles found: {len(cnn_source.articles)}")

191

print(f"Categories: {len(cnn_source.categories)}")

192

print(f"Feeds: {len(cnn_source.feeds)}")

193

194

# Access discovered articles

195

for article in cnn_source.articles[:5]:

196

print(f"Article URL: {article.url}")

197

```

198

199

### Manual Source Building

200

201

```python

202

from newspaper import Source

203

204

# Create source without automatic building

205

source = Source('http://example.com')

206

207

# Manual step-by-step processing

208

source.download()

209

source.parse()

210

source.set_categories()

211

source.download_categories()

212

source.set_feeds()

213

source.download_feeds()

214

source.generate_articles()

215

216

print(f"Generated {len(source.articles)} articles")

217

```

218

219

### Article Quality Filtering

220

221

```python

222

from newspaper import build

223

224

# Build source and filter articles

225

source = build('http://news-site.com')

226

227

# Filter by URL validity

228

valid_url_articles = source.purge_articles('url', source.articles)

229

print(f"Valid URL articles: {len(valid_url_articles)}")

230

231

# Download and filter by content quality

232

for article in valid_url_articles[:10]:

233

article.download()

234

article.parse()

235

236

valid_body_articles = source.purge_articles('body', valid_url_articles[:10])

237

print(f"Valid content articles: {len(valid_body_articles)}")

238

```

239

240

### Multi-threaded Article Processing

241

242

```python

243

from newspaper import build

244

245

# Build source

246

source = build('http://news-site.com')

247

248

# Download all articles with multiple threads

249

source.download_articles(thread_count_per_source=5)

250

251

# Process downloaded articles

252

for article in source.articles:

253

if hasattr(article, 'html') and article.html:

254

article.parse()

255

if article.is_valid_body():

256

article.nlp()

257

print(f"Processed: {article.title}")

258

```

259

260

### Category and Feed Analysis

261

262

```python

263

from newspaper import build

264

265

source = build('http://news-site.com')

266

267

# Examine categories

268

print("Categories:")

269

for category in source.categories:

270

print(f" {category.url}")

271

272

# Examine feeds

273

print("Feeds:")

274

for feed in source.feeds:

275

print(f" {feed.url}")

276

277

# Source metadata

278

print(f"Description: {source.description}")

279

print(f"Logo: {source.logo_url}")

280

print(f"Favicon: {source.favicon}")

281

```

282

283

### Custom Configuration for Sources

284

285

```python

286

from newspaper import build, Configuration

287

288

# Create custom configuration

289

config = Configuration()

290

config.number_threads = 20

291

config.request_timeout = 10

292

config.language = 'fr'

293

294

# Build source with custom settings

295

source = build('http://french-news-site.com', config=config)

296

print(f"Articles discovered: {len(source.articles)}")

297

```