or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

article-processing.mdconfiguration.mdindex.mdmultithreading.mdsource-management.md

configuration.mddocs/

0

# Configuration & Utilities

1

2

Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features. The Configuration class provides extensive customization options for article processing, while utility functions offer additional capabilities like fulltext extraction and trending topic discovery.

3

4

## Capabilities

5

6

### Configuration Management

7

8

Comprehensive configuration options for customizing newspaper3k behavior.

9

10

```python { .api }

11

class Configuration:

12

def __init__(self):

13

"""Initialize configuration with default settings."""

14

15

def get_language(self) -> str:

16

"""Get the current language setting."""

17

18

def set_language(self, language: str):

19

"""

20

Set the target language for processing.

21

22

Parameters:

23

- language: Two-character language code (e.g., 'en', 'es', 'fr')

24

25

Raises:

26

Exception: If language code is invalid or not 2 characters

27

"""

28

29

@staticmethod

30

def get_stopwords_class(language: str):

31

"""

32

Get the appropriate stopwords class for a language.

33

34

Parameters:

35

- language: Two-character language code

36

37

Returns:

38

Stopwords class for the specified language

39

"""

40

41

@staticmethod

42

def get_parser():

43

"""Get the HTML parser class (lxml-based Parser)."""

44

```

45

46

### Configuration Properties

47

48

Extensive configuration options for fine-tuning extraction behavior.

49

50

```python { .api }

51

# Content Validation Thresholds

52

MIN_WORD_COUNT: int = 300 # Minimum words for valid article

53

MIN_SENT_COUNT: int = 7 # Minimum sentences for valid article

54

MAX_TITLE: int = 200 # Maximum title length in characters

55

MAX_TEXT: int = 100000 # Maximum article text length

56

MAX_KEYWORDS: int = 35 # Maximum keywords to extract

57

MAX_AUTHORS: int = 10 # Maximum authors to extract

58

MAX_SUMMARY: int = 5000 # Maximum summary length

59

MAX_SUMMARY_SENT: int = 5 # Maximum summary sentences

60

61

# Caching and Storage

62

MAX_FILE_MEMO: int = 20000 # Max URLs cached per news source

63

memoize_articles: bool = True # Cache articles between runs

64

65

# Media Processing

66

fetch_images: bool = True # Download and process images

67

image_dimension_ration: float = 16/9.0 # Preferred image aspect ratio

68

69

# Network and Processing

70

follow_meta_refresh: bool = False # Follow meta refresh redirects

71

use_meta_language: bool = True # Use language from HTML meta tags

72

keep_article_html: bool = False # Retain cleaned article HTML

73

http_success_only: bool = True # Fail on HTTP error responses

74

request_timeout: int = 7 # HTTP request timeout in seconds

75

number_threads: int = 10 # Default thread count

76

thread_timeout_seconds: int = 1 # Thread timeout in seconds

77

78

# Language and Localization

79

language: str = 'en' # Target language code

80

stopwords_class: class = StopWords # Stopwords class for language

81

82

# HTTP Configuration

83

browser_user_agent: str # HTTP User-Agent header

84

headers: dict = {} # Additional HTTP headers

85

proxies: dict = {} # Proxy configuration

86

87

# Debugging

88

verbose: bool = False # Enable debug logging

89

```

90

91

### Utility Functions

92

93

Standalone functions for specialized processing and information retrieval.

94

95

```python { .api }

96

def fulltext(html: str, language: str = 'en') -> str:

97

"""

98

Extract clean text content from raw HTML.

99

100

Parameters:

101

- html: Raw HTML string

102

- language: Language code for processing (default: 'en')

103

104

Returns:

105

Extracted plain text content

106

"""

107

108

def hot() -> list:

109

"""

110

Get trending topics from Google Trends.

111

112

Returns:

113

List of trending search terms, or None if failed

114

"""

115

116

def languages():

117

"""Print list of supported languages to console."""

118

119

def popular_urls() -> list:

120

"""

121

Get list of popular news source URLs.

122

123

Returns:

124

List of pre-extracted popular news website URLs

125

"""

126

```

127

128

### Language Support Classes

129

130

Specialized stopwords classes for different languages.

131

132

```python { .api }

133

class StopWords:

134

"""Default English stopwords class."""

135

136

class StopWordsChinese(StopWords):

137

"""Chinese language stopwords."""

138

139

class StopWordsArabic(StopWords):

140

"""Arabic and Persian language stopwords."""

141

142

class StopWordsKorean(StopWords):

143

"""Korean language stopwords."""

144

145

class StopWordsHindi(StopWords):

146

"""Hindi language stopwords."""

147

148

class StopWordsJapanese(StopWords):

149

"""Japanese language stopwords."""

150

```

151

152

### Helper Functions

153

154

Additional utility functions for configuration and language support.

155

156

```python { .api }

157

def get_available_languages() -> list:

158

"""

159

Get list of supported language codes.

160

161

Returns:

162

List of two-character language codes

163

"""

164

165

def print_available_languages():

166

"""Print supported languages to console."""

167

168

def extend_config(config: Configuration, config_items: dict) -> Configuration:

169

"""

170

Merge configuration object with additional settings.

171

172

Parameters:

173

- config: Base Configuration object

174

- config_items: Dictionary of additional configuration values

175

176

Returns:

177

Updated Configuration object

178

"""

179

```

180

181

## Usage Examples

182

183

### Basic Configuration

184

185

```python

186

from newspaper import Configuration, Article

187

188

# Create custom configuration

189

config = Configuration()

190

config.language = 'es'

191

config.MIN_WORD_COUNT = 500

192

config.fetch_images = False

193

config.request_timeout = 10

194

195

# Use with article

196

article = Article('http://spanish-news-site.com/article', config=config)

197

article.build()

198

```

199

200

### Multi-language Processing

201

202

```python

203

from newspaper import Configuration, Article

204

205

# Process articles in different languages

206

languages = ['en', 'es', 'fr', 'de']

207

articles = {}

208

209

for lang in languages:

210

config = Configuration()

211

config.set_language(lang)

212

213

# Language-specific URL (example)

214

url = f'http://news-site.com/{lang}/article'

215

article = Article(url, config=config)

216

article.build()

217

218

articles[lang] = article

219

print(f"{lang}: {article.title}")

220

```

221

222

### Performance Optimization

223

224

```python

225

from newspaper import Configuration, build

226

227

# High-performance configuration

228

config = Configuration()

229

config.number_threads = 20

230

config.thread_timeout_seconds = 2

231

config.request_timeout = 5

232

config.memoize_articles = True

233

config.fetch_images = False # Skip images for speed

234

235

# Build source with optimized settings

236

source = build('http://news-site.com', config=config)

237

print(f"Fast processing: {len(source.articles)} articles discovered")

238

```

239

240

### Content Quality Configuration

241

242

```python

243

from newspaper import Configuration, Article

244

245

# Strict content validation

246

config = Configuration()

247

config.MIN_WORD_COUNT = 800 # Require longer articles

248

config.MIN_SENT_COUNT = 15 # Require more sentences

249

config.MAX_KEYWORDS = 50 # Extract more keywords

250

config.MAX_SUMMARY_SENT = 10 # Longer summaries

251

252

# Use strict configuration

253

article = Article('http://long-form-article.com', config=config)

254

article.build()

255

256

if article.is_valid_body():

257

print(f"High-quality article: {len(article.text)} words")

258

print(f"Keywords: {len(article.keywords)}")

259

print(f"Summary sentences: {len(article.summary.split('.'))}")

260

```

261

262

### Network Configuration

263

264

```python

265

from newspaper import Configuration, Article

266

267

# Custom network settings

268

config = Configuration()

269

config.browser_user_agent = 'MyBot/1.0'

270

config.headers = {

271

'Accept-Language': 'en-US,en;q=0.9',

272

'Accept-Encoding': 'gzip, deflate'

273

}

274

config.proxies = {

275

'http': 'http://proxy.example.com:8080',

276

'https': 'https://proxy.example.com:8080'

277

}

278

config.request_timeout = 15

279

280

# Use custom network settings

281

article = Article('http://example.com/article', config=config)

282

article.download()

283

```

284

285

### Language Detection and Processing

286

287

```python

288

from newspaper import get_available_languages, Configuration

289

290

# Show supported languages

291

print("Supported languages:")

292

print(get_available_languages())

293

294

# Auto-detect and process

295

def process_with_language_detection(url):

296

# First pass - detect language

297

article = Article(url)

298

article.download()

299

article.parse() # This extracts meta_lang

300

301

detected_lang = article.meta_lang

302

if detected_lang in get_available_languages():

303

# Second pass with detected language

304

config = Configuration()

305

config.set_language(detected_lang)

306

307

article_lang = Article(url, config=config)

308

article_lang.build()

309

return article_lang

310

311

return article

312

313

# Process with language detection

314

result = process_with_language_detection('http://multilingual-site.com/article')

315

print(f"Language: {result.meta_lang}")

316

print(f"Title: {result.title}")

317

```

318

319

### Utility Functions Usage

320

321

```python

322

from newspaper import fulltext, hot, popular_urls

323

324

# Extract text from raw HTML

325

html_content = """

326

<html><body>

327

<h1>News Title</h1>

328

<p>This is the main article content with <a href="#">links</a> and formatting.</p>

329

</body></html>

330

"""

331

332

clean_text = fulltext(html_content, language='en')

333

print(f"Extracted text: {clean_text}")

334

335

# Get trending topics

336

try:

337

trending = hot()

338

if trending:

339

print("Trending topics:", trending[:5])

340

except Exception as e:

341

print(f"Could not fetch trending topics: {e}")

342

343

# Get popular news sources

344

popular_sources = popular_urls()

345

print(f"Popular sources: {len(popular_sources)} URLs")

346

for source in popular_sources[:5]:

347

print(f" {source}")

348

```

349

350

### Debug Configuration

351

352

```python

353

from newspaper import Configuration, Article

354

import logging

355

356

# Enable debug logging

357

config = Configuration()

358

config.verbose = True

359

360

# Set up logging to see debug output

361

logging.basicConfig(level=logging.DEBUG)

362

363

# Process with verbose output

364

article = Article('http://example.com/article', config=config)

365

article.build() # Will show detailed debug information

366

```