Simplified python article discovery & extraction.
npx @tessl/cli install tessl/pypi-newspaper3k@0.2.00
# Newspaper3k
1
2
A comprehensive Python library for extracting and curating articles from web sources. Newspaper3k provides multi-threaded article downloading, intelligent text extraction from HTML, image and video extraction, keyword and summary generation using natural language processing, author and publication date detection, and multi-language support for over 10 languages including English, Chinese, German, and Arabic.
3
4
## Package Information
5
6
- **Package Name**: newspaper3k
7
- **Language**: Python
8
- **Installation**: `pip install newspaper3k`
9
10
## Core Imports
11
12
```python
13
import newspaper
14
```
15
16
Common imports for working with articles and sources:
17
18
```python
19
from newspaper import Article, Source, build, build_article, fulltext, __version__
20
from newspaper import Configuration, Config, NewsPool, news_pool
21
from newspaper import ArticleException, hot, languages, popular_urls
22
```
23
24
## Basic Usage
25
26
```python
27
import newspaper
28
from newspaper import Article
29
30
# Basic article extraction
31
url = 'http://cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
32
article = Article(url)
33
34
# Download and parse article
35
article.download()
36
article.parse()
37
38
# Access extracted content
39
print(article.title)
40
print(article.authors)
41
print(article.publish_date)
42
print(article.text)
43
print(article.top_image)
44
45
# Extract keywords and summary using NLP
46
article.nlp()
47
print(article.keywords)
48
print(article.summary)
49
50
# Build and process news sources
51
cnn_paper = newspaper.build('http://cnn.com')
52
for article in cnn_paper.articles:
53
print(article.url)
54
55
# Multi-threaded processing
56
from newspaper import news_pool
57
articles = [article1, article2, article3]
58
news_pool.set(articles)
59
news_pool.join()
60
```
61
62
## Architecture
63
64
The library is built around several core concepts:
65
66
- **Article**: Individual news articles with extraction, parsing, and NLP capabilities
67
- **Source**: News websites/domains that contain collections of articles
68
- **NewsPool**: Multi-threading framework for batch processing articles and sources
69
- **Configuration**: Customizable settings for extraction behavior, language processing, and validation thresholds
70
71
This design enables both single-article processing and large-scale news aggregation workflows, with configurable extraction parameters, caching mechanisms, and multi-language support that make it suitable for research applications, content curation systems, and automated journalism workflows.
72
73
## Capabilities
74
75
### Article Processing
76
77
Core functionality for downloading, parsing, and extracting content from individual news articles. Supports text extraction, metadata parsing, image discovery, video extraction, and natural language processing.
78
79
```python { .api }
80
class Article:
81
def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs): ...
82
def download(self, input_html=None, title=None, recursion_counter: int = 0): ...
83
def parse(self): ...
84
def nlp(self): ...
85
def build(self): ...
86
87
def build_article(url: str = '', config=None, **kwargs) -> Article: ...
88
```
89
90
[Article Processing](./article-processing.md)
91
92
### Source Management
93
94
Functionality for working with news websites and domains as collections of articles. Provides article discovery, category extraction, RSS feed processing, and batch operations.
95
96
```python { .api }
97
class Source:
98
def __init__(self, url: str, config=None, **kwargs): ...
99
def build(self): ...
100
def download(self): ...
101
def parse(self): ...
102
103
def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source: ...
104
```
105
106
[Source Management](./source-management.md)
107
108
### Multi-threading & Batch Processing
109
110
Thread pool management for processing multiple articles and sources concurrently. Enables efficient large-scale content extraction and processing.
111
112
```python { .api }
113
class NewsPool:
114
def __init__(self, config=None): ...
115
def set(self, news_list: list, threads_per_source: int = 1, override_threads=None): ...
116
def join(self): ...
117
118
# Pre-instantiated pool
119
news_pool: NewsPool
120
```
121
122
[Multi-threading](./multithreading.md)
123
124
### Configuration & Utilities
125
126
Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features.
127
128
```python { .api }
129
class Configuration:
130
def __init__(self): ...
131
def set_language(self, language: str): ...
132
def get_language(self) -> str: ...
133
134
# Configuration is also aliased as Config for convenience
135
Config = Configuration
136
137
def fulltext(html: str, language: str = 'en') -> str: ...
138
def hot() -> list: ...
139
def languages(): ...
140
def popular_urls() -> list: ...
141
142
# Version information
143
__version__: str # Package version (currently "0.2.8")
144
```
145
146
[Configuration](./configuration.md)
147
148
## Exception Handling
149
150
```python { .api }
151
class ArticleException(Exception):
152
"""Exception raised for article-related errors during download or parsing."""
153
154
class ConcurrencyException(Exception):
155
"""Exception raised for thread pool operation errors."""
156
```
157
158
Common error scenarios:
159
- Network errors during article download
160
- HTML parsing failures
161
- Invalid URL formats
162
- Missing required article content
163
- Thread pool configuration or execution errors