Tessl Tile for pypi/newspaper3k@0.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-newspaper3k

Simplified python article discovery & extraction.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/newspaper3k@0.2.x

To install, run

npx @tessl/cli install tessl/pypi-newspaper3k@0.2.0

0
# Newspaper3k
1

2
A comprehensive Python library for extracting and curating articles from web sources. Newspaper3k provides multi-threaded article downloading, intelligent text extraction from HTML, image and video extraction, keyword and summary generation using natural language processing, author and publication date detection, and multi-language support for over 10 languages including English, Chinese, German, and Arabic.
3

4
## Package Information
5

6
- **Package Name**: newspaper3k
7
- **Language**: Python
8
- **Installation**: `pip install newspaper3k`
9

10
## Core Imports
11

12
```python
13
import newspaper
14
```
15

16
Common imports for working with articles and sources:
17

18
```python
19
from newspaper import Article, Source, build, build_article, fulltext, __version__
20
from newspaper import Configuration, Config, NewsPool, news_pool
21
from newspaper import ArticleException, hot, languages, popular_urls
22
```
23

24
## Basic Usage
25

26
```python
27
import newspaper
28
from newspaper import Article
29

30
# Basic article extraction
31
url = 'http://cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
32
article = Article(url)
33

34
# Download and parse article
35
article.download()
36
article.parse()
37

38
# Access extracted content
39
print(article.title)
40
print(article.authors)
41
print(article.publish_date)
42
print(article.text)
43
print(article.top_image)
44

45
# Extract keywords and summary using NLP
46
article.nlp()
47
print(article.keywords)
48
print(article.summary)
49

50
# Build and process news sources
51
cnn_paper = newspaper.build('http://cnn.com')
52
for article in cnn_paper.articles:
53
    print(article.url)
54

55
# Multi-threaded processing
56
from newspaper import news_pool
57
articles = [article1, article2, article3]  
58
news_pool.set(articles)
59
news_pool.join()
60
```
61

62
## Architecture
63

64
The library is built around several core concepts:
65

66
- **Article**: Individual news articles with extraction, parsing, and NLP capabilities
67
- **Source**: News websites/domains that contain collections of articles
68
- **NewsPool**: Multi-threading framework for batch processing articles and sources
69
- **Configuration**: Customizable settings for extraction behavior, language processing, and validation thresholds
70

71
This design enables both single-article processing and large-scale news aggregation workflows, with configurable extraction parameters, caching mechanisms, and multi-language support that make it suitable for research applications, content curation systems, and automated journalism workflows.
72

73
## Capabilities
74

75
### Article Processing
76

77
Core functionality for downloading, parsing, and extracting content from individual news articles. Supports text extraction, metadata parsing, image discovery, video extraction, and natural language processing.
78

79
```python { .api }
80
class Article:
81
    def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs): ...
82
    def download(self, input_html=None, title=None, recursion_counter: int = 0): ...
83
    def parse(self): ...
84
    def nlp(self): ...
85
    def build(self): ...
86

87
def build_article(url: str = '', config=None, **kwargs) -> Article: ...
88
```
89

90
[Article Processing](./article-processing.md)
91

92
### Source Management
93

94
Functionality for working with news websites and domains as collections of articles. Provides article discovery, category extraction, RSS feed processing, and batch operations.
95

96
```python { .api }
97
class Source:
98
    def __init__(self, url: str, config=None, **kwargs): ...
99
    def build(self): ...
100
    def download(self): ...
101
    def parse(self): ...
102

103
def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source: ...
104
```
105

106
[Source Management](./source-management.md)
107

108
### Multi-threading & Batch Processing
109

110
Thread pool management for processing multiple articles and sources concurrently. Enables efficient large-scale content extraction and processing.
111

112
```python { .api }
113
class NewsPool:
114
    def __init__(self, config=None): ...
115
    def set(self, news_list: list, threads_per_source: int = 1, override_threads=None): ...
116
    def join(self): ...
117

118
# Pre-instantiated pool
119
news_pool: NewsPool
120
```
121

122
[Multi-threading](./multithreading.md)
123

124
### Configuration & Utilities
125

126
Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features.
127

128
```python { .api }
129
class Configuration:
130
    def __init__(self): ...
131
    def set_language(self, language: str): ...
132
    def get_language(self) -> str: ...
133

134
# Configuration is also aliased as Config for convenience
135
Config = Configuration
136

137
def fulltext(html: str, language: str = 'en') -> str: ...
138
def hot() -> list: ...
139
def languages(): ...
140
def popular_urls() -> list: ...
141

142
# Version information
143
__version__: str  # Package version (currently "0.2.8")
144
```
145

146
[Configuration](./configuration.md)
147

148
## Exception Handling
149

150
```python { .api }
151
class ArticleException(Exception):
152
    """Exception raised for article-related errors during download or parsing."""
153

154
class ConcurrencyException(Exception):
155
    """Exception raised for thread pool operation errors."""
156
```
157

158
Common error scenarios:
159
- Network errors during article download
160
- HTML parsing failures 
161
- Invalid URL formats
162
- Missing required article content
163
- Thread pool configuration or execution errors