or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-newspaper3k

Simplified python article discovery & extraction.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/newspaper3k@0.2.x

To install, run

npx @tessl/cli install tessl/pypi-newspaper3k@0.2.0

0

# Newspaper3k

1

2

A comprehensive Python library for extracting and curating articles from web sources. Newspaper3k provides multi-threaded article downloading, intelligent text extraction from HTML, image and video extraction, keyword and summary generation using natural language processing, author and publication date detection, and multi-language support for over 10 languages including English, Chinese, German, and Arabic.

3

4

## Package Information

5

6

- **Package Name**: newspaper3k

7

- **Language**: Python

8

- **Installation**: `pip install newspaper3k`

9

10

## Core Imports

11

12

```python

13

import newspaper

14

```

15

16

Common imports for working with articles and sources:

17

18

```python

19

from newspaper import Article, Source, build, build_article, fulltext, __version__

20

from newspaper import Configuration, Config, NewsPool, news_pool

21

from newspaper import ArticleException, hot, languages, popular_urls

22

```

23

24

## Basic Usage

25

26

```python

27

import newspaper

28

from newspaper import Article

29

30

# Basic article extraction

31

url = 'http://cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'

32

article = Article(url)

33

34

# Download and parse article

35

article.download()

36

article.parse()

37

38

# Access extracted content

39

print(article.title)

40

print(article.authors)

41

print(article.publish_date)

42

print(article.text)

43

print(article.top_image)

44

45

# Extract keywords and summary using NLP

46

article.nlp()

47

print(article.keywords)

48

print(article.summary)

49

50

# Build and process news sources

51

cnn_paper = newspaper.build('http://cnn.com')

52

for article in cnn_paper.articles:

53

print(article.url)

54

55

# Multi-threaded processing

56

from newspaper import news_pool

57

articles = [article1, article2, article3]

58

news_pool.set(articles)

59

news_pool.join()

60

```

61

62

## Architecture

63

64

The library is built around several core concepts:

65

66

- **Article**: Individual news articles with extraction, parsing, and NLP capabilities

67

- **Source**: News websites/domains that contain collections of articles

68

- **NewsPool**: Multi-threading framework for batch processing articles and sources

69

- **Configuration**: Customizable settings for extraction behavior, language processing, and validation thresholds

70

71

This design enables both single-article processing and large-scale news aggregation workflows, with configurable extraction parameters, caching mechanisms, and multi-language support that make it suitable for research applications, content curation systems, and automated journalism workflows.

72

73

## Capabilities

74

75

### Article Processing

76

77

Core functionality for downloading, parsing, and extracting content from individual news articles. Supports text extraction, metadata parsing, image discovery, video extraction, and natural language processing.

78

79

```python { .api }

80

class Article:

81

def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs): ...

82

def download(self, input_html=None, title=None, recursion_counter: int = 0): ...

83

def parse(self): ...

84

def nlp(self): ...

85

def build(self): ...

86

87

def build_article(url: str = '', config=None, **kwargs) -> Article: ...

88

```

89

90

[Article Processing](./article-processing.md)

91

92

### Source Management

93

94

Functionality for working with news websites and domains as collections of articles. Provides article discovery, category extraction, RSS feed processing, and batch operations.

95

96

```python { .api }

97

class Source:

98

def __init__(self, url: str, config=None, **kwargs): ...

99

def build(self): ...

100

def download(self): ...

101

def parse(self): ...

102

103

def build(url: str = '', dry: bool = False, config=None, **kwargs) -> Source: ...

104

```

105

106

[Source Management](./source-management.md)

107

108

### Multi-threading & Batch Processing

109

110

Thread pool management for processing multiple articles and sources concurrently. Enables efficient large-scale content extraction and processing.

111

112

```python { .api }

113

class NewsPool:

114

def __init__(self, config=None): ...

115

def set(self, news_list: list, threads_per_source: int = 1, override_threads=None): ...

116

def join(self): ...

117

118

# Pre-instantiated pool

119

news_pool: NewsPool

120

```

121

122

[Multi-threading](./multithreading.md)

123

124

### Configuration & Utilities

125

126

Configuration management, language support, and utility functions for customizing extraction behavior and accessing supplementary features.

127

128

```python { .api }

129

class Configuration:

130

def __init__(self): ...

131

def set_language(self, language: str): ...

132

def get_language(self) -> str: ...

133

134

# Configuration is also aliased as Config for convenience

135

Config = Configuration

136

137

def fulltext(html: str, language: str = 'en') -> str: ...

138

def hot() -> list: ...

139

def languages(): ...

140

def popular_urls() -> list: ...

141

142

# Version information

143

__version__: str # Package version (currently "0.2.8")

144

```

145

146

[Configuration](./configuration.md)

147

148

## Exception Handling

149

150

```python { .api }

151

class ArticleException(Exception):

152

"""Exception raised for article-related errors during download or parsing."""

153

154

class ConcurrencyException(Exception):

155

"""Exception raised for thread pool operation errors."""

156

```

157

158

Common error scenarios:

159

- Network errors during article download

160

- HTML parsing failures

161

- Invalid URL formats

162

- Missing required article content

163

- Thread pool configuration or execution errors