or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

article-processing.mdconfiguration.mdindex.mdmultithreading.mdsource-management.md

article-processing.mddocs/

0

# Article Processing

1

2

Core functionality for downloading, parsing, and extracting content from individual news articles. The Article class provides comprehensive capabilities for processing web articles including text extraction, metadata parsing, image discovery, video extraction, and natural language processing.

3

4

## Capabilities

5

6

### Article Creation and Building

7

8

Create and initialize Article objects, with full processing pipeline support.

9

10

```python { .api }

11

class Article:

12

def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs):

13

"""

14

Initialize an article object.

15

16

Parameters:

17

- url: Article URL to process

18

- title: Optional article title

19

- source_url: Optional source website URL

20

- config: Configuration object for processing options

21

- **kwargs: Additional configuration parameters

22

"""

23

24

def build(self):

25

"""

26

Complete article processing pipeline: download, parse, and NLP.

27

Equivalent to calling download(), parse(), and nlp() in sequence.

28

"""

29

30

def build_article(url: str = '', config=None, **kwargs) -> Article:

31

"""

32

Factory function to create an Article object.

33

34

Parameters:

35

- url: Article URL

36

- config: Configuration object

37

- **kwargs: Additional configuration parameters

38

39

Returns:

40

Article object ready for processing

41

"""

42

```

43

44

### Content Download

45

46

Download HTML content from article URLs with error handling and redirect support.

47

48

```python { .api }

49

def download(self, input_html: str = None, title: str = None, recursion_counter: int = 0):

50

"""

51

Download article HTML content.

52

53

Parameters:

54

- input_html: Optional pre-downloaded HTML content

55

- title: Optional title override

56

- recursion_counter: Internal parameter for handling redirects

57

58

Raises:

59

ArticleException: If download fails due to network or HTTP errors

60

"""

61

```

62

63

### Content Parsing

64

65

Parse downloaded HTML to extract article components including text, metadata, images, and structure.

66

67

```python { .api }

68

def parse(self):

69

"""

70

Parse downloaded HTML content to extract article data.

71

Extracts title, authors, text content, images, metadata, and publication date.

72

73

Raises:

74

ArticleException: If article has not been downloaded first

75

"""

76

```

77

78

### Natural Language Processing

79

80

Extract keywords and generate summaries from article text content.

81

82

```python { .api }

83

def nlp(self):

84

"""

85

Perform natural language processing on parsed article text.

86

Extracts keywords from title and body text, generates article summary.

87

88

Raises:

89

ArticleException: If article has not been downloaded and parsed first

90

"""

91

```

92

93

### Content Validation

94

95

Validate article URLs and content quality according to configurable criteria.

96

97

```python { .api }

98

def is_valid_url(self) -> bool:

99

"""

100

Check if the article URL is valid for processing.

101

102

Returns:

103

bool: True if URL is valid, False otherwise

104

"""

105

106

def is_valid_body(self) -> bool:

107

"""

108

Check if article content meets quality requirements.

109

Validates word count, sentence count, title quality, and HTML content.

110

111

Returns:

112

bool: True if article body is valid, False otherwise

113

114

Raises:

115

ArticleException: If article has not been parsed first

116

"""

117

118

def is_media_news(self) -> bool:

119

"""

120

Check if article is media-heavy (gallery, video, slideshow, etc.).

121

122

Returns:

123

bool: True if article is media-focused, False otherwise

124

"""

125

```

126

127

### Article Properties

128

129

Access extracted article data and metadata.

130

131

```python { .api }

132

# Content Properties

133

url: str # Article URL

134

title: str # Article title

135

text: str # Main article body text

136

html: str # Raw HTML content

137

article_html: str # Cleaned article HTML content

138

summary: str # Auto-generated summary

139

140

# Author and Date Information

141

authors: list # List of article authors

142

publish_date: str # Publication date

143

144

# Media Content

145

top_img: str # Primary article image URL (alias: top_image)

146

imgs: list # List of all image URLs (alias: images)

147

movies: list # List of video URLs

148

149

# Metadata from HTML

150

meta_img: str # Image URL from metadata

151

meta_keywords: list # Keywords from HTML meta tags

152

meta_description: str # Description from HTML meta

153

meta_lang: str # Language from HTML meta

154

meta_favicon: str # Favicon URL from meta

155

meta_data: dict # Dictionary of all metadata

156

canonical_link: str # Canonical URL from meta

157

tags: set # Set of article tags

158

159

# Processing State

160

is_parsed: bool # Whether article has been parsed

161

download_state: int # Download status (ArticleDownloadState values)

162

download_exception_msg: str # Error message if download failed

163

164

# Source Information

165

source_url: str # URL of the parent news source

166

167

# Advanced Properties

168

top_node: object # Main DOM node of article content

169

clean_top_node: object # Clean copy of main DOM node

170

doc: object # Full lxml DOM object

171

clean_doc: object # Clean copy of DOM object

172

additional_data: dict # Custom user data storage

173

174

# Extracted Content

175

keywords: list # Keywords from NLP processing

176

```

177

178

### Download State Constants

179

180

```python { .api }

181

class ArticleDownloadState:

182

NOT_STARTED: int = 0 # Download not yet attempted

183

FAILED_RESPONSE: int = 1 # Download failed due to network/HTTP error

184

SUCCESS: int = 2 # Download completed successfully

185

```

186

187

## Usage Examples

188

189

### Basic Article Processing

190

191

```python

192

from newspaper import Article

193

194

# Create and process article

195

article = Article('https://example.com/news/article')

196

article.download()

197

article.parse()

198

199

# Access extracted content

200

print(f"Title: {article.title}")

201

print(f"Authors: {article.authors}")

202

print(f"Text length: {len(article.text)} characters")

203

print(f"Publication date: {article.publish_date}")

204

print(f"Top image: {article.top_img}")

205

```

206

207

### Full Processing with NLP

208

209

```python

210

from newspaper import build_article

211

212

# Build article with full processing

213

article = build_article('https://example.com/news/article')

214

article.build() # download + parse + nlp

215

216

# Access NLP results

217

print(f"Keywords: {article.keywords}")

218

print(f"Summary: {article.summary}")

219

```

220

221

### Error Handling

222

223

```python

224

from newspaper import Article, ArticleException

225

226

try:

227

article = Article('https://example.com/news/article')

228

article.download()

229

230

if article.download_state == ArticleDownloadState.FAILED_RESPONSE:

231

print(f"Download failed: {article.download_exception_msg}")

232

else:

233

article.parse()

234

235

if article.is_valid_body():

236

article.nlp()

237

print(f"Article processed successfully: {article.title}")

238

else:

239

print("Article content does not meet quality requirements")

240

241

except ArticleException as e:

242

print(f"Article processing error: {e}")

243

```

244

245

### Custom Configuration

246

247

```python

248

from newspaper import Article, Configuration

249

250

# Create custom configuration

251

config = Configuration()

252

config.language = 'es'

253

config.MIN_WORD_COUNT = 500

254

config.fetch_images = False

255

256

# Process article with custom settings

257

article = Article('https://example.com/news/article', config=config)

258

article.build()

259

```