0
# Article Processing
1
2
Core functionality for downloading, parsing, and extracting content from individual news articles. The Article class provides comprehensive capabilities for processing web articles including text extraction, metadata parsing, image discovery, video extraction, and natural language processing.
3
4
## Capabilities
5
6
### Article Creation and Building
7
8
Create and initialize Article objects, with full processing pipeline support.
9
10
```python { .api }
11
class Article:
12
def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs):
13
"""
14
Initialize an article object.
15
16
Parameters:
17
- url: Article URL to process
18
- title: Optional article title
19
- source_url: Optional source website URL
20
- config: Configuration object for processing options
21
- **kwargs: Additional configuration parameters
22
"""
23
24
def build(self):
25
"""
26
Complete article processing pipeline: download, parse, and NLP.
27
Equivalent to calling download(), parse(), and nlp() in sequence.
28
"""
29
30
def build_article(url: str = '', config=None, **kwargs) -> Article:
31
"""
32
Factory function to create an Article object.
33
34
Parameters:
35
- url: Article URL
36
- config: Configuration object
37
- **kwargs: Additional configuration parameters
38
39
Returns:
40
Article object ready for processing
41
"""
42
```
43
44
### Content Download
45
46
Download HTML content from article URLs with error handling and redirect support.
47
48
```python { .api }
49
def download(self, input_html: str = None, title: str = None, recursion_counter: int = 0):
50
"""
51
Download article HTML content.
52
53
Parameters:
54
- input_html: Optional pre-downloaded HTML content
55
- title: Optional title override
56
- recursion_counter: Internal parameter for handling redirects
57
58
Raises:
59
ArticleException: If download fails due to network or HTTP errors
60
"""
61
```
62
63
### Content Parsing
64
65
Parse downloaded HTML to extract article components including text, metadata, images, and structure.
66
67
```python { .api }
68
def parse(self):
69
"""
70
Parse downloaded HTML content to extract article data.
71
Extracts title, authors, text content, images, metadata, and publication date.
72
73
Raises:
74
ArticleException: If article has not been downloaded first
75
"""
76
```
77
78
### Natural Language Processing
79
80
Extract keywords and generate summaries from article text content.
81
82
```python { .api }
83
def nlp(self):
84
"""
85
Perform natural language processing on parsed article text.
86
Extracts keywords from title and body text, generates article summary.
87
88
Raises:
89
ArticleException: If article has not been downloaded and parsed first
90
"""
91
```
92
93
### Content Validation
94
95
Validate article URLs and content quality according to configurable criteria.
96
97
```python { .api }
98
def is_valid_url(self) -> bool:
99
"""
100
Check if the article URL is valid for processing.
101
102
Returns:
103
bool: True if URL is valid, False otherwise
104
"""
105
106
def is_valid_body(self) -> bool:
107
"""
108
Check if article content meets quality requirements.
109
Validates word count, sentence count, title quality, and HTML content.
110
111
Returns:
112
bool: True if article body is valid, False otherwise
113
114
Raises:
115
ArticleException: If article has not been parsed first
116
"""
117
118
def is_media_news(self) -> bool:
119
"""
120
Check if article is media-heavy (gallery, video, slideshow, etc.).
121
122
Returns:
123
bool: True if article is media-focused, False otherwise
124
"""
125
```
126
127
### Article Properties
128
129
Access extracted article data and metadata.
130
131
```python { .api }
132
# Content Properties
133
url: str # Article URL
134
title: str # Article title
135
text: str # Main article body text
136
html: str # Raw HTML content
137
article_html: str # Cleaned article HTML content
138
summary: str # Auto-generated summary
139
140
# Author and Date Information
141
authors: list # List of article authors
142
publish_date: str # Publication date
143
144
# Media Content
145
top_img: str # Primary article image URL (alias: top_image)
146
imgs: list # List of all image URLs (alias: images)
147
movies: list # List of video URLs
148
149
# Metadata from HTML
150
meta_img: str # Image URL from metadata
151
meta_keywords: list # Keywords from HTML meta tags
152
meta_description: str # Description from HTML meta
153
meta_lang: str # Language from HTML meta
154
meta_favicon: str # Favicon URL from meta
155
meta_data: dict # Dictionary of all metadata
156
canonical_link: str # Canonical URL from meta
157
tags: set # Set of article tags
158
159
# Processing State
160
is_parsed: bool # Whether article has been parsed
161
download_state: int # Download status (ArticleDownloadState values)
162
download_exception_msg: str # Error message if download failed
163
164
# Source Information
165
source_url: str # URL of the parent news source
166
167
# Advanced Properties
168
top_node: object # Main DOM node of article content
169
clean_top_node: object # Clean copy of main DOM node
170
doc: object # Full lxml DOM object
171
clean_doc: object # Clean copy of DOM object
172
additional_data: dict # Custom user data storage
173
174
# Extracted Content
175
keywords: list # Keywords from NLP processing
176
```
177
178
### Download State Constants
179
180
```python { .api }
181
class ArticleDownloadState:
182
NOT_STARTED: int = 0 # Download not yet attempted
183
FAILED_RESPONSE: int = 1 # Download failed due to network/HTTP error
184
SUCCESS: int = 2 # Download completed successfully
185
```
186
187
## Usage Examples
188
189
### Basic Article Processing
190
191
```python
192
from newspaper import Article
193
194
# Create and process article
195
article = Article('https://example.com/news/article')
196
article.download()
197
article.parse()
198
199
# Access extracted content
200
print(f"Title: {article.title}")
201
print(f"Authors: {article.authors}")
202
print(f"Text length: {len(article.text)} characters")
203
print(f"Publication date: {article.publish_date}")
204
print(f"Top image: {article.top_img}")
205
```
206
207
### Full Processing with NLP
208
209
```python
210
from newspaper import build_article
211
212
# Build article with full processing
213
article = build_article('https://example.com/news/article')
214
article.build() # download + parse + nlp
215
216
# Access NLP results
217
print(f"Keywords: {article.keywords}")
218
print(f"Summary: {article.summary}")
219
```
220
221
### Error Handling
222
223
```python
224
from newspaper import Article, ArticleException
225
226
try:
227
article = Article('https://example.com/news/article')
228
article.download()
229
230
if article.download_state == ArticleDownloadState.FAILED_RESPONSE:
231
print(f"Download failed: {article.download_exception_msg}")
232
else:
233
article.parse()
234
235
if article.is_valid_body():
236
article.nlp()
237
print(f"Article processed successfully: {article.title}")
238
else:
239
print("Article content does not meet quality requirements")
240
241
except ArticleException as e:
242
print(f"Article processing error: {e}")
243
```
244
245
### Custom Configuration
246
247
```python
248
from newspaper import Article, Configuration
249
250
# Create custom configuration
251
config = Configuration()
252
config.language = 'es'
253
config.MIN_WORD_COUNT = 500
254
config.fetch_images = False
255
256
# Process article with custom settings
257
article = Article('https://example.com/news/article', config=config)
258
article.build()
259
```