Tessl Tile for pypi/newspaper3k@0.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-processing.md configuration.md index.md multithreading.md source-management.md

article-processing.mddocs/

0
# Article Processing
1

2
Core functionality for downloading, parsing, and extracting content from individual news articles. The Article class provides comprehensive capabilities for processing web articles including text extraction, metadata parsing, image discovery, video extraction, and natural language processing.
3

4
## Capabilities
5

6
### Article Creation and Building
7

8
Create and initialize Article objects, with full processing pipeline support.
9

10
```python { .api }
11
class Article:
12
    def __init__(self, url: str, title: str = '', source_url: str = '', config=None, **kwargs):
13
        """
14
        Initialize an article object.
15
        
16
        Parameters:
17
        - url: Article URL to process
18
        - title: Optional article title
19
        - source_url: Optional source website URL 
20
        - config: Configuration object for processing options
21
        - **kwargs: Additional configuration parameters
22
        """
23

24
    def build(self):
25
        """
26
        Complete article processing pipeline: download, parse, and NLP.
27
        Equivalent to calling download(), parse(), and nlp() in sequence.
28
        """
29

30
def build_article(url: str = '', config=None, **kwargs) -> Article:
31
    """
32
    Factory function to create an Article object.
33
    
34
    Parameters:
35
    - url: Article URL
36
    - config: Configuration object
37
    - **kwargs: Additional configuration parameters
38
    
39
    Returns:
40
    Article object ready for processing
41
    """
42
```
43

44
### Content Download
45

46
Download HTML content from article URLs with error handling and redirect support.
47

48
```python { .api }
49
def download(self, input_html: str = None, title: str = None, recursion_counter: int = 0):
50
    """
51
    Download article HTML content.
52
    
53
    Parameters:
54
    - input_html: Optional pre-downloaded HTML content
55
    - title: Optional title override
56
    - recursion_counter: Internal parameter for handling redirects
57
    
58
    Raises:
59
    ArticleException: If download fails due to network or HTTP errors
60
    """
61
```
62

63
### Content Parsing
64

65
Parse downloaded HTML to extract article components including text, metadata, images, and structure.
66

67
```python { .api }
68
def parse(self):
69
    """
70
    Parse downloaded HTML content to extract article data.
71
    Extracts title, authors, text content, images, metadata, and publication date.
72
    
73
    Raises:
74
    ArticleException: If article has not been downloaded first
75
    """
76
```
77

78
### Natural Language Processing
79

80
Extract keywords and generate summaries from article text content.
81

82
```python { .api }
83
def nlp(self):
84
    """
85
    Perform natural language processing on parsed article text.
86
    Extracts keywords from title and body text, generates article summary.
87
    
88
    Raises:
89
    ArticleException: If article has not been downloaded and parsed first
90
    """
91
```
92

93
### Content Validation
94

95
Validate article URLs and content quality according to configurable criteria.
96

97
```python { .api }
98
def is_valid_url(self) -> bool:
99
    """
100
    Check if the article URL is valid for processing.
101
    
102
    Returns:
103
    bool: True if URL is valid, False otherwise
104
    """
105

106
def is_valid_body(self) -> bool:
107
    """
108
    Check if article content meets quality requirements.
109
    Validates word count, sentence count, title quality, and HTML content.
110
    
111
    Returns:
112
    bool: True if article body is valid, False otherwise
113
    
114
    Raises:
115
    ArticleException: If article has not been parsed first
116
    """
117

118
def is_media_news(self) -> bool:
119
    """
120
    Check if article is media-heavy (gallery, video, slideshow, etc.).
121
    
122
    Returns:
123
    bool: True if article is media-focused, False otherwise
124
    """
125
```
126

127
### Article Properties
128

129
Access extracted article data and metadata.
130

131
```python { .api }
132
# Content Properties
133
url: str                    # Article URL
134
title: str                  # Article title  
135
text: str                   # Main article body text
136
html: str                   # Raw HTML content
137
article_html: str           # Cleaned article HTML content
138
summary: str                # Auto-generated summary
139

140
# Author and Date Information  
141
authors: list               # List of article authors
142
publish_date: str           # Publication date
143

144
# Media Content
145
top_img: str               # Primary article image URL (alias: top_image)
146
imgs: list                 # List of all image URLs (alias: images)  
147
movies: list               # List of video URLs
148

149
# Metadata from HTML
150
meta_img: str              # Image URL from metadata
151
meta_keywords: list        # Keywords from HTML meta tags
152
meta_description: str      # Description from HTML meta
153
meta_lang: str             # Language from HTML meta
154
meta_favicon: str          # Favicon URL from meta
155
meta_data: dict            # Dictionary of all metadata
156
canonical_link: str        # Canonical URL from meta
157
tags: set                  # Set of article tags
158

159
# Processing State
160
is_parsed: bool            # Whether article has been parsed
161
download_state: int        # Download status (ArticleDownloadState values)
162
download_exception_msg: str # Error message if download failed
163

164
# Source Information
165
source_url: str            # URL of the parent news source
166

167
# Advanced Properties
168
top_node: object           # Main DOM node of article content
169
clean_top_node: object     # Clean copy of main DOM node  
170
doc: object                # Full lxml DOM object
171
clean_doc: object          # Clean copy of DOM object
172
additional_data: dict      # Custom user data storage
173

174
# Extracted Content
175
keywords: list             # Keywords from NLP processing
176
```
177

178
### Download State Constants
179

180
```python { .api }
181
class ArticleDownloadState:
182
    NOT_STARTED: int = 0      # Download not yet attempted
183
    FAILED_RESPONSE: int = 1  # Download failed due to network/HTTP error
184
    SUCCESS: int = 2          # Download completed successfully
185
```
186

187
## Usage Examples
188

189
### Basic Article Processing
190

191
```python
192
from newspaper import Article
193

194
# Create and process article
195
article = Article('https://example.com/news/article')
196
article.download()
197
article.parse()
198

199
# Access extracted content  
200
print(f"Title: {article.title}")
201
print(f"Authors: {article.authors}")
202
print(f"Text length: {len(article.text)} characters")
203
print(f"Publication date: {article.publish_date}")
204
print(f"Top image: {article.top_img}")
205
```
206

207
### Full Processing with NLP
208

209
```python
210
from newspaper import build_article
211

212
# Build article with full processing
213
article = build_article('https://example.com/news/article')
214
article.build()  # download + parse + nlp
215

216
# Access NLP results
217
print(f"Keywords: {article.keywords}")
218
print(f"Summary: {article.summary}")
219
```
220

221
### Error Handling
222

223
```python
224
from newspaper import Article, ArticleException
225

226
try:
227
    article = Article('https://example.com/news/article')
228
    article.download()
229
    
230
    if article.download_state == ArticleDownloadState.FAILED_RESPONSE:
231
        print(f"Download failed: {article.download_exception_msg}")
232
    else:
233
        article.parse()
234
        
235
        if article.is_valid_body():
236
            article.nlp()
237
            print(f"Article processed successfully: {article.title}")
238
        else:
239
            print("Article content does not meet quality requirements")
240
            
241
except ArticleException as e:
242
    print(f"Article processing error: {e}")
243
```
244

245
### Custom Configuration
246

247
```python
248
from newspaper import Article, Configuration
249

250
# Create custom configuration
251
config = Configuration()
252
config.language = 'es'
253
config.MIN_WORD_COUNT = 500
254
config.fetch_images = False
255

256
# Process article with custom settings
257
article = Article('https://example.com/news/article', config=config)
258
article.build()
259
```

Version

Tile

Files

article-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

article-processing.mddocs/