Tessl Tile for pypi/goose3@3.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-data.md configuration.md core-extraction.md index.md media-extraction.md

index.mddocs/

0
# Goose3
1

2
A comprehensive Python library for extracting article content, metadata, and media from web pages and HTML documents. Goose3 intelligently identifies main article content while filtering out navigation, advertisements, and other non-content elements using advanced text analysis algorithms.
3

4
## Package Information
5

6
- **Package Name**: goose3
7
- **Language**: Python
8
- **Installation**: `pip install goose3`
9
- **Optional Dependencies**: 
10
  - `pip install goose3[chinese]` - Chinese language support
11
  - `pip install goose3[arabic]` - Arabic language support  
12
  - `pip install goose3[all]` - All language extensions
13

14
## Core Imports
15

16
```python
17
from goose3 import Goose
18
```
19

20
For configuration and data types:
21

22
```python
23
from goose3 import Goose, Configuration, Article, Image, Video
24
from goose3 import ArticleContextPattern, PublishDatePattern, AuthorPattern
25
```
26

27
For language-specific text processing:
28

29
```python
30
from goose3.text import StopWords, StopWordsChinese, StopWordsArabic, StopWordsKorean
31
```
32

33
## Basic Usage
34

35
```python
36
from goose3 import Goose
37

38
# Basic extraction from URL
39
g = Goose()
40
article = g.extract(url='https://example.com/article')
41

42
print(article.title)
43
print(article.cleaned_text)
44
print(article.meta_description)
45
if article.top_image:
46
    print(article.top_image.src)
47

48
# Extract from raw HTML
49
html_content = "<html>...</html>"
50
article = g.extract(raw_html=html_content)
51

52
# Using as context manager (recommended)
53
with Goose() as g:
54
    article = g.extract(url='https://example.com/article')
55
    print(article.title)
56
```
57

58
## Architecture
59

60
Goose3 uses a multi-stage extraction pipeline:
61

62
- **Network Fetcher**: Downloads web content with configurable user agents and request handling
63
- **Parser**: Processes HTML using lxml or BeautifulSoup with language-specific optimization
64
- **Content Extraction**: Identifies main article content using text density analysis and DOM patterns
65
- **Metadata Extraction**: Extracts titles, descriptions, publication dates, authors, and schema data
66
- **Media Detection**: Locates and extracts images and embedded videos
67
- **Language Processing**: Multi-language text analysis with specialized analyzers for Chinese, Arabic, and Korean
68

69
## Capabilities
70

71
### Core Extraction
72

73
Main article extraction functionality that processes URLs or HTML to extract clean text content, metadata, and media elements.
74

75
```python { .api }
76
class Goose:
77
    def __init__(self, config=None): ...
78
    def extract(self, url=None, raw_html=None) -> Article: ...
79
    def close(self): ...
80
    def shutdown_network(self): ...
81
```
82

83
[Core Extraction](./core-extraction.md)
84

85
### Configuration System
86

87
Comprehensive configuration options for customizing extraction behavior, including parser selection, language targeting, content patterns, and network settings.
88

89
```python { .api }
90
class Configuration:
91
    def __init__(self): ...
92
    
93
    # Key properties
94
    parser_class: str
95
    target_language: str
96
    browser_user_agent: str
97
    enable_image_fetching: bool
98
    strict: bool
99
    local_storage_path: str
100
```
101

102
[Configuration](./configuration.md)
103

104
### Article Data Structure
105

106
Rich data structure containing extracted content, metadata, and media with comprehensive property access for all extracted information.
107

108
```python { .api }
109
class Article:
110
    @property
111
    def title(self) -> str: ...
112
    @property  
113
    def cleaned_text(self) -> str: ...
114
    @property
115
    def top_image(self) -> Image: ...
116
    @property
117
    def movies(self) -> list[Video]: ...
118
    # ... additional properties
119
```
120

121
[Article Data](./article-data.md)
122

123
### Media Extraction
124

125
Image and video extraction capabilities with support for metadata, dimensions, and embedded content from various platforms.
126

127
```python { .api }
128
class Image:
129
    src: str
130
    width: int
131
    height: int
132

133
class Video:
134
    src: str
135
    embed_code: str
136
    embed_type: str
137
    width: int
138
    height: int
139
```
140

141
[Media Extraction](./media-extraction.md)
142

143
## Types
144

145
```python { .api }
146
from typing import Union, Optional, List, Dict, Any
147

148
# Main extraction interface
149
ExtractInput = Union[str, None]  # URL or raw HTML
150
ConfigInput = Union[Configuration, dict, None]
151

152
# Pattern matching for content extraction
153
class ArticleContextPattern:
154
    def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
155
    attr: str
156
    value: str  
157
    tag: str
158
    domain: str
159

160
class PublishDatePattern:
161
    def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None): ...
162
    attr: str
163
    value: str
164
    content: str
165
    subcontent: str
166
    tag: str
167
    domain: str
168

169
class AuthorPattern:
170
    def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
171
    attr: str
172
    value: str
173
    tag: str
174
    domain: str
175

176
# Exception types
177
class NetworkError(RuntimeError):
178
    """Network-related errors during content fetching"""
179
    def __init__(self, status_code, reason): ...
180
    status_code: int  # HTTP status code
181
    reason: str       # HTTP reason phrase
182
    message: str      # Formatted error message
183

184
# Language-specific text processing classes
185
class StopWords:
186
    """Base stopwords class for English text processing"""
187
    def __init__(self, language: str = 'en'): ...
188

189
class StopWordsChinese(StopWords):
190
    """Chinese language stopwords for improved text analysis"""
191
    def __init__(self): ...
192

193
class StopWordsArabic(StopWords):
194
    """Arabic language stopwords for improved text analysis"""
195
    def __init__(self): ...
196

197
class StopWordsKorean(StopWords):
198
    """Korean language stopwords for improved text analysis"""
199
    def __init__(self): ...
200
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/