0
# Goose3
1
2
A comprehensive Python library for extracting article content, metadata, and media from web pages and HTML documents. Goose3 intelligently identifies main article content while filtering out navigation, advertisements, and other non-content elements using advanced text analysis algorithms.
3
4
## Package Information
5
6
- **Package Name**: goose3
7
- **Language**: Python
8
- **Installation**: `pip install goose3`
9
- **Optional Dependencies**:
10
- `pip install goose3[chinese]` - Chinese language support
11
- `pip install goose3[arabic]` - Arabic language support
12
- `pip install goose3[all]` - All language extensions
13
14
## Core Imports
15
16
```python
17
from goose3 import Goose
18
```
19
20
For configuration and data types:
21
22
```python
23
from goose3 import Goose, Configuration, Article, Image, Video
24
from goose3 import ArticleContextPattern, PublishDatePattern, AuthorPattern
25
```
26
27
For language-specific text processing:
28
29
```python
30
from goose3.text import StopWords, StopWordsChinese, StopWordsArabic, StopWordsKorean
31
```
32
33
## Basic Usage
34
35
```python
36
from goose3 import Goose
37
38
# Basic extraction from URL
39
g = Goose()
40
article = g.extract(url='https://example.com/article')
41
42
print(article.title)
43
print(article.cleaned_text)
44
print(article.meta_description)
45
if article.top_image:
46
print(article.top_image.src)
47
48
# Extract from raw HTML
49
html_content = "<html>...</html>"
50
article = g.extract(raw_html=html_content)
51
52
# Using as context manager (recommended)
53
with Goose() as g:
54
article = g.extract(url='https://example.com/article')
55
print(article.title)
56
```
57
58
## Architecture
59
60
Goose3 uses a multi-stage extraction pipeline:
61
62
- **Network Fetcher**: Downloads web content with configurable user agents and request handling
63
- **Parser**: Processes HTML using lxml or BeautifulSoup with language-specific optimization
64
- **Content Extraction**: Identifies main article content using text density analysis and DOM patterns
65
- **Metadata Extraction**: Extracts titles, descriptions, publication dates, authors, and schema data
66
- **Media Detection**: Locates and extracts images and embedded videos
67
- **Language Processing**: Multi-language text analysis with specialized analyzers for Chinese, Arabic, and Korean
68
69
## Capabilities
70
71
### Core Extraction
72
73
Main article extraction functionality that processes URLs or HTML to extract clean text content, metadata, and media elements.
74
75
```python { .api }
76
class Goose:
77
def __init__(self, config=None): ...
78
def extract(self, url=None, raw_html=None) -> Article: ...
79
def close(self): ...
80
def shutdown_network(self): ...
81
```
82
83
[Core Extraction](./core-extraction.md)
84
85
### Configuration System
86
87
Comprehensive configuration options for customizing extraction behavior, including parser selection, language targeting, content patterns, and network settings.
88
89
```python { .api }
90
class Configuration:
91
def __init__(self): ...
92
93
# Key properties
94
parser_class: str
95
target_language: str
96
browser_user_agent: str
97
enable_image_fetching: bool
98
strict: bool
99
local_storage_path: str
100
```
101
102
[Configuration](./configuration.md)
103
104
### Article Data Structure
105
106
Rich data structure containing extracted content, metadata, and media with comprehensive property access for all extracted information.
107
108
```python { .api }
109
class Article:
110
@property
111
def title(self) -> str: ...
112
@property
113
def cleaned_text(self) -> str: ...
114
@property
115
def top_image(self) -> Image: ...
116
@property
117
def movies(self) -> list[Video]: ...
118
# ... additional properties
119
```
120
121
[Article Data](./article-data.md)
122
123
### Media Extraction
124
125
Image and video extraction capabilities with support for metadata, dimensions, and embedded content from various platforms.
126
127
```python { .api }
128
class Image:
129
src: str
130
width: int
131
height: int
132
133
class Video:
134
src: str
135
embed_code: str
136
embed_type: str
137
width: int
138
height: int
139
```
140
141
[Media Extraction](./media-extraction.md)
142
143
## Types
144
145
```python { .api }
146
from typing import Union, Optional, List, Dict, Any
147
148
# Main extraction interface
149
ExtractInput = Union[str, None] # URL or raw HTML
150
ConfigInput = Union[Configuration, dict, None]
151
152
# Pattern matching for content extraction
153
class ArticleContextPattern:
154
def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
155
attr: str
156
value: str
157
tag: str
158
domain: str
159
160
class PublishDatePattern:
161
def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None): ...
162
attr: str
163
value: str
164
content: str
165
subcontent: str
166
tag: str
167
domain: str
168
169
class AuthorPattern:
170
def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
171
attr: str
172
value: str
173
tag: str
174
domain: str
175
176
# Exception types
177
class NetworkError(RuntimeError):
178
"""Network-related errors during content fetching"""
179
def __init__(self, status_code, reason): ...
180
status_code: int # HTTP status code
181
reason: str # HTTP reason phrase
182
message: str # Formatted error message
183
184
# Language-specific text processing classes
185
class StopWords:
186
"""Base stopwords class for English text processing"""
187
def __init__(self, language: str = 'en'): ...
188
189
class StopWordsChinese(StopWords):
190
"""Chinese language stopwords for improved text analysis"""
191
def __init__(self): ...
192
193
class StopWordsArabic(StopWords):
194
"""Arabic language stopwords for improved text analysis"""
195
def __init__(self): ...
196
197
class StopWordsKorean(StopWords):
198
"""Korean language stopwords for improved text analysis"""
199
def __init__(self): ...
200
```