0
# Core Extraction
1
2
The main extraction functionality that processes URLs or HTML documents to extract clean text content, metadata, and media elements. This module provides the primary Goose class interface and extraction pipeline.
3
4
## Capabilities
5
6
### Primary Extraction Interface
7
8
The Goose class serves as the main entry point for all extraction operations, managing network connections, parser selection, and the complete extraction pipeline.
9
10
```python { .api }
11
class Goose:
12
def __init__(self, config: Union[Configuration, dict, None] = None):
13
"""
14
Initialize Goose extractor with optional configuration.
15
16
Parameters:
17
- config: Configuration object, dict of config options, or None for defaults
18
19
Raises:
20
- Exception: If local_storage_path is invalid when image fetching is enabled
21
"""
22
23
def extract(self, url: Union[str, None] = None, raw_html: Union[str, None] = None) -> Article:
24
"""
25
Extract article content from URL or raw HTML.
26
27
Parameters:
28
- url: URL to fetch and extract from
29
- raw_html: Raw HTML string to extract from
30
31
Returns:
32
- Article: Extracted content and metadata
33
34
Raises:
35
- ValueError: If neither url nor raw_html is provided
36
- NetworkError: Network-related errors during fetching
37
- UnicodeDecodeError: Character encoding issues
38
"""
39
40
def close(self):
41
"""
42
Close network connection and perform cleanup.
43
Automatically called when using as context manager or during garbage collection.
44
"""
45
46
def shutdown_network(self):
47
"""
48
Close the network connection specifically.
49
Called automatically by close() method.
50
"""
51
52
def __enter__(self):
53
"""Context manager entry."""
54
55
def __exit__(self, exc_type, exc_val, exc_tb):
56
"""Context manager exit with automatic cleanup."""
57
```
58
59
### Context Manager Usage
60
61
Goose supports context manager protocol for automatic resource cleanup:
62
63
```python
64
with Goose() as g:
65
article = g.extract(url="https://example.com/article")
66
print(article.title)
67
# Network connection automatically closed
68
```
69
70
### Configuration During Initialization
71
72
Pass configuration as dict or Configuration object:
73
74
```python
75
# Dict configuration
76
g = Goose({
77
'parser_class': 'soup',
78
'target_language': 'es',
79
'enable_image_fetching': True,
80
'strict': False
81
})
82
83
# Configuration object
84
from goose3 import Configuration
85
config = Configuration()
86
config.parser_class = 'soup'
87
config.target_language = 'es'
88
g = Goose(config)
89
```
90
91
### Extraction Modes
92
93
Extract from URL:
94
95
```python
96
g = Goose()
97
article = g.extract(url="https://example.com/news-article")
98
```
99
100
Extract from raw HTML:
101
102
```python
103
html_content = """
104
<html>
105
<body>
106
<h1>Article Title</h1>
107
<p>Article content goes here...</p>
108
</body>
109
</html>
110
"""
111
g = Goose()
112
article = g.extract(raw_html=html_content)
113
```
114
115
### Error Handling
116
117
```python
118
from goose3 import Goose, NetworkError
119
120
g = Goose({'strict': True}) # Raise all network errors
121
try:
122
article = g.extract(url="https://example.com/article")
123
except NetworkError as e:
124
print(f"Network error: {e}")
125
except ValueError as e:
126
print(f"Input error: {e}")
127
except UnicodeDecodeError as e:
128
print(f"Encoding error: {e}")
129
```
130
131
### Multi-language Support
132
133
Configure language targeting for better extraction:
134
135
```python
136
# Automatic language detection from meta tags
137
g = Goose({'use_meta_language': True})
138
139
# Force specific language
140
g = Goose({
141
'use_meta_language': False,
142
'target_language': 'es' # Spanish
143
})
144
145
# Chinese language support
146
g = Goose({'target_language': 'zh'})
147
148
# Arabic language support
149
g = Goose({'target_language': 'ar'})
150
```
151
152
### Parser Selection
153
154
Choose between available HTML parsers:
155
156
```python
157
# Default lxml parser (faster, more robust)
158
g = Goose({'parser_class': 'lxml'})
159
160
# BeautifulSoup parser (more lenient)
161
g = Goose({'parser_class': 'soup'})
162
```