Tessl Tile for pypi/goose3@3.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

article-data.md configuration.md core-extraction.md index.md media-extraction.md

core-extraction.mddocs/

0
# Core Extraction
1

2
The main extraction functionality that processes URLs or HTML documents to extract clean text content, metadata, and media elements. This module provides the primary Goose class interface and extraction pipeline.
3

4
## Capabilities
5

6
### Primary Extraction Interface
7

8
The Goose class serves as the main entry point for all extraction operations, managing network connections, parser selection, and the complete extraction pipeline.
9

10
```python { .api }
11
class Goose:
12
    def __init__(self, config: Union[Configuration, dict, None] = None):
13
        """
14
        Initialize Goose extractor with optional configuration.
15
        
16
        Parameters:
17
        - config: Configuration object, dict of config options, or None for defaults
18
        
19
        Raises:
20
        - Exception: If local_storage_path is invalid when image fetching is enabled
21
        """
22
        
23
    def extract(self, url: Union[str, None] = None, raw_html: Union[str, None] = None) -> Article:
24
        """
25
        Extract article content from URL or raw HTML.
26
        
27
        Parameters:
28
        - url: URL to fetch and extract from
29
        - raw_html: Raw HTML string to extract from
30
        
31
        Returns:
32
        - Article: Extracted content and metadata
33
        
34
        Raises:
35
        - ValueError: If neither url nor raw_html is provided
36
        - NetworkError: Network-related errors during fetching
37
        - UnicodeDecodeError: Character encoding issues
38
        """
39
        
40
    def close(self):
41
        """
42
        Close network connection and perform cleanup.
43
        Automatically called when using as context manager or during garbage collection.
44
        """
45
        
46
    def shutdown_network(self):
47
        """
48
        Close the network connection specifically.
49
        Called automatically by close() method.
50
        """
51
        
52
    def __enter__(self):
53
        """Context manager entry."""
54
        
55
    def __exit__(self, exc_type, exc_val, exc_tb):
56
        """Context manager exit with automatic cleanup."""
57
```
58

59
### Context Manager Usage
60

61
Goose supports context manager protocol for automatic resource cleanup:
62

63
```python
64
with Goose() as g:
65
    article = g.extract(url="https://example.com/article")
66
    print(article.title)
67
# Network connection automatically closed
68
```
69

70
### Configuration During Initialization
71

72
Pass configuration as dict or Configuration object:
73

74
```python
75
# Dict configuration
76
g = Goose({
77
    'parser_class': 'soup',
78
    'target_language': 'es', 
79
    'enable_image_fetching': True,
80
    'strict': False
81
})
82

83
# Configuration object
84
from goose3 import Configuration
85
config = Configuration()
86
config.parser_class = 'soup'
87
config.target_language = 'es'
88
g = Goose(config)
89
```
90

91
### Extraction Modes
92

93
Extract from URL:
94

95
```python
96
g = Goose()
97
article = g.extract(url="https://example.com/news-article")
98
```
99

100
Extract from raw HTML:
101

102
```python
103
html_content = """
104
<html>
105
<body>
106
<h1>Article Title</h1>
107
<p>Article content goes here...</p>
108
</body>
109
</html>
110
"""
111
g = Goose()
112
article = g.extract(raw_html=html_content)
113
```
114

115
### Error Handling
116

117
```python
118
from goose3 import Goose, NetworkError
119

120
g = Goose({'strict': True})  # Raise all network errors
121
try:
122
    article = g.extract(url="https://example.com/article")
123
except NetworkError as e:
124
    print(f"Network error: {e}")
125
except ValueError as e:
126
    print(f"Input error: {e}")
127
except UnicodeDecodeError as e:
128
    print(f"Encoding error: {e}")
129
```
130

131
### Multi-language Support
132

133
Configure language targeting for better extraction:
134

135
```python
136
# Automatic language detection from meta tags
137
g = Goose({'use_meta_language': True})
138

139
# Force specific language
140
g = Goose({
141
    'use_meta_language': False,
142
    'target_language': 'es'  # Spanish
143
})
144

145
# Chinese language support
146
g = Goose({'target_language': 'zh'})
147

148
# Arabic language support  
149
g = Goose({'target_language': 'ar'})
150
```
151

152
### Parser Selection
153

154
Choose between available HTML parsers:
155

156
```python
157
# Default lxml parser (faster, more robust)
158
g = Goose({'parser_class': 'lxml'})
159

160
# BeautifulSoup parser (more lenient)
161
g = Goose({'parser_class': 'soup'})
162
```

Version

Tile

Files

core-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-extraction.mddocs/