or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-goose3

Html Content / Article Extractor, web scrapping for Python3

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/goose3@3.1.x

To install, run

npx @tessl/cli install tessl/pypi-goose3@3.1.0

0

# Goose3

1

2

A comprehensive Python library for extracting article content, metadata, and media from web pages and HTML documents. Goose3 intelligently identifies main article content while filtering out navigation, advertisements, and other non-content elements using advanced text analysis algorithms.

3

4

## Package Information

5

6

- **Package Name**: goose3

7

- **Language**: Python

8

- **Installation**: `pip install goose3`

9

- **Optional Dependencies**:

10

- `pip install goose3[chinese]` - Chinese language support

11

- `pip install goose3[arabic]` - Arabic language support

12

- `pip install goose3[all]` - All language extensions

13

14

## Core Imports

15

16

```python

17

from goose3 import Goose

18

```

19

20

For configuration and data types:

21

22

```python

23

from goose3 import Goose, Configuration, Article, Image, Video

24

from goose3 import ArticleContextPattern, PublishDatePattern, AuthorPattern

25

```

26

27

For language-specific text processing:

28

29

```python

30

from goose3.text import StopWords, StopWordsChinese, StopWordsArabic, StopWordsKorean

31

```

32

33

## Basic Usage

34

35

```python

36

from goose3 import Goose

37

38

# Basic extraction from URL

39

g = Goose()

40

article = g.extract(url='https://example.com/article')

41

42

print(article.title)

43

print(article.cleaned_text)

44

print(article.meta_description)

45

if article.top_image:

46

print(article.top_image.src)

47

48

# Extract from raw HTML

49

html_content = "<html>...</html>"

50

article = g.extract(raw_html=html_content)

51

52

# Using as context manager (recommended)

53

with Goose() as g:

54

article = g.extract(url='https://example.com/article')

55

print(article.title)

56

```

57

58

## Architecture

59

60

Goose3 uses a multi-stage extraction pipeline:

61

62

- **Network Fetcher**: Downloads web content with configurable user agents and request handling

63

- **Parser**: Processes HTML using lxml or BeautifulSoup with language-specific optimization

64

- **Content Extraction**: Identifies main article content using text density analysis and DOM patterns

65

- **Metadata Extraction**: Extracts titles, descriptions, publication dates, authors, and schema data

66

- **Media Detection**: Locates and extracts images and embedded videos

67

- **Language Processing**: Multi-language text analysis with specialized analyzers for Chinese, Arabic, and Korean

68

69

## Capabilities

70

71

### Core Extraction

72

73

Main article extraction functionality that processes URLs or HTML to extract clean text content, metadata, and media elements.

74

75

```python { .api }

76

class Goose:

77

def __init__(self, config=None): ...

78

def extract(self, url=None, raw_html=None) -> Article: ...

79

def close(self): ...

80

def shutdown_network(self): ...

81

```

82

83

[Core Extraction](./core-extraction.md)

84

85

### Configuration System

86

87

Comprehensive configuration options for customizing extraction behavior, including parser selection, language targeting, content patterns, and network settings.

88

89

```python { .api }

90

class Configuration:

91

def __init__(self): ...

92

93

# Key properties

94

parser_class: str

95

target_language: str

96

browser_user_agent: str

97

enable_image_fetching: bool

98

strict: bool

99

local_storage_path: str

100

```

101

102

[Configuration](./configuration.md)

103

104

### Article Data Structure

105

106

Rich data structure containing extracted content, metadata, and media with comprehensive property access for all extracted information.

107

108

```python { .api }

109

class Article:

110

@property

111

def title(self) -> str: ...

112

@property

113

def cleaned_text(self) -> str: ...

114

@property

115

def top_image(self) -> Image: ...

116

@property

117

def movies(self) -> list[Video]: ...

118

# ... additional properties

119

```

120

121

[Article Data](./article-data.md)

122

123

### Media Extraction

124

125

Image and video extraction capabilities with support for metadata, dimensions, and embedded content from various platforms.

126

127

```python { .api }

128

class Image:

129

src: str

130

width: int

131

height: int

132

133

class Video:

134

src: str

135

embed_code: str

136

embed_type: str

137

width: int

138

height: int

139

```

140

141

[Media Extraction](./media-extraction.md)

142

143

## Types

144

145

```python { .api }

146

from typing import Union, Optional, List, Dict, Any

147

148

# Main extraction interface

149

ExtractInput = Union[str, None] # URL or raw HTML

150

ConfigInput = Union[Configuration, dict, None]

151

152

# Pattern matching for content extraction

153

class ArticleContextPattern:

154

def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...

155

attr: str

156

value: str

157

tag: str

158

domain: str

159

160

class PublishDatePattern:

161

def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None): ...

162

attr: str

163

value: str

164

content: str

165

subcontent: str

166

tag: str

167

domain: str

168

169

class AuthorPattern:

170

def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...

171

attr: str

172

value: str

173

tag: str

174

domain: str

175

176

# Exception types

177

class NetworkError(RuntimeError):

178

"""Network-related errors during content fetching"""

179

def __init__(self, status_code, reason): ...

180

status_code: int # HTTP status code

181

reason: str # HTTP reason phrase

182

message: str # Formatted error message

183

184

# Language-specific text processing classes

185

class StopWords:

186

"""Base stopwords class for English text processing"""

187

def __init__(self, language: str = 'en'): ...

188

189

class StopWordsChinese(StopWords):

190

"""Chinese language stopwords for improved text analysis"""

191

def __init__(self): ...

192

193

class StopWordsArabic(StopWords):

194

"""Arabic language stopwords for improved text analysis"""

195

def __init__(self): ...

196

197

class StopWordsKorean(StopWords):

198

"""Korean language stopwords for improved text analysis"""

199

def __init__(self): ...

200

```