or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

article-data.mdconfiguration.mdcore-extraction.mdindex.mdmedia-extraction.md

core-extraction.mddocs/

0

# Core Extraction

1

2

The main extraction functionality that processes URLs or HTML documents to extract clean text content, metadata, and media elements. This module provides the primary Goose class interface and extraction pipeline.

3

4

## Capabilities

5

6

### Primary Extraction Interface

7

8

The Goose class serves as the main entry point for all extraction operations, managing network connections, parser selection, and the complete extraction pipeline.

9

10

```python { .api }

11

class Goose:

12

def __init__(self, config: Union[Configuration, dict, None] = None):

13

"""

14

Initialize Goose extractor with optional configuration.

15

16

Parameters:

17

- config: Configuration object, dict of config options, or None for defaults

18

19

Raises:

20

- Exception: If local_storage_path is invalid when image fetching is enabled

21

"""

22

23

def extract(self, url: Union[str, None] = None, raw_html: Union[str, None] = None) -> Article:

24

"""

25

Extract article content from URL or raw HTML.

26

27

Parameters:

28

- url: URL to fetch and extract from

29

- raw_html: Raw HTML string to extract from

30

31

Returns:

32

- Article: Extracted content and metadata

33

34

Raises:

35

- ValueError: If neither url nor raw_html is provided

36

- NetworkError: Network-related errors during fetching

37

- UnicodeDecodeError: Character encoding issues

38

"""

39

40

def close(self):

41

"""

42

Close network connection and perform cleanup.

43

Automatically called when using as context manager or during garbage collection.

44

"""

45

46

def shutdown_network(self):

47

"""

48

Close the network connection specifically.

49

Called automatically by close() method.

50

"""

51

52

def __enter__(self):

53

"""Context manager entry."""

54

55

def __exit__(self, exc_type, exc_val, exc_tb):

56

"""Context manager exit with automatic cleanup."""

57

```

58

59

### Context Manager Usage

60

61

Goose supports context manager protocol for automatic resource cleanup:

62

63

```python

64

with Goose() as g:

65

article = g.extract(url="https://example.com/article")

66

print(article.title)

67

# Network connection automatically closed

68

```

69

70

### Configuration During Initialization

71

72

Pass configuration as dict or Configuration object:

73

74

```python

75

# Dict configuration

76

g = Goose({

77

'parser_class': 'soup',

78

'target_language': 'es',

79

'enable_image_fetching': True,

80

'strict': False

81

})

82

83

# Configuration object

84

from goose3 import Configuration

85

config = Configuration()

86

config.parser_class = 'soup'

87

config.target_language = 'es'

88

g = Goose(config)

89

```

90

91

### Extraction Modes

92

93

Extract from URL:

94

95

```python

96

g = Goose()

97

article = g.extract(url="https://example.com/news-article")

98

```

99

100

Extract from raw HTML:

101

102

```python

103

html_content = """

104

<html>

105

<body>

106

<h1>Article Title</h1>

107

<p>Article content goes here...</p>

108

</body>

109

</html>

110

"""

111

g = Goose()

112

article = g.extract(raw_html=html_content)

113

```

114

115

### Error Handling

116

117

```python

118

from goose3 import Goose, NetworkError

119

120

g = Goose({'strict': True}) # Raise all network errors

121

try:

122

article = g.extract(url="https://example.com/article")

123

except NetworkError as e:

124

print(f"Network error: {e}")

125

except ValueError as e:

126

print(f"Input error: {e}")

127

except UnicodeDecodeError as e:

128

print(f"Encoding error: {e}")

129

```

130

131

### Multi-language Support

132

133

Configure language targeting for better extraction:

134

135

```python

136

# Automatic language detection from meta tags

137

g = Goose({'use_meta_language': True})

138

139

# Force specific language

140

g = Goose({

141

'use_meta_language': False,

142

'target_language': 'es' # Spanish

143

})

144

145

# Chinese language support

146

g = Goose({'target_language': 'zh'})

147

148

# Arabic language support

149

g = Goose({'target_language': 'ar'})

150

```

151

152

### Parser Selection

153

154

Choose between available HTML parsers:

155

156

```python

157

# Default lxml parser (faster, more robust)

158

g = Goose({'parser_class': 'lxml'})

159

160

# BeautifulSoup parser (more lenient)

161

g = Goose({'parser_class': 'soup'})

162

```