or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

article-data.mdconfiguration.mdcore-extraction.mdindex.mdmedia-extraction.md
tile.json

tessl/pypi-goose3

Html Content / Article Extractor, web scrapping for Python3

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/goose3@3.1.x

To install, run

npx @tessl/cli install tessl/pypi-goose3@3.1.0

index.mddocs/

Goose3

A comprehensive Python library for extracting article content, metadata, and media from web pages and HTML documents. Goose3 intelligently identifies main article content while filtering out navigation, advertisements, and other non-content elements using advanced text analysis algorithms.

Package Information

  • Package Name: goose3
  • Language: Python
  • Installation: pip install goose3
  • Optional Dependencies:
    • pip install goose3[chinese] - Chinese language support
    • pip install goose3[arabic] - Arabic language support
    • pip install goose3[all] - All language extensions

Core Imports

from goose3 import Goose

For configuration and data types:

from goose3 import Goose, Configuration, Article, Image, Video
from goose3 import ArticleContextPattern, PublishDatePattern, AuthorPattern

For language-specific text processing:

from goose3.text import StopWords, StopWordsChinese, StopWordsArabic, StopWordsKorean

Basic Usage

from goose3 import Goose

# Basic extraction from URL
g = Goose()
article = g.extract(url='https://example.com/article')

print(article.title)
print(article.cleaned_text)
print(article.meta_description)
if article.top_image:
    print(article.top_image.src)

# Extract from raw HTML
html_content = "<html>...</html>"
article = g.extract(raw_html=html_content)

# Using as context manager (recommended)
with Goose() as g:
    article = g.extract(url='https://example.com/article')
    print(article.title)

Architecture

Goose3 uses a multi-stage extraction pipeline:

  • Network Fetcher: Downloads web content with configurable user agents and request handling
  • Parser: Processes HTML using lxml or BeautifulSoup with language-specific optimization
  • Content Extraction: Identifies main article content using text density analysis and DOM patterns
  • Metadata Extraction: Extracts titles, descriptions, publication dates, authors, and schema data
  • Media Detection: Locates and extracts images and embedded videos
  • Language Processing: Multi-language text analysis with specialized analyzers for Chinese, Arabic, and Korean

Capabilities

Core Extraction

Main article extraction functionality that processes URLs or HTML to extract clean text content, metadata, and media elements.

class Goose:
    def __init__(self, config=None): ...
    def extract(self, url=None, raw_html=None) -> Article: ...
    def close(self): ...
    def shutdown_network(self): ...

Core Extraction

Configuration System

Comprehensive configuration options for customizing extraction behavior, including parser selection, language targeting, content patterns, and network settings.

class Configuration:
    def __init__(self): ...
    
    # Key properties
    parser_class: str
    target_language: str
    browser_user_agent: str
    enable_image_fetching: bool
    strict: bool
    local_storage_path: str

Configuration

Article Data Structure

Rich data structure containing extracted content, metadata, and media with comprehensive property access for all extracted information.

class Article:
    @property
    def title(self) -> str: ...
    @property  
    def cleaned_text(self) -> str: ...
    @property
    def top_image(self) -> Image: ...
    @property
    def movies(self) -> list[Video]: ...
    # ... additional properties

Article Data

Media Extraction

Image and video extraction capabilities with support for metadata, dimensions, and embedded content from various platforms.

class Image:
    src: str
    width: int
    height: int

class Video:
    src: str
    embed_code: str
    embed_type: str
    width: int
    height: int

Media Extraction

Types

from typing import Union, Optional, List, Dict, Any

# Main extraction interface
ExtractInput = Union[str, None]  # URL or raw HTML
ConfigInput = Union[Configuration, dict, None]

# Pattern matching for content extraction
class ArticleContextPattern:
    def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
    attr: str
    value: str  
    tag: str
    domain: str

class PublishDatePattern:
    def __init__(self, *, attr=None, value=None, content=None, subcontent=None, tag=None, domain=None): ...
    attr: str
    value: str
    content: str
    subcontent: str
    tag: str
    domain: str

class AuthorPattern:
    def __init__(self, *, attr=None, value=None, tag=None, domain=None): ...
    attr: str
    value: str
    tag: str
    domain: str

# Exception types
class NetworkError(RuntimeError):
    """Network-related errors during content fetching"""
    def __init__(self, status_code, reason): ...
    status_code: int  # HTTP status code
    reason: str       # HTTP reason phrase
    message: str      # Formatted error message

# Language-specific text processing classes
class StopWords:
    """Base stopwords class for English text processing"""
    def __init__(self, language: str = 'en'): ...

class StopWordsChinese(StopWords):
    """Chinese language stopwords for improved text analysis"""
    def __init__(self): ...

class StopWordsArabic(StopWords):
    """Arabic language stopwords for improved text analysis"""
    def __init__(self): ...

class StopWordsKorean(StopWords):
    """Korean language stopwords for improved text analysis"""
    def __init__(self): ...