CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-html2text

Turn HTML into equivalent Markdown-structured text.

Pending
Overview
Eval results
Files

core-conversion.mddocs/

Core HTML Conversion

Primary conversion functionality for transforming HTML into Markdown or plain text. Provides both simple one-shot conversion and advanced configurable conversion with extensive formatting options.

Capabilities

Simple HTML Conversion

Convenience function for straightforward HTML to Markdown conversion with minimal configuration.

def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
    """
    Convert HTML string to Markdown/text using default settings.
    
    Args:
        html: HTML string to convert
        baseurl: Base URL for resolving relative links (default: "")
        bodywidth: Text wrapping width, None uses config.BODY_WIDTH (default: None)
    
    Returns:
        Converted Markdown/text string
    
    Example:
        >>> import html2text
        >>> html = "<p><strong>Bold</strong> and <em>italic</em></p>"
        >>> print(html2text.html2text(html))
        **Bold** and _italic_
    """

Advanced HTML Conversion

Full-featured HTML to text converter with extensive configuration options for fine-grained control over output formatting.

class HTML2Text(html.parser.HTMLParser):
    """
    Advanced HTML to text converter with comprehensive configuration options.
    
    Inherits from html.parser.HTMLParser to handle HTML parsing and provides
    extensive customization for output formatting, link handling, table processing,
    and text styling.
    """
    
    def __init__(
        self,
        out: Optional[OutCallback] = None,
        baseurl: str = "",
        bodywidth: int = 78
    ) -> None:
        """
        Initialize HTML2Text converter.
        
        Args:
            out: Optional custom output callback function for handling text output
            baseurl: Base URL for resolving relative links (default: "")
            bodywidth: Maximum line width for text wrapping (default: 78)
        """
    
    def handle(self, data: str) -> str:
        """
        Convert HTML string to Markdown/text with current configuration.
        
        This is the main conversion method that processes the HTML through
        the parser and returns the formatted output.
        
        Args:
            data: HTML string to convert
            
        Returns:
            Converted Markdown/text string
            
        Example:
            >>> h = html2text.HTML2Text()
            >>> h.ignore_links = True
            >>> html = "<p>Hello <a href='http://example.com'>world</a>!</p>"
            >>> print(h.handle(html))
            Hello world!
        """
    
    def feed(self, data: str) -> None:
        """
        Feed HTML data to the parser for processing.
        
        Args:
            data: HTML string to feed to parser
        """
    
    def finish(self) -> str:
        """
        Complete parsing and return formatted text output.
        
        Returns:
            Final formatted text string
        """
    
    def outtextf(self, s: str) -> None:
        """
        Default output callback function that appends text to internal buffer.
        
        This is the default implementation of the output callback that collects
        all text output into an internal list for final processing.
        
        Args:
            s: Text string to append to output buffer
        """
    
    def close(self) -> None:
        """
        Close the HTML parser and perform final cleanup.
        
        Inherited from HTMLParser, ensures proper parser cleanup.
        """
    
    def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]:
        """
        Find index of link with matching attributes in anchor list.
        
        Used internally for reference-style link processing to avoid
        duplicate link definitions.
        
        Args:
            attrs: Dictionary of HTML element attributes
            
        Returns:
            Index of matching anchor element or None if not found
        """

HTML Element Support

html2text supports comprehensive HTML element conversion:

Text Formatting

  • Bold: <strong>, <b>**text**
  • Italic: <em>, <i>_text_
  • Code: <code>, <tt>, <kbd>`text`
  • Strikethrough: <del>, <strike>, <s>~~text~~
  • Quotes: <q>"text"
  • Superscript/Subscript: <sup>, <sub> (configurable)

Structure Elements

  • Headers: <h1> through <h6># Header
  • Paragraphs: <p> → paragraph breaks
  • Line breaks: <br> → line breaks
  • Horizontal rules: <hr>* * *
  • Blockquotes: <blockquote>> text
  • Preformatted: <pre> → indented code blocks

Lists

  • Unordered lists: <ul>, <li>* item
  • Ordered lists: <ol>, <li>1. item
  • Nested lists: Full support with proper indentation
  • Definition lists: <dl>, <dt>, <dd>

Links and Images

  • Links: <a>[text](url) or reference-style
  • Images: <img>![alt](src) or configurable formats
  • Automatic links: URL detection and conversion

Tables

  • Tables: <table>, <tr>, <td>, <th> → Markdown tables
  • Table formatting: Configurable padding and alignment
  • Complex tables: Colspan handling and formatting options

Usage Examples

Basic Text Conversion

import html2text

# Simple paragraph with formatting
html = """
<div>
    <h1>Main Title</h1>
    <p>This is a <strong>bold statement</strong> with some <em>emphasis</em>.</p>
    <p>Here's a <a href="https://example.com">link</a> and some <code>inline code</code>.</p>
</div>
"""

converter = html2text.HTML2Text()
markdown = converter.handle(html)
print(markdown)

List Processing

html = """
<ul>
    <li>First item</li>
    <li>Second item with <strong>bold text</strong></li>
    <li>Third item
        <ol>
            <li>Nested ordered item</li>
            <li>Another nested item</li>
        </ol>
    </li>
</ul>
"""

converter = html2text.HTML2Text()
result = converter.handle(html)
print(result)

Table Conversion

html = """
<table>
    <tr>
        <th>Name</th>
        <th>Age</th>
        <th>City</th>
    </tr>
    <tr>
        <td>Alice</td>
        <td>30</td>
        <td>New York</td>
    </tr>
    <tr>
        <td>Bob</td>
        <td>25</td>
        <td>London</td>
    </tr>
</table>
"""

converter = html2text.HTML2Text()
converter.pad_tables = True  # Enable table padding
result = converter.handle(html)
print(result)

Custom Output Handling

def custom_output(text):
    """Custom output handler that uppercases text."""
    print(text.upper(), end='')

html = "<p>Hello world!</p>"
converter = html2text.HTML2Text(out=custom_output)
converter.handle(html)  # Will print "HELLO WORLD!" in uppercase

Install with Tessl CLI

npx tessl i tessl/pypi-html2text

docs

configuration.md

core-conversion.md

index.md

utilities.md

tile.json