Ctrl + K
DocumentationLog inGet started

tessl/pypi-w3lib

tessl install tessl/pypi-w3lib@2.3.0

Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection

Agent Success

Agent success rate when using this tile

84%

Improvement

Agent success rate improvement when using this tile compared to baseline

0.91x

Baseline

Agent success rate without this tile

92%

task.mdevals/scenario-1/

HTML Entity Converter

Build a text processing utility that converts HTML entity references into their corresponding Unicode characters. The utility should handle various types of HTML entities that appear in web content and provide options for selective entity preservation.

Requirements

Your implementation should process HTML text containing entities and convert them to readable Unicode characters:

  1. Entity Types: Support for named entities (like &), decimal numeric entities (like A), and hexadecimal numeric entities (like A).

  2. Selective Preservation: Provide the ability to keep specific entities unconverted. For example, preserve <, >, and & while converting others.

  3. Entity Detection: Implement a check to determine if a given text contains any HTML entities before processing.

  4. Illegal Character Handling: Handle removal of illegal XML/HTML character references according to standard specifications.

Test Cases

  • Converting text "Hello & goodbye" produces "Hello & goodbye" @test
  • Converting text "Price: €100" produces "Price: €100" @test
  • Converting text "Copyright © 2024" produces "Copyright © 2024" @test
  • Converting text "<div>" with preserved entities ['lt', 'gt'] keeps "<div>" unchanged @test
  • Checking if "No entities here" contains entities returns False @test
  • Checking if "Has   entity" contains entities returns True @test

Implementation

@generates

API

def convert_entities(text: str, keep: list[str] | None = None, remove_illegal: bool = True) -> str:
    """
    Convert HTML entities in text to Unicode characters.

    Args:
        text: The text containing HTML entities
        keep: Optional list of entity names to preserve (e.g., ['amp', 'lt', 'gt'])
        remove_illegal: Whether to remove illegal character references

    Returns:
        Text with entities converted to Unicode characters
    """
    pass

def has_entities(text: str) -> bool:
    """
    Check if text contains any HTML entities.

    Args:
        text: The text to check

    Returns:
        True if the text contains HTML entities, False otherwise
    """
    pass

Dependencies { .dependencies }

w3lib { .dependency }

Provides web utility functions for HTML processing.

@satisfied-by

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/w3lib@2.3.x
tile.json